-
Notifications
You must be signed in to change notification settings - Fork 0
Programmers Guide
- This page is a lightly updated version of the PDF guide accompanying the v2.0 SDK.
- This page freely mixes abstract API concepts with concrete C++ classes from the original Vamp plugin SDK. If you are using some other SDK, many of the details may not apply. The concepts and diagrams should still be of interest.
A Vamp plugin is a chunk of compiled program code that carries out analysis of a digital audio signal, returning results in some form other than an audio signal. These results are often called audio features.
Vamp plugins are distributed in shared library files with extension .dll
, .so
, or .dylib
depending on the platform, with one or more plugins in each library. A plugin is usually identified to the computer using the shared library name plus a short text identifier naming the plugin within the library. The plugin cannot be used on its own, but only with a conforming “host” application which loads the plugin from its shared library and calls functions within the plugin code to configure it, supply it with data, and run it.
Plugins are not supplied with the whole of their audio input at once, but instead are fed it as a series of short blocks: in this respect they resemble real-time audio processing plugins, but in all other respects they are executed “off-line”, meaning that they are not expected to run in real-time and are not obliged to return features as they occur.
The Vamp plugin binary interface is defined in C, and the plugin SDK (software development kit) which will be described here is written in C++. Plugin authors are recommended to use the C++ interfaces in preference to plain C. This document explains the concepts and structures necessary to write Vamp plugins; it is not an API reference, which can be found at http://www.vamp-plugins.org/code-doc/, and it does not cover host programming.
Vamp is not an acronym. It contains letters suggestive of audio, plugins, visualisation and so on, but not in the right order.
Throughout this document we will use as examples three types of feature extractor that already exist as Vamp plugins. (These are arbitrary real-world examples – not the same as the example plugins provided in the SDK.) We will not consider the processing work necessary to implement these techniques, only describe how they communicate with the host as plugins. These examples will appear occasionally in indented sections like this one. They are:
Note onset detector. This estimates the times at which new note events begin in the signal, and returns those times as results. There are no particular values associated with the times. For the purposes of this example, our onset detector will actually have two outputs – the first will report the times as described above, and the second will report a measure of the likelihood that there is an onset within the current input block.
Chromagram. This example of a visualisation data extractor analyses audio and produces from it a grid of values, with a time step as the X coordinate and a pitch bin within the octave as the Y coordinate. Each value in the grid describes the “strength” of the corresponding pitch in the music within the given time range. Another way to describe this is to say that it returns a single column containing a fixed number of pitch strength values, from each of a series of frames or blocks of input audio.
Amplitude follower. This calculates structurally simple regularly-sampled time-series data from the audio, to be displayed by the host as a data plot or used for some other purpose.
As these examples show, Vamp plugins do not actually do any display or user interaction: they just return data. In most cases, these data are not the final result of the work the host is doing, but are useful for something else – either for a host to display and the user to interact with in some way, or as intermediate calculations for a particular purpose such as semantically meaningful audio editing. Vamp plugins do not do swirly visualisation and are typically not useful in real-time contexts such as interacting with musical performers.
A Vamp plugin needs to make a certain amount of information available to the host.
Every plugin, no matter how simple, should provide the following:
-
The identifier, name, and description of the plugin itself. See Identifiers, Names and Descriptions below.
-
The name of the maker of the plugin, and the plugin's copyright status and version number.
-
The input domain which the plugin would like its audio provided in. See Inputs below.
-
The plugin's preferred step size and block size for audio input. See Step and Block Size below.
-
The minimum and maximum number of input audio channels the plugin is capable of handling. See Inputs below for discussion of these.
-
A list of output descriptors that contain information about the structure of the results that the plugin may produce. See Outputs below.
-
Implementations of standard functions that set up, reset, and run the plugin. See Inputs below for discussion of these.
Some plugins have parameters that can be set to adjust the way they do their processing. These plugins will also need to provide:
-
A list of parameter descriptors that contain information about the editable parameters of the plugin. The host may use these descriptors to show the user a control window for the plugin, for example. See Parameters below.
-
Implementations of standard functions that retrieve and set the values of parameters.
A plugin may also have a set of pre-defined configurations that can be offered to the user by name. These are known as programs and a plugin that supports them will also need to provide:
-
A list of program names. See Programs below.
-
Implementations of standard functions that retrieve the current program name and select a new program. The C++ base class provides virtual methods to override for these.
The Vamp SDK contains one class from which plugin implementation classes should be derived. This class, Plugin
, exposes pure virtual methods for most of the accessor and action functions that a plugin class needs to implement. Those that are not directly defined in Plugin
are themselves inherited from a further class called PluginBase
, which contains virtual methods for things that are not specific to the output structures used in Vamp – plugin name, maker, parameters, program names, etc. These classes, like everything in the SDK, are found in the Vamp
namespace.
The Plugin
and PluginBase
classes also contain a number of data classes that are used when returning bundles of information about features (Feature
, FeatureList
, FeatureSet
), outputs (OutputDescriptor
), parameters (ParameterDescriptor
) and so on. These will be referred to in the appropriate sections of this document.
Vamp uses a combination of “identifier”, “name” and “description” strings to describe several sorts of object. Most obviously, the plugin itself must implement getIdentifier
, getName
and getDescription
methods (inheriting from pure virtual methods in PluginBase
) that return textual information about the plugin. Similar data are included as public data members in the ParameterDescriptor
and OutputDescriptor
classes.
In all of these cases, the purposes of the three strings are:
-
identifier – This should contain a short string that the host can use to refer to the object, within the immediate surrounding scope. That is, the plugin identifier needs to be unique within the plugin's library; an output descriptor's identifier needs to be unique among output descriptors for the plugin; similarly for parameter descriptors. Identifiers are very limited in the characters they may include: upper and lower case ASCII alphabetical characters, digits 0 to 9, and the dash (“-”) and underscore (“_”) characters only.
-
name – This is a text that may be shown to the user by the host as the normal label for the object.
-
description – This is optional, and may contain extra text to describe the purpose of the object in a way that adds to the information in the name. Hosts that show the description to the user will normally do so in addition to the name, so it should not duplicate information already in the name.
The Vamp API itself does not provide any way for the host to categorise plugins by type or purpose. However, the host extension classes in the SDK do include a method to load plugin categories from category files (with .cat
extension) that may be found in the Vamp plugin load path alongside the plugin libraries. These are text files which contain lines of the form
vamp:libraryname:pluginidentifier::category
The category string is a series of category names separated by “ > ”, which describe a possibly multi-level path into a category tree. For example,
vamp:vamp-example-plugins:percussiononsets::Time > Onsets
The input to a Vamp plugin is audio data, with one or more channels. The audio is non-interleaved, so the plugin receives a set of pointers to data, one per channel. The plugin can specify how many channels it will accept using its getMinChannelCount
and getMaxChannelCount
methods.
The number of channels, as well as the block size and step size that will be used when running the plugin, are fixed when the plugin's initalise
method is called.
bool initialise(size_t inputChannels, size_t stepSize, size_t blockSize);
If the plugin finds the values supplied to initialise
unacceptable, it should return false
to indicate that initialisation has failed.
After initialisation, to supply audio data and run the plugin, the host calls the plugin's process
method repeatedly. The process
method receives a set of input pointers, and a timestamp.
FeatureSet process(const float *const *inputBuffers, RealTime timestamp);
Each time process
is called, it is passed a single block of audio of size in samples equal to the block size that was passed to initialise
. The difference in sample count between the input to one process call and that to the next is equal to the step size.
As with channel count, the plugin can influence the step and block size by returning its desired values through its getPreferredStepSize
and getPreferredBlockSize
methods; unlike channel count, the preferred step and block size are only hints, so you should always check the actual values used in initialise
if they are important to your code. You don't have to specify a preference for these if you don't want to: return zero for the host to use its defaults, and see Default Step and Block Sizes below.
The audio may be provided in either time domain or frequency domain form. Time domain audio input is conventional PCM sampled digital audio with a floating-point sample type; frequency domain input is the result of applying a windowed short-time Fourier transform to each input block. The input domain is specified by the plugin using its getInputDomain
method.
Note onset detector. The input domain and preferred step and block sizes are likely to depend on the method used for onset detection. The example plugin in the SDK requires frequency-domain input and can in theory handle any input step or block size. In practice it declares a preference for block size and expects the host to set the step size to something sensible accordingly.
Chromagram. The constant Q transform used for a chromagram needs, as input, the result of a short-time Fourier transform whose size depends on the sample rate, Q factor, and minimum output frequency of the constant Q transform. The chromagram plugin can therefore ask for a frequency-domain input, and make its preferred block size depend on the sample rate it was constructed with and on its bins-per-octave parameter. (See also What Can Depend on a Parameter? below.) It can not accept a different block size, and its initialise function will fail if provided with one. It may reasonably choose to leave the preferred step size unspecified.
Amplitude follower. This time-domain method is likely to work with any input step and block size, and so will probably leave them unspecified.
When a plugin requests time domain input, the host divides the audio input stream up into a series of blocks of equal size, and feeds one to each successive call to process
. The process
call may then return features derived from that audio input block, according to its whim. The inputBuffers
argument to process
will point to one array of floats for each input channel. For example, inputBuffers[0][blockSize-1]
will be the last audio sample in the current block for the first input channel.
When all of the audio data has been supplied, the host calls getRemainingFeatures
, and the plugin returns any features that are now known and not yet been returned from earlier process
calls.
When supplying time domain input, it is most usual for the step size to be equal to the block size as shown above. This means that the plugin is receiving every sample in the audio input exactly once, in a series of contiguous blocks of data. This does not have to be the case – the plugin can return a different value for getPreferredStepSize
to that returned from getPreferredBlockSize
if for any reason it would prefer to receive overlapping or non-contiguous blocks.
Vamp currently makes no provision for partial input blocks. If the audio input ends in the middle of a block, the host will fill the block with zero values up to the block size.
If the plugin requests frequency domain input, each block of audio input is processed by the host using a windowed short time Fourier transform before being supplied to the process
function.
In this situation, it is most usual for the input blocks to overlap – that is, for the step size to be half of the block size, or less. This is because the original time-domain data needs to be shaped using a cosine or similar window before the Fourier transform is applied, in order to avoid generating spectral noise because of the discontinuities at the edges of the input blocks. This windowing is the host's responsibility, but the plugin needs to be aware that it will happen so that it can choose sensible preferred step and block sizes.
The plugin does not get any control over what shape of window is used, or any other details of the time- to frequency-domain processing apart from the step and block size. If more control is needed, you will need to ask for time domain input and carry out the processing in the plugin instead.
When receiving frequency domain input, the inputBuffers
argument to process
will point to one array of floats for each input channel as for time domain input, but the arrays of float have a particular layout. Each channel contains blockSize+2
floats, which are alternately real and imaginary components of the
Fourier transform's complex output bins.
For example:
-
inputBuffers[0][0]
andinputBuffers[0][1]
contain the real and imaginary components of the DC bin for this block in the first input channel. The imaginary component for this bin should be zero. -
inputBuffers[0][2]
andinputBuffers[0][3]
contain the real and imaginary components of the bin with frequencysampleRate / blockSize
for the first input channel. -
inputBuffers[0][blockSize]
andinputBuffers[0][blockSize+1]
contain the real and imaginary components of the bin with “Nyquist” frequencysampleRate / 2
. Again, this imaginary component should be zero.
If the plugin does not care about either the step or block size, it should return zero as its preference.
If the plugin returns zero as its block size preference, the host will pick a block size that is practical for its own processing purposes. For time domain inputs this could reasonably be almost anything, from relatively small (e.g. 512) to huge (e.g. the length in sample frames of the entire audio file that the host is processing). For frequency domain inputs of course the host can reasonably be expected to use some block size that is generally considered appropriate for block-by-block short-time Fourier transform processing, such as 1024 or 2048.
If the plugin returns zero as its step size preference, the host will typically use a step size equal to the block size for time domain inputs, or half the block size for frequency domain inputs.
In either case, the host may alternatively offer the choice of step or block size to the user.
A plugin may return features in two places: from the process call, and from getRemainingFeatures. The process call is made repeatedly to provide the plugin with input data; see “Input” above. When all of the input has been consumed, getRemainingFeatures is called once, and the plugin should return any other features that it has computed or can now compute but has not yet returned.
The return type from process
and getRemainingFeatures
is called FeatureSet
. This is an STL map whose key is an output number and whose value is a FeatureList
, which is an STL vector of Feature
objects. The use of a FeatureList
allows the plugin to return features with more than one timestamp from a single process call, or to return all features for the entire audio input in a single FeatureSet
from getRemainingFeatures
.
A Feature
has an optional timestamp and duration (see “Sample Types and Timestamps”, below), a vector of zero or more values, and an optional label. Note that even a Feature
with zero values, no timestamp, and no label could still be a valid feature; it may indicate that “something happened in the block of audio passed to this process call”, with the interpretation of “something” depending on which output of the plugin was returning the feature.
A lot about the interpretation of returned features depends on which output the feature is associated with. A plugin has a fixed number of outputs, and it must provide a getOutputDescriptors
method that returns data about all of them, as a vector of OutputDescriptor
objects. The descriptor with index zero in this vector contains information about the output whose values are found with a key of zero in the feature sets returned by the plugin, and so on.
For a plugin to be of any use, it must provide at least one OutputDescriptor
(that is, have at least one output). This contains all of the information provided by the plugin about the “meaning” of values associated with an output of the plugin. OutputDescriptors
for all of the plugin's outputs, in order, must be returned by a call to getOutputDescriptors
.
The OutputDescriptor
contains:
- The identifier, name and description of this output (see Identifiers, Names and Descriptions above).
- Optionally, the unit (as a string) of all of the values associated with this output.
- Optionally, the number of value “bins” that features associated with this output have (via
hasFixedBinCount
andbinCount
). The bin count might be zero for “time only” features like the simple onset detector, one for “time and value” features like an amplitude tracker, or many for “column” features like a chromagram that have a series of a fixed number of values in each feature. Some features might have a variable number of values, and they will need to leavehasFixedBinCount
false. - Optionally, if
hasFixedBinCount
is true, names for the value bins (inbinNames
). For example, a chromagram plugin might have bin names describing the frequencies whose strengths are represented in the bin values. - Optionally, the extents for values associated with this output (i.e. their minimum and maximum values), via
hasKnownExtents
,minValue
andmaxValue
. Like the unit, these apply to values in all bins, if the output has more than one bin. - Optionally, the quantization of the values associated with this output, via
isQuantized
andquantizeStep
. A feature whose values can only fall within a certain subset of real numbers (for example, a feature whose values are always integers) may wish to set these. They are analogous to the quantization for parameters described in Parameters below, except that it is not possible to associate names with the quantize steps as it is for parameters. - The sample type and rate for the output, via
sampleType
andsampleRate
. See Sample Types and Timestamps below for discussion of these. - Whether the features associated with this output are expected to have durations (via
hasDuration
). (This field was added to theOutputDescriptor
in version 2.0 of the Vamp API and SDK.) See Sample Types and Timestamps below for discussion of this.
Every feature that is returned from a Vamp plugin has a time associated with it. This is the time at which the feature is considered to “start”, as a fractional number of seconds since the start of the audio input.
The time may be explicit – stored in the timestamp of the feature structure itself – or implicit – omitted from the feature itself and instead deduced by the host on the basis of the sample type and rate defined for the output in which the feature is returned. Whether the time is implicit or explicit depends on the sampleType
field of the output and the hasTimestamp
field of the feature.
The permitted sampleType
values for an output are:
-
OneSamplePerStep
– Implicit time. The effective time of any feature returned by process for that output is the same as the time that was passed in to the process function by the host. The plugin should not set a timestamp on the output feature, and should set the feature'shasTimestamp
field to false. The host should not read the timestamp from the output feature, even if the feature'shasTimestamp
field is erroneously found to be true. Any features returned fromgetRemainingFeatures
will all be effectively timed to the end of the input audio. -
FixedSampleRate
– Implicit or explicit time. If the output feature'shasTimestamp
field is true, the host should read and use the output feature's timestamp. The host may round the timestamp according to the sample rate given in the output descriptor'ssampleRate
field (which must be non-zero). If the feature'shasTimestamp
field is false, its time will be implicitly calculated by incrementing the time of the previous feature according to the sample rate. -
VariableSampleRate
– Explicit time. The time of each feature is returned in the timestamp of the feature, and the feature'shasTimestamp
field must be set to true. The feature may have a “resolution” given by thesampleRate
value for that output, but the host may ignore it. ThesampleRate
should be zero if there is no sensible resolution.
The hasTimestamp
field in the feature structure is only significant when using FixedSampleRate
. If the sample type is VariableSampleRate
, the host must always read and use the timestamp; if the sample type is OneSamplePerStep
, the host should never read the timestamp.
The difference between FixedSampleRate
and VariableSampleRate
is more significant than it may at first appear. An output whose sample type is FixedSampleRate
can be said to be “dense” and, for example, may be plotted naturally on a grid if it has a constant number of output bins. An output whose sample type is VariableSampleRate
is in principle sparse, may need a different representation, and may be treated differently by a host – even if its features are in fact evenly spaced in time.
A feature may also have a duration, stored in the duration
field of the feature structure. If this duration is present, the hasDuration
field of the feature must be set to true. Duration is always explicit; unlike the timestamp, it cannot be calculated implicitly. The Feature
struct constructor initialises hasDuration
to false, so you do not have to set it unless you need it. (Duration was added to the feature in version 2.0 of the Vamp API and SDK; version 1 plugins have no way to specify this property.)
- If your output returns things that are always regularly-spaced in time, and there is one such thing returned for every process block, and the calculation is causal so that results are available immediately, and there is no latency added beyond the length of the processing block, then you probably want to use
OneSamplePerStep
sample type and omit the feature timestamps. - If your output returns things that are regularly-spaced in time but the other limitations above are not true, use
FixedSampleRate
sample type, set the output sample rate to the (perhaps fractional) number of returned features per second, and provide a timestamp with each feature. - If your output returns anything else, use
VariableSampleRate
sample type, set the output sample rate to zero unless you know better, and provide a timestamp with each feature.
Note onset detector. Our onset detector as described has two outputs, one for the note onsets and the other for a function describing the likelihood that there is a note onset in a given block of audio. One thing to observe here is that the note onsets should be the first output, so that a host which defaults to using the first available output will get the results it expects, given the stated purpose of the plugin.
The note onsets output is most likely to have
VariableSampleRate
, with each feature timestamped explicitly with the estimated note onset position. If the plugin's detection method has a limited resolution in audio samples (for example if it can only detect whether an onset happened “during this block” but not where exactly within the block), then the plugin may also wish to report that by setting thesampleRate
field of its output descriptor.The onset detection function output could use either
OneSamplePerStep
orFixedSampleRate
. If the detection function is always calculated directly from the input data block given to the process call, and thus results in one result for each input block, thenOneSamplePerStep
is the natural sample type. If the input block is subdivided within the plugin for analysis purposes, or the processing block size is otherwise different from the input block size, then it may useFixedSampleRate
, giving the input sample rate divided by the processing block size as the output'ssampleRate
.Chromagram. If the chromagram plugin demands frequency domain input with a block size equal to the expected FFT size for its constant Q transform, then
OneSamplePerStep
is the natural sampleType for the chroma data output, because one column of output data is produced for every input block.Amplitude follower. The output
sampleType
may depend on the processing method, but as with the onset detector's detection function output,OneSamplePerStep
orFixedSampleRate
is likely.
The principal method for users and hosts to adjust the working of a Vamp plugins is via parameters. A parameter is a named value that may be set (and retrieved) by the host at any point between instantiation and initialisation of the plugin.
The plugin must provide a getParameterDescriptors
method, which returns a list of ParameterDescriptor
objects describing the available parameters of the plugin. The host refers to a parameter using the identifier given in its descriptor, and the plugin must provide getParameter
and setParameter
methods to retrieve and set the current value of the parameter.
Parameter values for Vamp plugins are always floating-point numbers. The parameter descriptor must provide limits for the permissible values through its minValue
and maxValue
fields. It may also provide a quantization for values, through its isQuantized
and quantizeStep
fields; if, for example, a parameter has isQuantized
set to true and quantizeStep
is set to 1.0f, then the host will only provide integer values for this parameter (or, strictly, floating point values that are the closest to integers). The descriptor may also include names to be associated with the quantize steps for a quantized parameter through its valueNames
field: a graphical host may use these names to offer the user a list of named options instead of a numeric entry field or controller for the parameter.
Some plugins may have combinations of parameters that are known to be effective for particular sorts of tasks. For example, an onset detector may have certain parameter settings that work well for music with only soft onsets, other parameters that work well with percussive onsets, and so on. This things can be encapsulated to some degree using programs. A program is simply a name that a host may select or query, such that at most one program may be active at once.
A plugin that supports programs should provide a getProgramNames
method that returns the names of all known programs, a getCurrentProgram
method that returns the name of the currently selected program if any (or an empty string otherwise), and a selectProgram
method that configures the plugin for the given program. The plugin is free to rewrite its own parameter values when a new program is selected; it's the host's responsibility to read the new parameter values afterwards if necessary.
One difference between Vamp plugins and most real-time audio processing plugin APIs is that the Vamp plugin may need to make significant aspects of its inputs and outputs dependent upon the parameters used to configure it.
For this reason, a Vamp plugin must be completely configured by the host, with its parameters and programs set, before it is initialised; the parameters cannot be changed afterwards. Furthermore, some properties of the plugin can depend on the values passed in to the initialise
function, so the host must query these again after initialisation but before calling process
.
A plugin may change the following at any time, up to and including during its initialise
function:
- The
sampleType
andsampleRate
for an output (in theOutputDescriptor
as returned fromgetOutputDescriptors
). - The number of bin values a feature associated with an output may have (
hasFixedBinCount
,binCount
, andbinNames
in theOutputDescriptor
). - The extents of values taken by features associated with an output (
hasKnownExtents
,minValue
,maxValue
,isQuantized
andquantizeStep
in theOutputDescriptor
). - The units of values taken by features associated with an output (
unit
in theOutputDescriptor
).
In addition, the plugin may change its mind about the following at any point before initialise
has been called:
- Its preferred input step size and block size (returned from
getPreferredStepSize
andgetPreferredBlockSize
).
Although a plugin may change these properties after construction, it should not do so unless it is necessary. The plugin should ensure that these properties have plausible default values right from the moment of construction, so that the host can make reasonable observations about the default or likely behaviour of the plugin without needing to initalise it. If the host does not in fact change any parameters or programs, and supplies the plugin with its preferred step and block size, then any values it has already queried from the plugin should remain valid after initialisation.
The plugin may not change any of the following after construction:
- Its total number of outputs.
- The minimum and maximum number of audio input channels it accepts.
- The input domain it requires audio to be supplied in.
- The set of available parameters returned through
getParameterDescriptors
, nor anything in any of theParameterDescriptors
themselves. - The set of available programs returned through
getPrograms
. - Any other descriptive data, such as the identifier, name or description text for the plugin itself or for any of its parameters or outputs.
Chromagram. The chromagram plugin produces a series of columns of output data, with each column containing a certain number of constant Q (pitch) bins spanning a range of an octave.
It is desirable to make the bins-per-octave value for the chromagram adjustable as a parameter. This means that the number of bin values declared for each output feature should be variable depending on the state of the plugin parameters.
Also, the processing block size for the chromagram depends upon factors including the bins-per-octave value, as well as on the input sample rate. This means that the preferred block size for the plugin should be variable depending on the sample rate given at construction time, as well as on the plugin parameters. Similarly if the plugin chooses to declare any
FixedSampleRate
outputs, that rate will also likely depend on the processing block size and therefore on the input sample rate and plugin parameters.Finally, it is useful to offer column normalization as an option in the plugin, selectable through a parameter. For this to be possible, the extents of feature values should also be variable depending on plugin parameters.