Programmers Guide

The Vamp Audio Analysis Plugin API: A Programmer's Guide

This page is a lightly updated version of the PDF guide accompanying the v2.0 SDK.
This page freely mixes abstract API concepts with concrete C++ classes from the original Vamp plugin SDK. If you are using some other SDK, many of the details may not apply. The concepts and diagrams should still be of interest.

1. Overview

A Vamp plugin is a chunk of compiled program code that carries out analysis of a digital audio signal, returning results in some form other than an audio signal. These results are often called audio features.

Vamp plugins are distributed in shared library files with extension .dll, .so, or .dylib depending on the platform, with one or more plugins in each library. A plugin is usually identified to the computer using the shared library name plus a short text identifier naming the plugin within the library. The plugin cannot be used on its own, but only with a conforming “host” application which loads the plugin from its shared library and calls functions within the plugin code to configure it, supply it with data, and run it.

Block diagram of Vamp plugin. Audio tracks feeding into a plugin which emits multiple possible output types: points, values, and grids.

Plugins are not supplied with the whole of their audio input at once, but instead are fed it as a series of short blocks: in this respect they resemble real-time audio processing plugins, but in all other respects they are executed “off-line”, meaning that they are not expected to run in real-time and are not obliged to return features as they occur.

The Vamp plugin binary interface is defined in C, and the plugin SDK (software development kit) which will be described here is written in C++. Plugin authors are recommended to use the C++ interfaces in preference to plain C. This document explains the concepts and structures necessary to write Vamp plugins; it is not an API reference, which can be found at http://www.vamp-plugins.org/code-doc/, and it does not cover host programming.

Vamp is not an acronym. It contains letters suggestive of audio, plugins, visualisation and so on, but not in the right order.

Examples

Throughout this document we will use as examples three types of feature extractor that already exist as Vamp plugins. (These are arbitrary real-world examples – not the same as the example plugins provided in the SDK.) We will not consider the processing work necessary to implement these techniques, only describe how they communicate with the host as plugins. These examples will appear occasionally in indented sections like this one. They are:

Note onset detector. This estimates the times at which new note events begin in the signal, and returns those times as results. There are no particular values associated with the times. For the purposes of this example, our onset detector will actually have two outputs – the first will report the times as described above, and the second will report a measure of the likelihood that there is an onset within the current input block.

Chromagram. This example of a visualisation data extractor analyses audio and produces from it a grid of values, with a time step as the X coordinate and a pitch bin within the octave as the Y coordinate. Each value in the grid describes the “strength” of the corresponding pitch in the music within the given time range. Another way to describe this is to say that it returns a single column containing a fixed number of pitch strength values, from each of a series of frames or blocks of input audio.

Amplitude follower. This calculates structurally simple regularly-sampled time-series data from the audio, to be displayed by the host as a data plot or used for some other purpose.

As these examples show, Vamp plugins do not actually do any display or user interaction: they just return data. In most cases, these data are not the final result of the work the host is doing, but are useful for something else – either for a host to display and the user to interact with in some way, or as intermediate calculations for a particular purpose such as semantically meaningful audio editing. Vamp plugins do not do swirly visualisation and are typically not useful in real-time contexts such as interacting with musical performers.

2. What does a Plugin Contain?

A Vamp plugin needs to make a certain amount of information available to the host.

Every plugin, no matter how simple, should provide the following:

The identifier, name, and description of the plugin itself. See Identifiers, Names and Descriptions below.
The name of the maker of the plugin, and the plugin's copyright status and version number.
The input domain which the plugin would like its audio provided in. See Inputs below.
The plugin's preferred step size and block size for audio input. See Step and Block Size below.
The minimum and maximum number of input audio channels the plugin is capable of handling. See Inputs below for discussion of these.
A list of output descriptors that contain information about the structure of the results that the plugin may produce. See Outputs below.
Implementations of standard functions that set up, reset, and run the plugin. See Inputs below for discussion of these.

Some plugins have parameters that can be set to adjust the way they do their processing. These plugins will also need to provide:

A list of parameter descriptors that contain information about the editable parameters of the plugin. The host may use these descriptors to show the user a control window for the plugin, for example. See Parameters below.
Implementations of standard functions that retrieve and set the values of parameters.

A plugin may also have a set of pre-defined configurations that can be offered to the user by name. These are known as programs and a plugin that supports them will also need to provide:

A list of program names. See Programs below.
Implementations of standard functions that retrieve the current program name and select a new program. The C++ base class provides virtual methods to override for these.

Base classes

The Vamp SDK contains one class from which plugin implementation classes should be derived. This class, Plugin, exposes pure virtual methods for most of the accessor and action functions that a plugin class needs to implement. Those that are not directly defined in Plugin are themselves inherited from a further class called PluginBase, which contains virtual methods for things that are not specific to the output structures used in Vamp – plugin name, maker, parameters, program names, etc. These classes, like everything in the SDK, are found in the Vamp namespace.

The Plugin and PluginBase classes also contain a number of data classes that are used when returning bundles of information about features (Feature, FeatureList, FeatureSet), outputs (OutputDescriptor), parameters (ParameterDescriptor) and so on. These will be referred to in the appropriate sections of this document.

3. Identifiers, Names and Descriptions

Vamp uses a combination of “identifier”, “name” and “description” strings to describe several sorts of object. Most obviously, the plugin itself must implement getIdentifier, getName and getDescription methods (inheriting from pure virtual methods in PluginBase) that return textual information about the plugin. Similar data are included as public data members in the ParameterDescriptor and OutputDescriptor classes.

In all of these cases, the purposes of the three strings are:

identifier – This should contain a short string that the host can use to refer to the object, within the immediate surrounding scope. That is, the plugin identifier needs to be unique within the plugin's library; an output descriptor's identifier needs to be unique among output descriptors for the plugin; similarly for parameter descriptors. Identifiers are very limited in the characters they may include: upper and lower case ASCII alphabetical characters, digits 0 to 9, and the dash (“-”) and underscore (“_”) characters only.
name – This is a text that may be shown to the user by the host as the normal label for the object.
description – This is optional, and may contain extra text to describe the purpose of the object in a way that adds to the information in the name. Hosts that show the description to the user will normally do so in addition to the name, so it should not duplicate information already in the name.

4. Inputs

The input to a Vamp plugin is audio data, with one or more channels. The audio is non-interleaved, so the plugin receives a set of pointers to data, one per channel. The plugin can specify how many channels it will accept using its getMinChannelCount and getMaxChannelCount methods.

The number of channels, as well as the block size and step size that will be used when running the plugin, are fixed when the plugin's initalise method is called.

bool initialise(size_t inputChannels, size_t stepSize, size_t blockSize);

If the plugin finds the values supplied to initialise unacceptable, it should return false to indicate that initialisation has failed.

After initialisation, to supply audio data and run the plugin, the host calls the plugin's process method repeatedly. The process method receives a set of input pointers, and a timestamp.

FeatureSet process(const float *const *inputBuffers, RealTime timestamp);

Each time process is called, it is passed a single block of audio of size in samples equal to the block size that was passed to initialise. The difference in sample count between the input to one process call and that to the next is equal to the step size.

As with channel count, the plugin can influence the step and block size by returning its desired values through its getPreferredStepSize and getPreferredBlockSize methods; unlike channel count, the preferred step and block size are only hints, so you should always check the actual values used in initialise if they are important to your code. You don't have to specify a preference for these if you don't want to: return zero for the host to use its defaults, and see Default Step and Block Sizes below.

The audio may be provided in either time domain or frequency domain form. Time domain audio input is conventional PCM sampled digital audio with a floating-point sample type; frequency domain input is the result of applying a windowed short-time Fourier transform to each input block. The input domain is specified by the plugin using its getInputDomain method.

Examples

Note onset detector. The input domain and preferred step and block sizes are likely to depend on the method used for onset detection. The example plugin in the SDK requires frequency-domain input and can in theory handle any input step or block size. In practice it declares a preference for block size and expects the host to set the step size to something sensible accordingly.

Chromagram. The constant Q transform used for a chromagram needs, as input, the result of a short-time Fourier transform whose size depends on the sample rate, Q factor, and minimum output frequency of the constant Q transform. The chromagram plugin can therefore ask for a frequency-domain input, and make its preferred block size depend on the sample rate it was constructed with and on its bins-per-octave parameter. (See also What Can Depend on a Parameter? below.) It can not accept a different block size, and its initialise function will fail if provided with one. It may reasonably choose to leave the preferred step size unspecified.

Amplitude follower. This time-domain method is likely to work with any input step and block size, and so will probably leave them unspecified.

Time Domain Input

When a plugin requests time domain input, the host divides the audio input stream up into a series of blocks of equal size, and feeds one to each successive call to process. The process call may then return features derived from that audio input block, according to its whim. The inputBuffers argument to process will point to one array of floats for each input channel. For example, inputBuffers[0][blockSize-1] will be the last audio sample in the current block for the first input channel.

When all of the audio data has been supplied, the host calls getRemainingFeatures, and the plugin returns any features that are now known and not yet been returned from earlier process calls.

Diagram showing time domain audio input, divided into fixed-size non-overlapping blocks, and audio feature outputs from a Vamp plugin.

When supplying time domain input, it is most usual for the step size to be equal to the block size as shown above. This means that the plugin is receiving every sample in the audio input exactly once, in a series of contiguous blocks of data. This does not have to be the case – the plugin can return a different value for getPreferredStepSize to that returned from getPreferredBlockSize if for any reason it would prefer to receive overlapping or non-contiguous blocks.

Vamp currently makes no provision for partial input blocks. If the audio input ends in the middle of a block, the host will fill the block with zero values up to the block size.

Frequency Domain Input

Diagram showing frequency domain audio input, with overlapping blocks pre-processed by the host using windowed short-time Fourier transforms.

If the plugin requests frequency domain input, each block of audio input is processed by the host using a windowed short time Fourier transform before being supplied to the process function.

In this situation, it is most usual for the input blocks to overlap – that is, for the step size to be half of the block size, or less. This is because the original time-domain data needs to be shaped using a cosine or similar window before the Fourier transform is applied, in order to avoid generating spectral noise because of the discontinuities at the edges of the input blocks. This windowing is the host's responsibility, but the plugin needs to be aware that it will happen so that it can choose sensible preferred step and block sizes.

The plugin does not get any control over what shape of window is used, or any other details of the time- to frequency-domain processing apart from the step and block size. If more control is needed, you will need to ask for time domain input and carry out the processing in the plugin instead.

When receiving frequency domain input, the inputBuffers argument to process will point to one array of floats for each input channel as for time domain input, but the arrays of float have a particular layout. Each channel contains blockSize+2 floats, which are alternately real and imaginary components of the Fourier transform's complex output bins.

For example:

inputBuffers[0][0] and inputBuffers[0][1] contain the real and imaginary components of the DC bin for this block in the first input channel. The imaginary component for this bin should be zero.
inputBuffers[0][2] and inputBuffers[0][3] contain the real and imaginary components of the bin with frequency sampleRate / blockSize for the first input channel.
inputBuffers[0][blockSize] and inputBuffers[0][blockSize+1] contain the real and imaginary components of the bin with “Nyquist” frequency sampleRate / 2. Again, this imaginary component should be zero.

Default Step and Block Sizes

If the plugin does not care about either the step or block size, it should return zero as its preference.

If the plugin returns zero as its block size preference, the host will pick a block size that is practical for its own processing purposes. For time domain inputs this could reasonably be almost anything, from relatively small (e.g. 512) to huge (e.g. the length in sample frames of the entire audio file that the host is processing). For frequency domain inputs of course the host can reasonably be expected to use some block size that is generally considered appropriate for block-by-block short-time Fourier transform processing, such as 1024 or 2048.

If the plugin returns zero as its step size preference, the host will typically use a step size equal to the block size for time domain inputs, or half the block size for frequency domain inputs.

In either case, the host may alternatively offer the choice of step or block size to the user.

5. Outputs

Feature Structure

A plugin may return features in two places: from the process call, and from getRemainingFeatures. The process call is made repeatedly to provide the plugin with input data; see “Input” above. When all of the input has been consumed, getRemainingFeatures is called once, and the plugin should return any other features that it has computed or can now compute but has not yet returned.

Diagram showing possible feature structures at the outputs of a Vamp plugin.

The return type from process and getRemainingFeatures is called FeatureSet. This is an STL map whose key is an output number and whose value is a FeatureList, which is an STL vector of Feature objects. The use of a FeatureList allows the plugin to return features with more than one timestamp from a single process call, or to return all features for the entire audio input in a single FeatureSet from getRemainingFeatures.

A Feature has an optional timestamp and duration (see “Sample Types and Timestamps”, below), a vector of zero or more values, and an optional label. Note that even a Feature with zero values, no timestamp, and no label could still be a valid feature; it may indicate that “something happened in the block of audio passed to this process call”, with the interpretation of “something” depending on which output of the plugin was returning the feature.

A lot about the interpretation of returned features depends on which output the feature is associated with. A plugin has a fixed number of outputs, and it must provide a getOutputDescriptors method that returns data about all of them, as a vector of OutputDescriptor objects. The descriptor with index zero in this vector contains information about the output whose values are found with a key of zero in the feature sets returned by the plugin, and so on.

Output Descriptors

For a plugin to be of any use, it must provide at least one OutputDescriptor (that is, have at least one output). This contains all of the information provided by the plugin about the “meaning” of values associated with an output of the plugin. OutputDescriptors for all of the plugin's outputs, in order, must be returned by a call to getOutputDescriptors.

The OutputDescriptor contains:

The identifier, name and description of this output (see Identifiers, Names and Descriptions above).
Optionally, the unit (as a string) of all of the values associated with this output.
Optionally, the number of value “bins” that features associated with this output have (via hasFixedBinCount and binCount). The bin count might be zero for “time only” features like the simple onset detector, one for “time and value” features like an amplitude tracker, or many for “column” features like a chromagram that have a series of a fixed number of values in each feature. Some features might have a variable number of values, and they will need to leave hasFixedBinCount false.
Optionally, if hasFixedBinCount is true, names for the value bins (in binNames). For example, a chromagram plugin might have bin names describing the frequencies whose strengths are represented in the bin values.
Optionally, the extents for values associated with this output (i.e. their minimum and maximum values), via hasKnownExtents, minValue and maxValue. Like the unit, these apply to values in all bins, if the output has more than one bin.
Optionally, the quantization of the values associated with this output, via isQuantized and quantizeStep. A feature whose values can only fall within a certain subset of real numbers (for example, a feature whose values are always integers) may wish to set these. They are analogous to the quantization for parameters described in Parameters below, except that it is not possible to associate names with the quantize steps as it is for parameters.
The sample type and rate for the output, via sampleType and sampleRate. See Sample Types and Timestamps below for discussion of these.
Whether the features associated with this output are expected to have durations (via hasDuration). (This field was added to the OutputDescriptor in version 2.0 of the Vamp API and SDK.) See Sample Types and Timestamps below for discussion of this.

Sample Types and Timestamps

Every feature that is returned from a Vamp plugin has a time associated with it. This is the time at which the feature is considered to “start”, as a fractional number of seconds since the start of the audio input.

The time may be explicit – stored in the timestamp of the feature structure itself – or implicit – omitted from the feature itself and instead deduced by the host on the basis of the sample type and rate defined for the output in which the feature is returned. Whether the time is implicit or explicit depends on the sampleType field of the output and the hasTimestamp field of the feature.

The permitted sampleType values for an output are:

OneSamplePerStep – Implicit time. The effective time of any feature returned by process for that output is the same as the time that was passed in to the process function by the host. The plugin should not set a timestamp on the output feature, and should set the feature's hasTimestamp field to false. The host should not read the timestamp from the output feature, even if the feature's hasTimestamp field is erroneously found to be true. Any features returned from getRemainingFeatures will all be effectively timed to the end of the input audio.
FixedSampleRate – Implicit or explicit time. If the output feature's hasTimestamp field is true, the host should read and use the output feature's timestamp. The host may round the timestamp according to the sample rate given in the output descriptor's sampleRate field (which must be non-zero). If the feature's hasTimestamp field is false, its time will be implicitly calculated by incrementing the time of the previous feature according to the sample rate.
VariableSampleRate – Explicit time. The time of each feature is returned in the timestamp of the feature, and the feature's hasTimestamp field must be set to true. The feature may have a “resolution” given by the sampleRate value for that output, but the host may ignore it. The sampleRate should be zero if there is no sensible resolution.

The hasTimestamp field in the feature structure is only significant when using FixedSampleRate. If the sample type is VariableSampleRate, the host must always read and use the timestamp; if the sample type is OneSamplePerStep, the host should never read the timestamp.

The difference between FixedSampleRate and VariableSampleRate is more significant than it may at first appear. An output whose sample type is FixedSampleRate can be said to be “dense” and, for example, may be plotted naturally on a grid if it has a constant number of output bins. An output whose sample type is VariableSampleRate is in principle sparse, may need a different representation, and may be treated differently by a host – even if its features are in fact evenly spaced in time.

A feature may also have a duration, stored in the duration field of the feature structure. If this duration is present, the hasDuration field of the feature must be set to true. Duration is always explicit; unlike the timestamp, it cannot be calculated implicitly. The Feature struct constructor initialises hasDuration to false, so you do not have to set it unless you need it. (Duration was added to the feature in version 2.0 of the Vamp API and SDK; version 1 plugins have no way to specify this property.)

Rules of Thumb for Choosing Sample Types

If your output returns things that are always regularly-spaced in time, and there is one such thing returned for every process block, and the calculation is causal so that results are available immediately, and there is no latency added beyond the length of the processing block, then you probably want to use OneSamplePerStep sample type and omit the feature timestamps.
If your output returns things that are regularly-spaced in time but the other limitations above are not true, use FixedSampleRate sample type, set the output sample rate to the (perhaps fractional) number of returned features per second, and provide a timestamp with each feature.
If your output returns anything else, use VariableSampleRate sample type, set the output sample rate to zero unless you know better, and provide a timestamp with each feature.

Examples

Note onset detector. Our onset detector as described has two outputs, one for the note onsets and the other for a function describing the likelihood that there is a note onset in a given block of audio. One thing to observe here is that the note onsets should be the first output, so that a host which defaults to using the first available output will get the results it expects, given the stated purpose of the plugin.

The note onsets output is most likely to have VariableSampleRate, with each feature timestamped explicitly with the estimated note onset position. If the plugin's detection method has a limited resolution in audio samples (for example if it can only detect whether an onset happened “during this block” but not where exactly within the block), then the plugin may also wish to report that by setting the sampleRate field of its output descriptor.

The onset detection function output could use either OneSamplePerStep or FixedSampleRate. If the detection function is always calculated directly from the input data block given to the process call, and thus results in one result for each input block, then OneSamplePerStep is the natural sample type. If the input block is subdivided within the plugin for analysis purposes, or the processing block size is otherwise different from the input block size, then it may use FixedSampleRate, giving the input sample rate divided by the processing block size as the output's sampleRate.

Chromagram. If the chromagram plugin demands frequency domain input with a block size equal to the expected FFT size for its constant Q transform, then OneSamplePerStep is the natural sampleType for the chroma data output, because one column of output data is produced for every input block.

Amplitude follower. The output sampleType may depend on the processing method, but as with the onset detector's detection function output, OneSamplePerStep or FixedSampleRate is likely.

6. Configuration

Parameters

The principal method for users and hosts to adjust the working of a Vamp plugins is via parameters. A parameter is a named value that may be set (and retrieved) by the host at any point between instantiation and initialisation of the plugin.

The plugin must provide a getParameterDescriptors method, which returns a list of ParameterDescriptor objects describing the available parameters of the plugin. The host refers to a parameter using the identifier given in its descriptor, and the plugin must provide getParameter and setParameter methods to retrieve and set the current value of the parameter.

Parameter values for Vamp plugins are always floating-point numbers. The parameter descriptor must provide limits for the permissible values through its minValue and maxValue fields. It may also provide a quantization for values, through its isQuantized and quantizeStep fields; if, for example, a parameter has isQuantized set to true and quantizeStep is set to 1.0f, then the host will only provide integer values for this parameter (or, strictly, floating point values that are the closest to integers). The descriptor may also include names to be associated with the quantize steps for a quantized parameter through its valueNames field: a graphical host may use these names to offer the user a list of named options instead of a numeric entry field or controller for the parameter.

Programs

Some plugins may have combinations of parameters that are known to be effective for particular sorts of tasks. For example, an onset detector may have certain parameter settings that work well for music with only soft onsets, other parameters that work well with percussive onsets, and so on. This things can be encapsulated to some degree using programs. A program is simply a name that a host may select or query, such that at most one program may be active at once.

A plugin that supports programs should provide a getProgramNames method that returns the names of all known programs, a getCurrentProgram method that returns the name of the currently selected program if any (or an empty string otherwise), and a selectProgram method that configures the plugin for the given program. The plugin is free to rewrite its own parameter values when a new program is selected; it's the host's responsibility to read the new parameter values afterwards if necessary.

What Can Depend on a Parameter?

One difference between Vamp plugins and most real-time audio processing plugin APIs is that the Vamp plugin may need to make significant aspects of its inputs and outputs dependent upon the parameters used to configure it.

For this reason, a Vamp plugin must be completely configured by the host, with its parameters and programs set, before it is initialised; the parameters cannot be changed afterwards. Furthermore, some properties of the plugin can depend on the values passed in to the initialise function, so the host must query these again after initialisation but before calling process.

A plugin may change the following at any time, up to and including during its initialise function:

The sampleType and sampleRate for an output (in the OutputDescriptor as returned from getOutputDescriptors).
The number of bin values a feature associated with an output may have (hasFixedBinCount, binCount, and binNames in the OutputDescriptor).
The extents of values taken by features associated with an output (hasKnownExtents, minValue, maxValue, isQuantized and quantizeStep in the OutputDescriptor).
The units of values taken by features associated with an output (unit in the OutputDescriptor).

In addition, the plugin may change its mind about the following at any point before initialise has been called:

Its preferred input step size and block size (returned from getPreferredStepSize and getPreferredBlockSize).

Although a plugin may change these properties after construction, it should not do so unless it is necessary. The plugin should ensure that these properties have plausible default values right from the moment of construction, so that the host can make reasonable observations about the default or likely behaviour of the plugin without needing to initalise it. If the host does not in fact change any parameters or programs, and supplies the plugin with its preferred step and block size, then any values it has already queried from the plugin should remain valid after initialisation.

The plugin may not change any of the following after construction:

Its total number of outputs.
The minimum and maximum number of audio input channels it accepts.
The input domain it requires audio to be supplied in.
The set of available parameters returned through getParameterDescriptors, nor anything in any of the ParameterDescriptors themselves.
The set of available programs returned through getPrograms.
Any other descriptive data, such as the identifier, name or description text for the plugin itself or for any of its parameters or outputs.

Example

Chromagram. The chromagram plugin produces a series of columns of output data, with each column containing a certain number of constant Q (pitch) bins spanning a range of an octave.

It is desirable to make the bins-per-octave value for the chromagram adjustable as a parameter. This means that the number of bin values declared for each output feature should be variable depending on the state of the plugin parameters.

Also, the processing block size for the chromagram depends upon factors including the bins-per-octave value, as well as on the input sample rate. This means that the preferred block size for the plugin should be variable depending on the sample rate given at construction time, as well as on the plugin parameters. Similarly if the plugin chooses to declare any FixedSampleRate outputs, that rate will also likely depend on the processing block size and therefore on the input sample rate and plugin parameters.

Finally, it is useful to offer column normalization as an option in the plugin, selectable through a parameter. For this to be possible, the extents of feature values should also be variable depending on plugin parameters.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Programmers Guide

The Vamp Audio Analysis Plugin API: A Programmer's Guide

1. Overview

Examples

2. What does a Plugin Contain?

Base classes

3. Identifiers, Names and Descriptions

Categories

4. Inputs

Examples

Time Domain Input

Frequency Domain Input

Default Step and Block Sizes

5. Outputs

Feature Structure

Output Descriptors

Sample Types and Timestamps

Rules of Thumb for Choosing Sample Types

Examples

6. Configuration

Parameters

Programs

What Can Depend on a Parameter?

Example

Uh oh!

Clone this wiki locally