Vocobox : Voice Controller for Digital Instruments
Java R Other
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Failed to load latest commit information.



Voice Controller for Digital Instruments


Vocobox intend to provide singers with a software turning the voice to a musical controller. Voice features (pitch, volume, ...) are used to control external software or hardware producing music.

We rather want to build a voice-to-instrument application than an audio-to-midi application. For this reason we found sufficient to control synthetizer in terms of frequency and amplitude, without clearly defining note on/off events. It makes mapping easier, and result is good enough.

If we you want to go straight to example output, go here.

VOCOBOX 1.0 (01/01/2015)

At this step we are mainly evaluating pitch detection algorithms using the Human Voice Dataset, a dataset we build to gather examples of singers' voice (e.g. all notes in the voice range). We define scores such as pitch detection latency and precision and compare them graphically.

We also evaluate pitch detection in real time by recording the voice with a microphone as input and by generating a synthetizer sound as output.

See the Component section of this document to learn more about algorithms used in this project.


To get notified of futured version, simply follow Vocobox on Twitter or here on Github.

We are currently having fun with sequence detection.

Collaborators are most welcome! See end of this page.


Controlling Synthetizers with CSV files

Our first attempt to analyze voice signal was written in R using Seewave and Aubio via an R binding written for the experiment.

To control JSyn synthetizer, we export frequency and amplitude change commands in two CSV files. Each file contains two columns, the first being elapsed time since song started, the second indicating a value change (frequency changes for pitch.csv, and amplitude changes for envelope.csv). Note that frequency and amplitude can change independently.

Having the original wav file available allows to play audio source in background while executing command events.

To run synthetizer control based on a csv files, see VocoboxControllerCsv.

Controlling Synthetizers with WAV files

The pitch and amplitude change events of a wav file are sent to a synthetizer via its sendFrequency() / sendAmplitude() methods. In these demonstrations, we use JSyn based synthetizers. As the direct control of oscillator's amplitude from input file is sufficiently good to mimic notes, we do not need additional computation to define note on and note off.

Below are few synthetized sounds and their wave file controller.

Input Do-re-mi piano source Do-re-mi voice source
Output Do-re-mi synth controlled by piano Do-re-mi synth controlled by voice

See this examples folder for more input/output/chart results.

To run synthetizer control based on a wav file, see VocoboxControllerFileRead.

Controlling Synthetizers in real time with available audio inputs (microphone, lines)

When starting the application, the list of available source are listed by tarsos, and an estimation algorithm is proposed. We found Yin performs best. Running live synthetizer control allows to see pitch detection is pretty efficient.

To run synthetizer control based on live voice, see VocoboxControllerMic

Benchmark Pitch Detection algorithm on note datasets

This document explain how we use the Human Voice Dataset (a serie of wav files containing human sung notes) to evaluate pitch detection algorithm on isolated notes.


Audio analysis

Audio signal analysis is powered by TarsosDSP. Yin implementation outperforms any other algorithm for pitch detection and has become the default implementation for the voice analysis module.

Vocobox delivers pitch detection through following analyzers

Analyzer Comment
VoiceInputStreamListen Analyse audio signal from available inputs (microphones, but also lines, etc). When running a Jack server, audio sources made available by Jack appear in source list!
VoiceFileRead Analyse audio signal from (mono) wav files. After reading, a collection of audio analysis events are collected an can be send to a synthetizer.

Note that you can process FFT using Spectro Edit, as provided by Jzy3d Spectro. It is used below to draw note signal analysis. JSyn and TarsosDSP also provides FFT processing.


Synthetizer powered by JSyn are available in a dedicated maven module. The below implementations are basic, we can do much more with JSyn!

Synthetizer Comment
JsynMonoscilloSynth A single oscillator.
JsynMonoscilloRampSynth A single oscillator having a LinearRamp on frequency and amplitude change commands, handling numerous pitch / amplitude change events without audio artifact.
JsynOcclusiveNoiseSynth A synthetizer using a non frequency-defined sound (here : a white noise) when confidence value of pitch detection is below a threshold. It allows a kind of audio debugging of pitch detection. Brutal tone change make the synthetizer sound harsh but smooth changes in tone balance could produce interesting effects.
JsynCircuitSynth A synthetizer based on JSyn Circuit, allowing easier abstraction of synthetizer element groups. Here, we use circuit SynthCircuitBlaster that is derived from JSyn examples. Note the circuit provides its control panel to Vocobox UI.
JsynOscilloSpectroHarpSynth An experimental synthetizer based on FFT analysis of a file. A file is played, its FFT is processed, and all frequency band energies defines amplitude of one the 93 oscillators covering 0-4kHz.


Charts are powered by Jzy3d. They are used as synthetizer command logs : parameter changes of the synthetizer are tracked and mapped to multiple 2d charts. Below is the list of available charts. See here a video of charts in action.

Chart Comment
Frequency chart Shows the synthetizer frequency changes with a pink scatter plot. Confidence is used to define alpha, so there is nothing displayed if pitch detection has confidence 0.
Amplitude chart Shows the synthetizer amplitude changes with a cyan scatter plot. Amplitude events below the note relevance threshold (default 0.1) are drawn in gray.

Few features interesting with Jzy3d

  • easy charting
  • performance and liveness
  • coming soon : log chart will help to let frequency charts look like note charts without having to do the frequency-to-note conversion by ourself.
  • underlying JOGL let it run everywhere (any Java Windowing toolkit including Android)

Real time

Human perception of real time

We found in P.Brossier Thesis that human can't perceive audio events when they are separated by less than 50ms to a few milisecond :

As auditory nerve cells need to rest after firing, several phenomena may occur within the inner ear. Depending on the nature of the sources, two or more events will be merged into one sensation. In some cases, events will need to be separated by only a few millisecond to be perceived as two distinct events, while some other sounds will be merged if they occur within 50 ms, and sometimes even longer. These effects, known as the psychoacoustic masking effects, are complex, and depend not only of the loudness of both sources, masker and maskee, but also on their frequency and timbre [Zwicker and Fastl, 1990]. The different masking effects can be divided in three kinds [Bregman, 1990]. Pre-masking occurs when a masked event is followed immediately by a louder event. Post-masking instead occurs when a loud event is followed by a quiet noise. In both case, the quiet event will not be perceived – i.e. it will be masked. The third kind of masking effect is simultaneous masking, also referred to as frequency masking, as it is strongly dependent on the spectrum of both the masker and the maskee.

We thus consider 5ms to be the timeframe within which we should be able to do computational work to be able to produce audio without cues. It is like being able to render images of an animation below 1/25s to display at a rate suitable with persistence of vision.

Real time capabilities of the Java platform

In this project we are rather working on data than delivering production ready software so we could say it is not a matter if standard versions of Java can't deal with the above speed constraints due to non predictability of garbage collection.

But Java can be a good platform for real time, as shown by Metronome GC, a Garbage Collector able work in deterministic time, with the promise of not spending more than 3ms in collecting garbage. 3ms is great because it is lower to than the perception capabilities of our brain as exposed above.

Performance of components

We noticed Tarsos processes files in much faster time a player would read it.


Several interesting papers related to voice frequency detection can be found in the doc/papers folder

Getting and building source code

Create a Vocobox directory

cd dev
mkdir vocobox
cd vocbox
mkdir external
mkdir public
cd public

Get the voice dataset

git clone https://github.com/vocobox/human-voice-dataset

Get and build Vocobox

git clone https://github.com/vocobox/vocobox
cd vocobox/dev/java
mvn clean install

Maven should retrieve TarsosDSP, JSyn, and Jzy3d from Jzy3d's maven repository.

Following is not necessary, but if you want to build the dependencies yourself, you can get our forks enabling JSyn and TarsosDSP on maven:

cd ../external/
git clone https://github.com/vocobox/jsyn
git clone https://github.com/vocobox/TarsosDSP tarsosdsp
cd jsyn
mvn clean install -D skipTests
cd ../tarsosdsp
mvn clean install -D skipTests


Please join us and share your contributions through pull-requests.

You can contact martin@vocobox.org for questions.




To Phil Burk and Jochen Six for their kind help and advices regarding the excellent tools JSyn and TarsosDSP.