Voice Controller for Digital Instruments
Vocobox intend to provide singers with a software turning the voice to a musical controller. Voice features (pitch, volume, ...) are used to control external software or hardware producing music.
We rather want to build a voice-to-instrument application than an audio-to-midi application. For this reason we found sufficient to control synthetizer in terms of frequency and amplitude, without clearly defining note on/off events. It makes mapping easier, and result is good enough.
If we you want to go straight to example output, go here.
VOCOBOX 1.0 (01/01/2015)
At this step we are mainly evaluating pitch detection algorithms using the Human Voice Dataset, a dataset we build to gather examples of singers' voice (e.g. all notes in the voice range). We define scores such as pitch detection latency and precision and compare them graphically.
We also evaluate pitch detection in real time by recording the voice with a microphone as input and by generating a synthetizer sound as output.
See the Component section of this document to learn more about algorithms used in this project.
To get notified of futured version, simply follow Vocobox on Twitter or here on Github.
We are currently having fun with sequence detection.
Collaborators are most welcome! See end of this page.
Controlling Synthetizers with CSV files
To control JSyn synthetizer, we export frequency and amplitude change commands in two CSV files. Each file contains two columns, the first being elapsed time since song started, the second indicating a value change (frequency changes for pitch.csv, and amplitude changes for envelope.csv). Note that frequency and amplitude can change independently.
Having the original wav file available allows to play audio source in background while executing command events.
To run synthetizer control based on a csv files, see VocoboxControllerCsv.
Controlling Synthetizers with WAV files
The pitch and amplitude change events of a wav file are sent to a synthetizer via its sendFrequency() / sendAmplitude() methods. In these demonstrations, we use JSyn based synthetizers. As the direct control of oscillator's amplitude from input file is sufficiently good to mimic notes, we do not need additional computation to define note on and note off.
Below are few synthetized sounds and their wave file controller.
|Input||Do-re-mi piano source||Do-re-mi voice source|
|Output||Do-re-mi synth controlled by piano||Do-re-mi synth controlled by voice|
See this examples folder for more input/output/chart results.
To run synthetizer control based on a wav file, see VocoboxControllerFileRead.
Controlling Synthetizers in real time with available audio inputs (microphone, lines)
When starting the application, the list of available source are listed by tarsos, and an estimation algorithm is proposed. We found Yin performs best. Running live synthetizer control allows to see pitch detection is pretty efficient.
To run synthetizer control based on live voice, see VocoboxControllerMic
Benchmark Pitch Detection algorithm on note datasets
Audio signal analysis is powered by TarsosDSP. Yin implementation outperforms any other algorithm for pitch detection and has become the default implementation for the voice analysis module.
Vocobox delivers pitch detection through following analyzers
|VoiceInputStreamListen||Analyse audio signal from available inputs (microphones, but also lines, etc). When running a Jack server, audio sources made available by Jack appear in source list!|
|VoiceFileRead||Analyse audio signal from (mono) wav files. After reading, a collection of audio analysis events are collected an can be send to a synthetizer.|
Note that you can process FFT using Spectro Edit, as provided by Jzy3d Spectro. It is used below to draw note signal analysis. JSyn and TarsosDSP also provides FFT processing.
|JsynMonoscilloSynth||A single oscillator.|
|JsynMonoscilloRampSynth||A single oscillator having a LinearRamp on frequency and amplitude change commands, handling numerous pitch / amplitude change events without audio artifact.|
|JsynOcclusiveNoiseSynth||A synthetizer using a non frequency-defined sound (here : a white noise) when confidence value of pitch detection is below a threshold. It allows a kind of audio debugging of pitch detection. Brutal tone change make the synthetizer sound harsh but smooth changes in tone balance could produce interesting effects.|
|JsynCircuitSynth||A synthetizer based on JSyn Circuit, allowing easier abstraction of synthetizer element groups. Here, we use circuit SynthCircuitBlaster that is derived from JSyn examples. Note the circuit provides its control panel to Vocobox UI.|
|JsynOscilloSpectroHarpSynth||An experimental synthetizer based on FFT analysis of a file. A file is played, its FFT is processed, and all frequency band energies defines amplitude of one the 93 oscillators covering 0-4kHz.|
Charts are powered by Jzy3d. They are used as synthetizer command logs : parameter changes of the synthetizer are tracked and mapped to multiple 2d charts. Below is the list of available charts. See here a video of charts in action.
|Frequency chart||Shows the synthetizer frequency changes with a pink scatter plot. Confidence is used to define alpha, so there is nothing displayed if pitch detection has confidence 0.|
|Amplitude chart||Shows the synthetizer amplitude changes with a cyan scatter plot. Amplitude events below the note relevance threshold (default 0.1) are drawn in gray.|
Few features interesting with Jzy3d
- easy charting
- performance and liveness
- coming soon : log chart will help to let frequency charts look like note charts without having to do the frequency-to-note conversion by ourself.
- underlying JOGL let it run everywhere (any Java Windowing toolkit including Android)
Human perception of real time
We found in P.Brossier Thesis that human can't perceive audio events when they are separated by less than 50ms to a few milisecond :
As auditory nerve cells need to rest after firing, several phenomena may occur within the inner ear. Depending on the nature of the sources, two or more events will be merged into one sensation. In some cases, events will need to be separated by only a few millisecond to be perceived as two distinct events, while some other sounds will be merged if they occur within 50 ms, and sometimes even longer. These effects, known as the psychoacoustic masking effects, are complex, and depend not only of the loudness of both sources, masker and maskee, but also on their frequency and timbre [Zwicker and Fastl, 1990]. The different masking effects can be divided in three kinds [Bregman, 1990]. Pre-masking occurs when a masked event is followed immediately by a louder event. Post-masking instead occurs when a loud event is followed by a quiet noise. In both case, the quiet event will not be perceived – i.e. it will be masked. The third kind of masking effect is simultaneous masking, also referred to as frequency masking, as it is strongly dependent on the spectrum of both the masker and the maskee.
We thus consider 5ms to be the timeframe within which we should be able to do computational work to be able to produce audio without cues. It is like being able to render images of an animation below 1/25s to display at a rate suitable with persistence of vision.
Real time capabilities of the Java platform
In this project we are rather working on data than delivering production ready software so we could say it is not a matter if standard versions of Java can't deal with the above speed constraints due to non predictability of garbage collection.
But Java can be a good platform for real time, as shown by Metronome GC, a Garbage Collector able work in deterministic time, with the promise of not spending more than 3ms in collecting garbage. 3ms is great because it is lower to than the perception capabilities of our brain as exposed above.
Performance of components
We noticed Tarsos processes files in much faster time a player would read it.
Several interesting papers related to voice frequency detection can be found in the doc/papers folder
Getting and building source code
Create a Vocobox directory
cd dev mkdir vocobox cd vocbox mkdir external mkdir public cd public
Get the voice dataset
git clone https://github.com/vocobox/human-voice-dataset
Get and build Vocobox
git clone https://github.com/vocobox/vocobox cd vocobox/dev/java mvn clean install
Maven should retrieve TarsosDSP, JSyn, and Jzy3d from Jzy3d's maven repository.
Following is not necessary, but if you want to build the dependencies yourself, you can get our forks enabling JSyn and TarsosDSP on maven:
cd ../external/ git clone https://github.com/vocobox/jsyn git clone https://github.com/vocobox/TarsosDSP tarsosdsp cd jsyn mvn clean install -D skipTests cd ../tarsosdsp mvn clean install -D skipTests
Please join us and share your contributions through pull-requests.
You can contact email@example.com for questions.
IF YOU INTEND TO REUSE THIS SOFTWARE, PLEASE VERIFY COMPONENTS LICENCE! Licensing
To Phil Burk and Jochen Six for their kind help and advices regarding the excellent tools JSyn and TarsosDSP.