NSynth: Neural Audio Synthesis
NSynth is a WaveNet-based autoencoder for synthesizing audio.
WaveNet is an expressive model for temporal sequences such as speech and music. As a deep autoregressive network of dilated convolutions, it models sound one sample at a time, similar to a nonlinear infinite impulse response filter. Since the context of this filter is currently limited to several thousand samples (about half a second), long-term structure requires a guiding external signal. Prior work demonstrated this in the case of text-to-speech and used previously learned linguistic embeddings to create impressive results.
In NSynth, we removed the need for conditioning on external features by employing a WaveNet-style autoencoder to learn its own temporal embeddings.
This repository contains a baseline spectral autoencoder model and a WaveNet autoencoder model, each in their respective directories. The baseline model uses a spectrogram with fft_size 1024 and hop_size 256, MSE loss on the magnitudes, and the Griffin-Lim algorithm for reconstruction. The WaveNet model trains on mu-law encoded waveform chunks of size 6144. It learns embeddings with 16 dimensions that are downsampled by 512 in time.
Given the difficulty of training, we've included weights of models pretrained on the NSynth dataset. They are available for download as TensorFlow checkpoints:
The most straightforward way to create your own sounds with NSynth is to
generate sounds directly from .wav files without altering the embeddings. You
can do this for sounds of any length as long as you set the
enough. Keep in mind the wavenet decoder works at 16kHz. The script below will
take all .wav files in the
source_path directory and create generated samples in the
save_path directory. If you've installed with the pip package you can call the scripts directly without calling
Example Usage (Generate from .wav files):
nsynth_generate \ --checkpoint_path=/<path>/wavenet-ckpt/model.ckpt-200000 \ --source_path=/<path> \ --save_path=/<path> \ --batch_size=4
We've included scripts for saving embeddings from your own wave files. This will save a single .npy file for each .wav file in the source_path directory. You can then alter those embeddings (for example, interpolating) and synthesize new sounds from them.
Example Usage (Save Embeddings):
bazel run //magenta/models/nsynth/baseline:save_embeddings -- \ --tfrecord_path=/<path>/nsynth-test.tfrecord \ --checkpoint_path=/<path>/baseline-ckpt/model.ckpt-200000 \ --savedir=/<path>
nsynth_save_embeddings \ --checkpoint_path=/<path>/wavenet-ckpt/model.ckpt-200000 \ --source_path=/<path> \ --save_path=/<path> \ --batch_size=4
Example Usage (Generate from .npy Embeddings):
nsynth_generate \ --checkpoint_path=/<path>/wavenet-ckpt/model.ckpt-200000 \ --source_path=/<path> \ --save_path=/<path> \ --encodings=true \ --batch_size=4
To train the model you first need a dataset containing raw audio. We have built a very large dataset of musical notes that you can use for this purpose: the NSynth Dataset.
Training for both these models is very expensive, and likely difficult for many practical setups. Nevertheless, We've included training code for completeness and transparency. The WaveNet model takes around 10 days on 32 K40 gpus (synchronous) to converge at ~200k iterations. The baseline model takes about 5 days on 6 K40 gpus (asynchronous).
bazel run //magenta/models/nsynth/baseline:train -- \ --train_path=/<path>/nsynth-train.tfrecord \ ---logdir=/<path>
bazel run //magenta/models/nsynth/wavenet/train -- \ --train_path=/<path>/nsynth-train.tfrecord \ --logdir=/<path>
The WaveNet training also requires tensorflow 1.1.0-rc1 or beyond.