fastspeech2

History

Name		Name	Last commit message	Last commit date
parent directory ..
conf		conf
README.md		README.md
decode_fastspeech2.py		decode_fastspeech2.py
extractfs_postnets.py		extractfs_postnets.py
fastspeech2_dataset.py		fastspeech2_dataset.py
train_fastspeech2.py		train_fastspeech2.py

README.md

FastSpeech 2: Fast and High-Quality End-to-End Text to Speech

Based on the script train_fastspeech2.py.

Training FastSpeech2 from scratch with LJSpeech dataset.

This example code show you how to train FastSpeech from scratch with Tensorflow 2 based on custom training loop and tf.function. The data used for this example is LJSpeech, you can download the dataset at link.

Step 1: Create Tensorflow based Dataloader (tf.dataset)

First, you need define data loader based on AbstractDataset class (see abstract_dataset.py). On this example, a dataloader read dataset from path. I use suffix to classify what file is a charactor, duration and mel-spectrogram (see fastspeech2_dataset.py). If you already have preprocessed version of your target dataset, you don't need to use this example dataloader, you just need refer my dataloader and modify generator function to adapt with your case. Normally, a generator function should return [charactor_ids, duration, f0, energy, mel]. Pls see tacotron2-example to know how to extract durations Extract Duration

Step 2: Training from scratch

After you redefine your dataloader, pls modify an input arguments, train_dataset and valid_dataset from train_fastspeech2.py. Here is an example command line to training fastspeech2 from scratch:

CUDA_VISIBLE_DEVICES=0 python examples/fastspeech2/train_fastspeech2.py \
  --train-dir ./dump/train/ \
  --dev-dir ./dump/valid/ \
  --outdir ./examples/fastspeech2/exp/train.fastspeech2.v1/ \
  --config ./examples/fastspeech2/conf/fastspeech2.v1.yaml \
  --use-norm 1 \
  --f0-stat ./dump/stats_f0.npy \
  --energy-stat ./dump/stats_energy.npy \
  --mixed_precision 1 \
  --resume ""

IF you want to use MultiGPU to training you can replace CUDA_VISIBLE_DEVICES=0 by CUDA_VISIBLE_DEVICES=0,1,2,3 for example. You also need to tune the batch_size for each GPU (in config file) by yourself to maximize the performance. Note that MultiGPU now support for Training but not yet support for Decode.

In case you want to resume the training progress, please following below example command line:

--resume ./examples/fastspeech2/exp/train.fastspeech2.v1/checkpoints/ckpt-100000

If you want to finetune a model, use --pretrained like this with your model filename

--pretrained pretrained.h5

You can also define var_train_expr in config file to let model training only on some layers in case you want to fine-tune on your dataset with the same pretrained language and processor. For example, var_train_expr: "embeddings|encoder|decoder" means we just training all variables that embeddings, encoder, decoder exist in its name.

Step 3: Decode mel-spectrogram from folder ids

CUDA_VISIBLE_DEVICES=0 python examples/fastspeech2/decode_fastspeech2.py \
  --rootdir ./dump/valid \
  --outdir ./predictions/fastspeech2.v1/ \
  --config ./examples/fastspeech2/conf/fastspeech2.v1.yaml \
  --checkpoint ./examples/fastspeech2/checkpoints/model-150000.h5 \
  --batch-size 8

What's difference ?

It's not ez for the model to learn predict f0/energy on mel level as paper did. Instead, i average f0/energy based on duration to get f0/energy on charactor level then sum it into encoder_hidden_state before pass though Length-Regulator.
I apply mean/std normalization for both f0/energy. Note that before calculate mean and std values over all training set, i remove all outliers from f0 and energy.
Instead using 256 bins for F0 and energy as FastSpeech2 paper, i let model learn to predict real f0/energy value then pass it though one layer Conv1D with kernel_size 9 to upsamples f0/energy scalar to vector as FastPitch paper suggest.
There are other modifications to make it work, let read the code carefully to make sure you won't miss anything :D.

Pretrained Models and Audio samples

Model	Conf	Lang	Fs [Hz]	Mel range [Hz]	FFT / Hop / Win [pt]	# iters
fastspeech2.v1	link	EN	22.05k	80-7600	1024 / 256 / None	150k
fastspeech2.kss.v1	link	KO	22.05k	80-7600	1024 / 256 / None	200k
fastspeech2.kss.v2	link	KO	22.05k	80-7600	1024 / 256 / None	200k

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Files

fastspeech2

fastspeech2

README.md

FastSpeech 2: Fast and High-Quality End-to-End Text to Speech

Training FastSpeech2 from scratch with LJSpeech dataset.

Step 1: Create Tensorflow based Dataloader (tf.dataset)

Step 2: Training from scratch

Step 3: Decode mel-spectrogram from folder ids

What's difference ?

Pretrained Models and Audio samples

Reference

Files

fastspeech2

Directory actions

More options

Directory actions

More options

Latest commit

History

fastspeech2

Folders and files

parent directory

README.md

FastSpeech 2: Fast and High-Quality End-to-End Text to Speech

Training FastSpeech2 from scratch with LJSpeech dataset.

Step 1: Create Tensorflow based Dataloader (tf.dataset)

Step 2: Training from scratch

Step 3: Decode mel-spectrogram from folder ids

What's difference ?

Pretrained Models and Audio samples

Reference