GST-Tacotron 2

An emotional speech synthesis research project conducted as part of IS4152 coursework. This repository contains code that can be used to train a speech synthesis model that attempts to generate speech-like sounds to express a chosen emotion.

Pre-requisites

NVIDIA GPU + CUDA cuDNN

Set up repository

Clone this repo: git clone https://github.com/taneliang/mellotron.git
CD into this repo: cd mellotron
Initialize submodule: git submodule init; git submodule update

Set up dependencies

Check CUDA toolkit version: nvcc --version. NB: This is the toolkit version, which may be different from the version reported by nvidia-smi.
Create Python 3 virtual environment: python3 -m venv .env-cuda<CUDA version>
Activate venv, by running one of the following:
- bash/sh: source .env-cudaxxx/bin/activate
- csh: source .env-cudaxxx/bin/activate.csh
- fish: source .env-cudaxxx/bin/activate.fish
Install [PyTorch 1.0]. As the time this was written, these are the instructions:
- CUDA 10.0: pip install torch==1.4.0+cu100 torchvision==0.5.0+cu100 -f https://download.pytorch.org/whl/cu100/torch_stable.html
- CUDA 10.1: pip install torch==1.6.0+cu101 torchvision==0.7.0+cu101 -f https://download.pytorch.org/whl/torch_stable.html
- CUDA 10.2 or 11.0: pip install torch torchvision

Install Apex:

pushd ..
git clone https://github.com/NVIDIA/apex
cd apex
pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./
popd

Install Python requirements: pip install -r requirements.txt

Set up data for training

EmoV-DB:
1. Download the EmoV-DB dataset
2. Normalize it: ls */*/*.wav | xargs -I % sh -c 'mkdir -p ../out/$(dirname %) && sox % --rate 16000 -c 1 -b 16 ../out/%'
3. Trim leading and trailing silences: ls */*/*.wav | xargs -I @ sh -c 'mkdir -p ../out-no-silence/$(dirname @) && sox @ --rate 16000 -c 1 -b 16 ../out-no-silence/@ silence 1 0.1 1% reverse silence 1 0.1 1% reverse'
4. (Optional) Manually trim non-verbal expressions:
  1. Generate a CSV file to be manually filled in with trim timestamps: ./genmanualtrimlist.py
  2. Use the CSV file to trim files: ./createcleanemovdb.py
LJSpeech:
1. Download the LJSpeech dataset.
2. Normalize it: mkdir ../../LJSpeech-1.1/wavs && ls *.wav | xargs -I % sh -c 'sox % --rate 16000 -c 1 -b 16 ../../LJSpeech-1.1/wavs/%'

Generate filelist files:

cd scripts
vim ./genfilelist.py # Configure the script before running
./genfilelist.py
cd ..

Training

Update the filelists inside the filelists folder to point to your data
python train.py --output_directory=outdir --log_directory=logdir
(OPTIONAL) tensorboard --logdir=outdir/logdir

Training using a pre-trained model

Training using a pre-trained model can lead to faster convergence
By default, the emotion embedding layer is ignored

python train.py --output_directory=outdir --log_directory=logdir -c models/mellotron_libritts.pt --warm_start

Multi-GPU (distributed) and Automatic Mixed Precision Training

python -m multiproc train.py --output_directory=outdir --log_directory=logdir --hparams=distributed_run=True,fp16_run=True

Inference demo

jupyter notebook --ip=127.0.0.1 --port=31337
Load inference.ipynb
(optional) Download our published WaveGlow model

Related repos

WaveGlow Faster than real time Flow-based Generative Network for Speech Synthesis.

Acknowledgements

This project is a slight modification of Mellotron, developed by Rafael Valle, Jason Li, Ryan Prenger and Bryan Catanzaro.

In turn, Mellotron uses code from the following repos: Keith Ito, Prem Seetharaman, Chengqi Deng, Patrice Guyot, as described in our code.

Name		Name	Last commit message	Last commit date
Latest commit History 50 Commits
data		data
filelists		filelists
scripts		scripts
text		text
waveglow @ 2fd4e63		waveglow @ 2fd4e63
.gitmodules		.gitmodules
LICENSE		LICENSE
README.md		README.md
audio_processing.py		audio_processing.py
data_utils.py		data_utils.py
distributed.py		distributed.py
fp16_optimizer.py		fp16_optimizer.py
hparam.py		hparam.py
hparams.py		hparams.py
inference.ipynb		inference.ipynb
layers.py		layers.py
logger.py		logger.py
loss_function.py		loss_function.py
loss_scaler.py		loss_scaler.py
mellotron_logo.png		mellotron_logo.png
mellotron_utils.py		mellotron_utils.py
model.py		model.py
modules.py		modules.py
multiproc.py		multiproc.py
plotting_utils.py		plotting_utils.py
requirements.txt		requirements.txt
stft.py		stft.py
train.py		train.py
utils.py		utils.py
yin.py		yin.py

License

taneliang/gst-tacotron2

Folders and files

Latest commit

History

Repository files navigation

GST-Tacotron 2

Pre-requisites

Set up repository

Set up dependencies

Set up data for training

Training

Training using a pre-trained model

Multi-GPU (distributed) and Automatic Mixed Precision Training

Inference demo

Related repos

Acknowledgements

About

Resources

License

Stars

Watchers

Forks

Languages