Skip to content

taneliang/gst-tacotron2

 
 

Repository files navigation

GST-Tacotron 2

An emotional speech synthesis research project conducted as part of IS4152 coursework. This repository contains code that can be used to train a speech synthesis model that attempts to generate speech-like sounds to express a chosen emotion.

Pre-requisites

  1. NVIDIA GPU + CUDA cuDNN

Set up repository

  1. Clone this repo: git clone https://github.com/taneliang/mellotron.git
  2. CD into this repo: cd mellotron
  3. Initialize submodule: git submodule init; git submodule update

Set up dependencies

  1. Check CUDA toolkit version: nvcc --version. NB: This is the toolkit version, which may be different from the version reported by nvidia-smi.
  2. Create Python 3 virtual environment: python3 -m venv .env-cuda<CUDA version>
  3. Activate venv, by running one of the following:
    • bash/sh: source .env-cudaxxx/bin/activate
    • csh: source .env-cudaxxx/bin/activate.csh
    • fish: source .env-cudaxxx/bin/activate.fish
  4. Install [PyTorch 1.0]. As the time this was written, these are the instructions:
    • CUDA 10.0: pip install torch==1.4.0+cu100 torchvision==0.5.0+cu100 -f https://download.pytorch.org/whl/cu100/torch_stable.html
    • CUDA 10.1: pip install torch==1.6.0+cu101 torchvision==0.7.0+cu101 -f https://download.pytorch.org/whl/torch_stable.html
    • CUDA 10.2 or 11.0: pip install torch torchvision
  5. Install Apex:
    pushd ..
    git clone https://github.com/NVIDIA/apex
    cd apex
    pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./
    popd
  6. Install Python requirements: pip install -r requirements.txt

Set up data for training

  1. EmoV-DB:
    1. Download the EmoV-DB dataset
    2. Normalize it: ls */*/*.wav | xargs -I % sh -c 'mkdir -p ../out/$(dirname %) && sox % --rate 16000 -c 1 -b 16 ../out/%'
    3. Trim leading and trailing silences: ls */*/*.wav | xargs -I @ sh -c 'mkdir -p ../out-no-silence/$(dirname @) && sox @ --rate 16000 -c 1 -b 16 ../out-no-silence/@ silence 1 0.1 1% reverse silence 1 0.1 1% reverse'
    4. (Optional) Manually trim non-verbal expressions:
      1. Generate a CSV file to be manually filled in with trim timestamps: ./genmanualtrimlist.py
      2. Use the CSV file to trim files: ./createcleanemovdb.py
  2. LJSpeech:
    1. Download the LJSpeech dataset.
    2. Normalize it: mkdir ../../LJSpeech-1.1/wavs && ls *.wav | xargs -I % sh -c 'sox % --rate 16000 -c 1 -b 16 ../../LJSpeech-1.1/wavs/%'
  3. Generate filelist files:
    cd scripts
    vim ./genfilelist.py # Configure the script before running
    ./genfilelist.py
    cd ..

Training

  1. Update the filelists inside the filelists folder to point to your data
  2. python train.py --output_directory=outdir --log_directory=logdir
  3. (OPTIONAL) tensorboard --logdir=outdir/logdir

Training using a pre-trained model

Training using a pre-trained model can lead to faster convergence
By default, the emotion embedding layer is ignored

  1. python train.py --output_directory=outdir --log_directory=logdir -c models/mellotron_libritts.pt --warm_start

Multi-GPU (distributed) and Automatic Mixed Precision Training

  1. python -m multiproc train.py --output_directory=outdir --log_directory=logdir --hparams=distributed_run=True,fp16_run=True

Inference demo

  1. jupyter notebook --ip=127.0.0.1 --port=31337
  2. Load inference.ipynb
  3. (optional) Download our published WaveGlow model

Related repos

WaveGlow Faster than real time Flow-based Generative Network for Speech Synthesis.

Acknowledgements

This project is a slight modification of Mellotron, developed by Rafael Valle, Jason Li, Ryan Prenger and Bryan Catanzaro.

In turn, Mellotron uses code from the following repos: Keith Ito, Prem Seetharaman, Chengqi Deng, Patrice Guyot, as described in our code.

About

Mellotron: a multispeaker voice synthesis model based on Tacotron 2 GST that can make a voice emote and sing without emotive or singing training data

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Jupyter Notebook 83.2%
  • Python 16.8%