# Tacotron2: WaveNet-basd text-to-speech demo

- Tacotron2 (mel-spectrogram prediction part): https://github.com/Rayhane-mamah/Tacotron-2
- WaveNet: https://github.com/r9y9/wavenet_vocoder

This is a proof of concept for Tacotron2 text-to-speech synthesis. Models used here were trained on [LJSpeech dataset](https://keithito.com/LJ-Speech-Dataset/).

**Notice**: The waveform generation is super slow since it implements naive autoregressive generation. It doesn't use parallel generation method described in [Parallel WaveNet](https://arxiv.org/abs/1711.10433). 

**Estimated time to complete**:  2~3 hours 

modified by ** [Hyungon Ryu](mailto://hryu@nvidia.com)** | Sr. Solution Architect  for GPU inferencing in Google COLAB
- Estimated time to complete : 10 minutes to configure. it depends on network traffic. 
 - framework
   - install CUDA 9.0
   - install tensorflow-gpu == 0.9.0
   - install pytorch == 0.4.1
  - git clone
    - Tacotron2
    - WaveNet
 
- Generation time : 26 minutes for 1 sec sentence. 
  - Mel generation : 20 sec
  - Wave Generation : 25 min.

In [0]:
!nvidia-smi

## Setup

### Install dependencies

In [0]:
%%bash
pip3 uninstall -y tensorflow tensorflow-gpu pytorch torch
rm -rf cuda-repo*
wget https://developer.nvidia.com/compute/cuda/9.0/Prod/local_installers/cuda-repo-ubuntu1704-9-0-local_9.0.176-1_amd64-deb
apt-get install dirmngr
dpkg -i cuda-repo-ubuntu1704-9-0-local_9.0.176-1_amd64-deb
apt-key add /var/cuda-repo-9-0-local/7fa2af80.pub
apt-get update
apt-get install  -y --no-install-recommends  \
 cuda-core-9-0 \
 cuda-cublas-9-0 cuda-cublas-dev-9-0 cuda-cudart-9-0 cuda-cudart-dev-9-0 \
 cuda-cufft-9-0 cuda-cufft-dev-9-0 cuda-curand-9-0 cuda-curand-dev-9-0 \
 cuda-cusolver-9-0 cuda-cusolver-dev-9-0 cuda-cusparse-9-0 \
 cuda-cusparse-dev-9-0 \
 cuda-libraries-9-0 cuda-libraries-dev-9-0 \
 cuda-misc-headers-9-0 cuda-npp-9-0 cuda-npp-dev-9-0 \
 cuda-nvgraph-9-0 cuda-nvgraph-dev-9-0 cuda-nvml-dev-9-0 cuda-nvrtc-9-0 \
 cuda-nvrtc-dev-9-0
rm -rf cuda-repo*
rm -rf wget-log*
pip3 install -q tensorflow-gpu==1.9.0  
pip3 install -q torch

In [0]:
%%time
import os
from os.path import exists, join, expanduser

os.chdir(expanduser("~"))

wavenet_dir = "wavenet_vocoder"
if not exists(wavenet_dir):
  ! git clone https://github.com/r9y9/$wavenet_dir
    
taco2_dir = "Tacotron-2"
if not exists(taco2_dir):
  ! git clone https://github.com/r9y9/$taco2_dir
  ! cd $taco2_dir && git checkout -B wavenet3 origin/wavenet3

In [0]:
# Install dependencies
os.chdir(join(expanduser("~"), taco2_dir))
!pip install -q -r requirements.txt

os.chdir(join(expanduser("~"), wavenet_dir))
!pip install -q -e '.[train]'

In [0]:
import torch
import tensorflow
print('for tacotron2 tf version:', tensorflow.__version__)
print('for wavenet pyt version :', torch.__version__)

In [0]:
import tensorflow as tf
config = tf.ConfigProto()
config.gpu_options.allow_growth = True
session = tf.Session(config=config )

In [0]:
!nvidia-smi

### Download pretrained models

#### Tacotron2 (mel-spectrogram prediction part)

In [0]:
%%time
os.chdir(join(expanduser("~"), taco2_dir))
! mkdir -p logs-Tacotron
if not exists("logs-Tacotron/pretrained"):
  ! curl -O -L "https://www.dropbox.com/s/vx7y4qqs732sqgg/pretrained.tar.gz"
  ! tar xzvf pretrained.tar.gz
  ! mv pretrained logs-Tacotron

#### WaveNet

In [0]:
%%time
os.chdir(join(expanduser("~"), wavenet_dir))
wn_preset = "20180510_mixture_lj_checkpoint_step000320000_ema.json"
wn_checkpoint_path = "20180510_mixture_lj_checkpoint_step000320000_ema.pth"

if not exists(wn_preset):
  !curl -O -L "https://www.dropbox.com/s/0vsd7973w20eskz/20180510_mixture_lj_checkpoint_step000320000_ema.json"
if not exists(wn_checkpoint_path):
  !curl -O -L "https://www.dropbox.com/s/zdbfprugbagfp2w/20180510_mixture_lj_checkpoint_step000320000_ema.pth"

## Input texts to be synthesized

Choose your favorite sentences :)

In [0]:
os.chdir(join(expanduser("~"), taco2_dir))

In [0]:
%%bash
cat << EOS > text_list.txt
This will was a deliberate forgery.
EOS

cat text_list.txt

## Mel-spectrogram prediction by Tacoron2

In [0]:
# Remove old files if exist
!rm -rf tacotron_output

In [0]:
%%time
!python synthesize.py --model='Tacotron' --mode='eval' \
  --hparams='symmetric_mels=False,max_abs_value=4.0,power=1.1,outputs_per_step=1' \
  --text_list=./text_list.txt

In [0]:
!ls -alh ../Tacotron-2/tacotron_output/eval
!ls -alh ../Tacotron-2/tacotron_output/logs-eval/plots
!ls -alh ../Tacotron-2/tacotron_output/logs-eval/wavs

In [0]:
# copy hidden files to COLAB ~
!cp ../Tacotron-2/tacotron_output//eval/speech-mel-00001.npy /content/.
!cp ../Tacotron-2/tacotron_output/logs-eval/wavs/* /content/.
!cp ../Tacotron-2/tacotron_output/logs-eval/plots/* /content/.

## Waveform synthesis by WaveNet

In [0]:
%%time
%%bash
pip install ipykernel

In [0]:
import librosa.display
import IPython
from IPython.display import Audio
import numpy as np
import torch

In [0]:
%%time
os.chdir(join(expanduser("~"), wavenet_dir))

# Setup WaveNet vocoder hparams
from hparams import hparams
with open(wn_preset) as f:
    hparams.parse_json(f.read())

# Setup WaveNet vocoder
from train import build_model
from synthesis import wavegen
import torch

use_cuda = torch.cuda.is_available()
device = torch.device("cuda" if use_cuda else "cpu")

model = build_model().to(device)
print("Load checkpoint from {}".format(wn_checkpoint_path))
checkpoint = torch.load(wn_checkpoint_path)
model.load_state_dict(checkpoint["state_dict"])

In [0]:
%%time
from glob import glob
from tqdm import tqdm

with open("../Tacotron-2/tacotron_output/eval/map.txt") as f:
  maps = f.readlines()
maps = list(map(lambda x:x[:-1].split("|"), maps))
# filter out invalid ones
maps = list(filter(lambda x:len(x) == 2, maps))

print("List of texts to be synthesized")
for idx, (text,_) in enumerate(maps):
  print(idx, text)

In [0]:
print(model)

### Waveform generation

**Note**: This will takes hours to finish depending on the number and lenght of texts. Try short sentences first if you would like to see samples quickly.

In [0]:
waveforms = []

for idx, (text, mel) in enumerate(maps):
  print("\n", idx, text)
  mel_path = join("../Tacotron-2", mel)
  c = np.load(mel_path)
  if c.shape[1] != hparams.num_mels:
    np.swapaxes(c, 0, 1)
  # Range [0, 4] was used for training Tacotron2 but WaveNet vocoder assumes [0, 1]
  c = np.interp(c, (0, 4), (0, 1))
 
  # Generate wave
  waveform = wavegen(model, c=c, g=None, fast=True, tqdm=tqdm)
  # Audio
  IPython.display.display(Audio(waveform, rate=hparams.sample_rate))
  waveforms.append(waveform)
  




## Summary: audio samples

In [0]:
for idx, (text, mel) in enumerate(maps):
  print(idx, text)
  IPython.display.display(Audio(waveforms[idx], rate=hparams.sample_rate))

For more information, please visit https://github.com/r9y9/wavenet_vocoder. More samples can  be  found at https://r9y9.github.io/wavenet_vocoder/. 