# Kalmyk Text-to-Speech with Tacotron2 and Waveglow

This is a Kalmyk female voice TTS demo using open source projects [NVIDIA/tacotron2](https://github.com/NVIDIA/tacotron2) and [NVIDIA/waveglow](https://github.com/NVIDIA/waveglow).

You can download the Kalmyk TTS dataset from [mongolian-nlp](https://github.com/tugstugi/mongolian-nlp#datasets).

For other deep-learning Colab notebooks, visit [tugstugi/dl-colab-notebooks](https://github.com/tugstugi/dl-colab-notebooks).

## Install Tacotron2 and Waveglow

In [1]:
#
# clone Tacotron2/Waveglow
#
%tensorflow_version 1.x
import os
from os.path import exists, join, basename, splitext

git_repo_url = 'https://github.com/NVIDIA/tacotron2.git'
project_name = splitext(basename(git_repo_url))[0]
if not exists(project_name):
  # clone and install
  !git clone -q --recursive {git_repo_url}
  !cd {project_name}/waveglow
  !pip install -q librosa unidecode gdown
  
import sys
sys.path.append(join(project_name, 'waveglow/'))
sys.path.append(project_name)

#
# download pretrained models
#
tacotron2_pretrained_model = 'kalmyk_tacotron2_19000'
if not exists(tacotron2_pretrained_model):
  # download the Tacotron2 pretrained model
  !gdown https://drive.google.com/uc?id=1MFzP3Xxwd0NQsew_rszrjfNOwB6gyRBX
waveglow_pretrained_model = 'waveglow_256channels_universal_v5.pt'
if not exists(waveglow_pretrained_model):
  # download the Waveglow pretrained model  
  !gdown https://drive.google.com/uc?id=1rpK8CzAAirq9sWZhe9nlfvxMF1dRgFbF

#
# Update symbols
#

symbols = """
from text import cmudict
_pad        = '_'
_punctuation = '!,.? '
_special = '-'
_letters = 'абвгдеёжзийклмноөпрстуүфхцчшъыьэюяәһҗң'
_arpabet = ['@' + s for s in cmudict.valid_symbols]
symbols = [_pad] + list(_special) + list(_punctuation) + list(_letters)
"""
open('tacotron2/text/symbols.py', 'wt').write(symbols)

#
# initialize tacotron2/waveglow
#

import IPython.display as ipd
import numpy as np
import torch

from hparams import create_hparams
from model import Tacotron2
from layers import TacotronSTFT
from audio_processing import griffin_lim
from text import text_to_sequence
from denoiser import Denoiser

def plot_data(data, figsize=(16, 4)):
    fig, axes = plt.subplots(1, len(data), figsize=figsize)
    for i in range(len(data)):
        axes[i].imshow(data[i], aspect='auto', origin='bottom', 
                       interpolation='none', cmap='viridis')

torch.set_grad_enabled(False)
        
# initialize Tacotron2 with the pretrained model
hparams = create_hparams()
hparams.sampling_rate = 22050
model = Tacotron2(hparams)
model.load_state_dict(torch.load(tacotron2_pretrained_model)['state_dict'])
_ = model.cuda().eval()#.half()

# initialize Waveglow with the pretrained model
# waveglow = torch.load(waveglow_pretrained_model)['model']
# WORKAROUND for: https://github.com/NVIDIA/tacotron2/issues/182
import json
from glow import WaveGlow
waveglow_config = json.load(open('%s/waveglow/config.json' % project_name))['waveglow_config']
waveglow = WaveGlow(**waveglow_config)
waveglow.load_state_dict(torch.load(waveglow_pretrained_model)['model'].state_dict())
_ = waveglow.cuda().eval()#.half()
for k in waveglow.convinv:
    k.float()
denoiser = Denoiser(waveglow)

#
# synthesizer
#

def synthesize(text, sigma=0.666, strength=0.01):
  sequence = np.array(text_to_sequence(text, ['basic_cleaners']))[None, :]
  sequence = torch.autograd.Variable(torch.from_numpy(sequence)).long()
  sequence = sequence.cuda()

  mel_outputs, mel_outputs_postnet, _, alignments = model.inference(sequence)
  audio = waveglow.infer(mel_outputs_postnet, sigma=sigma)
  audio_denoised = denoiser(audio, strength=0.01)[:, 0]
  return ipd.Audio(audio_denoised.cpu().numpy(), rate=hparams.sampling_rate)

TensorFlow 1.x selected.
[K     |████████████████████████████████| 245kB 17.8MB/s 
[?25hDownloading...
From: https://drive.google.com/uc?id=1MFzP3Xxwd0NQsew_rszrjfNOwB6gyRBX
To: /content/kalmyk_tacotron2_19000
338MB [00:04, 75.3MB/s]
Downloading...
From: https://drive.google.com/uc?id=1rpK8CzAAirq9sWZhe9nlfvxMF1dRgFbF
To: /content/waveglow_256channels_universal_v5.pt
676MB [00:06, 99.5MB/s]
The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:
  * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
  * https://github.com/tensorflow/addons
  * https://github.com/tensorflow/io (for I/O related ops)
If you depend on functionality not listed there, please file an issue.





## Synthesize

In [2]:
#@markdown * allowed characters: абвгдеёжзийклмноөпрстуүфхцчшъыьэюяәһҗң

text = "Кезәнәс нааран мана өөрднр Цаһан Сар байриг темдглдг йоста. Эн байр сарин литәр шин җилиг угтҗах байр болҗана." #@param {type:"string"}
synthesize(text)

In [3]:
#@markdown You can also synthesize Mongolian texts. The generated speech will have the Kalmyk accent.

text = "Энэ асуудалд иргэд нэгдсэн ойлголттой биш байгаа." #@param {type:"string"}
synthesize(text)