# Mongolian Text To Speech

This is an open source Mongolian text to speech implementing the paper:
```
Hideyuki Tachibana, Katsuya Uenoyama, Shunsuke Aihara
Efficiently Trainable Text-to-Speech System Based on Deep Convolutional Networks with Guided Attention
https://arxiv.org/abs/1710.08969
```

The repo containing the implementation can be found here: [https://github.com/tugstugi/pytorch-dc-tts](https://github.com/tugstugi/pytorch-dc-tts). The Mongolian Bible audio book is used as the training dataset.

## Setup

### Install dependencies

In [0]:
import os
from os.path import exists, join, expanduser

project_name = "pytorch-dc-tts"
if not exists(project_name):
  ! git clone --quiet https://github.com/tugstugi/{project_name}
  ! cd {project_name} && pip install -q -r requirements.txt

### Download pretrained models

In [0]:
# download text2mel
if not exists("mbspeech-text2mel.pth"):
  ! wget -q -O mbspeech-text2mel.pth https://www.dropbox.com/s/wu26k6tu5hz8hq1/step-200K.pth

# download SSRN
if not exists("mbspeech-ssrn.pth"):
  ! wget -q -O mbspeech-ssrn.pth https://www.dropbox.com/s/tel0xcqa7kkwqze/step-165K.pth

## Synthesize

### Prepare models


In [0]:
import sys
sys.path.append(project_name)

import warnings
warnings.filterwarnings("ignore")  # ignore warnings in this notebook

import numpy as np
import torch

from tqdm import *
import IPython
from IPython.display import Audio

from hparams import HParams as hp
from audio import save_to_wav
from models import Text2Mel, SSRN
from datasets.mb_speech import vocab, idx2char, get_test_data, number2word

In [0]:
torch.set_grad_enabled(False)
text2mel = Text2Mel(vocab)
text2mel.load_state_dict(torch.load("mbspeech-text2mel.pth").state_dict())
text2mel = text2mel.eval()
ssrn = SSRN()
ssrn.load_state_dict(torch.load("mbspeech-ssrn.pth").state_dict())
ssrn = ssrn.eval()

### Allowed characters

абвгдеёжзийклмноөпрстуүфхцчшъыьэюя-.,!?

### Sentences to synthesize

In [0]:
SENTENCES = [
    "Хэнтий, Хангай, Соёны өндөр сайхан нуруунууд. Хойд зүгийн чимэг болсон ой хөвч уулнууд.",
    "Мэнэн, Шарга, Номины өргөн их говиуд. Өмнө зүгийн манлай болсон элсэн манхан далайнууд.", 
    "Энэ бол миний төрсөн нутаг. Монголын сайхан орон."
]

### Synthetize on CPU

In [0]:
# synthetize by one by one because there is a batch processing bug!
for i in range(len(SENTENCES)):
    sentence = ' '.join([number2word(s) if s.isdigit() else s for s in SENTENCES[i].split()])
    normalized_sentence = "".join([c if c.lower() in vocab else '' for c in sentence])
    print(normalized_sentence)
    
    sentences = [normalized_sentence]
    max_N = len(normalized_sentence)
    L = torch.from_numpy(get_test_data(sentences, max_N))
    zeros = torch.from_numpy(np.zeros((1, hp.n_mels, 1), np.float32))
    Y = zeros
    A = None

    for t in range(hp.max_T):
      _, Y_t, A = text2mel(L, Y, monotonic_attention=True)
      Y = torch.cat((zeros, Y_t), -1)
      _, attention = torch.max(A[0, :, -1], 0)
      attention = attention.item()
      if L[0, attention] == vocab.index('E'):  # EOS
          break

    _, Z = ssrn(Y)
    
    Z = Z.cpu().detach().numpy()
    save_to_wav(Z[0, :, :].T, '%d.wav' % (i + 1))
    IPython.display.display(Audio('%d.wav' % (i + 1), rate=hp.sr))

Хэнтий, Хангай, Соёны өндөр сайхан нуруунууд. Хойд зүгийн чимэг болсон ой хөвч уулнууд.


Мэнэн, Шарга, Номины өргөн их говиуд. Өмнө зүгийн манлай болсон элсэн манхан далайнууд.


Энэ бол миний төрсөн нутаг. Монголын сайхан орон.
