# Text-to-speech demo

This is a quick text-to-speech (TTS) demo, albeit without option for custom voice cloning.

The process is split to 2 steps:
1. Tacotron 2 model as a way to generate mel spectograms from text, and
2. WaveGlow model to synthesize voice from those mel spectograms

<img src="https://pytorch.org/assets/images/tacotron2_diagram.png" alt="alt" width="30%"/>

This implementation of Tacotron 2 model differs from the model described in the paper, as it uses Dropout instead of Zoneout to regularize the LSTM layers. In addition, WaveGlow replaces WaveNet as mentioned in the paper for faster training and inference.

*This notebook heavily references NVidia's [demo colab](https://colab.research.google.com/github/pytorch/pytorch.github.io/blob/master/assets/hub/nvidia_deeplearningexamples_tacotron2.ipynb#scrollTo=thermal-voice).

Import required libraries

In [1]:
import datetime
import os
from typing import Any

import torch
from scipy.io.wavfile import write

Let's define TTS Class

In [2]:
class TTS:
    """ Text to speech using Tacotron2 and WaveGlow
    """
    def __init__(self):
        self.tacotron2 = None
        self.waveglow = None
        self.utils = None
        self.device = 'cuda' if torch.cuda.is_available() else 'cpu'

    def load_models(self):
        # load pretrained tacotron2 model
        tacotron2 = torch.hub.load('NVIDIA/DeepLearningExamples:torchhub', 'nvidia_tacotron2', model_math='fp16')
        self.tacotron2 = tacotron2.to(self.device)
        self.tacotron2.eval()
        # load pretrained waveglow model
        waveglow = torch.hub.load('NVIDIA/DeepLearningExamples:torchhub', 'nvidia_waveglow', model_math='fp16')
        waveglow = waveglow.remove_weightnorm(waveglow)
        self.waveglow = waveglow.to(self.device)
        self.waveglow.eval()
        # load utils
        self.utils = torch.hub.load('NVIDIA/DeepLearningExamples:torchhub', 'nvidia_tts_utils')

    def run_inference(self, text: str, mode='save', filename:str = '', rate=22050):
        sequences, lengths = self.utils.prepare_input_sequence([text])
        with torch.no_grad():
            mel, _, _ = self.tacotron2.infer(sequences, lengths)
            audio = self.waveglow.infer(mel)
        audio_numpy = audio[0].data.cpu().numpy()

        if mode=='save':
            # save audio if filename is given
            if not filename:
                filename = str(datetime.datetime.now())
                if not os.path.exists('out/'):
                    os.mkdir('out/')
                write(f"out/{filename}.wav", rate, audio_numpy)
        elif mode=='play':
            try:
                # vscode notebook specific workaround
                from vscode_audio import Audio
                return Audio(audio_numpy, rate)
            except:
                from IPython.display import Audio
                return Audio(audio_numpy, rate=rate)
        else:
            raise ValueError(f"Invalid mode {mode}: choose from 'save' or 'play' instead.")
        return

Initalize TTS instance and load the models. If this is your first time (or have not cached the models), it might take a while.

In [3]:
tts = TTS()
tts.load_models()

Using cache found in /home/richie/.cache/torch/hub/NVIDIA_DeepLearningExamples_torchhub
Using cache found in /home/richie/.cache/torch/hub/NVIDIA_DeepLearningExamples_torchhub
The boolean parameter 'some' has been replaced with a string parameter 'mode'.
Q, R = torch.qr(A, some)
should be replaced with
Q, R = torch.linalg.qr(A, 'reduced' if some else 'complete') (Triggered internally at  ../aten/src/ATen/native/BatchLinearAlgebra.cpp:1937.)
  W = torch.qr(torch.FloatTensor(c, c).normal_())[0]
Using cache found in /home/richie/.cache/torch/hub/NVIDIA_DeepLearningExamples_torchhub


You can now easily make inference call. You may save the file or play on the notebook.

In [4]:
tts.run_inference('Hello, nice to meet you. My name is Richie!', mode='play')