# Text to Speech with Tacotron2 and WaveGlow

---

[Github](https://github.com/eugenesiow/practical-ml/blob/master/notebooks/Remove_Image_Background_DeepLabV3.ipynb) | More Notebooks @ [eugenesiow/practical-ml](https://github.com/eugenesiow/practical-ml)

---

Notebook to convert (synthesize) an input piece of text into a speech audio file automatically.

[Text-To-Speech synthesis](https://paperswithcode.com/task/text-to-speech-synthesis) is the task of converting written text in natural language to speech.

The models used combines a pipeline of a [Tacotron 2](https://pytorch.org/hub/nvidia_deeplearningexamples_tacotron2/) model that produces mel spectrograms from input text using an encoder-decoder architecture and a [WaveGlow](https://pytorch.org/hub/nvidia_deeplearningexamples_waveglow/) flow-based model that consumes the mel spectrograms to generate speech.

Both steps in the pipeline will utilise pre-trained models from the PyTorch Hub by NVIDIA. Both the Tacotron 2 and WaveGlow models are trained on a publicly available [LJ Speech](https://keithito.com/LJ-Speech-Dataset/) dataset.

Do note that the models are under a [BSD 3 License](https://opensource.org/licenses/BSD-3-Clause).

The notebook is structured as follows:
* Setting up the Environment
* Using the Model (Running Inference)
* Apply Speech Enhancement/Noise Reduction

# Setting up the Environment

#### Ensure we have a GPU runtime

If you're running this notebook in Google Colab, select `Runtime` > `Change Runtime Type` from the menubar. Ensure that `GPU` is selected as the `Hardware accelerator`. This will allow us to use the GPU to train the model subsequently.

#### Setup Dependencies

We need to install `unidecode` for this example to run, so execute the command below to setup the dependencies.

In [1]:
!pip install -q unidecode

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/235.5 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m235.5/235.5 kB[0m [31m7.6 MB/s[0m eta [36m0:00:00[0m
[?25h

# Using the Model (Running Inference)

Now we want to load the Tacotron2 and WaveGlow models from PyTorch hub and prepare the models for inference.

Specifically we are running the following steps:

* `torch.hub.load()` - Downloads and loads the pre-trained model from torchhub. In particular, we specify to use the `silero_tts` model with the `en` (English) language speaker `lj_16khz`.
* `.to(device)` - We load both the models to the `GPU` for inferencing.

In [2]:
import torch

tacotron2 = torch.hub.load('nvidia/DeepLearningExamples:torchhub', 'nvidia_tacotron2')
tacotron2 = tacotron2.to('cuda')

waveglow = torch.hub.load('nvidia/DeepLearningExamples:torchhub', 'nvidia_waveglow')
waveglow = waveglow.remove_weightnorm(waveglow)
waveglow = waveglow.to('cuda')

Downloading: "https://github.com/nvidia/DeepLearningExamples/zipball/torchhub" to /root/.cache/torch/hub/torchhub.zip
Downloading checkpoint from https://api.ngc.nvidia.com/v2/models/nvidia/tacotron2_pyt_ckpt_fp32/versions/19.09.0/files/nvidia_tacotron2pyt_fp32_20190427
  ckpt = torch.load(ckpt_file)
Using cache found in /root/.cache/torch/hub/nvidia_DeepLearningExamples_torchhub
Downloading checkpoint from https://api.ngc.nvidia.com/v2/models/nvidia/waveglow_ckpt_fp32/versions/19.09.0/files/nvidia_waveglowpyt_fp32_20190427
  ckpt = torch.load(ckpt_file)
  WeightNorm.apply(module, name, dim)


In [9]:
from IPython.display import Audio, display
import numpy as np
import torch

# Import the utility functions
utils = torch.hub.load('NVIDIA/DeepLearningExamples:torchhub', 'nvidia_tts_utils')

example_text = 'What is umbrage? According to the Oxford Languages dictionary, Umbrage is a noun that means offence or annoyance.'

# Preprocess the text using the utility function
sequences, lengths = utils.prepare_input_sequence([example_text])

# Move the sequence to the GPU
sequences = sequences.to(device='cuda', dtype=torch.int64)

# Run the models
with torch.no_grad():
    # Change to unpack 3 values instead of 4
    _, mel, _ = tacotron2.infer(sequences, lengths)

    # Check and adjust the shape of mel
    print("Mel spectrogram shape:", mel.shape)
    mel = mel.unsqueeze(0)  # Ensure it's a batch with correct shape
    print("New Mel spectrogram shape:", mel.shape)

    # Run WaveGlow inference
    audio = waveglow.infer(mel)

audio_numpy = audio[0].data.cpu().numpy()
rate = 22050

# Play the audio
display(Audio(audio_numpy, rate=rate))


Using cache found in /root/.cache/torch/hub/NVIDIA_DeepLearningExamples_torchhub


Mel spectrogram shape: torch.Size([1])
New Mel spectrogram shape: torch.Size([1, 1])


RuntimeError: Given transposed=1, weight of size [80, 80, 1024], expected input[1, 1, 1] to have 80 channels, but got 1 channels instead

Now we define the `example_text` variable, a piece of text that we want to convert to a speech audio file. Next, we synthesize/generate the audio file.

* `tacotron2.text_to_sequence()` - Creates a tensor representation of the input text sequence (`example_text`).
* `tacotron2.infer()` - Tacotron2 generates mel spectrogram given tensor representation from the previous step (`sequence`).
* `waveglow.infer()` - Waveglow generates sound given the mel spectrogram
* `display()` - The notebook will then display a playback widget of the audio sample, `audio_numpy`.

In [3]:
from IPython.display import Audio, display
import numpy as np

example_text = 'What is umbrage? According to the Oxford Languages dictionary, Umbrage is a noun that means offence or annoyance.'

# preprocessing
sequence = np.array(tacotron2.text_to_sequence(example_text, ['english_cleaners']))[None, :]
sequence = torch.from_numpy(sequence).to(device='cuda', dtype=torch.int64)

# run the models
with torch.no_grad():
    _, mel, _, _ = tacotron2.infer(sequence)
    audio = waveglow.infer(mel)
audio_numpy = audio[0].data.cpu().numpy()
rate = 22050

display(Audio(audio_numpy, rate=rate))

AttributeError: 'Tacotron2' object has no attribute 'text_to_sequence'

We notice that there is some slight noise in the generated sample which can easily be reduced to enhance the quality of speech using a speech enhancement model. We try this in the next section. This is entirely optional.

# Apply Speech Enhancement/Noise Reduction

We use the simple and convenient LogMMSE algorithm (Log Minimum Mean Square Error) with the [logmmse library](https://github.com/wilsonchingg/logmmse).

In [None]:
!pip install -q logmmse

Run the LogMMSE algorithm on the generated audio `audio[0]` and  display the enhanced audio sample produced in an audio player.

In [None]:
import numpy as np
from logmmse import logmmse

enhanced = logmmse(audio_numpy, rate, output_file=None, initial_noise=1, window_size=160, noise_threshold=0.15)
display(Audio(enhanced, rate=rate))

Save the enhanced audio to file.

In [None]:
from scipy.io.wavfile import write

write('/content/audio.wav', rate, enhanced)

We can connect to Google Drive with the following code. You can also click the `Files` icon on the left panel and click `Mount Drive` to mount your Google Drive.

The root of your Google Drive will be mounted to `/content/drive/My Drive/`. If you have problems mounting the drive, you can check out this [tutorial](https://towardsdatascience.com/downloading-datasets-into-google-drive-via-google-colab-bcb1b30b0166).

In [None]:
from google.colab import drive
drive.mount('/content/drive/')

You can move the output files which are saved in the `/content/` directory to the root of your Google Drive.

In [None]:
import shutil
shutil.move('/content/audio.wav', '/content/drive/My Drive/audio.wav')

More Notebooks @ [eugenesiow/practical-ml](https://github.com/eugenesiow/practical-ml) and do star or drop us some feedback on how to improve the notebooks on the [Github repo](https://github.com/eugenesiow/practical-ml/).