```

```

<img src="https://nv-adlr.github.io/images/waveglow_logo.png" width=300 align=center >




# Part1. Voice Synthesize with NVIDIA WaveGlow Model
 


by **Hyungon Ryu** | Sr. Solution Architect at NVIDIA


---

```

```


----



  **Content**
- **Part1. Voice Synthesis with NVIDIA  WaveGlow Model**
- Part2. Voice Synthesis with NVIDIA Tacotron2 + WaveGlow



In this jupyter, I'll demonstrate Voice Synthesis from Mel with WaveGlow Model. You can reproduce  through the provided pretrained WaveGlow parameters. You can reproduce the voice synthesis of the WaveGlow model on this jupyter notebook on V100 GPUs. If you already configure jupyter environment on V100, you can replay it within 10 minutes, including the time you receive the weight file. If you use Tesla T4 or Tesla V100, you can synthesize voice in real time. 
Visit the NVIDIA ADLR's WaveGlow [blog](https://nv-adlr.github.io/WaveGlow) to see the sound quality of WaveGlow model. 


```

```
----

## Step1. DevOps



## Step1.  DevOps for Tesla V100

I assume you already launch jupyter as below steps.

- step1. pull docker image from dockerhub 
```
docker pull hryu/pytorch:t3-1
```

- step2. docker run with GPU and allow to access jupyter in 8888 port 
```
nvidia-docker run -ti -v:Target:/mnt/home/demo -p:8888:8888  hryu/pytorch:t3-1 bash
```

- step3. launch jupyter
```
cd /mnt/home/demo
jupyter notebook --ip=0.0.0.0 --port=8888 --no-browser --allow-root  --NotebookApp.token='' --notebookDir=/mnt/home/demo
```

- step4. access jupyter
access jupyter from chrome browser.  If the IP address of the server is 10.10.10.10, you can access 
10.10.10.10:8888

#### check Tesla  
I'll demonstrate in Tesla V100 
You can see the assigned GPU information with simple command  `nvidia-smi`

In [1]:
!nvidia-smi | grep Tesla

|   0  Tesla V100-SXM2...  On   | 00000000:0A:00.0 Off |                    0 |


#### system information and configure
The nvidia-smi tool allows for you to modify application clock of Tesla V100, clock rate is 1530 Mhz 
check max application clock with `nvidia-smi -q` command.

In [2]:
%%bash
#check the environemnt 
echo "Check H/W"
lscpu | grep 'CPU(s):            '
lscpu | grep GHz
echo "memory" && free -m | cut -c-49 |  head -n 2 
echo "storage" && df -h |  cut -c-60 | head -n 2
df -h |  grep '/dev/sda1'
echo " " && nvidia-smi -L | cut -c-17
echo "confure Max Application Clock forTesla V100"
nvidia-smi -ac 877,1530 && nvidia-smi -pm 1
echo " " &&echo "Check S/W"
cat /etc/*-release | grep PRETTY_NAME
python --version 
nvcc --version | grep  tools

Check H/W
CPU(s):                80
Model name:            Intel(R) Xeon(R) CPU E5-2698 v4 @ 2.20GHz
memory
              total        used        free      
Mem:         515894       49309       11147      
storage
Filesystem                                          Size  Us
overlay                                             440G  11
 
GPU 0: Tesla V100
confure Max Application Clock forTesla V100
The current user does not have permission to change clocks for GPU 00000000:0A:00.0.
 Run 'nvidia-smi -acp UNRESTRICTED' as root/admin to enable this option for all users.
Terminating early due to previous errors.
 
Check S/W
PRETTY_NAME="Ubuntu 16.04.4 LTS"
Cuda compilation tools, release 9.0, V9.0.176


Python 3.6.5 :: Anaconda, Inc.



### clone WaveGlow  Model


Copy the  NVIDIA's 
[WaveGlow](https://github.com/NVIDIA/waveglow) model  via the git clone command. In particular, the WaveGlow model uses tacotron2 as a submodule to creat a Mel Spectrogram.

This jupyter is based on the last commit [ f4c04e2 ](https://github.com/NVIDIA/waveglow/commit/f4c04e2d968de01b22d2fb092bbbf0cec0b6586f)  and environment in October 10, 2018


**SKIP** if you launch jupyter after finish to run Part1 Voice Synthesis with NVIDIA  WaveGlow Model. This Jupyter use same repository. 

In [2]:
%%bash
rm -rf waveglow
git clone https://github.com/NVIDIA/waveglow.git
cd waveglow
git submodule init
git submodule update

Submodule 'tacotron2' (http://github.com/NVIDIA/tacotron2) registered for path 'tacotron2'
Submodule path 'tacotron2': checked out 'fc0cf6a89a47166350b65daa1beaa06979e4cddf'


Cloning into 'waveglow'...
Cloning into 'tacotron2'...


### install requirements



The WaveGlow model has been tested in pytorch 0.4.0. You also need some library like librosa to handle audio and mel spectrogram  files. It takes about one minute to finish. It may vary depending on network environment.

if you pull my docker images `hryu/pytorch:t3-1` you don't need to install.  

In [None]:
%%time
%%bash 
#pip install torch==0.4.0 matplotlib==2.1.0 tensorflow  inflect==0.2.5 \
# librosa==0.6.0 scipy==1.0.0 tensorboardX==1.1 Unidecode==1.0.22 pillow 

```



```
---

## Step2. Prepare Wavegloe Weight Files



### 2-1 WaveGlow weight from  NVIDIA ADLR


###  

NVIDIA provide pre-trained WaveGlow Weight for voice synthesis. 



### 2-2 download checkpoint file direct from Google Drive
You can download checkpint files from Googie drive directly.

#### define python function 
 I borrow the charlesreid1's [python code](https://gist.githubusercontent.com/charlesreid1/4f3d676b33b95fce83af08e4ec261822/raw/4ec8b6b6f306a70fc229d01404ded90162f56a82/get_drive_file.py) 

In [3]:
import requests

def download_file_from_google_drive(id, destination):
    def get_confirm_token(response):
        for key, value in response.cookies.items():
            if key.startswith('download_warning'):
                return value

        return None

    def save_response_content(response, destination):
        CHUNK_SIZE = 32768

        with open(destination, "wb") as f:
            for chunk in response.iter_content(CHUNK_SIZE):
                if chunk: # filter out keep-alive new chunks
                    f.write(chunk)

    URL = "https://docs.google.com/uc?export=download"

    session = requests.Session()

    response = session.get(URL, params = { 'id' : id }, stream = True)
    token = get_confirm_token(response)

    if token:
        params = { 'id' : id, 'confirm' : token }
        response = session.get(URL, params = params, stream = True)

    save_response_content(response, destination)




#### download  waveglow_old.pt (2GB)
It will takes 15 sec to download checkpoint file from Google Drive directly.

In [7]:
%%time
destination="/mnt/home/demo/waveglow_old.pt"
file_id="1cjKPHbtAMh_4HTHmuIGNkbOkPBD9qwhj"
download_file_from_google_drive(file_id, destination)

CPU times: user 3.46 s, sys: 4.26 s, total: 7.71 s
Wall time: 15.9 s


In [35]:
!ls -lah /mnt/home/git/waveglow-20181115/waveglow/ch/waveglow_4998

-rw-r--r--+ 1 25223 dip 167M Nov 15 00:28 /mnt/home/git/waveglow-20181115/waveglow/ch/waveglow_4998


Check waveglow file in local storage in COLAB VM

In [8]:
%%bash
ls -alh "/mnt/home/demo/waveglow_old.pt"

-rw-r--r--+ 1 25223 dip 2.0G Nov 14 19:30 /mnt/home/demo/waveglow_old.pt


```

```

## Step3. Voice Synthesis from provided Mel files

### 3-1.  public Mel files

NVIDIA provide Generated [Mel files](https://drive.google.com/file/d/1g_VXK2lpP9J25dQFhQwx7doWl_p20fXA/view?usp=sharing) from real voice in  [github](https://github.com/NVIDIA/waveglow) to reproduce sample in ADLR [WaveGlow page](https://nv-adlr.github.io/WaveGlow)






### 3-2  download Mel files 
It takes only one second to download Mel files(1.5MB) which ADLR Waveglow team provided.

In [None]:
%%time
destination="/mnt/home/demo/mel_spectrograms.zip"
file_id="1g_VXK2lpP9J25dQFhQwx7doWl_p20fXA"
download_file_from_google_drive(file_id, destination)

In [None]:
%%bash
ls -alh "/mnt/home/demo/mel_spectrograms.zip"

In [None]:
%%bash
rm -rf /mnt/home/demo/mel_spectrogram
ls /mnt/home/demo/mel_spectrogram

### 3-3 . Decompess Mel files

An abnormal phenomenon was observed in COLAB.  The root cause was the compressed file include some MACOSX related files.  Delete all files associated with MACOSX in compresse zip file.

In [None]:
%%bash
unzip mel_spectrograms.zip
rm -rf /mnt/home/demo/mel_spectrogram/.DS_Store
rm -rf __MACOSX 

### 3-4 . Generate Audio

Now we will synthesize the voice from the provided Mel Spectrogram. Likewise, it takes time to load 2GB parameter file.

In [1]:
%%bash
rm -rf audio_mel_ref 
mkdir audio_mel_ref 
cd waveglow
python inference.py -f <(ls /mnt/home/demo/mel_spectrograms/*.pt) -w  /mnt/home/git/waveglow-20181115/waveglow/ch/waveglow_1428  -o /mnt/home/demo/audio_mel_ref   -s 0.6

/mnt/home/demo/audio_mel_ref/LJ001-0015.wav_synthesis.wav
/mnt/home/demo/audio_mel_ref/LJ001-0051.wav_synthesis.wav
/mnt/home/demo/audio_mel_ref/LJ001-0063.wav_synthesis.wav
/mnt/home/demo/audio_mel_ref/LJ001-0072.wav_synthesis.wav
/mnt/home/demo/audio_mel_ref/LJ001-0079.wav_synthesis.wav
/mnt/home/demo/audio_mel_ref/LJ001-0094.wav_synthesis.wav
/mnt/home/demo/audio_mel_ref/LJ001-0096.wav_synthesis.wav
/mnt/home/demo/audio_mel_ref/LJ001-0102.wav_synthesis.wav
/mnt/home/demo/audio_mel_ref/LJ001-0153.wav_synthesis.wav
/mnt/home/demo/audio_mel_ref/LJ001-0173.wav_synthesis.wav


  import imp


### 3-5. Compare Voice Quality

**Generated Voice** from provided Mel
check one example LJ001-0153.wav

Sentence "DUMMY/LJ001-0153.wav| only nominally so, however, in many cases, since when he uses a headline he counts that in,""

In [2]:
audio_file_synth = "/mnt/home/demo/audio_mel_ref/LJ001-0153.wav_synthesis.wav"
import IPython.display as ipd
ipd.Audio(audio_file_synth, rate=22050)

```

```

##  Step4. Check Voice Synthesis Quality from Mel of Real Audio

You could generate the audio from voice files. Visit the NVIDIA ADLR's WaveGlow [blog](https://nv-adlr.github.io/WaveGlow) to see the sound quality of WaveGlow model. You can reproduce through the provided pretrained WaveGlow parameters.



In [None]:
%%time
destination="/mnt/home/demo/LJ001-0153.wav"
file_id="1kM_7q5dVGkf4CV97cc7rY07JLwB9VaAL"
download_file_from_google_drive(file_id, destination)

In [None]:
!ls -alh /mnt/home/demo/LJ001-0153.wav

### 4-2. Generate Mel from Real Audio
I created Mel Spectrogram from Real Audio(LJ001-0153.wav) in the **`Mel_real `** folder as configured by config.json.

```
    "data_config": {
        "training_files":"train_files.txt",
        "segment_length": 16000,
        "sampling_rate": 22050,
        "filter_length": 1024,
        "hop_length": 256,
        "win_length": 1024,
        "mel_fmin": 0.0,
        "mel_fmax": 8000.0
    },
```

  

In [38]:
%%time
%%bash
rm -rf Mel_real
mkdir Mel_real
cd waveglow
ls /mnt/home/demo/LJ001-0153.wav > /mnt/home/demo/waveglow/test_files.txt
# mel2samp refer train_files 
ls /mnt/home/demo/LJ001-0153.wav > /mnt/home/demo/waveglow/train_files.txt 
python mel2samp.py -f test_files.txt -o /mnt/home/demo/Mel_real -c config.json
ls /mnt/home/demo/Mel_real/

/mnt/home/demo/Mel_real/LJ001-0153.wav.pt
LJ001-0153.wav.pt
CPU times: user 0 ns, sys: 88 ms, total: 88 ms
Wall time: 2.22 s


### 4-3. Generate Synthetic Audio
6 seconds of voice can be generated in about 20 seconds. It takes most of the time to load a 2GB weight file. Actual speech synthesis is processed in real time, and it takes time to save the speech file.

In [63]:
%%time
%%bash
rm -rf audio_real
mkdir audio_real
ls /mnt/home/demo/Mel_real/*.pt > /mnt/home/demo/waveglow/mel_files_real.txt
cd waveglow
python inference.py -f mel_files_real.txt -w /mnt/home/git/waveglow-20181115/waveglow/ch/waveglow_15300 -o /mnt/home/demo/audio_real  -s 0.6

Traceback (most recent call last):
  File "inference.py", line 74, in <module>
    args.output_dir, args.sampling_rate, args.is_fp16)
  File "inference.py", line 36, in main
    waveglow = torch.load(waveglow_path)['model']
  File "/opt/conda/envs/pytorch-py3.6/lib/python3.6/site-packages/torch/serialization.py", line 306, in load
    return _load(f, map_location, pickle_module)
  File "/opt/conda/envs/pytorch-py3.6/lib/python3.6/site-packages/torch/serialization.py", line 475, in _load
    result = unpickler.load()
AttributeError: Can't get attribute '_rebuild_parameter' on <module 'torch._utils' from '/opt/conda/envs/pytorch-py3.6/lib/python3.6/site-packages/torch/_utils.py'>


CPU times: user 0 ns, sys: 148 ms, total: 148 ms
Wall time: 2.92 s


### 4-4. compare  Audio Quality

Sentence "DUMMY/LJ001-0153.wav| only nominally so, however, in many cases, since when he uses a headline he counts that in,""


**Real Audio**


In [40]:
import IPython.display as ipd
audio_file_real ="/mnt/home/demo/LJ001-0153.wav"
ipd.Audio(audio_file_real, rate=22050)

 **synthesis Audio**



In [52]:
audio_file_synth = "/mnt/home/demo/audio_real/LJ001-0153.wav_synthesis.wav"
ipd.Audio(audio_file_synth, rate=22050)

If the sound quality differs from the actual sound, the option settings for preprocessing may be incorrect as [issue7](https://github.com/NVIDIA/waveglow/issues/7)

```


```

---

### inferencing with FP16
in WaveGlow Model, with `--is_fp16` option, you could voice synthesize on half precision.

In [10]:
%%time
%%bash
rm -rf audio_real_fp16
mkdir audio_real_fp16
ls /mnt/home/demo/Mel_real/*.pt > /mnt/home/demo/waveglow/mel_files_real_fp16.txt
cd waveglow
python inference.py -f mel_files_real_fp16.txt -w /mnt/home/git/waveglow-20181115/waveglow/ch/waveglow_3264 -o /mnt/home/demo/audio_real_fp16  -s 0.6 --is_fp16 

/mnt/home/demo/audio_real_fp16/LJ001-0153.wav_synthesis.wav
CPU times: user 0 ns, sys: 8 ms, total: 8 ms
Wall time: 10.4 s


In [11]:
audio_file_synth = "/mnt/home/demo/audio_real_fp16/LJ001-0153.wav_synthesis.wav"
ipd.Audio(audio_file_synth, rate=22050)

```

```

### Step-by-step Voice synthesis
So far, I have been working on the Mel file and the inverse.py of the WaveGlow model. Now, let's call the waveglow.infer module directly for voice synthesis.
#### load required modules
Set the folder information where waveglow is installed for normal operation.

In [53]:
import os
import sys
import time
import numpy as np
from scipy.io.wavfile import write
import torch
sys.path.insert(0, 'waveglow')
sys.path.insert(0, 'waveglow/tacotron2')

#### load WaveGlow weight parameters
Loads a weight file of 2 GB in size and set to half precision for fast inference.

In [62]:
%%time
is_fp16=True
waveglow_path = "/mnt/home/git/waveglow-20181115/waveglow/ch/waveglow_15300"
waveglow = torch.load(waveglow_path)['model']
#waveglow.remove_weightnorm() # for old model.
waveglow.cuda().eval()
if is_fp16:
    waveglow.half()
    for k in waveglow.convinv:
        k.float()



AttributeError: Can't get attribute '_rebuild_parameter' on <module 'torch._utils' from '/opt/conda/envs/pytorch-py3.6/lib/python3.6/site-packages/torch/_utils.py'>

#### configure mel files list
Set to half precision for fast inference.

In [55]:
mel_files="/mnt/home/demo/waveglow/mel_files_real.txt"

Set the working directory where the created voice file will be saved.

In [56]:
%%time
import time 

from mel2samp import files_to_list, MAX_WAV_VALUE
sampling_rate=22050
sigma=0.6

mel_files = files_to_list(mel_files)
!rm -rf Mel_fp16_subroutine
!mkdir Mel_fp16_subroutine
output_dir="Mel_fp16_subroutine"

CPU times: user 0 ns, sys: 204 ms, total: 204 ms
Wall time: 1.19 s



#### inference
The Mel file is loaded through the Mel list and loaded into the pytorch GPU memory. Synthesizes voice with WaveGlow model `waveglow.infer()`  based on Mel loaded in memory.



In [58]:
for i, file_path in enumerate(mel_files):
    file_name = os.path.splitext(os.path.basename(file_path))[0]    
    mel = torch.load(file_path)
    mel = torch.autograd.Variable(mel.cuda())
    mel = torch.unsqueeze(mel, 0)
    mel = mel.half() if is_fp16 else mel
    mel = mel.data
    start= time.perf_counter()
    with torch.no_grad():
        audio = MAX_WAV_VALUE*waveglow.infer(mel, sigma=0.6)[0]
    duration= time.perf_counter() - start
    print("inference time {:.2f}s/it".format(duration))
    audio = audio.cpu().numpy()
    audio = audio.astype('int16')
    audio_path_fp16_subroutine = os.path.join(
        output_dir, "{}_synthesis.wav".format(file_name))
    write(audio_path_fp16_subroutine, sampling_rate, audio)
    print(audio_path_fp16_subroutine)

inference time 0.35s/it
Mel_fp16_subroutine/LJ001-0153.wav_synthesis.wav


#### Listen the synthesized voice

In [59]:
ipd.Audio(audio_path_fp16_subroutine, rate=22050)

## Summary

With this jupyter you can easily demonstrate the speech synthesis.

I especially would like to thank Rafael Valle for urgent commit during validating this jupyter.


## Reference
- paper  https://arxiv.org/abs/1811.00002

- blog https://nv-adlr.github.io/WaveGlow 

- github https://github.com/NVIDIA/waveglow

```

```


<img src="https://nv-adlr.github.io/images/waveglow_logo.png" width=300 align=center >

```





```