# Download and Log data to W&B

For our tutorial, we will use a small part of the Hi-Fi Multi-Speaker English TTS (Hi-Fi TTS) dataset. You can read more about dataset [here](https://arxiv.org/abs/2104.01497). We will use speaker 9017 as the target speaker, and only a 5-minute subset of audio will be used for this fine-tuning example. We additionally resample audio to 22050 kHz.

In [79]:
from types import SimpleNamespace

config = SimpleNamespace(SPEAKER_ID = "9017")

In [80]:
!wget https://multilangaudiosamples.s3.us-east-2.amazonaws.com/"{config.SPEAKER_ID}_5_mins.tar.gz"  # Contains 10MB of data
!tar -xzf "{config.SPEAKER_ID}_5_mins.tar.gz"

--2022-12-06 15:14:12--  https://multilangaudiosamples.s3.us-east-2.amazonaws.com/9017_5_mins.tar.gz
Resolving multilangaudiosamples.s3.us-east-2.amazonaws.com (multilangaudiosamples.s3.us-east-2.amazonaws.com)... 52.219.178.42
Connecting to multilangaudiosamples.s3.us-east-2.amazonaws.com (multilangaudiosamples.s3.us-east-2.amazonaws.com)|52.219.178.42|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 10802737 (10M) [application/x-gzip]
Saving to: ‘9017_5_mins.tar.gz.1’


2022-12-06 15:14:12 (38.0 MB/s) - ‘9017_5_mins.tar.gz.1’ saved [10802737/10802737]



Looking at `manifest.json`, we see a standard NeMo json that contains the filepath, text, and duration. Please note that our `manifest.json` contains the relative path.

Let's make sure that the entries look something like this:

```
{"audio_filepath": "audio/presentpictureofnsw_02_mann_0532.wav", "text": "not to stop more than ten minutes by the way", "duration": 2.6, "text_no_preprocessing": "not to stop more than ten minutes by the way,", "text_normalized": "not to stop more than ten minutes by the way,"}
```

In [81]:
!head -n 1 ./{config.SPEAKER_ID}_5_mins/manifest.json

{"audio_filepath": "audio/dartagnan03part1_027_dumas_0047.wav", "text": "yes monsieur", "duration": 1.04, "text_no_preprocessing": "Yes, monsieur.", "text_normalized": "Yes, monsieur."}


Let's take 2 samples from the dataset and split it off into a validation set. Then, split all other samples into the training set.

As mentioned, since the paths in the manifest are relative, we also create a symbolic link to the audio folder such that `audio/` goes to the correct directory.

In [82]:
!cat ./{config.SPEAKER_ID}_5_mins/manifest.json | tail -n 2 > ./{config.SPEAKER_ID}_manifest_valid_local.json
!cat ./{config.SPEAKER_ID}_5_mins/manifest.json | head -n -2 > ./{config.SPEAKER_ID}_manifest_train_local.json
!ln -s ./{config.SPEAKER_ID}_5_mins/audio audio

Let's create a W&B Table to inspect these files

In [83]:
import wandb
import json
import pandas as pd

In [84]:
train_df = pd.read_json(f"{config.SPEAKER_ID}_manifest_train_local.json", lines=True)
train_df

Unnamed: 0,audio_filepath,text,duration,text_no_preprocessing,text_normalized
0,audio/dartagnan03part1_027_dumas_0047.wav,yes monsieur,1.04,"Yes, monsieur.","Yes, monsieur."
1,audio/dartagnan01_42_dumas_0220.wav,asked he in an undertone,1.66,"asked he, in an undertone.","asked he, in an undertone."
2,audio/dartagnan01_38_dumas_0123.wav,grimaud entered,1.20,Grimaud entered.,Grimaud entered.
3,audio/dartagnan01_53_dumas_0059.wav,in the morning when they entered milady's cham...,3.70,"In the morning, when they entered Milady's cha...","In the morning, when they entered Milady's cha..."
4,audio/dartagnan03part3_66_dumas_0203.wav,yes monseigneur,1.42,"“Yes, monseigneur.","Yes, monseigneur."
...,...,...,...,...,...
71,audio/dartagnan03part3_09_dumas_0218.wav,and so you are determined to sign the sale of ...,8.76,“And so you are determined to sign the sale of...,And so you are determined to sign the sale of ...
72,audio/dartagnan01_62_dumas_0190.wav,what,0.58,“What?”,"""What?"""
73,audio/dartagnan01_33_dumas_0018.wav,well what is to be done,1.90,"“Well, what is to be done?”","""Well, what is to be done?"""
74,audio/dartagnan03part3_62_dumas_0243.wav,said grimaud addressing athos and pointing to ...,7.88,"said Grimaud, addressing Athos and pointing to...","said Grimaud, addressing Athos and pointing to..."


create a `wandb.Table` from a `DataFrame`
- We need to convert the audio files paths to `wandb.Audio` objects

In [85]:
train_df.audio_filepath = train_df.audio_filepath.apply(wandb.Audio)

In [86]:
train_table = wandb.Table(dataframe=train_df)

In [87]:
wandb.init(project="nemo", job_type="log_dataset", config=config)

In [88]:
wandb.log({"train_data": train_table})

We can do the same with the validation data:

In [89]:
valid_df = pd.read_json(f"{config.SPEAKER_ID}_manifest_valid_local.json", lines=True)
valid_df.audio_filepath = valid_df.audio_filepath.apply(wandb.Audio)
valid_table = wandb.Table(dataframe=valid_df)

In [90]:
wandb.finish()

VBox(children=(Label(value='12.257 MB of 12.272 MB uploaded (0.000 MB deduped)\r'), FloatProgress(value=0.9988…

Let's also download the pretrained checkpoint that we want to finetune from. NeMo will save checkpoints to `~/.cache`, so let's move that to our current directory. 

*Note: please, check that `home_path` refers to your home folder. Otherwise, change it manually.*

## Download Necessary files for training

To finetune the FastPitch model on the above created filelists, we use the `examples/tts/fastpitch_finetune.py` script to train the models with the `fastpitch_align_v1.05.yaml` configuration.

Let's grab those files.

In [96]:
BRANCH = "master"

In [97]:
!wget https://raw.githubusercontent.com/nvidia/NeMo/$BRANCH/examples/tts/fastpitch_finetune.py

!mkdir -p conf \
&& cd conf \
&& wget https://raw.githubusercontent.com/nvidia/NeMo/$BRANCH/examples/tts/conf/fastpitch_align_v1.05.yaml \
&& cd ..

--2022-12-06 15:22:23--  https://raw.githubusercontent.com/nvidia/NeMo/master/examples/tts/fastpitch_finetune.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.110.133, 185.199.109.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1751 (1.7K) [text/plain]
Saving to: ‘fastpitch_finetune.py.1’


2022-12-06 15:22:23 (39.5 MB/s) - ‘fastpitch_finetune.py.1’ saved [1751/1751]

--2022-12-06 15:22:23--  https://raw.githubusercontent.com/nvidia/NeMo/master/examples/tts/conf/fastpitch_align_v1.05.yaml
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.110.133, 185.199.109.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 6831 (6.7K) [text/plain]
Saving to: ‘fastpitch_align_v1.05.ya

We also need some additional files (see `FastPitch_MixerTTS_Training.ipynb` tutorial for more details) for training. Let's download these, too.

In [98]:
# additional files
!mkdir -p tts_dataset_files && cd tts_dataset_files \
&& wget https://raw.githubusercontent.com/NVIDIA/NeMo/$BRANCH/scripts/tts_dataset_files/cmudict-0.7b_nv22.10 \
&& wget https://raw.githubusercontent.com/NVIDIA/NeMo/$BRANCH/scripts/tts_dataset_files/heteronyms-052722 \
&& wget https://raw.githubusercontent.com/NVIDIA/NeMo/$BRANCH/nemo_text_processing/text_normalization/en/data/whitelist/lj_speech.tsv \
&& cd ..

--2022-12-06 15:22:25--  https://raw.githubusercontent.com/NVIDIA/NeMo/master/scripts/tts_dataset_files/cmudict-0.7b_nv22.10
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.110.133, 185.199.109.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 3741429 (3.6M) [text/plain]
Saving to: ‘cmudict-0.7b_nv22.10’


2022-12-06 15:22:25 (52.1 MB/s) - ‘cmudict-0.7b_nv22.10’ saved [3741429/3741429]

--2022-12-06 15:22:25--  https://raw.githubusercontent.com/NVIDIA/NeMo/master/scripts/tts_dataset_files/heteronyms-052722
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.110.133, 185.199.109.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1606 (1.6K) [text/plain]
Saving to: ‘heteronyms