# Download and Log data to W&B

For our tutorial, we will use a small part of the Hi-Fi Multi-Speaker English TTS (Hi-Fi TTS) dataset. You can read more about dataset [here](https://arxiv.org/abs/2104.01497). We will use speaker 9017 as the target speaker, and only a 5-minute subset of audio will be used for this fine-tuning example. We additionally resample audio to 22050 kHz.

In [1]:
import wandb
import json
import pandas as pd

In [2]:
SPEAKER_ID = "lukas"
WANDB_PROJECT = "tts-lukas"
WANDB_ENTITY = "capecape" # replace with your wandb username or team

In [3]:
# !unzip -q lukas.zip

Looking at `manifest.json`, we see a standard NeMo json that contains the filepath, text, and duration. Please note that our `manifest.json` contains the relative path.

In [4]:
df = pd.read_json(f"{SPEAKER_ID}/manifest.json", lines=True)

In [5]:
df.head()

Unnamed: 0,audio_filepath,text,duration,text_no_preprocessing,text_normalized
0,lukas/seg0.wav,Today we're going to talk about the big picture.,3.28,Today we're going to talk about the big picture.,Today we're going to talk about the big picture.
1,lukas/seg1.wav,What is machine learning? What is deep learning?,2.64,What is machine learning? What is deep learning?,What is machine learning? What is deep learning?
2,lukas/seg2.wav,How does it really work and where can we appl...,2.88,How does it really work and where can we appl...,How does it really work and where can we apply...
3,lukas/seg3.wav,And unlike some of the other videos that we'r...,2.8,And unlike some of the other videos that we'r...,And unlike some of the other videos that we're...
4,lukas/seg4.wav,this isn't just for engineers.,1.52,this isn't just for engineers.,this isn't just for engineers.


## Normalizing

In [6]:
# !wget https://raw.githubusercontent.com/NVIDIA/NeMo/main/scripts/dataset_processing/tts/compute_speaker_stats.py

In [7]:
from nemo_text_processing.text_normalization.normalize import Normalizer
normalizer = Normalizer(input_case='cased', lang='en')

[NeMo W 2022-12-08 13:49:53 optimizers:55] Apex was not found. Using the lamb or fused_adam optimizer will error out.


[NeMo I 2022-12-08 13:49:56 tokenize_and_classify:87] Creating ClassifyFst grammars.


In [8]:
df.text_normalized = df.text.apply(normalizer.normalize)

In [9]:
dict_records = df.to_dict('records')

In [10]:
import ndjson
with open("lukas/manifest.json", 'w') as f:
    ndjson.dump(dict_records, f,ensure_ascii=False)

In [11]:
# # this line does not work!
# df.to_json("manifest_n.json", orient="records", lines=True)

## Save to W&B

Let's log this raw data to W&B

In [12]:
wandb.init(project=WANDB_PROJECT, entity=WANDB_ENTITY, job_type="log_dataset", config={"speaker_id":SPEAKER_ID})

Failed to detect the name of this notebook, you can set it manually with the WANDB_NOTEBOOK_NAME environment variable to enable code saving.
[34m[1mwandb[0m: Currently logged in as: [33mcapecape[0m. Use [1m`wandb login --relogin`[0m to force relogin


In [13]:
at = wandb.Artifact("lukas_data", type="dataset", description=f"Speaker {SPEAKER_ID} from ML course from YouTube")

In [14]:
at.add_dir(f"lukas")

[34m[1mwandb[0m: Adding directory to artifact (./lukas)... Done. 0.2s


In [15]:
wandb.log_artifact(at)

<wandb.sdk.wandb_artifacts.Artifact at 0x7f28d5b7a640>

In [16]:
wandb.finish()

### Train/Val split

Let's take 2 samples from the dataset and split it off into a validation set. Then, split all other samples into the training set.

As mentioned, since the paths in the manifest are relative, we also create a symbolic link to the audio folder such that `audio/` goes to the correct directory.

In [17]:
!cat ./{SPEAKER_ID}/manifest.json | tail -n 5 > ./{SPEAKER_ID}_manifest_valid_local.json
!cat ./{SPEAKER_ID}/manifest.json | head -n -5 > ./{SPEAKER_ID}_manifest_train_local.json

Let's log the split files to W&B

In [18]:
run = wandb.init(project=WANDB_PROJECT, entity=WANDB_ENTITY,  job_type="dataset_split", config={"speaker_id":SPEAKER_ID})

In [19]:
run.use_artifact(f'{WANDB_ENTITY}/{WANDB_PROJECT}/lukas_data:latest', type='dataset')

<Artifact QXJ0aWZhY3Q6Mjk1MjY4NzQz>

In [20]:
at = wandb.Artifact("lukas_split", type="dataset_split", description=f"Train/valid split for Speaker {SPEAKER_ID}")

In [21]:
at.add_file(f"./{SPEAKER_ID}_manifest_train_local.json")
at.add_file(f"./{SPEAKER_ID}_manifest_valid_local.json")

<ManifestEntry digest: z5Wzrg8vIQ1IoGpUvnWHJw==>

In [22]:
wandb.log_artifact(at)

<wandb.sdk.wandb_artifacts.Artifact at 0x7f28d5ab7fd0>

## ðŸ‘€ Visualizing the dataset (or playing the audio ðŸ¤£)

Let's create a W&B Table to inspect these files

In [23]:
train_df = pd.read_json(f"{SPEAKER_ID}_manifest_train_local.json", lines=True)
train_df

Unnamed: 0,audio_filepath,text,duration,text_no_preprocessing,text_normalized
0,lukas/seg0.wav,Today we're going to talk about the big picture.,3.28,Today we're going to talk about the big picture.,Today we're going to talk about the big picture.
1,lukas/seg1.wav,What is machine learning? What is deep learning?,2.64,What is machine learning? What is deep learning?,What is machine learning? What is deep learning?
2,lukas/seg2.wav,How does it really work and where can we appl...,2.88,How does it really work and where can we appl...,How does it really work and where can we apply...
3,lukas/seg3.wav,And unlike some of the other videos that we'r...,2.80,And unlike some of the other videos that we'r...,And unlike some of the other videos that we're...
4,lukas/seg4.wav,this isn't just for engineers.,1.52,this isn't just for engineers.,this isn't just for engineers.
...,...,...,...,...,...
233,lukas/seg233.wav,specific API that's common to all machine lea...,6.00,specific API that's common to all machine lea...,specific API that's common to all machine lear...
234,lukas/seg234.wav,"thinking, okay, is my problem suitable for ma...",5.00,"thinking, okay, is my problem suitable for ma...","thinking, okay, is my problem suitable for mac..."
235,lukas/seg235.wav,"asking yourself is, can I turn it into this k...",6.00,"asking yourself is, can I turn it into this k...","asking yourself is, can I turn it into this ki..."
236,lukas/seg236.wav,numbers as input and a fixed length of number...,5.00,numbers as input and a fixed length of number...,numbers as input and a fixed length of numbers...


create a `wandb.Table` from a `DataFrame`
- We need to convert the audio files paths to `wandb.Audio` objects

In [24]:
train_df.audio_filepath = train_df.audio_filepath.apply(wandb.Audio)

In [25]:
train_table = wandb.Table(dataframe=train_df)

In [26]:
wandb.log({"train_data": train_table})

We can do the same with the validation data:

In [27]:
valid_df = pd.read_json(f"{SPEAKER_ID}_manifest_valid_local.json", lines=True)
valid_df.audio_filepath = valid_df.audio_filepath.apply(wandb.Audio)
valid_table = wandb.Table(dataframe=valid_df)

In [28]:
wandb.log({"valid_data": valid_table})

In [29]:
wandb.finish()

wandb: Network error (ConnectTimeout), entering retry loop.
