## Validation

In [1]:
import os
import json
from pathlib import Path

import wandb

import torch
import pandas as pd
import IPython.display as ipd
from matplotlib.pyplot import imshow
from matplotlib import pyplot as plt

In [2]:
SPEAKER_ID = "lukas"
MODEL_NAME = "tts_en_fastpitch"

WANDB_PROJECT = "tts-lukas"
WANDB_ENTITY = "capecape" # replace with your wandb username or team

In [3]:
# which split we are using
validation_split_artifact = f'{WANDB_ENTITY}/{WANDB_PROJECT}/lukas_split:latest'

# which model
model_artifact = f'{WANDB_ENTITY}/{WANDB_PROJECT}/model-2022-12-08_13-54-17:v3'

## Synthesize Samples from Finetuned Checkpoints

Once we have finetuned our FastPitch model, we can synthesize the audio samples for given text using the following inference steps. We use a HiFi-GAN vocoder trained on LJSpeech.

We define some helper functions as well.

In [4]:
from nemo.collections.tts.models import HifiGanModel
from nemo.collections.tts.models import FastPitchModel

[NeMo W 2022-12-08 14:20:52 optimizers:55] Apex was not found. Using the lamb or fused_adam optimizer will error out.
[NeMo W 2022-12-08 14:20:52 experimental:27] Module <class 'nemo.collections.common.tokenizers.text_to_speech.tts_tokenizers.IPATokenizer'> is experimental, not ready for production and is not fully supported. Use at your own risk.
[NeMo W 2022-12-08 14:20:52 experimental:27] Module <class 'nemo.collections.tts.models.radtts.RadTTSModel'> is experimental, not ready for production and is not fully supported. Use at your own risk.


we will load a pretrained model to generate the voice from the spectrogram (we will later fine tune this model)

In [5]:
vocoder = HifiGanModel.from_pretrained("tts_hifigan")
vocoder = vocoder.eval().cuda()

[NeMo I 2022-12-08 14:20:52 cloud:56] Found existing object /home/tcapelle/.cache/torch/NeMo/NeMo_1.14.0rc0/tts_hifigan/e6da322f0f7e7dcf3f1900a9229a7e69/tts_hifigan.nemo.
[NeMo I 2022-12-08 14:20:52 cloud:62] Re-using file from: /home/tcapelle/.cache/torch/NeMo/NeMo_1.14.0rc0/tts_hifigan/e6da322f0f7e7dcf3f1900a9229a7e69/tts_hifigan.nemo
[NeMo I 2022-12-08 14:20:52 common:912] Instantiating model from pre-trained checkpoint


[NeMo W 2022-12-08 14:20:55 modelPT:142] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    dataset:
      _target_: nemo.collections.tts.data.datalayers.MelAudioDataset
      manifest_filepath: /home/fkreuk/data/train_finetune.txt
      min_duration: 0.75
      n_segments: 8192
    dataloader_params:
      drop_last: false
      shuffle: true
      batch_size: 64
      num_workers: 4
    
[NeMo W 2022-12-08 14:20:55 modelPT:149] If you intend to do validation, please call the ModelPT.setup_validation_data() or ModelPT.setup_multiple_validation_data() method and provide a valid configuration file to setup the validation data loader(s). 
    Validation config : 
    dataset:
      _target_: nemo.collections.tts.data.datalayers.MelAudioDataset
      manifest_filepath: /home/fkreuk/data/val_finetune.txt
      min_duration: 3
      n_segments: 66150


[NeMo I 2022-12-08 14:20:55 features:267] PADDING: 0


[NeMo W 2022-12-08 14:20:55 features:244] Using torch_stft is deprecated and has been removed. The values have been forcibly set to False for FilterbankFeatures and AudioToMelSpectrogramPreprocessor. Please set exact_pad to True as needed.


[NeMo I 2022-12-08 14:20:55 features:267] PADDING: 0
[NeMo I 2022-12-08 14:20:58 save_restore_connector:243] Model HifiGanModel was successfully restored from /home/tcapelle/.cache/torch/NeMo/NeMo_1.14.0rc0/tts_hifigan/e6da322f0f7e7dcf3f1900a9229a7e69/tts_hifigan.nemo.


We can grab the fine tuned models from the `wandb` artifact:

Let's log the model predictions to `W&B`

In [6]:
wandb.init(entity=WANDB_ENTITY, project=WANDB_PROJECT, job_type="fastptich_validation")

Failed to detect the name of this notebook, you can set it manually with the WANDB_NOTEBOOK_NAME environment variable to enable code saving.
[34m[1mwandb[0m: Currently logged in as: [33mcapecape[0m. Use [1m`wandb login --relogin`[0m to force relogin


In [7]:
split_artifact = wandb.use_artifact(validation_split_artifact)
split_artifact_dir = split_artifact.download()

model_artifact = wandb.use_artifact(model_artifact, type='model')
model_artifact_dir = model_artifact.download()

[34m[1mwandb[0m:   2 of 2 files downloaded.  
[34m[1mwandb[0m: Downloading large artifact model-2022-12-08_13-54-17:v3, 524.07MB. 1 files... 
[34m[1mwandb[0m:   1 of 1 files downloaded.  
Done. 0:0:0.4


In [8]:
def ls(path): return list(Path(path).iterdir())

In [9]:
ls(split_artifact_dir)

[PosixPath('artifacts/lukas_split:v0/lukas_manifest_train_local.json'),
 PosixPath('artifacts/lukas_split:v0/lukas_manifest_valid_local.json')]

In [10]:
ls(model_artifact_dir)

[PosixPath('artifacts/model-2022-12-08_13-54-17:v3/model.ckpt')]

In [11]:
def infer(spec_gen_model, vocoder_model, str_input, speaker=None):
    """
    Synthesizes spectrogram and audio from a text string given a spectrogram synthesis and vocoder model.
    
    Args:
        spec_gen_model: Spectrogram generator model (FastPitch in our case)
        vocoder_model: Vocoder model (HiFiGAN in our case)
        str_input: Text input for the synthesis
        speaker: Speaker ID
    
    Returns:
        spectrogram and waveform of the synthesized audio.
    """
    with torch.no_grad():
        parsed = spec_gen_model.parse(str_input)
        if speaker is not None:
            speaker = torch.tensor([speaker]).long().to(device=spec_gen_model.device)
        spectrogram = spec_gen_model.generate_spectrogram(tokens=parsed, speaker=speaker)
        audio = vocoder_model.convert_spectrogram_to_audio(spec=spectrogram)
        
    if spectrogram is not None:
        if isinstance(spectrogram, torch.Tensor):
            spectrogram = spectrogram.to('cpu').numpy()
        if len(spectrogram.shape) == 3:
            spectrogram = spectrogram[0]
    if isinstance(audio, torch.Tensor):
        audio = audio.to('cpu').numpy()
    return spectrogram, audio

In [12]:
last_ckpt = str(ls(model_artifact_dir)[0])
print(last_ckpt)

spec_model = FastPitchModel.load_from_checkpoint(last_ckpt)
spec_model.eval().cuda();

artifacts/model-2022-12-08_13-54-17:v3/model.ckpt
[NeMo I 2022-12-08 14:21:08 tokenize_and_classify:87] Creating ClassifyFst grammars.


[NeMo W 2022-12-08 14:21:40 experimental:27] Module <class 'nemo_text_processing.g2p.modules.IPAG2P'> is experimental, not ready for production and is not fully supported. Use at your own risk.
[NeMo W 2022-12-08 14:21:41 modules:95] apply_to_oov_word=None, This means that some of words will remain unchanged if they are not handled by any of the rules in self.parse_one_word(). This may be intended if phonemes and chars are both valid inputs, otherwise, you may see unexpected deletions in your input.
[NeMo W 2022-12-08 14:21:41 modelPT:142] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    dataset:
      _target_: nemo.collections.tts.torch.data.TTSDataset
      manifest_filepath: lukas_manifest_train_local.json
      sample_rate: 22050
      sup_data_path: ./fastpitch_sup_data
      sup_data_types:
      - align_prior_matrix
      - pitch
      

[NeMo I 2022-12-08 14:21:41 features:267] PADDING: 1


In [13]:
def generate_audio(text, speaker_id):
    "Generate MEL and Synth Audio"
    spec, audio = infer(spec_model, vocoder, text, speaker=speaker_id)
    return spec, audio.flatten()

In [14]:
new_speaker_id = 42
duration_mins = 5
mixing = False
original_speaker_id = "ljspeech"

In [15]:
valid_df = pd.read_json(Path(split_artifact_dir)/f"{SPEAKER_ID}_manifest_valid_local.json", lines=True)
valid_df.head()

Unnamed: 0,audio_filepath,text,duration,text_no_preprocessing,text_normalized
0,lukas/seg238.wav,"is yes, then you really do have a machine lea...",5,"is yes, then you really do have a machine lea...","is yes, then you really do have a machine lear..."
1,lukas/seg239.wav,excited enough about all the applications of ...,4,excited enough about all the applications of ...,excited enough about all the applications of m...
2,lukas/seg240.wav,videos that explain actually how to build the...,4,videos that explain actually how to build the...,videos that explain actually how to build thes...
3,lukas/seg241.wav,we're going to keep creating these videos so ...,4,we're going to keep creating these videos so ...,we're going to keep creating these videos so y...
4,lukas/seg242.wav,first to know when a new video comes out.,21,first to know when a new video comes out.,first to know when a new video comes out.


In [16]:
table = wandb.Table(columns=['Text', 'Real validation audio', f'Audio Speaker {new_speaker_id}', 'Spec'])

sample_rate=22050

for _, val_record in valid_df.iterrows():
    speaker_spec, speaker_audio = generate_audio(val_record['text'], speaker_id=new_speaker_id)
    row = [val_record["text_no_preprocessing"],
           wandb.Audio(val_record['audio_filepath'], sample_rate=sample_rate), 
           wandb.Audio(speaker_audio.flatten(), sample_rate=sample_rate),
           wandb.Image(speaker_spec)]
    table.add_data(*row)

wandb.log({"fastpitch_predictions": table})

In [17]:
wandb.finish()

VBox(children=(Label(value='2.495 MB of 2.496 MB uploaded (0.000 MB deduped)\r'), FloatProgress(value=0.999749…