# Getting Started: Sample Conversational AI application
This notebook shows how to use NVIDIA NeMo (https://github.com/NVIDIA/NeMo) to construct a toy demo which translate Mandarin audio file into English one.

The demo demonstrates how to: 

* Instantiate pre-trained NeMo models from NVIDIA NGC.
* Transcribe audio with English speech recognition model.
* Translate text to Spanish with machine translation model.
* Generate audio with text-to-speech models fine-tuned to Spanish speach.

## Import all necessary packages

In [1]:
# Import NeMo and it's ASR, NLP and TTS collections
import nemo
# Import Speech Recognition collection
import nemo.collections.asr as nemo_asr
# Import Natural Language Processing collection
import nemo.collections.nlp as nemo_nlp
# Import Speech Synthesis collection
import nemo.collections.tts as nemo_tts
# We'll use this to listen to audio
import IPython

import soundfile
import uuid
import io
import base64
import json
import logging

import mlflow
import os
from mlflow.types.schema import Schema, ColSpec
from mlflow.types import ParamSchema, ParamSpec
from mlflow.models import ModelSignature

## Test whether NeMo is properly imported

In this cell, we show a list of available NeMo models for Automatic Speech Recognition on NGC, to show our Workspace is capable to load NeMo and connect to NGC

* ``list_available_models()`` - it will list all models currently available on NGC and their names.



In [2]:
# Here is an example of all CTC-based models:
nemo_asr.models.EncDecCTCModel.list_available_models()
# More ASR Models are available - see: nemo_asr.models.ASRModel.list_available_models()

[PretrainedModelInfo(
 	pretrained_model_name=QuartzNet15x5Base-En,
 	description=QuartzNet15x5 model trained on six datasets: LibriSpeech, Mozilla Common Voice (validated clips from en_1488h_2019-12-10), WSJ, Fisher, Switchboard, and NSC Singapore English. It was trained with Apex/Amp optimization level O1 for 600 epochs. The model achieves a WER of 3.79% on LibriSpeech dev-clean, and a WER of 10.05% on dev-other. Please visit https://ngc.nvidia.com/catalog/models/nvidia:nemospeechmodels for further details.,
 	location=https://api.ngc.nvidia.com/v2/models/nvidia/nemospeechmodels/versions/1.0.0a5/files/QuartzNet15x5Base-En.nemo
 ),
 PretrainedModelInfo(
 	pretrained_model_name=stt_en_quartznet15x5,
 	description=For details about this model, please visit https://ngc.nvidia.com/catalog/models/nvidia:nemo:stt_en_quartznet15x5,
 	location=https://api.ngc.nvidia.com/v2/models/nvidia/nemo/stt_en_quartznet15x5/versions/1.0.0rc1/files/stt_en_quartznet15x5.nemo
 ),
 PretrainedModelInfo(
 	pre

## Loading from local saved models

Here, instead of downloading the models directly from NGC via code, we are showing that we can access the models that were downloaded previously, using Ai Studio assets manager

In [6]:
# Speech Recognition model - Citrinet initially trained on Multilingual LibriSpeech English corpus, and fine-tuned on the open source Aishell-2
asr_model = nemo_asr.models.EncDecCTCModel.restore_from("/home/jovyan/datafabric/stt-quartznet/stt_fr_quartznet15x5.nemo")

# Neural Machine Translation model
nmt_model = nemo_nlp.models.MTEncDecModel.restore_from("/home/jovyan/datafabric/fr-en-transformer/nmt_fr_en_transformer12x2.nemo")

# Spectrogram generator which takes text as an input and produces spectrogram
spectrogram_generator = nemo_tts.models.FastPitchModel.restore_from("/home/jovyan/datafabric/en-fastpitch/tts_en_fastpitch_align_ipa.nemo")

# Vocoder model which takes spectrogram and produces actual audio
vocoder = nemo_tts.models.HifiGanModel.restore_from("/home/jovyan/datafabric/en-hifigan/tts_hifigan.nemo")

[NeMo W 2024-06-10 18:33:13 modelPT:161] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    manifest_filepath: /raid/noneval.json
    sample_rate: 16000
    labels:
    - ' '
    - a
    - b
    - c
    - d
    - e
    - f
    - g
    - h
    - i
    - j
    - k
    - l
    - m
    - 'n'
    - o
    - p
    - q
    - r
    - s
    - t
    - u
    - v
    - w
    - x
    - 'y'
    - z
    - ''''
    - ç
    - é
    - â
    - ê
    - î
    - ô
    - û
    - à
    - è
    - ù
    - ë
    - ï
    - ü
    - ÿ
    batch_size: 32
    trim_silence: true
    max_duration: 16.7
    shuffle: true
    is_tarred: false
    tarred_audio_filepaths: null
    num_workers: 8
    pin_memory: true
    
[NeMo W 2024-06-10 18:33:13 modelPT:168] If you intend to do validation, please call the ModelPT.setup_validation_data() or ModelPT.setup_multiple_validation_data() m

[NeMo I 2024-06-10 18:33:13 features:289] PADDING: 16
[NeMo I 2024-06-10 18:33:14 save_restore_connector:249] Model EncDecCTCModel was successfully restored from /home/jovyan/datafabric/fr-en-models/stt_fr_quartznet15x5.nemo.
[NeMo I 2024-06-10 18:35:07 tokenizer_utils:179] Getting YouTokenToMeTokenizer with model: /tmp/tmpr_nt7i45/tokenizer.75.32000.BPE.model with r2l: False.
[NeMo I 2024-06-10 18:35:07 tokenizer_utils:179] Getting YouTokenToMeTokenizer with model: /tmp/tmpr_nt7i45/tokenizer.75.32000.BPE.model with r2l: False.


[NeMo W 2024-06-10 18:35:07 modelPT:161] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    src_file_name: /raid/batches.tokens.75.16000.pkl
    tgt_file_name: /raid/batches.tokens.75.16000.pkl
    tokens_in_batch: 16000
    clean: true
    max_seq_length: 512
    cache_ids: false
    cache_data_per_node: false
    use_cache: false
    shuffle: true
    num_samples: -1
    drop_last: false
    pin_memory: false
    num_workers: 8
    load_from_cached_dataset: true
    reverse_lang_direction: true
    load_from_tarred_dataset: false
    metadata_path: null
    tar_shuffle_n: 100
    
[NeMo W 2024-06-10 18:35:07 modelPT:168] If you intend to do validation, please call the ModelPT.setup_validation_data() or ModelPT.setup_multiple_validation_data() method and provide a valid configuration file to setup the validation data loader(s). 
    Validation c

[NeMo I 2024-06-10 18:35:11 nlp_overrides:752] Model MTEncDecModel was successfully restored from /home/jovyan/datafabric/fr-en-models/nmt_fr_en_transformer12x2.nemo.


 NeMo-text-processing :: INFO     :: Creating ClassifyFst grammars.
I0610 18:35:23.338866 140234841343808 tokenize_and_classify.py:86] Creating ClassifyFst grammars.
[NeMo W 2024-06-10 18:35:49 deprecated:63] Function ``g2p_backward_compatible_support`` is deprecated. But it will not be removed until a further notice. G2P object root directory `nemo_text_processing.g2p` has been replaced with `nemo.collections.tts.g2p`. Please use the latter instead as of NeMo 1.18.0.
[NeMo W 2024-06-10 18:35:49 experimental:26] `<class 'nemo.collections.tts.g2p.models.i18n_ipa.IpaG2p'>` is experimental and not ready for production yet. Use at your own risk.
[NeMo W 2024-06-10 18:35:50 i18n_ipa:124] apply_to_oov_word=None, This means that some of words will remain unchanged if they are not handled by any of the rules in self.parse_one_word(). This may be intended if phonemes and chars are both valid inputs, otherwise, you may see unexpected deletions in your input.
[NeMo W 2024-06-10 18:35:50 experimen

[NeMo I 2024-06-10 18:35:50 features:289] PADDING: 1
[NeMo I 2024-06-10 18:35:51 save_restore_connector:249] Model FastPitchModel was successfully restored from /home/jovyan/datafabric/fr-en-models/tts_en_fastpitch_align_ipa.nemo.


[NeMo W 2024-06-10 18:36:31 modelPT:161] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    dataset:
      _target_: nemo.collections.tts.data.datalayers.MelAudioDataset
      manifest_filepath: /home/fkreuk/data/train_finetune.txt
      min_duration: 0.75
      n_segments: 8192
    dataloader_params:
      drop_last: false
      shuffle: true
      batch_size: 64
      num_workers: 4
    
[NeMo W 2024-06-10 18:36:31 modelPT:168] If you intend to do validation, please call the ModelPT.setup_validation_data() or ModelPT.setup_multiple_validation_data() method and provide a valid configuration file to setup the validation data loader(s). 
    Validation config : 
    dataset:
      _target_: nemo.collections.tts.data.datalayers.MelAudioDataset
      manifest_filepath: /home/fkreuk/data/val_finetune.txt
      min_duration: 3
      n_segments: 66150


[NeMo I 2024-06-10 18:36:31 features:289] PADDING: 0


[NeMo W 2024-06-10 18:36:31 features:266] Using torch_stft is deprecated and has been removed. The values have been forcibly set to False for FilterbankFeatures and AudioToMelSpectrogramPreprocessor. Please set exact_pad to True as needed.


[NeMo I 2024-06-10 18:36:31 features:289] PADDING: 0


    


[NeMo I 2024-06-10 18:36:33 save_restore_connector:249] Model HifiGanModel was successfully restored from /home/jovyan/datafabric/fr-en-models/tts_hifigan.nemo.


In [12]:
audio_sample = 'June18.mp3'
# To listen it, click on the play button below
IPython.display.Audio(audio_sample)

In [13]:
cuda_asr_model = asr_model.cuda()
transcribed_text = cuda_asr_model.transcribe([audio_sample])
print(transcribed_text)

Transcribing:   0%|          | 0/1 [00:00<?, ?it/s]

["l'honneur le beau sence l'intérêt supérieur de la patrie"]


In [14]:
cuda_nmt_model = nmt_model.cuda()
translated_text = cuda_nmt_model.translate(transcribed_text)

print(translated_text)

['Honor the fine sence the higher interest of the homeland']


In [15]:
cuda_spectrogram_generator = spectrogram_generator.cuda()
cuda_vocoder = vocoder.cuda()


parsed = cuda_spectrogram_generator.parse(translated_text[0])
spectrogram = cuda_spectrogram_generator.generate_spectrogram(tokens=parsed, speaker=2)
audio = cuda_vocoder.convert_spectrogram_to_audio(spec=spectrogram)
IPython.display.Audio(audio.to('cpu').detach().numpy(), rate=22050)


[NeMo W 2024-06-10 19:32:52 fastpitch:291] parse() is meant to be called in eval mode.
[NeMo W 2024-06-10 19:32:52 fastpitch:368] generate_spectrogram() is meant to be called in eval mode.


In [16]:
class NemoTranslationModel(mlflow.pyfunc.PythonModel):
    def transcribe_audio(self, inputs):
        """
        Deserializes base64-serialized audio to a NumPy array.
        Assume the audio is in WAV format for simplicity
        """
        serialized_audio = inputs['source_serialized_audio'][0]
        audio_buffer = io.BytesIO(base64.b64decode(serialized_audio))
        audio, self.framerate = soundfile.read(audio_buffer)
        if len(audio.shape) > 1 and audio.shape[1] > 1:
            audio = audio[:, 0] #Get single channel  
        wave_file = "/phoenix/mlflow/tmp/{}.wav".format(self.file_id)
        soundfile.write(wave_file, audio, self.framerate)
        text = self.asr_model.cuda().transcribe([wave_file])
        return text

    def text_to_audio(self, text):
        """
        Generates audio from text using TTS templates.
        """
        parsed = self.spectrogram_generator.cuda().parse(text)
        spectrogram = self.spectrogram_generator.cuda().generate_spectrogram(tokens=parsed, speaker=2)
        audio = self.vocoder.cuda().convert_spectrogram_to_audio(spec=spectrogram)
        return audio.to('cpu').detach().numpy()

    def serialize_audio(self, audio_np):
        """
        Serializes an audio NumPy array to a base64 string representing a WAV file.
        """
        audio_base64 = ""
        wave_file = "/phoenix/mlflow/tmp/out_{}.wav".format(self.file_id)
        soundfile.write(wave_file, audio_np, samplerate=22050, format='WAV')
        
        with io.BytesIO() as audio_buffer:
            soundfile.write(audio_buffer, audio_np, samplerate=22050, format='WAV')
            audio_buffer.seek(0)
            audio_output = audio_buffer.read()
        return audio_output

    def load_context(self, context):
        model_name=context.artifacts["model"]
        self.asr_model = nemo_asr.models.EncDecCTCModel.restore_from("{}/enc_dec_CTC.nemo".format(model_name))
        self.nmt_model = nemo_nlp.models.MTEncDecModel.restore_from("{}/MT_enc_dec.nemo".format(model_name))
        self.spectrogram_generator = nemo_tts.models.FastPitchModel.restore_from("{}/fast_pitch.nemo".format(model_name))
        self.vocoder = nemo_tts.models.HifiGanModel.restore_from("{}/hifi_gan.nemo".format(model_name))
        self.framerate = 44100
        if not os.path.isdir("/phoenix/mlflow/tmp"):
            os.mkdir("/phoenix/mlflow/tmp")
         
    def predict(self, context, model_input, params):
        source_text = ""
        self.file_id = uuid.uuid1()
        if params["use_audio"]:
            source_text = self.transcribe_audio(model_input)[0]
        else:
            source_text = model_input['source_text'][0]
        translated_text = self.nmt_model.cuda().translate([source_text])[0]
        translated_audio = ""
        if params["use_audio"]:
            audio = self.text_to_audio(translated_text)
            translated_audio = self.serialize_audio(audio[0])
        return {"original_text": source_text, "translated_text": translated_text, "translated_serialized_audio": translated_audio}

    @classmethod
    def log_model(cls, model_name, nemo_models, demo_folder): #eg (model, '', 'my_model')
        import os, shutil
        input_schema = Schema(
            [
                ColSpec("string", "source_text"),
                ColSpec("string", "source_serialized_audio"),
            ]
        )
        output_schema = Schema(
            [
                ColSpec("string", "original_text"),
                ColSpec("string", "translated_text"),
                ColSpec("string", "translated_serialized_audio"),
            ]
        )
        
        params_schema = ParamSchema(
            [
                ParamSpec("use_audio", "boolean", False)
            ]
        )
      
        signature = ModelSignature(inputs=input_schema, outputs=output_schema, params=params_schema)

        if not os.path.isdir(model_name):
            os.mkdir(model_name)
        if "enc_dec_CTC" in nemo_models:
            nemo_models["enc_dec_CTC"].save_to("{}/enc_dec_CTC.nemo".format(model_name))
        if "MT_enc_dec" in nemo_models:
            nemo_models["MT_enc_dec"].save_to("{}/MT_enc_dec.nemo".format(model_name))
        if "fast_pitch" in nemo_models:
            nemo_models["fast_pitch"].save_to("{}/fast_pitch.nemo".format(model_name))
        if "hifi_gan" in nemo_models:
            nemo_models["hifi_gan"].save_to("{}/hifi_gan.nemo".format(model_name))
  
        mlflow.pyfunc.log_model(
            model_name,
            python_model=cls(),
            artifacts={"model": model_name, "demo": demo_folder},
            signature=signature
        )            
        
        shutil.rmtree(model_name)


In [20]:
mlflow.set_experiment(experiment_name='NeMo_translation')

with mlflow.start_run(run_name='NeMo_fr_en_translation') as run:
    model_set = {
        "enc_dec_CTC": asr_model,
        "MT_enc_dec": nmt_model,
        "fast_pitch": spectrogram_generator,
        "hifi_gan": vocoder
    }

    NemoTranslationModel.log_model(model_name='nemo_fr_en', nemo_models=model_set, demo_folder = "demo")
    mlflow.register_model(model_uri = f"runs:/{run.info.run_id}/nemo_fr_en", name="nemo_fr_en")

    


W0610 21:37:54.836454 140234841343808 file_store.py:308] Malformed experiment 'tmp'. Detailed error Yaml file '/phoenix/mlflow/tmp/meta.yaml' does not exist.
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/mlflow/store/tracking/file_store.py", line 304, in search_experiments
    exp = self._get_experiment(exp_id, view_type)
  File "/usr/local/lib/python3.10/dist-packages/mlflow/store/tracking/file_store.py", line 397, in _get_experiment
    meta = FileStore._read_yaml(experiment_dir, FileStore.META_DATA_FILE_NAME)
  File "/usr/local/lib/python3.10/dist-packages/mlflow/store/tracking/file_store.py", line 1306, in _read_yaml
    return _read_helper(root, file_name, attempts_remaining=retries)
  File "/usr/local/lib/python3.10/dist-packages/mlflow/store/tracking/file_store.py", line 1299, in _read_helper
    result = read_yaml(root, file_name)
  File "/usr/local/lib/python3.10/dist-packages/mlflow/utils/file_utils.py", line 282, in read_yaml
    raise Mi

Downloading artifacts:   0%|          | 0/4 [00:00<?, ?it/s]

Downloading artifacts:   0%|          | 0/85 [00:00<?, ?it/s]

2024/06/10 21:41:12 INFO mlflow.store.artifact.artifact_repo: The progress bar can be disabled by setting the environment variable MLFLOW_ENABLE_ARTIFACTS_PROGRESS_BAR to false
Registered model 'nemo_fr_en' already exists. Creating a new version of this model...
2024/06/10 21:42:48 INFO mlflow.tracking._model_registry.client: Waiting up to 300 seconds for model version to finish creation. Model name: nemo_fr_en, version 3
Created version '3' of model 'nemo_fr_en'.
