# Getting Started: Sample Conversational AI application
This notebook shows how to use NVIDIA NeMo (https://github.com/NVIDIA/NeMo) to construct a toy demo which translate Mandarin audio file into English one.

The demo demonstrates how to: 

* Instantiate pre-trained NeMo models from NVIDIA NGC.
* Transcribe audio with (Mandarin) speech recognition model.
* Translate text with machine translation model.
* Generate audio with text-to-speech models.

## Installation
NeMo can be installed via simple pip command.
This will take about 4 minutes.

(The installation method below should work inside your new Conda environment or in an NVIDIA docker container.)

In [1]:
BRANCH = 'r1.5.0'
!python -m pip install git+https://github.com/NVIDIA/NeMo.git@$BRANCH#egg=nemo_toolkit[all]

Collecting nemo_toolkit[all]
  Cloning https://github.com/NVIDIA/NeMo.git (to revision r1.5.0) to /tmp/pip-install-2ejxio0p/nemo-toolkit_cbe5487477bb45dd98b39498bb12f509
  Running command git clone -q https://github.com/NVIDIA/NeMo.git /tmp/pip-install-2ejxio0p/nemo-toolkit_cbe5487477bb45dd98b39498bb12f509
  Running command git checkout -b r1.5.0 --track origin/r1.5.0
  Switched to a new branch 'r1.5.0'
  Branch 'r1.5.0' set up to track remote branch 'r1.5.0' from 'origin'.
Collecting onnx>=1.7.0
  Downloading onnx-1.10.2-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (12.7 MB)
[K     |████████████████████████████████| 12.7 MB 10.0 MB/s 
Collecting ruamel.yaml
  Downloading ruamel.yaml-0.17.17-py3-none-any.whl (109 kB)
[K     |████████████████████████████████| 109 kB 63.2 MB/s 
Collecting sentencepiece<1.0.0
  Downloading sentencepiece-0.1.96-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.2 MB)
[K     |████████████████████████████████| 1.2 MB 37.1 MB/s 
Coll

## Import all necessary packages

In [2]:
# Import NeMo and it's ASR, NLP and TTS collections
import nemo
# Import Speech Recognition collection
import nemo.collections.asr as nemo_asr
# Import Natural Language Processing colleciton
import nemo.collections.nlp as nemo_nlp
# Import Speech Synthesis collection
import nemo.collections.tts as nemo_tts
# We'll use this to listen to audio
import IPython

[NeMo W 2021-12-09 09:52:51 optimizers:50] Apex was not found. Using the lamb or fused_adam optimizer will error out.
################################################################################
###          (please add 'export KALDI_ROOT=<your_path>' in your $HOME/.profile)
###          (or run as: KALDI_ROOT=<your_path> python <your_script>.py)
################################################################################

[NeMo W 2021-12-09 09:52:54 experimental:28] Module <class 'nemo.collections.nlp.data.text_normalization.decoder_dataset.TextNormalizationDecoderDataset'> is experimental, not ready for production and is not fully supported. Use at your own risk.
[NeMo W 2021-12-09 09:52:54 experimental:28] Module <class 'nemo.collections.nlp.data.text_normalization.tagger_dataset.TextNormalizationTaggerDataset'> is experimental, not ready for production and is not fully supported. Use at your own risk.
[NeMo W 2021-12-09 09:52:54 experimental:28] Module <class 'nemo.collecti

## Instantiate pre-trained NeMo models

Every NeMo model has these methods:

* ``list_available_models()`` - it will list all models currently available on NGC and their names.

* ``from_pretrained(...)`` API downloads and initialized model directly from the NGC using model name.


In [3]:
# Here is an example of all CTC-based models:
nemo_asr.models.EncDecCTCModel.list_available_models()
# More ASR Models are available - see: nemo_asr.models.ASRModel.list_available_models()

[PretrainedModelInfo(
 	pretrained_model_name=QuartzNet15x5Base-En,
 	description=QuartzNet15x5 model trained on six datasets: LibriSpeech, Mozilla Common Voice (validated clips from en_1488h_2019-12-10), WSJ, Fisher, Switchboard, and NSC Singapore English. It was trained with Apex/Amp optimization level O1 for 600 epochs. The model achieves a WER of 3.79% on LibriSpeech dev-clean, and a WER of 10.05% on dev-other. Please visit https://ngc.nvidia.com/catalog/models/nvidia:nemospeechmodels for further details.,
 	location=https://api.ngc.nvidia.com/v2/models/nvidia/nemospeechmodels/versions/1.0.0a5/files/QuartzNet15x5Base-En.nemo
 ), PretrainedModelInfo(
 	pretrained_model_name=stt_en_quartznet15x5,
 	description=For details about this model, please visit https://ngc.nvidia.com/catalog/models/nvidia:nemo:stt_en_quartznet15x5,
 	location=https://api.ngc.nvidia.com/v2/models/nvidia/nemo/stt_en_quartznet15x5/versions/1.0.0rc1/files/stt_en_quartznet15x5.nemo
 ), PretrainedModelInfo(
 	pretr

In [4]:
nemo_nlp.models.MTEncDecModel.list_available_models()

[PretrainedModelInfo(
 	pretrained_model_name=nmt_en_de_transformer12x2,
 	description=En->De translation model. See details here: https://ngc.nvidia.com/catalog/models/nvidia:nemo:nmt_en_de_transformer12x2,
 	location=https://api.ngc.nvidia.com/v2/models/nvidia/nemo/nmt_en_de_transformer12x2/versions/1.0.0rc1/files/nmt_en_de_transformer12x2.nemo
 ), PretrainedModelInfo(
 	pretrained_model_name=nmt_de_en_transformer12x2,
 	description=De->En translation model. See details here: https://ngc.nvidia.com/catalog/models/nvidia:nemo:nmt_de_en_transformer12x2,
 	location=https://api.ngc.nvidia.com/v2/models/nvidia/nemo/nmt_de_en_transformer12x2/versions/1.0.0rc1/files/nmt_de_en_transformer12x2.nemo
 ), PretrainedModelInfo(
 	pretrained_model_name=nmt_en_es_transformer12x2,
 	description=En->Es translation model. See details here: https://ngc.nvidia.com/catalog/models/nvidia:nemo:nmt_en_es_transformer12x2,
 	location=https://api.ngc.nvidia.com/v2/models/nvidia/nemo/nmt_en_es_transformer12x2/ve

In [5]:
# Speech Recognition model - Citrinet initially trained on Multilingual LibriSpeech English corpus, and fine-tuned on the open source Aishell-2
asr_model = nemo_asr.models.EncDecCTCModel.from_pretrained(model_name="stt_fr_quartznet15x5").cuda()
# Neural Machine Translation model
nmt_model = nemo_nlp.models.MTEncDecModel.from_pretrained(model_name='nmt_fr_en_transformer24x6').cuda()
# Spectrogram generator which takes text as an input and produces spectrogram
spectrogram_generator = nemo_tts.models.FastPitchModel.from_pretrained(model_name="tts_en_fastpitch").cuda()
# Vocoder model which takes spectrogram and produces actual audio
vocoder = nemo_tts.models.HifiGanModel.from_pretrained(model_name="tts_hifigan").cuda()

[NeMo I 2021-12-09 09:55:12 cloud:66] Downloading from: https://api.ngc.nvidia.com/v2/models/nvidia/nemo/stt_fr_quartznet15x5/versions/1.0.0rc1/files/stt_fr_quartznet15x5.nemo to /root/.cache/torch/NeMo/NeMo_1.5.0/stt_fr_quartznet15x5/ad2ff3fc7d157ba778d3551caec449cc/stt_fr_quartznet15x5.nemo
[NeMo I 2021-12-09 09:55:14 common:728] Instantiating model from pre-trained checkpoint


[NeMo W 2021-12-09 09:55:15 modelPT:131] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    manifest_filepath: /raid/noneval.json
    sample_rate: 16000
    labels:
    - ' '
    - a
    - b
    - c
    - d
    - e
    - f
    - g
    - h
    - i
    - j
    - k
    - l
    - m
    - 'n'
    - o
    - p
    - q
    - r
    - s
    - t
    - u
    - v
    - w
    - x
    - 'y'
    - z
    - ''''
    - ç
    - é
    - â
    - ê
    - î
    - ô
    - û
    - à
    - è
    - ù
    - ë
    - ï
    - ü
    - ÿ
    batch_size: 32
    trim_silence: true
    max_duration: 16.7
    shuffle: true
    is_tarred: false
    tarred_audio_filepaths: null
    num_workers: 8
    pin_memory: true
    
[NeMo W 2021-12-09 09:55:15 modelPT:138] If you intend to do validation, please call the ModelPT.setup_validation_data() or ModelPT.setup_multiple_validation_data() m

[NeMo I 2021-12-09 09:55:15 features:265] PADDING: 16
[NeMo I 2021-12-09 09:55:15 features:282] STFT using torch
[NeMo I 2021-12-09 09:55:27 save_restore_connector:149] Model EncDecCTCModel was successfully restored from /root/.cache/torch/NeMo/NeMo_1.5.0/stt_fr_quartznet15x5/ad2ff3fc7d157ba778d3551caec449cc/stt_fr_quartznet15x5.nemo.
[NeMo I 2021-12-09 09:55:27 cloud:66] Downloading from: https://api.ngc.nvidia.com/v2/models/nvidia/nemo/nmt_fr_en_transformer24x6/versions/1.5/files/fr_en_24x6.nemo to /root/.cache/torch/NeMo/NeMo_1.5.0/fr_en_24x6/8b2edc09043b633be7ec2e96d73bfc91/fr_en_24x6.nemo
[NeMo I 2021-12-09 09:56:04 common:728] Instantiating model from pre-trained checkpoint
[NeMo I 2021-12-09 09:56:24 tokenizer_utils:163] Getting YouTokenToMeTokenizer with model: /tmp/tmplggx7myg/c5a378c4fb184011bfb0c7f53deb3998_shared_tokenizer.32000.BPE.model with r2l: False.
[NeMo I 2021-12-09 09:56:24 tokenizer_utils:163] Getting YouTokenToMeTokenizer with model: /tmp/tmplggx7myg/6d8a39da43c3

[NeMo W 2021-12-09 09:56:24 modelPT:131] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    src_file_name: null
    tgt_file_name: null
    use_tarred_dataset: true
    tar_files: /data/tarred_dataset_4k_tokens/parallel.batches.tokens.4000._OP_0..7945_CL_.tar
    metadata_file: /data/tarred_dataset_4k_tokens/metadata.tokens.4000.json
    lines_per_dataset_fragment: 1000000
    num_batches_per_tarfile: 100
    shard_strategy: scatter
    tokens_in_batch: 512
    clean: true
    max_seq_length: 512
    min_seq_length: 1
    cache_ids: false
    cache_data_per_node: false
    use_cache: false
    shuffle: true
    num_samples: -1
    drop_last: false
    pin_memory: false
    num_workers: 8
    reverse_lang_direction: true
    load_from_tarred_dataset: false
    metadata_path: null
    tar_shuffle_n: 100
    n_preproc_jobs: -2
    tar_file_prefix: p

[NeMo I 2021-12-09 09:56:32 save_restore_connector:149] Model MTEncDecModel was successfully restored from /root/.cache/torch/NeMo/NeMo_1.5.0/fr_en_24x6/8b2edc09043b633be7ec2e96d73bfc91/fr_en_24x6.nemo.
[NeMo I 2021-12-09 09:56:33 cloud:66] Downloading from: https://api.ngc.nvidia.com/v2/models/nvidia/nemo/tts_en_fastpitch/versions/1.4.0/files/tts_en_fastpitch_align.nemo to /root/.cache/torch/NeMo/NeMo_1.5.0/tts_en_fastpitch_align/b50e16c5d695b00855ae53d6ba4e4f7f/tts_en_fastpitch_align.nemo
[NeMo I 2021-12-09 09:56:37 common:728] Instantiating model from pre-trained checkpoint


[NeMo W 2021-12-09 09:56:39 modelPT:131] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    dataset:
      _target_: nemo.collections.asr.data.audio_to_text.AudioToCharWithPriorAndPitchDataset
      manifest_filepath: /raid/LJSpeech/nvidia_ljspeech_train.json
      max_duration: null
      min_duration: 0.1
      int_values: false
      normalize: true
      sample_rate: 22050
      trim: false
      sup_data_path: /raid/LJSpeech/prior
      n_window_stride: 256
      n_window_size: 1024
      pitch_fmin: 80
      pitch_fmax: 640
      pitch_avg: 211.27540199742586
      pitch_std: 52.1851002822779
      vocab:
        notation: phonemes
        punct: true
        spaces: true
        stresses: true
        add_blank_at: None
        pad_with_space: true
        chars: true
        improved_version_g2p: true
    dataloader_params:
      drop_las

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package cmudict to /root/nltk_data...
[nltk_data]   Package cmudict is already up-to-date!
[NeMo I 2021-12-09 09:56:41 features:265] PADDING: 1
[NeMo I 2021-12-09 09:56:41 features:282] STFT using torch
[NeMo I 2021-12-09 09:56:42 save_restore_connector:149] Model FastPitchModel was successfully restored from /root/.cache/torch/NeMo/NeMo_1.5.0/tts_en_fastpitch_align/b50e16c5d695b00855ae53d6ba4e4f7f/tts_en_fastpitch_align.nemo.
[NeMo I 2021-12-09 09:56:42 cloud:66] Downloading from: https://api.ngc.nvidia.com/v2/models/nvidia/nemo/tts_hifigan/versions/1.0.0rc1/files/tts_hifigan.nemo to /root/.cache/torch/NeMo/NeMo_1.5.0/tts_hifigan/e6da322f0f7e7dcf3f1900a9229a7e69/tts_hifigan.nemo
[NeMo I 2021-12-09 09:56:48 common:728] Instantiating model from pre-trained checkpoint


[NeMo W 2021-12-09 09:56:52 modelPT:131] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    dataset:
      _target_: nemo.collections.tts.data.datalayers.MelAudioDataset
      manifest_filepath: /home/fkreuk/data/train_finetune.txt
      min_duration: 0.75
      n_segments: 8192
    dataloader_params:
      drop_last: false
      shuffle: true
      batch_size: 64
      num_workers: 4
    
[NeMo W 2021-12-09 09:56:52 modelPT:138] If you intend to do validation, please call the ModelPT.setup_validation_data() or ModelPT.setup_multiple_validation_data() method and provide a valid configuration file to setup the validation data loader(s). 
    Validation config : 
    dataset:
      _target_: nemo.collections.tts.data.datalayers.MelAudioDataset
      manifest_filepath: /home/fkreuk/data/val_finetune.txt
      min_duration: 3
      n_segments: 66150


[NeMo I 2021-12-09 09:56:52 features:265] PADDING: 0
[NeMo I 2021-12-09 09:56:52 features:282] STFT using torch


[NeMo W 2021-12-09 09:56:52 features:243] Using torch_stft is deprecated and will be removed in 1.1.0. Please set stft_conv and stft_exact_pad to False for FilterbankFeatures and AudioToMelSpectrogramPreprocessor. Please set exact_pad to True as needed.


[NeMo I 2021-12-09 09:56:52 features:265] PADDING: 0
[NeMo I 2021-12-09 09:56:52 features:282] STFT using torch
[NeMo I 2021-12-09 09:56:53 save_restore_connector:149] Model HifiGanModel was successfully restored from /root/.cache/torch/NeMo/NeMo_1.5.0/tts_hifigan/e6da322f0f7e7dcf3f1900a9229a7e69/tts_hifigan.nemo.


## Get an audio sample in French


In [179]:
#@markdown Record from microphone { run: "auto" }


import ipywidgets as widgets
import numpy as np
from scipy.io import wavfile
from IPython.display import Audio, display, clear_output
from colab_utils import (record_audio,
                         audio_bytes_to_np,
                         upload_audio)

record_seconds =   5#@param {type:"number", min:1, max:10, step:1}
sample_rate = 16000

def _recognize(audio):
  display(Audio(audio, rate=sample_rate, autoplay=True))
  if use_VAD == "Yes":
    audio = _apply_vad(audio)
  wavfile.write('test.wav', sample_rate, (32767*audio).numpy().astype(np.int16))

def _record_audio(b):
  clear_output()
  audio = record_audio(record_seconds)
  wavfile.write('audio_sample.mp3', sample_rate, (32767*audio).numpy().astype(np.int16))
  _recognize(audio)


button = widgets.Button(description="Record Speech")
button.on_click(_record_audio)
display(button)


Starting recording for 5 seconds...


<IPython.core.display.Javascript object>

Finished recording!


In [182]:
audio_samples = ["ia_16k.mp3", "bonjour_16k.mp3", "content_16k.mp3", "chardonnay_16k.mp3"]
IPython.display.Audio(audio_samples[0])

## Transcribe audio file
We will use speech recognition model to convert audio into text.


In [183]:
transcribed_text = asr_model.transcribe(audio_samples)
print(transcribed_text)

Transcribing:   0%|          | 0/1 [00:00<?, ?it/s]

      ) // self.conv.stride[0] + 1
    


["l'intelligence artificielle va conquérir le mond", "bonjour je suis très content d'ête là", 'on est vraiment très content de vouvoir collaborer avec vous', "quand il fait chaud e préfère un bon chardonné plutôt qu'une bière"]


## Translate French text into English
NeMo's NMT models have a handy ``.translate()`` method.

In [184]:
english_text = nmt_model.translate(transcribed_text)
print(english_text)

      mems_ids = indices_i.unsqueeze(2).unsqueeze(3).repeat(1, 1, p_len - 1, hidden_size) // self.beam_size
    


['Artificial Intelligence Will Conquer the World', 'hello i am very happy to be here', 'we are really happy to work with you', "when it's hot e prefers a good thistle rather than a beer"]


## Generate English audio from text
Speech generation from text typically has two steps:
* Generate spectrogram from the text. In this example we will use FastPitch model for this.
* Generate actual audio from the spectrogram. In this example we will use HifiGan model for this.


In [185]:
# A helper function which combines FastPitch and HifiGan to go directly from 
# text to audio
def text_to_audio(text):
  parsed = spectrogram_generator.parse(text)
  spectrogram = spectrogram_generator.generate_spectrogram(tokens=parsed)
  audio = vocoder.convert_spectrogram_to_audio(spec=spectrogram)
  return audio.squeeze().to('cpu').detach().numpy()

In [186]:
# Listen to generated audio in English
IPython.display.Audio(text_to_audio(english_text[-1]), rate=22050)

In [187]:
def save_audio_en(audio, fname='out.mp3', sample_rate=22050):
  out_audio = (audio*np.iinfo(np.int16).max).astype(np.int16)
  wavfile.write(fname, sample_rate, audio)


english_audios = [text_to_audio(eng_text) for eng_text in english_text]

## Wandb 🏋️‍♀️

Logging rich media types to wandb.
- We can log audio files and play them back on the dashboard!

In [188]:
!pip install -Uqqq wandb

In [189]:
import wandb
wandb.login()

True

In [190]:
wandb.init(project="NeMo")

In [191]:
table = wandb.Table(columns=['audio_input', 'transcribed_text', 'translated_text', 'audio_output'])

In [192]:
def build_row(audio_fr_fname, fr_txt, eng_txt, audio_en_np):
  "Save output of the model to files"
  fname_en = audio_fr_fname.split('.')[0] + '_en.wav'
  save_audio_en(audio_en, fname=fname_en, sample_rate=22050)
  return [wandb.Audio(audio_fr_fname, sample_rate=16000), fr_txt, eng_txt, wandb.Audio(fname_en, sample_rate=22050)]

In [193]:
for audio_fr_fname, fr_txt, eng_txt, audio_en in zip(audio_samples, transcribed_text, english_text, english_audios):
  table.add_data(*build_row(audio_fr_fname, fr_txt, eng_txt, audio_en))

In [194]:
wandb.log({"table": table})

In [195]:
wandb.finish()

VBox(children=(Label(value=' 1.19MB of 1.19MB uploaded (0.00MB deduped)\r'), FloatProgress(value=1.0, max=1.0)…