# Getting Started: Sample Conversational AI application
This notebook shows how to use NVIDIA NeMo (https://github.com/NVIDIA/NeMo) to construct a toy demo which translate Mandarin audio file into English one.

The demo demonstrates how to: 

* Instantiate pre-trained NeMo models from NVIDIA NGC.
* Transcribe audio with (Mandarin) speech recognition model.
* Translate text with machine translation model.
* Generate audio with text-to-speech models.

## Installation
NeMo can be installed via simple pip command.
This will take about 4 minutes.

In [1]:
BRANCH = 'r1.22.0'
!python -m pip install git+https://github.com/NVIDIA/NeMo.git@$BRANCH#egg=nemo_toolkit[all]


[33mDEPRECATION: git+https://github.com/NVIDIA/NeMo.git@r1.22.0#egg=nemo_toolkit[all] contains an egg fragment with a non-PEP 508 name pip 25.0 will enforce this behaviour change. A possible replacement is to use the req @ url syntax, and remove the egg fragment. Discussion can be found at https://github.com/pypa/pip/issues/11617[0m[33m
[0mLooking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com
Collecting nemo_toolkit (from nemo_toolkit[all])
  Cloning https://github.com/NVIDIA/NeMo.git (to revision r1.22.0) to /tmp/pip-install-0p53req7/nemo-toolkit_eebe09e4997442009d00edef80013aed
  Running command git clone --filter=blob:none --quiet https://github.com/NVIDIA/NeMo.git /tmp/pip-install-0p53req7/nemo-toolkit_eebe09e4997442009d00edef80013aed
  Running command git checkout -b r1.22.0 --track origin/r1.22.0
  Switched to a new branch 'r1.22.0'
  Branch 'r1.22.0' set up to track remote branch 'r1.22.0' from 'origin'.
  Resolved https://github.com/NVIDIA/NeMo.git to com

## Import all necessary packages

In [2]:
# Import NeMo and it's ASR, NLP and TTS collections
import nemo
# Import Speech Recognition collection
import nemo.collections.asr as nemo_asr
# Import Natural Language Processing collection
import nemo.collections.nlp as nemo_nlp
# Import Speech Synthesis collection
import nemo.collections.tts as nemo_tts
# We'll use this to listen to audio
import IPython

## Instantiate pre-trained NeMo models

Every NeMo model has these methods:

* ``list_available_models()`` - it will list all models currently available on NGC and their names.

* ``from_pretrained(...)`` API downloads and initialized model directly from the NGC using model name.


In [4]:
# Speech Recognition model - Citrinet initially trained on Multilingual LibriSpeech English corpus, and fine-tuned on the open source Aishell-2
asr_model = nemo_asr.models.EncDecCTCModel.from_pretrained(model_name="stt_zh_citrinet_1024_gamma_0_25").cuda()

# Neural Machine Translation model
nmt_model = nemo_nlp.models.MTEncDecModel.from_pretrained(model_name='nmt_zh_en_transformer6x6').cuda()

# Spectrogram generator which takes text as an input and produces spectrogram
spectrogram_generator = nemo_tts.models.FastPitchModel.from_pretrained(model_name="tts_en_fastpitch").cuda()

# Vocoder model which takes spectrogram and produces actual audio
vocoder = nemo_tts.models.HifiGanModel.from_pretrained(model_name="tts_en_hifigan").cuda()

[NeMo I 2024-01-31 17:10:23 cloud:68] Downloading from: https://api.ngc.nvidia.com/v2/models/nvidia/nemo/stt_zh_citrinet_1024_gamma_0_25/versions/1.0.0/files/stt_zh_citrinet_1024_gamma_0_25.nemo to /root/.cache/torch/NeMo/NeMo_1.22.0/stt_zh_citrinet_1024_gamma_0_25/e4a8b1119971335507d9672e03bc80f4/stt_zh_citrinet_1024_gamma_0_25.nemo
[NeMo I 2024-01-31 17:11:47 common:913] Instantiating model from pre-trained checkpoint


[NeMo W 2024-01-31 17:12:01 modelPT:161] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    manifest_filepath: null
    sample_rate: 16000
    batch_size: 32
    labels:
    - ' '
    - ''''
    - A
    - B
    - C
    - D
    - E
    - F
    - G
    - H
    - I
    - J
    - K
    - L
    - M
    - 'N'
    - O
    - P
    - Q
    - R
    - S
    - T
    - U
    - V
    - W
    - X
    - 'Y'
    - Z
    - 㶧
    - 䶮
    - 一
    - 丁
    - 七
    - 万
    - 丈
    - 三
    - 上
    - 下
    - 不
    - 与
    - 丐
    - 丑
    - 专
    - 且
    - 丕
    - 世
    - 丘
    - 丙
    - 业
    - 丛
    - 东
    - 丝
    - 丞
    - 丢
    - 两
    - 严
    - 丧
    - 个
    - 丫
    - 中
    - 丰
    - 串
    - 临
    - 丸
    - 丹
    - 为
    - 主
    - 丽
    - 举
    - 乃
    - 久
    - 么
    - 义
    - 之
    - 乌
    - 乍
    - 乎
    - 乏
    - 乐
    - 乒
    - 乓
    - 乔
    - 乖
    - 乘
    - 乙

[NeMo I 2024-01-31 17:12:02 features:289] PADDING: 16
[NeMo I 2024-01-31 17:12:08 save_restore_connector:249] Model EncDecCTCModel was successfully restored from /root/.cache/torch/NeMo/NeMo_1.22.0/stt_zh_citrinet_1024_gamma_0_25/e4a8b1119971335507d9672e03bc80f4/stt_zh_citrinet_1024_gamma_0_25.nemo.
[NeMo I 2024-01-31 17:12:08 cloud:68] Downloading from: https://api.ngc.nvidia.com/v2/models/nvidia/nemo/nmt_zh_en_transformer6x6/versions/1.0.0rc1/files/nmt_zh_en_transformer6x6.nemo to /root/.cache/torch/NeMo/NeMo_1.22.0/nmt_zh_en_transformer6x6/eff3792e6f4420ba83436be889e92d79/nmt_zh_en_transformer6x6.nemo
[NeMo I 2024-01-31 17:15:25 common:913] Instantiating model from pre-trained checkpoint
[NeMo I 2024-01-31 17:15:31 tokenizer_utils:179] Getting YouTokenToMeTokenizer with model: /tmp/tmp7seaoms1/tokenizer.decoder.32000.BPE.model with r2l: False.
[NeMo I 2024-01-31 17:15:31 tokenizer_utils:179] Getting YouTokenToMeTokenizer with model: /tmp/tmp7seaoms1/tokenizer.encoder.32000.BPE.model

[NeMo W 2024-01-31 17:15:31 modelPT:161] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    src_file_name: /raid/tarred_data_accaligned_16k_tokens_32k_vocab_cov_0.999/batches.tokens.16000._OP_1..144_CL_.tar
    tgt_file_name: /raid/tarred_data_accaligned_16k_tokens_32k_vocab_cov_0.999/batches.tokens.16000._OP_1..144_CL_.tar
    tokens_in_batch: 16000
    clean: true
    max_seq_length: 512
    cache_ids: false
    cache_data_per_node: false
    use_cache: false
    shuffle: true
    num_samples: -1
    drop_last: false
    pin_memory: false
    num_workers: 8
    load_from_cached_dataset: false
    reverse_lang_direction: true
    load_from_tarred_dataset: true
    metadata_path: /raid/tarred_data_accaligned_16k_tokens_32k_vocab_cov_0.999/metadata.json
    tar_shuffle_n: 100
    
[NeMo W 2024-01-31 17:15:31 modelPT:168] If you intend to do valida

[NeMo I 2024-01-31 17:15:35 save_restore_connector:249] Model MTEncDecModel was successfully restored from /root/.cache/torch/NeMo/NeMo_1.22.0/nmt_zh_en_transformer6x6/eff3792e6f4420ba83436be889e92d79/nmt_zh_en_transformer6x6.nemo.
[NeMo I 2024-01-31 17:15:35 cloud:58] Found existing object /root/.cache/torch/NeMo/NeMo_1.22.0/tts_en_fastpitch_align/b7d086a07b5126c12d5077d9a641a38c/tts_en_fastpitch_align.nemo.
[NeMo I 2024-01-31 17:15:35 cloud:64] Re-using file from: /root/.cache/torch/NeMo/NeMo_1.22.0/tts_en_fastpitch_align/b7d086a07b5126c12d5077d9a641a38c/tts_en_fastpitch_align.nemo
[NeMo I 2024-01-31 17:15:35 common:913] Instantiating model from pre-trained checkpoint


 NeMo-text-processing :: INFO     :: Creating ClassifyFst grammars.
I0131 17:15:39.019015 139623426499008 tokenize_and_classify.py:86] Creating ClassifyFst grammars.
[NeMo W 2024-01-31 17:16:15 en_us_arpabet:66] apply_to_oov_word=None, This means that some of words will remain unchanged if they are not handled by any of the rules in self.parse_one_word(). This may be intended if phonemes and chars are both valid inputs, otherwise, you may see unexpected deletions in your input.
[NeMo W 2024-01-31 17:16:15 modelPT:161] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    dataset:
      _target_: nemo.collections.tts.torch.data.TTSDataset
      manifest_filepath: /ws/LJSpeech/nvidia_ljspeech_train_clean_ngc.json
      sample_rate: 22050
      sup_data_path: /raid/LJSpeech/supplementary
      sup_data_types:
      - align_prior_matrix
      - pitch
  

[NeMo I 2024-01-31 17:16:15 features:289] PADDING: 1
[NeMo I 2024-01-31 17:16:16 save_restore_connector:249] Model FastPitchModel was successfully restored from /root/.cache/torch/NeMo/NeMo_1.22.0/tts_en_fastpitch_align/b7d086a07b5126c12d5077d9a641a38c/tts_en_fastpitch_align.nemo.
[NeMo I 2024-01-31 17:16:16 cloud:68] Downloading from: https://api.ngc.nvidia.com/v2/models/nvidia/nemo/tts_hifigan/versions/1.0.0rc1/files/tts_hifigan.nemo to /root/.cache/torch/NeMo/NeMo_1.22.0/tts_hifigan/e6da322f0f7e7dcf3f1900a9229a7e69/tts_hifigan.nemo
[NeMo I 2024-01-31 17:17:28 common:913] Instantiating model from pre-trained checkpoint


[NeMo W 2024-01-31 17:17:30 modelPT:161] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    dataset:
      _target_: nemo.collections.tts.data.datalayers.MelAudioDataset
      manifest_filepath: /home/fkreuk/data/train_finetune.txt
      min_duration: 0.75
      n_segments: 8192
    dataloader_params:
      drop_last: false
      shuffle: true
      batch_size: 64
      num_workers: 4
    
[NeMo W 2024-01-31 17:17:30 modelPT:168] If you intend to do validation, please call the ModelPT.setup_validation_data() or ModelPT.setup_multiple_validation_data() method and provide a valid configuration file to setup the validation data loader(s). 
    Validation config : 
    dataset:
      _target_: nemo.collections.tts.data.datalayers.MelAudioDataset
      manifest_filepath: /home/fkreuk/data/val_finetune.txt
      min_duration: 3
      n_segments: 66150


[NeMo I 2024-01-31 17:17:30 features:289] PADDING: 0


[NeMo W 2024-01-31 17:17:30 features:266] Using torch_stft is deprecated and has been removed. The values have been forcibly set to False for FilterbankFeatures and AudioToMelSpectrogramPreprocessor. Please set exact_pad to True as needed.


[NeMo I 2024-01-31 17:17:30 features:289] PADDING: 0


    


[NeMo I 2024-01-31 17:17:32 save_restore_connector:249] Model HifiGanModel was successfully restored from /root/.cache/torch/NeMo/NeMo_1.22.0/tts_hifigan/e6da322f0f7e7dcf3f1900a9229a7e69/tts_hifigan.nemo.


## Get an audio sample in Mandarin

In [6]:
# Download audio sample which we'll try
# This is a sample from MCV 6.1 Dev dataset - the model hasn't seen it before
# IMPORTANT: The audio must be mono with 16Khz sampling rate
audio_sample = 'common_voice_zh-CN_21347786.mp3'
!wget 'https://nemo-public.s3.us-east-2.amazonaws.com/zh-samples/common_voice_zh-CN_21347786.mp3'
# To listen it, click on the play button below
IPython.display.Audio(audio_sample)

--2024-01-31 17:17:33--  https://nemo-public.s3.us-east-2.amazonaws.com/zh-samples/common_voice_zh-CN_21347786.mp3
Resolving nemo-public.s3.us-east-2.amazonaws.com (nemo-public.s3.us-east-2.amazonaws.com)... 52.219.97.34, 52.219.110.90, 52.219.92.122, ...
Connecting to nemo-public.s3.us-east-2.amazonaws.com (nemo-public.s3.us-east-2.amazonaws.com)|52.219.97.34|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 24813 (24K) [audio/mp3]
Saving to: ‘common_voice_zh-CN_21347786.mp3.1’


2024-01-31 17:17:34 (135 KB/s) - ‘common_voice_zh-CN_21347786.mp3.1’ saved [24813/24813]



## Transcribe audio file
We will use speech recognition model to convert audio into text.


In [7]:
transcribed_text = asr_model.transcribe([audio_sample])
print(transcribed_text)

Transcribing:   0%|          | 0/1 [00:00<?, ?it/s]

['我们尽了最大努力']


## Translate Chinese text into English
NeMo's NMT models have a handy ``.translate()`` method.

In [8]:
english_text = nmt_model.translate(transcribed_text)
print(english_text)

['We tried our best']


## Generate English audio from text
Speech generation from text typically has two steps:
* Generate spectrogram from the text. In this example we will use FastPitch model for this.
* Generate actual audio from the spectrogram. In this example we will use HifiGan model for this.


In [9]:
# A helper function which combines FastPitch and HifiGan to go directly from 
# text to audio
def text_to_audio(text):
  parsed = spectrogram_generator.parse(text)
  spectrogram = spectrogram_generator.generate_spectrogram(tokens=parsed)
  audio = vocoder.convert_spectrogram_to_audio(spec=spectrogram)
  return audio.to('cpu').detach().numpy()

In [10]:
# Listen to generated audio in English
IPython.display.Audio(text_to_audio(english_text[0]), rate=22050)

[NeMo W 2024-01-31 17:17:46 fastpitch:291] parse() is meant to be called in eval mode.
[NeMo W 2024-01-31 17:17:46 fastpitch:368] generate_spectrogram() is meant to be called in eval mode.
