<a href="https://colab.research.google.com/github/tcapelle/nvidia_nemo_wandb/blob/main/nemo_conversational_ai.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Getting Started: Sample Conversational AI application
This notebook shows how to use NVIDIA NeMo (https://github.com/NVIDIA/NeMo) to construct a toy demo which translate Mandarin audio file into English one.

The demo demonstrates how to: 

* Instantiate pre-trained NeMo models from NVIDIA NGC.
* Transcribe audio with (Mandarin) speech recognition model.
* Translate text with machine translation model.
* Generate audio with text-to-speech models.

## Installation
NeMo can be installed via simple pip command.
This will take about 4 minutes.

(The installation method below should work inside your new Conda environment or in an NVIDIA docker container.)

In [None]:
!git clone https://github.com/tcapelle/nvidia_nemo_wandb/

BRANCH = 'r1.5.0'
!python -m pip install git+https://github.com/NVIDIA/NeMo.git@$BRANCH#egg=nemo_toolkit[all]

## Import all necessary packages

In [None]:
from pathlib import Path

# Import NeMo and it's ASR, NLP and TTS collections
import nemo
# Import Speech Recognition collection
import nemo.collections.asr as nemo_asr
# Import Natural Language Processing colleciton
import nemo.collections.nlp as nemo_nlp
# Import Speech Synthesis collection
import nemo.collections.tts as nemo_tts
# We'll use this to listen to audio
import IPython

## Instantiate pre-trained NeMo models

Every NeMo model has these methods:

* ``list_available_models()`` - it will list all models currently available on NGC and their names.

* ``from_pretrained(...)`` API downloads and initialized model directly from the NGC using model name.


In [None]:
# Here is an example of all CTC-based models:
nemo_asr.models.EncDecCTCModel.list_available_models()
# More ASR Models are available - see: nemo_asr.models.ASRModel.list_available_models()

In [None]:
nemo_nlp.models.MTEncDecModel.list_available_models()

In [None]:
# Speech Recognition model - Citrinet initially trained on Multilingual LibriSpeech English corpus, and fine-tuned on the open source Aishell-2
asr_model = nemo_asr.models.EncDecCTCModel.from_pretrained(model_name="stt_fr_quartznet15x5").cuda()
# Neural Machine Translation model
nmt_model = nemo_nlp.models.MTEncDecModel.from_pretrained(model_name='nmt_fr_en_transformer12x2').cuda()
# Spectrogram generator which takes text as an input and produces spectrogram
spectrogram_generator = nemo_tts.models.FastPitchModel.from_pretrained(model_name="tts_en_fastpitch").cuda()
# Vocoder model which takes spectrogram and produces actual audio
vocoder = nemo_tts.models.HifiGanModel.from_pretrained(model_name="tts_hifigan").cuda()

## Get an audio sample in French


In [None]:
audio_samples = [str(f) for f in Path("nvidia_nemo_wandb/audio_samples").iterdir()]

In [None]:
IPython.display.Audio(audio_samples[0])

## Transcribe audio file
We will use speech recognition model to convert audio into text.


In [None]:
transcribed_text = asr_model.transcribe(audio_samples)
print(transcribed_text)

## Translate French text into English
NeMo's NMT models have a handy ``.translate()`` method.

In [None]:
english_text = nmt_model.translate(transcribed_text)
print(english_text)

## Generate English audio from text
Speech generation from text typically has two steps:
* Generate spectrogram from the text. In this example we will use FastPitch model for this.
* Generate actual audio from the spectrogram. In this example we will use HifiGan model for this.


In [None]:
# A helper function which combines FastPitch and HifiGan to go directly from 
# text to audio
def text_to_audio(text):
  parsed = spectrogram_generator.parse(text)
  spectrogram = spectrogram_generator.generate_spectrogram(tokens=parsed)
  audio = vocoder.convert_spectrogram_to_audio(spec=spectrogram)
  return audio.squeeze().to('cpu').detach().numpy()

In [None]:
# Listen to generated audio in English
IPython.display.Audio(text_to_audio(english_text[-1]), rate=22050)

In [None]:
from scipy.io import wavfile
import numpy as np

def save_audio_en(audio, fname='out.mp3', sample_rate=22050):
  out_audio = (audio*np.iinfo(np.int16).max).astype(np.int16)
  wavfile.write(fname, sample_rate, audio)


english_audios = [text_to_audio(eng_text) for eng_text in english_text]

## Wandb 🏋️‍♀️

Logging rich media types to wandb.
- We can log audio files and play them back on the dashboard!

In [None]:
!pip install -Uqqq wandb

In [None]:
import wandb
wandb.login()

In [None]:
wandb.init(project="NeMo")

we will create a `wandb.Table` to put our different data processing stages:

In [None]:
table = wandb.Table(columns=['audio_input', 'transcribed_text', 'translated_text', 'audio_output'])

In [None]:
def _build_row(audio_fr_fname, fr_txt, eng_txt, audio_en_np):
  "Save output of the model to files"
  fname_en = audio_fr_fname.split('.')[0] + '_en.wav'
  save_audio_en(audio_en, fname=fname_en, sample_rate=22050)
  return [wandb.Audio(audio_fr_fname, sample_rate=16000), fr_txt, eng_txt, wandb.Audio(fname_en, sample_rate=22050)]

In [None]:
for audio_fr_fname, fr_txt, eng_txt, audio_en in zip(audio_samples, transcribed_text, english_text, english_audios):
  table.add_data(*_build_row(audio_fr_fname, fr_txt, eng_txt, audio_en))

In [None]:
wandb.log({"table": table})

In [None]:
wandb.finish()