<a href="https://www.nvidia.com/dli"> <img src="images/DLI_Header.png" alt="Header" style="width: 400px;"/> </a>

# 8.0 ASR-NLP-TTS Deployment
## (part of Lab 2)

In this notebook, you'll deploy a full pipeline with [NVIDIA Riva](https://developer.nvidia.com/riva). After building a "plain vanilla" end-to-end application using out-of-the-box models, you'll customize the pipeline for a specific restaurant use case.

**[8.1 Full Pipeline With OOTB Models](#8.1-Full-Pipeline-With-OOTB-Models)<br>**
&nbsp;&nbsp;&nbsp;&nbsp;[8.1.1 Model Deployment](#8.1.1-Model-Deployment)<br>
&nbsp;&nbsp;&nbsp;&nbsp;[8.1.2 Excercise: Riva Configuration](#8.1.2-Excercise:-Riva-Configuration)<br>
&nbsp;&nbsp;&nbsp;&nbsp;[8.1.3 Riva Start Services](#8.1.3-Riva-Start-Services)<br>
&nbsp;&nbsp;&nbsp;&nbsp;[8.1.4 OOTB Pipeline Demo](#8.1.4-OOTB-Pipeline-Demo)<br>
**[8.2 ASR Customization](#8.2-ASR-Customization)<br>**
&nbsp;&nbsp;&nbsp;&nbsp;[8.2.1 Word Boosting](#8.2.1-Word-Boosting)<br>
&nbsp;&nbsp;&nbsp;&nbsp;[8.2.2 Exercise: Negative Word Boost](#8.2.2-Exercise:-Negative-Word-Boost)<br>
&nbsp;&nbsp;&nbsp;&nbsp;[8.2.3 Lexicon Customization](#8.2.3-Lexicon-Customization)<br>
**[8.3 NER Customization](#8.3-NER-Customization)<br>**
&nbsp;&nbsp;&nbsp;&nbsp;[8.3.1 IOB Tagging](#8.3.1-IOB-Tagging)<br>
&nbsp;&nbsp;&nbsp;&nbsp;[8.3.2 Restaurant Context for NER](#8.3.2-Restaurant-Context-for-NER)<br>
**[8.4 TTS Customization](#8.4-TTS-Customization)<br>**
**[8.5 Customized Restaurant Pipeline Demo](#8.5-Customized-Restaurant-Pipeline-Demo)<br>**
&nbsp;&nbsp;&nbsp;&nbsp;[8.5.1 Exercise: Run a Custom Pipeline](#8.5.1-Exercise:-Run-a-Custom-Pipeline)<br>
&nbsp;&nbsp;&nbsp;&nbsp;[8.5.2 Shut Down Riva](#8.5.2-Shut-Down-Riva)<br>
**[8.6 Shut Down the Kernel](#8.6-Shut-Down-the-Kernel)<br>**

### Notebook Dependencies
The steps in this notebook assume that you have:

1. **NGC Credentials**<br>Be sure you have added your NGC credential as described in the [NGC Setup notebook](003_Intro_NGC_Setup.ipynb)

---
# 8.1 Full Pipeline With OOTB Models

In previous notebooks, we took a close look at ASR and TTS speech models.  In this notebook, we'll add NLP (Natural Language Processing) models to build a full pipeline. NLP models interpret text in various ways so that action can be taken based on the text's meaning. We'll use the standard OOTB (out-of-the-box) models that Riva pulls from NGC.



## 8.1.1 Model Deployment
The default ASR and TTS service models were deployed earlier in their own model repositories, `/dli_workspace/riva-asr-model-repo` and `/dli_workspace/riva-tts-model-repo`.  For the full pipeline, we'll need to deploy default models for ASR, TTS, and NLP services.  The `riva_init.sh` command loads and builds models specific to the GPU you are using, but to save time for this course, these have been preloaded.

The optimized NLP models are already located in the `/dli_workspace/riva-full-model-repo`, but lets go ahead and copy the ASR and TTS models there as well to save some build time:

In [1]:
# Set the Riva Quick Start directory
WORKSPACE='/dli_workspace'
RIVA_QS = WORKSPACE + "/riva_quickstart"
RIVA_MODEL_REPO = WORKSPACE + "/riva-full-model-repo"
!mkdir -p $RIVA_MODEL_REPO

In [3]:
%%bash
# Copy all the ASR and TTS models for convenience (faster deployment)
# Time is about 1-2 minutes for the copy
cp -rn  /dli_workspace/riva-asr-model-repo/* \
    /dli_workspace/riva-full-model-repo/
cp -rn  /dli_workspace/riva-tts-model-repo/* \
    /dli_workspace/riva-full-model-repo/

In [4]:
# check to see what models are there now
!ls $RIVA_MODEL_REPO/models

conformer-en-US-asr-offline
conformer-en-US-asr-offline-ctc-decoder-cpu-streaming-offline
conformer-en-US-asr-offline-endpointing-streaming-offline
conformer-en-US-asr-offline-feature-extractor-streaming-offline
conformer-en-US-asr-streaming
conformer-en-US-asr-streaming-ctc-decoder-cpu-streaming
conformer-en-US-asr-streaming-endpointing-streaming
conformer-en-US-asr-streaming-feature-extractor-streaming
conformer-es-US-asr-offline
conformer-es-US-asr-offline-ctc-decoder-cpu-streaming-offline
conformer-es-US-asr-offline-endpointing-streaming-offline
conformer-es-US-asr-offline-feature-extractor-streaming-offline
conformer-es-US-asr-streaming
conformer-es-US-asr-streaming-ctc-decoder-cpu-streaming
conformer-es-US-asr-streaming-endpointing-streaming
conformer-es-US-asr-streaming-feature-extractor-streaming
fastpitch_hifigan_ensemble-English-US
intent_slot_detokenizer
intent_slot_label_tokens_weather
intent_slot_tokenizer-en-US-weather
qa_qa_postprocessor
qa_tokenizer-en-US
riva-onnx-fast

## 8.1.2 Excercise: Riva Configuration

Open [config.sh](dli_workspace/riva_quickstart/config.sh) and modify it to deploy all three services (ASR, NLP, TTS) using the `/dli_workspace/riva-full-model-repo` location that we've preloaded with all the models.  Save your work.

If you're not sure what to change, take a peek at the [solution](solutions/ex8.1.2_config.sh).

Check your work.  The `diff` comparison in the following cell should have no output.

In [6]:
# Check your work
!diff solutions/ex8.1.2_config.sh dli_workspace/riva_quickstart/config.sh

In [5]:
# Quick fix!
!cp solutions/ex8.1.2_config.sh dli_workspace/riva_quickstart/config.sh

## 8.1.3 Riva Start Services

The `riva_init.sh` script downloads the Riva containers needed, downloads models listed in `config.sh`, and optimizes  models as required with [NVIDIA TensorRT](https://developer.nvidia.com/tensorrt). Since we've already downloaded the containers and preloaded the optimized models, `riva_init.sh` won't have much to do, but it is provided here for completeness.

The `riva_start.sh` script starts the server.

In [6]:
# Initialize Riva 
!cd $RIVA_QS && bash riva_init.sh config.sh

Logging into NGC docker registry if necessary...
Pulling required docker images if necessary...
Note: This may take some time, depending on the speed of your Internet connection.
> Pulling Riva Speech Server images.
  > Image nvcr.io/nvidia/riva/riva-speech:2.8.1 exists. Skipping.
  > Image nvcr.io/nvidia/riva/riva-speech:2.8.1-servicemaker exists. Skipping.

Downloading models (RMIRs) from NGC...
Note: this may take some time, depending on the speed of your Internet connection.
To skip this process and use existing RMIRs set the location and corresponding flag in config.sh.

=== Riva Speech Skills ===

NVIDIA Release  (build 49655095)
Copyright (c) 2016-2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

Copyright (c) 2018-2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

https://developer.nvidia.com/tensorrt

Various files include modifications (c) NVIDIA CORPORATION & AFFILIATES.  All rights reserved.

This container image and its contents are governed by the NVI

In [7]:
# Start the Riva server (about 1 minute)
!cd $RIVA_QS && bash riva_start.sh config.sh

Starting Riva Speech Services. This may take several minutes depending on the number of models deployed.
bd5c6254cc79998ac192204cb84a4d34afdc315775882e75ee818968d8101426
Waiting for Riva server to load all models...retrying in 10 seconds
Waiting for Riva server to load all models...retrying in 10 seconds
Waiting for Riva server to load all models...retrying in 10 seconds
Waiting for Riva server to load all models...retrying in 10 seconds
Waiting for Riva server to load all models...retrying in 10 seconds
Riva server is ready...


## 8.1.4 OOTB Pipeline Demo
Now that we have the Riva server running with all the OOTB models, let's build an application that will:

1. Transcribe audio into text (ASR)
2. Find a name in the text (NER)
3. Determine what text to output in response (DM)
4. Output audio of the text response (TTS)

<img src="images/pipeline/full_pipeline_ootb.png">

We've already explored how ASR (step 1) and TTS (step 4) work, but now we add a Named Entity Recognition (NER) model and a simple Dialog Manager (DM) to the pipeline.  

NER is a Natural Language Processing (NLP) task.  NER, also referred to as entity chunking, identification, or extraction, is the task of detecting and classifying key information (entities) in text. In other words, an NER model takes a piece of text as input and for each word in the text, the model identifies a category the word belongs to. For example, in a sentence: "Mary lives in Santa Clara and works at NVIDIA", the NER model should detect that Mary is a person, Santa Clara is a location and NVIDIA is a company.

The DM is responsible for keeping track of the conversation state and determining responses.  For our purposes, we will use very simple slot-filling to create a response.

Begin by importing the Riva client and some other useful libraries.

In [8]:
import riva.client
import numpy as np
import IPython.display as ipd
import io
import time
import librosa

**Step 1: ASR**

Create an ASR function to transcribe audio to text. Then give it a try using a simple sentence.

In [9]:
# Define a Python function to transcribe text from audio
def asr_predict(SAMPLE):
    auth = riva.client.Auth(uri='localhost:50051')
    riva_asr = riva.client.ASRService(auth)
    # This example uses a .wav file with LINEAR_PCM encoding.
    # read in an audio file from local disk
    # Set up an offline/batch recognition request
    config = riva.client.RecognitionConfig()
    config.language_code = "en-US"                    # Language code of the audio clip
    config.max_alternatives = 1                       # How many top-N hypotheses to return
    config.enable_automatic_punctuation = True        # Add punctuation when end of VAD detected
    config.audio_channel_count = 1     
    with io.open(SAMPLE, 'rb') as fh:
        content = fh.read()
    response = riva_asr.offline_recognize(content, config)
    transcript=response.results[0].alternatives[0].transcript
    return transcript

In [10]:
SAMPLE="/dli_workspace/data/audio_sample_resampled2.wav"

with io.open(SAMPLE, 'rb') as fh:
    content = fh.read()
ipd.Audio(SAMPLE, autoplay=True)

In [11]:
transcript=asr_predict(SAMPLE)
print("ASR Transcript:", transcript)

ASR Transcript: Hi, my name is Dana and I Work for NVIDIA. 


**Step 2: NER**

Create an NER function to find special words, such as persons, locations, and organizations, within the transcribed text.  For the sentence "Hi, my name is Dana and I work for NVIDIA", the NER model should recognize "Dana" as a name and "NVIDIA" as an organization.

In [12]:
# Define a Python function to extract entities from text
def ner_predict(text):
    auth = riva.client.Auth(uri='localhost:50051')
    service = riva.client.NLPService(auth)
    tokens, slots, slot_confidences, starts, ends = riva.client.extract_most_probable_token_classification_predictions(
        service.classify_tokens(input_strings=text, model_name='riva_ner'))
    return tokens[0],slots[0]

In [13]:
tokens,slots=ner_predict(transcript)
print(tokens)
print(slots)

['dana', 'nvidia']
['PER', 'ORG']


**Step 3: DM**

For demonstration purposes, we'll build a very basic dialog manager that just looks for a person's name in the sentence and uses that name in the response text (if it exists).

In [14]:
# Dialog Manager
def dm_predict(slots, tokens):
    if "PER" in slots:
        index = slots.index("PER")
        response="Hi " + tokens[index] + ", how can I help you?"
    else:
        response="Hi, how can I help you?"
    return response

In [15]:
response = dm_predict(slots, tokens)
print(response)

Hi dana, how can I help you?


**Step 4: TTS**

Finally, we'll use the TTS model to output the response sentence as audio.

In [16]:
sample_rate_hz = 44100

# helper function for more readable output
def remove_braces(braced_text):
    return braced_text.replace("{@","").replace("}","")

# Define a Python function to create speech from text
def tts_predict(text):
    auth = riva.client.Auth(uri='localhost:50051')
    riva_tts = riva.client.SpeechSynthesisService(auth)
    req = { 
            "language_code"  : "en-US",
            "encoding"       : riva.client.AudioEncoding.LINEAR_PCM ,   # Currently only LINEAR_PCM is supported
            "sample_rate_hz" : sample_rate_hz,                          # Generate 44.1KHz audio
            "voice_name"     : "English-US.Female-1"                    # The name of the voice to generate
    }
    req["text"] = text
    resp = riva_tts.synthesize(**req)
    audio_samples = np.frombuffer(resp.audio, dtype=np.int16)
    return audio_samples, remove_braces(resp.meta.processed_text)

In [17]:
audio_samples, processed_text =tts_predict(response)
print(processed_text)
ipd.Audio(audio_samples, rate=sample_rate_hz, autoplay=True)

 ˈhaɪ ˈdeɪnə, ˈhaʊ CAN ˈaɪ ˈhɛɫp ˈju? 


We have all the pieces, and have tested them individually.  Now let's run it as one pipeline.

In [18]:
# Put it all together

SAMPLE="/dli_workspace/data/audio_sample_resampled2.wav"
print("First Audio sample:")
ipd.display(ipd.Audio(SAMPLE, rate=sample_rate_hz, autoplay=True))

# get input audio duration
d=librosa.get_duration(filename=SAMPLE)
# call Riva ASR  
transcript = asr_predict(SAMPLE)
# call Riva NER
tokens,slots = ner_predict(transcript)
# call Dialog Manager
dm_response = dm_predict(slots, tokens)
# call Riva TTS
synth_audio, processed_text = tts_predict(dm_response)

time.sleep(d)
print("Virtual Assistant Response:")
ipd.display(ipd.Audio(synth_audio, rate=sample_rate_hz, autoplay=True))

First Audio sample:


Virtual Assistant Response:


---
# 8.2 ASR Customization

Imagine an application with the same basic pipeline that receives restaurant orders.  It may be the case that the standard ASR model does not recognize a particular dish available on the restaurant menu, so recognition of that dish needs to be added to the dictionary.  For our example, let's say that "couscous" has been ordered.  

In [19]:
SAMPLE="/dli_workspace/data/couscous-left.wav"
ipd.display(ipd.Audio(SAMPLE, rate=sample_rate_hz, autoplay=True))

# get inut audio duration
d=librosa.get_duration(filename=SAMPLE)
# call Riva ASR  
transcript=asr_predict(SAMPLE)
print(transcript)

Hello, I would like to order a Css. 


The spoken word, "couscous", is not correctly transcribed by the ASR.  Instead, it thinks the spelling is "Css".  Is that because it doesn't know the word, or because it just didn't recognize it?  We can check by searching for the correct spelling in the Conformer-CTC model lexicon.

In [20]:
import os
CONFORMER_OFFLINE = "conformer-en-US-asr-offline-ctc-decoder-cpu-streaming-offline"
LEXICON = os.path.join(RIVA_MODEL_REPO, "models", CONFORMER_OFFLINE, "1", "lexicon.txt")

! grep couscous $LEXICON

couscous	▁co us co us


The word "couscous" is already part of the Riva ASR vocabulary, but was not recognized as the likely transcription. Since this is a restaurant context, we want to improve the likelihood that "couscous" is transcribed, which we can do with word boosting. 

_Note: If we have a large list of words of interest like this, we could fine-tune the language model, which would make word boosting unnecessary._

## 8.2.1 Word Boosting 

Word boosting is the easiest customization of Riva ASR. The boosting  happens at the client side, when querying for transcription. The user can specify a list of words of interest that are most likely to appear and giving them new (higher) scores. The user-specified scores are used for decoding the output of the acoustic model.

In order to boost the word "couscous", we can use the config [`riva.client.add_word_boosting_to_config()`](https://github.com/nvidia-riva/python-clients/blob/928c63273176a939500e01ce176c463f1606a1ff/riva_api/asr.py#L78) function to specify the list of words and their scores. 


In [21]:
# predict with word boosting
def asr_predict_WB(SAMPLE, boost, score):
    auth = riva.client.Auth(uri='localhost:50051')
    riva_asr = riva.client.ASRService(auth)
    # This example uses a .wav file with LINEAR_PCM encoding.
    # read in an audio file from local disk
    # Set up an offline/batch recognition request
    config = riva.client.RecognitionConfig()
    config.language_code = "en-US"                    # Language code of the audio clip
    config.max_alternatives = 1                       # How many top-N hypotheses to return
    config.enable_automatic_punctuation = True        # Add punctuation when end of VAD detected
    config.audio_channel_count = 1     
    riva.client.add_word_boosting_to_config(config, [boost], score)    # ****** WORD BOOSTING ******
    with io.open(SAMPLE, 'rb') as fh:
        content = fh.read()
    response = riva_asr.offline_recognize(content, config)
    transcript=response.results[0].alternatives[0].transcript
    return transcript

In [22]:
SAMPLE="/dli_workspace/data/couscous-left.wav"
ipd.display(ipd.Audio(SAMPLE, rate=sample_rate_hz, autoplay=True))

# transcibe while boosting couscous with the score 4.0
transcript=asr_predict_WB(SAMPLE,"couscous", 4.0)
print(transcript)

Hello, I would like to order a couscous. 


## 8.2.2 Exercise: Negative Word Boost

It's also possible to boost with negative numbers to reduce the likelihood of a particular transcription.  Try a few boost numbers, both negative and positive, to see what value is required to get the correct transcription of "couscous".

In [23]:
# transcibe while boosting couscous with various `myscore` values such as:  -4.0, 1.0, 2.0, 3.0
myscore = -4.0

SAMPLE="/dli_workspace/data/couscous-left.wav"
transcript=asr_predict_WB(SAMPLE,"couscous", myscore)
print(transcript)

Hello, I would like to order a Css. 


## 8.2.3 Lexicon Customization

The lexicon is a raw text file that contains the mapping of each word on the vocabulary to its tokenized format (separated by a tab). Tokens must be part of the acoustic model's vocabulary .  <br> For example:

``` 
as      ▁as
with    ▁with
not     ▁not
don't   ▁don ' t
```

Customizing the lexicon file creates explicit pronunciations of terms, in the form of tokenized sequences. 
Let's see this in action with the word `bruschetta` that could be pronounced `brusketa`. 

In [24]:
# load sample audio

SAMPLE="/dli_workspace/data/bruschetta_resampled.wav"
ipd.display(ipd.Audio(SAMPLE, rate=sample_rate_hz, autoplay=True))

# call Riva ASR  
transcript=asr_predict(SAMPLE)
print(transcript)

Hi, I would like to order a Bruce Keta. 


The transcription is "Bruce Keta", which is not accurate. Let's check if "bruschetta" is part of the vocabulary. 

In [25]:
! grep bruschetta $LEXICON

bruschetta	▁b ru s ch e t t a


The word "bruschetta" is in the lexicon, but it was not recognized by Riva ASR pipeline. The only pronunciation configured is `▁b ru s ch e t t a`. 

Let's provide more sequences to recognize the word when pronouncing it "bruce keta".  

We'll need to stop the Riva server, customize the pronunciation, and then restart with the update.

In [26]:
# Stop the Riva server. 
!bash $RIVA_QS/riva_stop.sh

Shutting down docker containers...


Let's locate and load the tokenizer and test the sequence of tokens for the sentence "Hi, my name is Dana".

In [27]:
import glob
import sentencepiece as spm

# locate the _tokenizer.model in Riva models repo
mydir = os.path.join(RIVA_MODEL_REPO, "models", CONFORMER_OFFLINE, "1")
os.chdir(mydir)
for file in glob.glob("*.model"):
    filename = file
    
tokenizer = os.path.join(RIVA_MODEL_REPO, "models", CONFORMER_OFFLINE, "1", filename)

# Load the tokenizer
s = spm.SentencePieceProcessor(model_file=tokenizer)

# tokenize the sentence
s.encode("Hi, my name is Dana",out_type=str)

['▁', 'h', 'i', ',', '▁my', '▁', 'n', 'a', 'me', '▁is', '▁da', 'n', 'a']

To generate new lexicon possibilities for "bruschetta" using the `encode()` function. We need:
- PRONUNCIATION: What the word or phrase should sound like (our example: "bruce keta")
- TOKEN: The desired written form of the word (our example: "bruschetta")

Let's query for five variants of the sequence of tokens leading to this sound. 

In [28]:
TOKEN="bruschetta"
PRONUNCIATION="bruce keta"

for n in range(5):
    print(TOKEN + '\t' + ' '.join(s.encode(PRONUNCIATION, out_type=str, enable_sampling=True, alpha=0.1, nbest_size=-1)))

bruschetta	▁b ru ce ▁k e t a
bruschetta	▁b r u c e ▁ k e t a
bruschetta	▁b ru c e ▁ k e t a
bruschetta	▁ b r u ce ▁k e t a
bruschetta	▁b r u c e ▁k e t a


Add those pronunciation options to the Riva offline lexicon file.

In [29]:
!echo -e "bruschetta\t▁b ru ce ▁k e t a" >> $LEXICON
!echo -e "bruschetta\t▁b ru c e ▁ k e t a" >> $LEXICON
!echo -e "bruschetta\t▁b r u c e ▁k e t a" >> $LEXICON
!echo -e "bruschetta\t▁ b ru c e ▁k e t a" >> $LEXICON
!echo -e "bruschetta\t▁ b ru ce ▁ k e t a" >> $LEXICON

In [30]:
# Check that Riva lexicon is updated
! grep bruschetta $LEXICON

bruschetta	▁b ru s ch e t t a
bruschetta	▁b ru ce ▁k e t a
bruschetta	▁b ru c e ▁ k e t a
bruschetta	▁b r u c e ▁k e t a
bruschetta	▁ b ru c e ▁k e t a
bruschetta	▁ b ru ce ▁ k e t a


In [31]:
# Start the Riva server (about 1 minute)
!cd $RIVA_QS && bash riva_start.sh config.sh

Starting Riva Speech Services. This may take several minutes depending on the number of models deployed.
0afebbdb16f20228c178632de2b5d734f0106624150cb52c2e7f43f6a25523cc
Waiting for Riva server to load all models...retrying in 10 seconds
Waiting for Riva server to load all models...retrying in 10 seconds
Waiting for Riva server to load all models...retrying in 10 seconds
Waiting for Riva server to load all models...retrying in 10 seconds
Riva server is ready...


Let's query the customized Riva ASR service again with the updated lexicon.

In [32]:
SAMPLE = "/dli_workspace/data/bruschetta_resampled.wav"
ipd.display(ipd.Audio(SAMPLE, rate=sample_rate_hz, autoplay=True))

# call Riva ASR  
transcript = asr_predict(SAMPLE)
print("Riva transcription After Lexicon Mapping:\n", transcript)

Riva transcription After Lexicon Mapping:
 Hi. I would like to order a bruschetta. 


---
# 8.3 NER Customization
In our restaurant application, we will need to pick entities out of a conversation beyond names and organizations.  For example, we might like to identify cuisine or dishes in the conversation.  The [MIT Restaurant Corpus](https://groups.csail.mit.edu/sls/downloads/) dataset has labels identified with _IOB Tagging_, which is what we need for fine-tuning an NER model with NeMo.  

The actual training for the NER model is out of scope for this course, but the process is covered in detail in the DLI course, [Building Transformer-Based Natural Language Processing Applications](https://www.nvidia.com/en-us/training/instructor-led-workshops/natural-language-processing/).  For this class, we will use a custom NER restaurant model that was fine-tuned in NeMo using the Restaurant data.

## 8.3.1 IOB Tagging

The sentences and labels in the NER restaurant dataset map to each other with inside, outside, beginning (IOB) tagging. Anything separated by white space is a word, including punctuation. 

In [33]:
# let's take a look at the data 
print('Text:')
! head -n 8 /dli_workspace/data/restaurant/text_train.txt

print('\nLabels:')
! head -n 8 /dli_workspace/data/restaurant/labels_train.txt

Text:
2 start restaurants with inside dining 
34 
5 star resturants in my town 
98 hong kong restaurant reasonable prices 
a great lunch spot but open till 2 a m passims kitchen 
a place that serves soft serve ice cream 
a restaurant that is good for groups 
a salad would make my day 

Labels:
B-Rating I-Rating O O B-Amenity I-Amenity 
O 
B-Rating I-Rating O B-Location I-Location I-Location 
O B-Restaurant_Name I-Restaurant_Name O B-Price O 
O O O O O B-Hours I-Hours I-Hours I-Hours I-Hours B-Restaurant_Name I-Restaurant_Name 
O O O O B-Dish I-Dish I-Dish I-Dish 
O O O O B-Rating B-Amenity I-Amenity 
O B-Dish O O O O 


The first eight lines of the training dataset are mapped to eight lines of labels. For example, look at the 6th line, "a place that serves soft serve ice cream":

```text
a    place    that    serves    soft    serve    ice    cream
O    O        O       O         B-Dish  I-Dish   I-Dish I-Dish
```

The IOB tags indicate that "soft" is the beginning of a "Dish" entity, and that the next three words, "serve ice cream" are also part of that entity.  The full entity identified is therefore "soft serve ice cream" as a "Dish". None of the other words in the sentence are identified as entities.  

The possible entity labels for this dataset are listed in [label_ids.csv](dli_workspace/data/restaurant/label_ids.csv).

## 8.3.2 Restaurant Context for NER
Let's try the basic OOTB NER model with a few restaurant queries.  In this example, we'll use NeMo rather than the Riva client.

In [34]:
import nemo
import nemo.collections.nlp as nemo_nlp
pretrained_ner_model = nemo_nlp.models.TokenClassificationModel.from_pretrained(model_name="ner_en_bert") 

    
[NeMo W 2023-07-27 07:36:58 experimental:27] Module <class 'nemo.collections.nlp.data.language_modeling.megatron.megatron_batch_samplers.MegatronPretrainingRandomBatchSampler'> is experimental, not ready for production and is not fully supported. Use at your own risk.
[NeMo W 2023-07-27 07:37:02 experimental:27] Module <class 'nemo.collections.nlp.models.text_normalization_as_tagging.thutmose_tagger.ThutmoseTaggerModel'> is experimental, not ready for production and is not fully supported. Use at your own risk.


[NeMo I 2023-07-27 07:37:02 cloud:66] Downloading from: https://api.ngc.nvidia.com/v2/models/nvidia/nemo/ner_en_bert/versions/1.10/files/ner_en_bert.nemo to /root/.cache/torch/NeMo/NeMo_1.14.0/ner_en_bert/8186f86c83b11d70b43b9ead695e7eda/ner_en_bert.nemo
[NeMo I 2023-07-27 07:37:23 common:912] Instantiating model from pre-trained checkpoint
[NeMo I 2023-07-27 07:37:26 tokenizer_utils:130] Getting HuggingFace AutoTokenizer with pretrained_model_name: bert-base-uncased, vocab_file: /tmp/tmpxlfrfseb/tokenizer.vocab_file, merges_files: None, special_tokens_dict: {}, and use_fast: False


Using eos_token, but it is not set yet.
Using bos_token, but it is not set yet.
[NeMo W 2023-07-27 07:37:26 modelPT:222] You tried to register an artifact under config key=tokenizer.vocab_file but an artifact for it has already been registered.
[NeMo W 2023-07-27 07:37:26 modelPT:142] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    text_file: text_train.txt
    labels_file: labels_train.txt
    shuffle: true
    num_samples: -1
    batch_size: 64
    
[NeMo W 2023-07-27 07:37:26 modelPT:149] If you intend to do validation, please call the ModelPT.setup_validation_data() or ModelPT.setup_multiple_validation_data() method and provide a valid configuration file to setup the validation data loader(s). 
    Validation config : 
    text_file: text_dev.txt
    labels_file: labels_dev.txt
    shuffle: false
    num_samples: -1
    batch_size: 64
    

[NeMo I 2023-07-27 07:37:38 save_restore_connector:243] Model TokenClassificationModel was successfully restored from /root/.cache/torch/NeMo/NeMo_1.14.0/ner_en_bert/8186f86c83b11d70b43b9ead695e7eda/ner_en_bert.nemo.


In [35]:
# define a list of queries for inference
request_bruschetta = "I would like to order a bruschetta for 6pm"
request_pasta = "Can you recommend a good Italian pasta?"

queries = [request_bruschetta, request_pasta]
results = pretrained_ner_model.add_predictions(queries)

for query, result in zip(queries, results):
    print()
    print(f'Query : {query}')
    print(f'Result: {result.strip()}\n')

[NeMo I 2023-07-27 07:37:38 token_classification_dataset:123] Setting Max Seq length to: 15
[NeMo I 2023-07-27 07:37:38 data_preprocessing:404] Some stats of the lengths of the sequences:
[NeMo I 2023-07-27 07:37:38 data_preprocessing:406] Min: 10 |                  Max: 15 |                  Mean: 12.5 |                  Median: 12.5
[NeMo I 2023-07-27 07:37:38 data_preprocessing:412] 75 percentile: 13.75
[NeMo I 2023-07-27 07:37:38 data_preprocessing:413] 99 percentile: 14.95


[NeMo W 2023-07-27 07:37:38 token_classification_dataset:152] 0 are longer than 15


[NeMo I 2023-07-27 07:37:38 token_classification_dataset:155] *** Example ***
[NeMo I 2023-07-27 07:37:38 token_classification_dataset:156] i: 0
[NeMo I 2023-07-27 07:37:38 token_classification_dataset:157] subtokens: [CLS] i would like to order a br ##us ##chet ##ta for 6 ##pm [SEP]
[NeMo I 2023-07-27 07:37:38 token_classification_dataset:158] loss_mask: 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[NeMo I 2023-07-27 07:37:38 token_classification_dataset:159] input_mask: 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[NeMo I 2023-07-27 07:37:38 token_classification_dataset:160] subtokens_mask: 0 1 1 1 1 1 1 1 0 0 0 1 1 0 0

Query : I would like to order a bruschetta for 6pm
Result: I would like to order a bruschetta[B-ORG] for 6pm[B-TIME]


Query : Can you recommend a good Italian pasta?
Result: Can you recommend a good Italian[B-GPE] pasta?



If our NER thinks "bruschetta" is an organization, it will be difficult to design a DM that detects what dish a customer wants to order! 
Let's try this again with a customized model based on the Restaurant dataset.

In [36]:
my_ner_model = nemo_nlp.models.TokenClassificationModel.restore_from(restore_path="/dli_workspace/riva-full-model-repo/ner_restaurant.nemo") 

[NeMo I 2023-07-27 07:37:44 tokenizer_utils:130] Getting HuggingFace AutoTokenizer with pretrained_model_name: bert-base-uncased, vocab_file: /tmp/tmpsbki0fy7/f17adb2e1b9646cca9cc4f0a08c1748a_vocab.txt, merges_files: None, special_tokens_dict: {}, and use_fast: False


Using eos_token, but it is not set yet.
Using bos_token, but it is not set yet.
[NeMo W 2023-07-27 07:37:44 modelPT:222] You tried to register an artifact under config key=tokenizer.vocab_file but an artifact for it has already been registered.
[NeMo W 2023-07-27 07:37:44 modelPT:142] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    text_file: text_train.txt
    labels_file: labels_train.txt
    shuffle: true
    num_samples: 1000
    batch_size: 64
    
[NeMo W 2023-07-27 07:37:44 modelPT:149] If you intend to do validation, please call the ModelPT.setup_validation_data() or ModelPT.setup_multiple_validation_data() method and provide a valid configuration file to setup the validation data loader(s). 
    Validation config : 
    text_file: text_dev.txt
    labels_file: labels_dev.txt
    shuffle: false
    num_samples: 1000
    batch_size: 64


[NeMo I 2023-07-27 07:37:52 save_restore_connector:243] Model TokenClassificationModel was successfully restored from /dli_workspace/riva-full-model-repo/ner_restaurant.nemo.


In [37]:
results = my_ner_model.add_predictions(queries)
results

[NeMo I 2023-07-27 07:37:52 token_classification_dataset:123] Setting Max Seq length to: 15
[NeMo I 2023-07-27 07:37:52 data_preprocessing:404] Some stats of the lengths of the sequences:
[NeMo I 2023-07-27 07:37:52 data_preprocessing:406] Min: 10 |                  Max: 15 |                  Mean: 12.5 |                  Median: 12.5
[NeMo I 2023-07-27 07:37:52 data_preprocessing:412] 75 percentile: 13.75
[NeMo I 2023-07-27 07:37:52 data_preprocessing:413] 99 percentile: 14.95


[NeMo W 2023-07-27 07:37:52 token_classification_dataset:152] 0 are longer than 15


[NeMo I 2023-07-27 07:37:52 token_classification_dataset:155] *** Example ***
[NeMo I 2023-07-27 07:37:52 token_classification_dataset:156] i: 0
[NeMo I 2023-07-27 07:37:52 token_classification_dataset:157] subtokens: [CLS] i would like to order a br ##us ##chet ##ta for 6 ##pm [SEP]
[NeMo I 2023-07-27 07:37:52 token_classification_dataset:158] loss_mask: 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[NeMo I 2023-07-27 07:37:52 token_classification_dataset:159] input_mask: 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[NeMo I 2023-07-27 07:37:52 token_classification_dataset:160] subtokens_mask: 0 1 1 1 1 1 1 1 0 0 0 1 1 0 0


['I would like to order a bruschetta[B-Dish] for 6pm[B-Hours]',
 'Can you recommend a good[B-Rating] Italian[B-Cuisine] pasta[B-Dish]?']

In [38]:
# Create a new definition of ner_predict using the custom model
def ner_predict(text):
    results = my_ner_model.add_predictions([text])
    return results[0]

## 8.3.2 Restaurant DM

In [39]:
Options ={"bruschetta": ["Tomato and Olive Oil", "Mozarella and Basil", "Cherry Tomato and Garlic"], "pasta": ["Spaghetti Marinara", "Fettuccine Alfredo", "Linguini and Clams"]}

# Create a response with options for specific dishes
def dm_predict_restaurant(res):
    if "B-Dish" in res:
        index = res.index("B-Dish")
        dish = res[:index-1].split().pop()
        if dish in Options.keys():
            list_options = Options[dish]
            response = "What " + dish + " option would you like? We have : "
            for l in list_options:
                 response += l + ". "
        else:
            response = dish + ". What else would you like?"
    else:
        response = "Hi, how can I help you?"
    return response

In [40]:
dm_response_bruschetta = dm_predict_restaurant(ner_predict(request_bruschetta))
print("\n{}".format(request_bruschetta))
print(dm_response_bruschetta)

[NeMo I 2023-07-27 07:37:52 token_classification_dataset:123] Setting Max Seq length to: 15
[NeMo I 2023-07-27 07:37:52 data_preprocessing:404] Some stats of the lengths of the sequences:
[NeMo I 2023-07-27 07:37:52 data_preprocessing:406] Min: 15 |                  Max: 15 |                  Mean: 15.0 |                  Median: 15.0
[NeMo I 2023-07-27 07:37:52 data_preprocessing:412] 75 percentile: 15.00
[NeMo I 2023-07-27 07:37:52 data_preprocessing:413] 99 percentile: 15.00


[NeMo W 2023-07-27 07:37:52 token_classification_dataset:152] 0 are longer than 15


[NeMo I 2023-07-27 07:37:52 token_classification_dataset:155] *** Example ***
[NeMo I 2023-07-27 07:37:52 token_classification_dataset:156] i: 0
[NeMo I 2023-07-27 07:37:52 token_classification_dataset:157] subtokens: [CLS] i would like to order a br ##us ##chet ##ta for 6 ##pm [SEP]
[NeMo I 2023-07-27 07:37:52 token_classification_dataset:158] loss_mask: 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[NeMo I 2023-07-27 07:37:52 token_classification_dataset:159] input_mask: 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[NeMo I 2023-07-27 07:37:52 token_classification_dataset:160] subtokens_mask: 0 1 1 1 1 1 1 1 0 0 0 1 1 0 0

I would like to order a bruschetta for 6pm
What bruschetta option would you like? We have : Tomato and Olive Oil. Mozarella and Basil. Cherry Tomato and Garlic. 


In [41]:
dm_response_pasta = dm_predict_restaurant(ner_predict(request_pasta))
print("\n{}".format(request_pasta))
print(dm_response_pasta)

[NeMo I 2023-07-27 07:37:52 token_classification_dataset:123] Setting Max Seq length to: 10
[NeMo I 2023-07-27 07:37:52 data_preprocessing:404] Some stats of the lengths of the sequences:
[NeMo I 2023-07-27 07:37:52 data_preprocessing:406] Min: 10 |                  Max: 10 |                  Mean: 10.0 |                  Median: 10.0
[NeMo I 2023-07-27 07:37:52 data_preprocessing:412] 75 percentile: 10.00
[NeMo I 2023-07-27 07:37:52 data_preprocessing:413] 99 percentile: 10.00


[NeMo W 2023-07-27 07:37:52 token_classification_dataset:152] 0 are longer than 10


[NeMo I 2023-07-27 07:37:52 token_classification_dataset:155] *** Example ***
[NeMo I 2023-07-27 07:37:52 token_classification_dataset:156] i: 0
[NeMo I 2023-07-27 07:37:52 token_classification_dataset:157] subtokens: [CLS] can you recommend a good italian pasta ? [SEP]
[NeMo I 2023-07-27 07:37:52 token_classification_dataset:158] loss_mask: 1 1 1 1 1 1 1 1 1 1
[NeMo I 2023-07-27 07:37:52 token_classification_dataset:159] input_mask: 1 1 1 1 1 1 1 1 1 1
[NeMo I 2023-07-27 07:37:52 token_classification_dataset:160] subtokens_mask: 0 1 1 1 1 1 1 1 0 0

Can you recommend a good Italian pasta?
What pasta option would you like? We have : Spaghetti Marinara. Fettuccine Alfredo. Linguini and Clams. 


---
# 8.4 TTS Customization
The correct pronunciation for "bruschetta" may be up for debate depending on where in the world you are.  For our application, we need to settle on a consistent pronunciation, so assume we want that "bruce keta" type of pronunciation we focused on earlier. Without any customization, the TTS model does not pronounce it that way.

In [42]:
# Generate speech with the OOTB TTS model
dm_response = dm_response_bruschetta
synth_audio, processed_text =tts_predict(dm_response)
ipd.display(ipd.Audio(synth_audio, rate=sample_rate_hz, autoplay=True))

We can try different pronunciations with the NeMo aligner model as follows:

In [43]:
from nemo.collections.tts.models import AlignerModel
aligner = AlignerModel.from_pretrained("tts_en_radtts_aligner_ipa")

[NeMo W 2023-07-27 07:37:55 experimental:27] Module <class 'nemo.collections.common.tokenizers.text_to_speech.tts_tokenizers.IPATokenizer'> is experimental, not ready for production and is not fully supported. Use at your own risk.
[NeMo W 2023-07-27 07:37:55 experimental:27] Module <class 'nemo.collections.tts.models.radtts.RadTTSModel'> is experimental, not ready for production and is not fully supported. Use at your own risk.


[NeMo I 2023-07-27 07:37:55 cloud:56] Found existing object /root/.cache/torch/NeMo/NeMo_1.14.0/Aligner/0cfa131db81f64e49f9c47f286991019/Aligner.nemo.
[NeMo I 2023-07-27 07:37:55 cloud:62] Re-using file from: /root/.cache/torch/NeMo/NeMo_1.14.0/Aligner/0cfa131db81f64e49f9c47f286991019/Aligner.nemo
[NeMo I 2023-07-27 07:37:55 common:912] Instantiating model from pre-trained checkpoint
[NeMo I 2023-07-27 07:37:58 tokenize_and_classify:87] Creating ClassifyFst grammars.


[NeMo W 2023-07-27 07:38:22 experimental:27] Module <class 'nemo_text_processing.g2p.modules.IPAG2P'> is experimental, not ready for production and is not fully supported. Use at your own risk.
[NeMo W 2023-07-27 07:38:22 modules:344] apply_to_oov_word=None, This means that some of words will remain unchanged if they are not handled by any of the rules in self.parse_one_word(). This may be intended if phonemes and chars are both valid inputs, otherwise, you may see unexpected deletions in your input.
[NeMo W 2023-07-27 07:38:22 modelPT:142] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    dataset:
      _target_: nemo.collections.tts.torch.data.TTSDataset
      manifest_filepath: /raid/LJSpeech/nvidia_ljspeech_train.json
      sample_rate: 22050
      sup_data_path: /raid/LJSpeech/aligner_train_supp/
      sup_data_types:
      - align_prior_ma

[NeMo I 2023-07-27 07:38:22 features:267] PADDING: 1
[NeMo I 2023-07-27 07:38:23 save_restore_connector:243] Model AlignerModel was successfully restored from /root/.cache/torch/NeMo/NeMo_1.14.0/Aligner/0cfa131db81f64e49f9c47f286991019/Aligner.nemo.


In [44]:
input_string = "broo sketah"
text_g2p = aligner.tokenizer.g2p(input_string)
print(text_g2p)
text_tokens = aligner.tokenizer(input_string)
print(text_tokens)
print("\n" + ''.join(text_g2p))
synth_audio, processed_text =tts_predict(''.join(text_g2p))
ipd.display(ipd.Audio(synth_audio, rate=sample_rate_hz, autoplay=True))

['B', 'R', 'O', 'O', ' ', 'S', 'K', 'E', 'T', 'A', 'H']
[93, 23, 39, 36, 36, 93, 40, 32, 26, 41, 22, 29, 93]

BROO SKETAH


In [45]:
input_string = "brous KETA"
text_g2p = aligner.tokenizer.g2p(input_string)
print(text_g2p)
text_tokens = aligner.tokenizer(input_string)
print(text_tokens)
print("\n" + ''.join(text_g2p))
synth_audio, processed_text =tts_predict(''.join(text_g2p))
ipd.display(ipd.Audio(synth_audio, rate=sample_rate_hz, autoplay=True))

['B', 'R', 'O', 'U', 'S', ' ', 'ˈ', 'k', 'ɛ', 't', 'ə']
[93, 23, 39, 36, 42, 40, 93, 32, 26, 41, 22, 93]

BROUS ˈkɛtə


After trying a few, alter the input to the TTS model with a simple `replace()` in the string output from the DM.

In [46]:
# generate speech 
custom_dm_response = dm_response.replace("bruschetta", "BROO SKETAH") 

synth_audio, processed_text =tts_predict(custom_dm_response)
ipd.display(ipd.Audio(synth_audio, rate=sample_rate_hz, autoplay=True))

---
# 8.5 Customized Restaurant Pipeline Demo

We've customized several parts of our pipeline:
1. ASR - added detection of additional pronunciations with the lexicon
2. NER - incorporated a fine-tuned NeMo model for restaurant context
3. DM - added refinements for our simple example
4. TTS - added pronunciation substitution

<img src="images/pipeline/restaurant_pipeline.png">



## 8.5.1 Exercise: Run a Custom Pipeline

In the following cell, complete the "TODO" sections to run the full pipeline with customizations:
1. ASR customized lexicon for "bruschetta" (this should already be in place)
2. NER customized NeMo model
3. DM with the restaurant options
4. TTS with substituted pronunciation for "bruschetta"

If you get stuck, you can check the [solution](solutions/ex8.5.1.ipynb)

In [47]:
# Restaurant Pipeline
# Put it all together

SAMPLE="/dli_workspace/data/bruschetta_resampled.wav"
print("First Audio sample:")
ipd.display(ipd.Audio(SAMPLE, rate=sample_rate_hz, autoplay=True))

# get input audio duration
d=librosa.get_duration(filename=SAMPLE)

# TODO call Riva ASR  
# TODO call NeMo NER
# TODO call Dialog Manager
# TODO call Riva TTS

time.sleep(d)
print("Virtual Assistant Response:")
ipd.display(ipd.Audio(synth_audio, rate=sample_rate_hz, autoplay=True))

First Audio sample:


Virtual Assistant Response:


## 8.5.2 Shut Down Riva

In [48]:
# Stop the Riva server 
!bash $RIVA_QS/riva_stop.sh

Shutting down docker containers...


---
# 8.6 Shut Down the Kernel
<h3 style="color:red;">Important!</h3>

From the menu above, choose ***Kernel->Shut Down Kernel*** to fully clear GPU memory before moving on.

---
<h2 style="color:green;">Congratulations!</h2>

In this notebook, you have:
- Built a full conversational AI pipeline
- Customized ASR, NLP, and TTS in a full pipeline
- Built a full restaurant context customized pipeline

This concludes the TTS portion of the course.<br>
Next, you'll work with deployment of Riva at scale using Kubernetes, starting with [Enabling GPU within Kubernetes](009_K8s_Enable.ipynb).

<a href="https://www.nvidia.com/dli"> <img src="images/DLI_Header.png" alt="Header" style="width: 400px;"/> </a>