<a href="https://www.nvidia.com/dli"> <img src="images/DLI_Header.png" alt="Header" style="width: 400px;"/> </a>

# Deploying Text Classification Model in Riva

[Train Adapt Optimize (TAO) Toolkit](https://developer.nvidia.com/tao-toolkit)  provides the capability to export your model in a format that can deployed using [NVIDIA Riva](https://developer.nvidia.com/riva), a highly performant application framework for multi-modal conversational AI services using GPUs. 

This tutorial explores taking a .riva model, the result of `tao text_classification export` command, and leveraging the Riva ServiceMaker framework to aggregate all the necessary artifacts for Riva deployment to a target environment. Once the model is deployed in Riva, you can issue inference requests to the server. We will demonstrate how quick and straightforward this whole process is. 

## Learning Objectives
In this notebook, you will learn how to:  
- Use Riva ServiceMaker to take a TAO exported .riva and convert it to .rmir
- Deploy the model(s) locally  on the Riva Server
- Send inference requests from a demo client using Riva API bindings..

## Pre-requisites
To follow along, please make sure:
- You have access to NVIDIA NGC, and are able to download the Riva Quickstart [resources](https://ngc.nvidia.com/catalog/resources/nvidia:riva:riva_quickstart/)
- Have a .riva model file that you wish to deploy. You can obtain this from `tao <task> export` (with `export_format=RIVA`). Please refer the tutorial on *Text Classification using Train Adapt Optimize (TAO) Toolkit* for more details on training and exporting a .riva model.

## Riva ServiceMaker
Servicemaker is the set of tools that aggregates all the necessary artifacts (models, files, configurations, and user settings) for Riva deployment to a target environment. It has two main components as shown below:

### 1. Riva-build

This step helps build a Riva-ready version of the model. It’s only output is an intermediate format (called a RMIR) of an end to end pipeline for the supported services within Riva. We are taking a ASR QuartzNet Model in consideration<br>

`riva-build` is responsible for the combination of one or more exported models (.riva files) into a single file containing an intermediate format called Riva Model Intermediate Representation (.rmir). This file contains a deployment-agnostic specification of the whole end-to-end pipeline along with all the assets required for the final deployment and inference. Please checkout the [documentation](https://docs.nvidia.com/deeplearning/riva/user-guide/docs/service-nlp.html#pipeline-configuration) to find out more.

In [1]:
# Set some path and file names we need 

# ServiceMaker Docker
RIVA_SM_CONTAINER = "nvcr.io/nvidia/riva/riva-speech:1.7.0-beta-servicemaker"

# Directory where the .riva model is stored $MODEL_LOC/*.riva
MODEL_LOC = "/dli/task/domainclassification_english_bert_vdeployable_v1.0"

# Name of the .riva file
MODEL_NAME = "domain_classification_bert.riva"

# Key that model is encrypted with, while exporting with TAO
KEY = 'tlt_encode'

In [2]:
# Get the ServiceMaker docker
! docker pull $RIVA_SM_CONTAINER

1.7.0-beta-servicemaker: Pulling from nvidia/riva/riva-speech

[1B32c2132b: Pulling fs layer 
[1Bfc91ca4c: Pulling fs layer 
[1Bbfe29823: Pulling fs layer 
[1Bbb0f48c6: Pulling fs layer 
[1B937ae0b1: Pulling fs layer 
[1B47dbb869: Pulling fs layer 
[1B9a515d38: Pulling fs layer 
[1Bbefddb18: Pulling fs layer 
[1Ba5bdde0b: Pulling fs layer 
[1B32b6dcb0: Pulling fs layer 
[1Bb39618ed: Pulling fs layer 
[1B5b7dac39: Pulling fs layer 
[1B46f1ce67: Pulling fs layer 
[1B46b2b0ee: Pulling fs layer 
[1B0f57ab67: Pulling fs layer 
[1B010c3f61: Pulling fs layer 
[1B920eee68: Pulling fs layer 
[2B920eee68: Waiting fs layer 
[1Bec3721d9: Pulling fs layer 
[1Baf4d5a99: Pulling fs layer 
[1Baee79aa7: Pulling fs layer 
[3Baf4d5a99: Waiting fs layer 
[1B23103b6c: Pulling fs layer 
[1Bff55d023: Pulling fs layer 
[1Bedee2aea: Pulling fs layer 
[1B59107317: Pulling fs layer 
[3Bedee2aea: Waiting fs layer 
[1Bb3fc277b: Pulling fs layer 
[1B9b3d3e1b: Pulling fs layer 
[1B82e73

In [3]:
# Syntax: riva-build <task-name> output-dir-for-rmir/model.rmir:key dir-for-riva/model.riva:key
# riva-build text_classification \
#            --domain_name="<your custom domain name>" \
#            /servicemaker-dev/<rmir_filename>:<encryption_key> \
#            /servicemaker-dev/<riva_filename>:<encryption_key>

! docker run --rm --gpus 1 -v $MODEL_LOC:/data $RIVA_SM_CONTAINER -- \
            riva-build text_classification -f /data/tc-model.rmir:$KEY /data/$MODEL_NAME:$KEY


=== Riva Speech Skills ===

NVIDIA Release 21.10 (build 29079244)

Copyright (c) 2018-2021, NVIDIA CORPORATION.  All rights reserved.

Various files include modifications (c) NVIDIA CORPORATION.  All rights reserved.
NVIDIA modifications are covered by the license terms that apply to the underlying
project or file.

NOTE: Legacy NVIDIA Driver detected.  Compatibility mode ENABLED.

NOTE: The SHMEM allocation limit is set to the default of 64MB.  This may be
   insufficient for the inference server.  NVIDIA recommends the use of the following flags:
   nvidia-docker run --shm-size=1g --ulimit memlock=-1 --ulimit stack=67108864 ...

2023-03-31 14:28:57,592 [INFO] Packing binaries for self/PyTorch : {'class_labels_file': ('nemo.collections.nlp.models.text_classification.text_classification_model.TextClassificationModel', 'intent_labels.csv')}
2023-03-31 14:28:57,592 [INFO] Copying class_labels_file:intent_labels.csv -> self:self-intent_labels.csv
2023-03-31 14:28:57,592 [INFO] Packing bi

### 2. Riva-deploy

The deployment tool takes as input one or more Riva Model Intermediate Representation (RMIR) files and a target model repository directory. It creates an ensemble configuration specifying the pipeline for the execution and finally writes all those assets to the output model repository directory.

In [4]:
# Syntax: riva-deploy -f dir-for-rmir/model.rmir:key output-dir-for-repository
! docker run --rm --gpus 1 -v $MODEL_LOC:/data $RIVA_SM_CONTAINER -- \
            riva-deploy -f /data/tc-model.rmir:$KEY /data/models


=== Riva Speech Skills ===

NVIDIA Release 21.10 (build 29079244)

Copyright (c) 2018-2021, NVIDIA CORPORATION.  All rights reserved.

Various files include modifications (c) NVIDIA CORPORATION.  All rights reserved.
NVIDIA modifications are covered by the license terms that apply to the underlying
project or file.

NOTE: Legacy NVIDIA Driver detected.  Compatibility mode ENABLED.

NOTE: The SHMEM allocation limit is set to the default of 64MB.  This may be
   insufficient for the inference server.  NVIDIA recommends the use of the following flags:
   nvidia-docker run --shm-size=1g --ulimit memlock=-1 --ulimit stack=67108864 ...

2023-03-31 14:29:28,170 [INFO] Writing Riva model repository to '/data/models'...
2023-03-31 14:29:28,170 [INFO] The riva model repo target directory is /data/models
2023-03-31 14:29:32,062 [INFO] Extract_binaries for language_model -> /data/models/riva-trt-riva_text_classification_default-nn-bert-base-uncased/1
2023-03-31 14:29:32,062 [INFO] extracting {'ck

## Start Riva Server
Once the model repository is generated, we are ready to start the Riva server. From this step onwards you need to download the Riva QuickStart Resource from NGC. 
Set the path to the directory here:

In [5]:
# Set the Riva QuickStart directory
RIVA_DIR = "/dli/task/riva_quickstart_v1.7.0-beta"

Next, we modify config.sh to enable relevant Riva services (asr for QuartzNet Model), provide the encryption key, and path to the model repository (`riva_model_loc`) generated in the previous step among other configurations. 

For instance, if above the model repository is generated at `$MODEL_LOC/models`, then you can specify `riva_model_loc` as the same directory as `MODEL_LOC` <br>

Pretrained versions of models specified in models_asr/nlp/tts are fetched from NGC. Since we are using our custom model, we can comment it in models_asr (and any others that are not relevant to your use case). <br>

#### config.sh snipet
```
# Enable or Disable Riva Services 
service_enabled_asr=false                                 ## MAKE CHANGES HERE
service_enabled_nlp=true                                  ## MAKE CHANGES HERE
service_enabled_tts=false                                 ## MAKE CHANGES HERE

# Specify one or more GPUs to use
# specifying more than one GPU is currently an experimental feature, and may result in undefined behaviours.
gpus_to_use="device=0"

# Specify the encryption key to use to deploy models
MODEL_DEPLOY_KEY="tlt_encode"                             ## Set the model encryption key

# Locations to use for storing models artifacts
...
riva_model_loc="<add path>"                              ## Replace with MODEL_LOC

# The default RMIRs are downloaded from NGC by default in the above $riva_rmir_loc directory
# If you'd like to skip the download from NGC and use the existing RMIRs in the $riva_rmir_loc
# then set the below $use_existing_rmirs flag to true.
...
use_existing_rmirs=true                                  ## Set to True
```

In [6]:
# Execute this cell to copy the solution config.sh into the quickstart directory
! cp solutions/config.sh $RIVA_DIR

In [7]:
# Ensure you have permission to execute these scripts.
! cd $RIVA_DIR && chmod +x ./riva_init.sh && chmod +x ./riva_start.sh

In [8]:
# Run Riva Init. This will fetch the containers/models
# YOU CAN SKIP THIS STEP IF YOU DID RIVA DEPLOY
! cd $RIVA_DIR && ./riva_init.sh config.sh

Logging into NGC docker registry if necessary...
Pulling required docker images if necessary...
Note: This may take some time, depending on the speed of your Internet connection.
> Pulling Riva Speech Server images.
  > Pulling nvcr.io/nvidia/riva/riva-speech:1.7.0-beta-server. This may take some time...
  > Pulling nvcr.io/nvidia/riva/riva-speech-client:1.7.0-beta. This may take some time...
  > Image nvcr.io/nvidia/riva/riva-speech:1.7.0-beta-servicemaker exists. Skipping.

Converting RMIRs at /dli/task/domainclassification_english_bert_vdeployable_v1.0/rmir to Riva Model repository.
+ docker run --init -it --rm --gpus '"device=0"' -v /dli/task/domainclassification_english_bert_vdeployable_v1.0:/data -e MODEL_DEPLOY_KEY=tlt_encode --name riva-service-maker nvcr.io/nvidia/riva/riva-speech:1.7.0-beta-servicemaker deploy_all_models /data/rmir /data/models

=== Riva Speech Skills ===

NVIDIA Release 21.10 (build 29079244)

Copyright (c) 2018-2021, NVIDIA CORPORATION.  All rights reserved

In [9]:
# Run Riva Start. This will deploy the model(s).
! cd $RIVA_DIR && bash riva_start.sh config.sh

Starting Riva Speech Services. This may take several minutes depending on the number of models deployed.
Waiting for Riva server to load all models...retrying in 10 seconds
Riva server is ready...


## Run Inference
Once the Riva server is up and running with your models, you can send inference requests querying the server. 

To send GRPC requests, you can install Riva Python API bindings for client. This is available as a pip .whl with the QuickStart.

The following code sample shows how you can perform inference using Riva Python API gRPC bindings:

In [10]:
import grpc
import argparse
import os
import riva_api.riva_nlp_pb2 as rnlp
import riva_api.riva_nlp_pb2_grpc as rnlp_srv


class BertTextClassifyClient(object):
    def __init__(self, grpc_server, model_name):
        # generate the correct model based on precision and whether or not ensemble is used
        print("Using model: {}".format(model_name))

        self.model_name = model_name
        self.channel = grpc.insecure_channel(grpc_server)
        self.riva_nlp = rnlp_srv.RivaLanguageUnderstandingStub(self.channel)

        self.has_bos_eos = False

    # use the text_classification network to return top-1 classes for intents/sequences
    def postprocess_labels_server(self, ct_response):
        results = []

        for i in range(0, len(ct_response.results)):
            intent_str = ct_response.results[i].labels[0].class_name
            intent_conf = ct_response.results[i].labels[0].score

            results.append((intent_str, intent_conf))

        return results

    # accept a list of strings, return a list of tuples ('intent', scores)
    def run(self, input_strings):
        if isinstance(input_strings, str):
            # user probably passed a single string instead of a list/iterable
            input_strings = [input_strings]

        # get intent of the query
        request = rnlp.TextClassRequest()
        request.model.model_name = self.model_name
        for q in input_strings:
            request.text.append(q)
        ct_response = self.riva_nlp.ClassifyText(request)

        return self.postprocess_labels_server(ct_response)


def run_text_classify(server, model, query):
    print("Client app to test text classification on Riva")
    client = BertTextClassifyClient(server, model_name=model)
    result = client.run(query)
    print(result)

In [11]:
# Model Name will depend on the dataset and the domain on which the model was trained. 
# Please check `docker logs <container name>` and replace it accordingly (There will 
# be a table of models with their status displayed next to them) Check the documentation
# for more information.

run_text_classify(server="localhost:50051",
                model="riva_text_classification_default",
                query="How is the weather tomorrow?")

Client app to test text classification on Riva
Using model: riva_text_classification_default
[('misty.weather', 0.9970700144767761)]


`NOTE`: You could also run the above inference code from inside the Riva Client container. The QuickStart provides a script `riva_start_client.sh` to run the container. It has more examples for different services.

You can stop all docker container before shutting down the jupyter kernel. Caution: The following command will stop all running containers

In [12]:
! docker stop $(docker ps -a -q)

d13cfb4d6ccd


## What's next?
You could train your own custom models in TAO and deploy them in Riva! You could scale up your deployment using Kubernetes with the Riva AI Services Helm Chart, which will pull the relevant Images and download model artifacts from NGC, generate the model repository, start and expose the Riva speech services.

<a href="https://www.nvidia.com/dli"> <img src="images/DLI_Header.png" alt="Header" style="width: 400px;"/> </a>