In [None]:
# Copyright 2019 NVIDIA Corporation. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================

<img src="http://developer.download.nvidia.com/compute/machine-learning/frameworks/nvidia_logo.png" style="width: 90px; float: right;">

# Torch-TensorRT Getting Started - CitriNet

## Overview

In the practice of developing machine learning models, there are few tools as approachable as PyTorch for developing and experimenting in designing machine learning models. The power of PyTorch comes from its deep integration into Python, its flexibility and its approach to automatic differentiation and execution (eager execution). However, when moving from research into production, the requirements change and we may no longer want that deep Python integration and we want optimization to get the best performance we can on our deployment platform. In PyTorch 1.0, TorchScript was introduced as a method to separate your PyTorch model from Python, make it portable and optimizable. TorchScript uses PyTorch's JIT compiler to transform your normal PyTorch code which gets interpreted by the Python interpreter to an intermediate representation (IR) which can have optimizations run on it and at runtime can get interpreted by the PyTorch JIT interpreter. For PyTorch this has opened up a whole new world of possibilities, including deployment in other languages like C++. It also introduces a structured graph based format that we can use to do down to the kernel level optimization of models for inference.

When deploying on NVIDIA GPUs TensorRT, NVIDIA's Deep Learning Optimization SDK and Runtime is able to take models from any major framework and specifically tune them to perform better on specific target hardware in the NVIDIA family be it an A100, TITAN V, Jetson Xavier or NVIDIA's Deep Learning Accelerator. TensorRT performs a couple sets of optimizations to achieve this. TensorRT fuses layers and tensors in the model graph, it then uses a large kernel library to select implementations that perform best on the target GPU. TensorRT also has strong support for reduced operating precision execution which allows users to leverage the Tensor Cores on Volta and newer GPUs as well as reducing memory and computation footprints on device.

Torch-TensorRT is a compiler that uses TensorRT to optimize TorchScript code, compiling standard TorchScript modules into ones that internally run with TensorRT optimizations. This enables you to continue to remain in the PyTorch ecosystem, using all the great features PyTorch has such as module composability, its flexible tensor implementation, data loaders and more. Torch-TensorRT is available to use with both PyTorch and LibTorch.

### Learning objectives

This notebook demonstrates the steps for compiling a TorchScript module with Torch-TensorRT on a pretrained CitriNet network, and running it to test the speedup obtained.

## Content
1. [Requirements](#1)
1. [CitriNet Overview](#2)
1. [Creating TorchScript modules](#3)
1. [Compiling with Torch-TensorRT](#4)
1. [Conclusion](#5)

<a id="1"></a>
## 1. Requirements

Follow the steps in `notebooks/README` to prepare a Docker container, within which you can run this notebook. 

1. Start docker container:

In [None]:
# Before starting this notebook, make sure you're running this docker container
!docker run --gpus all -it --rm -v $PWD:/benchmark --net=host nvcr.io/nvidia/pytorch:21.12-py3

2. Now that you are in the docker, the next step is to install the required dependencies.

In [None]:
# Install dependencies
!pip install wget
!apt-get update && DEBIAN_FRONTEND=noninteractive  apt-get install -y libsndfile1 ffmpeg
!pip install Cython

## Install NeMo
!pip install nemo_toolkit[all]==1.5.1

<a id="2"></a>
## 2. CitriNet Overview

CitriNet models are end-to-end neural automatic speech recognition (ASR) models that transcribe segments of audio to text.



### Model Description

Citrinet is a version of [QuartzNet](https://arxiv.org/pdf/1910.10261.pdf) that extends [ContextNet](https://arxiv.org/pdf/2005.03191.pdf), utilizing subword encoding (via Word Piece tokenization) and Squeeze-and-Excitation(SE) mechanism and are therefore smaller than QuartzNet models.

CitriNet models take in audio segments and transcribe them to letter, byte pair, or word piece sequences. The pretrained models here can be used immediately for fine-tuning or dataset evaluation.



<img src="https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/_images/jasper_vertical.png" alt="alt" width="50%"/>




Download and convert Nemo Citrinet model:

In [2]:
import nemo
import torch

import nemo.collections.asr as nemo_asr
from nemo.core import typecheck
typecheck.set_typecheck_enabled(False) 

    


In [3]:
precisions_str = 'fp32'
variant = 'stt_en_citrinet_256'
batch_sizes = [1, 2, 8]

precisions = []
if 'fp32' in precisions_str:
    precisions.append(torch.float32)
if 'fp16' in precisions_str:
    precisions.append(torch.half)

print(f"Downloading and saving {variant}...")
asr_model = nemo_asr.models.EncDecCTCModelBPE.from_pretrained(model_name=variant)
asr_model.export(f"{variant}.ts")

Downloading and saving stt_en_citrinet_256...
[NeMo I 2022-02-25 12:34:21 cloud:66] Downloading from: https://api.ngc.nvidia.com/v2/models/nvidia/nemo/stt_en_citrinet_256/versions/1.0.0rc1/files/stt_en_citrinet_256.nemo to /root/.cache/torch/NeMo/NeMo_1.5.1/stt_en_citrinet_256/91a9cc5850784b2065e8a0aa3d526fd9/stt_en_citrinet_256.nemo
100% [.................................................................] 38872168 / 38872168[NeMo I 2022-02-25 12:34:31 common:728] Instantiating model from pre-trained checkpoint
[NeMo I 2022-02-25 12:34:33 mixins:146] Tokenizer SentencePieceTokenizer initialized with 1024 tokens


[NeMo W 2022-02-25 12:34:34 modelPT:130] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    manifest_filepath: null
    sample_rate: 16000
    batch_size: 32
    trim_silence: true
    max_duration: 16.7
    shuffle: true
    is_tarred: false
    tarred_audio_filepaths: null
    use_start_end_token: false
    
[NeMo W 2022-02-25 12:34:34 modelPT:137] If you intend to do validation, please call the ModelPT.setup_validation_data() or ModelPT.setup_multiple_validation_data() method and provide a valid configuration file to setup the validation data loader(s). 
    Validation config : 
    manifest_filepath: null
    sample_rate: 16000
    batch_size: 32
    shuffle: false
    use_start_end_token: false
    
[NeMo W 2022-02-25 12:34:34 modelPT:143] Please call the ModelPT.setup_test_data() or ModelPT.setup_multiple_test_data() method and provide a va

[NeMo I 2022-02-25 12:34:34 features:265] PADDING: 16
[NeMo I 2022-02-25 12:34:34 features:282] STFT using torch
[NeMo I 2022-02-25 12:34:39 save_restore_connector:149] Model EncDecCTCModelBPE was successfully restored from /root/.cache/torch/NeMo/NeMo_1.5.1/stt_en_citrinet_256/91a9cc5850784b2065e8a0aa3d526fd9/stt_en_citrinet_256.nemo.


[NeMo W 2022-02-25 12:34:39 export_utils:198] Swapped 0 modules
[NeMo W 2022-02-25 12:34:39 conv_asr:73] Turned off 235 masked convolutions
[NeMo W 2022-02-25 12:34:39 export_utils:198] Swapped 0 modules
    
      if hasattr(mod, name):
    
      if hasattr(mod, name):
    
      item = getattr(mod, name)
    


(['stt_en_citrinet_256.ts'],
 ['nemo.collections.asr.models.ctc_bpe_models.EncDecCTCModelBPE exported to ONNX'])

### Benchmark utility

Let us define a helper function to benchmark a model.

In [4]:
from __future__ import print_function
from __future__ import absolute_import
from __future__ import division

import argparse
import timeit
import numpy as np
import torch
import torch_tensorrt as trtorch
import torch.backends.cudnn as cudnn

def benchmark(model, input_tensor, num_loops, model_name, batch_size):
    def timeGraph(model, input_tensor, num_loops):
        print("Warm up ...")
        with torch.no_grad():
            for _ in range(20):
                features = model(input_tensor)

        torch.cuda.synchronize()
        print("Start timing ...")
        timings = []
        with torch.no_grad():
            for i in range(num_loops):
                start_time = timeit.default_timer()
                features = model(input_tensor)
                torch.cuda.synchronize()
                end_time = timeit.default_timer()
                timings.append(end_time - start_time)
                # print("Iteration {}: {:.6f} s".format(i, end_time - start_time))
        return timings
    def printStats(graphName, timings, batch_size):
        times = np.array(timings)
        steps = len(times)
        speeds = batch_size / times
        time_mean = np.mean(times)
        time_med = np.median(times)
        time_99th = np.percentile(times, 99)
        time_std = np.std(times, ddof=0)
        speed_mean = np.mean(speeds)
        speed_med = np.median(speeds)
        msg = ("\n%s =================================\n"
                "batch size=%d, num iterations=%d\n"
                "  Median samples/s: %.1f, mean: %.1f\n"
                "  Median latency (s): %.6f, mean: %.6f, 99th_p: %.6f, std_dev: %.6f\n"
                ) % (graphName,
                    batch_size, steps,
                    speed_med, speed_mean,
                    time_med, time_mean, time_99th, time_std)
        print(msg)
    timings = timeGraph(model, input_tensor, num_loops)
    printStats(model_name, timings, batch_size)

precisions_str = 'fp32' # Precision (default=fp32, fp16)
variant = 'stt_en_citrinet_256' # Nemo Citrinet variant
batch_sizes = [1, 8, 32, 128] # Batch sizes (default=1,8,32,128)
trt = False # If True, infer with Torch-TensorRT engine. Else, infer with Pytorch model.
precision = torch.float32 if precisions_str =='fp32' else torch.float16

for batch_size in batch_sizes:
    if trt:
        model_name = f"{variant}_bs{batch_size}_{precision}.torch-tensorrt"
    else:
        model_name = f"{variant}.ts"

    print(f"Loading model: {model_name}") 
    # Load traced model to CPU first
    model = torch.jit.load(model_name).cuda()
    cudnn.benchmark = True
    # Create random input tensor of certain size
    torch.manual_seed(12345)
    input_shape=(batch_size, 80, 1488)
    input_tensor = torch.randn(input_shape).cuda()

    # Timing graph inference
    benchmark(model, input_tensor, 50, model_name, batch_size)
    
#     timings = timeGraph(model, input_tensor, 50)
#     printStats(model_name, timings, batch_size)

Loading model: stt_en_citrinet_256.ts
Warm up ...
Start timing ...

batch size=1, num iterations=50
  Median samples/s: 44.7, mean: 44.6
  Median latency (s): 0.022363, mean: 0.022444, 99th_p: 0.024401, std_dev: 0.000538

Loading model: stt_en_citrinet_256.ts
Warm up ...
Start timing ...

batch size=8, num iterations=50
  Median samples/s: 334.4, mean: 332.8
  Median latency (s): 0.023922, mean: 0.024059, 99th_p: 0.027440, std_dev: 0.000718

Loading model: stt_en_citrinet_256.ts
Warm up ...
Start timing ...

batch size=32, num iterations=50
  Median samples/s: 643.9, mean: 643.8
  Median latency (s): 0.049701, mean: 0.049703, 99th_p: 0.049983, std_dev: 0.000107

Loading model: stt_en_citrinet_256.ts
Warm up ...
Start timing ...

batch size=128, num iterations=50
  Median samples/s: 708.3, mean: 708.2
  Median latency (s): 0.180723, mean: 0.180746, 99th_p: 0.181572, std_dev: 0.000220



Confirming the GPU we are using here:

In [5]:
!nvidia-smi

Fri Feb 25 12:35:38 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.51.06    Driver Version: 450.51.06    CUDA Version: 11.5     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla V100S-PCI...  On   | 00000000:04:00.0 Off |                    0 |
| N/A   39C    P0    46W / 250W |   2645MiB / 32510MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100S-PCI...  On   | 00000000:05:00.0 Off |                    0 |
| N/A   28C    P0    27W / 250W |      4MiB / 32510MiB |      0%      Default |
|       

<a id="3"></a>
## 3. Creating TorchScript modules

To compile with Torch-TensorRT, the model must first be in **TorchScript**. TorchScript is a programming language included in PyTorch which removes the Python dependency normal PyTorch models have. This conversion is done via a JIT compiler which given a PyTorch Module will generate an equivalent TorchScript Module. There are two paths that can be used to generate TorchScript: **Tracing** and **Scripting**. 

- Tracing follows execution of PyTorch generating ops in TorchScript corresponding to what it sees. 
- Scripting does an analysis of the Python code and generates TorchScript, this allows the resulting graph to include control flow which tracing cannot do. 

Since tracing is yet to be supported in the newest update, we start with an example of the scripted model in TorchScript.

In [6]:
import torch
import torch.nn as nn
import torch_tensorrt as trtorch
import argparse

# trtorch.logging.set_reportable_log_level(trtorch.logging.Level.Info)

import nemo.collections.asr as nemo_asr
from nemo.core import typecheck
typecheck.set_typecheck_enabled(False) 


arg_precisions = "fp32,fp16"
arg_batch_sizes = "1,8,32,128"
arg_variant = "stt_en_citrinet_256"


precisions_str = arg_precisions.split(',')
precisions = []
if 'fp32' in precisions_str:
    precisions.append(torch.float32)
if 'fp16' in precisions_str:
    precisions.append(torch.half)

batch_sizes = [int(x) for x in arg_batch_sizes.split(',')]

# print(f"Downloading and saving {arg_variant}...")
# asr_model = nemo_asr.models.EncDecCTCModelBPE.from_pretrained(model_name=arg_variant)
# asr_model.export(f"{arg_variant}.ts")


model = torch.jit.load(f"{arg_variant}.ts")


for precision in precisions:
    for batch_size in batch_sizes:
        compile_settings = {
            "inputs": [trtorch.Input(shape=[batch_size, 80, 1488])],
            "enabled_precisions": {precision},
            "workspace_size": 2000000000,
            "truncate_long_and_double": True,
        }
        print(f"Generating Torchscript-TensorRT module for batchsize {batch_size} precision {precision}")
        trt_ts_module = trtorch.compile(model, **compile_settings)
        torch.jit.save(trt_ts_module, f"{arg_variant}_bs{batch_size}_{precision}.torch-tensorrt")

Generating Torchscript-TensorRT module for batchsize 1 precision torch.float32
Generating Torchscript-TensorRT module for batchsize 8 precision torch.float32
Generating Torchscript-TensorRT module for batchsize 32 precision torch.float32
Generating Torchscript-TensorRT module for batchsize 128 precision torch.float32
Generating Torchscript-TensorRT module for batchsize 1 precision torch.float16
Generating Torchscript-TensorRT module for batchsize 8 precision torch.float16
Generating Torchscript-TensorRT module for batchsize 32 precision torch.float16
Generating Torchscript-TensorRT module for batchsize 128 precision torch.float16


<a id="4"></a>
## 4. Compiling with Torch-TensorRT

TorchScript modules behave just like normal PyTorch modules and are intercompatible. From TorchScript we can now compile a TensorRT based module. This module will still be implemented in TorchScript but all the computation will be done in TensorRT.

As mentioned earlier, we start with an example of Torch-TensorRT compilation with the traced model.

Note that we show benchmarking results of two precisions: FP32 (single precision) and FP16 (half precision).

### FP32 (single precision)

In [7]:
# nemo_asr.models.ASRModel.list_available_models()
precisions_str = 'fp32' # Precision (default=fp32, fp16)
variant = 'stt_en_citrinet_256' # Nemo Citrinet variant
batch_sizes = [1, 8, 32, 128] # Batch sizes (default=1,8,32,128)
trt = True # If True, infer with Torch-TensorRT engine. Else, infer with Pytorch model.
precision = torch.float32 if precisions_str =='fp32' else torch.float16

for batch_size in batch_sizes:
    if trt:
        model_name = f"{variant}_bs{batch_size}_{precision}.torch-tensorrt"
    else:
        model_name = f"{variant}.ts"

    print(f"Loading model: {model_name}") 
    # Load traced model to CPU first
    model = torch.jit.load(model_name).cuda()
    cudnn.benchmark = True
    # Create random input tensor of certain size
    torch.manual_seed(12345)
    input_shape=(batch_size, 80, 1488)
    input_tensor = torch.randn(input_shape).cuda()

    # Timing graph inference
    benchmark(model, input_tensor, 50, model_name, batch_size)

Loading model: stt_en_citrinet_256_bs1_torch.float32.torch-tensorrt
Warm up ...
Start timing ...

batch size=1, num iterations=50
  Median samples/s: 92.1, mean: 89.8
  Median latency (s): 0.010855, mean: 0.011154, 99th_p: 0.012702, std_dev: 0.000535

Loading model: stt_en_citrinet_256_bs8_torch.float32.torch-tensorrt
Warm up ...
Start timing ...

batch size=8, num iterations=50
  Median samples/s: 696.3, mean: 675.2
  Median latency (s): 0.011489, mean: 0.011920, 99th_p: 0.014809, std_dev: 0.000986

Loading model: stt_en_citrinet_256_bs32_torch.float32.torch-tensorrt
Warm up ...
Start timing ...

batch size=32, num iterations=50
  Median samples/s: 1064.6, mean: 1064.6
  Median latency (s): 0.030057, mean: 0.030057, 99th_p: 0.030247, std_dev: 0.000073

Loading model: stt_en_citrinet_256_bs128_torch.float32.torch-tensorrt
Warm up ...
Start timing ...

batch size=128, num iterations=50
  Median samples/s: 1244.4, mean: 1240.0
  Median latency (s): 0.102863, mean: 0.103233, 99th_p: 0.105

### FP16 (half precision)

In [8]:
precisions_str = 'fp16' # Precision (default=fp32, fp16)
variant = 'stt_en_citrinet_256' # Nemo Citrinet variant
batch_sizes = [1, 8, 32, 128] # Batch sizes (default=1,8,32,128)
trt = True # If True, infer with Torch-TensorRT engine. Else, infer with Pytorch model.
precision = torch.float32 if precisions_str =='fp32' else torch.float16

for batch_size in batch_sizes:
    if trt:
        model_name = f"{variant}_bs{batch_size}_{precision}.torch-tensorrt"
    else:
        model_name = f"{variant}.ts"

    print(f"Loading model: {model_name}") 
    # Load traced model to CPU first
    model = torch.jit.load(model_name).cuda()
    cudnn.benchmark = True
    # Create random input tensor of certain size
    torch.manual_seed(12345)
    input_shape=(batch_size, 80, 1488)
    input_tensor = torch.randn(input_shape).cuda()

    # Timing graph inference
    benchmark(model, input_tensor, 50, model_name, batch_size)

Loading model: stt_en_citrinet_256_bs1_torch.float16.torch-tensorrt
Warm up ...
Start timing ...

batch size=1, num iterations=50
  Median samples/s: 145.9, mean: 140.5
  Median latency (s): 0.006855, mean: 0.007173, 99th_p: 0.009178, std_dev: 0.000673

Loading model: stt_en_citrinet_256_bs8_torch.float16.torch-tensorrt
Warm up ...
Start timing ...

batch size=8, num iterations=50
  Median samples/s: 999.9, mean: 959.6
  Median latency (s): 0.008001, mean: 0.008412, 99th_p: 0.011683, std_dev: 0.000893

Loading model: stt_en_citrinet_256_bs32_torch.float16.torch-tensorrt
Warm up ...
Start timing ...

batch size=32, num iterations=50
  Median samples/s: 1817.0, mean: 1767.1
  Median latency (s): 0.017612, mean: 0.018151, 99th_p: 0.020383, std_dev: 0.000907

Loading model: stt_en_citrinet_256_bs128_torch.float16.torch-tensorrt
Warm up ...
Start timing ...

batch size=128, num iterations=50
  Median samples/s: 2249.1, mean: 2237.0
  Median latency (s): 0.056912, mean: 0.057233, 99th_p: 0.0

<a id="5"></a>
## 5. Conclusion

In this notebook, we have walked through the complete process of compiling TorchScript models with Torch-TensorRT for CitriNet model and test the performance impact of the optimization. With Torch-TensorRT, we observe a speedup of **2.0X** with FP32, and **2.5X** with FP16.

### What's next
Now it's time to try Torch-TensorRT on your own model. Fill out issues at https://github.com/NVIDIA/Torch-TensorRT. Your involvement will help future development of Torch-TensorRT.
