# NLP Transformers Inference Optimization

![](https://mk0spaceflightnoa02a.kinstacdn.com/wp-content/uploads/2019/06/65025135_2531780803519285_6381814664434548736_o.jpg)

Hello everyone!

In this notebook, we'll compare performance of our models for inference on CPU and GPU after several optimizations. They all could be applied to a lot of nlp [transformers](https://github.com/huggingface/transformers), including BERT, DistilBERT, RoBERTa, ALBERT, GPT-2, DistilGPT2.

We’ll take a look on three things that can be done after training to improve inference speed:
* [TorchScript](https://pytorch.org/docs/stable/jit.html)
* [Dynamic Quantization](https://pytorch.org/docs/stable/quantization.html)
* [ONNX](https://pytorch.org/docs/stable/onnx.html) and [ONNX Runntime](https://github.com/microsoft/onnxruntime)

As an example, we’ll use trained DistilBERT model on Amazon review dataset. Training part you can find in this [notebook](https://www.kaggle.com/alexalex02/sentiment-analysis-distilbert-amazon-reviews). Our result was 96.22% accuracy and model size was 255 MB.

### Measurements and Environments

We'll use batch size of 1 which is useful for online inference. Maximum sequence length - [64, 128, 256, 512]

Time: `%timeit -r 30 -n 3` to provide stable result

Kaggle Kernel Setup

CPU: Intel(R) Xeon(R) CPU @ 2.30GHz 4 CPU(s)

GPU: Tesla P100 16GB, Intel(R) Xeon(R) CPU @ 2.00GHz 2 CPU(s)


In [None]:
import os
os.environ['WANDB_SILENT'] = 'True'

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

import torch
from transformers import AutoTokenizer
from scipy.special import softmax

# TorchScript

TorchScript is a way to create serializable and optimizable models from PyTorch code. The models can be run independently from Python environment, such as C++.

To trace our model, we must define model input first. 

*Note: For GPU inference we must change device to 'cuda'.*

In [None]:
MODEL_NAME = 'distilbert-base-uncased'

tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
def sentence_input(sentence: str, max_len: int = 512, device = 'cpu'):
    encoded = tokenizer.encode_plus(sentence, add_special_tokens=True, 
                                    pad_to_max_length=True, max_length=max_len, 
                                    return_tensors="pt",).to(device)
    model_input = (encoded['input_ids'], encoded['attention_mask'])
    return model_input

In [None]:
test_sentence = "Super Cute: First of all, I LOVE this product. When I bought it my husband jokingly said that it looked cute and small in the picture, but was really HUGE in real life. Don't tell him I said so, but he was right. It is huge and the cord is really long. Although I wish it was smaller, I still love it. It works really well when we travel and need to plug a lot of things in and although the length is annoying, it's very useful."
model_input = sentence_input(test_sentence)

In [None]:
print(test_sentence)
print(model_input)

### Converting model - CPU and GPU

In [None]:
import torch.nn as nn
import torch

In [None]:
class DistilBert(nn.Module):
    """
    Simplified version of the same class by HuggingFace.
    See transformers/modeling_distilbert.py in the transformers repository.
    """

    def __init__(self, pretrained_model_name: str, num_classes: int = None):
        """
        Args:
            pretrained_model_name (str): HuggingFace model name.
                See transformers/modeling_auto.py
            num_classes (int): the number of class labels
                in the classification task
        """
        super().__init__()

        config = AutoConfig.from_pretrained(
             pretrained_model_name)

        self.distilbert = AutoModel.from_pretrained(pretrained_model_name,
                                                    config=config)
        self.pre_classifier = nn.Linear(config.dim, config.dim)
        self.classifier = nn.Linear(config.dim, num_classes)
        self.dropout = nn.Dropout(config.seq_classif_dropout)

    def forward(self, features, attention_mask=None, head_mask=None):
        """Compute class probabilities for the input sequence.

        Args:
            features (torch.Tensor): ids of each token,
                size ([bs, seq_length]
            attention_mask (torch.Tensor): binary tensor, used to select
                tokens which are used to compute attention scores
                in the self-attention heads, size [bs, seq_length]
            head_mask (torch.Tensor): 1.0 in head_mask indicates that
                we keep the head, size: [num_heads]
                or [num_hidden_layers x num_heads]
        Returns:
            PyTorch Tensor with predicted class probabilities
        """
        assert attention_mask is not None, "attention mask is none"
        distilbert_output = self.distilbert(input_ids=features,
                                            attention_mask=attention_mask,
                                            head_mask=head_mask)
        # we only need the hidden state here and don't need
        # transformer output, so index 0
        hidden_state = distilbert_output[0]  # (bs, seq_len, dim)
        # we take embeddings from the [CLS] token, so again index 0
        pooled_output = hidden_state[:, 0]  # (bs, dim)
        pooled_output = self.pre_classifier(pooled_output)  # (bs, dim)
        pooled_output = nn.ReLU()(pooled_output)  # (bs, dim)
        pooled_output = self.dropout(pooled_output)  # (bs, dim)
        logits = self.classifier(pooled_output)  # (bs, dim)

        return logits

In [None]:
from transformers import AutoConfig, AutoTokenizer, AutoModel
model = DistilBert(pretrained_model_name=MODEL_NAME,
                                           num_classes=2)

In [None]:
from catalyst.dl.utils import trace
def load_chechpoint(model, path):
    mod = trace.load_checkpoint(path)
    model.load_state_dict(mod['model_state_dict'])
    return model

In [None]:


model = load_chechpoint(model, '../input/sentiment-all-models/last 0.9622.pth')


In [None]:
model.eval()

traced_cpu = torch.jit.trace(model, model_input)
torch.jit.save(traced_cpu, "cpu.pth")

#to load
cpu_model = torch.jit.load("cpu.pth")

# GPU
# traced_gpu = torch.jit.trace(model.cuda(), gpu_model_input)
# torch.jit.save(traced_gpu, "gpu.pth")

# gpu_model = torch.jit.load("gpu.pth")

In [None]:
print(cpu_model.graph)

# Dynamic Quantization

Post Training Dynamic Quantization: This is the form of quantization where the weights are quantized ahead of time but the activations are dynamically quantized during inference.

Dynamic quantization support in PyTorch converts a float model to a quantized model with static int8 or float16 data types for the weights and dynamic quantization for the activations. The activations are quantized dynamically (per batch) to int8 when the weights are quantized to int8.

The mapping is performed by converting the floating point tensors using:

![](https://pytorch.org/docs/stable/_images/math-quantizer-equation.png)

In [None]:
quantized_model = torch.quantization.quantize_dynamic(model)

In [None]:
print(quantized_model)

Our model size decreased from 255 to 132 MB. If we calculate the total size of word embedding table ~ 4 (Bytes/FP32) * 30522(Vocabulary Size) * 768(Embedding Size) = 90 MB. Then the model size reduced from 165 to 42MB (INT8 Model)

# ONNX Runtime

ONNX provides an open source format for AI models, both deep learning and traditional ML. It defines an extensible computation graph model, as well as definitions of built-in operators and standard data types.

ONNX Runtime is a performance-focused engine(written in C++) for ONNX models, which inferences efficiently across multiple platforms and hardware

In [None]:
!pip install onnx onnxruntime onnxruntime-tools
#For GPU Inference: install onnxruntime-gpu

To export model, first, we need to put our model in eval() mode. Then, provide model_input. Because input size is fixed, we need to specify `dynamic_axis`.

*Note: for 4 sequence lengths - we need 4 different models.*

In [None]:
torch.onnx.export(model, model_input, "model_512.onnx",
                  export_params=True,
                  input_names=["input_ids", "attention_mask"],
                  output_names=["targets"],
                  dynamic_axes={
                      "input_ids": {0: "batch_size"},
                      "attention_mask": {0: "batch_size"},
                      "targets": {0: "batch_size"}
                  },
                  verbose=True)

To check that the model is well formed

In [None]:
import onnx
onnx_model = onnx.load('model_512.onnx')
onnx.checker.check_model(onnx_model, full_check=True)
onnx.helper.printable_graph(onnx_model.graph)

To optimize our model, we import optimizer. `opt_level` is a proper graph optimization level: 0 - disable all (default), 1 - basic, 2 - extended, 99 - all, `use_gpu` for GPU Inference

In [None]:
from onnxruntime_tools import optimizer
optimized_model_512 = optimizer.optimize_model("model_512.onnx", model_type='bert', 
                                               num_heads=12, hidden_size=768,
                                              use_gpu=False, opt_level=99)

optimized_model_512.save_model_to_file("optimized_512.onnx")

For GPU Inference, we can use following methods:
* change_input_to_int32() - int32 will be used as input, can get better performance.
* change_input_output_float32_to_float16() - half-precision will be used in computation.
* convert_model_float32_to_float16() - decreasing model size (255MB -> 128MB)

In order to run the model with ONNX Runtime, we need to create an inference session for the model.

In [None]:
import onnxruntime as ort
print(ort.get_device())
OPTIMIZED_512 = ort.InferenceSession('./optimized_512.onnx')

In [None]:
def to_numpy(tensor):
    if tensor.requires_grad:
        return tensor.detach().cpu().numpy()
    return tensor.cpu().numpy()

def prediction_onnx(model, sentence: str, max_len: int = 512):
    encoded = tokenizer.encode_plus(sentence, add_special_tokens=True, 
                                    pad_to_max_length=True, max_length=max_len,
                                    return_tensors="pt",)
    # compute ONNX Runtime output prediction
    input_ids = to_numpy(encoded['input_ids'])
    attention_mask = to_numpy(encoded['attention_mask'])
    onnx_input = {"input_ids": input_ids, "attention_mask": attention_mask}
    logits = model.run(None, onnx_input)
    preds = softmax(logits[0][0])
    print(f"Class: {['Negative' if preds.argmax() == 0 else 'Positive'][0]}, Probability: {preds.max():.4f}")

In [None]:
prediction_onnx(OPTIMIZED_512, test_sentence)

# CPU Results

Inference time presented in milliseconds.

In [None]:
df = pd.DataFrame([[506, 273, 151, 89.1, 0],
                  [507, 263, 145, 82.7, 5.2],
                  [516, 237, 126, 72.4, 19],
                  [388, 180, 92.2, 49.7, 56.2]], index = ['Pytorch', 'TorchScript', 
                                                    'ONNX Runtime', 'Quantization'],
                  columns=['512', '256', '128', '64', "Av.SpeedUp (%)"])
display(df)

In [None]:
cpu = pd.DataFrame([[64, 'Pytorch', 89.1],
                  [64, 'TorchScript', 82.7],
                  [64, 'ONNX Runtime', 72.4],
                  [64, 'Quantization', 49.4],
                   [128, 'Pytorch', 151],
                   [256, 'Pytorch', 273],
                   [512, 'Pytorch', 506],
                   [128, 'TorchScript', 145],
                   [256, 'TorchScript', 263],
                   [512, 'TorchScript', 507],
                   [128, 'ONNX Runtime', 126],
                   [256, 'ONNX Runtime', 237],
                   [512, 'ONNX Runtime', 516],
                   [128, 'Quantization', 92.2],
                   [256, 'Quantization', 180],
                    [512, 'Quantization', 388]],
                  columns=['Sequence', 'Optimization', 'Time (ms)'])

sns.set_style("darkgrid")
sns.catplot(x='Optimization', y='Time (ms)', data=cpu, kind='bar',
            ci=None, col='Sequence', col_wrap=2,
           col_order=[512,256,128,64]);

Here we can see that quantization gave us the most significant improvement in inference speed. After checking validation accuracy, we can see the drop from 96.22 to 96.03%. It’s not serious considering model size drop and speedup. If we extend maximum sequence lengths further to 32 and 16, then we can observe that speedup ~ 85% in [16, 32, 64, 128].

# GPU Results

GPU support isn’t provided for quantization in Pytorch yet.

In [None]:
gpu_df = pd.DataFrame([[16.1, 12.1, 11.9, 11.9, 0],
                  [15.9, 11.2, 9.2, 8.92, 18],
                  [14.2, 10, 8.14, 7.57, 35]], index = ['Pytorch', 'TorchScript', 
                                                    'ONNX Runtime'],
                  columns=['512', '256', '128', '64', "Av.SpeedUp (%)"])
display(gpu_df)

In [None]:
gpu = pd.DataFrame([[64, 'Pytorch', 11.9],
                  [64, 'TorchScript', 8.92],
                  [64, 'ONNX Runtime', 7.57],
                   [128, 'Pytorch', 11.9],
                   [256, 'Pytorch', 12.1],
                   [512, 'Pytorch', 16.1],
                   [128, 'TorchScript', 9.2],
                   [256, 'TorchScript', 11.2],
                   [512, 'TorchScript', 15.9],
                   [128, 'ONNX Runtime', 8.14],
                   [256, 'ONNX Runtime', 10],
                   [512, 'ONNX Runtime', 14.2]],
                  columns=['Sequence', 'Optimization', 'Time (ms)'])

sns.catplot(x='Optimization', y='Time (ms)', data=gpu, kind='bar',
            ci=None, col='Sequence', col_wrap=2,
           col_order=[512,256,128,64]);

Although TorchScript wasn't created for speedup improvement, it still yield solid 20% boost versus non-traced model.

FP16 ONNX model showed us very good performance gains. And there are more optimization available, such as disable/enable some fusions and GPU support for quantization.