![nebullvm nebuly AI accelerate inference optimize DeepLearning](https://user-images.githubusercontent.com/38586138/201391643-a80407e5-2c28-409c-90c9-327795cd27e8.png)

# Accelerate Hugging Face PyTorch DistilBERT with Speedster


Hi and welcome 👋

In this notebook we will discover how in just a few steps you can speed up the response time of deep learning model inference using the Speedster app from the open-source library nebullvm.

With Speedster's latest API, you can speed up models up to 10 times without any loss of accuracy (option A), or accelerate them up to 20-30 times by setting a self-defined amount of accuracy/precision that you are willing to trade off to get even lower response time (option B). To accelerate your model, Speedster takes advantage of various optimization techniques such as deep learning compilers (in both option A and option B), quantization, half accuracy, and so on (option B).

Let's jump to the code.

In [None]:
%env CUDA_VISIBLE_DEVICES=0

# Installation

Install Speedster:

In [None]:
!pip install speedster

Install deep learning compilers:

In [None]:
!python -m nebullvm.installers.auto_installer --frameworks huggingface --compilers all

## Model and Dataset setup

Add tensorrt installation path to the LD_LIBRARY_PATH env variable, in order to activate TensorrtExecutionProvider for ONNXRuntime

In [None]:
import os

tensorrt_path = "/usr/local/lib/python3.8/dist-packages/tensorrt"  # Change this path according to your TensorRT location

if os.path.exists(tensorrt_path):
    os.environ['LD_LIBRARY_PATH'] += f":{tensorrt_path}"
else:
    print("Unable to find TensorRT path. ONNXRuntime won't use TensorrtExecutionProvider.")

We chose DistilBERT as the pre-trained model that we want to optimize. Let's download both the pre-trained model and the tokenizer from the Hugging Face model hub.

In [None]:
import torch
from transformers import DistilBertTokenizer, DistilBertModel

tokenizer = DistilBertTokenizer.from_pretrained("distilbert-base-uncased")
model = DistilBertModel.from_pretrained("distilbert-base-uncased", torchscript=True)

# Move the model to gpu if available and set eval mode
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device).eval()

Let's create an example dataset with some random sentences

In [None]:
import random

sentences = [
    "Mars is the fourth planet from the Sun.",
    "has a crust primarily composed of elements",
    "However, it is unknown",
    "can be viewed from Earth",
    "It was the Romans",
]

len_dataset = 100

texts = []
for _ in range(len_dataset):
    n_times = random.randint(1, 30)
    texts.append(" ".join(random.choice(sentences) for _ in range(n_times)))

In [None]:
encoded_inputs = [tokenizer(text, return_tensors="pt") for text in texts]

## Speed up inference with Speedster: no metric drop

It's now time of improving a bit the performance in terms of speed. Let's use `Speedster`.

In [None]:
from speedster import optimize_model, save_model, load_model

Using Speedster is very simple and straightforward! Just use the `optimize_model` function and provide as input the model, some input data as example and the optimization time mode. Optionally a dynamic_info dictionary can be also provided, in order to support inputs with dynamic shape.

In [None]:
dynamic_info = {
    "inputs": [
        {0: 'batch', 1: 'num_tokens'},
        {0: 'batch', 1: 'num_tokens'}
    ],
    "outputs": [
        {0: 'batch', 1: 'num_tokens'}
    ]
}

optimized_model = optimize_model(
    model=model,
    input_data=encoded_inputs,
    optimization_time="constrained",
    ignore_compilers=["tensor_rt", "tvm"],  # TensorRT does not work for this model
    dynamic_info=dynamic_info,
)

In [None]:
import time

# Move inputs to gpu if available
encoded_inputs = [tokenizer(text, return_tensors="pt").to(device) for text in texts]

Let's run the prediction 100 times to calculate the average response time of the original model.

In [None]:
times = []

# Warmup for 30 iterations
for encoded_input in encoded_inputs[:30]:
    with torch.no_grad():
        final_out = model(**encoded_input)

# Benchmark
for encoded_input in encoded_inputs:
    st = time.time()
    with torch.no_grad():
        final_out = model(**encoded_input)
    times.append(time.time()-st)
original_model_time = sum(times)/len(times)*1000
print(f"Average response time for original DistilBERT: {original_model_time} ms")

Let's see the output of the original model

In [None]:
model(**encoded_input)

Let's run the prediction 100 times to calculate the average response time of the optimized model.

In [None]:
times = []

# Warmup for 30 iterations
for encoded_input in encoded_inputs[:30]:
    with torch.no_grad():
        final_out = optimized_model(**encoded_input)

# Benchmark
for encoded_input in encoded_inputs:
    st = time.time()
    with torch.no_grad():
        final_out = optimized_model(**encoded_input)
    times.append(time.time()-st)
optimized_model_time = sum(times)/len(times)*1000
print(f"Average response time for optimized DistilBERT (no metric drop): {optimized_model_time} ms")

Let's see the output of the optimized_model

In [None]:
optimized_model(**encoded_input)

## Speed up inference with Speedster: metric drop

This time we will use the `metric_drop_ths` argument to accept a little drop in terms of precision, in order to enable quantization and obtain an higher speedup

In [None]:
optimized_model = optimize_model(
    model=model,
    input_data=encoded_inputs,
    optimization_time="constrained",
    ignore_compilers=["tensor_rt", "tvm"],  # TensorRT does not work for this model
    dynamic_info=dynamic_info,
    metric_drop_ths=0.1,
)

In [None]:
times = []

# Warmup for 30 iterations
for encoded_input in encoded_inputs[:30]:
    with torch.no_grad():
        final_out = model(**encoded_input)

# Benchmark
for encoded_input in encoded_inputs:
    st = time.time()
    with torch.no_grad():
        final_out = model(**encoded_input)
    times.append(time.time()-st)
original_model_time = sum(times)/len(times)*1000
print(f"Average response time for original DistilBERT: {original_model_time} ms")

In [None]:
model(**encoded_input)

In [None]:
times = []

# Warmup for 30 iterations
for encoded_input in encoded_inputs[:30]:
    with torch.no_grad():
        final_out = optimized_model(**encoded_input)

# Benchmark
for encoded_input in encoded_inputs:
    st = time.time()
    with torch.no_grad():
        final_out = optimized_model(**encoded_input)
    times.append(time.time()-st)
optimized_model_time = sum(times)/len(times)*1000
print(f"Average response time for optimized DistilBERT (metric drop): {optimized_model_time} ms")

In [None]:
optimized_model(**encoded_input)

## Save and reload the optimized model

We can easily save to disk the optimized model with the following line:

In [None]:
save_model(optimized_model, "model_save_path")

We can then load again the model:

In [None]:
optimized_model = load_model("model_save_path")

Great! Was it easy? How are the results? Do you have any comments?
Share your optimization results and thoughts with <a href="https://discord.gg/RbeQMu886J" target="_blank"> our community on Discord</a>, where we chat about Speedster and AI acceleration.

Note that the acceleration of Speedster depends very much on the hardware configuration and your AI model. Given the same input model, Speedster can accelerate it by 10 times on some machines and perform poorly on others.

If you want to learn more about how Speedster works, look at other tutorials and performance benchmarks, check out the links below or write to us on Discord.

<center> 
    <a href="https://discord.com/invite/RbeQMu886J" target="_blank" style="text-decoration: none;"> Join the community </a> |
    <a href="https://nebuly.gitbook.io/nebuly/welcome/questions-and-contributions" target="_blank" style="text-decoration: none;"> Contribute to the library </a>
</center>

<center> 
    <a href="https://github.com/nebuly-ai/nebullvm/tree/main/apps/accelerate/speedster#key-concepts" target="_blank" style="text-decoration: none;"> How speedster works </a> •
    <a href="https://github.com/nebuly-ai/nebullvm/tree/main/apps/accelerate/speedster#documentation" target="_blank" style="text-decoration: none;"> Documentation </a> •
    <a href="https://github.com/nebuly-ai/nebullvm/tree/main/apps/accelerate/speedster#quick-start" target="_blank" style="text-decoration: none;"> Quick start </a> 
</center>