![nebullvm nebuly AI accelerate inference optimize DeepLearning](https://user-images.githubusercontent.com/38586138/201391643-a80407e5-2c28-409c-90c9-327795cd27e8.png)

# Accelerate Hugging Face T5 with Speedster


Hi and welcome 👋

In this notebook we will discover how in just a few steps you can speed up the response time of deep learning model inference using the Speedster app from the open-source library nebullvm.

With Speedster's latest API, you can speed up models up to 10 times without any loss of accuracy (option A), or accelerate them up to 20-30 times by setting a self-defined amount of accuracy/precision that you are willing to trade off to get even lower response time (option B). To accelerate your model, Speedster takes advantage of various optimization techniques such as deep learning compilers (in both option A and option B), quantization, half accuracy, and so on (option B).

Let's jump to the code.

In [None]:
%env CUDA_VISIBLE_DEVICES=0

# Installation

Install Speedster:

In [None]:
!pip install speedster

Install deep learning compilers:

In [None]:
!python -m nebullvm.installers.auto_installer --frameworks huggingface --compilers all

## Model and Dataset setup

Add tensorrt installation path to the LD_LIBRARY_PATH env variable, in order to activate TensorrtExecutionProvider for ONNXRuntime

In [None]:
import os

tensorrt_path = "/usr/local/lib/python3.8/dist-packages/tensorrt"  # Change this path according to your TensorRT location

if os.path.exists(tensorrt_path):
    os.environ['LD_LIBRARY_PATH'] += f":{tensorrt_path}"
else:
    print("Unable to find TensorRT path. ONNXRuntime won't use TensorrtExecutionProvider.")

We chose T5-efficient-base as the pre-trained model that we want to optimize. Let's download both the pre-trained model and the tokenizer from the Hugging Face model hub.

In [None]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
import torch

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

model_name = "google/t5-efficient-base"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name, torchscript=True).to(device)

# set the model to eval mode
_ = model.eval()

Let's create an example dataset with some random sentences

In [None]:
texts = [
    """BERT is a transformers model pretrained on a large corpus of English data in a self-supervised fashion. This means it was pretrained on the raw texts only, with no humans labeling them in any way (which is why it can use lots of publicly available data) with an automatic process to generate inputs and labels from those texts.""",
    """GPT-2 is a transformers model pretrained on a very large corpus of English data in a self-supervised fashion. This means it was pretrained on the raw texts only, with no humans labelling them in any way (which is why it can use lots of publicly available data) with an automatic process to generate inputs and labels from those texts. More precisely, it was trained to guess the next word in sentences.""",
    """With T5, we propose reframing all NLP tasks into a unified text-to-text-format where the input and output are always text strings, in contrast to BERT-style models that can only output either a class label or a span of the input. Our text-to-text framework allows us to use the same model, loss function, and hyperparameters on any NLP task.""",
    """LayoutLMv3 is a pre-trained multimodal Transformer for Document AI with unified text and image masking. The simple unified architecture and training objectives make LayoutLMv3 a general-purpose pre-trained model. For example, LayoutLMv3 can be fine-tuned for both text-centric tasks, including form understanding, receipt understanding, and document visual question answering, and image-centric tasks such as document image classification and document layout analysis.""",
    """XLNet is a new unsupervised language representation learning method based on a novel generalized permutation language modeling objective. Additionally, XLNet employs Transformer-XL as the backbone model, exhibiting excellent performance for language tasks involving long context. Overall, XLNet achieves state-of-the-art (SOTA) results on various downstream language tasks including question answering, natural language inference, sentiment analysis, and document ranking."""
]
texts = texts*20

In [None]:
encoded_inputs = [tokenizer(text, padding="longest", return_tensors="pt") for text in texts]

## Speed up inference with Speedster: no metric drop

It's now time of improving a bit the performance in terms of speed. Let's use `Speedster`.

In [None]:
from speedster import optimize_model, save_model, load_model

Usually Speedster is very simple and straightforward! Just use the `optimize_model` function and provide as input the model, some input data as example and the optimization time mode. But for this type of models, we need to do some extra steps because current version of speedster don't have direct support for Encoder-Decoder Models. These type of models has both Encoder and Decoder. For Example, BERT models are Encoder models and GPT models are Decoder models, but T5 has both.

In [None]:
# First, we get the encoder and decoder from the model
encoder = model.get_encoder()
decoder = model.get_decoder()

Optionally a dynamic_info dictionary can be also provided, in order to support inputs with dynamic shape.

In [None]:
dynamic_info = {
    "inputs": [
        {0: 'batch', 1: 'num_tokens'},
        {0: 'batch', 1: 'num_tokens'}
    ],
    "outputs": [
        {0: 'batch', 1: 'num_tokens'},
    ]
}

In [None]:
# Create the optimized encoder model seperately
optimized_encoder_model = optimize_model(
    model=encoder,
    input_data=encoded_inputs,
    optimization_time="constrained",
    ignore_compilers=["tensor_rt", "tvm"],  # TensorRT does not work for this model
    dynamic_info=dynamic_info,
)

In [None]:
# Create the optimized decoder model seperately
optimized_decoder_model = optimize_model(
    model=decoder,
    input_data=encoded_inputs,
    optimization_time="constrained",
    ignore_compilers=["tensor_rt", "tvm"],  # TensorRT does not work for this model
    dynamic_info=dynamic_info,
)

In [None]:
import time

# Move inputs to gpu if available
encoded_inputs = [tokenizer(text, padding="longest", return_tensors="pt").to(device) for text in texts]

Let's run the prediction 100 times to calculate the average response time of the original model.

In [None]:
times = []
# Warmup for 30 iterations
for encoded_input in encoded_inputs[:30]:
    with torch.no_grad():
        encoder_out = encoder(**encoded_input)
        decoder_out = decoder(**encoded_input,encoder_hidden_states=encoder_out[0])

# Benchmark
for encoded_input in encoded_inputs:
    st = time.time()
    with torch.no_grad():
        encoder_out = encoder(**encoded_input)
        decoder_out = decoder(**encoded_input,encoder_hidden_states=encoder_out[0])
    times.append(time.time()-st)
original_model_time = sum(times)/len(times)*1000
print(f"Average response time for original T5: {original_model_time} ms")

In Real world use cases, we pass the decoder output to `model.lm_head` to get the actual prediction, but here we are testing the performance improvements, so i am skipping that step.

Let's see the output of the original model

In [None]:
encoder(**encoded_input)

In [None]:
decoder(**encoded_input,encoder_hidden_states=encoder_out[0])

Let's run the prediction 100 times to calculate the average response time of the optimized model.

In [None]:
times = []

# Warmup for 30 iterations
for encoded_input in encoded_inputs[:30]:
    with torch.no_grad():
        encoder_out = optimized_encoder_model(**encoded_input)
        decoder_out = optimized_decoder_model(**encoded_input,encoder_hidden_states=encoder_out[0])

# Benchmark
for encoded_input in encoded_inputs:
    st = time.time()
    with torch.no_grad():
        encoder_out = optimized_encoder_model(**encoded_input)
        decoder_out = optimized_decoder_model(**encoded_input,encoder_hidden_states=encoder_out[0])
    times.append(time.time()-st)
optimized_model_time = sum(times)/len(times)*1000
print(f"Average response time for optimized T5 (no metric drop): {optimized_model_time} ms")

Let's see the output of the optimized_model

In [None]:
optimized_encoder_model(**encoded_input)

In [None]:
optimized_decoder_model(**encoded_input,encoder_hidden_states=encoder_out[0])

## Speed up inference with Speedster: metric drop

This time we will use the `metric_drop_ths` argument to accept a little drop in terms of precision, in order to enable quantization and obtain an higher speedup

In [None]:
optimized_encoder_model = optimize_model(
    model=encoder,
    input_data=encoded_inputs,
    optimization_time="constrained",
    ignore_compilers=["tensor_rt", "tvm"],  # TensorRT does not work for this model
    dynamic_info=dynamic_info,
    metric_drop_ths=0.1,
)

In [None]:
optimized_decoder_model = optimize_model(
    model=decoder,
    input_data=encoded_inputs,
    optimization_time="constrained",
    ignore_compilers=["tensor_rt", "tvm"],  # TensorRT does not work for this model
    dynamic_info=dynamic_info,
    metric_drop_ths=0.1,
)

In [None]:
times = []
# Warmup for 30 iterations
for encoded_input in encoded_inputs[:30]:
    with torch.no_grad():
        encoder_out = encoder(**encoded_input)
        decoder_out = decoder(**encoded_input,encoder_hidden_states=encoder_out[0])

# Benchmark
for encoded_input in encoded_inputs:
    st = time.time()
    with torch.no_grad():
        encoder_out = encoder(**encoded_input)
        decoder_out = decoder(**encoded_input,encoder_hidden_states=encoder_out[0])
    times.append(time.time()-st)
original_model_time = sum(times)/len(times)*1000
print(f"Average response time for original T5: {original_model_time} ms")

In [None]:
encoder(**encoded_input)

In [None]:
decoder(**encoded_input,encoder_hidden_states=encoder_out[0])

In [None]:
times = []

# Warmup for 30 iterations
for encoded_input in encoded_inputs[:30]:
    with torch.no_grad():
        encoder_out = optimized_encoder_model(**encoded_input)
        decoder_out = optimized_decoder_model(**encoded_input,encoder_hidden_states=encoder_out[0])

# Benchmark
for encoded_input in encoded_inputs:
    st = time.time()
    with torch.no_grad():
        encoder_out = optimized_encoder_model(**encoded_input)
        decoder_out = optimized_decoder_model(**encoded_input,encoder_hidden_states=encoder_out[0])
    times.append(time.time()-st)
optimized_model_time = sum(times)/len(times)*1000
print(f"Average response time for optimized T5 (metric drop): {optimized_model_time} ms")

## Save and reload the optimized model

We can easily save to disk the optimized model with the following line:

In [None]:
save_model(optimized_encoder_model, "encoder_model_save_path")
save_model(optimized_decoder_model, "decoder_model_save_path")

We can then load again the model:



In [None]:
optimized_encoder_model = load_model("encoder_model_save_path")
optimized_decoder_model = load_model("decoder_model_save_path")

Great! Was it easy? How are the results? Do you have any comments?
Share your optimization results and thoughts with <a href="https://discord.gg/RbeQMu886J" target="_blank"> our community on Discord</a>, where we chat about Speedster and AI acceleration.

Note that the acceleration of Speedster depends very much on the hardware configuration and your AI model. Given the same input model, Speedster can accelerate it by 10 times on some machines and perform poorly on others.

If you want to learn more about how Speedster works, look at other tutorials and performance benchmarks, check out the links below or write to us on Discord.

<center> 
    <a href="https://discord.com/invite/RbeQMu886J" target="_blank" style="text-decoration: none;"> Join the community </a> |
    <a href="https://nebuly.gitbook.io/nebuly/welcome/questions-and-contributions" target="_blank" style="text-decoration: none;"> Contribute to the library </a>
</center>

<center> 
    <a href="https://github.com/nebuly-ai/nebullvm/tree/main/apps/accelerate/speedster#key-concepts" target="_blank" style="text-decoration: none;"> How speedster works </a> •
    <a href="https://github.com/nebuly-ai/nebullvm/tree/main/apps/accelerate/speedster#documentation" target="_blank" style="text-decoration: none;"> Documentation </a> •
    <a href="https://github.com/nebuly-ai/nebullvm/tree/main/apps/accelerate/speedster#quick-start" target="_blank" style="text-decoration: none;"> Quick start </a> 
</center>