# Making Transfomers Efficient in Production 

In this notebook, we will explore four techniques that can be used to speed up the predictions and reduce the memory footprint of our transformer model:

- `Knowledge Distillation`
- `Quantization` 
- `Pruning`
- `Graph Optimization`

with the `Open Neural Network Exchange (ONNX)` format and `ONNX Runtime (ORT)`. So Lets see how these techniques can be combined to produce significant performance gains.

## Intent Detection as a Case Study

As a baseline, we are using fine-tuned BERT-base model that achieved 95% accuracy on the `CLINC150` dataset. This dataset includes 22,500 queries across 150 intents and 10 domains like banking and travel, and also includes 1200 out-of-scope queries that belong to an oos intent class. 

In [None]:
from transformers import pipeline

bert_ckpt = "transformersbook/bert-base-uncased-finetuned-clinc"
pipe = pipeline("text-classification", model = bert_ckpt)

In [None]:
## Lets pass a query to get the predicted content and confidence score from the model.

query = """Hey, I'd like to rent a vehicle from Nov 1st to Nov 15th in
Paris and I need a 15 passenger van"""

pipe(query)

It makes sense. 

## Creating a Performance Benchmark

Like other machine learning models, deploying transformers in production environments involves a trade-off among several constraints, the most common being:

- *Model Performace*: How well our model performs on a well crafted test set that reflects production data?
- *Latency*: How fast our model can deliver predictions?
- *Memory*: How can we deploy billion-parameter models like GPT2 or T5 that require gigbytes of disk storage and RAM?

In [None]:
## A skeleton of what we'll need is given by the following class:

class PerformanceBenchmark():
    def __init__(self, pipeline, dataset, optim_type = "BERT baseline"):
        self.pipeline = pipeline
        self.dataset = dataset
        self.optim_type = optim_type

    def compute_accuracy(self):
        ## We'll define this later
        pass

    def compute_size(self):
        ## We'll define this later
        pass

    def time_pipeline(self):
        ## We'll define this later
        pass

    def run_benchmark(self):
        metrics = {}
        metrics[self.optim_type] = self.compute_size()
        metrics[self.optim_type].update(self.time_pipeline())
        metrics[self.optim_type].update(self.compute_accuracy())
        return metrics    

We've defined an `optim_type` parameter to keep track of the different optimization techniques that we'll cover in this notebook. We'll use the `run_benchmark` method to collect all the metrics in a dictionary, with keys given by `optim_type`.

Lets download `CLINC150` dataset to compute accuracy.

In [None]:
from datasets import load_dataset
## Here plus configuration refers to the subset that contains out-of-scope training examples.
clinc = load_dataset("clinc_oos", "plus")

In [None]:
sample = clinc["test"][42]
sample

In [None]:
intents = clinc["test"].features["intent"]
intents.int2str(sample["intent"])

Now we have a basic understanding of the contents in the `CLINC150` dataset, lets implement the `compute_accuracy` method of `PerformanceBenchmark`. Since the dataset is balanced across the intent classes, we’ll use accuracy as our metric.

In [None]:
from evaluate import load

accuracy_score = load("accuracy")

The accuracy metric expects the predictions and references (i.e., the ground truth labels) to be integers. We can use the pipeline to extact the predictions from the text field and then use the `str2int()` method of our intents object to map each predictions to its corresponding ID. The following code collects all predictions and labels in list before returning the accuracy on the dataset. Lets also add it to our `PerformanceBenchmark` class: 

In [None]:
def compute_accuracy(self):
    preds, labels = [], []
    for example in self.dataset:
        pred = self.pipeline(example['text'])
        label = example['intent']
        preds.append(intents.str2int(pred[0]['label']))
        labels.append(label)
    accuracy = accuracy_score.compute(predictions = preds, references = labels)
    print(f"Accuracy on test set - {accuracy['accuracy']:.3f}")
    return accuracy            

PerformanceBenchmark.compute_accuracy = compute_accuracy

Next, let’s compute the size of our model by using the `torch.save()` function from PyTorch to serialize the model to disk.

In [None]:
list(pipe.model.state_dict().items())[42]

We can clearly see that each key/value pair corresponds to a specific layer and tensor in BERT. So if we save our model with:

In [None]:
import torch

torch.save(pipe.model.state_dict(), "model.pt")

We can then use `Path.stat()` function from Python's pathlib module to get information about the underlying files. In particular, `Path('model.pt').stat()`. `st.size` will give us the model size in bytes. Lets put this all together in the `compute_size()` function and add it to `PerformanceBenchmark`:

In [None]:
from pathlib import Path

def compute_size(self):
    state_dict = self.pipeline.model.state_dict()
    tmp_path = Path("model.pt")
    torch.save(state_dict, tmp_path)
    ## Calculate size in megabytes 
    size_mb = Path(tmp_path).stat().st_size / (1024 * 1024)
    ## Delete temporary file
    tmp_path.unlink()
    print(f"Model Size (MB) - {size_mb:.2f}")
    return {"size_mb": size_mb}

PerformanceBenchmark.compute_size = compute_size


Finally, lets implement the `time_pipeline` function so that we can time the avearge latency per query. We can use `perf_counter()` from Python's `time` module to time our pipeline by passing our test query and calculating the time difference in milliseconds between the start and end:

In [None]:
from time import perf_counter

for _ in range(3):
    start_time = perf_counter()
    _ = pipe(query)
    latency = perf_counter() - start_time
    print(f"Latency (ms): {1000 * latency:.3f}")

we’ll collect the latencies over many runs and then use the resulting distribution to calculate the mean and standard deviation, which will give us an idea about the spread in values. The following code does what we need and includes a phase to warm up the CPU before performing the actual timed run:

In [None]:
import numpy as np

def time_pipeline(self, query = "What is the pin number for my account?"):
    latencies = []
    ## Warm up the CPU
    for _ in range(10):
        _ = self.pipeline(query)
    ## Timed run
    for _ in range(100):
        start_time = perf_counter()
        _ = self.pipeline(query)
        latency = perf_counter() - start_time
        latencies.append(latency)
    ## Compute run statistics
    time_avg_ms = 1000 * np.mean(latencies)
    time_std_ms = 1000 * np.std(latencies)
    print(f"Average latency (ms) - {time_avg_ms:.2f} +\- {time_std_ms:.2f}")
    return {"time_avg_ms": time_avg_ms, "time_std_ms": time_std_ms}

PerformanceBenchmark.time_pipeline = time_pipeline    

In [None]:
## PerformanceBenchmark class is complete so lets test  on out BERT Baseline
pb = PerformanceBenchmark(pipe, clinc["test"])
perf_metrics = pb.run_benchmark()

## Making Models Smaller via Knowledge Distillation

Given the trend toward pretraining language models with ever-increasing parameter counts (the largest at the time of writing having over one trillion parameters),5 knowledge distillation has also become a popular strategy to compress these huge models and make them more suitable for building practical applications.

### Knowledge Distillation for Fine-Tuning

For supervised task like fine-tuning, the main idea is to augment the ground truth labels with a distribution of **soft probabilities** from the large pre-trained model which provide complimentary information for the smaller models to learn from.

Mathematically, suppose we feed an input sequence x to the large pre-trained model to generate a vector of logits 

$$
z(x) = [z_1(x), z_2(x),..., z_n(x)]
$$

We can convert this logits into probabilities by applying a softmax function.

$$
\frac{exp(z_i(x))}{\sum exp(z_i(x))}
$$

In many cases the large pre-trained model will assign a high probability to one class, with all other class probabilities close to zero. When that happens, this large model doesn’t provide much additional information beyond the ground truth labels, so instead we “soften” the probabilities by scaling the logits with a temperature hyperparameter T before applying the softmax:

$$
p_i(x) = \frac{exp(z_i(x)/T)}{\sum exp(z_i(x)/T)}
$$

Higher values of T produce a softer probability distribution over the classes and reveal much more information about the decision boundary that
the large pre-trained has learned for each training example. When T = 1 we recover the original softmax distribution.

Since smaller model also produces softened probabilites $q_i(x)$ of its own, we can use the **Kullback-Leibler(KL)** divergence to measure the difference between the two probability distributions:

$$
D_{KL}(p,q) = \sum p_i(x)  log \frac {p_i(x)} {q_i(x)}
$$ 

With the KL divergence, we can calculate how much is lost when we approximate the probability distribution of the large pre-trained model with the smaller. This allows us to define a knowledge distillation loss:

$$ 
L_KD = T^2 D_{KL}
$$

where $T^2$ is a normalization factor to account for the fact that the magnitude of the gradients produced by soft labels as $1/T^2$.
For classification tasks, the smaller model loss is then weighted average of the distillation loss with the usual cross-entropy loss $L_{CE}$ of the ground truth labels:

$$
L_{smaller} = \alpha L_{CE} + (1 - \alpha) L_{KD}
$$

where $\alpha$ is a hyperparameter that controls the relative strength of each class.

### Creating a Knowledge Distillation Trainer 

Since we already have a fine-tuned BERT-base model, let’s see how we can use knowledge distillation to fine-tune a smaller and faster model. To do that we’ll need a way to augment the cross-entropy loss with an LKD term. Fortunately we can do this by creating our own trainer!

To implement a knowledge distillation trainer, we need to add a few things to the `Trainer` base class:

- The new hyperparameters $\alpha$ and $T$, which controls the relative weight of the distillation loss and how much the probability distribution of the labels should be smoothed.
- The fine-tuned model, which in our case is BERT-base.
- A new loss function that combines the cross-entropy loss with the knowledge distillation loss.


In [None]:
from transformers import TrainingArguments

class DistillationTrainingArguments(TrainingArguments):
    def __init__(self, *args, alpha = 0.5, temperature = 2.0, **kwargs):
        super().__init__(*args, **kwargs)
        self.alpha = alpha
        self.temperature = temperature

For the trainer itself, we need a new loss function. The way to implement this is by subclassing `Trainer` and overriding the `compute_loss()` method to include the knowledge distillation loss term *L_{KD}*:

In [None]:
import torch.nn as nn
import torch.nn.functional as F
from transformers import Trainer

class DistillationTrainer(Trainer):
    def __init__(self, *args, larger_model = None, **kwargs):
        super().__init__(*args, **kwargs)
        self.larger_model = larger_model

    def compute_loss(self, model, inputs, return_outputs=False, **kwargs):
        outputs_small = model(**inputs)
        ## Extract cross-entropy loss and logits from smaller
        loss_ce = outputs_small.loss
        logits_small = outputs_small.logits
        ## Extract logits from larger model
        with torch.no_grad():
            outputs_larger = self.larger_model(**inputs) 
            logits_larger = outputs_larger.logits
        ## Soften probabilites and compute distillation loss
        loss_fct = nn.KLDivLoss(reduction = "batchmean")
        loss_kd = self.args.temperature ** 2 * loss_fct(
            F.log_softmax(logits_small / self.args.temperature, dim = -1),
            F.softmax(logits_larger / self.args.temperature, dim = -1)
        )
        ## Return weighted smaller model loss
        loss = self.args.alpha * loss_ce + (1. - self.args.alpha) * loss_kd
        return (loss, outputs_small) if return_outputs else loss

### Chossing a Good Smaller Model Initialization

A good rule of thumb from the literature is that knowledge distillation works best when the larger and smaller are of the same model type as these model can have different output embedding spaces, which can hinder the ability of smaller model to mimic larger model.

In our case study the larger model is BERT, so DistilBERT is a natural candidate to initialize the stu‐
dent with since it has 40% fewer parameters and has been shown to achieve strong results on downstream tasks.

In [None]:
from transformers import AutoTokenizer

smaller_ckpt = "distilbert-base-uncased"
smaller_tokenizer = AutoTokenizer.from_pretrained(smaller_ckpt)

def tokenize_text(batch):
    return smaller_tokenizer(batch['text'], truncation = False)

clinc_enc = clinc.map(tokenize_text, batched = True, remove_columns = ['text'])
clinc_enc = clinc_enc.rename_column("intent", "labels")

In [None]:
from huggingface_hub import notebook_login
notebook_login()

As we did in the performance benchmark, we’ll use accuracy as the main metric. This means we can reuse our `accuracy_score()` function in the `compute_metrics()` function that we’ll include in `DistillationTrainer`:

In [None]:
def compute_metrics(pred):
    predictions,labels = pred
    predictions = np.argmax(predictions, axis = 1)
    return accuracy_score.compute(predictions = predictions, references = labels)

In this function, the predictions from the sequence modelling head come in the form of logits, so we use the `np.argmax()` function to fine the most confident class prediction and compare it against the ground truth label.

Next, we need to define the training arguments. To warm up, we'll set $\alpha$ = 1 to see how well DistilBERT performs without any signal from the BERT baseline.

In [None]:
batch_size = 16

finetuned_ckpt = "distilbert-base-uncased-finetuned-clinc"
smaller_training_args = DistillationTrainingArguments(
    output_dir = finetuned_ckpt, eval_strategy = "epoch",
    num_train_epochs = 5, learning_rate = 2e-5,
    per_device_train_batch_size = batch_size,
    per_device_eval_batch_size = batch_size, alpha = 1, weight_decay = 0.01,
    push_to_hub = True
)

In [None]:
id2label = pipe.model.config.id2label
label2id = pipe.model.config.label2id

In [None]:
from transformers import AutoConfig

num_labels = intents.num_classes
smaller_config = (AutoConfig.from_pretrained(smaller_ckpt, num_labels = num_labels,
                                             id2label = id2label, label2id = label2id))

In [None]:
import torch 
from transformers import AutoModelForSequenceClassification

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

def smaller_init():
    return (AutoModelForSequenceClassification
            .from_pretrained(smaller_ckpt, config = smaller_config).to(device))

In [None]:
from transformers import DataCollatorWithPadding

data_collator = DataCollatorWithPadding(tokenizer=smaller_tokenizer)

In [None]:
larger_ckpt = "transformersbook/bert-base-uncased-finetuned-clinc"
larger_model = (AutoModelForSequenceClassification
                .from_pretrained(larger_ckpt, num_labels = num_labels)
                .to(device))

distilbert_trainer = DistillationTrainer(model_init = smaller_init, larger_model = larger_model,
                                         args = smaller_training_args,
                                         train_dataset = clinc_enc['train'], eval_dataset = clinc_enc['validation'],
                                         compute_metrics = compute_metrics, data_collator = data_collator)

distilbert_trainer.train()

| Metric              | Value                        |
| ------------------- | ---------------------------- |
| **Total Epochs**    | 5                            |
| **Final Accuracy**  | **94.1%** on validation set  |
| **Training Loss**   | Dropped from **4.29 → 0.09** |
| **Validation Loss** | Dropped from **1.80 → 0.26** |
| **Runtime**         | \~10.6 minutes               |
| **Samples/sec**     | 119                          |
| **Steps/sec**       | 7.47                         |

Performance Observations:

- Excellent convergence: Both training and validation losses consistently drop.
- No overfitting: Accuracy plateaus but doesn't drop — good sign.
- Distillation success: The student is learning efficiently from the larger model.

In [None]:
distilbert_trainer.push_to_hub("Training Completed!")

With our model now safely stored on the Hub, we can immediately use it in a pipeline for our performance benchmark:

In [None]:
finetuned_ckpt = "vjay98/distilbert-base-uncased-finetuned-clinc"
pipe = pipeline("text-classification", model=finetuned_ckpt)

In [None]:
optim_type = "DistilBERT"
pb = PerformanceBenchmark(pipe, clinc['test'], optim_type = optim_type)
perf_metrics.update(pb.run_benchmark())

| Metric              | BERT Baseline | Distilled DistilBERT     |
| ------------------- | ------------- | ------------------------ |
| **Model Size (MB)** | 418.15        | **255.88**  \~39%      |
| **Latency (ms)**    | 8.77 ± 2.30   | **4.20 ± 0.69** ⬇\~52% |
| **Test Accuracy**   | 0.867         | **0.885**  (+1.8%)     |

Key Takeaways:

- Much faster inference (2× improvement in latency).
- Smaller footprint — great for deployment on edge or low-resource environments.
- Better accuracy — distillation worked well (likely due to effective training and good teacher model).
- No signs of underfitting — smaller model is learning both from labels and larger model's predictions.

To compare these results against our baseline, let’s create a scatter plot of the accuracy against the latency, with the radius of each point corresponding to the size of the model on disk. The following function does what we need and marks the current optimization type as a dashed circle to aid the comparison to previous results:

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

def plot_metrics(perf_metrics, current_optim_type):
    df = pd.DataFrame.from_dict(perf_metrics, orient = 'index')

    for idx in df.index:
        df_opt = df.loc[idx]
        ## Add a dashed circle aroud the current optimization type
        if idx == current_optim_type:
            plt.scatter(df_opt["time_avg_ms"], df_opt["accuracy"] * 100,
                        alpha = 0.5, s = df_opt["size_mb"], label = idx,
                        marker = "$\u25CC$")
        else:
            plt.scatter(df_opt["time_avg_ms"], df_opt['accuracy'] * 100,
                        alpha = 0.5, s = df_opt["size_mb"], label = idx)

    legend = plt.legend(bbox_to_anchor = (1,1))
    for handle in legend.legend_handles:
        handle.set_sizes([20])

    plt.ylim(80,90)
    ## Use the slowest model to define x-axis range
    xlim = int(perf_metrics["BERT baseline"]["time_avg_ms"] + 3)
    plt.xlim(1, xlim)
    plt.ylabel("Accuracy (%)")
    plt.xlabel("Average latency (ms)")
    plt.show()                

plot_metrics(perf_metrics, optim_type)    

From the plot, we can see that by using smaller model we've managed to significatnly decrease the average latency and improve accuracy.

Let's see if we can improvethsi by including the distillation loss of the larger BERT baseline model and finding good values for $\alpha$ and $T$.

### Finding Good Hyperparameters with Optuna

To find good values of $\alpha$ and $T$, we could do a grid search over the 2D parameter space. But a much better alternative is to use `Optuna`, which is an optimatization framework designed for just this type of task. Optuna formulates the search problem
in terms of an objective function that is optimized through multiple trials. For example, suppose we wished to minimize Rosenbrock’s banana function:

$$
f(x,y) = (1 - x)^2 + 100(y - x^2)^2
$$

which is a famous test case for optimization frameworks. which is a famous test case for optimization frameworks. As shown in Figure below, the function gets its name from the curved contours and has a global minimum at (x, y) = (1, 1). Finding the valley is an easy optimization problem, but converging to the global minimum is not.

In Optuna, we can find the minimum of f x, y by defining an objective() function that returns the value of $f(x, y)$ :

In [None]:
def objective(trial):
    x = trial.suggest_float("x", -2, 2)
    y = trial.suggest_float("y", -2, 2)
    return (1 - x) ** 2 + 100 * (y - x ** 2) ** 2

In [None]:
import optuna

study = optuna.create_study()
study.optimize(objective, n_trials = 1000)

In [None]:
study.best_params

We see that with one thousand trials, Optuna has managed to find values for x and y that are reasonably close to the global minimum. To use Optuna in Transformers, we use similar logic by first defining the hyperparameter space that we wish to optimize over. In addition to α and T, we’ll include the number of training epochs as follows:

In [None]:
def hp_space(trial):
    return {"num_train_epochs": trial.suggest_int("num_train_epochs", 5, 10),
            "alpha": trial.suggest_float("alpha", 0, 1),
            "temperature": trial.suggest_int("temperature", 2, 20)}

In [None]:
best_run = distilbert_trainer.hyperparameter_search(
    n_trials = 20, direction = "maximize", hp_space = hp_space
)