# A Practical Guide to Improving Performance: Optimizing For Throughput

This notebook serves as a practical guide to demonstrate how you can tune the performance of your model on Tenstorrent hardware by increasing the batch size of inputs. It will also demonstrate the appropriate way of benchmarking models on AI hardware by separating the compilation time from the run time.

The tutorial will walk through an example of running the [BERT](https://en.wikipedia.org/wiki/BERT_(language_model)) model on Tenstorrent AI accelerator hardware. The model weights will be directly downloaded from the [HuggingFace library](https://huggingface.co/docs/transformers/model_doc/bert) and executed through the PyBUDA SDK.

## Guide Overview

In this guide, we will talk through the steps for running the BERT model trained on the [SST2](https://nlp.stanford.edu/sentiment/index.html) dataset for the **Text Classification** task.

You will learn how to vary the input batch size of the model to achieve higher throughput performance. You will also learn how to configure a benchmark framework for evaluating the model performance.

## Step 1: Import libraries

Make sure that you have an activate Python environment with the latest version of PyBUDA installed.

We will start by first pip installing the `evaluate` library which will be used to calculate the accuracy metric.

In [None]:
# Install a pip package in the current Jupyter kernel
import sys
!{sys.executable} -m pip install evaluate==0.4.0

In [None]:
# import the pybuda library and additional libraries required for this tutorial
import time
from typing import Any, Dict, List, Tuple

import evaluate
import pybuda
import torch
from datasets import load_dataset
from torch.utils.data import DataLoader, Dataset
from transformers import BertForSequenceClassification, BertTokenizer

## Step 2: Create helper classes and functions

We will create some helper classes and functions to improve code reusability throughout this tutorial.

* `SST2Dataset` -- Python Class to hold a preprocessed version of the SST2 dataset used for evaluation
* `eval_fn` -- function to compute the evaluation score

In [None]:
# Create a Dataset Class to preprocess the data
class SST2Dataset(Dataset):
    """Configurable SST-2 Dataset."""

    def __init__(self, dataset: Any, tokenizer: Any, split: str, seq_len: int):
        """
        Init and preprocess SST-2 dataset.

        Parameters
        ----------
        dataset : Any
            SST-2 dataset
        tokenizer : Any
            tokenizer object from HuggingFace
        split : str
            Which split to use i.e. ["train", "validation", "test"]
        seq_len : int
            Sequence length
        """
        self.sst2 = dataset[split]
        self.data = [
            (
                tokenizer(
                    item["sentence"],
                    return_tensors="pt",
                    max_length=seq_len,
                    padding="max_length",
                    return_token_type_ids=False,
                    truncation=True,
                ),
                item["label"],
            )
            for item in self.sst2
        ]

        for data in self.data:
            tokenized = data[0]
            for item in tokenized:
                tokenized[item] = tokenized[item].squeeze()

    def __len__(self) -> int:
        """
        Return length of dataset.

        Returns
        -------
        int
            Length of dataset
        """
        return len(self.data)

    def __getitem__(self, index: int) -> Tuple[Dict[str, torch.Tensor], int]:
        """
        Return sample from dataset.

        Parameters
        ----------
        index : int
            Index of sample

        Returns
        -------
        Tuple
            Data sample in format of X, y
        """
        X, y = self.data[index]
        return X, y

In [None]:
# Define evaluation function
def eval_fn(outputs: List[torch.tensor], labels: List[int], metric_type: str) -> float:
    """
    Evaluation function for measuring model accuracy.

    Parameters
    ----------
    outputs : List[torch.tensor]
        Predicted outputs from model
    labels : List[int]
        List of true labels
    metric_type : str
        Type of metric to return i.e. accuracy, recall, precision, etc.

    Returns
    -------
    float
        Evaluation score.
    """

    # set evaluation metric for dataset
    accuracy_metric = evaluate.load(metric_type)

    # initialize lists to store predictions and labels
    pred_labels = []
    true_labels = []

    # store all predictions
    for output in outputs:
        pred_labels.extend(torch.argmax(output, axis=-1))

    # store all labels
    for label in labels:
        true_labels.extend(label)

    # compute the accuracy
    eval_score = accuracy_metric.compute(references=true_labels, predictions=pred_labels)

    return eval_score[metric_type]

## Step 3: Download the model weights from HuggingFace

In [None]:
# Load BERT tokenizer and model from HuggingFace for text classification task
model_ckpt = "textattack/bert-base-uncased-SST-2"
tokenizer = BertTokenizer.from_pretrained(model_ckpt)
model = BertForSequenceClassification.from_pretrained(model_ckpt)

## Step 4: Set optimal configurations

For every model, you can adjust TT-BUDA configuration parameters to achieve optimized performance. Some key parameters include:

* Data format e.g. BFP8, FP16_b, FP16, etc.
* Math fidelity
* Balancer policy
* etc...

For a full list of tuneable parameters, please refer to the TT-BUDA documentation: <https://docs.tenstorrent.com/tenstorrent/>

In [None]:
# Set optimal configurations
compiler_cfg = pybuda.config._get_global_compiler_config()
compiler_cfg.default_df_override = pybuda._C.DataFormat.Float16_b
compiler_cfg.enable_auto_transposing_placement = True
compiler_cfg.balancer_policy = "Ribbon"

## Step 5: Instantiate Tenstorrent device

The first time we use PyBUDA, we must initialize a `TTDevice` object which serves as the abstraction over the target hardware.

In [None]:
tt0 = pybuda.TTDevice(
    name="tt_device_0",  # here we can give our device any name we wish, for tracking purposes
    arch=pybuda.BackendDevice.Grayskull  # we set the target device architecture to compile for
)

## Step 6: Create a PyBUDA module from PyTorch model

Next, we must abstract the PyTorch model loaded from HuggingFace into a `pybuda.PyTorchModule` object. This will let the BUDA compiler know which model architecture and AI framework it has to compile.

We then "place" this module onto the previously initialized `TTDevice`.

In [None]:
# Create module
pybuda_module = pybuda.PyTorchModule(
    name = "pt_bert_text_classification",  # give the module a name, this will be used for tracking purposes
    module=model  # specify the model that is being targeted for compilation
)

# Place module on device
tt0.place_module(module=pybuda_module)

## Step 7: Load the SST2 dataset for evaluation

In [None]:
dataset = SST2Dataset(dataset=load_dataset("glue", "sst2"), tokenizer=tokenizer, split="validation", seq_len=128)

## Step 8: Set the batch size, prep the dataset, and load a sample input

In [None]:
# set batch size
batch_size = 64

# prepare the dataset for specified batch size
generator = DataLoader(dataset, batch_size=batch_size, shuffle=False, drop_last=True)

# get sample input
sample_input, _ = next(iter(generator))

## Step 9: Compile the model with fixed batch size

In [None]:
start_compilation_time = time.time()
output_q = pybuda.initialize_pipeline(training=False, sample_inputs=list(sample_input.values()))
end_compilation_time = time.time()

## Step 10: Run benchmark on SST2 dataset with `batch_size==64`

In [None]:
# Run benchmark loop
store_outputs = []
store_labels = []
start_runtime_time = time.time()
for batch, labels in generator:
    # push input to Tenstorrent device
    tt0.push_to_inputs(batch)

    # run inference on Tenstorrent device
    pybuda.run_forward(input_count=1)
    output = output_q.get()  # inference will return a queue object, get last returned object

    # store outputs
    store_labels.append(labels)
    store_outputs.append(output[0].value())
end_runtime_time = time.time()

# Process output times
total_runtime_time = end_runtime_time - start_runtime_time
total_compilation_time = end_compilation_time - start_compilation_time
total_samples = len(generator) *  batch_size
eval_score = eval_fn(store_outputs, store_labels, "accuracy")

In [None]:
# Display results
print("Benchmark Result")
print(f" Model compilation time: {total_compilation_time:.3f}s")
print(f" Total runtime time for {total_samples} inputs: {total_runtime_time:.3f}s")
print(f" Throughput: {(total_samples / total_runtime_time):.1f} samples/s")
print(f" Accuracy: {(eval_score * 100):.1f}%")

## Step 11: Shutdown PyBUDA

In [None]:
pybuda.shutdown()