# A Practical Guide to Serving AI Models on Tenstorrent Hardware: Deploying BERT with FastAPI

This notebook serves as a practical guide to demonstrate how you can deploy an AI model on Tenstorrent hardware for an inference service using FastAPI.

The tutorial will walk through an example of running the [BERT](https://en.wikipedia.org/wiki/BERT_(language_model)) model on Tenstorrent AI accelerator hardware. The model weights will be directly downloaded from the [HuggingFace library](https://huggingface.co/docs/transformers/model_doc/bert) and executed through the PyBUDA SDK. We will use FastAPI to build a RESTful API.

## Step 1: Import libraries

Make sure that you have an activate Python environment with the latest version of PyBUDA installed.

We will start by first pip installing a few libraries required to build a RESTful API: `fastapi`, `uvicorn`, and `nest-asyncio`

In [None]:
# Install a pip package in the current Jupyter kernel
import sys
!{sys.executable} -m pip install fastapi==0.85.1 uvicorn==0.19.0 nest-asyncio==1.5.8

In [None]:
# import the pybuda library and additional libraries required for this tutorial
import os
from threading import Thread
from typing import Dict, Tuple, Union

import nest_asyncio
import pybuda
import requests
import torch
import uvicorn
from fastapi import FastAPI
from transformers import BertForSequenceClassification, BertTokenizer

## Step 2: Build a Handler class

We're going to build a Handler class that will act as the interface to be deployed.

The class will hold the following methods:

* `initailize` -- initialize / compile the model
* `preprocess` -- preprocess the user input for the model
* `inference` -- run inference on the model
* `postprocess` -- postprocess the model outputs (logits)
* `handle` -- pull all of the steps together

In [None]:
class BERTHandler:
    """
    A class to represent a BERT model RESTful API handler.

    ...

    Attributes
    ----------
    initialized : bool
        Flag to mark if model as been compiled or not
    device0 : pybuda.TTDevice
        Tenstorrent device object which represents the hardware target to deploy model on
    seqlen : int
        Input sequence length

    Methods
    -------
    initialize():
        Initializes the model by downloading the weights, selecting the hardware target, and compiling the model
    preprocess(input_text):
        Preprocess the input (apply tokenization)
    inference(processed_inputs):
        Run inference on device
    postprocess(logits):
        Run post-processing on logits from model
    handle(input_text):
        Run all of the steps on user inputs
    """

    def __init__(self, seqlen: int = 128):
        """
        Constructs all the necessary attributes for the BERTHandler object.

        Parameters
        ----------
        seqlen : int, optional
            Input sequence length, by default 128
        batch_size : int, optional
            Input batch size, by default 1
        """
        self.initialized = False
        self.device0 = None
        self.seqlen = seqlen

    def initialize(self):
        """
        Initialize and compile model pipeline.
        """

        # Set logging levels
        os.environ["LOGURU_LEVEL"] = "ERROR"
        os.environ["LOGGER_LEVEL"] = "ERROR"

        # Load BERT tokenizer and model from HuggingFace for text classification task
        model_ckpt = "assemblyai/bert-large-uncased-sst2"
        model = BertForSequenceClassification.from_pretrained(model_ckpt)
        self.tokenizer = BertTokenizer.from_pretrained(model_ckpt)

        # Initialize TTDevice object
        tt0 = pybuda.TTDevice(
            name="tt_device_0",  # here we can give our device any name we wish, for tracking purposes
        )

        # Create PyBUDA module
        pybuda_module = pybuda.PyTorchModule(
            name = "pt_bert_text_classification",  # give the module a name, this will be used for tracking purposes
            module=model  # specify the model that is being targeted for compilation
        )

        # Place module on device
        tt0.place_module(module=pybuda_module)
        self.device0 = tt0

        # Load data sample to compile model
        sample_input = self.preprocess("sample input text")

        # Push input to model
        self.device0.push_to_inputs(*sample_input)

        # Compile & initialize the pipeline for inference, with given shapes
        output_q = pybuda.run_inference()
        _ = output_q.get()

        # Configure initialization flag
        self.initialized = True
        print("BERTHandler initialized.")

    def preprocess(self, input_text: str) -> Tuple[torch.Tensor, torch.Tensor]:
        """
        Preprocess the user inputs.

        Parameters
        ----------
        input_text : str
            User input

        Returns
        -------
        Tuple[torch.Tensor, torch.Tensor]
            Processed outputs: `input_ids` and `attention_mask`
        """

        input_tokens = self.tokenizer(
            input_text,
            max_length=self.seqlen,  # set the maximum input context length
            padding="max_length",  # pad to max length for fixed input size
            truncation=True,  # truncate to max length
            return_tensors="pt",  # return PyTorch tensor
        )

        return (input_tokens["input_ids"], input_tokens["attention_mask"])

    def inference(self, processed_inputs: Tuple[torch.Tensor, torch.Tensor]) -> torch.Tensor:
        """
        Run inference on Tenstorrent hardware.

        Parameters
        ----------
        processed_inputs : Tuple[torch.Tensor, torch.Tensor]
            Processed inputs: `input_ids` and `attention_mask`

        Returns
        -------
        torch.Tensor
            Output logits from model
        """

        self.device0.push_to_inputs(*processed_inputs)
        output_q = pybuda.run_inference()
        output = output_q.get()
        logits = output[0].value().detach()
        return logits

    def postprocessing(self, logits: torch.Tensor) -> Dict[str, Union[str, float]]:
        """
        Post-process logits and return dictionary with prediction and confidence score.

        Parameters
        ----------
        logits : torch.Tensor
            Predicted logits from model

        Returns
        -------
        Dict[str, Union[str, float]]
            Output dictionary with predicted class and confidence score
        """

        probabilities = torch.softmax(logits, dim=1)
        confidences, predicted_classes = torch.max(probabilities, dim=1)
        confidences = confidences.cpu().tolist()[0]
        predicted_classes = predicted_classes.cpu()
        output = {
            "predicted sentiment": "positive" if predicted_classes else "negative",
            "confidence": confidences
        }

        return output

    def handle(self, text_input: str) -> Dict[str, Union[str, float]]:
        """
        Handler function which runs end-to-end model pipeline

        Parameters
        ----------
        text_input : str
            User input

        Returns
        -------
        Dict[str, Union[str, float]]
            Output dictionary with predicted class and confidence score
        """

        # Data preprocessing
        processed_text = self.preprocess(text_input)

        # Run inference
        logits = self.inference(processed_text)

        # Data postprocessing
        output = self.postprocessing(logits)

        return output


## Step 3: Create FastAPI App

We're going to use FastAPI to develop a simple RESTful API. You can experiment with alterative frameworks such as Flask and TorchServe to build your own application!

In [None]:
# Create FastAPI app
app = FastAPI(
    title="BERT Sentiment Analysis",
    description="Inference engine to classify texts.",
    version="1.0",
)

# Initialize model on startup
@app.on_event("startup")
async def startup():
    global model
    model = BERTHandler()
    model.initialize()

# Safely shutdown on exit
@app.on_event("shutdown")
async def shutdown():
    pybuda.shutdown()
    pybuda.pybuda_reset()

# Call handler on post request
@app.post("/sentiment_v1/")
async def sentiment_v1(input_text: str) -> Dict[str, Union[str, float]]:
    response = model.handle(input_text)
    return response

## Step 4: Launch App on LocalHost

Launch the app on your LocalHost. The model will first need to initialize and compile which can take 1-2 minutes.

You can query the model once you see the following message:

```
INFO:     Application startup complete.
INFO:     Uvicorn running on http://localhost:8000 (Press CTRL+C to quit)
BERTHandler initialized.
```

In [None]:
# Run asyncio in Jupyter
nest_asyncio.apply()

# Define `uvicorn` command to launch app on LocalHost
def run():
    uvicorn.run(app, port=8000, host="localhost")

# Start app on thread
thread = Thread(target=run)
thread.start()

## Step 5: Query the model

Send post requets to your deployed model with the following API.

Try changing the `INPUT_TEXT` to different sentiment sentences and observe the outputs.

In [None]:
# ↓↓↓↓↓↓↓↓ CONFIGURE INPUT ↓↓↓↓↓↓↓↓
INPUT_TEXT = "TT-BUDA is awesome!"
# ↑↑↑↑↑↑↑↑ CONFIGURE INPUT ↑↑↑↑↑↑↑↑

# Set localhost url for app
url = "http://localhost:8000/sentiment_v1/"

# Issue post request
input_text = {"input_text": INPUT_TEXT}
response = requests.post(url, params=input_text).json()

# Display outputs
print(f"Statement: {INPUT_TEXT}\nPredicted sentiment: {response['predicted sentiment']}\nConfidence: {response['confidence']*100:.0f}%")

Congratulations on deploying your first RESTful API on Tenstorrent AI hardware!

With this framework, you can now build your own AI applications on Tenstorrent AI hardware and deploy them in real life.