<a target="_blank" href="https://colab.research.google.com/github/shaankhosla/optimizingllms/blob/main/notebooks/Inference_Optimizations.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

In [None]:
%%capture
!pip3 install --pre torch --extra-index-url https://download.pytorch.org/whl/nightly/cpu
!pip3 install transformers

# Compile Model

[Source](https://pytorch.org/blog/Accelerating-Hugging-Face-and-TIMM-models/)

In [None]:
import torch
from transformers import BertTokenizer, BertModel
import time


tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
slow_model = BertModel.from_pretrained("bert-base-uncased")
fast_model = torch.compile(slow_model)
text = "Replace me by any text you'd like."
encoded_input = tokenizer(text, return_tensors="pt")

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [None]:
for _ in range(10):
    st_time = time.time()
    output = slow_model(**encoded_input)
    print(time.time() - st_time)

0.28278350830078125
0.1829512119293213
0.3486652374267578
0.46073079109191895
0.1964707374572754
0.15879392623901367
0.14906883239746094
0.15548968315124512
0.23986124992370605
0.3142368793487549


 The first run is slow and that’s because the model is being compiled. Subsequent runs will be faster so it’s common practice to warm up your model before you start benchmarking it.

In [None]:
for _ in range(10):
    st_time = time.time()
    output = fast_model(**encoded_input)
    print(time.time() - st_time)

52.09566831588745
0.11190652847290039
0.11991548538208008
0.11152386665344238
0.12228822708129883
0.11652398109436035
0.12348484992980957
0.11616969108581543
0.11203479766845703
0.11759781837463379


# Batching Inference

[Source](https://huggingface.co/docs/transformers/main/main_classes/pipelines#pipeline-chunk-batching)

In [None]:
from transformers import pipeline
from torch.utils.data import Dataset
from tqdm.auto import tqdm

pipe = pipeline("text-classification", device=0)

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


Downloading (…)lve/main/config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Xformers is not installed correctly. If you want to use memory_efficient_attention to accelerate training use the following command to install Xformers
pip install xformers.


In [None]:
class FastDataset(Dataset):
    def __len__(self):
        return 5000

    def __getitem__(self, i):
        return "This is a test"


fast_dataset = FastDataset()

for batch_size in [1, 8, 64, 256]:
    print("-" * 30)
    print(f"Streaming batch_size={batch_size}")
    for out in tqdm(pipe(fast_dataset, batch_size=batch_size), total=len(fast_dataset)):
        pass

------------------------------
Streaming batch_size=1


  0%|          | 0/5000 [00:00<?, ?it/s]

------------------------------
Streaming batch_size=8


  0%|          | 0/5000 [00:00<?, ?it/s]

------------------------------
Streaming batch_size=64


  0%|          | 0/5000 [00:00<?, ?it/s]

------------------------------
Streaming batch_size=256


  0%|          | 0/5000 [00:00<?, ?it/s]

In [None]:
class SlowDataset(Dataset):
    def __len__(self):
        return 5000

    def __getitem__(self, i):
        if i % 64 == 0:
            n = 100
        else:
            n = 1
        return "This is a test" * n


slow_dataset = SlowDataset()

for batch_size in [1, 8, 64, 256]:
    print("-" * 30)
    print(f"Streaming batch_size={batch_size}")
    for out in tqdm(pipe(slow_dataset, batch_size=batch_size), total=len(slow_dataset)):
        pass

------------------------------
Streaming batch_size=1


  0%|          | 0/5000 [00:00<?, ?it/s]

------------------------------
Streaming batch_size=8


  0%|          | 0/5000 [00:00<?, ?it/s]

------------------------------
Streaming batch_size=64


  0%|          | 0/5000 [00:00<?, ?it/s]

------------------------------
Streaming batch_size=256


  0%|          | 0/5000 [00:00<?, ?it/s]