<a target="_blank" href="https://colab.research.google.com/github/shaankhosla/optimizingllms/blob/main/notebooks/Inference_Optimizations.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

Use GPU

In [1]:
%%capture
!pip3 install --pre torch --extra-index-url https://download.pytorch.org/whl/nightly/cpu
!pip3 install transformers

# Compile Model

[Source](https://pytorch.org/blog/Accelerating-Hugging-Face-and-TIMM-models/)

In [2]:
import torch
from transformers import BertTokenizer, BertModel
import time


tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
slow_model = BertModel.from_pretrained("bert-base-uncased")
fast_model = torch.compile(slow_model)
text = "Replace me by any text you'd like."
encoded_input = tokenizer(text, return_tensors="pt")

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

In [3]:
for _ in range(10):
    st_time = time.time()
    output = slow_model(**encoded_input)
    print(time.time() - st_time)

0.36992669105529785
0.31711244583129883
0.2365107536315918
0.14289307594299316
0.11475658416748047
0.09571218490600586
0.08169341087341309
0.12516355514526367
0.1393725872039795
0.21643781661987305


 The first run is slow and that’s because the model is being compiled. Subsequent runs will be faster so it’s common practice to warm up your model before you start benchmarking it.

In [4]:
for _ in range(10):
    st_time = time.time()
    output = fast_model(**encoded_input)
    print(time.time() - st_time)

56.6559853553772
0.12535476684570312
0.20423340797424316
0.12020540237426758
0.1245884895324707
0.12730884552001953
0.123870849609375
0.12383508682250977
0.11982321739196777
0.1189727783203125


# Batching Inference

[Source](https://huggingface.co/docs/transformers/main/main_classes/pipelines#pipeline-chunk-batching)

In [5]:
from transformers import pipeline
from torch.utils.data import Dataset
from tqdm.auto import tqdm

pipe = pipeline("text-classification", device=0)

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


Downloading (…)lve/main/config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Xformers is not installed correctly. If you want to use memory_efficient_attention to accelerate training use the following command to install Xformers
pip install xformers.


In [6]:
class FastDataset(Dataset):
    def __len__(self):
        return 5000

    def __getitem__(self, i):
        return "This is a test"


fast_dataset = FastDataset()

for batch_size in [1, 8, 64, 256]:
    print("-" * 30)
    print(f"Streaming batch_size={batch_size}")
    for out in tqdm(pipe(fast_dataset, batch_size=batch_size), total=len(fast_dataset)):
        pass

------------------------------
Streaming batch_size=1


  0%|          | 0/5000 [00:00<?, ?it/s]

------------------------------
Streaming batch_size=8


  0%|          | 0/5000 [00:00<?, ?it/s]

------------------------------
Streaming batch_size=64


  0%|          | 0/5000 [00:00<?, ?it/s]

------------------------------
Streaming batch_size=256


  0%|          | 0/5000 [00:00<?, ?it/s]

In [7]:
class SlowDataset(Dataset):
    def __len__(self):
        return 5000

    def __getitem__(self, i):
        if i % 64 == 0:
            n = 100
        else:
            n = 1
        return "This is a test" * n


slow_dataset = SlowDataset()

for batch_size in [1, 8, 64, 256]:
    print("-" * 30)
    print(f"Streaming batch_size={batch_size}")
    for out in tqdm(pipe(slow_dataset, batch_size=batch_size), total=len(slow_dataset)):
        pass

------------------------------
Streaming batch_size=1


  0%|          | 0/5000 [00:00<?, ?it/s]

------------------------------
Streaming batch_size=8


  0%|          | 0/5000 [00:00<?, ?it/s]

------------------------------
Streaming batch_size=64


  0%|          | 0/5000 [00:00<?, ?it/s]

------------------------------
Streaming batch_size=256


  0%|          | 0/5000 [00:00<?, ?it/s]