<a href="https://colab.research.google.com/github/rahiakela/small-language-models-fine-tuning/blob/main/domain-specific-small-language-models/05-generate-python-code/03_benchmark_inference_performance2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Benchmarking Python Code Generation with Vanilla and 8-bit Quantized StarCoder2 Models


The code in this notebook is to benchmark inference performance (latency and throughtput) when generating Python code using a vanilla [StarCoder2](https://huggingface.co/Salesforce/codegen-350M-mono) 2B model, and after 8-bit quantization of the same model. It reuqires hardware acceleration.  

Install the missing requirements in the ColabVM (only HF's Optimum for the ONNX runtime and Bitsandbytes).

In [None]:
!pip install optimum[onnxruntime-gpu]==1.21.2
!pip install -U bitsandbytes

Upgrade the Numpy and HF's Transformers packages to the latest version. A restart of the VM is needed after.

In [None]:
!pip install -U numpy transformers

### Vanilla Model

Download the StarCoder2-3B model (in bfloat16) and its tokenizer from the HF's Hub.

In [None]:
from transformers import AutoTokenizer

model_id = "bigcode/starcoder2-3b"
tokenizer = AutoTokenizer.from_pretrained(model_id)

In [2]:
import torch
from transformers import AutoModelForCausalLM

device = "cuda" if torch.cuda.is_available() else "cpu"
model = AutoModelForCausalLM.from_pretrained(model_id,
                                             device_map='auto',
                                             torch_dtype=torch.bfloat16)
model.eval()

config.json:   0%|          | 0.00/700 [00:00<?, ?B/s]

`torch_dtype` is deprecated! Use `dtype` instead!


model.safetensors:   0%|          | 0.00/12.1G [00:00<?, ?B/s]

Starcoder2ForCausalLM(
  (model): Starcoder2Model(
    (embed_tokens): Embedding(49152, 3072)
    (layers): ModuleList(
      (0-29): 30 x Starcoder2DecoderLayer(
        (self_attn): Starcoder2Attention(
          (q_proj): Linear(in_features=3072, out_features=3072, bias=True)
          (k_proj): Linear(in_features=3072, out_features=256, bias=True)
          (v_proj): Linear(in_features=3072, out_features=256, bias=True)
          (o_proj): Linear(in_features=3072, out_features=3072, bias=True)
        )
        (mlp): Starcoder2MLP(
          (c_fc): Linear(in_features=3072, out_features=12288, bias=True)
          (c_proj): Linear(in_features=12288, out_features=3072, bias=True)
          (act): GELUTanh()
        )
        (input_layernorm): LayerNorm((3072,), eps=1e-05, elementwise_affine=True)
        (post_attention_layernorm): LayerNorm((3072,), eps=1e-05, elementwise_affine=True)
      )
    )
    (norm): LayerNorm((3072,), eps=1e-05, elementwise_affine=True)
    (rotary_emb

Set a text prompt (a Python function header) to be used across benchmarks.

In [3]:
prompt = "def print_hello_world():"

The code in the following cell is just to verify that model and tokenizer have been downloaded properly. You can skip its execution.

In [None]:
inputs = tokenizer.encode(prompt, return_tensors="pt").to(device)
outputs = model.generate(inputs)

In [6]:
print(tokenizer.decode(outputs[0]))

def print_hello_world():
    print("Hello World")

def print_hello_world_with_name(name


In [None]:
prompt = "def fibonacci(n):"
inputs = tokenizer.encode(prompt, return_tensors="pt").to(device)
outputs = model.generate(inputs)

In [8]:
print(tokenizer.decode(outputs[0]))

def fibonacci(n):
    if n == 0:
        return 0
    elif n == 1:
        return


Setup a Transformers' pipeline for inference with the vanilla model.

In [9]:
from transformers import pipeline

pipe = pipeline("text-generation",
            model=model,
            tokenizer=tokenizer,
            do_sample=True,
            use_cache=True,
            temperature=0.2,
            top_p=0.95,
            max_length=14
)

Device set to use cuda:0


Test the pipeline.

In [10]:
result = pipe(prompt)
print(result[0]['generated_text'])

Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.


def fibonacci(n):
    if n == 0:
       


Save the checkpoints locally, to be reused when quantizing it later.

In [11]:
checkpoint_save_dir = 'local-pt-checkpoint'
tokenizer.save_pretrained(checkpoint_save_dir)
model.save_pretrained(checkpoint_save_dir)

Define some utils for benchmarking (more details about them in chapter 6 of the book).

In [12]:
from contextlib import contextmanager
from dataclasses import dataclass
from time import perf_counter

@contextmanager
def track_infer_time(time_buffer):
    start_time = perf_counter()
    yield
    end_time = perf_counter()

    time_buffer.append(end_time - start_time)

@dataclass
class BenchmarkInferenceResult:
    model_inference_time: [int]
    optimized_model_path: str

Define a custom funtion to be reused across benchmarks with the different versions of the model under evaluation.

In [13]:
from tqdm import trange

def benchmark_inference(providers_dict, pipe, prompt, results):
  for device, label in PROVIDERS:
      for _ in trange(10, desc="Warming up"):
          pipe(prompt)

      time_buffer = []
      for _ in trange(100, desc=f"Tracking inference time ({label})"):
        with track_infer_time(time_buffer):
            pipe(prompt)

      results[label] = BenchmarkInferenceResult(
          time_buffer,
          None
      )

  return results

Execute the benchmarks for the StarCoder2 vanilla model.

In [14]:
results = {}
PROVIDERS = {
    ("gpu", "PyTorch GPU"),
}
results = benchmark_inference(PROVIDERS, pipe, prompt, results)

Warming up:   0%|          | 0/10 [00:00<?, ?it/s]Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
Warming up:  10%|█         | 1/10 [00:00<00:05,  1.73it/s]Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
Warming up:  20%|██        | 2/10 [00:01<00:03,  2.01it/s]Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
Warming up:  30%|███       | 3/10 [00:01<00:03,  2.15it/s]Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
Warming up:  40%|████      | 4/10 [00:01<00:02,  2.21it/s]Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
Warming up:  50%|█████     | 5/10 [00:02<00:02,  2.26it/s]Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
Warming up:  60%|██████    | 6/10 [00:02<00:01,  2.27it/s]Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
Warming up:  70%|███████   | 7/10 [00:03<00:01,  2.26it/s]Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.


### 8-bit Quantization

To prevent potential out of memory issues, let's do some VRAM and RAM cleanup.

In [15]:
import gc

model.cpu()
del model
del pipe
gc.collect()
torch.cuda.empty_cache()

Let's do 8-bit quantization of the original model using Bitsandbytes library and save it to disk.

In [16]:
from transformers import BitsAndBytesConfig

quantization_config = BitsAndBytesConfig(load_in_8bit=True)

tokenizer = AutoTokenizer.from_pretrained(checkpoint_save_dir)
quantized_model = AutoModelForCausalLM.from_pretrained(checkpoint_save_dir,
                                        quantization_config=quantization_config)
quantized_model.eval()

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Starcoder2ForCausalLM(
  (model): Starcoder2Model(
    (embed_tokens): Embedding(49152, 3072)
    (layers): ModuleList(
      (0-29): 30 x Starcoder2DecoderLayer(
        (self_attn): Starcoder2Attention(
          (q_proj): Linear8bitLt(in_features=3072, out_features=3072, bias=True)
          (k_proj): Linear8bitLt(in_features=3072, out_features=256, bias=True)
          (v_proj): Linear8bitLt(in_features=3072, out_features=256, bias=True)
          (o_proj): Linear8bitLt(in_features=3072, out_features=3072, bias=True)
        )
        (mlp): Starcoder2MLP(
          (c_fc): Linear8bitLt(in_features=3072, out_features=12288, bias=True)
          (c_proj): Linear8bitLt(in_features=12288, out_features=3072, bias=True)
          (act): GELUTanh()
        )
        (input_layernorm): LayerNorm((3072,), eps=1e-05, elementwise_affine=True)
        (post_attention_layernorm): LayerNorm((3072,), eps=1e-05, elementwise_affine=True)
      )
    )
    (norm): LayerNorm((3072,), eps=1e-05, elem

The code in the following cell is just to verify that model and tokenizer have been downloaded properly. You can skip its execution.

In [17]:
inputs = tokenizer.encode(prompt, return_tensors="pt").to(device)
outputs = quantized_model.generate(inputs)
print(tokenizer.decode(outputs[0]))

def fibonacci(n):
    if n == 0:
        return 0
    elif n == 1:
        return


In [18]:
quantized_model.save_pretrained('local-8bit-checkpoint')

In [19]:
checkpoint_8bit_save_dir = 'local-8bit-checkpoint'

# Load the quantized model from the specified directory
quantized_model_loaded = AutoModelForCausalLM.from_pretrained(checkpoint_8bit_save_dir)
quantized_model_loaded.eval()

Starcoder2ForCausalLM(
  (model): Starcoder2Model(
    (embed_tokens): Embedding(49152, 3072)
    (layers): ModuleList(
      (0-29): 30 x Starcoder2DecoderLayer(
        (self_attn): Starcoder2Attention(
          (q_proj): Linear8bitLt(in_features=3072, out_features=3072, bias=True)
          (k_proj): Linear8bitLt(in_features=3072, out_features=256, bias=True)
          (v_proj): Linear8bitLt(in_features=3072, out_features=256, bias=True)
          (o_proj): Linear8bitLt(in_features=3072, out_features=3072, bias=True)
        )
        (mlp): Starcoder2MLP(
          (c_fc): Linear8bitLt(in_features=3072, out_features=12288, bias=True)
          (c_proj): Linear8bitLt(in_features=12288, out_features=3072, bias=True)
          (act): GELUTanh()
        )
        (input_layernorm): LayerNorm((3072,), eps=1e-05, elementwise_affine=True)
        (post_attention_layernorm): LayerNorm((3072,), eps=1e-05, elementwise_affine=True)
      )
    )
    (norm): LayerNorm((3072,), eps=1e-05, elem

Setup the pipeline for inference with the quantized model.

In [20]:
pipe = pipeline("text-generation",
            model=quantized_model_loaded,
            tokenizer=tokenizer,
            do_sample=True,
            use_cache=True,
            temperature=0.2,
            top_p=0.95,
            max_length=14,
)

Device set to use cuda:0


Verify that the pipeline works as expected.

In [21]:
result = pipe(prompt)
result

Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.


[{'generated_text': 'def fibonacci(n):\n    if n == 0:\n       '}]

Repeat the benchmark on the quantized model.

In [22]:
PROVIDERS = {
    ("ort", "Quant GPU"),
}
results = benchmark_inference(PROVIDERS, pipe, prompt, results)

Warming up:   0%|          | 0/10 [00:00<?, ?it/s]Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
Warming up:  10%|█         | 1/10 [00:01<00:10,  1.17s/it]Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
Warming up:  20%|██        | 2/10 [00:02<00:09,  1.17s/it]Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
Warming up:  30%|███       | 3/10 [00:03<00:08,  1.17s/it]Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
Warming up:  40%|████      | 4/10 [00:04<00:07,  1.19s/it]Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
Warming up:  50%|█████     | 5/10 [00:05<00:05,  1.19s/it]Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
Warming up:  60%|██████    | 6/10 [00:07<00:04,  1.18s/it]Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
Warming up:  70%|███████   | 7/10 [00:08<00:03,  1.27s/it]Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.


### Results of the Benchmarks

Visually compare the average inference times across benchmarks for the 2 different versions of the model.

In [23]:
import numpy as np
import plotly.express as px

# Compute average inference time
time_results = {k: np.mean(v.model_inference_time) * 1e3 for k, v in results.items()}

fig = px.bar(x=time_results.keys(), y=time_results.values(),
             title="Average inference time (ms) for each provider",
             labels={'x':'Provider', 'y':'Avg Inference time (ms)'},
             text_auto='.2s')
fig.show()

Calculate latency and throughput metrics for the 3 benchmark sets and put them into a Pandas DataFrame.

In [24]:
time_results = {k: np.mean(v.model_inference_time) * 1e3 for k, v in results.items()}
time_results_std = {k: np.std(v.model_inference_time) * 1000 for k, v in results.items()}

In [25]:
perf_results = {}
for k, v in results.items():
  latency_list = v.model_inference_time
  latency_50 = np.percentile(latency_list, 50) * 1e3
  latency_75 = np.percentile(latency_list, 75) * 1e3
  latency_90 = np.percentile(latency_list, 90) * 1e3
  latency_95 = np.percentile(latency_list, 95) * 1e3
  latency_99 = np.percentile(latency_list, 99) * 1e3

  average_latency = np.mean(v.model_inference_time) * 1e3
  throughput = 1 * (1000 / average_latency)

  perf_results[k] = (
        average_latency,
        latency_50,
        latency_75,
        latency_90,
        latency_95,
        latency_99,
        throughput,
    )

In [26]:
import pandas as pd

index_labels = ['Average_latency (ms)', 'Latency_P50', 'Latency_P75',
                'Latency_P90', 'Latency_P95', 'Latency_P99', 'Throughput']
perf_df = pd.DataFrame(data=perf_results, index=index_labels)
perf_df

Unnamed: 0,PyTorch GPU,Quant GPU
Average_latency (ms),451.989805,1227.648008
Latency_P50,441.700732,1176.418144
Latency_P75,453.452057,1210.200198
Latency_P90,506.745469,1418.15164
Latency_P95,523.371302,1507.512108
Latency_P99,546.437064,1530.378054
Throughput,2.212439,0.814566


Visually compare inference durations across benchmarks for the 2 different versions of the model.

In [27]:
results_df = pd.DataFrame(columns=['Provider', 'Inference_time'])
for k, v in results.items():
  for i in range(len(v.model_inference_time)):
    results_df.loc[len(results_df.index)] = [k, v.model_inference_time[i] * 1e3]

fig = px.box(results_df, x="Provider", y="Inference_time",
             points="all",
             labels={'Provider':'Provider', 'Inference_time':'Inference durations (ms)'})
fig.show()

### 4-bit Quantization