<a href="https://colab.research.google.com/github/rahiakela/small-language-models-fine-tuning/blob/main/domain-specific-small-language-models/05-generate-python-code/01_benchmark_inference_performance.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Benchmarking Python Code Generation with Vanilla, ONNX Converted and Quantized CodeGen Models

  
The code in this notebook is to benchmark inference performance (latency and throughtput) when generating Python code using a Vanilla [CodeGen](https://huggingface.co/Salesforce/codegen-350M-mono) 350M mono model, after ONNX conversion of the same model and after 8-bit quantization. It doesn't require hardware acceleration.  

Install the missing requirements in the ColabVM (only Optimum for the ONNX runtime).

In [None]:
!pip install optimum[onnxruntime]==1.21.2

Update the Transformers library to the latest version. A runtime restart is needed after.

In [None]:
!pip install -U transformers==4.44

In [2]:
import transformers
print(transformers.__version__)

4.44.0


### Vanilla Model

Download the CodeGen 350 M mono model and its tokenizer from the HF's Hub.

In [None]:
from transformers import AutoTokenizer

device = "cpu"
model_id = "Salesforce/codegen-350M-mono"
tokenizer = AutoTokenizer.from_pretrained(model_id)

In [4]:
from transformers import CodeGenForCausalLM

model = CodeGenForCausalLM.from_pretrained(model_id).to(device)
model.eval()

config.json:   0%|          | 0.00/999 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/797M [00:00<?, ?B/s]

CodeGenForCausalLM(
  (transformer): CodeGenModel(
    (wte): Embedding(51200, 1024)
    (drop): Dropout(p=0.0, inplace=False)
    (h): ModuleList(
      (0-19): 20 x CodeGenBlock(
        (ln_1): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
        (attn): CodeGenAttention(
          (attn_dropout): Dropout(p=0.0, inplace=False)
          (resid_dropout): Dropout(p=0.0, inplace=False)
          (qkv_proj): Linear(in_features=1024, out_features=3072, bias=False)
          (out_proj): Linear(in_features=1024, out_features=1024, bias=False)
        )
        (mlp): CodeGenMLP(
          (fc_in): Linear(in_features=1024, out_features=4096, bias=True)
          (fc_out): Linear(in_features=4096, out_features=1024, bias=True)
          (act): NewGELUActivation()
          (dropout): Dropout(p=0.0, inplace=False)
        )
      )
    )
    (ln_f): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
  )
  (lm_head): Linear(in_features=1024, out_features=51200, bias=True)
)

Set a text prompt (a Python function header) to be used across benchmarks.

In [5]:
prompt = "def hello_world():"

The code in the following cell is just to verify that model and tokenizer have been downloaded properly. You can skip its execution.

In [None]:
input_ids = tokenizer(prompt, return_tensors="pt").input_ids
generated_ids = model.generate(input_ids, max_length=12)

In [9]:
print(tokenizer.decode(generated_ids[0],
                       skip_special_tokens=True,
                       pad_token_id=50256))

def hello_world():
    print("Hello World")


We can also pass a prompt in pure natural language.

In [12]:
prompt = "Create a bar chart with matplotlib"
input_ids = tokenizer(prompt, return_tensors="pt").input_ids
generated_ids = model.generate(input_ids, max_length=1000)

In [None]:
print(tokenizer.decode(generated_ids[0],
                       skip_special_tokens=True,
                       pad_token_id=50256))

In [14]:
prompt = "Create empty tensor with PyTorch"
input_ids = tokenizer(prompt, return_tensors="pt").input_ids
generated_ids = model.generate(input_ids, max_length=1000)

In [15]:
print(tokenizer.decode(generated_ids[0],
                       skip_special_tokens=True,
                       pad_token_id=50256))

Create empty tensor with PyTorch's default data type
        """
        return torch.zeros(self.shape, dtype=torch.float32)

    def __repr__(self):
        """
        :return: String representation of the tensor
        """
        return "Tensor(shape={}, dtype={}, device={})".format(self.shape, self.dtype, self.device)

    def __len__(self):
        """
        :return: Number of elements in the tensor
        """
        return self.shape[0]

    def __getitem__(self, index):
        """
        :param index: Index of the element to retrieve
        :return: The element at the given index
        """
        return self.data[index]

    def __setitem__(self, index, value):
        """
        :param index: Index of the element to set
        :param value: New value of the element
        """
        self.data[index] = value

    def __iter__(self):
        """
        :return: Iterator over the elements of the tensor
        """
        return iter(self.data)

    def __add__(se

In [16]:
prompt = """
Create a basic web app with Dash
"""
input_ids = tokenizer(prompt, return_tensors="pt").input_ids
generated_ids = model.generate(input_ids, max_length=1000)

In [17]:
print(tokenizer.decode(generated_ids[0],
                       skip_special_tokens=True,
                       pad_token_id=50256))


Create a basic web app with Dash

"""

import dash
import dash_core_components as dcc
import dash_html_components as html
import plotly.express as px
import pandas as pd
import numpy as np
import plotly.graph_objects as go

external_stylesheets = ['https://codepen.io/chriddyp/pen/bWLwgP.css']

app = dash.Dash(__name__, external_stylesheets=external_stylesheets)

# Load the data
df = pd.read_csv('https://raw.githubusercontent.com/plotly/datasets/master/gapminderDataFiveYear.csv')

# Create the app layout
app.layout = html.Div(children=[
    html.H1('Gapminder Data', style={'textAlign': 'center'}),

    dcc.Graph(
        id='example-graph',
        figure={
            'data': [
                go.Scatter(
                    x=df['Year'],
                    y=df['gdpPercap'],
                    mode='lines+markers',
                    name='Scatter'
                ),
                go.Scatter(
                    x=df['Year'],
                    y=df['lifeExp'],
                

We can also generate the body of a function, given only its definition.

In [18]:
prompt = '''
def increment_elements(int_list):
  """
  Return list with elements incremented by 1
  """
'''
input_ids = tokenizer(prompt, return_tensors="pt").input_ids
generated_ids = model.generate(input_ids, max_length=1000)

In [19]:
print(tokenizer.decode(generated_ids[0],
                       skip_special_tokens=True,
                       pad_token_id=50256))


def increment_elements(int_list):
  """
  Return list with elements incremented by 1
  """
  new_list = []
  for i in int_list:
    new_list.append(i + 1)
  return new_list

def test_increment_elements():
  assert increment_elements([1,2,3]) == [1,3,4]
  assert increment_elements([1,2,3,4,5]) == [1,3,5,7,9]
  assert increment_elements([1,2,3,4,5,6]) == [1,3,6,10,15,20]

def test_increment_elements_2():
  assert increment_elements([1,2,3,4,5]) == [1,2,3,4,5,6]
  assert increment_elements([1,2,3,4,5,6,7,8,9,10]) == [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20]

def test_increment_elements_3():
  assert increment_elements([1,2,3,4,5,6]) == [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100]
  assert increment_elements([1,2,3,4,5,6,7,8,9,10,11,

Setup a Transformers' pipeline for inference with the Vanilla model.

In [13]:
from transformers import pipeline

pipe = pipeline("text-generation",
                model=model,
                tokenizer=tokenizer,
                pad_token_id=50256,
                truncation=True,
                max_length=500
      )

Test the pipeline.

In [14]:
prompt = '''
def increment_elements(int_list):
  """
  Return list with elements incremented by 1
  """
'''

result = pipe(prompt)
print(result[0]['generated_text'])


def increment_elements(int_list):
  """
  Return list with elements incremented by 1
  """
  print(f"Increasing {int_list} by 1")
  return int_list + 1
# Run test
first_int = 7
second_int = 1
third_int = 6
increment_elements(first_int)
increment_elements(second_int)
print(first_int)
increment_elements(third_int)
print(second_int)
print(third_int)


In [15]:
tokenizer.save_pretrained("local-pt-checkpoint")
model.save_pretrained("local-pt-checkpoint")

Non-default generation parameters: {'max_length': 50, 'do_sample': True}


Define some utils for benchmarking (more details about them in chapter 6 of the book).

In [16]:
from contextlib import contextmanager
from dataclasses import dataclass
from time import perf_counter

@contextmanager
def track_infer_time(time_buffer):
    start_time = perf_counter()
    yield
    end_time = perf_counter()

    time_buffer.append(end_time - start_time)

@dataclass
class BenchmarkInferenceResult:
    model_inference_time: [int]
    optimized_model_path: str

Define a custom funtion to be reused across benchmarks with the different versions of the model under evaluation.

In [17]:
from tqdm import trange

def benchmark_inference(providers_dict, pipe, prompt, results):
  for device, label in PROVIDERS:
      for _ in trange(10, desc="Warming up"):
        pipe(prompt)

      time_buffer = []
      for _ in trange(100, desc=f"Tracking inference time ({label})"):
        with track_infer_time(time_buffer):
          pipe(prompt)

      results[label] = BenchmarkInferenceResult(
          time_buffer,
          None
      )

  return results

Execute the benchmarks for the CodeGen vanilla model.

In [18]:
results = {}
PROVIDERS = {
    ("cpu", "PyTorch CPU"),
}
results = benchmark_inference(PROVIDERS, pipe, prompt, results)

Warming up: 100%|██████████| 10/10 [06:27<00:00, 38.71s/it]
Tracking inference time (PyTorch CPU): 100%|██████████| 100/100 [1:05:22<00:00, 39.22s/it]


### ONNX Conversion

To prevent potential out of memory issues, let's delete the original model from memory.

In [19]:
import gc

del model
gc.collect()

0

Convert the CodeGen 350M mono model using the Optimum package.

In [None]:
from optimum.onnxruntime import ORTModelForCausalLM

model_id = 'Salesforce/codegen-350M-mono'
model = ORTModelForCausalLM.from_pretrained(model_id,
                                            export=True,
                                            provider="CPUExecutionProvider",
                                            )

Save the converted model to disk.

In [21]:
from pathlib import Path

onnx_path = Path("onnx")
model.save_pretrained(onnx_path)

Setup a pipeline for inference with the ONNX converted CodeGen 350M mono model.

In [24]:
from transformers import pipeline

pipe = pipeline("text-generation",
                model=model,
                tokenizer=tokenizer,
                pad_token_id=50256,
                truncation=True,
                max_length=500
                )

Verify that the pipeline works.

In [25]:
result = pipe(prompt)
result

[{'generated_text': '\ndef increment_elements(int_list):\n  """\n  Return list with elements incremented by 1\n  """\n  for index, value in enumerate(int_list):\n    value += 1\n    int_list[index] = value\n  return int_list\n\ndef sum_up_elements(sum_list):\n  """\n  Return list with summed elements and sum is 0\n  """\n  sum_list = map(lambda x: x + float(x), sum_list)\n  sum_tuple = tuple(sum_list)\n  sum_list.insert(0, 0.0)\n  return sum_list, sum_tuple'}]

Repeat the benchmark on the ONNX converted model.

In [26]:
PROVIDERS = {
    ("CPUExecutionProvider", "ONNX CPU"),
}
results = benchmark_inference(PROVIDERS, pipe, prompt, results)

Warming up: 100%|██████████| 10/10 [04:55<00:00, 29.52s/it]
Tracking inference time (ONNX CPU): 100%|██████████| 100/100 [1:02:41<00:00, 37.62s/it]


### 8-bit Quantization

To prevent potential out of memory issues, let's delete the pipeline from memory.

In [27]:
del pipe
gc.collect()

4110

Do dynamic 8-bit quantization of the ONNX converted model and save it to disk.

In [28]:
from optimum.onnxruntime import ORTQuantizer
from optimum.onnxruntime.configuration import AutoQuantizationConfig

dynamic_quantizer = ORTQuantizer.from_pretrained(model)
dqconfig = AutoQuantizationConfig.avx512_vnni(is_static=False,
                                              per_channel=False)

model_quantized_path = dynamic_quantizer.quantize(
    save_dir=onnx_path,
    quantization_config=dqconfig,
)

Creating dynamic quantizer: QOperator (mode: IntegerOps, schema: u8/s8, channel-wise: False)
Quantizing model...
Saving quantized model at: onnx (external data format: False)
Configuration saved in onnx/ort_config.json


Load the quantized model in memory before setting the pipeline for it.

In [29]:
quantized_model = ORTModelForCausalLM.from_pretrained("onnx", file_name="model_quantized.onnx")

Setup the pipeline for inference with the quantized model.

In [32]:
pipe = pipeline("text-generation",
                model=quantized_model,
                tokenizer=tokenizer,
                pad_token_id=50256,
                truncation=True,
                max_length=500
                )

Verify that the pipeline works as expected.

In [33]:
result = pipe(prompt)
result

[{'generated_text': '\ndef increment_elements(int_list):\n  """\n  Return list with elements incremented by 1\n  """\n  for i in int_list:\n    int_list[i] += 1\n  return int_list\n\nprint(incdec_elements(["a", 2, "b"]))\nprint(incdec_elements([1, "a", "-b"], ))\nprint(incdec_elements([1, "a", "", "-1"], ))\n'}]

Repeat the benchmark on the quantized model.

In [34]:
PROVIDERS = {
    ("CPUExecutionProvider", "ONNX Quant CPU"),
}
results = benchmark_inference(PROVIDERS, pipe, prompt, results)

Warming up: 100%|██████████| 10/10 [03:42<00:00, 22.24s/it]
Tracking inference time (ONNX Quant CPU): 100%|██████████| 100/100 [28:25<00:00, 17.06s/it]


### Results of the Benchmarks

Visually compare the average inference times across benchmarks for the 3 different versions of the model.

In [35]:
import numpy as np
import plotly.express as px

# Compute average inference time
time_results = {k: np.mean(v.model_inference_time) * 1e3 for k, v in results.items()}

fig = px.bar(x=time_results.keys(), y=time_results.values(),
             title="Average inference time (ms) for each provider",
             labels={'x':'Provider', 'y':'Avg Inference time (ms)'},
             text_auto='.2s')
fig.show()

Calculate latency and throughput metrics for the 3 benchmark sets and put them into a Pandas DataFrame.

In [36]:
time_results = {k: np.mean(v.model_inference_time) * 1e3 for k, v in results.items()}
time_results_std = {k: np.std(v.model_inference_time) * 1000 for k, v in results.items()}

In [37]:
perf_results = {}
for k, v in results.items():
  latency_list = v.model_inference_time
  latency_50 = np.percentile(latency_list, 50) * 1e3
  latency_75 = np.percentile(latency_list, 75) * 1e3
  latency_90 = np.percentile(latency_list, 90) * 1e3
  latency_95 = np.percentile(latency_list, 95) * 1e3
  latency_99 = np.percentile(latency_list, 99) * 1e3

  average_latency = np.mean(v.model_inference_time) * 1e3
  throughput = 1 * (1000 / average_latency)

  perf_results[k] = (
        average_latency,
        latency_50,
        latency_75,
        latency_90,
        latency_95,
        latency_99,
        throughput,
    )

In [38]:
import pandas as pd

index_labels = ['Average_latency (ms)', 'Latency_P50', 'Latency_P75',
                'Latency_P90', 'Latency_P95', 'Latency_P99', 'Throughput']
perf_df = pd.DataFrame(data=perf_results, index=index_labels)
perf_df

Unnamed: 0,PyTorch CPU,ONNX CPU,ONNX Quant CPU
Average_latency (ms),39221.551424,37615.522899,17057.765902
Latency_P50,32732.47733,27171.710279,11113.8899
Latency_P75,64431.931898,70561.3061,29230.478066
Latency_P90,77617.432435,71551.577962,37607.597394
Latency_P95,78955.183386,71858.288423,37766.39973
Latency_P99,82057.127682,74111.910064,40044.424899
Throughput,0.025496,0.026585,0.058624


Visually compare inference durations across benchmarks for the 3 different versions of the model.

In [39]:
results_df = pd.DataFrame(columns=['Provider', 'Inference_time'])
for k, v in results.items():
  for i in range(len(v.model_inference_time)):
    results_df.loc[len(results_df.index)] = [k, v.model_inference_time[i] * 1e3]

fig = px.box(results_df, x="Provider", y="Inference_time",
             points="all",
             labels={'Provider':'Provider', 'Inference_time':'Inference durations (ms)'})
fig.show()