This notebook shows how to use optimum-benchmark to benchmark LLMs. It  focuses on benchmarking quantization algorithms for Mistral 7B.

We need to install the following packages. bitsandbytes, auto-gptq and autoawq are only necessary if you benchmark models quantized with these algorithms.

In [None]:
!python -m pip install git+https://github.com/huggingface/optimum-benchmark.git
!pip install bitsandbytes
!pip install auto-gptq
!pip install autoawq
!pip install --upgrade transformers #Google Colab doesn't use by default the last version of Transformers

Collecting git+https://github.com/huggingface/optimum-benchmark.git
  Cloning https://github.com/huggingface/optimum-benchmark.git to /tmp/pip-req-build-w7nad0pk
  Running command git clone --filter=blob:none --quiet https://github.com/huggingface/optimum-benchmark.git /tmp/pip-req-build-w7nad0pk
  Resolved https://github.com/huggingface/optimum-benchmark.git to commit ef70214a33902d33896d4edd663e08480682c05f
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting pyrsmi@ git+https://github.com/RadeonOpenCompute/pyrsmi.git (from optimum-benchmark==0.0.1)
  Cloning https://github.com/RadeonOpenCompute/pyrsmi.git to /tmp/pip-install-ixjgfm_5/pyrsmi_fe0ad5f50f4b4933bd5593e62a4c335e
  Running command git clone --filter=blob:none --quiet https://github.com/RadeonOpenCompute/pyrsmi.git /tmp/pip-install-ixjgfm_5/pyrsmi_fe0ad5f50f4b4933bd5593e62a4c335e
  Resolved https:

Define the configuration for optimum-benchmark.

Here we benchmark for inference, using different batch sizes, Mistral 7B loaded as fp16.
If you run this notebook on Google Colab, you will need the A100 only for this part. The following benchmarks would run on the T4.

In [None]:
YAML_DEFAULT="""
defaults:
  - backend: pytorch # default backend
  - benchmark: inference # we will monitor the inference
  - launcher: process
  - experiment # inheriting from experiment config
  - _self_ # for hydra 1.1 compatibility
  - override hydra/job_logging: colorlog # colorful logging
  - override hydra/hydra_logging: colorlog # colorful logging

hydra:
  run:
    dir: experiments/${experiment_name} #The results will be reported in this directory. Note that "experiment_name" refers to the configuration field name "experiment_name" below
  sweep:
    dir: experiments/${experiment_name}
  job:
    chdir: true
    env_set: #These are environment variable that you may want to set before running the benchmark
      CUDA_VISIBLE_DEVICES: 0
      CUDA_DEVICE_ORDER: PCI_BUS_ID
  sweeper:
    params:
      benchmark.input_shapes.batch_size: 1,2,4,8,16 #we will try all these batch sizes

experiment_name: fp16-batch_size(${benchmark.input_shapes.batch_size})-sequence_length(${benchmark.input_shapes.sequence_length})-new_tokens(${benchmark.new_tokens})
model: mistralai/Mistral-7B-v0.1 #The model that we want to evaluate. It can be from the Hugging Face Hub or local directory
device: cuda #Which device to use for the benchmark. We will use CUDA, i.e., the GPU

backend:
  torch_dtype: float16 #The model will be loaded with fp16

benchmark:
  memory: true #We will monitor the memory usage
  warmup_runs: 10 #Before the monitoring starts, the inference will be run 10 times for warming up

  new_tokens: 1000 #Inference will generate 1000 tokens
  input_shapes:
    sequence_length: 512 #Prompt will have 512 tokens
"""

with open("mistral_7b_ob.yaml", 'w') as f:
  f.write(YAML_DEFAULT)

In [None]:
!optimum-benchmark --config-dir ./ --config-name mistral_7b_ob --multirun

2024-01-16 04:32:34.977174: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-01-16 04:32:34.977223: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-01-16 04:32:34.978673: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
[[36m2024-01-16 04:32:39,634[0m][[35mHYDRA[0m] Launching 5 jobs locally[0m
[[36m2024-01-16 04:32:39,635[0m][[35mHYDRA[0m] 	#0 : benchmark.input_shapes.batch_size=1[0m
[[36m2024-01-16 04:32:39,799[0m][[34mlauncher[0m][[32mINFO[0m] - Configuring process launcher[0m
[[36m2024-01-16 04:32:39,799[0m][[34mprocess[0m][[32mINFO[0m] - Setting multip

Benchmarking BNB's NF4 quantization without double quantization

In [None]:
YAML_DEFAULT="""
defaults:
  - backend: pytorch # default backend
  - benchmark: inference # we will monitor the inference
  - launcher: process
  - experiment # inheriting from experiment config
  - _self_ # for hydra 1.1 compatibility
  - override hydra/job_logging: colorlog # colorful logging
  - override hydra/hydra_logging: colorlog # colorful logging

hydra:
  run:
    dir: experiments/${experiment_name} #The results will be reported in this directory. Note that "experiment_name" refers to the configuration field name "experiment_name" below
  sweep:
    dir: experiments/${experiment_name}
  job:
    chdir: true
    env_set: #These are environment variable that you may want to set before running the benchmark
      CUDA_VISIBLE_DEVICES: 0
      CUDA_DEVICE_ORDER: PCI_BUS_ID
  sweeper:
    params:
      benchmark.input_shapes.batch_size: 1,2,4,8,16 #we will try all these batch sizes

experiment_name: bnb-batch_size(${benchmark.input_shapes.batch_size})-sequence_length(${benchmark.input_shapes.sequence_length})-new_tokens(${benchmark.new_tokens})
model: mistralai/Mistral-7B-v0.1 #The model that we want to evaluate. It can be from the Hugging Face Hub or local directory
device: cuda #Which device to use for the benchmark. We will use CUDA, i.e., the GPU

backend:
  torch_dtype: float16 #The model will be loaded with fp16
  quantization_scheme: bnb
  quantization_config:
    load_in_4bit: true
    bnb_4bit_compute_dtype: float16

benchmark:
  memory: true #We will monitor the memory usage
  warmup_runs: 10 #Before the monitoring starts, the inference will be run 10 times for warming up

  new_tokens: 1000 #Inference will generate 1000 tokens
  input_shapes:
    sequence_length: 512 #Prompt will have 512 tokens
"""

with open("mistral_7b_bnb_ob.yaml", 'w') as f:
  f.write(YAML_DEFAULT)

In [None]:
!optimum-benchmark --config-dir ./ --config-name mistral_7b_bnb_ob --multirun

2024-01-16 04:46:52.647047: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-01-16 04:46:52.647091: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-01-16 04:46:52.648589: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
[[36m2024-01-16 04:46:57,283[0m][[35mHYDRA[0m] Launching 5 jobs locally[0m
[[36m2024-01-16 04:46:57,283[0m][[35mHYDRA[0m] 	#0 : benchmark.input_shapes.batch_size=1[0m
[[36m2024-01-16 04:46:57,450[0m][[34mlauncher[0m][[32mINFO[0m] - Configuring process launcher[0m
[[36m2024-01-16 04:46:57,451[0m][[34mprocess[0m][[32mINFO[0m] - Setting multip

Benchmarking BNB's NF4 with double quantization

In [None]:
YAML_DEFAULT="""
defaults:
  - backend: pytorch # default backend
  - benchmark: inference # we will monitor the inference
  - launcher: process
  - experiment # inheriting from experiment config
  - _self_ # for hydra 1.1 compatibility
  - override hydra/job_logging: colorlog # colorful logging
  - override hydra/hydra_logging: colorlog # colorful logging

hydra:
  run:
    dir: experiments/${experiment_name} #The results will be reported in this directory. Note that "experiment_name" refers to the configuration field name "experiment_name" below
  sweep:
    dir: experiments/${experiment_name}
  job:
    chdir: true
    env_set: #These are environment variable that you may want to set before running the benchmark
      CUDA_VISIBLE_DEVICES: 0
      CUDA_DEVICE_ORDER: PCI_BUS_ID
  sweeper:
    params:
      benchmark.input_shapes.batch_size: 1,2,4,8,16 #we will try all these batch sizes

experiment_name: bnb_dq-batch_size(${benchmark.input_shapes.batch_size})-sequence_length(${benchmark.input_shapes.sequence_length})-new_tokens(${benchmark.new_tokens})
model: mistralai/Mistral-7B-v0.1 #The model that we want to evaluate. It can be from the Hugging Face Hub or local directory
device: cuda #Which device to use for the benchmark. We will use CUDA, i.e., the GPU

backend:
  torch_dtype: float16 #The model will be loaded with fp16
  quantization_scheme: bnb
  quantization_config:
    load_in_4bit: true
    bnb_4bit_compute_dtype: float16
    bnb_4bit_use_double_quant: true

benchmark:
  memory: true #We will monitor the memory usage
  warmup_runs: 10 #Before the monitoring starts, the inference will be run 10 times for warming up

  new_tokens: 1000 #Inference will generate 1000 tokens
  input_shapes:
    sequence_length: 512 #Prompt will have 512 tokens
"""

with open("mistral_7b_bnb_dq_ob.yaml", 'w') as f:
  f.write(YAML_DEFAULT)

In [None]:
!optimum-benchmark --config-dir ./ --config-name mistral_7b_bnb_dq_ob --multirun

2024-01-16 05:08:43.112464: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-01-16 05:08:43.112511: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-01-16 05:08:43.113993: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
[[36m2024-01-16 05:08:47,787[0m][[35mHYDRA[0m] Launching 5 jobs locally[0m
[[36m2024-01-16 05:08:47,787[0m][[35mHYDRA[0m] 	#0 : benchmark.input_shapes.batch_size=1[0m
[[36m2024-01-16 05:08:47,955[0m][[34mlauncher[0m][[32mINFO[0m] - Configuring process launcher[0m
[[36m2024-01-16 05:08:47,956[0m][[34mprocess[0m][[32mINFO[0m] - Setting multip

Benchmarking AWQ

In [None]:
YAML_DEFAULT="""
defaults:
  - backend: pytorch # default backend
  - benchmark: inference # we will monitor the inference
  - launcher: process
  - experiment # inheriting from experiment config
  - _self_ # for hydra 1.1 compatibility
  - override hydra/job_logging: colorlog # colorful logging
  - override hydra/hydra_logging: colorlog # colorful logging

hydra:
  run:
    dir: experiments/${experiment_name} #The results will be reported in this directory. Note that "experiment_name" refers to the configuration field name "experiment_name" below
  sweep:
    dir: experiments/${experiment_name}
  job:
    chdir: true
    env_set: #These are environment variable that you may want to set before running the benchmark
      CUDA_VISIBLE_DEVICES: 0
      CUDA_DEVICE_ORDER: PCI_BUS_ID
  sweeper:
    params:
      benchmark.input_shapes.batch_size: 1,2,4,8,16 #we will try all these batch sizes

experiment_name: awq-batch_size(${benchmark.input_shapes.batch_size})-sequence_length(${benchmark.input_shapes.sequence_length})-new_tokens(${benchmark.new_tokens})
model: kaitchup/Mistral-7B-awq-4bit #The model that we want to evaluate. It can be from the Hugging Face Hub or local directory
device: cuda #Which device to use for the benchmark. We will use CUDA, i.e., the GPU

backend:
  torch_dtype: float16 #The model will be loaded with fp16

benchmark:
  memory: true #We will monitor the memory usage
  warmup_runs: 10 #Before the monitoring starts, the inference will be run 10 times for warming up

  new_tokens: 1000 #Inference will generate 1000 tokens
  input_shapes:
    sequence_length: 512 #Prompt will have 512 tokens
"""

with open("mistral_7b_awq_ob.yaml", 'w') as f:
  f.write(YAML_DEFAULT)

In [None]:
!optimum-benchmark --config-dir ./ --config-name mistral_7b_awq_ob --multirun

2024-01-16 04:17:09.135766: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-01-16 04:17:09.135815: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-01-16 04:17:09.137536: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
[[36m2024-01-16 04:17:13,735[0m][[35mHYDRA[0m] Launching 5 jobs locally[0m
[[36m2024-01-16 04:17:13,735[0m][[35mHYDRA[0m] 	#0 : benchmark.input_shapes.batch_size=1[0m
[[36m2024-01-16 04:17:13,900[0m][[34mlauncher[0m][[32mINFO[0m] - Configuring process launcher[0m
[[36m2024-01-16 04:17:13,900[0m][[34mprocess[0m][[32mINFO[0m] - Setting multip

Benchmarking GPTQ

In [None]:
YAML_DEFAULT="""
defaults:
  - backend: pytorch # default backend
  - benchmark: inference # we will monitor the inference
  - launcher: process
  - experiment # inheriting from experiment config
  - _self_ # for hydra 1.1 compatibility
  - override hydra/job_logging: colorlog # colorful logging
  - override hydra/hydra_logging: colorlog # colorful logging

hydra:
  run:
    dir: experiments/${experiment_name} #The results will be reported in this directory. Note that "experiment_name" refers to the configuration field name "experiment_name" below
  sweep:
    dir: experiments/${experiment_name}
  job:
    chdir: true
    env_set: #These are environment variable that you may want to set before running the benchmark
      CUDA_VISIBLE_DEVICES: 0
      CUDA_DEVICE_ORDER: PCI_BUS_ID
  sweeper:
    params:
      benchmark.input_shapes.batch_size: 1,2,4,8,16 #we will try all these batch sizes

experiment_name: gptq-batch_size(${benchmark.input_shapes.batch_size})-sequence_length(${benchmark.input_shapes.sequence_length})-new_tokens(${benchmark.new_tokens})
model: flozi00/Mistral-7B-v0.1-4bit-autogptq #The model that we want to evaluate. It can be from the Hugging Face Hub or local directory
device: cuda #Which device to use for the benchmark. We will use CUDA, i.e., the GPU

backend:
  torch_dtype: float16 #The model will be loaded with fp16

benchmark:
  memory: true #We will monitor the memory usage
  warmup_runs: 10 #Before the monitoring starts, the inference will be run 10 times for warming up

  new_tokens: 1000 #Inference will generate 1000 tokens
  input_shapes:
    sequence_length: 512 #Prompt will have 512 tokens
"""

with open("mistral_7b_gptq_ob.yaml", 'w') as f:
  f.write(YAML_DEFAULT)

In [None]:
!optimum-benchmark --config-dir ./ --config-name mistral_7b_gptq_ob --multirun

2024-01-16 05:34:18.710547: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-01-16 05:34:18.710597: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-01-16 05:34:18.712068: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
[[36m2024-01-16 05:34:23,360[0m][[35mHYDRA[0m] Launching 5 jobs locally[0m
[[36m2024-01-16 05:34:23,360[0m][[35mHYDRA[0m] 	#0 : benchmark.input_shapes.batch_size=1[0m
[[36m2024-01-16 05:34:23,525[0m][[34mlauncher[0m][[32mINFO[0m] - Configuring process launcher[0m
[[36m2024-01-16 05:34:23,526[0m][[34mprocess[0m][[32mINFO[0m] - Setting multip

# Benchmarking QLoRA training

In [None]:
YAML_DEFAULT="""
defaults:
  - backend: pytorch # default backend
  - benchmark: training # we will monitor the inference
  - launcher: process
  - experiment # inheriting from experiment config
  - _self_ # for hydra 1.1 compatibility
  - override hydra/job_logging: colorlog # colorful logging
  - override hydra/hydra_logging: colorlog # colorful logging

hydra:
  run:
    dir: experiments_training/${experiment_name} #The results will be reported in this directory. Note that "experiment_name" refers to the configuration field name "experiment_name" below
  sweep:
    dir: experiments_training/${experiment_name}
  job:
    chdir: true
    env_set: #These are environment variable that you may want to set before running the benchmark
      CUDA_VISIBLE_DEVICES: 0
      CUDA_DEVICE_ORDER: PCI_BUS_ID
  sweeper:
    params:
      benchmark.training_arguments.per_device_train_batch_size: 1,2,4,8 #we will try all these batch sizes

experiment_name: qlora-batch_size(${benchmark.training_arguments.per_device_train_batch_size})
model: mistralai/Mistral-7B-v0.1 #The model that we want to evaluate. It can be from the Hugging Face Hub or local directory
device: cuda #Which device to use for the benchmark. We will use CUDA, i.e., the GPU

backend:
  no_weights: true
  torch_dtype: float16 #The model will be loaded with fp16
  #peft_model: kaitchup/Mistral-7B-v0.1-SFT-ultrachat
  peft_strategy: lora
  peft_config:
    task_type: CAUSAL_LM
  quantization_scheme: bnb
  quantization_config:
    load_in_4bit: true
    bnb_4bit_compute_dtype: float16
    bnb_4bit_use_double_quant: true



benchmark:
  memory: true
  warmup_steps: 40
  dataset_shapes:
    dataset_size: 160
    sequence_length: 256
  training_arguments:
    max_steps: 140
    per_device_train_batch_size: 1
"""

with open("mistral_7b_qlora_ob.yaml", 'w') as f:
  f.write(YAML_DEFAULT)

In [None]:
!optimum-benchmark --config-dir ./ --config-name mistral_7b_qlora_ob --multirun

2024-01-15 14:27:41.423277: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-01-15 14:27:41.423331: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-01-15 14:27:41.424639: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
[[36m2024-01-15 14:27:46,109[0m][[35mHYDRA[0m] Launching 4 jobs locally[0m
[[36m2024-01-15 14:27:46,109[0m][[35mHYDRA[0m] 	#0 : benchmark.training_arguments.per_device_train_batch_size=1[0m
[[36m2024-01-15 14:27:46,283[0m][[34mlauncher[0m][[32mINFO[0m] - Configuring process launcher[0m
[[36m2024-01-15 14:27:46,283[0m][[34mprocess[0m][[32mINF

# Generating plots

We will use the script prepared by optimum-benchmark:

In [None]:
!wget https://raw.githubusercontent.com/huggingface/optimum-benchmark/ef70214a33902d33896d4edd663e08480682c05f/examples/running-mistrals/report.py

--2024-01-16 05:49:42--  https://raw.githubusercontent.com/huggingface/optimum-benchmark/ef70214a33902d33896d4edd663e08480682c05f/examples/running-mistrals/report.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.111.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 8097 (7.9K) [text/plain]
Saving to: ‘report.py’


2024-01-16 05:49:42 (68.7 MB/s) - ‘report.py’ saved [8097/8097]



In [None]:
!pip install flatten-dict

Collecting flatten-dict
  Downloading flatten_dict-0.4.2-py2.py3-none-any.whl (9.7 kB)
Installing collected packages: flatten-dict
Successfully installed flatten-dict-0.4.2


In [None]:
!python report.py -e experiments

  inference_report = pd.concat(inference_reports, axis=0, ignore_index=True)
┏━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃[1m           [0m┃[1m           [0m┃[1m            [0m┃[1m           [0m┃[1m [0m[1m   Forward[0m[1m [0m┃[1m           [0m┃[1m [0m[1m  Generate[0m[1m [0m┃[1m           [0m┃
┃[1m           [0m┃[1m           [0m┃[1m [0m[1m   Forward[0m[1m [0m┃[1m [0m[1m  Forward[0m[1m [0m┃[1m [0m[1m      Peak[0m[1m [0m┃[1m [0m[1m Generate[0m[1m [0m┃[1m [0m[1m      Peak[0m[1m [0m┃[1m           [0m┃
┃[1m [0m[1mExperime…[0m[1m [0m┃[1m [0m[1m    Batch[0m[1m [0m┃[1m [0m[1m   Latency[0m[1m [0m┃[1m [0m[1mThroughp…[0m[1m [0m┃[1m [0m[1m    Memory[0m[1m [0m┃[1m [0m[1mThroughp…[0m[1m [0m┃[1m [0m[1m    Memory[0m[1m [0m┃[1m [0m[1mQuantiza…[0m[1m [0m┃
┃[1m [0m[1mName     [0m[1m [0m┃[1m [0m[1m     Size[0m[1m [0m┃[1m [0m[1m 