<a href="https://colab.research.google.com/github/tuhinmallick/AI-for-Fashion/blob/main/GPU_Benchmarking_for_LoRA%2C_QLoRA_Fine_tuning%2C_and_Inference_with_and_without_4_bit_Quantization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

*More details in this article: [GPU Benchmarking: What Is the Best GPU for LoRA, QLoRA, and Inference?](https://newsletter.kaitchup.com/p/gpu-benchmarking-what-is-the-best)*

This notebook benchmarks your hardware configuration for LoRA, QLoRA fine-tuning, and inference with and without 4-bit Quantization.

It uses Hugging Face Transformers, TRL, and PEFT for the fine-tuning part. For the inference part, the benchmarking is done through optimum-benchmark.

For fine-tuning, I use the training time returned by the logs as the benchmarking metric.

For inference, optimum benchmark produces a CSV file containing a lot of information that you can use as benchmarking metrics (memory consumption, latency, etc.)

*Note: This notebook only works with Ampere and more recent NVIDIA GPUs*

In [None]:
!pip install optimum-benchmark bitsandbytes datasets peft trl
!pip install nvidia-ml-py

[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.3.1[0m[39;49m -> [0m[32;49m24.1.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython -m pip install --upgrade pip[0m
[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.3.1[0m[39;49m -> [0m[32;49m24.1.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython -m pip install --upgrade pip[0m


Method for LoRA and QLoRA fine-tuning:

In [None]:
import torch, os, multiprocessing
from datasets import load_dataset
from peft import LoraConfig, prepare_model_for_kbit_training
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig
)
from trl import SFTTrainer, SFTConfig

#use bf16 and FlashAttention if supported
if torch.cuda.is_bf16_supported():
  os.system('pip install flash_attn')
  compute_dtype = torch.bfloat16
  attn_implementation = 'flash_attention_2'
else:
  compute_dtype = torch.float16
  attn_implementation = 'sdpa'


def LoRA(model_id, q=False):
  model_name =  model_id.split('/')[1]
  #Tokenizer
  tokenizer = AutoTokenizer.from_pretrained(model_id, use_fast=True)
  tokenizer.pad_token = tokenizer.eos_token
  tokenizer.padding_side = 'left'

  ds = load_dataset("timdettmers/openassistant-guanaco")
  #Add the EOS token
  def process(row):
      row["text"] = row["text"]+"<|end_of_text|>"
      return row

  ds = ds.map(
      process,
      num_proc= multiprocessing.cpu_count(),
      load_from_cache_file=False,
  )
  if q:
    bnb_config = BitsAndBytesConfig(
            load_in_4bit=True,
            bnb_4bit_quant_type="nf4",
            bnb_4bit_compute_dtype=compute_dtype,
            bnb_4bit_use_double_quant=True,
    )
    model = AutoModelForCausalLM.from_pretrained(
              model_id, quantization_config=bnb_config, device_map={"": 0}, attn_implementation=attn_implementation
    )
    model = prepare_model_for_kbit_training(model, gradient_checkpointing_kwargs={'use_reentrant':True})
    output_dir = model_name+"_QLoRA"
  else:
    model = AutoModelForCausalLM.from_pretrained(
            model_id, device_map={"": 0}, attn_implementation=attn_implementation, torch_dtype=compute_dtype
    )
    model.gradient_checkpointing_enable(gradient_checkpointing_kwargs={'use_reentrant':True})
    output_dir = model_name+"_LoRA"

  peft_config = LoraConfig(
          lora_alpha=16,
          lora_dropout=0.05,
          r=16,
          bias="none",
          task_type="CAUSAL_LM",
          target_modules= ['k_proj', 'q_proj', 'v_proj', 'o_proj', "gate_proj", "down_proj", "up_proj"]
  )


  training_arguments = SFTConfig(
          output_dir="./"+output_dir,
          optim="adamw_8bit",
          per_device_train_batch_size=2,
          gradient_accumulation_steps=8,
          log_level="debug",
          save_strategy="epoch",
          logging_steps=10,
          learning_rate=1e-4,
          fp16 = not torch.cuda.is_bf16_supported(),
          bf16 = torch.cuda.is_bf16_supported(),
          num_train_epochs=1,
          warmup_ratio=0.1,
          lr_scheduler_type="linear",
          dataset_text_field="text",
          max_seq_length=512,
  )

  trainer = SFTTrainer(
          model=model,
          train_dataset=ds['train'],
          peft_config=peft_config,
          tokenizer=tokenizer,
          args=training_arguments,
  )

  trainer.train()
  del model
  del trainer
  gc.collect()
  torch.cuda.empty_cache()

Collecting flash_attn
  Downloading flash_attn-2.5.9.post1.tar.gz (2.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.6/2.6 MB[0m [31m12.6 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25h  Preparing metadata (setup.py): started
  Preparing metadata (setup.py): finished with status 'done'
Collecting einops (from flash_attn)
  Downloading einops-0.8.0-py3-none-any.whl.metadata (12 kB)
Downloading einops-0.8.0-py3-none-any.whl (43 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m43.2/43.2 kB[0m [31m22.4 MB/s[0m eta [36m0:00:00[0m
[?25hBuilding wheels for collected packages: flash_attn
  Building wheel for flash_attn (setup.py): started
  Building wheel for flash_attn (setup.py): finished with status 'done'
  Created wheel for flash_attn: filename=flash_attn-2.5.9.post1-cp310-cp310-linux_x86_64.whl size=121711011 sha256=c55cb075a15591ebea0e8df05daea937ae206afd67ad1c50524a36f37cbd8d1d
  Stored in directory: /root/.cache/pip/wheels/cc/ad/

[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.3.1[0m[39;49m -> [0m[32;49m24.1.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython -m pip install --upgrade pip[0m


Method for benchmarking the inference:

Note:
- Change the input shape for different batch size and sequence length
- Change "model_names" to add the models you wish to benchmark

In [None]:
from optimum_benchmark import Benchmark, BenchmarkConfig, TorchrunConfig, InferenceConfig, TrainingConfig, PyTorchConfig
from optimum_benchmark.logging_utils import setup_logging
from transformers import set_seed
import gc
set_seed(1234)

model_names = ["meta-llama/Meta-Llama-3-8B"]

def inference_bench(model_id, quant=False):
    model_name = model_id.split('/')[1]
    launcher_config = TorchrunConfig(nproc_per_node=1)
    input_shapes = {"batch_size": 1, "num_choices": 1, "sequence_length": 200}

    scenario_config = InferenceConfig(latency=True, memory=True, input_shapes=input_shapes)
    if quant:
      name = "benchmark_inference_report_quant"+model_name+".csv"
      quantization_scheme = 'bnb'
      quantization_config = {
                              "bnb_4bit_compute_dtype": "float16",
                              "bnb_4bit_quant_type": "nf4",
                              "bnb_4bit_use_double_quant": True,
                              "llm_int8_enable_fp32_cpu_offload": False,
                              "llm_int8_has_fp16_weight": False,
                              "llm_int8_threshold": 6.0,
                              "load_in_4bit": True,
                              "load_in_8bit": False,
                            }
      backend_config = PyTorchConfig(model=model_id, quantization_scheme=quantization_scheme, torch_dtype="bfloat16", quantization_config=quantization_config, device="cuda", device_ids="0", no_weights=False)
    else:
      name = "benchmark_inference_report_"+model_name+".csv"
      backend_config = PyTorchConfig(model=model_id, device="cuda", torch_dtype="bfloat16", device_ids="0", no_weights=False)
    benchmark_config = BenchmarkConfig(
        name="pytorch_"+model_name,
        scenario=scenario_config,
        launcher=launcher_config,
        backend=backend_config,
    )
    benchmark_report = Benchmark.launch(benchmark_config)


    benchmark_report.log()
    benchmark_config.to_dict()
    benchmark_report.save_csv(name)


for m in model_names:
  LoRA(m, True) #QLoRA
  LoRA(m) #LoRA

  inference_bench(m) #Inference, not quantized (bfloat16)
  inference_bench(m, True) #Inference, quantized


The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `huggingface-cli` if you want to set the git credential as well.
Token is valid (permission: read).
Your token has been saved to /root/.cache/huggingface/token
Login successful


tokenizer_config.json:   0%|          | 0.00/50.6k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/73.0 [00:00<?, ?B/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


Downloading readme:   0%|          | 0.00/395 [00:00<?, ?B/s]

Repo card metadata block was not found. Setting CardData to empty.


Downloading data:   0%|          | 0.00/20.9M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.11M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/9846 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/518 [00:00<?, ? examples/s]

Map (num_proc=48):   0%|          | 0/9846 [00:00<?, ? examples/s]

Map (num_proc=48):   0%|          | 0/518 [00:00<?, ? examples/s]

config.json:   0%|          | 0.00/654 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/23.9k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/4 [00:00<?, ?it/s]

model-00001-of-00004.safetensors:   0%|          | 0.00/4.98G [00:00<?, ?B/s]

model-00002-of-00004.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00003-of-00004.safetensors:   0%|          | 0.00/4.92G [00:00<?, ?B/s]

model-00004-of-00004.safetensors:   0%|          | 0.00/1.17G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/177 [00:00<?, ?B/s]

Map:   0%|          | 0/9846 [00:00<?, ? examples/s]

Using auto half precision backend
Currently training with a batch size of: 2
***** Running training *****
  Num examples = 9,846
  Num Epochs = 1
  Instantaneous batch size per device = 2
  Total train batch size (w. parallel, distributed & accumulation) = 16
  Gradient Accumulation steps = 8
  Total optimization steps = 615
  Number of trainable parameters = 41,943,040
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.
The input hidden states seems to be silently casted in float32, this might be related to the fact you have upcasted embedding or layer norm layers in float32. We will cast back the input in torch.bfloat16.


Step,Training Loss
10,1.6694
20,1.5952
30,1.4482
40,1.3575
50,1.3457
60,1.3482
70,1.3138
80,1.3882
90,1.3081
100,1.2655


Saving model checkpoint to ./Meta-Llama-3-8B_QLoRA/checkpoint-615
loading configuration file config.json from cache at /root/.cache/huggingface/hub/models--meta-llama--Meta-Llama-3-8B/snapshots/62bd457b6fe961a42a631306577e622c83876cb6/config.json
Model config LlamaConfig {
  "architectures": [
    "LlamaForCausalLM"
  ],
  "attention_bias": false,
  "attention_dropout": 0.0,
  "bos_token_id": 128000,
  "eos_token_id": 128001,
  "hidden_act": "silu",
  "hidden_size": 4096,
  "initializer_range": 0.02,
  "intermediate_size": 14336,
  "max_position_embeddings": 8192,
  "mlp_bias": false,
  "model_type": "llama",
  "num_attention_heads": 32,
  "num_hidden_layers": 32,
  "num_key_value_heads": 8,
  "pretraining_tp": 1,
  "rms_norm_eps": 1e-05,
  "rope_scaling": null,
  "rope_theta": 500000.0,
  "tie_word_embeddings": false,
  "torch_dtype": "bfloat16",
  "transformers_version": "4.42.3",
  "use_cache": true,
  "vocab_size": 128256
}

tokenizer config file saved in ./Meta-Llama-3-8B_QLoRA/ch

Map (num_proc=48):   0%|          | 0/9846 [00:00<?, ? examples/s]

Map (num_proc=48):   0%|          | 0/518 [00:00<?, ? examples/s]

loading configuration file config.json from cache at /root/.cache/huggingface/hub/models--meta-llama--Meta-Llama-3-8B/snapshots/62bd457b6fe961a42a631306577e622c83876cb6/config.json
Model config LlamaConfig {
  "_name_or_path": "meta-llama/Meta-Llama-3-8B",
  "architectures": [
    "LlamaForCausalLM"
  ],
  "attention_bias": false,
  "attention_dropout": 0.0,
  "bos_token_id": 128000,
  "eos_token_id": 128001,
  "hidden_act": "silu",
  "hidden_size": 4096,
  "initializer_range": 0.02,
  "intermediate_size": 14336,
  "max_position_embeddings": 8192,
  "mlp_bias": false,
  "model_type": "llama",
  "num_attention_heads": 32,
  "num_hidden_layers": 32,
  "num_key_value_heads": 8,
  "pretraining_tp": 1,
  "rms_norm_eps": 1e-05,
  "rope_scaling": null,
  "rope_theta": 500000.0,
  "tie_word_embeddings": false,
  "torch_dtype": "bfloat16",
  "transformers_version": "4.42.3",
  "use_cache": true,
  "vocab_size": 128256
}

loading weights file model.safetensors from cache at /root/.cache/huggingf

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

All model checkpoint weights were used when initializing LlamaForCausalLM.

All the weights of LlamaForCausalLM were initialized from the model checkpoint at meta-llama/Meta-Llama-3-8B.
If your task is similar to the task the model of the checkpoint was trained on, you can already use LlamaForCausalLM for predictions without further training.
loading configuration file generation_config.json from cache at /root/.cache/huggingface/hub/models--meta-llama--Meta-Llama-3-8B/snapshots/62bd457b6fe961a42a631306577e622c83876cb6/generation_config.json
Generate config GenerationConfig {
  "bos_token_id": 128000,
  "do_sample": true,
  "eos_token_id": 128001,
  "max_length": 4096,
  "temperature": 0.6,
  "top_p": 0.9
}

PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info d

Step,Training Loss
10,1.6283
20,1.5879
30,1.4404
40,1.3179
50,1.3019
60,1.3063
70,1.2839
80,1.3523
90,1.2779
100,1.2343


Saving model checkpoint to ./Meta-Llama-3-8B_LoRA/checkpoint-615
loading configuration file config.json from cache at /root/.cache/huggingface/hub/models--meta-llama--Meta-Llama-3-8B/snapshots/62bd457b6fe961a42a631306577e622c83876cb6/config.json
Model config LlamaConfig {
  "architectures": [
    "LlamaForCausalLM"
  ],
  "attention_bias": false,
  "attention_dropout": 0.0,
  "bos_token_id": 128000,
  "eos_token_id": 128001,
  "hidden_act": "silu",
  "hidden_size": 4096,
  "initializer_range": 0.02,
  "intermediate_size": 14336,
  "max_position_embeddings": 8192,
  "mlp_bias": false,
  "model_type": "llama",
  "num_attention_heads": 32,
  "num_hidden_layers": 32,
  "num_key_value_heads": 8,
  "pretraining_tp": 1,
  "rms_norm_eps": 1e-05,
  "rope_scaling": null,
  "rope_theta": 500000.0,
  "tie_word_embeddings": false,
  "torch_dtype": "bfloat16",
  "transformers_version": "4.42.3",
  "use_cache": true,
  "vocab_size": 128256
}

tokenizer config file saved in ./Meta-Llama-3-8B_LoRA/chec

[ISOLATED-PROCESS][[36m2024-07-05 11:42:07,633[0m][[34mtorchrun[0m][[32mINFO[0m] - 	+ Starting benchmark in isolated process[0m
[RANK-PROCESS-0][[36m2024-07-05 11:42:10,826[0m][[34mtorchrun[0m][[32mINFO[0m] - 	+ Setting torch.distributed cuda device to 0[0m
[RANK-PROCESS-0][[36m2024-07-05 11:42:10,841[0m][[34mtorchrun[0m][[32mINFO[0m] - 	+ Initializing torch.distributed process group[0m
[RANK-PROCESS-0][[36m2024-07-05 11:42:10,884[0m][[34mdatasets[0m][[32mINFO[0m] - PyTorch version 2.1.0+cu118 available.[0m
[RANK-PROCESS-0][[36m2024-07-05 11:42:12,280[0m][[34mpytorch[0m][[32mINFO[0m] - Allocating pytorch backend[0m
[RANK-PROCESS-0][[36m2024-07-05 11:42:12,281[0m][[34mpytorch[0m][[32mINFO[0m] - 	+ Seeding backend with 42[0m
[RANK-PROCESS-0][[36m2024-07-05 11:42:12,284[0m][[34mpytorch[0m][[32mINFO[0m] - 	+ Benchmarking a Transformers model[0m
[RANK-PROCESS-0][[36m2024-07-05 11:42:13,918[0m][[34mpytorch[0m][[32mINFO[0m] - 	+ Using aut

Loading checkpoint shards: 100%|██████████| 4/4 [00:03<00:00,  1.16it/s]


[RANK-PROCESS-0][[36m2024-07-05 11:42:17,820[0m][[34mpytorch[0m][[32mINFO[0m] - 	+ Moving Transformers model to device: cuda[0m
[RANK-PROCESS-0][[36m2024-07-05 11:42:19,809[0m][[34mpytorch[0m][[32mINFO[0m] - 	+ Turning on model's eval mode[0m
[RANK-PROCESS-0][[36m2024-07-05 11:42:19,813[0m][[34minference[0m][[32mINFO[0m] - Allocating inference scenario[0m
[RANK-PROCESS-0][[36m2024-07-05 11:42:19,814[0m][[34minference[0m][[32mINFO[0m] - 	+ Creating input generator[0m
[RANK-PROCESS-0][[36m2024-07-05 11:42:19,814[0m][[34minference[0m][[32mINFO[0m] - 	+ Generating Text Generation inputs[0m
[RANK-PROCESS-0][[36m2024-07-05 11:42:19,815[0m][[34minference[0m][[32mINFO[0m] - 	+ Updating Text Generation kwargs with default values[0m
[RANK-PROCESS-0][[36m2024-07-05 11:42:19,816[0m][[34minference[0m][[32mINFO[0m] - 	+ Initializing Text Generation report[0m
[RANK-PROCESS-0][[36m2024-07-05 11:42:19,817[0m][[34minference[0m][[32mINFO[0m] - 	+ Pre



[RANK-PROCESS-0][[36m2024-07-05 11:42:26,370[0m][[34minference[0m][[32mINFO[0m] - 	+ Running Text Generation memory tracking[0m
[RANK-PROCESS-0][[36m2024-07-05 11:42:26,371[0m][[34mmemory[0m][[32mINFO[0m] - 	+ Tracking RAM memory of process [6131][0m
[RANK-PROCESS-0][[36m2024-07-05 11:42:26,372[0m][[34mmemory[0m][[32mINFO[0m] - 	+ Tracking VRAM memory of CUDA devices [0][0m
[RANK-PROCESS-0][[36m2024-07-05 11:42:26,372[0m][[34mmemory[0m][[32mINFO[0m] - 	+ Tracking Allocated/Reserved memory of 1 Pytorch CUDA devices[0m
[RANK-PROCESS-0][[36m2024-07-05 11:42:43,185[0m][[34mmemory[0m][[32mINFO[0m] - 		+ prefill memory:[0m
[RANK-PROCESS-0][[36m2024-07-05 11:42:43,186[0m][[34mmemory[0m][[32mINFO[0m] - 			- max RAM: 1326.043136 (MB)[0m
[RANK-PROCESS-0][[36m2024-07-05 11:42:43,187[0m][[34mmemory[0m][[32mINFO[0m] - 			- max global VRAM: 17474.781184 (MB)[0m
[RANK-PROCESS-0][[36m2024-07-05 11:42:43,187[0m][[34mmemory[0m][[32mINFO[0m] - 			- m

Loading checkpoint shards: 100%|██████████| 4/4 [00:03<00:00,  1.23it/s]


[RANK-PROCESS-0][[36m2024-07-05 11:43:54,951[0m][[34mpytorch[0m][[32mINFO[0m] - 	+ Turning on model's eval mode[0m
[RANK-PROCESS-0][[36m2024-07-05 11:43:54,955[0m][[34minference[0m][[32mINFO[0m] - Allocating inference scenario[0m
[RANK-PROCESS-0][[36m2024-07-05 11:43:54,956[0m][[34minference[0m][[32mINFO[0m] - 	+ Creating input generator[0m
[RANK-PROCESS-0][[36m2024-07-05 11:43:54,956[0m][[34minference[0m][[32mINFO[0m] - 	+ Generating Text Generation inputs[0m
[RANK-PROCESS-0][[36m2024-07-05 11:43:54,957[0m][[34minference[0m][[32mINFO[0m] - 	+ Updating Text Generation kwargs with default values[0m
[RANK-PROCESS-0][[36m2024-07-05 11:43:54,957[0m][[34minference[0m][[32mINFO[0m] - 	+ Initializing Text Generation report[0m
[RANK-PROCESS-0][[36m2024-07-05 11:43:54,958[0m][[34minference[0m][[32mINFO[0m] - 	+ Preparing inputs for Inference[0m
[RANK-PROCESS-0][[36m2024-07-05 11:43:54,997[0m][[34minference[0m][[32mINFO[0m] - 	+ Preparing ba



[RANK-PROCESS-0][[36m2024-07-05 11:44:03,154[0m][[34minference[0m][[32mINFO[0m] - 	+ Running Text Generation memory tracking[0m
[RANK-PROCESS-0][[36m2024-07-05 11:44:03,155[0m][[34mmemory[0m][[32mINFO[0m] - 	+ Tracking RAM memory of process [6510][0m
[RANK-PROCESS-0][[36m2024-07-05 11:44:03,156[0m][[34mmemory[0m][[32mINFO[0m] - 	+ Tracking VRAM memory of CUDA devices [0][0m
[RANK-PROCESS-0][[36m2024-07-05 11:44:03,156[0m][[34mmemory[0m][[32mINFO[0m] - 	+ Tracking Allocated/Reserved memory of 1 Pytorch CUDA devices[0m
[RANK-PROCESS-0][[36m2024-07-05 11:44:21,035[0m][[34mmemory[0m][[32mINFO[0m] - 		+ prefill memory:[0m
[RANK-PROCESS-0][[36m2024-07-05 11:44:21,036[0m][[34mmemory[0m][[32mINFO[0m] - 			- max RAM: 1057.619968 (MB)[0m
[RANK-PROCESS-0][[36m2024-07-05 11:44:21,037[0m][[34mmemory[0m][[32mINFO[0m] - 			- max global VRAM: 7100.170240 (MB)[0m
[RANK-PROCESS-0][[36m2024-07-05 11:44:21,038[0m][[34mmemory[0m][[32mINFO[0m] - 			- ma