This notebook compares GPTQ and bitsandbytes to quantize Llama 2.
It runs on Google Colab Pro (it would run on the free instance for bitsandbytes). It can also run on a machine with at least 24 GB of CPU RAM and a GPU with 12 GB VRAM

For more details check out this article: https://kaitchup.substack.com/p/gptq-or-bitsandbytes-which-quantization

We need the last version of AutoGPTQ, so we will install it from GitHub.

In [None]:
!git clone https://github.com/PanQiWei/AutoGPTQ.git

Cloning into 'AutoGPTQ'...
remote: Enumerating objects: 3125, done.[K
remote: Counting objects: 100% (923/923), done.[K
remote: Compressing objects: 100% (379/379), done.[K
remote: Total 3125 (delta 655), reused 633 (delta 543), pack-reused 2202[K
Receiving objects: 100% (3125/3125), 7.59 MiB | 6.85 MiB/s, done.
Resolving deltas: 100% (2054/2054), done.


First we patch the repository to enable use_auth_token support. Don't do this if you want to use a model that doesn't require an access token. Also, this patch may become obsolete very soon so you may try without it.

In [None]:
!wget https://about.benjaminmarie.com/data/py/auto-gptq-patch/_utils.py
!wget https://about.benjaminmarie.com/data/py/auto-gptq-patch/auto.py

!mv _utils.py AutoGPTQ/auto_gptq/modeling/
!mv auto.py AutoGPTQ/auto_gptq/modeling/

--2023-08-21 12:30:02--  https://about.benjaminmarie.com/data/py/auto-gptq-patch/_utils.py
Resolving about.benjaminmarie.com (about.benjaminmarie.com)... 192.95.30.6
Connecting to about.benjaminmarie.com (about.benjaminmarie.com)|192.95.30.6|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 10839 (11K) [text/x-python]
Saving to: ‘_utils.py’


2023-08-21 12:30:03 (130 MB/s) - ‘_utils.py’ saved [10839/10839]

--2023-08-21 12:30:03--  https://about.benjaminmarie.com/data/py/auto-gptq-patch/auto.py
Resolving about.benjaminmarie.com (about.benjaminmarie.com)... 192.95.30.6
Connecting to about.benjaminmarie.com (about.benjaminmarie.com)|192.95.30.6|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4697 (4.6K) [text/x-python]
Saving to: ‘auto.py’


2023-08-21 12:30:04 (331 MB/s) - ‘auto.py’ saved [4697/4697]



In [None]:
!pip install bitsandbytes accelerate
!pip install nvidia-ml-py3
%cd AutoGPTQ
!pip install .


Collecting bitsandbytes
  Downloading bitsandbytes-0.41.1-py3-none-any.whl (92.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m92.6/92.6 MB[0m [31m20.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting accelerate
  Downloading accelerate-0.21.0-py3-none-any.whl (244 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m244.2/244.2 kB[0m [31m28.4 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: bitsandbytes, accelerate
Successfully installed accelerate-0.21.0 bitsandbytes-0.41.1
Collecting nvidia-ml-py3
  Downloading nvidia-ml-py3-7.352.0.tar.gz (19 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: nvidia-ml-py3
  Building wheel for nvidia-ml-py3 (setup.py) ... [?25l[?25hdone
  Created wheel for nvidia-ml-py3: filename=nvidia_ml_py3-7.352.0-py3-none-any.whl size=19174 sha256=1c233b47298e63d07bf7cd14423af74ccb75862acc30568b7a9a4200705e3d7a
  Stored in directory: /root/.cache/pip/wheels/5c/d8/c

We import all the necessary libraries. I use pyvnml to monitor the VRAM consumption.

In [None]:
from transformers import AutoTokenizer
from auto_gptq import AutoGPTQForCausalLM
from pynvml import *
import time

def print_gpu_utilization():
    nvmlInit()
    handle = nvmlDeviceGetHandleByIndex(0)
    info = nvmlDeviceGetMemoryInfo(handle)
    print(f"GPU memory occupied: {info.used//1024**2} MB.")

Load the tokenizer.

In [None]:
#Replace the following with your own Hugging Face access token.
access_token = ""

#The tokenizer of Llama 2
pretrained_model_tokenizer = "meta-llama/Llama-2-7b-chat-hf"

tokenizer = AutoTokenizer.from_pretrained(pretrained_model_tokenizer , use_fast=True, use_auth_token=access_token )

I dropped the quantized version of llama 2 with autogptq on the HF hub.
The code runs 5 prompts and averages the toks/sec.

In [None]:
#The name of the model once quantized
quantized_model_dir = "kaitchup/llama-2-7b-4bit-autogptq"
#We load the quantized model
model = AutoGPTQForCausalLM.from_quantized(quantized_model_dir, use_safetensors=True, device="cuda:0", use_auth_token=False)
import time
print(print_gpu_utilization())
#Your test prompt
duration = 0.0
total_length = 0
prompt = []
prompt.append("Tell me about gravity.")
prompt.append("What is AI?")
prompt.append("Write an essay about intelligence.")
prompt.append("Cite 20 famous people.")
prompt.append("Give me the recipe for the best chicken curry.")

for i in range(len(prompt)):
  model_inputs = tokenizer(prompt[i], return_tensors="pt").to("cuda:0")
  start_time = time.time()
  output = model.generate(**model_inputs, max_length=1000)[0]
  duration += float(time.time() - start_time)
  total_length += len(output)
  tok_sec_prompt = round(len(output)/float(time.time() - start_time),3)
  print("Prompt --- %s tokens/seconds ---" % (tok_sec_prompt))
  print(print_gpu_utilization())

tok_sec = round(total_length/duration,3)
print("Average --- %s tokens/seconds ---" % (tok_sec))




Downloading (…)lve/main/config.json:   0%|          | 0.00/630 [00:00<?, ?B/s]

Downloading (…)quantize_config.json:   0%|          | 0.00/211 [00:00<?, ?B/s]

Downloading (…)bit-128g.safetensors:   0%|          | 0.00/3.90G [00:00<?, ?B/s]



GPU memory occupied: 6063 MB.
None
Prompt --- 21.743 tokens/seconds ---
GPU memory occupied: 6695 MB.
None
Prompt --- 28.831 tokens/seconds ---
GPU memory occupied: 7555 MB.
None
Prompt --- 28.983 tokens/seconds ---
GPU memory occupied: 7555 MB.
None
Prompt --- 28.706 tokens/seconds ---
GPU memory occupied: 7555 MB.
None
Prompt --- 29.309 tokens/seconds ---
GPU memory occupied: 7555 MB.
None
Average --- 27.898 tokens/seconds ---


*Note: If you run the previous cell generating with the autogptq quantized model, you should disconnect and destroy the runtime and start a new one to be sure that all the memory is free before running bitsandbytes nf4.*

The following code benchmarks bitsandbytes nf4 with double quantization. Double quantization slows down inference, so you may want to set "bnb_4bit_use_double_quant" to False for faster inference (the drawback is that it increases memory usage).

In [None]:
import torch, accelerate
from transformers import BitsAndBytesConfig, AutoModelForCausalLM
compute_dtype = getattr(torch, "float16")
bnb_config = BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_quant_type="nf4",
        bnb_4bit_compute_dtype=compute_dtype,
        bnb_4bit_use_double_quant=True,
)
model = AutoModelForCausalLM.from_pretrained(
          pretrained_model_tokenizer, quantization_config=bnb_config, device_map={"": 0}, use_auth_token=access_token
)
duration = 0.0
total_length = 0
prompt = []
prompt.append("Tell me about gravity.")
prompt.append("What is AI?")
prompt.append("Write an essay about intelligence.")
prompt.append("Cite 20 famous people.")
prompt.append("Give me the recipe for the best chicken curry.")

for i in range(len(prompt)):
  model_inputs = tokenizer(prompt[i], return_tensors="pt").to("cuda:0")
  start_time = time.time()
  output = model.generate(**model_inputs, max_length=1000)[0]
  duration += float(time.time() - start_time)
  total_length += len(output)
  tok_sec_prompt = round(len(output)/float(time.time() - start_time),3)
  print("Prompt --- %s tokens/seconds ---" % (tok_sec_prompt))
  print(print_gpu_utilization())

tok_sec = round(total_length/duration,3)
print("Average --- %s tokens/seconds ---" % (tok_sec))

Downloading (…)lve/main/config.json:   0%|          | 0.00/614 [00:00<?, ?B/s]



Downloading (…)fetensors.index.json:   0%|          | 0.00/26.8k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

Downloading (…)of-00002.safetensors:   0%|          | 0.00/9.98G [00:00<?, ?B/s]

Downloading (…)of-00002.safetensors:   0%|          | 0.00/3.50G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Downloading (…)neration_config.json:   0%|          | 0.00/188 [00:00<?, ?B/s]

Prompt --- 12.881 tokens/seconds ---
GPU memory occupied: 7411 MB.
None
Prompt --- 13.669 tokens/seconds ---
GPU memory occupied: 7411 MB.
None
Prompt --- 13.683 tokens/seconds ---
GPU memory occupied: 7411 MB.
None
Prompt --- 13.691 tokens/seconds ---
GPU memory occupied: 7411 MB.
None
Prompt --- 13.946 tokens/seconds ---
GPU memory occupied: 7411 MB.
None
Average --- 13.513 tokens/seconds ---
