This notebook demonstrates how to quantize Llama 2 with AutoAWQ. It also compares AWQ with GPTQ.
It runs on Google Colab Pro. Quantization of Llama 2 7B can be done with a GPU with 6 GB of VRAM but you need at least 36 GB of CPU RAM.

For inference, note that AWQ models can only be run by Ampere GPUs or more recent. In Google Colab, only the A100 can run AWQ models.

For more details check out this article: [Fast and Small Llama 2 with Activation-Aware Quantization (AWQ)](https://kaitchup.substack.com/p/fast-and-small-llama-2-with-activation)

Install autawq. I only use nvidia-ml-py3 to monitor the GPU utilization.

In [None]:
!pip install autoawq
!pip install nvidia-ml-py3

Collecting autoawq
  Downloading autoawq-0.1.2-cp310-cp310-manylinux2014_x86_64.whl (17.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m17.4/17.4 MB[0m [31m39.2 MB/s[0m eta [36m0:00:00[0m
Collecting transformers>=4.32.0 (from autoawq)
  Downloading transformers-4.34.0-py3-none-any.whl (7.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.7/7.7 MB[0m [31m116.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting tokenizers>=0.12.1 (from autoawq)
  Downloading tokenizers-0.14.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.8/3.8 MB[0m [31m116.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting accelerate (from autoawq)
  Downloading accelerate-0.23.0-py3-none-any.whl (258 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m258.1/258.1 kB[0m [31m32.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting sentencepiece (from autoawq)
  Downloading sente

We will use Llama 2 so wee need to be connected to the HF hub.

In [None]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

Quantize Llama 2 7B with AWQ in 4-bit. The model is saved with safetensors.

In [None]:
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer
model_path = 'meta-llama/Llama-2-7b-hf'
quant_path = 'Llama-2-7b-awq-4bit'
quant_config = { "zero_point": True, "q_group_size": 128, "w_bit": 4 }

# Load model and tokenizer
model = AutoAWQForCausalLM.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path, use_fast=True)

# Quantize
model.quantize(tokenizer, quant_config=quant_config)

# Save quantized model
model.save_quantized("./"+quant_path, safetensors=True)
tokenizer.save_pretrained("./"+quant_path)



Downloading (…)lve/main/config.json:   0%|          | 0.00/609 [00:00<?, ?B/s]

Fetching 14 files:   0%|          | 0/14 [00:00<?, ?it/s]

Downloading (…)0bbb9/.gitattributes:   0%|          | 0.00/1.58k [00:00<?, ?B/s]

Downloading (…)85b5b0bbb9/README.md:   0%|          | 0.00/10.4k [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/188 [00:00<?, ?B/s]

Downloading (…)b5b0bbb9/LICENSE.txt:   0%|          | 0.00/7.02k [00:00<?, ?B/s]

Downloading (…)model.bin.index.json:   0%|          | 0.00/26.8k [00:00<?, ?B/s]

Downloading (…)b0bbb9/USE_POLICY.md:   0%|          | 0.00/4.77k [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/414 [00:00<?, ?B/s]

Downloading (…)0bbb9/tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

Downloading (…)l-00001-of-00002.bin:   0%|          | 0.00/9.98G [00:00<?, ?B/s]

Downloading (…)l-00002-of-00002.bin:   0%|          | 0.00/3.50G [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/776 [00:00<?, ?B/s]

Downloading tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

Downloading (…)nsible-Use-Guide.pdf:   0%|          | 0.00/1.25M [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Downloading readme:   0%|          | 0.00/167 [00:00<?, ?B/s]

Repo card metadata block was not found. Setting CardData to empty.


Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/471M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Generating validation split: 0 examples [00:00, ? examples/s]

AWQ: 100%|██████████| 32/32 [18:26<00:00, 34.59s/it]
Using pad_token, but it is not set yet.
Using pad_token, but it is not set yet.


('./Llama-2-7b-awq-4bit/tokenizer_config.json',
 './Llama-2-7b-awq-4bit/special_tokens_map.json',
 './Llama-2-7b-awq-4bit/tokenizer.model',
 './Llama-2-7b-awq-4bit/added_tokens.json',
 './Llama-2-7b-awq-4bit/tokenizer.json')

Utilities to benchmark the VRAM consumption and the perplexity of the model.

In [None]:
from datasets import load_dataset
from tqdm import tqdm
import torch

from pynvml import *

def print_gpu_utilization():
    nvmlInit()
    handle = nvmlDeviceGetHandleByIndex(0)
    info = nvmlDeviceGetMemoryInfo(handle)
    print(f"GPU memory occupied: {info.used//1024**2} MB.")

dataset = load_dataset("timdettmers/openassistant-guanaco")['test']


#return the perplexity of the model on the dataset
#The perplexity is computed on each example, individually, with a sliding window for examples longer than 512 tokens.
def ppl_model(model, tokenizer, dataset):
  nlls= []
  max_length = 2048
  stride = 512
  for s in tqdm(range(len(dataset['text']))):
      encodings = tokenizer(dataset['text'][s], return_tensors="pt")
      seq_len = encodings.input_ids.size(1)
      prev_end_loc = 0
      for begin_loc in range(0, seq_len, stride):
          end_loc = min(begin_loc + max_length, seq_len)
          trg_len = end_loc - prev_end_loc
          input_ids = encodings.input_ids[:, begin_loc:end_loc].to("cuda")
          target_ids = input_ids.clone()
          target_ids[:, :-trg_len] = -100
          with torch.no_grad():
              outputs = model(input_ids, labels=target_ids)
              neg_log_likelihood = outputs.loss
          nlls.append(neg_log_likelihood)
          prev_end_loc = end_loc
          if end_loc == seq_len:
              break
  ppl = torch.exp(torch.stack(nlls).mean())
  return ppl

Repo card metadata block was not found. Setting CardData to empty.


Run this cell and the next cell to benchmark GPTQ.

In [None]:
!pip install auto-gptq optimum

Collecting auto-gptq
  Downloading auto_gptq-0.4.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.8/1.8 MB[0m [31m16.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting optimum
  Downloading optimum-1.13.2.tar.gz (300 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m301.0/301.0 kB[0m [31m21.8 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting rouge (from auto-gptq)
  Downloading rouge-1.0.1-py3-none-any.whl (13 kB)
Collecting peft (from auto-gptq)
  Downloading peft-0.5.0-py3-none-any.whl (85 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m85.6/85.6 kB[0m [31m12.0 MB/s[0m eta [36m0:00:00[0m
Building wheels for collected packages: optimum
  Building wheel for optimum (pyproject.toml) ... [?25l

In [None]:
from transformers import AutoTokenizer, AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("kaitchup/Llama-2-7b-gptq-4bit", device_map={"": 0})
print(print_gpu_utilization())
tokenizer = AutoTokenizer.from_pretrained("kaitchup/Llama-2-7b-gptq-4bit", use_fast=True)
ppl = ppl_model(model, tokenizer, dataset)
print(ppl)

Downloading pytorch_model.bin:   0%|          | 0.00/3.90G [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/183 [00:00<?, ?B/s]

GPU memory occupied: 5863 MB.
None


Downloading (…)okenizer_config.json:   0%|          | 0.00/762 [00:00<?, ?B/s]

Downloading tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/414 [00:00<?, ?B/s]

100%|██████████| 518/518 [00:39<00:00, 13.25it/s]


tensor(5.4423, device='cuda:0')


Benchmark AWQ for memory consumption and measure perplexity.

In [None]:
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer

model = AutoAWQForCausalLM.from_quantized('./Llama-2-7b-awq-4bit', 'model.safetensors',  fuse_layers=True)
print(print_gpu_utilization())
tokenizer = AutoTokenizer.from_pretrained("./Llama-2-7b-awq-4bit")
ppl = ppl_model(model, tokenizer, dataset)
print(ppl)

Replacing layers...: 100%|██████████| 32/32 [00:04<00:00,  6.77it/s]


GPU memory occupied: 8713 MB.
None


Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
100%|██████████| 518/518 [01:18<00:00,  6.56it/s]

tensor(5.3631, device='cuda:0')





Run 4 different prompts and measure the number of tokens generated per second.

In [None]:
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer

model = AutoAWQForCausalLM.from_quantized('./Llama-2-7b-awq-4bit', 'model.safetensors',  fuse_layers=True)
print(print_gpu_utilization())
tokenizer = AutoTokenizer.from_pretrained("./Llama-2-7b-awq-4bit")
import time
print(print_gpu_utilization())
#Your test prompt
duration = 0.0
total_length = 0
prompt = []
prompt.append("Tell me about gravity.")
prompt.append("What is AI?")
prompt.append("Write an essay about intelligence.")
prompt.append("Cite 20 famous people.")
prompt.append("Give me the recipe for the best chicken curry.")


for i in range(len(prompt)):
  model_inputs = tokenizer(prompt[i], return_tensors="pt").to("cuda:0")
  start_time = time.time()
  output = model.generate(**model_inputs, max_length=1000)[0]
  duration += float(time.time() - start_time)
  total_length += len(output)
  tok_sec_prompt = round(len(output)/float(time.time() - start_time),3)
  print("Prompt --- %s tokens/seconds ---" % (tok_sec_prompt))
  print(print_gpu_utilization())

tok_sec = round(total_length/duration,3)
print("Average --- %s tokens/seconds ---" % (tok_sec))

Replacing layers...: 100%|██████████| 32/32 [00:04<00:00,  7.21it/s]
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Both `max_new_tokens` (=4096) and `max_length`(=1000) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


GPU memory occupied: 14539 MB.
None
GPU memory occupied: 14539 MB.
None


This is a friendly reminder - the current text generation call will exceed the model's predefined maximum length (4096). Depending on the model, you may observe exceptions, performance degradation, or nothing at all.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Both `max_new_tokens` (=4096) and `max_length`(=1000) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


Prompt --- 76.199 tokens/seconds ---
GPU memory occupied: 14899 MB.
None


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Both `max_new_tokens` (=4096) and `max_length`(=1000) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


Prompt --- 79.28 tokens/seconds ---
GPU memory occupied: 14899 MB.
None


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Both `max_new_tokens` (=4096) and `max_length`(=1000) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


Prompt --- 79.794 tokens/seconds ---
GPU memory occupied: 14899 MB.
None


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Both `max_new_tokens` (=4096) and `max_length`(=1000) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


Prompt --- 79.415 tokens/seconds ---
GPU memory occupied: 14899 MB.
None
Prompt --- 79.618 tokens/seconds ---
GPU memory occupied: 14899 MB.
None
Average --- 78.838 tokens/seconds ---
