<a href="https://colab.research.google.com/github/tuhinmallick/AI-for-Fashion/blob/main/Fast_and_Small_Llama_3_with_Activation_Aware_Quantization_(AWQ).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This notebook demonstrates how to quantize Llama 3 with AutoAWQ. It also compares AWQ with GPTQ.
It runs on Google Colab Pro. Quantization of Llama 3 8B can be done with a GPU with 12 GB of VRAM but you need at least 14 GB of CPU RAM.

For inference, note that AWQ models can only be run by Ampere GPUs or more recent. In Google Colab, only the A100 and L4 GPUs can run AWQ models.

For more details check out this article: [Fast and Small Llama 3 with Activation-Aware Quantization (AWQ)](https://kaitchup.substack.com/p/fast-and-small-llama-2-with-activation)

Install autawq. I only use nvidia-ml-py3 to monitor the GPU utilization.

In [None]:
!pip install autoawq bitsandbytes
!pip install nvidia-ml-py3

Collecting autoawq
  Downloading autoawq-0.2.4-cp310-cp310-manylinux2014_x86_64.whl (80 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/80.8 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m80.8/80.8 kB[0m [31m2.3 MB/s[0m eta [36m0:00:00[0m
Collecting transformers<=4.38.2,>=4.35.0 (from autoawq)
  Downloading transformers-4.38.2-py3-none-any.whl (8.5 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m8.5/8.5 MB[0m [31m27.1 MB/s[0m eta [36m0:00:00[0m
Collecting accelerate (from autoawq)
  Downloading accelerate-0.29.3-py3-none-any.whl (297 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m297.6/297.6 kB[0m [31m23.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting datasets (from autoawq)
  Downloading datasets-2.19.0-py3-none-any.whl (542 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m542.0/542.0 kB[0m [31m29.6 MB/s[0m eta [36m0:00:00[0m



# AWQ Quantization
Quantize Llama 3 8B with AWQ in 4-bit. The model is saved with safetensors.

In [None]:
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer
model_path = 'meta-llama/Meta-Llama-3-8B'
quant_path = 'Llama-3-8b-awq-4bit'
quant_config = { "zero_point": True, "q_group_size": 128, "w_bit": 4 }

# Load model and tokenizer
model = AutoAWQForCausalLM.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path, use_fast=True)

# Quantize
model.quantize(tokenizer, quant_config=quant_config)

# Save quantized model
model.save_quantized("./"+quant_path, safetensors=True)
tokenizer.save_pretrained("./"+quant_path)

Utilities to benchmark the VRAM consumption and the perplexity of the model.

In [None]:
from datasets import load_dataset
from tqdm import tqdm
import torch, time
from transformers import AutoTokenizer
from pynvml import *

if torch.cuda.is_bf16_supported():
  compute_dtype = torch.bfloat16
else:
  compute_dtype = torch.float16

def print_gpu_utilization():
    nvmlInit()
    handle = nvmlDeviceGetHandleByIndex(0)
    info = nvmlDeviceGetMemoryInfo(handle)
    print(f"GPU memory occupied: {info.used//1024**2} MB.")

dataset = load_dataset("timdettmers/openassistant-guanaco")['test']

tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B", use_fast=True)
#return the perplexity of the model on the dataset
#The perplexity is computed on each example, individually, with a sliding window for examples longer than 512 tokens.
def ppl_model(model, tokenizer, dataset):
  nlls= []
  max_length = 2048
  stride = 512
  for s in tqdm(range(len(dataset['text']))):
      encodings = tokenizer(dataset['text'][s], return_tensors="pt")
      seq_len = encodings.input_ids.size(1)
      prev_end_loc = 0
      for begin_loc in range(0, seq_len, stride):
          end_loc = min(begin_loc + max_length, seq_len)
          trg_len = end_loc - prev_end_loc
          input_ids = encodings.input_ids[:, begin_loc:end_loc].to("cuda")
          target_ids = input_ids.clone()
          target_ids[:, :-trg_len] = -100
          with torch.no_grad():
              outputs = model(input_ids, labels=target_ids)
              neg_log_likelihood = outputs.loss
          nlls.append(neg_log_likelihood)
          prev_end_loc = end_loc
          if end_loc == seq_len:
              break
  ppl = torch.exp(torch.stack(nlls).mean())
  print("\nPerplexity: "+str(ppl.item()))

def eval_model(model):
  total_tokens = 0
  total_duration = 0
  p = "You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe. Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature.\n\nIf a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information. \n\n Tell me about gravity."
  for b in range(5):


    inputs = tokenizer(p, return_tensors="pt").to("cuda")
    generation_time = time.time()
    outputs = model.generate(**inputs, pad_token_id=tokenizer.eos_token_id, max_new_tokens=300)
    duration = time.time() - generation_time
    total_duration += duration

    for output in outputs:
      result = tokenizer.decode(output)
      nb_tokens = len(result)
      total_tokens += nb_tokens
    print("--- Speed: %s tokens/second ---" % (round(nb_tokens/duration,2)))
  print("--- Average speed: %s tokens/second ---" % (round(total_tokens/total_duration,2)))
  ppl_model(model, tokenizer, dataset)

Repo card metadata block was not found. Setting CardData to empty.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


Basline: Llama 3 8B without quantization

In [None]:
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("meta-llama/Meta-Llama-3-8B", torch_dtype=compute_dtype, device_map={"": 0})
print(print_gpu_utilization())
eval_model(model)

config.json:   0%|          | 0.00/654 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/23.9k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/4 [00:00<?, ?it/s]

model-00001-of-00004.safetensors:   0%|          | 0.00/4.98G [00:00<?, ?B/s]

model-00002-of-00004.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00003-of-00004.safetensors:   0%|          | 0.00/4.92G [00:00<?, ?B/s]

model-00004-of-00004.safetensors:   0%|          | 0.00/1.17G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/126 [00:00<?, ?B/s]

GPU memory occupied: 16035 MB.
None
--- Speed: 83.39 tokens/second ---
--- Speed: 92.0 tokens/second ---
--- Speed: 91.67 tokens/second ---
--- Speed: 91.42 tokens/second ---
--- Speed: 91.3 tokens/second ---
--- Average speed: 89.83 tokens/second ---


100%|██████████| 518/518 [01:28<00:00,  5.84it/s]



Perplexity: 5.793899059295654


In [None]:
from transformers import AutoTokenizer, AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("kaitchup/Llama-3-8b-awq-4bit", torch_dtype=torch.float16, device_map={"": 0})
print(print_gpu_utilization())
eval_model(model)

config.json:   0%|          | 0.00/978 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/60.5k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.68G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/1.05G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/142 [00:00<?, ?B/s]

GPU memory occupied: 6193 MB.
None
--- Speed: 158.23 tokens/second ---
--- Speed: 94.2 tokens/second ---
--- Speed: 621.21 tokens/second ---
--- Speed: 114.61 tokens/second ---
--- Speed: 201.15 tokens/second ---
--- Average speed: 131.46 tokens/second ---


100%|██████████| 518/518 [02:43<00:00,  3.16it/s]


Perplexity: 6.115833759307861





Benchmark AWQ for memory consumption and measure perplexity.

In [None]:

from transformers import AutoTokenizer

model = AutoAWQForCausalLM.from_quantized('./Llama-2-7b-awq-4bit', 'model.safetensors',  fuse_layers=True)
print(print_gpu_utilization())
tokenizer = AutoTokenizer.from_pretrained("./Llama-2-7b-awq-4bit")
ppl = ppl_model(model, tokenizer, dataset)
print(ppl)

Replacing layers...: 100%|██████████| 32/32 [00:04<00:00,  6.77it/s]


GPU memory occupied: 8713 MB.
None


Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
100%|██████████| 518/518 [01:18<00:00,  6.56it/s]

tensor(5.3631, device='cuda:0')



