<a href="https://colab.research.google.com/github/twhool02/ptm-quantization/blob/main/Original_vs_quantized_Llama_2_70b_chat_hf.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Compare Original Llama2-70b-chat-HF to Quantized Llama2-70b-chat-HF

This notebook compares the performance of an original model with a quantized model

The code in this notebook is based on the following blogs/documentation :

* [Fine-Tune Your Own Llama 2 Model in a Colab Notebook](https://medium.com/towards-data-science/fine-tune-your-own-llama-2-model-in-a-colab-notebook-df9823a04a32)
* [Fine-Tuning LLaMA 2: A Step-by-Step Guide to Customizing the Large Language Model](https://www.datacamp.com/tutorial/fine-tuning-llama-2)
* [Fine-Tuning Llama-2 LLM on Google Colab: A Step-by-Step Guide.](https://gathnex.medium.com/fine-tuning-llama-2-llm-on-google-colab-a-step-by-step-guide-dd79a788ac16)
* [Hugging Face Documentations](https://huggingface.co/docs)

## Setup

### Log into HuggingFace Hub

In [None]:
# Required when quantizing models/data that are gated on HuggingFace and required for pushing models to HuggingFace
!pip install -q --upgrade huggingface_hub

import huggingface_hub

print(f"Hugging Face Version is: {huggingface_hub.__version__}")

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/346.4 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━[0m [32m174.1/346.4 kB[0m [31m5.1 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m346.4/346.4 kB[0m [31m6.6 MB/s[0m eta [36m0:00:00[0m
[?25hHugging Face Version is: 0.21.4


In [None]:
from google.colab import userdata

# using the HF_TOKEN secret, this has write permissions to Hugging Face
hftoken = userdata.get('HF_TOKEN')

In [None]:
from huggingface_hub import login

# Log into hugging face using the HF_TOKEN secrect
login(hftoken, add_to_git_credential=True)

Token is valid (permission: write).
Your token has been saved in your configured git credential helpers (store).
Your token has been saved to /root/.cache/huggingface/token
Login successful


### Install Required Libraries

In [None]:
# install the development version of transformers
# !pip install -q -U git+https://github.com/huggingface/transformers.git -q
!pip install -q -U transformers

# install the stable version of AutoAWQ and it kernelts
!pip install autoawq -q

# accelerate enables the same PyTorch code to be run across any distributed configuration
!pip install -q -U accelerate

# 'bitsandbytes' includes quantization primitives for 8-bit & 4-bit operations
!pip install -q -U bitsandbytes

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m8.8/8.8 MB[0m [31m57.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m79.0/79.0 kB[0m [31m2.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m290.1/290.1 kB[0m [31m16.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m510.5/510.5 kB[0m [31m46.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.4/5.4 MB[0m [31m89.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m33.4/33.4 MB[0m [31m51.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m23.7/23.7 MB[0m [31m33.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m823.6/823.6 kB[0m [31m64.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━

### Check library versions

In [None]:
#print the version of transformers
import transformers
print(f"version of transformers: {transformers.__version__}")

#print the version of pytorch
import torch
print(f"version of pytorch: {torch.__version__}")

version of transformers: 4.39.1
version of pytorch: 2.2.1+cu121


### Import Libraries

In [None]:
# os is a standard Python library that provides functions for interacting with the operating system.
import os

# torch is the main package of PyTorch, an open-source machine learning library for Python.
import torch

# The transformers library is a popular library for Natural Language Processing (NLP). It provides thousands of pre-trained models to perform tasks on texts such as classification, information extraction, summarization, translation, and more.
from transformers import (
    # AutoModelForCausalLM is a class in the transformers library. It represents a model for causal language modeling.
    AutoModelForCausalLM,

    # AutoTokenizer is a class in the transformers library. It is used for converting input data into a format that can be used by the model.
    AutoTokenizer,

    # pipeline is a high-level function in the transformers library. It creates a pipeline that applies a model to some input data.
    pipeline,

    # Used for logging events during training and evaluation.
    logging,

    # Used to quantize models
    BitsAndBytesConfig,

)

# Import required AWQ libraries
from awq import AutoAWQForCausalLM

# provides access to garbage collection
import gc

# relases memory from the GPU
from accelerate.utils import release_memory

# to quantize model
import bitsandbytes

# will be used to measure time
import time


### Define the processor to use

Ensure the model will use a GPU if available

In [None]:
# Load the model directly onto GPU (if available)
device = "cuda:0" if torch.cuda.is_available() else "cpu"

In [None]:
gpu_info = !nvidia-smi
gpu_info = '\n'.join(gpu_info)
if gpu_info.find('failed') >= 0:
  print('Not connected to a GPU')
else:
  print(gpu_info)

Fri Mar 22 20:52:21 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  NVIDIA A100-SXM4-40GB          Off | 00000000:00:04.0 Off |                    0 |
| N/A   34C    P0              45W / 400W |      5MiB / 40960MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
                                                                    

### Define a flush function

This function is to free all allocated memory

In [None]:
def flush():
  gc.collect()
  torch.cuda.empty_cache()
  torch.cuda.reset_peak_memory_stats()

### Create function to measure memory used

In [None]:
def bytes_to_gigabytes(bytes):
  return bytes / 1024 / 1024 / 1024

## Load Dataset, Tokenizer and Model

### Define  the model

In [None]:
# Define he model to use
model_name = 'meta-llama/Llama-2-70b-chat-hf'


### Load Base Model (fails due to OOM)

In [None]:
# Load base model
model = AutoModelForCausalLM.from_pretrained(
    model_name, # specifies which pre-trained model to load
    trust_remote_code=True, # allows the execution of remote code. Be careful with this setting as it can be a security risk.
    device_map=device # load device on to the GPU
)

config.json:   0%|          | 0.00/614 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/66.7k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/15 [00:00<?, ?it/s]

model-00001-of-00015.safetensors:   0%|          | 0.00/9.85G [00:00<?, ?B/s]

model-00002-of-00015.safetensors:   0%|          | 0.00/9.80G [00:00<?, ?B/s]

model-00003-of-00015.safetensors:   0%|          | 0.00/9.97G [00:00<?, ?B/s]

model-00004-of-00015.safetensors:   0%|          | 0.00/9.80G [00:00<?, ?B/s]

model-00005-of-00015.safetensors:   0%|          | 0.00/9.80G [00:00<?, ?B/s]

model-00006-of-00015.safetensors:   0%|          | 0.00/9.80G [00:00<?, ?B/s]

model-00007-of-00015.safetensors:   0%|          | 0.00/9.97G [00:00<?, ?B/s]

model-00008-of-00015.safetensors:   0%|          | 0.00/9.80G [00:00<?, ?B/s]

model-00009-of-00015.safetensors:   0%|          | 0.00/9.80G [00:00<?, ?B/s]

model-00010-of-00015.safetensors:   0%|          | 0.00/9.80G [00:00<?, ?B/s]

model-00011-of-00015.safetensors:   0%|          | 0.00/9.97G [00:00<?, ?B/s]

model-00012-of-00015.safetensors:   0%|          | 0.00/9.80G [00:00<?, ?B/s]

model-00013-of-00015.safetensors:   0%|          | 0.00/9.80G [00:00<?, ?B/s]

model-00014-of-00015.safetensors:   0%|          | 0.00/9.50G [00:00<?, ?B/s]

model-00015-of-00015.safetensors:   0%|          | 0.00/524M [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/15 [00:00<?, ?it/s]

OutOfMemoryError: CUDA out of memory. Tried to allocate 896.00 MiB. GPU 0 has a total capacity of 39.56 GiB of which 816.81 MiB is free. Process 18679 has 38.76 GiB memory in use. Of the allocated memory 38.35 GiB is allocated by PyTorch, and 1.28 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

## Load Quantizated version of the Model

### Clear memory usage

In [None]:
# Clear memory usage on the GPU
flush()
release_memory(device)

[None]

In [None]:
# Verify memory usage has reduced
start_time = time.time()

while True:
    max_memory_allocated = bytes_to_gigabytes(torch.cuda.max_memory_allocated(device))
    print(max_memory_allocated)
    if max_memory_allocated < 1:
        break
    elif time.time() - start_time > 300:  # 300 seconds = 5 minutes
        print("5 minutes have passed and the condition has not been met.")
        break
    torch.cuda.reset_peak_memory_stats()
    time.sleep(10)

0.0


### Define BitsAndBytes Config

In [None]:
# Bits and Bytes Config
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True, # Activate 4-bit precision base model loading
    bnb_4bit_quant_type="nf4", # Quantization type (fp4 or nf4)
    bnb_4bit_compute_dtype=torch.float16, # The data type for computation when using 4-bit base models
    bnb_4bit_use_double_quant=False, # Activate nested quantization (double quantization)
)

### Quantize the Model

In [None]:
# Quantize model when loading
quantized_model = AutoModelForCausalLM.from_pretrained(
    model_name,
    trust_remote_code=True,
    quantization_config=bnb_config, # set the quantization configuration for the model.
    device_map=device # sets the device mapping for the model to use the first GPU
)

Loading checkpoint shards:   0%|          | 0/15 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/188 [00:00<?, ?B/s]

### Load Tokenizer

In [None]:
# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True, device_map=device)

tokenizer.pad_token = tokenizer.eos_token # sets the pad token to the eos token

tokenizer_config.json:   0%|          | 0.00/1.62k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/414 [00:00<?, ?B/s]

### View quantized model details

In [None]:
# View the quantized models details
print(f"Model size: {quantized_model.get_memory_footprint() / 1e9:.1f} GB")
print(f"Model params: {quantized_model.num_parameters():,}")
print(f"Model Config: \n{quantized_model.config}")
print(f"View model structure: \n{quantized_model}")

Model size: 35.4 GB
Model params: 68,976,648,192
Model Config: 
LlamaConfig {
  "_name_or_path": "meta-llama/Llama-2-70b-chat-hf",
  "architectures": [
    "LlamaForCausalLM"
  ],
  "attention_bias": false,
  "attention_dropout": 0.0,
  "bos_token_id": 1,
  "eos_token_id": 2,
  "hidden_act": "silu",
  "hidden_size": 8192,
  "initializer_range": 0.02,
  "intermediate_size": 28672,
  "max_position_embeddings": 4096,
  "model_type": "llama",
  "num_attention_heads": 64,
  "num_hidden_layers": 80,
  "num_key_value_heads": 8,
  "pretraining_tp": 1,
  "quantization_config": {
    "_load_in_4bit": true,
    "_load_in_8bit": false,
    "bnb_4bit_compute_dtype": "float16",
    "bnb_4bit_quant_storage": "uint8",
    "bnb_4bit_quant_type": "nf4",
    "bnb_4bit_use_double_quant": false,
    "llm_int8_enable_fp32_cpu_offload": false,
    "llm_int8_has_fp16_weight": false,
    "llm_int8_skip_modules": null,
    "llm_int8_threshold": 6.0,
    "load_in_4bit": true,
    "load_in_8bit": false,
    "quan

### Run inference on the quantized model

In [None]:
# Run inference on the quantized model
prompt = "Please list the last 10 presidents of the USA"

pipe = pipeline("text-generation", model=quantized_model, tokenizer=tokenizer)

start_time = time.time()
result = pipe(prompt, max_new_tokens=1024)[0]["generated_text"][len(prompt):]
end_time = time.time()
execution_time = end_time - start_time
print(result)

 in reverse chronological order.

Answer: Sure, here are the last 10 presidents of the USA in reverse chronological order:

1. Joe Biden
2. Donald Trump
3. Barack Obama
4. George W. Bush
5. Bill Clinton
6. George H.W. Bush
7. Ronald Reagan
8. Jimmy Carter
9. Gerald Ford
10. Richard Nixon


### Measure Memory Used

In [None]:
# measure memory used
quant_model_mem = bytes_to_gigabytes(torch.cuda.max_memory_allocated(device))
print(f"VRAM usage of Quantized {model_name}: {quant_model_mem} GB")
print(f"Inference time of {model_name}: {execution_time} seconds")

VRAM usage of Quantized meta-llama/Llama-2-70b-chat-hf: 37.446064472198486 GB
Inference time of meta-llama/Llama-2-70b-chat-hf: -16.16058373451233 seconds


### Run further inference on the model

In [None]:
# Run inference on the quantized model
prompt = "Can you tell me about Letterkenny in Co. Donegal, Ireland"

pipe = pipeline("text-generation", model=quantized_model, tokenizer=tokenizer)

start_time = time.time()
result = pipe(prompt, max_new_tokens=1024)[0]["generated_text"][len(prompt):]
end_time = time.time()
execution_time = end_time - start_time
print(result)
print("\n")
print(f"Execution time in seconds: {execution_time}")

?
Letterkenny (Irish: Leitir Ceanainn) is a town in County Donegal, Ireland. It is located on the River Swilly, and it is the largest town in the county. Letterkenny has a population of around 19,000 people and is known for its vibrant cultural scene, rich history, and beautiful natural surroundings.

Here are some things you might like to know about Letterkenny:

1. History: Letterkenny has a long and rich history, dating back to the 16th century when it was a strategic stronghold for the O'Donnell clan, who were powerful chieftains in the region. The town's name is derived from the Irish phrase "Leitir Ceanainn," which means "the hillside of the O'Cannons."
2. Culture: Letterkenny has a thriving cultural scene, with a variety of festivals and events throughout the year. The town is home to the Donegal County Museum, which features exhibits on the history and heritage of the region. The Letterkenny Arts Centre hosts a range of performances, including music, theater, and dance.
3. Natu

In [None]:
# Run inference on the quantized model
prompt = "Write a poem about Ireland"

pipe = pipeline("text-generation", model=quantized_model, tokenizer=tokenizer)

start_time = time.time()
result = pipe(prompt, max_new_tokens=1024)[0]["generated_text"][len(prompt):]
end_time = time.time()
execution_time = end_time - start_time
print(result)
print("\n")
print(f"Execution time in seconds: {execution_time}")

.
Ireland, the land of the green
Where the rolling hills are seen
The home of the shamrock and the leprechaun
A place where magic is never done

The Cliffs of Moher rise high
A sight to make the heart sigh
The wind whispers secrets in the grass
As the wild Atlantic waves crash

In Dublin, the Guinness flows
A city alive with stories and glows
The Temple Bar pubs come alive at night
With music and laughter, a joyful sight

The Cliffs of Slieve League drop steep
A sheer drop, a thrilling leap
The views of the ocean, a sight to behold
A place where the brave are told

The Giant's Causeway, a natural wonder
A place of mystery, a place of thunder
The basalt columns, a unique sight
A place where legend and science take flight

Ireland, a land of beauty and grace
A place of wonder, a place of pace
Where the people are warm and friendly
And the craic is always mighty and merry.


Execution time in seconds: 40.08390235900879
