<a href="https://colab.research.google.com/github/twhool02/ptm-quantization/blob/main/Quantize_Llama_2_7b_HF_AutoGPTQ.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Quantization of Llama-2-7b-HF using AutoGPTQ

This notebook implments model quantization using GPTQ, specifically using the AutoGPTQ library from Hugging Face

The following blogs/documentation were refered to when determining the code that is used in this notebook:
* [AutoGPTQ GitHub Repository](https://github.com/AutoGPTQ/AutoGPTQ)
* [Making LLMs lighter with AutoGPTQ and transformers](https://huggingface.co/blog/gptq-integration)
* [GPTQ Quantization on a Llama 2 7B Fine-Tuned Model With HuggingFace](https://towardsai.net/p/machine-learning/gptq-quantization-on-a-llama-2-7b-fine-tuned-model-with-huggingface)
* [4-bit Quantization with GPTQ](https://towardsdatascience.com/4-bit-quantization-with-gptq-36b0f4f02c34)
* [GPTQConfig Documentation](https://huggingface.co/docs/transformers/main/en/main_classes/quantization#transformers.GPTQConfig)


## Setup

### Log into HuggingFace Hub

In [2]:
# Required when quantizing models/data that are gated on HuggingFace and required for pushing models to HuggingFace
!pip install --upgrade huggingface_hub

import huggingface_hub

print(f"Hugging Face Version is: {huggingface_hub.__version__}")

Collecting huggingface_hub
  Downloading huggingface_hub-0.22.2-py3-none-any.whl (388 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m388.9/388.9 kB[0m [31m10.3 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: huggingface_hub
  Attempting uninstall: huggingface_hub
    Found existing installation: huggingface-hub 0.20.3
    Uninstalling huggingface-hub-0.20.3:
      Successfully uninstalled huggingface-hub-0.20.3
Successfully installed huggingface_hub-0.22.2
Hugging Face Version is: 0.22.2


In [3]:
from google.colab import userdata

# using the HF_TOKEN secret, this has write permissions to Hugging Face
hftoken = userdata.get('HF_TOKEN')

In [4]:
from huggingface_hub import login

# Log into hugging face using the HF_TOKEN secrect
login(hftoken, add_to_git_credential=True)

Token is valid (permission: write).
Your token has been saved in your configured git credential helpers (store).
Your token has been saved to /root/.cache/huggingface/token
Login successful


### Install Required libraries

In [5]:
!pip install auto-gptq
!pip install git+https://github.com/huggingface/optimum.git

# The transformers library is a state-of-the-art NLP library for working with TensorFlow 2.0 and PyTorch.
# Using senteniece installs the development version
!pip install transformers[sentencepiece]
!pip install -U -q accelerate

# It provides one-liners to download and pre-process any of the major public datasets provided on the Hugging Face Datasets Hub.
!pip install -U -q datasets

Collecting auto-gptq
  Downloading auto_gptq-0.7.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (23.5 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m23.5/23.5 MB[0m [31m65.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting accelerate>=0.26.0 (from auto-gptq)
  Downloading accelerate-0.28.0-py3-none-any.whl (290 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m290.1/290.1 kB[0m [31m37.2 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting datasets (from auto-gptq)
  Downloading datasets-2.18.0-py3-none-any.whl (510 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m510.5/510.5 kB[0m [31m48.9 MB/s[0m eta [36m0:00:00[0m
Collecting rouge (from auto-gptq)
  Downloading rouge-1.0.1-py3-none-any.whl (13 kB)
Collecting gekko (from auto-gptq)
  Downloading gekko-1.1.0-py3-none-any.whl (13.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.2/13.2 MB[0m [31m98.1 MB/s[0m eta [36m0:00:00[0m
Collecting

### Check library versions

In [6]:
#print the version of transformers
import transformers
print(f"version of transformers: {transformers.__version__}")

# print the version of the accelerate library
import accelerate
print(f"version of accelerate: {accelerate.__version__}")

# print the version of the auto-gptq library
import auto_gptq
print(f"version of auto-gptq: {auto_gptq.__version__}")

version of transformers: 4.38.2
version of accelerate: 0.28.0
version of auto-gptq: 0.7.1


### Import libraries

In [7]:
# Import the torch library, PyTorch is an open source machine learning library based on the Torch library.
import torch

# Import the nn module from the torch library. This module contains various building blocks for creating neural networks.
import torch.nn as nn

# AutoTokenizer: This class can automatically guess and download the correct tokenizer based on the model’s name.
# AutoConfig: This class can automatically guess and download the correct configuration based on the model’s name.
# AutoModelForCausalLM: This class can automatically guess and download a model for causal language modeling based on the model’s name.
# The pipeline() function is a high-level, easy to use, API for doing inference over a variety of downstream-tasks
from transformers import AutoTokenizer, AutoConfig, AutoModelForCausalLM, pipeline, GPTQConfig, logging

### Define the processor to use

Define that the model will use a GPU if available

In [8]:
# Load the model directly onto GPU (if available)
device = "cuda:0" if torch.cuda.is_available() else "cpu"

## Load Model, Tokenizer and Dataset

### Define model

In [9]:
# Define the model
model_name = 'meta-llama/Llama-2-7b-chat-hf'

### Load the Tokenizer

In [10]:
# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

tokenizer_config.json:   0%|          | 0.00/1.62k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/414 [00:00<?, ?B/s]

### Quantization Configuration

To quantize a model:
* Create a GPTQConfig class
* Set the number of bits to quantize to
* Define a dataset to calibrate the weights for quantization
* Define a tokenizer to prepare the dataset

In [11]:
gptq_config = GPTQConfig(
    bits=4,
    exllama_config={"version":2}, # Specify the version of the exllama kernel, defaults to {"version": 1}
    dataset='c4', # The dataset has to be one of 'wikitext2','c4','c4-new','ptb','ptb-new'
    tokenizer=tokenizer,
    group_size=128,  # the group size to use for quantization. Recommended value is 128
    desc_act=False,  # Whether to quantize columns in order of decreasing activation size. Setting to False can significantly speed up inference but impact perplexity
    damp_percent=0.01 # The percent of the average Hessian diagonal to use for dampening. Recommended value is 0.1
    )

### Quantize the model

Load a model to quantize and pass the gptq_config to the from_pretrained() method.

In [12]:
# Load base model
quantized_model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=gptq_config, # set the quantization configuration for the model.
    device_map=device, # sets the device mapping for the model to use the first GPU
)

config.json:   0%|          | 0.00/614 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/26.8k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/9.98G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/3.50G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/188 [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/41.1k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/319M [00:00<?, ?B/s]

Generating train split: 0 examples [00:00, ? examples/s]

Quantizing model.layers blocks :   0%|          | 0/32 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/7 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/7 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/7 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/7 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/7 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/7 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/7 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/7 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/7 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/7 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/7 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/7 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/7 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/7 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/7 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/7 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/7 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/7 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/7 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/7 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/7 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/7 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/7 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/7 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/7 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/7 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/7 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/7 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/7 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/7 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/7 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/7 [00:00<?, ?it/s]

The cos_cached attribute will be removed in 4.39. Bear in mind that its contents changed in v4.38. Use the forward method of RoPE from now on instead. It is not used in the `LlamaAttention` class
The sin_cached attribute will be removed in 4.39. Bear in mind that its contents changed in v4.38. Use the forward method of RoPE from now on instead. It is not used in the `LlamaAttention` class


### View the model details

In [13]:
print(f"Model size: {quantized_model.get_memory_footprint() / 1e9:.1f} GB")
print(f"Model params: {quantized_model.num_parameters():,}")
print(f"Model config: {quantized_model.config}")
print(f"View model structure: {quantized_model}")

Model size: 4.0 GB
Model params: 262,410,240
Model config: LlamaConfig {
  "_name_or_path": "meta-llama/Llama-2-7b-chat-hf",
  "architectures": [
    "LlamaForCausalLM"
  ],
  "attention_bias": false,
  "attention_dropout": 0.0,
  "bos_token_id": 1,
  "eos_token_id": 2,
  "hidden_act": "silu",
  "hidden_size": 4096,
  "initializer_range": 0.02,
  "intermediate_size": 11008,
  "max_position_embeddings": 4096,
  "model_type": "llama",
  "num_attention_heads": 32,
  "num_hidden_layers": 32,
  "num_key_value_heads": 32,
  "pretraining_tp": 1,
  "quantization_config": {
    "batch_size": 1,
    "bits": 4,
    "block_name_to_quantize": null,
    "cache_block_outputs": true,
    "damp_percent": 0.01,
    "dataset": "c4",
    "desc_act": false,
    "exllama_config": {
      "version": 1
    },
    "group_size": 128,
    "max_input_length": null,
    "model_seqlen": null,
    "module_name_preceding_first_block": null,
    "modules_in_block_to_quantize": null,
    "pad_token_id": null,
    "quan

## Inference

This section is to verify that inference can be carried out successfully on the quantized model.  
More in-depth evaluation will be done in the notebook ‘Model Evaluation’.

### Using transformers pipeline

In [14]:
pipeline = pipeline(
    "text-generation",
    model=quantized_model,
    tokenizer=tokenizer,
    torch_dtype=torch.bfloat16,
    device_map=device,
)

sequences = pipeline(
    "Write a poem about Ireland",
    do_sample=True,
    top_k=10,
    num_return_sequences=1,
    eos_token_id=tokenizer.eos_token_id,
    max_length=200,
    truncation=True
)
for seq in sequences:
    print(f"Result: {seq['generated_text']}")

Result: Write a poem about Ireland.

Poem:
In Ireland's green fields, where the sun shines bright,
Lies a land of beauty, a sight to behold.
From the rugged coastlines, to the rolling hills,
Ireland is a place, where the heart is made of gold.

The Emerald Isle, a land of magic and might,
Where the whiskey flows, and the music takes flight.
From Dublin to Galway, the charm never fades,
In Ireland, the heart beats strong, with a love that never wanes.

The people are warm, with a smile that's always there,
In Ireland, you'll find, a place to share.
From the castles to the pubs, the history to the land,
In Ireland, you'll find, a place to stand.

So come and visit, and see for yourself,



In [15]:
sequences = pipeline(
   "What is a large language model?",
    do_sample=True,
    top_k=10,
    num_return_sequences=1,
    eos_token_id=tokenizer.eos_token_id,
    truncation=True,
    max_length=200
)
for seq in sequences:
    print(f"Result: {seq['generated_text']}")

Result: What is a large language model?

A large language model is a type of artificial intelligence (AI) model that is trained on a large dataset of text to generate language outputs that are coherent and natural-sounding. These models are designed to mimic the language abilities of humans, and can be used for a variety of applications such as language translation, text summarization, and language generation.

The key advantage of large language models is their ability to generate text that is more natural and coherent than that produced by smaller models or rule-based systems. This is because they have been trained on much larger datasets, which allows them to learn more complex patterns and relationships in language. As a result, they are often able to produce text that is more nuanced and contextually appropriate than that produced by smaller models.

Some examples of large language models include:

1. BERT (Bidirectional Encoder Representations from Transformers


In [16]:
sequences = pipeline(
    "Tell me what you know about Co. Donegal in Ireland",
    do_sample=True,
    top_k=10,
    num_return_sequences=1,
    eos_token_id=tokenizer.eos_token_id,
    truncation=True,
    max_length=200
)
for seq in sequences:
    print(f"Result: {seq['generated_text']}")

Result: Tell me what you know about Co. Donegal in Ireland.
Co. Donegal is a county located in the northwest of Ireland, known for its rugged coastline, picturesque villages, and rich cultural heritage. Here are some of the top things to do and see in Co. Donegal:
1. Slieve League Cliffs: The second-highest sea cliffs in Europe, stretching for five miles along the Atlantic coast. Take a guided hike or simply enjoy the breathtaking views.
2. Glencolmcille Folk Village Museum: Explore the traditional Irish village, featuring thatched cottages, a stone church, and a museum showcasing local history and culture.
3. Killybegs: A charming fishing town with a bustling harbor, offering boat tours and fishing excursions.
4. Donegal Castle: A 15


### Using the system prompt

In [17]:
# clear down the existing pipeline so a new instance can be created
del pipeline

from transformers import pipeline

In [18]:
# Ignore warnings
logging.set_verbosity(logging.CRITICAL)

# Run text generation pipeline with our next model
system_prompt = """
You are a helpful, respectful and honest assistant.
Always answer as helpfully as possible, with short concise answers.
"""
prompt = "Can you name the all previous presidents of Ireland?"
pipe = pipeline(task="text-generation", model=quantized_model, tokenizer=tokenizer, max_length=2048)
result = pipe(f"<s>[INST]<<SYS>>{system_prompt}<</SYS>>{prompt} [/INST]</s>")
print(result[0]['generated_text'])

<s>[INST]<<SYS>>
You are a helpful, respectful and honest assistant.
Always answer as helpfully as possible, with short concise answers.
<</SYS>>Can you name the all previous presidents of Ireland? [/INST]</s> The President of Ireland is the head of state of Ireland. Here are the names of the previous presidents of Ireland:

1. Douglas Hyde (1938-1945)
2. Éamon de Valera (1945-1959)
3. Séamus Ó Tuama (1959-1966)
4. Éamon de Valera (1966-1973)
5. Cearbhall Ó Dálaigh (1974-1976)
6. Patrick Hillery (1976-1990)
7. Mary Robinson (1990-1997)
8. Mary McAleese (1997-2011)
9. Michael D. Higgins (2011-present)


### Chat with the model

In [19]:
messages = [
    {"role": "user", "content": "What is your favourite TV show?"},
    {"role": "assistant", "content": "I liked Breaking Bad and Band of Brothers"},
    {"role": "user", "content": "Do you have any recommendations of other shows I might like?"}
]

model_inputs = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt").to(device)

generated_ids = quantized_model.generate(model_inputs, max_new_tokens=1000, do_sample=True)
decoded = tokenizer.batch_decode(generated_ids)
print(decoded[0])

<s> [INST] What is your favourite TV show? [/INST] I liked Breaking Bad and Band of Brothers </s><s> [INST] Do you have any recommendations of other shows I might like? [/INST]  Sure! Based on your interest in "Breaking Bad" and "Band of Brothers," here are some other TV shows you might enjoy:

1. "The Sopranos" - A classic HBO drama about a New Jersey mob boss and his family.
2. "The Wire" - A gritty and intense HBO series that explores the drug trade in Baltimore from multiple perspectives.
3. "Mad Men" - A period drama set in the 1960s that follows the lives of advertising executives on Madison Avenue.
4. "The Shield" - A gritty FX series about a corrupt police detective and his team.
5. "The Americans" - A spy thriller set during the Cold War, following a pair of Soviet spies living in the United States.
6. "Narcos" - A Netflix series that chronicles the rise and fall of Colombian drug lord Pablo Escobar.
7. "Peaky Blinders" - A historical crime drama set in post-World War I Englan

## Push the model to Hugging Face

### Push Model

In [20]:
quantized_model.push_to_hub("twhoool02/Llama-2-7b-hf-AutoGPTQ")
tokenizer.push_to_hub("twhoool02/Llama-2-7b-hf-AutoGPTQ")

README.md:   0%|          | 0.00/654 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/3.90G [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/twhoool02/Llama-2-7b-hf-AutoGPTQ/commit/287a305836750557fcbd01e3b00a33c96e03c305', commit_message='Upload tokenizer', commit_description='', oid='287a305836750557fcbd01e3b00a33c96e03c305', pr_url=None, pr_revision=None, pr_num=None)

### Add Model Card

In [21]:
# Add custom model card
from huggingface_hub import ModelCard, ModelCardData

card_data = ModelCardData(
    base_model=model_name,
    language='en',  # The language the model was trained on
    license='other',  # The license for the model
    library=['Transformers','GPTQ'],  # The library used to train the model
    model_name='Llama-2-7b-hf-AutoGPTQ',  # The name of the model
    model_type='llama',
    pipeline_tag='text-generation',
    qunatized_by="twhoool02",
    tags=["meta","GPTQ","facebook", "llama", "llama2"],  # Tags for the model
)

content = f"""
---
{ card_data.to_yaml() }
---

# Model Card for twhoool02/Llama-2-7b-hf-AutoGPTQ

## Model Details

This model is a GPTQ quantized version of the {model_name} model.

- **Developed by:** Ted Whooley
- **Library:** Transformers, GPTQ
- **Model type:** llama
- **Model name:** Llama-2-7b-hf-AutoGPTQ
- **Pipeline tag:** text-generation
- **Qunatized by:** twhoool02
- **Language(s) (NLP):** en
- **License:** other
"""

card = ModelCard(content)
print(card)

---
language: en
license: other
tags:
- meta
- GPTQ
- facebook
- llama
- llama2
base_model: meta-llama/Llama-2-7b-chat-hf
model_name: Llama-2-7b-hf-AutoGPTQ
library:
- Transformers
- GPTQ
model_type: llama
pipeline_tag: text-generation
qunatized_by: twhoool02
---

# Model Card for twhoool02/Llama-2-7b-hf-AutoGPTQ

## Model Details

This model is a GPTQ quantized version of the meta-llama/Llama-2-7b-chat-hf model.

- **Developed by:** Ted Whooley
- **Library:** Transformers, GPTQ
- **Model type:** llama
- **Model name:** Llama-2-7b-hf-AutoGPTQ
- **Pipeline tag:** text-generation
- **Qunatized by:** twhoool02
- **Language(s) (NLP):** en
- **License:** other



### Upload the model card

In [22]:
try:
    card.validate()
    card.save('README.md')
    card.push_to_hub(repo_id=f"twhoool02/Llama-2-7b-hf-AutoGPTQ")
except:
    raise ValueError("model card info is invalid. please check.")