<a href="https://colab.research.google.com/github/shake/colab-Llama-2-ipynb/blob/main/collect/Demo_AWQ_transformers.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Transformer meets AWQ quantization (AutoAWQ and LLM-AWQ) for lighter and faster quantized inference of LLMs

![img](https://huggingface.co/datasets/ybelkada/documentation-images/resolve/main/Thumbnail.png)

In June 2023, the [AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration](https://arxiv.org/pdf/2306.00978.pdf) has been published by Ji Lin et al. The paper details an algorithm to compress any transformer-based language model in few bits with a tiny performance degradation. To learn more about this quantization method, Professor Song Han gives a excellent [talk](https://hanlab.mit.edu/projects/awq).

We new support loading models that are quantized with GPTQ algorithm in 🤗 transformers from two different libraries: [LLM-AWQ](https://github.com/mit-han-lab/llm-awq) and [AutoAWQ](https://github.com/casper-hansen/AutoAWQ).

Let's check in this notebook the different options (quantize a model, push a quantized model on the 🤗 Hub, load an already quantized model from the Hub, etc.) that are offered in this integration!

## Load required libraries

Let us first load the required libraries that are 🤗 transformers and llm-awq, autoawq library.

In [None]:
!pip install -q transformers accelerate

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.9/7.9 MB[0m [31m18.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m261.4/261.4 kB[0m [31m27.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m311.2/311.2 kB[0m [31m25.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.8/3.8 MB[0m [31m51.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m53.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m295.0/295.0 kB[0m [31m26.9 MB/s[0m eta [36m0:00:00[0m
[?25h

AutoAWQ will default to CUDA 12.1, since google colab has CUDA < 12.1 installed, we will install the wheels for CUDA 11.8. For 12.1 you can simply do `pip install autoawq`

In [None]:
!pip install -q -U https://github.com/casper-hansen/AutoAWQ/releases/download/v0.1.6/autoawq-0.1.6+cu118-cp310-cp310-linux_x86_64.whl

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m20.0/20.0 MB[0m [31m44.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m6.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m178.7/178.7 kB[0m [31m9.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m75.4/75.4 kB[0m [31m6.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m153.8/153.8 kB[0m [31m9.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m227.5/227.5 kB[0m [31m1.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m493.7/493.7 kB[0m [31m14.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m219.9/219.9 kB[0m [31m11.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━

## LLM-AWQ integration with Transformers

As LLM-AWQ is not supported on T4 devices (such as the one we use on free-tier Google Colab instances) you need to have access to a hardware that is compatible with that repository and follow the [instructions](https://github.com/mit-han-lab/llm-awq/tree/main) provided by llm-awq repository.

You can follow the instructions stated on [this section](https://github.com/mit-han-lab/llm-awq/blob/main/examples/chat_demo.ipynb) then use the conversion script exposed [here](https://github.com/mit-han-lab/llm-awq/blob/main/examples/convert_to_hf.py) to convert your model into a transformers compatible version.

## AutoAWQ integration with Transformers

Let's first quantize `opt-125m` using `autoawq`!

In [None]:
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer

model_path = "facebook/opt-125m"
quant_path = "opt-125m-awq"
quant_config = {"zero_point": True, "q_group_size": 128, "w_bit": 4, "version":"GEMM"}

# Load model
model = AutoAWQForCausalLM.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)

# Quantize
model.quantize(tokenizer, quant_config=quant_config)

Fetching 10 files:   0%|          | 0/10 [00:00<?, ?it/s]

Repo card metadata block was not found. Setting CardData to empty.
AWQ: 100%|██████████| 12/12 [01:38<00:00,  8.18s/it]


In [None]:
quant_config

{'zero_point': True, 'q_group_size': 128, 'w_bit': 4, 'version': 'GEMM'}

In order to make it compatible with transformers, we need to modify the config file.

In [None]:
from transformers import AwqConfig, AutoConfig
from huggingface_hub import HfApi

# modify the config file so that it is compatible with transformers integration
quantization_config = AwqConfig(
    bits=quant_config["w_bit"],
    group_size=quant_config["q_group_size"],
    zero_point=quant_config["zero_point"],
    version=quant_config["version"].lower(),
).to_dict()

# the pretrained transformers model is stored in the model attribute + we need to pass a dict
model.model.config.quantization_config = quantization_config
# a second solution would be to use Autoconfig and push to hub (what we do at llm-awq)


# save model weights
model.save_quantized(quant_path)
tokenizer.save_pretrained(quant_path)



('opt-125m-awq/tokenizer_config.json',
 'opt-125m-awq/special_tokens_map.json',
 'opt-125m-awq/vocab.json',
 'opt-125m-awq/merges.txt',
 'opt-125m-awq/added_tokens.json',
 'opt-125m-awq/tokenizer.json')

In [None]:
# optional -> push the quantized weights to the hub
! huggingface-cli login


    _|    _|  _|    _|    _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|_|_|_|    _|_|      _|_|_|  _|_|_|_|
    _|    _|  _|    _|  _|        _|          _|    _|_|    _|  _|            _|        _|    _|  _|        _|
    _|_|_|_|  _|    _|  _|  _|_|  _|  _|_|    _|    _|  _|  _|  _|  _|_|      _|_|_|    _|_|_|_|  _|        _|_|_|
    _|    _|  _|    _|  _|    _|  _|    _|    _|    _|    _|_|  _|    _|      _|        _|    _|  _|        _|
    _|    _|    _|_|      _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|        _|    _|    _|_|_|  _|_|_|_|
    
    To login, `huggingface_hub` requires a token generated from https://huggingface.co/settings/tokens .
Token: 
Add token as git credential? (Y/n) Y
Token is valid (permission: write).
[1m[31mCannot authenticate through git-credential as no helper is defined on your machine.
You might have to re-authenticate when pushing to the Hugging Face Hub.
Run the following command in your terminal in case you want to set the '

In [None]:
api = HfApi()
api.upload_folder(
    folder_path="opt-125m-awq",
    repo_id="ybelkada/opt-125m-awq",
    repo_type="model",
)

model.safetensors:   0%|          | 0.00/202M [00:00<?, ?B/s]

'https://huggingface.co/ybelkada/opt-125m-awq/tree/main/'

Now we can use our model with transformers library to run inference !

In [None]:
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("ybelkada/opt-125m-awq")
model = AutoModelForCausalLM.from_pretrained("ybelkada/opt-125m-awq").to(0)

text = "Hello my name is"
inputs = tokenizer(text, return_tensors="pt").to(0)

out = model.generate(**inputs, max_new_tokens=5)
print(tokenizer.decode(out[0], skip_special_tokens=True))

Downloading (…)okenizer_config.json:   0%|          | 0.00/669 [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/798k [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/2.11M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/548 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/979 [00:00<?, ?B/s]

You have loaded an AWQ model on CPU and have a CUDA device available, make sure to set your model on a GPU device in order to run your model.


Downloading model.safetensors:   0%|          | 0.00/202M [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/132 [00:00<?, ?B/s]

Hello my name is katie and I


## Loading a large model on Google colab

Let's now try to load a very large model that would not fit on a single Google Colab instance using this integration. The integration is compatible with any AWQ model that is under [`TheBloke`](https://huggingface.co/TheBloke) namespace! For our demo we will use `TheBloke/Llama-2-13B-chat-AWQ`. That model would require ~26GB in float16 but thanks to AWQ would be very easy to run on a 16GB GPU!

<img src="https://huggingface.co/datasets/ybelkada/documentation-images/resolve/main/thebloke-screenshot.png" width="800"/>

In [None]:
model_id = "TheBloke/Llama-2-13B-chat-AWQ"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, device_map="cuda")

Downloading (…)okenizer_config.json:   0%|          | 0.00/776 [00:00<?, ?B/s]

Downloading tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/414 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/750 [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/7.25G [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/188 [00:00<?, ?B/s]

In [None]:
text = "User:\nHello can you provide me with top-3 cool places to visit in Paris?\n\nAssistant:\n"
inputs = tokenizer(text, return_tensors="pt").to(0)

out = model.generate(**inputs, max_new_tokens=300)
print(tokenizer.decode(out[0], skip_special_tokens=True))

User:
Hello can you provide me with top-3 cool places to visit in Paris?

Assistant:
Bonjour! There are so many amazing places to visit in Paris, but here are my top three recommendations:

1. The Eiffel Tower - an iconic symbol of the city, this tower offers breathtaking views of the city from its observation decks.
2. The Louvre Museum - home to some of the world's most famous artworks, including the Mona Lisa, this museum is a must-visit for art lovers.
3. Notre Dame Cathedral - a beautiful and historic church that is one of the most famous landmarks in Paris.

I hope you enjoy your visit to Paris! Is there anything else you'd like to know?
