<a href="https://colab.research.google.com/github/saraHuang/LLM_study/blob/main/%E4%BD%BF%E7%94%A8_AWQ_%E4%BE%86%E9%87%8F%E5%8C%96_Tinyllama.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 本筆記本展示了如何使用 AutoAWQ 來量化 TinyLlama/TinyLlama-1.1B-Chat-v0.1。

* `MODEL_ID`: `TinyLlama/TinyLlama-1.1B-Chat-v0.1`


## 作者聯絡方式與社群媒體

如果您有任何疑問或想要進一步交流， 也歡迎私訊聯絡我，或隨時關注我的社群媒體：

* **GitHub**： [我的 GitHub 連結](https://github.com/Heng-xiu)  
* **Hugging Face**： [我的 Hugging Face 連結](https://huggingface.co/Heng666)
* **部落格**： [我的 Medium 連結](https://r23456999.medium.com/)

感謝大家的支持，也希望透過這些管道與更多對生成式 AI、Agentic AI System  
或其他技術領域感興趣的朋友們進行討論和交流！

<div class="align-center">
  <a href="https://ko-fi.com/hengshiousheu"><img src="https://github.com/unslothai/unsloth/raw/main/images/Kofi button.png" width="145"></a>
</div>


## 來開始量化模型

### 首先，來登入 HuggingFace

由於我們將從 Hugging Face hub 下載基礎模型 `microsoft/phi-2`，並將我們量化過的模型上傳回 Hugging Face hub，所以讓我們先登入 Hugging Face。

#### Google Colab 新功能
我將我的 Hugging Face token 存儲在左側的秘密標籤中。將我的 token 儲存在這個秘密標籤的好處是，我不會在筆記本中暴露 token，且我可以將這個秘密配置應用於我所有的 Colab 筆記本。

In [None]:
from google.colab import userdata
from huggingface_hub import HfApi

HF_TOKEN = userdata.get("HF_TOKEN")

api = HfApi(token=HF_TOKEN)
username = api.whoami()['name']
print(username)

Heng666


# 安裝環境

## 安裝套件 autoawq

以下示範中，需要安裝 autoawq 來協助我們量化基礎模型，開始之前先來安裝吧！

In [None]:
!pip install --quiet -U autoawq
!pip install --quiet -U transformers>=4.41.0 # Updated transformers version
!pip install --quiet -U flash_attn

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m10.8/10.8 MB[0m [31m61.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.1/3.1 MB[0m [31m84.0 MB/s[0m eta [36m0:00:00[0m
[?25h

## 設定參數，下載模型，量化模型

### 設定參數

參數還是可以參考 TheBloke 設定
https://huggingface.co/TheBloke/WestLake-7B-v2-AWQ/blob/main/quant_config.json

因此若是要進行 LLM Service 部署，一般而言選擇 GEMM 即可。

In [None]:
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer


model_path = 'MediaTek-Research/Breeze-7B-Instruct-v0_1'
quant_path = model_path+'-AWQ'
quant_config = {
    "zero_point": True,
    "q_group_size": 128,
    "w_bit": 4,
    "version": "GEMM",
    "modules_to_not_convert": []
  }

I have left this message as the final dev message to help you transition.

Important Notice:
- AutoAWQ is officially deprecated and will no longer be maintained.
- The last tested configuration used Torch 2.6.0 and Transformers 4.51.3.
- If future versions of Transformers break AutoAWQ compatibility, please report the issue to the Transformers project.

Alternative:
- AutoAWQ has been adopted by the vLLM Project: https://github.com/vllm-project/llm-compressor

For further inquiries, feel free to reach out:
- X: https://x.com/casper_hansen_
- LinkedIn: https://www.linkedin.com/in/casper-hansen-804005170/



ImportError: cannot import name 'BaseImageProcessor' from 'transformers' (/usr/local/lib/python3.11/dist-packages/transformers/__init__.py)

### 下載模型並且載入

讓我們下載吧

In [None]:
%%time

model = AutoAWQForCausalLM.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)

## 量化模型
終於來到這步驟，讓我們開始吧！

In [None]:
%%time

model.quantize(tokenizer, quant_config=quant_config)

## 保存檔案
這麼辛苦轉完，理應當要留著吧

In [None]:
# Save quantized model
MODEL_NAME = model_path.split('/')[-1]
model.save_quantized(MODEL_NAME)
tokenizer.save_pretrained(MODEL_NAME)

## 上傳量化模型到 HuggingFace Hub 上頭

水喔，接著上傳已經量化好的模型到 HuggingFaceHub 中吧

In [None]:
!pip install -q huggingface_hub
from huggingface_hub import create_repo , HfApi
from google.colab import userdata

username = username

# Defined in the secrets tab in Google Colab
api = HfApi(token=userdata.get("HF_TOKEN"))

# Create empty repo
api.create_repo(
    repo_id = f"{username}/{MODEL_NAME}-AWQ",
    repo_type="model",
    exist_ok=True,
)

# Upload gguf files
api.upload_folder(
    folder_path=MODEL_NAME,
    repo_id=f"{username}/{MODEL_NAME}-AWQ",
    # allow_patterns=f"*.gguf",
)

In [None]:
!pip install llmcompressor



In [None]:
from llmcompressor.modifiers.quantization import AWQModifier
from llmcompressor import oneshot

recipe = [
    AWQModifier(ignore=["lm_head"], scheme="W4A16_ASYM", targets=["Linear"]),
]

# Apply quantization using the built in open_platypus dataset.
#   * See examples for demos showing how to pass a custom calibration set
oneshot(
    model="TinyLlama/TinyLlama-1.1B-Chat-v1.0",
    dataset="open_platypus",
    recipe=recipe,
    output_dir="TinyLlama-1.1B-Chat-v1.0-INT8",
    max_seq_length=2048,
    num_calibration_samples=512,
)

ImportError: cannot import name 'BaseImageProcessor' from 'transformers' (/usr/local/lib/python3.11/dist-packages/transformers/__init__.py)