<a href="https://colab.research.google.com/github/saraHuang/LLM_study/blob/main/%E4%BD%BF%E7%94%A8_GGUF_%E5%92%8C_llama_cpp_%E4%BE%86%E9%87%8F%E5%8C%96_Phi_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 本筆記本展示了如何使用 GGUF 和 llama.cpp 來量化 microsoft/phi-2。

* `MODEL_ID`: `microsoft/phi-2`
* `QUANTIZATION_METHOD`: 要使用的量化方法。
- Q5_K_M：5位元，推薦，品質損失低。
- Q4_K_M：4位元，推薦，提供平衡的品質。


## 作者聯絡方式與社群媒體

如果您有任何疑問或想要進一步交流， 也歡迎私訊聯絡我，或隨時關注我的社群媒體：

* **GitHub**： [我的 GitHub 連結](https://github.com/Heng-xiu)  
* **Hugging Face**： [我的 Hugging Face 連結](https://huggingface.co/Heng666)
* **部落格**： [我的 Medium 連結](https://r23456999.medium.com/)

感謝大家的支持，也希望透過這些管道與更多對生成式 AI、Agentic AI System  
或其他技術領域感興趣的朋友們進行討論和交流！

<div class="align-center">
  <a href="https://ko-fi.com/hengshiousheu"><img src="https://github.com/unslothai/unsloth/raw/main/images/Kofi button.png" width="145"></a>
</div>


## 來開始量化模型

### 首先，來登入 HuggingFace

由於我們將從 Hugging Face hub 下載基礎模型 `microsoft/phi-2`，並將我們量化過的模型上傳回 Hugging Face hub，所以讓我們先登入 Hugging Face。

#### Google Colab 新功能
我將我的 Hugging Face token 存儲在左側的秘密標籤中。將我的 token 儲存在這個秘密標籤的好處是，我不會在筆記本中暴露 token，且我可以將這個秘密配置應用於我所有的 Colab 筆記本。

In [None]:
from google.colab import userdata
from huggingface_hub import HfApi

HF_TOKEN = userdata.get("HF_TOKEN")

api = HfApi(token=HF_TOKEN)
username = api.whoami()['name']
print(username)

Heng666


## 安裝環境

### 安裝套件 llama.cpp

以下示範中，需要安裝 llama.cpp 來協助我們量化基礎模型，開始之前先來安裝吧！

In [None]:
!cd llama.cpp && cmake -B build && cmake --build build --config Release # optionally, add -DGGML_CUDA=ON to activate CUDA

## 下載，轉換，量化模型

### 下載模型

在這一系列操作中，我們將會先下載基礎模型, 接者轉會成為 FP16, 最後進行量化。


In [None]:
# Variables
MODEL_ID = "microsoft/phi-2"
QUANTIZATION_METHODS = ["q4_k_m", "q3_k_m"]

# Constants
MODEL_NAME = MODEL_ID.split('/')[-1]
print(MODEL_NAME)

# Download model
!git lfs install

!git clone https://huggingface.co/{MODEL_ID}

phi-2
Git LFS initialized.
Cloning into 'phi-2'...
remote: Enumerating objects: 127, done.[K
remote: Total 127 (delta 0), reused 0 (delta 0), pack-reused 127 (from 1)[K
Receiving objects: 100% (127/127), 1.15 MiB | 3.55 MiB/s, done.
Resolving deltas: 100% (64/64), done.
Filtering content: 100% (2/2), 1.17 GiB | 6.33 MiB/s, done.
Encountered 1 file(s) that may not have been copied correctly on Windows:
	model-00001-of-00002.safetensors

See: `git lfs help smudge` for more details.


### 進行推論

現在我們已經量化好模型了，接著讓我們跑測試看看。

In [None]:
import os

model_list = [file for file in os.listdir(MODEL_NAME) if "gguf" in file]
print("Available models: " + ", ".join(model_list))

prompt = input("Enter your prompt: ")
chosen_method = input("Name of the model (options: " + ", ".join(model_list) + "): ")

# Verify the chosen method is in the list
if chosen_method not in model_list:
    print("Invalid name")
else:
    qtype = f"{MODEL_NAME}/{MODEL_NAME.lower()}.{method.upper()}.gguf"
    !./llama.cpp/main -m {qtype} -n 128 --color -ngl 35 -p "{prompt}"

Available models: 
Enter your prompt: Hi there
Name of the model (options: ): 
Invalid name


## 上傳量化模型到 HuggingFace Hub 上頭

水喔，接著上傳已經量化好的模型到 HuggingFaceHub 中吧

In [None]:
!pip install -q huggingface_hub
from huggingface_hub import create_repo , HfApi
from google.colab import userdata

username = username

# Defined in the secrets tab in Google Colab
api = HfApi(token=userdata.get("HF_TOKEN"))

# Create empty repo
api.create_repo(
    repo_id = f"{username}/{MODEL_NAME}-GGUF",
    repo_type="model",
    exist_ok=True,
)

# Upload gguf files
api.upload_folder(
    folder_path=MODEL_NAME,
    repo_id=f"{username}/{MODEL_NAME}-GGUF",
    allow_patterns=f"*.gguf",
)

CommitInfo(commit_url='https://huggingface.co/Heng666/phi-2-GGUF/commit/06d9503c33f4a5ddf77d8b6bf60383d4ccf51478', commit_message='Upload folder using huggingface_hub', commit_description='', oid='06d9503c33f4a5ddf77d8b6bf60383d4ccf51478', pr_url=None, pr_revision=None, pr_num=None)

## 撰寫良好的 README.md 文檔

最後一步，撰寫良好的 README.md 文檔，可以觀察 Thebloke 所分享的格式後續進行，這邊我們列出擢為重要的內容，包含 language, tags, license。

---
license: apache-2.0
pipeline_tag: text-generation
tags:
- finetuned
inference: false
base_model: microsoft/phi-2
model_creator: Mocrosoft AI_
model_name: microsoft/phi-2
model_type: phi-2
prompt_template: '<s>[INST] {prompt} [/INST]
  '
quantized_by: Heng666
---
# microsoft/phi-2 - GGUF

This is a quantized model for `microsoft/phi-2`. Two quantization methods were used:
- Q5_K_M: 5-bit, preserves most of the model's performance
- Q4_K_M: 4-bit, smaller footprints, and saves more memory
  
<!-- description start -->
## Description

This repo contains GGUF format model files for [microsoft/phi-2](https://huggingface.co/microsoft/phi-2).

This model was quantized in Google Colab.

In [None]:
!cd llama.cpp && cmake -B build && cmake --build build --config Release # optionally, add -DGGML_CUDA=ON to activate CUDA

/bin/bash: line 1: cd: llama.cpp: No such file or directory


In [None]:
!wget https://github.com/ggml-org/llama.cpp/releases/download/b5787/llama-b5787-bin-ubuntu-x64.zip

--2025-07-01 06:15:06--  https://github.com/ggml-org/llama.cpp/releases/download/b5787/llama-b5787-bin-ubuntu-x64.zip
Resolving github.com (github.com)... 140.82.121.3
Connecting to github.com (github.com)|140.82.121.3|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://objects.githubusercontent.com/github-production-release-asset-2e65be/612354784/d950c0b7-f80d-46f5-934a-4f9b63745b90?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=releaseassetproduction%2F20250701%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20250701T061506Z&X-Amz-Expires=1800&X-Amz-Signature=0012aa416f2f82b1556b4c92471fca9167a1a2040775e27c6dfe39ae0a10bf0e&X-Amz-SignedHeaders=host&response-content-disposition=attachment%3B%20filename%3Dllama-b5787-bin-ubuntu-x64.zip&response-content-type=application%2Foctet-stream [following]
--2025-07-01 06:15:06--  https://objects.githubusercontent.com/github-production-release-asset-2e65be/612354784/d950c0b7-f80d-46f5-934a-4f9b63745b90?X-Amz-Algori

In [None]:
# 解壓縮
!unzip llama-b5787-bin-ubuntu-x64.zip

Archive:  llama-b5787-bin-ubuntu-x64.zip
  inflating: build/bin/LICENSE       
  inflating: build/bin/LICENSE-curl  
  inflating: build/bin/LICENSE-httplib  
  inflating: build/bin/LICENSE-jsonhpp  
  inflating: build/bin/LICENSE-linenoise  
  inflating: build/bin/libggml-base.so  
  inflating: build/bin/libggml-cpu-alderlake.so  
  inflating: build/bin/libggml-cpu-haswell.so  
  inflating: build/bin/libggml-cpu-icelake.so  
  inflating: build/bin/libggml-cpu-sandybridge.so  
  inflating: build/bin/libggml-cpu-sapphirerapids.so  
  inflating: build/bin/libggml-cpu-skylakex.so  
  inflating: build/bin/libggml-cpu-sse42.so  
  inflating: build/bin/libggml-cpu-x64.so  
  inflating: build/bin/libggml-rpc.so  
  inflating: build/bin/libggml.so    
  inflating: build/bin/libllama.so   
  inflating: build/bin/libmtmd.so    
  inflating: build/bin/llama-batched-bench  
  inflating: build/bin/llama-bench   
  inflating: build/bin/llama-cli     
  inflating: build/bin/llama-gemma3-cli  
  inflat

In [None]:
%cd build/bin

/content/build/bin


MODEL_ID = "microsoft/phi-2"
QUANTIZATION_METHODS = ["q4_k_m", "q3_k_m"]

In [None]:
ls

[0m[01;32mlibggml-base.so[0m*                [01;32mlibmtmd.so[0m*           [01;32mllama-llava-cli[0m*
[01;32mlibggml-cpu-alderlake.so[0m*       LICENSE               [01;32mllama-minicpmv-cli[0m*
[01;32mlibggml-cpu-haswell.so[0m*         LICENSE-curl          [01;32mllama-mtmd-cli[0m*
[01;32mlibggml-cpu-icelake.so[0m*         LICENSE-httplib       [01;32mllama-perplexity[0m*
[01;32mlibggml-cpu-sandybridge.so[0m*     LICENSE-jsonhpp       [01;32mllama-quantize[0m*
[01;32mlibggml-cpu-sapphirerapids.so[0m*  LICENSE-linenoise     [01;32mllama-qwen2vl-cli[0m*
[01;32mlibggml-cpu-skylakex.so[0m*        [01;32mllama-batched-bench[0m*  [01;32mllama-run[0m*
[01;32mlibggml-cpu-sse42.so[0m*           [01;32mllama-bench[0m*          [01;32mllama-server[0m*
[01;32mlibggml-cpu-x64.so[0m*             [01;32mllama-cli[0m*            [01;32mllama-tokenize[0m*
[01;32mlibggml-rpc.so[0m*                 [01;32mllama-gemma3-cli[0m*     [01;32mllama-tts[0m

In [None]:
llama-cli -hf microsoft/phi-2:Q4_0

SyntaxError: invalid syntax (ipython-input-38-3229893042.py, line 1)