# Quantize Llama 2 Models using GGUF and llama.cpp

This notebook demonstrates how to quantize Llama 2 models using the GGUF format and llama.cpp library for efficient inference.

## Overview
- **Quantization Format**: GGUF (GPT-Generated Unified Format)
- **Library**: llama.cpp
- **Benefits**: CPU-friendly inference, reduced memory usage, multiple quantization methods

## Configuration

### Variables
* `MODEL_ID`: The HuggingFace model ID to quantize
* `QUANTIZATION_METHODS`: List of quantization methods to apply

### Recommended Quantization Methods
- **Q5_K_M**: Best quality-size tradeoff (recommended for most use cases)
- **Q4_K_M**: Good quality with smaller size
- **Q6_K**: Highest quality, larger size
- **Q3_K_M**: Smaller size, acceptable quality loss
- **Q2_K**: Smallest size, significant quality loss (not recommended)


## Step 1: Set Configuration

Define the model to quantize and the quantization methods to use.


In [None]:
# Configuration
MODEL_ID = "mlabonne/EvolCodeLlama-7b"
QUANTIZATION_METHODS = ["q4_k_m", "q5_k_m"]

# Extract model name
MODEL_NAME = MODEL_ID.split('/')[-1]
print(f"Model: {MODEL_NAME}")
print(f"Quantization methods: {', '.join(QUANTIZATION_METHODS)}")


## Step 2: Install llama.cpp

Clone and compile llama.cpp with CUDA support for GPU acceleration.


In [None]:
# Clone and build llama.cpp
!git clone https://github.com/ggerganov/llama.cpp
!cd llama.cpp && git pull && make clean && LLAMA_CUBLAS=1 make
!pip install -r llama.cpp/requirements.txt


## Step 3: Download the Model

Download the base model from HuggingFace.


In [None]:
# Download model from HuggingFace
!git lfs install
!git clone https://huggingface.co/{MODEL_ID}


## Step 4: Convert to FP16

First, convert the model to FP16 format, which serves as the base for quantization.


In [None]:
# Convert to FP16 format
fp16 = f"{MODEL_NAME}/{MODEL_NAME.lower()}.fp16.bin"
!python llama.cpp/convert.py {MODEL_NAME} --outtype f16 --outfile {fp16}


## Step 5: Quantize with Multiple Methods

Quantize the model using each specified quantization method.


In [None]:
# Quantize the model for each method
for method in QUANTIZATION_METHODS:
    qtype = f"{MODEL_NAME}/{MODEL_NAME.lower()}.{method.upper()}.gguf"
    print(f"Quantizing with method: {method}")
    !./llama.cpp/quantize {fp16} {qtype} {method}
    print(f"Created: {qtype}\n")


## Step 6: Run Inference

Test the quantized models with interactive inference.


In [None]:
import os

# List available quantized models
model_list = [file for file in os.listdir(MODEL_NAME) if "gguf" in file]
print("Available quantized models:")
for i, model in enumerate(model_list, 1):
    print(f"{i}. {model}")

# Interactive inference
prompt = input("\nEnter your prompt: ")
chosen_method = input(f"Name of the model (options: {', '.join(model_list)}): ")

# Verify the chosen method is in the list
if chosen_method not in model_list:
    print("Invalid model name!")
else:
    model_path = f"{MODEL_NAME}/{chosen_method}"
    print(f"\nRunning inference with {chosen_method}...")
    # Using -ngl 35 to offload all layers to GPU (adjust based on your model)
    !./llama.cpp/main -m {model_path} -n 128 --color -ngl 35 -p "{prompt}"


## Step 7: Upload to HuggingFace Hub (Optional)

Upload the quantized models to HuggingFace for sharing and deployment.


In [None]:
!pip install -q huggingface_hub

from huggingface_hub import create_repo, HfApi
from google.colab import userdata

# Replace with your HuggingFace username
username = "your-username"

# Use token from Colab secrets or set directly
# In Colab: Go to Tools > Secrets and add HF_TOKEN
try:
    api = HfApi(token=userdata.get("HF_TOKEN"))
except:
    # Alternative: Login manually
    from huggingface_hub import notebook_login
    notebook_login()
    api = HfApi()

# Create repository
create_repo(
    repo_id=f"{username}/{MODEL_NAME}-GGUF",
    repo_type="model",
    exist_ok=True,
)

# Upload all GGUF files
api.upload_folder(
    folder_path=MODEL_NAME,
    repo_id=f"{username}/{MODEL_NAME}-GGUF",
    allow_patterns="*.gguf",
)

print(f"✓ Models uploaded to: https://huggingface.co/{username}/{MODEL_NAME}-GGUF")
