# ExLlamaV2: High-Performance LLM Quantization

This notebook demonstrates quantization of large language models using ExLlamaV2, a high-performance library for running LLMs efficiently.

## Overview
- **Quantization Method**: ExLlamaV2 format
- **Use Case**: Fast inference with reduced memory footprint
- **Target**: Large language models (7B+ parameters)

## Key Features
- Efficient quantization with calibration dataset
- Configurable bits per weight (BPW)
- Optimized for inference speed


## Step 1: Install ExLlamaV2

First, we'll clone and install the ExLlamaV2 library from its GitHub repository.


In [None]:
# Install ExLLamaV2
!git clone https://github.com/turboderp/exllamav2
!pip install -e exllamav2


## Step 2: Configure Model and Quantization Parameters

Set the model name and bits per weight (BPW) for quantization. Lower BPW values result in smaller models but may reduce quality.


In [None]:
MODEL_NAME = "zephyr-7b-beta"
BPW = 5.0  # Bits per weight - adjust based on your memory constraints


## Step 3: Download Base Model

Download the model from HuggingFace and prepare it for quantization.


In [None]:
# Download model from HuggingFace
!git lfs install
!git clone https://huggingface.co/HuggingFaceH4/{MODEL_NAME}
!mv {MODEL_NAME} base_model
!rm -f base_model/*.bin  # Remove unnecessary .bin files


## Step 4: Download Calibration Dataset

Download the WikiText dataset for calibration during quantization.


In [None]:
# Download calibration dataset
!wget https://huggingface.co/datasets/wikitext/resolve/9a9e482b5987f9d25b3a9b2883fc6cc9fd8071b3/wikitext-103-v1/wikitext-test.parquet


## Step 5: Quantize the Model

Run the quantization process using the specified BPW and calibration dataset.


In [None]:
# Quantize the model
!mkdir quant
!python exllamav2/convert.py \
    -i base_model \
    -o quant \
    -c wikitext-test.parquet \
    -b {BPW}


## Step 6: Prepare Quantized Model

Copy necessary files and clean up the output directory.


In [None]:
# Clean up and copy necessary files
!rm -rf quant/out_tensor
!rsync -av --exclude='*.safetensors' --exclude='.*' ./base_model/ ./quant/


## Step 7: Test the Quantized Model

Run inference with the quantized model to verify it works correctly.


In [None]:
# Test inference with the quantized model
!python exllamav2/test_inference.py -m quant/ -p "I have a dream"


## Step 8: Upload to HuggingFace Hub (Optional)

Upload the quantized model to HuggingFace for sharing and deployment.


In [None]:
# Install required packages for uploading
!pip install -q huggingface_hub
!git config --global credential.helper store


In [None]:
from huggingface_hub import notebook_login, HfApi
import locale
locale.getpreferredencoding = lambda: "UTF-8"

# Login to HuggingFace
notebook_login()

# Initialize API
api = HfApi()

# Replace 'your-username' with your HuggingFace username
username = "your-username"

# Create repository
api.create_repo(
    repo_id=f"{username}/{MODEL_NAME}-{BPW:.1f}bpw-exl2",
    repo_type="model",
    exist_ok=True
)

# Upload the quantized model
api.upload_folder(
    repo_id=f"{username}/{MODEL_NAME}-{BPW:.1f}bpw-exl2",
    folder_path="quant",
)
