# 📦 LMFast: Quantization & Export

**Shrink your models for deployment without losing intelligence!**

## What You'll Learn
- 4-bit (QLoRA) vs 8-bit quantization
- Export to GGUF (for llama.cpp / Ollama)
- Export to ONNX (for standard runtimes)
- Understand AWQ vs GPTQ

## Quick Guide
| Format | Best For | Speed | Size |
|--------|----------|-------|------|
| **GGUF** | CPU / Mac / Edge | ⭐⭐⭐ | ⭐⭐⭐ |
| **Int4** | GPU Serving | ⭐⭐⭐ | ⭐⭐⭐ |
| **ONNX** | Browser / Web | ⭐⭐ | ⭐⭐ |

**Time to complete:** ~10 minutes

## 1️⃣ Setup

In [None]:
!pip install -q lmfast[all]

import lmfast
lmfast.setup_colab_env()

import torch
print(f"GPU: {torch.cuda.get_device_name(0)}")

## 2️⃣ Load a Model

We'll use a small model for demonstration.

In [None]:
# Using the base model for export demos
MODEL_ID = "HuggingFaceTB/SmolLM-135M-Instruct"

# You can also point to your locally trained model:
# MODEL_ID = "./my_first_slm"

## 3️⃣ Export to GGUF (llama.cpp)

GGUF is the gold standard for running LLMs on consumer hardware. 

**Note:** LMFast automatically clones and sets up `llama.cpp` for you if it's not found in your environment.

In [None]:
from lmfast.inference import export_gguf

print("📦 Exporting to GGUF (q4_k_m)...")

try:
    export_gguf(
        model_path=MODEL_ID,
        output_path="./smollm-135m-q4.gguf",
        quantization="q4_k_m"  # Balanced 4-bit quantization
    )
    print("✅ GGUF Export Successful!")
    
    # Check size
    import os
    size_mb = os.path.getsize("./smollm-135m-q4.gguf") / 1024 / 1024
    print(f"File Size: {size_mb:.2f} MB")
    
except Exception as e:
    print(f"⚠️ GGUF Export failed: {e}")

## 4️⃣ In-Place Quantization (Int4 / Int8)

If you want to serve the model using Python (transformers/bitsandbytes), you can save a quantized version locally.

In [None]:
from lmfast.inference import quantize_model

print("⚖️ Quantizing to 4-bit (NF4)...")

quantize_model(
    model_path=MODEL_ID,
    output_path="./smollm-int4",
    method="int4"  # Uses bitsandbytes NF4
)

print("✅ Int4 Model Saved!")

## 5️⃣ Export to ONNX

Great for running in the browser or cross-platform apps.

In [None]:
from lmfast.deployment import export_for_browser

print("🌐 Exporting for Browser (ONNX)...")

# The browser exporter handles ONNX conversion, optimization, and demo generation
artifacts = export_for_browser(
    model_path=MODEL_ID,
    output_dir="./onnx_model",
    target="onnx",
    quantization="int8",
    create_demo=False
)

print(f"✅ ONNX Export complete!")
print(f"Artifacts: {list(artifacts.keys())}")

## 6️⃣ Verify the Quantized Model

LMFast's `SLMServer` automatically detects quantization and optimizes inference.

In [None]:
from lmfast.inference import SLMServer

print("🚀 Loading Quantized Model...")

# Load the int4 model we saved earlier
server = SLMServer("./smollm-int4")

prompt = "What is the speed of light? Answer briefly."
response = server.generate(prompt)

print(f"\nPrompt: {prompt}")
print(f"Response: {response}")

## 🎉 Summary

You've learned how to:
- ✅ Create **GGUF** files for edge devices
- ✅ Save **Int4/Int8** models for high-performance Python serving
- ✅ Export **ONNX** models for browser deployment

### Next Steps
- `15_browser_deployment.ipynb`: Use the ONNX model in a web app!
- `16_edge_deployment.ipynb`: Run the GGUF model on a Raspberry Pi