# EdgeLLM Fine-Tuning on Google Colab

**Fine-tune your own LLM for FREE on Google Colab!**

This notebook walks you through:
1. Setting up the environment
2. Preparing your dataset
3. Fine-tuning with QLoRA
4. Merging and quantizing the model
5. Deploying to edge devices

**Requirements:**
- Google Colab (FREE tier works!)
- T4 GPU (free) or better
- ~15 minutes for a small model

Let's get started!

## 1. Setup Environment

First, let's install the required packages.

In [None]:
# Install dependencies
!pip install -q torch transformers peft bitsandbytes datasets accelerate trl
!pip install -q sentencepiece protobuf

# Check GPU
import torch
print(f"GPU: {torch.cuda.get_device_name(0)}")
print(f"VRAM: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")

In [None]:
# Imports
import json
import torch
from datasets import Dataset
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    TrainingArguments,
)
from trl import SFTTrainer

## 2. Select Base Model

Choose a base model based on your target hardware:

| Model | Size | BitNet | Hardware | Speed |
|-------|------|--------|----------|-------|
| SmolLM-135M | 135M | 35MB | Pi Zero 2 W | 5-10 tok/s |
| SmolLM-360M | 360M | 90MB | Pi Zero 2 W | 3-6 tok/s |
| Qwen2-0.5B | 500M | 125MB | Pi 4 | 8-15 tok/s |
| Llama-3.2-1B | 1B | 200MB | Pi 5 | 20-40 tok/s |
| Phi-3-mini | 3.8B | 750MB | Jetson | 10-20 tok/s |

In [None]:
# Configuration
MODEL_ID = "HuggingFaceTB/SmolLM-135M"  # Change this!
OUTPUT_DIR = "./finetuned_model"

# Training settings
EPOCHS = 3
BATCH_SIZE = 4
LEARNING_RATE = 2e-4
MAX_LENGTH = 512

# LoRA settings
LORA_R = 16
LORA_ALPHA = 32
LORA_DROPOUT = 0.05

## 3. Prepare Your Dataset

You can either:
1. Upload your own JSONL file
2. Use a HuggingFace dataset
3. Create data directly in the notebook

### Data Format (JSONL)
```json
{"instruction": "What is 2+2?", "response": "4"}
{"instruction": "Say hello", "response": "Hello! How can I help you?"}
```

In [None]:
# Option 1: Create sample data directly
sample_data = [
    {"instruction": "Turn on the living room lights", "response": "Turning on the living room lights now."},
    {"instruction": "What's the temperature?", "response": "The current temperature is 72°F (22°C)."},
    {"instruction": "Set a timer for 5 minutes", "response": "Timer set for 5 minutes. I'll let you know when it's done."},
    {"instruction": "Play some music", "response": "Playing your favorite playlist now."},
    {"instruction": "What's on my calendar today?", "response": "You have a meeting at 2 PM and a dentist appointment at 4 PM."},
    # Add more examples here!
]

# Save as JSONL
with open("dataset.jsonl", "w") as f:
    for item in sample_data:
        f.write(json.dumps(item) + "\n")

print(f"Created dataset with {len(sample_data)} examples")

In [None]:
# Option 2: Upload your own file
# from google.colab import files
# uploaded = files.upload()  # Upload your dataset.jsonl

# Load dataset
def load_jsonl(path):
    data = []
    with open(path, "r") as f:
        for line in f:
            data.append(json.loads(line.strip()))
    return Dataset.from_list(data)

dataset = load_jsonl("dataset.jsonl")
print(f"Loaded {len(dataset)} examples")
print(f"\nSample: {dataset[0]}")

## 4. Load Model and Train

We use QLoRA for efficient fine-tuning:
- 4-bit quantized base model
- LoRA adapters for training
- Uses only ~4GB VRAM!

In [None]:
# Configure 4-bit quantization
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,
)

# Load tokenizer
print(f"Loading tokenizer...")
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

# Load model
print(f"Loading model: {MODEL_ID}")
model = AutoModelForCausalLM.from_pretrained(
    MODEL_ID,
    quantization_config=bnb_config,
    device_map="auto",
    trust_remote_code=True,
)

# Prepare for training
model = prepare_model_for_kbit_training(model)
model.gradient_checkpointing_enable()

print("Model loaded!")

In [None]:
# Configure LoRA
lora_config = LoraConfig(
    r=LORA_R,
    lora_alpha=LORA_ALPHA,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
    lora_dropout=LORA_DROPOUT,
    bias="none",
    task_type="CAUSAL_LM",
)

model = get_peft_model(model, lora_config)

# Print trainable parameters
trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
total_params = sum(p.numel() for p in model.parameters())
print(f"Trainable parameters: {trainable_params:,} ({100 * trainable_params / total_params:.2f}%)")

In [None]:
# Format dataset for training
def format_instruction(example):
    if "input" in example and example["input"]:
        text = f"### Instruction:\n{example['instruction']}\n\n### Input:\n{example['input']}\n\n### Response:\n{example['response']}"
    else:
        text = f"### Instruction:\n{example['instruction']}\n\n### Response:\n{example['response']}"
    return text

In [None]:
# Training arguments
training_args = TrainingArguments(
    output_dir=OUTPUT_DIR,
    num_train_epochs=EPOCHS,
    per_device_train_batch_size=BATCH_SIZE,
    gradient_accumulation_steps=4,
    learning_rate=LEARNING_RATE,
    warmup_ratio=0.03,
    weight_decay=0.01,
    logging_steps=10,
    save_steps=50,
    save_total_limit=2,
    fp16=True,
    report_to="none",
)

# Create trainer
trainer = SFTTrainer(
    model=model,
    args=training_args,
    train_dataset=dataset,
    tokenizer=tokenizer,
    formatting_func=format_instruction,
    max_seq_length=MAX_LENGTH,
)

# Train!
print("Starting training...")
trainer.train()
print("Training complete!")

In [None]:
# Save the fine-tuned model
model.save_pretrained(OUTPUT_DIR)
tokenizer.save_pretrained(OUTPUT_DIR)
print(f"Model saved to {OUTPUT_DIR}")

## 5. Test Your Model

Let's test the fine-tuned model before quantization.

In [None]:
# Test the model
def generate(prompt, max_tokens=50):
    inputs = tokenizer(f"### Instruction:\n{prompt}\n\n### Response:\n", return_tensors="pt").to("cuda")
    outputs = model.generate(**inputs, max_new_tokens=max_tokens, do_sample=True, temperature=0.7)
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

# Test prompts
test_prompts = [
    "Turn on the kitchen lights",
    "What time is it?",
    "Set an alarm for 7 AM",
]

print("Testing model...\n")
for prompt in test_prompts:
    print(f"User: {prompt}")
    response = generate(prompt)
    # Extract just the response part
    if "### Response:" in response:
        response = response.split("### Response:")[1].strip()
    print(f"Assistant: {response}\n")

## 6. Merge LoRA Weights

Merge the LoRA adapters back into the base model.

In [None]:
from peft import PeftModel

# Load base model (full precision)
print("Loading base model...")
base_model = AutoModelForCausalLM.from_pretrained(
    MODEL_ID,
    torch_dtype=torch.float16,
    device_map="cpu",
    trust_remote_code=True,
)

# Load and merge LoRA
print("Merging LoRA weights...")
merged_model = PeftModel.from_pretrained(base_model, OUTPUT_DIR)
merged_model = merged_model.merge_and_unload()

# Save merged model
MERGED_DIR = f"{OUTPUT_DIR}_merged"
merged_model.save_pretrained(MERGED_DIR)
tokenizer.save_pretrained(MERGED_DIR)
print(f"Merged model saved to {MERGED_DIR}")

## 7. Quantize to BitNet (Optional)

Convert to BitNet 1.58-bit for edge deployment.

In [None]:
import struct
import numpy as np

def quantize_to_bitnet(model, output_path):
    """Simple BitNet quantization."""
    with open(output_path, "wb") as f:
        # Write header
        f.write(b"TMAC")
        f.write(struct.pack("I", 2))  # version
        
        config = model.config
        f.write(struct.pack("I", config.hidden_size))
        f.write(struct.pack("I", config.num_hidden_layers))
        f.write(struct.pack("I", config.num_attention_heads))
        f.write(struct.pack("I", config.vocab_size))
        f.write(struct.pack("I", 2))  # bits
        f.write(struct.pack("I", 4))  # group_size
        
        # Quantize weights
        for name, param in model.named_parameters():
            if "weight" in name and param.dim() >= 2:
                # Quantize to ternary
                weight = param.data.cpu().float()
                scale = weight.abs().max(dim=-1, keepdim=True)[0]
                normalized = weight / (scale + 1e-8)
                
                quantized = torch.zeros_like(normalized, dtype=torch.int8)
                quantized[normalized > 0.5] = 1
                quantized[normalized < -0.5] = -1
                
                # Pack and write
                flat = quantized.flatten().numpy()
                packed_len = (len(flat) + 3) // 4
                packed = np.zeros(packed_len, dtype=np.uint8)
                
                for i in range(len(flat)):
                    val = flat[i]
                    encoded = 1 if val == 0 else (0 if val == -1 else 2)
                    packed[i // 4] |= (encoded << ((i % 4) * 2))
                
                # Write tensor
                name_bytes = name.encode()
                f.write(struct.pack("I", len(name_bytes)))
                f.write(name_bytes)
                f.write(struct.pack("I", len(param.shape)))
                for dim in param.shape:
                    f.write(struct.pack("I", dim))
                f.write(packed.tobytes())
                f.write(scale.squeeze(-1).numpy().tobytes())
    
    print(f"Quantized model saved to {output_path}")

# Quantize
OUTPUT_BITNET = "model.tmac2.bin"
quantize_to_bitnet(merged_model, OUTPUT_BITNET)

import os
original_size = sum(p.numel() * 2 for p in merged_model.parameters())  # FP16
quantized_size = os.path.getsize(OUTPUT_BITNET)
print(f"\nOriginal size (FP16): {original_size / 1e6:.1f} MB")
print(f"Quantized size: {quantized_size / 1e6:.1f} MB")
print(f"Compression: {original_size / quantized_size:.1f}x")

## 8. Download Your Model

Download the quantized model to deploy on your edge device.

In [None]:
from google.colab import files

# Download the quantized model
files.download(OUTPUT_BITNET)
print(f"Downloaded {OUTPUT_BITNET}")
print(f"\nNext steps:")
print(f"  1. Copy to your edge device (Raspberry Pi, Jetson, etc.)")
print(f"  2. Run: edgellm serve --model {OUTPUT_BITNET} --port 8080")
print(f"  3. Test: curl localhost:8080/v1/chat/completions ...")

## Next Steps

1. **Deploy to Edge Device:**
   ```bash
   # On your Raspberry Pi / edge device
   edgellm serve --model model.tmac2.bin --port 8080
   ```

2. **Test the API:**
   ```bash
   curl localhost:8080/v1/chat/completions \
       -H "Content-Type: application/json" \
       -d '{"messages": [{"role": "user", "content": "Turn on the lights"}]}'
   ```

3. **Benchmark:**
   ```bash
   edgellm benchmark --model model.tmac2.bin
   ```

## Resources

- [EdgeLLM GitHub](https://github.com/yourusername/edgellm)
- [Documentation](https://github.com/yourusername/edgellm/docs)
- [Hardware Guide](https://github.com/yourusername/edgellm/docs/hardware-guide.md)