# Lab 3: Model Customization with QLoRA

In this lab, you'll learn how to **fine-tune** open-source language models using **QLoRA** (Quantized Low-Rank Adaptation) - a technique that makes fine-tuning possible on consumer hardware.

## What is QLoRA?
QLoRA allows you to fine-tune large models by:
1. **Quantizing** the base model to 4-bit precision (reduces memory)
2. **Adding small trainable adapters** (LoRA) instead of updating all weights
3. Training only the adapters while keeping the base model frozen

## What You'll Learn
- Loading quantized models
- Preparing training data
- Configuring LoRA parameters
- Fine-tuning with Hugging Face
- Merging and saving adapters

## Requirements
- **GPU recommended** (8GB+ VRAM) but CPU works for small models
- Python packages: transformers, peft, datasets, bitsandbytes, accelerate

## 1. Setup

In [None]:
!pip install transformers peft datasets accelerate bitsandbytes -q

In [None]:
import torch
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    TrainingArguments,
    Trainer,
    DataCollatorForLanguageModeling
)
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
from datasets import Dataset
import json

# Check for GPU
device = "cuda" if torch.cuda.is_available() else "mps" if torch.backends.mps.is_available() else "cpu"
print(f"Using device: {device}")

if device == "cuda":
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(f"Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")

## 2. Load a Quantized Model

In [None]:
# Model to fine-tune (using a small model for demo)
model_name = "microsoft/phi-2"  # 2.7B parameters, good for learning

# For larger models with GPU:
# model_name = "meta-llama/Llama-2-7b-hf"  # Requires HF token
# model_name = "mistralai/Mistral-7B-v0.1"

In [None]:
# Quantization config for 4-bit loading
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_use_double_quant=True,
)

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token

print(f"Tokenizer loaded: {model_name}")

In [None]:
# Load model with quantization (GPU required for bitsandbytes)
if device == "cuda":
    model = AutoModelForCausalLM.from_pretrained(
        model_name,
        quantization_config=bnb_config,
        device_map="auto",
        trust_remote_code=True,
    )
    model = prepare_model_for_kbit_training(model)
else:
    # For CPU/MPS, load without quantization (slower but works)
    print("Loading without quantization (no CUDA GPU detected)")
    model = AutoModelForCausalLM.from_pretrained(
        model_name,
        torch_dtype=torch.float32,
        trust_remote_code=True,
    )

print(f"Model loaded with {model.num_parameters():,} parameters")

## 3. Configure LoRA

In [None]:
# LoRA configuration
lora_config = LoraConfig(
    r=16,                      # Rank of the update matrices
    lora_alpha=32,             # Scaling factor
    target_modules=["q_proj", "k_proj", "v_proj", "dense"],  # Layers to apply LoRA
    lora_dropout=0.05,         # Dropout for regularization
    bias="none",               # Don't train biases
    task_type="CAUSAL_LM"      # Task type
)

# Apply LoRA to model
model = get_peft_model(model, lora_config)

# Print trainable parameters
trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
total_params = sum(p.numel() for p in model.parameters())
print(f"Trainable parameters: {trainable_params:,} ({100 * trainable_params / total_params:.2f}%)")

## 4. Prepare Training Data

In [None]:
# Sample training data - customize for your use case
# Format: instruction-response pairs

training_data = [
    {
        "instruction": "What is the capital of France?",
        "response": "The capital of France is Paris."
    },
    {
        "instruction": "Explain photosynthesis in simple terms.",
        "response": "Photosynthesis is how plants make food. They use sunlight, water, and carbon dioxide to create sugar and oxygen."
    },
    {
        "instruction": "Write a haiku about coding.",
        "response": "Lines of code flow free\nBugs emerge from the shadows\nDebug, iterate"
    },
    {
        "instruction": "What is machine learning?",
        "response": "Machine learning is a type of AI where computers learn patterns from data instead of being explicitly programmed."
    },
    {
        "instruction": "Translate 'hello' to Spanish.",
        "response": "Hello in Spanish is 'hola'."
    },
    {
        "instruction": "What is 15% of 200?",
        "response": "15% of 200 is 30."
    },
    {
        "instruction": "Name three programming languages.",
        "response": "Three popular programming languages are Python, JavaScript, and Java."
    },
    {
        "instruction": "What causes rain?",
        "response": "Rain occurs when water vapor in clouds condenses into droplets that become heavy enough to fall to the ground."
    }
]

print(f"Training examples: {len(training_data)}")

In [None]:
# Format data for training
def format_prompt(example):
    """Format instruction-response pair into training text."""
    return f"""### Instruction:
{example['instruction']}

### Response:
{example['response']}"""

# Create formatted texts
formatted_data = [{"text": format_prompt(ex)} for ex in training_data]

# Show example
print("Example formatted prompt:")
print(formatted_data[0]["text"])

In [None]:
# Create Hugging Face dataset
dataset = Dataset.from_list(formatted_data)

# Tokenize the data
def tokenize_function(examples):
    return tokenizer(
        examples["text"],
        truncation=True,
        max_length=512,
        padding="max_length",
    )

tokenized_dataset = dataset.map(tokenize_function, batched=True, remove_columns=["text"])
print(f"Tokenized dataset: {tokenized_dataset}")

## 5. Training

In [None]:
# Training arguments
training_args = TrainingArguments(
    output_dir="./lora_output",
    num_train_epochs=3,
    per_device_train_batch_size=1,  # Adjust based on memory
    gradient_accumulation_steps=4,
    learning_rate=2e-4,
    warmup_steps=10,
    logging_steps=10,
    save_strategy="epoch",
    fp16=device == "cuda",  # Use fp16 only with CUDA
    report_to="none",  # Disable wandb/tensorboard
)

# Data collator
data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer,
    mlm=False  # Causal LM, not masked LM
)

print("Training configuration ready")

In [None]:
# Create trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset,
    data_collator=data_collator,
)

print("Trainer initialized")

In [None]:
# Train! (This will take a few minutes)
print("Starting training...")
trainer.train()
print("Training complete!")

## 6. Save the Adapter

In [None]:
# Save the LoRA adapter
adapter_path = "./my_lora_adapter"
model.save_pretrained(adapter_path)
tokenizer.save_pretrained(adapter_path)

print(f"Adapter saved to: {adapter_path}")

# List saved files
import os
print("\nSaved files:")
for f in os.listdir(adapter_path):
    size = os.path.getsize(os.path.join(adapter_path, f))
    print(f"  {f}: {size/1024:.1f} KB")

## 7. Test the Fine-tuned Model

In [None]:
# Test generation
def generate_response(instruction: str, max_new_tokens: int = 100):
    prompt = f"""### Instruction:
{instruction}

### Response:
"""
    
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
    
    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=max_new_tokens,
            temperature=0.7,
            do_sample=True,
            pad_token_id=tokenizer.eos_token_id
        )
    
    response = tokenizer.decode(outputs[0], skip_special_tokens=True)
    # Extract just the response part
    response = response.split("### Response:")[-1].strip()
    return response

# Test with training-like prompts
test_prompts = [
    "What is the capital of Germany?",
    "Explain gravity in simple terms.",
    "Write a haiku about coffee."
]

print("Testing fine-tuned model:\n")
for prompt in test_prompts:
    print(f"Q: {prompt}")
    print(f"A: {generate_response(prompt)}\n")

## 8. Load Adapter Later

In [None]:
# To load the adapter later:
from peft import PeftModel

# Example code (uncomment to use):
# base_model = AutoModelForCausalLM.from_pretrained(model_name)
# model_with_adapter = PeftModel.from_pretrained(base_model, adapter_path)

print("See code above for loading saved adapters")

## 9. Convert to Ollama (Optional)

After fine-tuning, you can merge the adapter and convert to Ollama format.

In [None]:
# Merge adapter with base model
merged_model = model.merge_and_unload()
merged_path = "./merged_model"
merged_model.save_pretrained(merged_path)
tokenizer.save_pretrained(merged_path)

print(f"Merged model saved to: {merged_path}")
print("\nTo convert to Ollama:")
print("1. Convert to GGUF format using llama.cpp")
print("2. Create a Modelfile")
print("3. Run: ollama create my-model -f Modelfile")

## Summary

In this lab, you learned how to:
- Load models with 4-bit quantization
- Configure LoRA adapters
- Prepare instruction-response training data
- Fine-tune with Hugging Face Trainer
- Save and load LoRA adapters
- Merge adapters with base models

**Key takeaways:**
- QLoRA makes fine-tuning accessible on consumer hardware
- Only ~0.1% of parameters are trained
- Adapters are small (few MB) and easy to share
- You can have multiple adapters for different tasks

**Tips for better results:**
- Use more training data (100+ examples)
- Train for more epochs
- Experiment with learning rate and LoRA rank
- Use higher quality, diverse examples

**Next Lab:** Image and Multimodal