<a href="https://colab.research.google.com/github/sandeepdcoder/SandeepFDC/blob/main/Llama_4bit_after_tuning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

================================================================================
📋 WORKSHOP: Emotion Classification Fine-Tuning with QLoRA
================================================================================

📋 PURPOSE:
This notebook fine-tunes Llama 3.2 3B on the emotion classification task using
QLoRA (Quantized Low-Rank Adaptation) for ultra-efficient training. After
training, we test on the SAME sentences from the baseline to measure improvement.

🎯 KEY CONCEPT:
We're training the model to classify emotions into 6 categories using only
1,000 examples. QLoRA combines 4-bit quantization + LoRA adapters to train
efficiently on consumer GPUs without modifying the entire 3B parameter model.

🎯 LEARNING OBJECTIVES:
- Understand QLoRA: 4-bit quantized base model + LoRA adapters
- Apply LoRA adapters while keeping base model frozen in 4-bit
- Fine-tune on emotion dataset with proper formatting
- Use SFTTrainer for supervised fine-tuning
- Compare before/after results on same test cases
- Save and load fine-tuned adapters

⚙️ REQUIREMENTS:
- Google Colab with GPU (T4 recommended, 15GB VRAM)
- ~15-20 minutes runtime (including training)
- Run baseline test first (llama4bit_pretraining.py) for comparison

🔬 WHAT THIS DEMONSTRATES:
- QLoRA training: 4-bit base model + rank 32 LoRA (only 0.5% params trained)
- Extreme memory efficiency: ~2GB total (vs ~6GB for regular LoRA)
- Dramatic improvement from baseline (poor) to fine-tuned (80-90%+ accuracy)
- Production-ready workflow: load 4bit → add LoRA → train → test → save

📚 REFERENCE:
QLoRA paper by Tim Dettmers et al. (2023): "QLoRA: Efficient Finetuning of
Quantized LLMs" - enables training 65B models on single 48GB GPU

================================================================================

In [None]:
#============================================================================
# 🔧 STEP 1: INSTALLATION
#============================================================================

print("="*80)
print("📦 Installing Unsloth and Dependencies for Fine-Tuning")
print("="*80)


import os
if "COLAB_" not in "".join(os.environ.keys()):
    # Local installation (simpler)
    !uv pip install unsloth
else:
    # Colab installation (optimized for Colab environment)
    !uv pip install --no-deps bitsandbytes accelerate xformers==0.0.29.post3 peft trl==0.15.2 triton cut_cross_entropy unsloth_zoo
    !uv pip install sentencepiece protobuf "datasets>=3.4.1" huggingface_hub hf_transfer
    !uv pip install --no-deps unsloth

# 💡 KEY LIBRARIES FOR FINE-TUNING:
# - peft: Parameter-Efficient Fine-Tuning (LoRA implementation)
# - trl: Transformer Reinforcement Learning (SFTTrainer for supervised training)
# - xformers: Memory-efficient attention operations
# - datasets: Hugging Face datasets library (loads emotion data)

print("✅ Installation complete!\n")

In [None]:
print("="*80)
print("🔍 Loading Base Model (Same as Baseline Test)")
print("="*80)

from unsloth import FastLanguageModel
import torch

# Model configuration (same as baseline for fair comparison)
max_seq_length = 2048  # Maximum context window
dtype = None           # Auto-detect (FP16 for T4, BF16 for Ampere+)
load_in_4bit = True    # 4-bit quantization to save memory

# Load the same model used in baseline testing
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/Llama-3.2-3B-Instruct-bnb-4bit",
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
)

# 💡 WHY SAME MODEL AS BASELINE?
# We want to measure the impact of fine-tuning alone
# By starting with the same model, we can directly compare:
# - Baseline (no training) vs Fine-tuned (after training)

print(f"✅ Base model loaded: {model.config.model_type}")
print(f"✅ Total parameters: ~3 Billion")
print(f"✅ Memory: ~1.5-2 GB (4-bit quantized)")

print("-"*80)
print("="*80 + "\n")

In [None]:
#============================================================================
# 🎯 STEP 3: APPLY QLoRA (4-BIT BASE + LoRA ADAPTERS)
#============================================================================

print("="*80)
print("🔧 Applying QLoRA: 4-bit Quantized Base + LoRA Adapters")
print("="*80)

model = FastLanguageModel.get_peft_model(
    model,
    r = 32,  # LoRA rank: Higher = more capacity, more memory (8/16/32/64)
    target_modules = [
        "q_proj", "k_proj", "v_proj", "o_proj",    # Attention layers
        "gate_proj", "up_proj", "down_proj",       # MLP layers
    ],
    lora_alpha = 64,              # LoRA scaling factor (typically 2× rank)
    lora_dropout = 0,             # Dropout rate (0 is optimized for Unsloth)
    bias = "none",                # Bias training ("none" is optimized)
    use_gradient_checkpointing = "unsloth",  # Memory-efficient backprop
    random_state = 3407,          # Random seed for reproducibility
    use_rslora = False,           # Rank-Stabilized LoRA (not needed here)
    loftq_config = None,          # LoftQ quantization (not needed)
)

# 💡 WHAT IS QLoRA?
# QLoRA = Quantized LoRA, combining two techniques:
#
# 1. BASE MODEL: 4-bit NormalFloat (NF4) quantization
#    - Frozen at 4-bit precision (~1.5GB memory)
#    - Not trained, just used for forward pass
#    - 75% memory reduction vs FP16 (1.5GB vs 6GB)
#
# 2. LoRA ADAPTERS: Low-Rank Adaptation in FP16/BF16
#    - Small trainable matrices added to model
#    - Original weight W (frozen in 4-bit)
#    - LoRA adds: ΔW = A × B (small matrices in 16-bit)
#    - New weight: W' = W + ΔW
#    - Only A and B are trained (~0.5% of parameters, ~20-50MB)
#
# QLoRA Architecture:
# ┌─────────────────────────────────────┐
# │ Base Model (Frozen)                 │
# │ - 4-bit NF4 quantization            │
# │ - ~1.5 GB memory                    │
# │ - Not updated during training       │
# └─────────────────────────────────────┘
#          ↓
# ┌─────────────────────────────────────┐
# │ LoRA Adapters (Trainable)           │
# │ - FP16/BF16 precision               │
# │ - ~20-50 MB memory                  │
# │ - Updated during training           │
# └─────────────────────────────────────┘
#          ↓
# Total: ~2GB memory (vs ~6GB for LoRA, ~12GB for full FP16 fine-tuning)
#
# Benefits:
# - Extreme memory efficiency (train 3B on T4, 65B on A100)
# - Fast training (fewer parameters to update)
# - No quality loss vs regular LoRA (proven in paper)
# - Easy to swap (keep base model, change adapters)
# - High quality (95-99% of full fine-tuning performance)

# 💡 RANK EXPLAINED:
# Rank = 32 means each LoRA matrix has 32 dimensions
# Higher rank = more capacity to learn, but more memory
# - Rank 8: Fastest, lowest memory, good for simple tasks
# - Rank 16: Balanced (common choice)
# - Rank 32: Higher capacity, better for complex tasks (our choice)
# - Rank 64+: Highest quality, but approaching full fine-tuning cost

print("✅ QLoRA configuration:")
print(f"   Base model: 4-bit NF4 quantization (~1.5GB, frozen)")
print(f"   LoRA adapters: FP16/BF16 (~20-50MB, trainable)")
print(f"   LoRA rank: 32 (higher capacity for better accuracy)")
print(f"   LoRA alpha: 64 (2× rank, standard scaling)")
print(f"   Target modules: 7 (Attention + MLP layers)")
print("\n📊 Trainable Parameters:")
model.print_trainable_parameters()


**[Dataset Link text](https://huggingface.co/datasets/dair-ai/emotion/viewer/split/train?views%5B%5D=split_train)**

In [None]:
#============================================================================
# 📝 STEP 4: LOAD AND FORMAT DATASET
#============================================================================

print("="*80)
print("📊 Loading Emotion Dataset from Hugging Face")
print("="*80)

from datasets import load_dataset

# Define emotion labels (same as baseline)
EMOTION_LABELS = {
    0: "sadness",
    1: "joy",
    2: "love",
    3: "anger",
    4: "fear",
    5: "surprise"
}

def to_llama3_format(example):
    """
    Convert text-label pair to Llama 3.2 chat format.

    Uses the tokenizer's apply_chat_template() for proper formatting.
    This ensures the model sees data in the exact format it expects.
    """
    text = example['text']
    label = example['label']
    emotion_name = EMOTION_LABELS[label]

    # Create messages in chat format
    messages = [
        {"role": "system", "content": "Identify the emotion in the following sentence and provide the emotion label."},
        {"role": "user", "content": text},
        {"role": "assistant", "content": f"{label} ({emotion_name})"}  # Expected output
    ]

    # Use tokenizer's built-in chat template
    # add_generation_prompt=False because we include the assistant's response
    formatted_text = tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=False  # We have the full conversation
    )

    return {"text": formatted_text}

# Load dataset
dataset = load_dataset("dair-ai/emotion")

# Use first 1000 samples for faster training (workshop demo)
# For production, use full dataset: dataset['train']
train_dataset = dataset['train'].select(range(1000)).map(
    to_llama3_format,
    remove_columns=['text', 'label']
)

print(f"✅ Dataset: dair-ai/emotion")
print(f"✅ Training samples: {len(train_dataset):,}")
print(f"✅ Emotion classes: {len(EMOTION_LABELS)}")
print(f"\n📄 Sample formatted training example:")
print("-" * 80)
print(train_dataset[0]["text"][:300] + "...")
print("-" * 80 + "\n")

# 💡 WHY ONLY 1000 SAMPLES?
# For workshop/demo purposes:
# - Faster training (~10-15 min vs 30-60 min for full dataset)
# - Still shows dramatic improvement over baseline
# - For production: use full 16k training samples

In [None]:
#============================================================================
# 🏋️ STEP 5: CONFIGURE TRAINING
#============================================================================

print("="*80)
print("⚙️  Configuring Training Parameters")
print("="*80)

from trl import SFTTrainer
from transformers import TrainingArguments
from unsloth import is_bfloat16_supported

trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = train_dataset,
    dataset_text_field = "text",       # Field name in dataset
    max_seq_length = max_seq_length,   # Max tokens per example
    dataset_num_proc = 2,               # Parallel data loading
    packing = False,                    # Don't pack multiple examples together

    args = TrainingArguments(
        # Batch configuration
        per_device_train_batch_size = 2,   # Batch size per GPU
        gradient_accumulation_steps = 4,   # Effective batch = 2 × 4 = 8

        # Training duration
        num_train_epochs = 2,               # Number of passes through data
        # max_steps = 60,                   # Alternative: fixed number of steps

        # Learning rate
        learning_rate = 2e-4,               # How fast to learn
        warmup_steps = 5,                   # Gradual learning rate warmup
        lr_scheduler_type = "cosine",       # Learning rate decay schedule

        # Optimization
        optim = "adamw_8bit",               # Memory-efficient optimizer
        weight_decay = 0.01,                # Regularization strength

        # Precision (auto-detect based on GPU)
        fp16 = not is_bfloat16_supported(), # Use FP16 on older GPUs (T4, V100)
        bf16 = is_bfloat16_supported(),     # Use BF16 on newer GPUs (A100, A6000)

        # Logging and saving
        logging_steps = 1,                  # Log every step
        output_dir = "outputs",             # Where to save checkpoints
        report_to = "none",                 # Disable W&B/TensorBoard

        # Reproducibility
        seed = 3407,
    ),
)

# 💡 KEY TRAINING PARAMETERS EXPLAINED:
#
# Effective Batch Size = 2 × 4 = 8:
#   - Real batch size: 2 (fits in memory)
#   - Gradient accumulation: 4 (accumulate gradients from 4 batches)
#   - Result: Same as training with batch size 8, but uses less memory
#
# Learning Rate = 2e-4:
#   - Standard for LoRA fine-tuning
#   - Lower than full fine-tuning (which uses 1e-5)
#   - LoRA is more stable with higher learning rates
#
# Cosine Schedule:
#   - Learning rate starts at 2e-4
#   - Gradually decreases following cosine curve
#   - Helps model converge smoothly
#
# 2 Epochs:
#   - Model sees each of 1000 examples twice
#   - Total steps: ~250 (1000 / 8 batch size × 2 epochs)
#   - Training time: ~10-15 minutes on T4

print(f"✅ Effective batch size: {2 * 4}")
print(f"✅ Training epochs: 2")
print(f"✅ Learning rate: 2e-4 (with cosine decay)")
print(f"✅ Optimizer: AdamW 8-bit (memory efficient)")
print(f"✅ Expected training time: ~10-15 min on T4 GPU")

In [None]:
#============================================================================
# 🚀 STEP 6: TRAIN THE MODEL
#============================================================================

print("="*80)
print("🚀 Starting Fine-Tuning Training...")
print("="*80)
print("Training on 1,000 emotion examples")
print("Watch the loss decrease - this shows the model is learning!\n")

# Train!
trainer_stats = trainer.train()

print("\n" + "="*80)
print("✅ Training Complete!")
print("="*80)
print(f"📊 Final training loss: {trainer_stats.training_loss:.4f}")
print(f"⏱️  Training time: {trainer_stats.metrics.get('train_runtime', 0):.1f} seconds")
print("="*80 + "\n")

# 💡 WHAT HAPPENED DURING TRAINING?
# 1. Model processed 1000 emotion examples, 2 times (2 epochs)
# 2. Learned to map text → emotion labels
# 3. Learned the output format: "0 (sadness)", "1 (joy)", etc.
# 4. Only LoRA adapters were trained (~0.5% of parameters, ~20-50MB)
# 5. Base model weights remain frozen in 4-bit (QLoRA technique)
# 6. Total memory usage: ~2GB (vs ~6GB for regular LoRA)


In [None]:
#============================================================================
# 🧪 STEP 7: TEST THE FINE-TUNED MODEL
#============================================================================

print("="*80)
print("🧪 Testing Fine-Tuned Model")
print("="*80)
print("Using SAME test sentences from baseline for fair comparison\n")

# Enable inference mode (faster, no gradient calculation)
FastLanguageModel.for_inference(model)

def predict_emotion(text):
    """
    Predict emotion using the fine-tuned model.

    Same function as baseline test, but now using trained model.
    """
    messages = [
        {"role": "system", "content": "Identify the emotion in the following sentence and provide the emotion label."},
        {"role": "user", "content": text}
    ]

    # Format with chat template
    prompt = tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True  # Add <|assistant|> marker
    )

    # Tokenize and generate
    inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
    outputs = model.generate(
        **inputs,
        max_new_tokens=20,
        temperature=0.1,  # Low temperature for consistent classification
        do_sample=True
    )

    # Decode and extract response
    response = tokenizer.decode(outputs[0], skip_special_tokens=True)
    response = response.split("assistant")[-1].strip()

    return response

# SAME test sentences as baseline (for comparison)
test_sentences = [
    "i didnt feel humiliated",
    "im grabbing a minute to post i feel greedy wrong",
    "i am ever feeling nostalgic about the fireplace i will know that it is still on the property",
    "i am feeling grouchy",
    "ive been taking or milligrams or times recommended amount and ive fallen asleep a lot faster but i also feel like so funny",
    "i feel as confused about life as a teenager or as jaded as a year old man",
    "i need you i need someone i need to be protected and feel safe i am small now i find myself in a season of no words"
]

print("Running predictions on test sentences...")
print("="*80 + "\n")

In [None]:
# Test and display results
results = []
for i, sentence in enumerate(test_sentences, 1):
    prediction = predict_emotion(sentence)
    results.append({
        "input": sentence,
        "output": prediction
    })
    print(f"[{i}/{len(test_sentences)}] {sentence[:60]}...")
    print(f"→ {prediction}\n")

print("="*80)

In [None]:
print("\n" + "="*80)
print("📊 FINE-TUNING RESULTS SUMMARY")
print("="*80)

print("\n✅ EXPECTED IMPROVEMENTS FROM BASELINE:")
print("   1. OUTPUT FORMAT: Now consistently follows '0 (sadness)' format")
print("   2. ACCURACY: 80-90%+ correct emotion identification")
print("   3. CONSISTENCY: Same input → same output (reproducible)")

print("\n💡 WHAT THE MODEL LEARNED:")
print("   ✓ 6 emotion categories (sadness, joy, love, anger, fear, surprise)")
print("   ✓ Specific output format with number + name")
print("   ✓ Emotion patterns in text (keywords, context, sentiment)")
print("   ✓ Task-specific consistency")

print("\n📈 TRAINING STATISTICS:")
print(f"   Training samples: 1,000")
print(f"   Epochs: 2")
print(f"   Trainable parameters: ~0.5% of total (QLoRA)")
print(f"   Memory usage: ~2GB (4-bit base + LoRA adapters)")
print(f"   Training time: {trainer_stats.metrics.get('train_runtime', 0):.1f}s")
print(f"   Final loss: {trainer_stats.training_loss:.4f}")

print("\n🔍 COMPARE THESE RESULTS TO BASELINE:")
print("   Run llama4bit_pretraining.py to see the baseline (untrained)")
print("   You should see dramatic improvement in:")
print("   - Format adherence (was messy → now clean)")
print("   - Emotion accuracy (was random → now 80-90%+)")
print("   - Consistency (was varied → now deterministic)")

print("\n" + "="*80)
print("✅ Fine-tuning demonstration complete!")
print("="*80 + "\n")



"""
================================================================================
🎯 WORKSHOP FACILITATOR NOTES
================================================================================

1. BEFORE AND AFTER STORY:
   - Show baseline results first (poor, inconsistent, wrong format)
   - Run this training script (takes ~10-15 min)
   - Show dramatic improvement on SAME test cases
   - This visceral before/after is the key teaching moment

2. QLoRA EFFICIENCY:
   - QLoRA = 4-bit quantized base model + LoRA adapters
   - Only 0.5% of parameters trained (LoRA adapters)
   - Total memory: ~2GB (vs ~6GB for regular LoRA, ~12GB for full FP16)
   - Adapters are tiny (~20-50 MB vs 6GB full model)
   - Training is fast (10-15 min vs hours for full fine-tuning)
   - Quality is 95-99% of full fine-tuning (no degradation vs LoRA)
   - Can swap adapters: same base model, different tasks
   - Breakthrough: Tim Dettmers' QLoRA paper (2023) enabled training 65B on single GPU

3. KEY HYPERPARAMETERS:
   - Rank 32: Higher than default (8/16) for better accuracy
   - Learning rate 2e-4: Standard for LoRA (higher than full fine-tuning)
   - Cosine schedule: Smooth learning rate decay
   - 2 epochs: Enough for 1000 samples (more epochs on larger datasets)

4. COMMON ISSUES:
   - If loss doesn't decrease: Check data formatting
   - If outputs still wrong: May need more epochs or data
   - If CUDA OOM: Reduce batch size or sequence length
   - If slow: Check GPU is being used (should be <1 min/epoch)

5. PRODUCTION CONSIDERATIONS:
   - Use full dataset (16k samples) not just 1000
   - Add validation split to monitor overfitting
   - Increase epochs to 3-5 for full dataset
   - Save checkpoints periodically
   - Test on held-out test set for final evaluation

6. REAL-WORLD APPLICATIONS:
   - Customer support: Classify ticket categories, urgency
   - Content moderation: Detect toxic, spam, inappropriate
   - Healthcare: Classify symptoms, triage severity
   - Education: Grade sentiment in student feedback
   - Any classification task with 100-10000 examples

7. COST COMPARISON:
   - Fine-tuning cost: $0.50-2 on Colab Pro (includes GPU time)
   - API cost (no fine-tuning): $0.01 per 1k tokens × volume
   - Crossover: If >50k-200k queries, fine-tuning cheaper
   - Plus benefits: Privacy, control, customization

================================================================================
"""