<a href="https://colab.research.google.com/github/sidh4rth7/projects/blob/main/Llama_4bit_Before_tuning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

================================================================================
📋 Emotion Classification Baseline Test (BEFORE Fine-Tuning)
================================================================================

📋 PURPOSE:
This notebook tests Llama 3.2 3B's ability to classify emotions BEFORE any
fine-tuning. This establishes a baseline to measure improvement after training.

🎯 KEY CONCEPT:
We're testing what the model ALREADY KNOWS without any emotion training data.
This shows the "before" picture - after fine-tuning, we'll compare the "after".

🎯 LEARNING OBJECTIVES:
- Load larger models (3B parameters) with 4-bit quantization
- Understand chat templates for proper prompt formatting
- Test model capabilities on classification tasks
- Establish baseline performance metrics
- See why fine-tuning is needed for specialized tasks

⚙️ REQUIREMENTS:
- Google Colab with GPU (T4 recommended, 15GB VRAM)
- ~5-10 minutes runtime

🔬 WHAT THIS IS:
- Tests untrained model on emotion classification
- Establishes "before" metrics for comparison

================================================================================

In [1]:
#============================================================================
# 🔧 STEP 1: INSTALLATION
#============================================================================
# Install necessary libraries for running the language model

# Optional: Install 'uv' first if you want faster package installation
# Uncomment the line below if uv is not already installed
# !pip install uv

# 💡 WHAT IS UV?
# 'uv' is a Rust-based Python package installer that's 10-100x faster than pip
# It parallelizes downloads and has better dependency resolution
# Think of it as "pip on steroids" - same commands, much faster execution

print("="*80)
print("📦 Installing Unsloth and Dependencies")
print("="*80)

# Install Unsloth - The core library for efficient LLM inference
!uv pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"

# 💡 WHY UNSLOTH?
# - 2-5x faster inference than standard Hugging Face
# - 80% less memory usage with same model quality
# - Optimized CUDA kernels for attention operations
# - Automatic mixed precision handling
# - Works seamlessly with Hugging Face ecosystem

# Install Transformers - Hugging Face's core library for language models
!uv pip install --no-deps transformers>=4.39.0

# 💡 WHY --no-deps?
# Prevents conflicting dependency versions since Unsloth already installed compatible ones
# This avoids the common "dependency hell" problem in ML projects

# Install supporting libraries
!uv pip install accelerate bitsandbytes

# 💡 LIBRARY BREAKDOWN (minimal for baseline testing):
# - accelerate: Mixed precision, device management
# - bitsandbytes: 4-bit/8-bit quantization for memory efficiency

print("✅ Installation complete!\n")

📦 Installing Unsloth and Dependencies
[2mUsing Python 3.12.12 environment at: /usr[0m
[2K[2mResolved [1m86 packages[0m [2min 5.20s[0m[0m
[2K[2mPrepared [1m12 packages[0m [2min 2.31s[0m[0m
[2mUninstalled [1m4 packages[0m [2min 398ms[0m[0m
[2K[2mInstalled [1m12 packages[0m [2min 69ms[0m[0m
 [32m+[39m [1mbitsandbytes[0m[2m==0.48.1[0m
 [32m+[39m [1mcut-cross-entropy[0m[2m==25.1.1[0m
 [31m-[39m [1mdatasets[0m[2m==4.0.0[0m
 [32m+[39m [1mdatasets[0m[2m==4.3.0[0m
 [32m+[39m [1mmsgspec[0m[2m==0.19.0[0m
 [31m-[39m [1mpyarrow[0m[2m==18.1.0[0m
 [32m+[39m [1mpyarrow[0m[2m==22.0.0[0m
 [32m+[39m [1mshtab[0m[2m==1.7.2[0m
 [31m-[39m [1mtorchao[0m[2m==0.10.0[0m
 [32m+[39m [1mtorchao[0m[2m==0.14.1[0m
 [31m-[39m [1mtransformers[0m[2m==4.57.1[0m
 [32m+[39m [1mtransformers[0m[2m==4.56.2[0m
 [32m+[39m [1mtrl[0m[2m==0.23.0[0m
 [32m+[39m [1mtyro[0m[2m==0.9.35[0m
 [32m+[39m [1munsloth[0m[2m==202

In [2]:
#============================================================================
# 📦 STEP 2: IMPORT LIBRARIES & MEMORY CLEANUP
#============================================================================

import gc            # Garbage collector - Python's automatic memory management
import torch         # PyTorch - The deep learning framework powering everything
import warnings      # For suppressing non-critical warning messages

# 💡 WHAT IS PyTorch (torch)?
# PyTorch is the foundation for modern deep learning:
# - Tensor operations (like NumPy but on GPU)
# - Automatic differentiation (calculates gradients for training)
# - Neural network building blocks
# - GPU acceleration (100-1000x faster than CPU)
# Most LLM libraries (Unsloth, Transformers, etc.) are built on PyTorch

# Clean up before starting to ensure maximum available memory
warnings.filterwarnings("ignore")  # Hide non-critical warnings for cleaner output
torch.cuda.empty_cache()           # Clear GPU memory cache
gc.collect()                       # Run Python's garbage collector

# 💡 WHY MEMORY CLEANUP MATTERS:
# Colab notebooks can accumulate memory from previous runs
# Clearing cache prevents "CUDA out of memory" errors
# This is especially important when loading large models (3B parameters!)

print("✅ Memory cleaned and libraries imported\n")


✅ Memory cleaned and libraries imported



In [3]:
#============================================================================
# 🤖 STEP 3: LOAD PRE-TRAINED MODEL WITH 4-BIT QUANTIZATION
#============================================================================

from unsloth import FastLanguageModel  # Unsloth's optimized model loader

# Model configuration constants
MODEL_NAME = "unsloth/Llama-3.2-3B-Instruct-bnb-4bit"  # Pre-quantized instruct model
MAX_SEQ_LENGTH = 512    # Maximum tokens the model can process at once
DTYPE = None            # Auto-detect best precision (FP16 for T4, BF16 for A100)
LOAD_IN_4BIT = True     # Use 4-bit quantization to save memory

# 💡 MODEL CHOICE EXPLAINED:
# "unsloth/Llama-3.2-3B-Instruct-bnb-4bit"
# - Llama 3.2: Latest version of Meta's open-source LLM family
# - 3B: 3 billion parameters (good balance of quality and speed)
# - Instruct: Fine-tuned to follow instructions (better than base models)
# - bnb: bitsandbytes quantization (memory-efficient)
# - 4bit: Uses 4-bit precision (75% memory reduction)

# 💡 PARAMETER COUNT COMPARISON:
# TinyLlama: 1.1B parameters  → Good for demos, limited capability
# Llama 3.2 3B: 3B parameters → This file, good quality/speed balance
# Llama 3 8B: 8B parameters   → Higher quality, needs more VRAM
# Llama 3 70B: 70B parameters → Best quality, requires multi-GPU

# 💡 MAX_SEQ_LENGTH = 512:
# For emotion classification, sentences are short (typically 10-50 tokens)
# 512 tokens is more than enough (vs 2048 for longer documents)
# Using smaller context window saves memory

print("="*80)
print(f"🔍 Loading Model: {MODEL_NAME}")
print("="*80)

# Load the model and tokenizer
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name=MODEL_NAME,         # Which model to download/load
    max_seq_length=MAX_SEQ_LENGTH, # Context window size (512 tokens ≈ 400 words)
    dtype=DTYPE,                   # None = auto-detect (FP16 on T4, BF16 on A100)
    load_in_4bit=LOAD_IN_4BIT     # Enable 4-bit quantization
)

# 💡 WHAT IS QUANTIZATION?
# Normal models store weights in 16-bit floating point (FP16):
#   3B parameters × 2 bytes = 6 GB memory
#
# 4-bit quantization stores weights in 4 bits:
#   3B parameters × 0.5 bytes = 1.5 GB memory
#
# Result: 75% memory reduction with only 1-2% quality loss!
# This is why we can run 3B models on free Colab T4 GPU (15GB VRAM)

print(f"✅ Model loaded: {MODEL_NAME}")
print(f"✅ Parameters: ~3 Billion")
print(f"✅ Quantization: 4-bit (saves ~75% memory)")
print(f"✅ Memory usage: ~1.5-2 GB (vs ~6 GB for FP16)")
print(f"✅ Context window: {MAX_SEQ_LENGTH} tokens (~400 words)")
print(f"✅ Status: Ready for baseline testing (NO fine-tuning applied)")
print("="*80 + "\n")


🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.




🦥 Unsloth Zoo will now patch everything to make training faster!
🔍 Loading Model: unsloth/Llama-3.2-3B-Instruct-bnb-4bit
==((====))==  Unsloth 2025.10.9: Fast Llama patching. Transformers: 4.56.2.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.8.0+cu126. CUDA: 7.5. CUDA Toolkit: 12.6. Triton: 3.4.0
\        /    Bfloat16 = FALSE. FA [Xformers = None. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/2.24G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/234 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/454 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.2M [00:00<?, ?B/s]

chat_template.jinja: 0.00B [00:00, ?B/s]

✅ Model loaded: unsloth/Llama-3.2-3B-Instruct-bnb-4bit
✅ Parameters: ~3 Billion
✅ Quantization: 4-bit (saves ~75% memory)
✅ Memory usage: ~1.5-2 GB (vs ~6 GB for FP16)
✅ Context window: 512 tokens (~400 words)
✅ Status: Ready for baseline testing (NO fine-tuning applied)



In [4]:
#============================================================================
# 🔤 STEP 4: CONFIGURE THE TOKENIZER
#============================================================================
# The tokenizer converts text into numbers (tokens) that the model understands

# Set padding token to be the same as end-of-sequence token
tokenizer.pad_token = tokenizer.eos_token

# Set padding to happen on the right side of the text
tokenizer.padding_side = "right"

# 💡 WHAT IS A TOKENIZER?
# Language models don't understand text - they understand numbers!
# The tokenizer breaks text into pieces (tokens) and converts them to IDs:
#
# Example: "I'm feeling happy today!"
#   → Tokens: ["I", "'m", " feeling", " happy", " today", "!"]
#   → Token IDs: [40, 2846, 8430, 6380, 3432, 0]
#   → Model processes these numbers
#   → Output numbers converted back to text
#
# Different models use different tokenizers:
# - GPT uses Byte-Pair Encoding (BPE)
# - Llama uses SentencePiece (similar to BPE)
# - Typical vocab size: 32k-100k tokens

# 💡 WHY PAD_TOKEN = EOS_TOKEN?
# When processing batches of different length sentences:
# "I'm happy" (2 tokens) and "I'm feeling very happy today" (6 tokens)
# need to be padded to same length for GPU efficiency:
# "I'm happy [PAD] [PAD] [PAD] [PAD]"
# Using EOS (end-of-sequence) as pad token tells model "ignore these"

print("✅ Tokenizer configured")
print(f"   Vocabulary size: {len(tokenizer):,} tokens")
print(f"   Padding token: {tokenizer.pad_token}")
print(f"   Padding side: {tokenizer.padding_side}")
print(f"   EOS token: {tokenizer.eos_token}\n")


✅ Tokenizer configured
   Vocabulary size: 128,256 tokens
   Padding token: <|eot_id|>
   Padding side: right
   EOS token: <|eot_id|>



In [5]:
#============================================================================
# 🔮 STEP 5: CREATE PREDICTION FUNCTION (BASELINE TESTING)
#============================================================================

def predict_emotion(text):
    """
    Predict emotion from text using the model's chat template.

    This tests the UNTRAINED model's baseline performance.
    We expect poor/inconsistent results - that's normal and expected!

    Args:
        text (str): The sentence to classify (e.g., "I'm feeling happy")

    Returns:
        str: Model's predicted emotion response

    Process:
        1. Format input with system + user messages
        2. Apply chat template (converts to model's expected format)
        3. Tokenize (text → numbers)
        4. Generate (model prediction)
        5. Decode (numbers → text)
        6. Extract assistant's response
    """
    # STEP 1: Create message structure (chat format)
    messages = [
        {
            "role": "system",
            "content": "Identify the emotion in the following sentence and provide the emotion label."
        },
        {
            "role": "user",
            "content": text
        }
    ]

    # 💡 MESSAGES FORMAT:
    # This is a list of dicts, similar to OpenAI's API format
    # - system: Instructions for the AI
    # - user: The actual input to process

    # STEP 2: Apply chat template (converts messages to model-specific format)
    prompt = tokenizer.apply_chat_template(
        messages,
        tokenize=False,           # Return string, not token IDs yet
        add_generation_prompt=True  # Add <|assistant|> marker at end
    )

    # 💡 WHAT IS apply_chat_template()?
    # Different models expect different formats:
    # - Llama uses <|system|>, <|user|>, <|assistant|>
    # - GPT uses different tags
    # - Some models use [INST] and [/INST]
    # apply_chat_template() handles this automatically!

    # STEP 3: Tokenize (convert text to numbers) and move to GPU
    inputs = tokenizer(prompt, return_tensors="pt").to("cuda")

    # 💡 .to("cuda"):
    # Moves data from CPU memory to GPU memory
    # Model is on GPU, so inputs must be too
    # GPU processing is 100-1000x faster than CPU

    # STEP 4: Generate prediction
    outputs = model.generate(
        **inputs,                 # Unpack input token IDs
        max_new_tokens=20,       # Maximum length of response (20 tokens ≈ 15 words)
        temperature=0.1,         # Low randomness for consistent classification
        do_sample=True           # Use sampling (vs greedy always-pick-top-word)
    )

    # 💡 GENERATION PARAMETERS EXPLAINED:
    #
    # max_new_tokens=20:
    #   - Limits response length to prevent endless generation
    #   - For emotion classification, we only need "0 (sadness)" ≈ 3-5 tokens
    #   - 20 is safe upper limit
    #
    # temperature=0.1:
    #   - Controls randomness/creativity of response
    #   - 0.0 = Always pick most likely word (deterministic)
    #   - 0.1 = Mostly likely words (good for factual tasks)
    #   - 0.7 = Balanced randomness (good for conversation)
    #   - 1.0+ = More creative/random (good for stories)
    #   - For classification, we want CONSISTENT answers → use low temp
    #
    # do_sample=True:
    #   - True: Use probability sampling (with temperature)
    #   - False: Always pick most likely word (greedy decoding)
    #   - Even with low temperature, sampling can help avoid repetition

    # STEP 5: Decode (convert numbers back to text)
    response = tokenizer.decode(outputs[0], skip_special_tokens=True)

    # 💡 skip_special_tokens=True:
    # Removes special tokens like <|assistant|>, </s>, [PAD]
    # Makes output cleaner and human-readable

    # STEP 6: Extract just the assistant's response
    # The full response includes the prompt too, we only want the answer
    response = response.split("assistant")[-1].strip()

    # 💡 WHY SPLIT ON "assistant"?
    # Full generated text might be:
    # "system...user...assistant\n0 (sadness)"
    # We split on "assistant" and take the last part: "0 (sadness)"

    return response

In [6]:
#============================================================================
# 🧪 STEP 6: RUN BASELINE TESTS (BEFORE FINE-TUNING)
#============================================================================

print("="*80)
print("🧪 BASELINE TESTING: Emotion Classification (UNTRAINED MODEL)")
print("="*80)
print("\n⚠️  CRITICAL: This model has NOT been fine-tuned on emotion data!")
print("   We expect POOR, VAGUE, or INCORRECT results.")
print("   This is the 'BEFORE' picture - fine-tuning will be the 'AFTER'.\n")
print("="*80 + "\n")

# Test sentences covering different emotions from the emotion dataset
test_sentences = [
    "i didnt feel humiliated",
    "im grabbing a minute to post i feel greedy wrong",
    "i am ever feeling nostalgic about the fireplace i will know that it is still on the property",
    "i am feeling grouchy",
    "ive been taking or milligrams or times recommended amount and ive fallen asleep a lot faster but i also feel like so funny",
    "i feel as confused about life as a teenager or as jaded as a year old man"
]



print(f"Running predictions on {len(test_sentences)} test sentences...\n")
print("="*80 + "\n")

# Storage for results
results = []

# Test each sentence
for i, sentence in enumerate(test_sentences, 1):
    print(f"[Test {i}/{len(test_sentences)}]")
    print(f"📝 Input: {sentence}")
    print("-" * 80)

    # Generate prediction
    prediction = predict_emotion(sentence)

    print(f"🤖 Baseline Prediction: {prediction}")
    print("=" * 80 + "\n")

    # Store result
    results.append({
        "input": sentence,
        "baseline_output": prediction
    })

🧪 BASELINE TESTING: Emotion Classification (UNTRAINED MODEL)

⚠️  CRITICAL: This model has NOT been fine-tuned on emotion data!
   We expect POOR, VAGUE, or INCORRECT results.
   This is the 'BEFORE' picture - fine-tuning will be the 'AFTER'.


Running predictions on 6 test sentences...


[Test 1/6]
📝 Input: i didnt feel humiliated
--------------------------------------------------------------------------------
🤖 Baseline Prediction: The emotion in the sentence is "humiliation".

[Test 2/6]
📝 Input: im grabbing a minute to post i feel greedy wrong
--------------------------------------------------------------------------------
🤖 Baseline Prediction: The emotion in the sentence is guilt. 

The emotion label is: Guilt

[Test 3/6]
📝 Input: i am ever feeling nostalgic about the fireplace i will know that it is still on the property
--------------------------------------------------------------------------------
🤖 Baseline Prediction: The emotion in the sentence is: Nostalgia.

Emotion Labe