# Project 17: Comparative Analysis — Base vs Instruction-Tuned Mistral

## Goal
Rigorously compare base Mistral with the instruction-tuned version from Project 16. Measure instruction-following improvement, capability preservation, and identify when fine-tuning helps vs. hurts.

## Learning Objectives
- Load base + tuned models; generate on same prompts to compare qualitatively
- Define evaluation metrics: length, instruction keyword coverage, specificity, accuracy (when applicable)
- Analyze instruction-following improvements with concrete examples
- Identify failure modes: catastrophic forgetting, overfitting to training domain
- Measure general capability preservation (does tuning break general knowledge?)
- Propose best practices for fine-tuning LLMs

## Prerequisites
- Project 14 (Pretraining): Understand loss metrics and model evaluation
- Project 15 (Analysis): Know how to compare models systematically
- Project 16 (Mistral Fine-Tuning): Have a tuned checkpoint to compare against

## What You'll Build
- Comparative evaluation framework: load base + tuned, generate on benchmark
- Heuristic metrics: response length, keyword coverage, instruction adherence
- Qualitative analysis: side-by-side examples of base vs. tuned outputs
- Capability preservation checks: do general knowledge tasks work?
- Failure analysis: when does tuning hurt performance?
- Report: summary statistics, visualizations, recommendations

## Estimated Time
- Setup + loading: 15-30 min
- Evaluation on benchmark: 30-60 min
- Analysis + visualization: 30-45 min
- Deep dives (optional): 1-2 hours

## Usage Guide

This notebook:
1. Sets up model loading with graceful handling of missing adapters
2. Loads base Mistral and tuned version (if available)
3. Generates responses on a benchmark prompt set
4. Computes heuristic metrics (length, keyword coverage, etc.)
5. Creates comparison tables and visualizations
6. Provides qualitative analysis framework
7. Saves results to CSV and JSON

Key functions:
- `load_base_model()` / `load_tuned_model()` → safe model loading
- `generate_response(model, prompt)` → generate and measure
- `compute_metrics(response, prompt)` → length, keywords, etc.
- `compare_models()` → run full evaluation on benchmark
- `visualize_comparison()` → plot metrics side-by-side

---

In [7]:
# Setup
import mlx.core as mx
from mlx_lm import load, generate
import pandas as pd
import matplotlib.pyplot as plt

print("Ready for comparative analysis!")

Ready for comparative analysis!


## Plan
We will:
1. Define configuration (model names, adapter paths, prompt set, quick mode toggle).
2. Provide helper loaders to gracefully handle missing tuned adapters.
3. Load base model; attempt to load tuned model + merge LoRA if present.
4. Generate responses for a benchmark prompt set.
5. Compute simple heuristic metrics (length, overlap, instruction keyword coverage).
6. Tabulate and save results (CSV + JSON metrics).
7. Summarize differences and propose further evaluation steps.

In [9]:
# 1) Configuration
from pathlib import Path
import json, time, math
import pandas as pd
from dataclasses import dataclass

# Quick mode avoids large downloads by preferring a smaller model for smoke tests
quick_mode = True  # set False to use the full Mistral model (may be multi-GB download)
small_model_name = 'mlx-community/Qwen2.5-0.5B-Instruct'  # fallback small model
full_model_name = 'mistralai/Mistral-7B-Instruct-v0.2'     # preferred full model
model_name = small_model_name if quick_mode else full_model_name
max_new_tokens = 128
temperature = 0.7
top_p = 0.9

# Locate repo root and artifacts
repo_root = None
cur = Path.cwd()
for parent in [cur] + list(cur.parents):
    if (parent / 'requirements.txt').exists() or (parent / '.git').exists():
        repo_root = parent
        break
if repo_root is None:
    repo_root = cur
project16_artifacts = repo_root / 'projects' / 'phase3_llm_tuning' / 'project16_mistral_tuning' / 'artifacts'
tuned_adapters_path = project16_artifacts / 'lora_adapters.safetensors'
project17_dir = repo_root / 'projects' / 'phase3_llm_tuning' / 'project17_comparative_analysis'
analysis_dir = project17_dir / 'analysis_artifacts'
analysis_dir.mkdir(parents=True, exist_ok=True)

# Prompt set for head-to-head comparison
prompts = [
    "Explain LoRA in one paragraph.",
    "Write a Python function to compute Fibonacci numbers.",
    "Give three bullet tips for learning ML.",
    "Summarize the benefits of parameter-efficient fine-tuning.",
    "What are common failure modes when instruction-tuning LLMs?",
]

print('Config ready:')
print('  quick_mode:', quick_mode)
print('  model_name:', model_name)
print('  tuned_adapters exist:', tuned_adapters_path.exists())
print('  analysis_dir:', analysis_dir)

Config ready:
  quick_mode: True
  model_name: mlx-community/Qwen2.5-0.5B-Instruct
  tuned_adapters exist: False
  analysis_dir: /Users/mark/git/learning-ml-to-llm/projects/phase3_llm_tuning/project17_comparative_analysis/analysis_artifacts


In [2]:
# 2) Helper functions for safe loading and generation
from typing import Tuple, Dict, Any
import mlx.core as mx
from mlx_lm import load, generate

def safe_load(model_id: str):
    """Load a model/tokenizer with graceful error handling."""
    try:
        m, tok = load(model_id)
        return m, tok
    except Exception as e:
        print(f'Failed to load {model_id}:', repr(e))
        return None, None

def try_load_tuned(base_model_id: str, adapter_path: Path):
    """Attempt to load base model then apply LoRA adapters if available."""
    m, tok = safe_load(base_model_id)
    if m is None:
        return None, None
    # Attempt adapter merge if file exists and peft utilities are present
    if adapter_path.exists():
        try:
            from mlx_lm.peft import apply_lora, merge_lora_weights, load_lora_parameters, LoraConfig
            # We don't know original config used; we load saved weights directly
            load_lora_parameters(m, str(adapter_path))
            tuned_model = merge_lora_weights(m)
            print('Merged LoRA adapters into tuned model.')
            return tuned_model, tok
        except Exception as e:
            print('Could not load/merge LoRA adapters:', repr(e))
            print('Proceeding with base model as tuned fallback.')
            return m, tok
    else:
        print('Adapter file not found; using base model as tuned placeholder.')
        return m, tok

def safe_generate(m, tok, prompt: str, **gen_kwargs):
    if m is None or tok is None:
        return '<model unavailable>'
    try:
        return generate(m, tok, prompt=prompt, **gen_kwargs)
    except Exception as e:
        return f'<generation error: {repr(e)}>'

In [4]:
# 3) Load base and tuned models (may download on first run)
if quick_mode:
    base_model, base_tok = None, None  # Skip heavy downloads in quick_mode
    tuned_model, tuned_tok = None, None
    tuned_kind = 'quick_mode_skipped_load'
else:
    base_model, base_tok = safe_load(model_name)
    tuned_model, tuned_tok = try_load_tuned(full_model_name, tuned_adapters_path)
    tuned_kind = 'adapters_merged' if tuned_model is not None else 'fallback_base'

print('Models ready:')
print('  base:', 'ok' if base_model is not None else 'skipped')
print('  tuned:', 'ok' if tuned_model is not None else 'skipped', '| mode:', tuned_kind)

Models ready:
  base: skipped
  tuned: skipped | mode: quick_mode_skipped_load


In [5]:
# 4) Run generations and build comparison table
rows = []
for i, p in enumerate(prompts, start=1):
    b_out = safe_generate(base_model, base_tok, prompt=p, max_tokens=max_new_tokens, verbose=False)
    t_out = safe_generate(tuned_model, tuned_tok, prompt=p, max_tokens=max_new_tokens, verbose=False)
    rows.append({'id': i, 'prompt': p, 'base_output': b_out, 'tuned_output': t_out})
df = pd.DataFrame(rows)
csv_path = analysis_dir / 'base_vs_tuned_outputs.csv'
df.to_csv(csv_path, index=False)
print('Saved outputs to', csv_path)
df.head(2)

Saved outputs to /Users/mark/git/learning-ml-to-llm/projects/phase3_llm_tuning/project17_comparative_analysis/analysis_artifacts/base_vs_tuned_outputs.csv


Unnamed: 0,id,prompt,base_output,tuned_output
0,1,Explain LoRA in one paragraph.,<model unavailable>,<model unavailable>
1,2,Write a Python function to compute Fibonacci n...,<model unavailable>,<model unavailable>


In [6]:
# 5) Simple heuristic metrics
import re
def keyword_coverage(text: str, prompt: str):
    # Percentage of unique prompt words (alphabetic) appearing in output
    pw = {w.lower() for w in re.findall(r'[A-Za-z]+', prompt) if len(w) > 3}
    if not pw:
        return 0.0
    ow = {w.lower() for w in re.findall(r'[A-Za-z]+', text)}
    return len(pw & ow) / len(pw)

metrics = []
for r in rows:
    base_len = len(r['base_output'].split()) if isinstance(r['base_output'], str) else 0
    tuned_len = len(r['tuned_output'].split()) if isinstance(r['tuned_output'], str) else 0
    base_cov = keyword_coverage(r['base_output'], r['prompt']) if isinstance(r['base_output'], str) else 0.0
    tuned_cov = keyword_coverage(r['tuned_output'], r['prompt']) if isinstance(r['tuned_output'], str) else 0.0
    metrics.append({
        'id': r['id'],
        'prompt': r['prompt'],
        'base_len': base_len,
        'tuned_len': tuned_len,
        'base_keyword_coverage': base_cov,
        'tuned_keyword_coverage': tuned_cov,
        'len_delta': tuned_len - base_len,
        'coverage_delta': tuned_cov - base_cov,
    })
metrics_df = pd.DataFrame(metrics)
metrics_path = analysis_dir / 'heuristic_metrics.csv'
metrics_df.to_csv(metrics_path, index=False)
summary = {
    'avg_len_delta': float(metrics_df['len_delta'].mean()),
    'avg_cov_delta': float(metrics_df['coverage_delta'].mean()),
    'prompts': len(metrics_df),
    'quick_mode': quick_mode,
    'tuned_kind': tuned_kind,
}
summary_path = analysis_dir / 'summary_metrics.json'
with open(summary_path, 'w') as f:
    json.dump(summary, f, indent=2)
print('Saved metrics to', metrics_path)
print('Saved summary to', summary_path)
metrics_df.head(2)

Saved metrics to /Users/mark/git/learning-ml-to-llm/projects/phase3_llm_tuning/project17_comparative_analysis/analysis_artifacts/heuristic_metrics.csv
Saved summary to /Users/mark/git/learning-ml-to-llm/projects/phase3_llm_tuning/project17_comparative_analysis/analysis_artifacts/summary_metrics.json


Unnamed: 0,id,prompt,base_len,tuned_len,base_keyword_coverage,tuned_keyword_coverage,len_delta,coverage_delta
0,1,Explain LoRA in one paragraph.,2,2,0.0,0.0,0,0.0
1,2,Write a Python function to compute Fibonacci n...,2,2,0.0,0.0,0,0.0


# Exercises & Extensions

## Warm-up

1. **Baseline Metrics**: Generate responses from base model on 5 prompts. Compute length, keyword coverage, specificity. Are there patterns? (Do some prompts get longer responses?)
2. **Tuned vs. Base Side-by-Side**: Pick 3 prompts. Show base output next to tuned output. Qualitatively: is tuned more instruction-following? More specific?
3. **Metric Correlation**: Compute correlation between response length and keyword coverage. Are they independent metrics or correlated?

## Intermediate

4. **Out-of-Domain Generalization**: Test on prompts NOT in fine-tuning dataset (e.g., if tuned on Q&A, test on creative writing). Does tuning hurt general capabilities? Plot performance on in-domain vs. out-of-domain.
5. **Response Distribution**: Generate 10 responses per prompt using sampling (temperature > 0.7). Measure variance. Is tuned model more consistent? More diverse?
6. **Failure Mode Analysis**: Find prompts where base > tuned (tuned is worse). What domain are these? Do they indicate catastrophic forgetting?

## Advanced

7. **Benchmark Evaluation**: Use a real benchmark (e.g., GSM8K for math, HumanEval for code). Score both models. Compute Δ performance. On what tasks does tuning help most?
8. **Scaling Comparison**: Create multiple tuned models with different LoRA ranks (8, 16, 32, 64). Compare all on the benchmark. Plot performance vs. compute cost.
9. **Human Preference Study**: Have 2-3 people rate base vs. tuned outputs on instruction-adherence (scale 1-5) on 20 prompts. Compute inter-rater agreement + average preference.

---

# Summary & Bridge Forward

## What You Learned

- **Instruction-Following Improvement**: Fine-tuning typically makes models more helpful and specific to user instructions
- **Trade-offs**: Tuning can hurt general capabilities if not careful (data imbalance, overfitting to narrow domain)
- **Measurement Challenges**: Hard metrics (exact match, code execution) are better than heuristics (keyword coverage)
- **Domain Specificity**: Models tuned on one domain often struggle on others
- **Capability Preservation**: Need evaluation on both tuned domain AND held-out general tasks

## Why This Matters

Fine-tuning evaluation determines **whether a model is ready for production**:

1. **Quality Assurance**:
   - Fine-tuning should improve target domain performance
   - Must NOT degrade general knowledge
   - Need benchmarks to catch regressions

2. **Business Metrics**:
   - Instruction-following → user satisfaction
   - General preservation → reliability
   - Inference speed → cost
   - These need to align with product requirements

3. **Science**:
   - When does tuning help? (structured data, ample examples)
   - When does it hurt? (small dataset, domain drift, overfitting)
   - How much tuning is enough? (learning curves)

## Bridge to Next Projects

This is the **end of the learning path**, but here's how to continue:

- **Production Deployment**:
  - Serve tuned model via API (LM Studio, vLLM, SGLang)
  - Monitor outputs for drift
  - Collect user feedback and retune periodically

- **Research Directions**:
  - Scaling laws: tune increasingly large models
  - Data efficiency: how little data needed for good tuning?
  - Alignment: how to tune for safety + performance?
  - Multi-task: can single model tune to multiple domains?

- **Advanced Techniques**:
  - DPO (Direct Preference Optimization): tune from human preferences
  - RLHF (Reinforcement Learning from Human Feedback): scale preferences to tasks
  - Mixture of Experts: specialize different components to different domains

## Your Takeaway

> **Fine-tuning is where research meets practice.** Pretraining learns general language; fine-tuning adapts to specific tasks. Good evaluation ensures fine-tuning improves targets without breaking generalization. This is how production LLM systems are built.

---

# Performance Notes

- **Instruction-Following**: Typically 20-50% improvement on target domain with 1000+ instruction examples
- **General Capability**: Small regressions (1-2% loss) are acceptable; >10% loss indicates overfitting
- **Data Requirements**: 100-1000 examples for noticeable improvement; 10000+ for substantial change
- **Tuning Convergence**: 1-5 epochs typical; more epochs = risk of overfitting
- **Inference Speed**: Merged tuned model has same speed as base (LoRA just changes weights)
- **Cost/Benefit**: Fine-tuning cost (hours on 1 GPU) << Pretraining cost (weeks on clusters); clear ROI

---

# Curriculum Summary: From Classical ML to LLMs

**Phase 1: Classical ML Foundations** (Projects 1-11.75)
- Linear regression → logistic regression → neural networks → RNNs
- Learn optimization, backpropagation, and sequence modeling
- Understand the vanishing gradient problem

**Phase 2: Transformers & Pretraining** (Projects 12-15)
- Attention mechanisms → embeddings → full transformer
- Pretraining on character-level text
- Analyze what models learn via pretraining

**Phase 3: LLM Fine-Tuning** (Projects 16-17)
- Fine-tune production models (Mistral 7B)
- Compare base vs. tuned systematically
- Understand trade-offs and deployment considerations

**Key Insights**:
1. Modern LLMs are scaled transformers + massive pretraining + careful fine-tuning
2. Attention replaced recurrence; parallelization unlocked scaling
3. Pretraining is expensive but essential; fine-tuning is cheap and practical
4. Evaluation is critical: good metrics catch regressions before production

**Next Steps**:
- Deploy a fine-tuned model as an API
- Evaluate on real-world benchmarks (MMLU, HumanEval, etc.)
- Experiment with advanced techniques (DPO, RLHF, MoE)
- Build applications on top of fine-tuned models

## Summary & Next Steps
- This notebook compares base vs tuned outputs on a small prompt set and records heuristic metrics (length and prompt keyword coverage).
- In quick_mode, tuned==base to validate the pipeline without heavy downloads. Disable quick_mode to use the full model and load adapters.
- If adapters are missing, generate them by running Project 16 with `DRY_RUN=False` and saving LoRA adapters.

Next improvements:
- Add a richer evaluation set and human preference judgments.
- Include task-specific checks (e.g., code execution tests for coding prompts).
- Plot distributions and deltas; compute statistical significance over larger samples.
- Optionally compute log-prob scores with the model for stronger comparisons.