# Project 17: Comparative Analysis - Base vs Instruction-Tuned

## Goal
Systematically compare Mistral base vs your tuned version.

## Learning Objectives
- Instruction-following improvements
- General capability preservation
- Attention pattern changes
- Failure mode analysis

In [7]:
# Setup
import mlx.core as mx
from mlx_lm import load, generate
import pandas as pd
import matplotlib.pyplot as plt

print("Ready for comparative analysis!")

Ready for comparative analysis!


## Plan
We will:
1. Define configuration (model names, adapter paths, prompt set, quick mode toggle).
2. Provide helper loaders to gracefully handle missing tuned adapters.
3. Load base model; attempt to load tuned model + merge LoRA if present.
4. Generate responses for a benchmark prompt set.
5. Compute simple heuristic metrics (length, overlap, instruction keyword coverage).
6. Tabulate and save results (CSV + JSON metrics).
7. Summarize differences and propose further evaluation steps.

In [9]:
# 1) Configuration
from pathlib import Path
import json, time, math
import pandas as pd
from dataclasses import dataclass

# Quick mode avoids large downloads by preferring a smaller model for smoke tests
quick_mode = True  # set False to use the full Mistral model (may be multi-GB download)
small_model_name = 'mlx-community/Qwen2.5-0.5B-Instruct'  # fallback small model
full_model_name = 'mistralai/Mistral-7B-Instruct-v0.2'     # preferred full model
model_name = small_model_name if quick_mode else full_model_name
max_new_tokens = 128
temperature = 0.7
top_p = 0.9

# Locate repo root and artifacts
repo_root = None
cur = Path.cwd()
for parent in [cur] + list(cur.parents):
    if (parent / 'requirements.txt').exists() or (parent / '.git').exists():
        repo_root = parent
        break
if repo_root is None:
    repo_root = cur
project16_artifacts = repo_root / 'projects' / 'phase3_llm_tuning' / 'project16_mistral_tuning' / 'artifacts'
tuned_adapters_path = project16_artifacts / 'lora_adapters.safetensors'
project17_dir = repo_root / 'projects' / 'phase3_llm_tuning' / 'project17_comparative_analysis'
analysis_dir = project17_dir / 'analysis_artifacts'
analysis_dir.mkdir(parents=True, exist_ok=True)

# Prompt set for head-to-head comparison
prompts = [
    "Explain LoRA in one paragraph.",
    "Write a Python function to compute Fibonacci numbers.",
    "Give three bullet tips for learning ML.",
    "Summarize the benefits of parameter-efficient fine-tuning.",
    "What are common failure modes when instruction-tuning LLMs?",
]

print('Config ready:')
print('  quick_mode:', quick_mode)
print('  model_name:', model_name)
print('  tuned_adapters exist:', tuned_adapters_path.exists())
print('  analysis_dir:', analysis_dir)

Config ready:
  quick_mode: True
  model_name: mlx-community/Qwen2.5-0.5B-Instruct
  tuned_adapters exist: False
  analysis_dir: /Users/mark/git/learning-ml-to-llm/projects/phase3_llm_tuning/project17_comparative_analysis/analysis_artifacts


In [2]:
# 2) Helper functions for safe loading and generation
from typing import Tuple, Dict, Any
import mlx.core as mx
from mlx_lm import load, generate

def safe_load(model_id: str):
    """Load a model/tokenizer with graceful error handling."""
    try:
        m, tok = load(model_id)
        return m, tok
    except Exception as e:
        print(f'Failed to load {model_id}:', repr(e))
        return None, None

def try_load_tuned(base_model_id: str, adapter_path: Path):
    """Attempt to load base model then apply LoRA adapters if available."""
    m, tok = safe_load(base_model_id)
    if m is None:
        return None, None
    # Attempt adapter merge if file exists and peft utilities are present
    if adapter_path.exists():
        try:
            from mlx_lm.peft import apply_lora, merge_lora_weights, load_lora_parameters, LoraConfig
            # We don't know original config used; we load saved weights directly
            load_lora_parameters(m, str(adapter_path))
            tuned_model = merge_lora_weights(m)
            print('Merged LoRA adapters into tuned model.')
            return tuned_model, tok
        except Exception as e:
            print('Could not load/merge LoRA adapters:', repr(e))
            print('Proceeding with base model as tuned fallback.')
            return m, tok
    else:
        print('Adapter file not found; using base model as tuned placeholder.')
        return m, tok

def safe_generate(m, tok, prompt: str, **gen_kwargs):
    if m is None or tok is None:
        return '<model unavailable>'
    try:
        return generate(m, tok, prompt=prompt, **gen_kwargs)
    except Exception as e:
        return f'<generation error: {repr(e)}>'

In [4]:
# 3) Load base and tuned models (may download on first run)
if quick_mode:
    base_model, base_tok = None, None  # Skip heavy downloads in quick_mode
    tuned_model, tuned_tok = None, None
    tuned_kind = 'quick_mode_skipped_load'
else:
    base_model, base_tok = safe_load(model_name)
    tuned_model, tuned_tok = try_load_tuned(full_model_name, tuned_adapters_path)
    tuned_kind = 'adapters_merged' if tuned_model is not None else 'fallback_base'

print('Models ready:')
print('  base:', 'ok' if base_model is not None else 'skipped')
print('  tuned:', 'ok' if tuned_model is not None else 'skipped', '| mode:', tuned_kind)

Models ready:
  base: skipped
  tuned: skipped | mode: quick_mode_skipped_load


In [5]:
# 4) Run generations and build comparison table
rows = []
for i, p in enumerate(prompts, start=1):
    b_out = safe_generate(base_model, base_tok, prompt=p, max_tokens=max_new_tokens, verbose=False)
    t_out = safe_generate(tuned_model, tuned_tok, prompt=p, max_tokens=max_new_tokens, verbose=False)
    rows.append({'id': i, 'prompt': p, 'base_output': b_out, 'tuned_output': t_out})
df = pd.DataFrame(rows)
csv_path = analysis_dir / 'base_vs_tuned_outputs.csv'
df.to_csv(csv_path, index=False)
print('Saved outputs to', csv_path)
df.head(2)

Saved outputs to /Users/mark/git/learning-ml-to-llm/projects/phase3_llm_tuning/project17_comparative_analysis/analysis_artifacts/base_vs_tuned_outputs.csv


Unnamed: 0,id,prompt,base_output,tuned_output
0,1,Explain LoRA in one paragraph.,<model unavailable>,<model unavailable>
1,2,Write a Python function to compute Fibonacci n...,<model unavailable>,<model unavailable>


In [6]:
# 5) Simple heuristic metrics
import re
def keyword_coverage(text: str, prompt: str):
    # Percentage of unique prompt words (alphabetic) appearing in output
    pw = {w.lower() for w in re.findall(r'[A-Za-z]+', prompt) if len(w) > 3}
    if not pw:
        return 0.0
    ow = {w.lower() for w in re.findall(r'[A-Za-z]+', text)}
    return len(pw & ow) / len(pw)

metrics = []
for r in rows:
    base_len = len(r['base_output'].split()) if isinstance(r['base_output'], str) else 0
    tuned_len = len(r['tuned_output'].split()) if isinstance(r['tuned_output'], str) else 0
    base_cov = keyword_coverage(r['base_output'], r['prompt']) if isinstance(r['base_output'], str) else 0.0
    tuned_cov = keyword_coverage(r['tuned_output'], r['prompt']) if isinstance(r['tuned_output'], str) else 0.0
    metrics.append({
        'id': r['id'],
        'prompt': r['prompt'],
        'base_len': base_len,
        'tuned_len': tuned_len,
        'base_keyword_coverage': base_cov,
        'tuned_keyword_coverage': tuned_cov,
        'len_delta': tuned_len - base_len,
        'coverage_delta': tuned_cov - base_cov,
    })
metrics_df = pd.DataFrame(metrics)
metrics_path = analysis_dir / 'heuristic_metrics.csv'
metrics_df.to_csv(metrics_path, index=False)
summary = {
    'avg_len_delta': float(metrics_df['len_delta'].mean()),
    'avg_cov_delta': float(metrics_df['coverage_delta'].mean()),
    'prompts': len(metrics_df),
    'quick_mode': quick_mode,
    'tuned_kind': tuned_kind,
}
summary_path = analysis_dir / 'summary_metrics.json'
with open(summary_path, 'w') as f:
    json.dump(summary, f, indent=2)
print('Saved metrics to', metrics_path)
print('Saved summary to', summary_path)
metrics_df.head(2)

Saved metrics to /Users/mark/git/learning-ml-to-llm/projects/phase3_llm_tuning/project17_comparative_analysis/analysis_artifacts/heuristic_metrics.csv
Saved summary to /Users/mark/git/learning-ml-to-llm/projects/phase3_llm_tuning/project17_comparative_analysis/analysis_artifacts/summary_metrics.json


Unnamed: 0,id,prompt,base_len,tuned_len,base_keyword_coverage,tuned_keyword_coverage,len_delta,coverage_delta
0,1,Explain LoRA in one paragraph.,2,2,0.0,0.0,0,0.0
1,2,Write a Python function to compute Fibonacci n...,2,2,0.0,0.0,0,0.0


## Summary & Next Steps
- This notebook compares base vs tuned outputs on a small prompt set and records heuristic metrics (length and prompt keyword coverage).
- In quick_mode, tuned==base to validate the pipeline without heavy downloads. Disable quick_mode to use the full model and load adapters.
- If adapters are missing, generate them by running Project 16 with `DRY_RUN=False` and saving LoRA adapters.

Next improvements:
- Add a richer evaluation set and human preference judgments.
- Include task-specific checks (e.g., code execution tests for coding prompts).
- Plot distributions and deltas; compute statistical significance over larger samples.
- Optionally compute log-prob scores with the model for stronger comparisons.