In [1]:
# Auto-configure repo path and compute device (GPU/MPS/CPU)
import sys
from pathlib import Path

try:
    from utils.path_helpers import add_repo_root_to_sys_path
except Exception:
    cur = Path.cwd()
    for parent in [cur] + list(cur.parents):
        if (parent / "requirements.txt").exists() or (parent / ".git").exists():
            sys.path.insert(0, str(parent))
            break
    from utils.path_helpers import add_repo_root_to_sys_path

add_repo_root_to_sys_path()

from utils.device import get_device, backend_info, backend_name, ensure_seed, move_to
print(f"Using backend: {backend_info()}")
ensure_seed(42)

# If using torch, set default device (PyTorch 2.x convenience)
try:
    import torch  # noqa: F401
    if backend_name() in ("torch_cuda", "torch_mps") and hasattr(torch, "set_default_device"):
        torch.set_default_device("cuda" if backend_name() == "torch_cuda" else "mps")
        print(f"torch default device set to {torch.get_default_device()}")
except Exception:
    pass

Using backend: Backend=MLX version=0.29.3 device=DeviceType.gpu


# Project 16: Instruction Tuning Mistral 7B

## Goal
Fine-tune a production large language model (Mistral 7B) on instruction-response pairs using parameter-efficient fine-tuning (LoRA). Learn how to adapt pretrained models to new tasks without retraining from scratch.

## Learning Objectives
- Load a pretrained LLM (Mistral 7B) and understand its architecture
- Implement LoRA (Low-Rank Adaptation): train small weight matrices instead of full weights
- Prepare instruction datasets and tokenize them properly
- Build a training loop: forward → loss → backward → optimizer step
- Compare base vs. fine-tuned model outputs on same prompts
- Quantize models to fit on consumer hardware (4-bit quantization)
- Save and load fine-tuned adapters

## Prerequisites
- Project 14 (Pretraining): Understand training loops and loss computation
- Project 15 (Analysis): Know why fine-tuning outperforms random initialization
- MLX framework: Installed and working on Mac (GPU via Metal Performance Shaders)

## What You'll Build
- Quantized Mistral 7B loaded via MLX
- LoRA adapter injection into transformer layers
- Instruction dataset (toy + real format)
- Fine-tuning loop with validation + checkpointing
- Before/after generation comparison
- LoRA weight saving/loading
- Optional: merged model for inference

## Estimated Time
- Setup + baseline generation: 30 min
- Demo fine-tuning (300 steps): 1-3 hours
- Full fine-tuning (3000+ steps): 10-24 hours
- Analysis + comparison: 30-60 min

## Usage Guide

This notebook:
1. Sets up MLX device detection and model loading
2. Loads Mistral 7B with 4-bit quantization
3. Creates toy instruction dataset (format: {"instruction": "...", "response": "..."})
4. Applies LoRA adapters to linear layers
5. Trains for N steps with validation loss tracking
6. Saves LoRA weights and merges for inference
7. Compares generations before/after tuning

Key functions:
- `load()` → load pretrained model from HF Hub
- `apply_lora()` → inject rank-16 adaptations
- `get_batch()` → tokenize and batch instruction pairs
- `train_step()` → forward + loss + backward for LoRA weights only
- `merge_lora_weights()` → combine LoRA adapters with base weights

---

## Configuration
```
Model: Mistral 7B
Method: LoRA (rank 32)
Batch size: 32-64
Dataset: ~10k instruction examples
Memory usage: ~20-30GB
Training time: Hours per run
```

In [2]:
# Setup
import mlx.core as mx
import mlx.nn as nn
from mlx_lm import load, generate
import numpy as np
import json

print(f"MLX version: {mx.__version__}")
print(f"Device: {mx.default_device()}")

MLX version: 0.29.3
Device: Device(gpu, 0)


## Quick Plan
We will:
1. Configure paths, hyperparameters, and MLX backend.
2. Load Mistral 7B in MLX and run a baseline generation.
3. Prepare a small instruction dataset (toy JSONL) and a collate function.
4. Inject LoRA adapters into targeted linear layers.
5. Train with parameter-efficient fine-tuning (few steps for demo).
6. Save LoRA adapters and show how to merge for inference.
7. Compare generations before vs after tuning and summarize.

In [3]:
# 1) Configuration and utility helpers
from pathlib import Path
import json, math, time, csv, datetime
import numpy as np
import mlx.core as mx
import mlx.nn as nn
from mlx_lm import load, generate

project_dir = Path().resolve()
artifacts_dir = project_dir / 'artifacts'
artifacts_dir.mkdir(exist_ok=True)

model_name = 'mistralai/Mistral-7B-Instruct-v0.2'  # can switch to base if desired
use_4bit = True  # quantization for memory
lora_rank = 16   # keep small for Mac memory
lora_alpha = 32
lora_dropout = 0.05
warmup_steps = 50
max_steps = 300  # demo length; increase for real tuning
grad_accum = 4
learning_rate = 5e-5
eval_interval = 50
save_every = 100
patience = 6
seed = 42

mx.random.seed(seed)
print('Config ready. Device:', mx.default_device())

Config ready. Device: Device(gpu, 0)


In [4]:
# 2) Load base model and tokenizer; baseline generation
try:
    model, tokenizer = load(model_name, quantize='q4' if use_4bit else None)
except TypeError:
    # Fallback for mlx_lm versions without 'quantize' kwarg support
    print("mlx_lm.load() does not support 'quantize' in this version; loading without quantization.")
    model, tokenizer = load(model_name)

system_prompt = "You are a helpful assistant."
baseline_prompts = [
    "Explain LoRA in one paragraph.",
    "Write a Python function to compute Fibonacci numbers.",
]

print('Baseline generations:')
for p in baseline_prompts:
    try:
        text = generate(model, tokenizer, prompt=p, max_tokens=128, verbose=False)
        print('\n--- Prompt ---\n', p)
        print('\n--- Output ---\n', text)
    except Exception as e:
        print('\n--- Prompt ---\n', p)
        print('Generation error:', repr(e))

mlx_lm.load() does not support 'quantize' in this version; loading without quantization.


Fetching 11 files:   0%|          | 0/11 [00:00<?, ?it/s]

model-00003-of-00003.safetensors:   0%|          | 0.00/4.54G [00:00<?, ?B/s]

model-00001-of-00003.safetensors:   0%|          | 0.00/4.94G [00:00<?, ?B/s]

model-00002-of-00003.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

Cancellation requested; stopping current tasks.


KeyboardInterrupt: 

In [None]:
# 3) Build a tiny instruction dataset (toy JSONL)
import json
from pathlib import Path

dataset_path = artifacts_dir / 'toy_instructions.jsonl'
if not dataset_path.exists():
    examples = [
        {"instruction": "Summarize: LoRA is a method for parameter-efficient fine-tuning.",
         "output": "LoRA fine-tunes large models by training small low-rank adapters while freezing original weights."},
        {"instruction": "Write a function: factorial in Python.",
         "output": "def factorial(n):\n    return 1 if n<=1 else n*factorial(n-1)"},
        {"instruction": "Give three bullet tips for learning ML.",
         "output": "- Start with linear models\n- Practice on small datasets\n- Read research and implement"},
        {"instruction": "Fix bug: reverse a list in Python.",
         "output": "def reverse_list(xs):\n    return xs[::-1]"},
    ]
    with open(dataset_path, 'w') as f:
        for ex in examples:
            f.write(json.dumps(ex) + "\n")
print('Toy dataset at', dataset_path)

Toy dataset at /Users/mark/git/learning-ml-to-llm/projects/phase3_llm_tuning/project16_mistral_tuning/artifacts/toy_instructions.jsonl


In [None]:
# 4) Tokenize dataset and prepare batches
with open(dataset_path) as f:
    records = [json.loads(line) for line in f]

def format_example(rec):
    return f"Instruction: {rec['instruction']}\nResponse:"  # model predicts the response tokens

# Build tokenized pairs
inputs = []
labels = []
for r in records:
    prompt = format_example(r)
    full = prompt + " " + r['output']
    inp_ids = tokenizer.encode(prompt)
    full_ids = tokenizer.encode(full)
    # Labels: -100 for prompt part (so we only learn response) -> mimic instruction tuning masking
    label_ids = [-100]*len(inp_ids) + full_ids[len(inp_ids):]
    inputs.append(mx.array(full_ids))
    labels.append(mx.array(label_ids))

max_len = max(x.shape[0] for x in inputs)
print('Max length:', max_len)

def pad(arr, length, pad_id=tokenizer.pad_id):
    if arr.shape[0] >= length:
        return arr[:length]
    return mx.concatenate([arr, mx.array([pad_id]*(length - arr.shape[0]))])

input_batch = mx.stack([pad(x, max_len) for x in inputs])
label_batch = mx.stack([pad(mx.array(l), max_len, pad_id=-100) for l in labels])
print('Batch shape:', input_batch.shape, label_batch.shape)

NameError: name 'tokenizer' is not defined

In [None]:
# 5) Apply LoRA adapters (if available in mlx_lm)
LORA_AVAILABLE = False
try:
    from mlx_lm.peft import LoraConfig, apply_lora, mark_only_lora_as_trainable, save_lora_parameters, merge_lora_weights
    LORA_AVAILABLE = True
except Exception as e:
    print('LoRA utilities not found in mlx_lm.peft, proceeding without training. Error:', repr(e))

if LORA_AVAILABLE:
    lcfg = LoraConfig(
        r=lora_rank,
        lora_alpha=lora_alpha,
        lora_dropout=lora_dropout,
        target_modules=["q_proj","k_proj","v_proj","o_proj","gate_proj","up_proj","down_proj"],
    )
    apply_lora(model, lcfg)
    mark_only_lora_as_trainable(model)
    print('Applied LoRA to target modules and marked as trainable.')
else:
    print('Skipping LoRA application. You can upgrade mlx_lm to a version with PEFT support.')

In [None]:
# 6) Training loop (guarded; short demo)
DRY_RUN = True  # set to False to attempt a few training steps

if not LORA_AVAILABLE:
    print('LoRA not available; training is skipped.')
elif DRY_RUN:
    print('DRY_RUN=True: skipping heavy training. Set DRY_RUN=False to try a few steps (may be slow).')
else:
    import mlx.optimizers as optim

    def cross_entropy_ignore_index(logits, targets, ignore_index=-100):
        # logits: [B,T,V], targets: [B,T]
        V = logits.shape[-1]
        logits_2d = logits.reshape((-1, V))
        targets_1d = targets.reshape((-1,))
        mask = (targets_1d != ignore_index)
        logits_masked = logits_2d[mask]
        targets_masked = targets_1d[mask]
        log_probs = logits_masked - mx.logsumexp(logits_masked, axis=-1, keepdims=True)
        nll = -log_probs[mx.arange(log_probs.shape[0]), targets_masked]
        return nll.mean() if nll.shape[0] > 0 else mx.array(0.0)

    opt = optim.AdamW(learning_rate)

    def loss_fn(_model):
        out = _model(input_batch)
        return cross_entropy_ignore_index(out.logits, label_batch)

    val_and_grad = nn.value_and_grad(model, loss_fn)

    for step in range(min(20, max_steps)):
        loss, grads = val_and_grad(model)
        opt.update(model, grads)
        if (step+1) % 5 == 0:
            print(f'step {step+1} loss {float(loss):.4f}')

    # Save LoRA params if utilities present
    try:
        save_lora_parameters(model, str(artifacts_dir / 'lora_adapters.safetensors'))
        print('Saved LoRA adapters to artifacts directory.')
    except Exception as e:
        print('Could not save LoRA adapters:', repr(e))

In [None]:
# 7) Merge adapters for inference (optional)
if LORA_AVAILABLE:
    try:
        merged_model = merge_lora_weights(model)
        print('Merged LoRA weights into a copy of the model for inference.')
    except Exception as e:
        merged_model = model
        print('Could not merge LoRA (using current model):', repr(e))
else:
    merged_model = model

# Compare generations after tuning (or baseline if skipped)
for p in baseline_prompts:
    text = generate(merged_model, tokenizer, prompt=p, max_tokens=128, verbose=False)
    print('\n=== AFTER / CURRENT MODEL ===\nPrompt:', p, '\nOutput:', text)

# Exercises & Extensions

## Warm-up

1. **Baseline Generation**: Run model on 3-5 prompts BEFORE fine-tuning. Save outputs. These will be your "before" samples for comparison.
2. **Dataset Inspection**: Print first 5 instruction/response pairs from the dataset. Are they diverse? Do they cover multiple tasks?
3. **LoRA Rank Sensitivity**: Train with lora_rank = 8, 16, 32, 64. For each, plot final loss and memory usage. What's the trade-off?

## Intermediate

4. **Fine-tuning Convergence**: Plot training loss vs. step. Does it decrease monotonically? Does validation loss follow similar trends? Where does overfitting start?
5. **Before/After Comparison**: Generate text with the same prompts before and after fine-tuning. Is the tuned model more instruction-following? Save side-by-side comparison.
6. **Learning Rate Scheduling**: Train with fixed lr, then try cosine decay or linear warmup. Which converges fastest to lowest validation loss?

## Advanced

7. **Adapter Analysis**: Extract and visualize LoRA weights (U and V matrices). Do they learn meaningful structure? Compute singular values—what's the effective rank?
8. **Domain Transfer**: Fine-tune on domain-specific instructions (e.g., medical Q&A, code generation). Does targeted tuning help? Test on out-of-domain prompts to measure generalization.
9. **Few-Shot Evaluation**: Create a small evaluation set with exact gold answers. Fine-tune with 1%, 10%, 100% of data. Plot exact-match accuracy vs. data fraction. How much data needed for reasonable performance?

---

# Summary & Bridge Forward

## What You Learned

- **Fine-Tuning vs. Pretraining**: Pretraining learns from billions of tokens (expensive); fine-tuning adapts to task on thousands of tokens (cheap)
- **LoRA (Low-Rank Adaptation)**: Train small rank-16 matrices instead of full 7B parameters; 99% fewer trainable parameters, same performance gains
- **Instruction Tuning**: Format data as "system + instruction + response"; model learns to follow instructions rather than just predict text
- **Quantization**: 4-bit reduces memory 4×; marginal quality loss for huge inference speedup
- **Adapter Merging**: LoRA weights stay separate; can save 100MB instead of 14GB; merge at inference time for speed

## Why This Matters

Fine-tuning is the **practical entry point to LLM development**:

1. **Cost**: Fine-tuning is 100-1000× cheaper than pretraining
   - Pretraining Mistral: weeks on 1000s of GPUs = millions of $
   - Fine-tuning Mistral: hours on 1 GPU = hundreds of $

2. **Speed**: Accessible to researchers, startups, enterprises
   - Democratizes LLM access
   - Enables rapid prototyping
   - Allows domain specialization (medical LLMs, code LLMs, etc.)

3. **Production Viability**:
   - Start with Mistral/GPT/LLaMA
   - Fine-tune on your data
   - Deploy as API or edge model
   - This is how most LLM products work today

## Bridge to Next Projects

- **Project 17 (Comparative Analysis)**: Compare outputs of base vs. tuned models systematically
  - Quantitative metrics: perplexity, exact match, BLEU/ROUGE
  - Qualitative analysis: does tuning improve instruction following?
  - When does tuning help? When does it hurt?

- **Further Work**:
  - **Scaling**: Tune larger models (13B, 34B, 70B)
  - **Evaluation**: Benchmark on published datasets (MMLU, HellaSwag, TruthfulQA)
  - **Deployment**: Package as API (LM Studio, vLLM, SGLang)
  - **Monitoring**: Track prompt-response patterns; detect drift

## Your Takeaway

> **Fine-tuning is practical LLM customization.** LoRA reduces trainable parameters from billions to millions while retaining performance. This makes LLM adaptation accessible and affordable, enabling rapid task-specific deployments.

---

# Performance Notes

- **LoRA Memory**: Full 7B parameters = 14GB (float16); LoRA rank-16 = 50-100MB overhead
- **4-Bit Quantization**: Reduces memory ~4×; quality loss minimal on instruction tasks
- **Training Speed**: ~100-200 tokens/second on M4 GPU with MLX
- **Convergence**: Instruction tuning typically converges in 1-5 epochs
- **Inference**: Base model ≈ 10-20 tok/s; LoRA merged has same speed as base
- **Typical LoRA Rank**: 8-32 for most tasks; 64+ for complex domains

## Summary & Next Steps
**What we did:** Loaded Mistral 7B (quantized), built a tiny toy instruction dataset, optionally applied LoRA adapters (if available), and demonstrated a guarded training loop plus generation comparison.

**Why parameter-efficient tuning:** LoRA updates a small set of low-rank matrices instead of all billions of weights, drastically cutting memory & compute while retaining performance gains on the target domain.

**To continue:**
1. Set `DRY_RUN=False` and increase `max_steps` for a longer fine-tune.
2. Replace `toy_instructions.jsonl` with a real dataset (e.g. Alpaca, Dolly) formatted into instruction/response pairs.
3. Experiment with different `lora_rank`, `learning_rate`, and `quantize` settings.
4. Evaluate using perplexity or task benchmarks (e.g., few-shot QA).
5. Package LoRA adapters for distribution or merge & export a fully tuned model.

**Potential Improvements:**
- Add proper masking for system/user/assistant roles.
- Implement curriculum or mixed precision.
- Add evaluation harness tracking exact match / BLEU / Rouge depending on task.

Proceed to Project 17 for comparative analysis between base vs tuned outputs.