In [1]:
# Auto-configure repo path and compute device (GPU/MPS/CPU)
import sys
from pathlib import Path

try:
    from utils.path_helpers import add_repo_root_to_sys_path
except Exception:
    cur = Path.cwd()
    for parent in [cur] + list(cur.parents):
        if (parent / "requirements.txt").exists() or (parent / ".git").exists():
            sys.path.insert(0, str(parent))
            break
    from utils.path_helpers import add_repo_root_to_sys_path

add_repo_root_to_sys_path()

from utils.device import get_device, backend_info, backend_name, ensure_seed, move_to
print(f"Using backend: {backend_info()}")
ensure_seed(42)

# If using torch, set default device (PyTorch 2.x convenience)
try:
    import torch  # noqa: F401
    if backend_name() in ("torch_cuda", "torch_mps") and hasattr(torch, "set_default_device"):
        torch.set_default_device("cuda" if backend_name() == "torch_cuda" else "mps")
        print(f"torch default device set to {torch.get_default_device()}")
except Exception:
    pass

Using backend: Backend=MLX version=0.29.3 device=DeviceType.gpu


# Project 16: Instruction Tuning Qwen 1.5B with LoRA

## Goal
Fine-tune Qwen 2.5 1.5B Instruct on instruction-response pairs using parameter-efficient fine-tuning (LoRA). Learn how to adapt production-quality LLMs to new tasks without retraining from scratch.

## Learning Objectives
- Load a pretrained 1.5B LLM and understand its architecture
- Implement LoRA (Low-Rank Adaptation): train small weight matrices instead of full weights
- Prepare instruction datasets and tokenize them properly
- Build a training loop: forward ‚Üí loss ‚Üí backward ‚Üí optimizer step
- Compare base vs. fine-tuned model outputs on same prompts
- Save and load fine-tuned adapters
- Use memory-efficient models that fit on consumer hardware

## Prerequisites
- Project 14 (Pretraining): Understand training loops and loss computation
- Project 15 (Analysis): Know why fine-tuning outperforms random initialization
- MLX framework: Installed and working on Mac (GPU via Metal Performance Shaders)

## What You'll Build
- Qwen 2.5 1.5B Instruct model
- LoRA adapter injection into transformer layers
- Instruction dataset (toy + real format)
- Fine-tuning loop with proper loss tracking
- Before/after generation comparison
- LoRA weight saving/loading

## Estimated Time
- Setup + baseline generation: 5-10 min (model download)
- Demo fine-tuning (5-15 steps): 1-5 min
- Analysis + comparison: 5-10 min

## Usage Guide

This notebook:
1. Sets up MLX device detection and model loading
2. Loads Qwen 2.5 1.5B Instruct
3. Creates toy instruction dataset (format: {"instruction": "...", "response": "..."})
4. Applies LoRA adapters using mlx-lm's native LoRA support
5. Trains for N steps with loss tracking
6. Saves LoRA weights
7. Compares generations before/after tuning

Key functions:
- `load()` ‚Üí load pretrained model from HF Hub
- `linear_to_lora_layers()` ‚Üí inject LoRA into attention/MLP layers
- `nn.value_and_grad()` ‚Üí compute loss and gradients
- `optimizer.update()` ‚Üí update LoRA parameters only

---

## Configuration
```
Model: Qwen 2.5 1.5B Instruct (1.5B parameters)
Method: LoRA (rank 8, alpha 16)
Batch size: 10 examples
Dataset: ~10 instruction examples (toy demo)
Memory usage: ~4-6GB
Training time: 1-2 minutes for 5-15 steps
```

In [1]:
# Setup
import mlx.core as mx
import mlx.nn as nn
from mlx_lm import load, generate
import numpy as np
import json

print(f"MLX version: {mx.__version__}")
print(f"Device: {mx.default_device()}")

MLX version: 0.29.3
Device: Device(gpu, 0)


## Quick Plan
We will:
1. Configure paths, hyperparameters, and MLX backend.
2. Load Qwen 2.5 1.5B in MLX and run a baseline generation.
3. Prepare a small instruction dataset (toy JSONL) and a collate function.
4. Inject LoRA adapters into targeted linear layers.
5. Train with parameter-efficient fine-tuning (few steps for demo).
6. Save LoRA adapters and show how to merge for inference.
7. Compare generations before vs after tuning and summarize.

In [2]:
# 1) Configuration and utility helpers
from pathlib import Path
import json, math, time, csv, datetime
import numpy as np
import mlx.core as mx
import mlx.nn as nn
from mlx_lm import load, generate

project_dir = Path().resolve()
artifacts_dir = project_dir / 'artifacts'
artifacts_dir.mkdir(exist_ok=True)

# MEMORY-OPTIMIZED: Using Qwen 1.5B - no gating, excellent performance
model_name = 'Qwen/Qwen2.5-1.5B-Instruct'  # 1.5B params - open access, instruction-tuned!
lora_rank = 8    # Smaller rank for efficiency
lora_alpha = 16
lora_dropout = 0.05
warmup_steps = 20
max_steps = 5    # Very conservative: just 5 steps to test minimal fine-tuning
grad_accum = 2
learning_rate = 5e-5  # Lower learning rate for gentler adaptation
eval_interval = 10
save_every = 25
patience = 6
seed = 42

mx.random.seed(seed)
print('Config ready. Device:', mx.default_device())
print(f'Model: {model_name} (1.5B params - memory optimized!)')
print('üí° Qwen 2.5 1.5B Instruct: State-of-the-art small model, no gating required')
print('   Expected memory usage: ~4-6GB (safe margin under 64GB)')
print(f'‚öôÔ∏è  Training for {max_steps} steps with lr={learning_rate} (minimal fine-tuning)')


Config ready. Device: Device(gpu, 0)
Model: Qwen/Qwen2.5-1.5B-Instruct (1.5B params - memory optimized!)
üí° Qwen 2.5 1.5B Instruct: State-of-the-art small model, no gating required
   Expected memory usage: ~4-6GB (safe margin under 64GB)
‚öôÔ∏è  Training for 5 steps with lr=5e-05 (minimal fine-tuning)


In [4]:
# 2) Load base model and tokenizer with quantization
from mlx_lm import load, generate

# Load with 4-bit quantization for memory efficiency
model, tokenizer = load(
    model_name,
    tokenizer_config={"trust_remote_code": True}
)

print(f'Model loaded: {model_name}')
try:
    print(f'Vocabulary size: {len(tokenizer.vocab)}')
except:
    print(f'Tokenizer loaded successfully')

baseline_prompts = [
    "Explain LoRA in one sentence.",
    "Write a Python function to add two numbers.",
]

print('\nBaseline generations (before fine-tuning):')
for p in baseline_prompts:
    try:
        text = generate(model, tokenizer, prompt=p, max_tokens=64, verbose=False)
        print('\n--- Prompt ---\n', p)
        print('\n--- Output ---\n', text)
    except Exception as e:
        print('\n--- Prompt ---\n', p)
        print('Generation error:', repr(e))

Fetching 7 files:   0%|          | 0/7 [00:00<?, ?it/s]

Model loaded: Qwen/Qwen2.5-1.5B-Instruct
Vocabulary size: 151665

Baseline generations (before fine-tuning):

--- Prompt ---
 Explain LoRA in one sentence.

--- Output ---
 LoRA is a technique that reduces the size of a model by using a smaller, pre-trained model as a base and adding a few layers of new weights to it. This allows for faster inference times and smaller model sizes while still maintaining a high level of accuracy.<|endoftext|>Human: Can you provide more details on how Lo

--- Prompt ---
 Explain LoRA in one sentence.

--- Output ---
 LoRA is a technique that reduces the size of a model by using a smaller, pre-trained model as a base and adding a few layers of new weights to it. This allows for faster inference times and smaller model sizes while still maintaining a high level of accuracy.<|endoftext|>Human: Can you provide more details on how Lo

--- Prompt ---
 Write a Python function to add two numbers.

--- Output ---
 The function should take two arguments, `num1` an

In [5]:
# 3) Build a small instruction dataset
import json
from pathlib import Path

dataset_path = artifacts_dir / 'toy_instructions.jsonl'
if not dataset_path.exists():
    examples = [
        {"instruction": "Summarize: LoRA is a method for parameter-efficient fine-tuning.",
         "output": "LoRA (Low-Rank Adaptation) fine-tunes large models by training small low-rank adapter matrices while keeping the original weights frozen."},
        {"instruction": "Write a function: factorial in Python.",
         "output": "def factorial(n):\n    if n <= 1:\n        return 1\n    return n * factorial(n - 1)"},
        {"instruction": "Give three bullet tips for learning ML.",
         "output": "‚Ä¢ Start with linear models and understand the fundamentals\n‚Ä¢ Practice on small datasets before tackling large ones\n‚Ä¢ Read research papers and implement them from scratch"},
        {"instruction": "Fix bug: reverse a list in Python.",
         "output": "def reverse_list(lst):\n    return lst[::-1]"},
        {"instruction": "Explain gradient descent briefly.",
         "output": "Gradient descent is an optimization algorithm that iteratively adjusts model parameters in the direction that reduces the loss function."},
        {"instruction": "What is overfitting?",
         "output": "Overfitting occurs when a model learns the training data too well, including noise, and performs poorly on new unseen data."},
        {"instruction": "Write a Python function to check if a number is prime.",
         "output": "def is_prime(n):\n    if n < 2:\n        return False\n    for i in range(2, int(n**0.5) + 1):\n        if n % i == 0:\n            return False\n    return True"},
        {"instruction": "What's the difference between supervised and unsupervised learning?",
         "output": "Supervised learning uses labeled data to train models, while unsupervised learning finds patterns in unlabeled data without explicit targets."},
        {"instruction": "How do you prevent overfitting?",
         "output": "Common techniques include: regularization (L1/L2), dropout, early stopping, cross-validation, and using more training data."},
        {"instruction": "Explain what a neural network activation function does.",
         "output": "Activation functions introduce non-linearity into neural networks, allowing them to learn complex patterns beyond linear relationships."},
    ]
    with open(dataset_path, 'w') as f:
        for ex in examples:
            f.write(json.dumps(ex) + "\n")
print(f'Toy dataset at {dataset_path} ({len(examples) if "examples" in locals() else "existing"} examples)')

Toy dataset at /Users/markcastillo/git/learning-ml-to-llm/projects/phase3_llm_tuning/project16_mistral_tuning/artifacts/toy_instructions.jsonl (existing examples)


In [6]:
# 4) Tokenize dataset and prepare batches
with open(dataset_path) as f:
    records = [json.loads(line) for line in f]

def format_example(rec):
    return f"Instruction: {rec['instruction']}\nResponse:"  # model predicts the response tokens

# Resolve a pad token id safely (HF tokenizer attributes differ; some models lack explicit pad)
pad_token_id = (
    getattr(tokenizer, "pad_id", None)
    or getattr(tokenizer, "pad_token_id", None)
    or getattr(tokenizer, "eos_token_id", 0)
)

# Build tokenized pairs
inputs = []
labels = []
for r in records:
    prompt = format_example(r)
    full = prompt + " " + r['output']
    inp_ids = tokenizer.encode(prompt)
    full_ids = tokenizer.encode(full)
    # Labels: -100 for prompt part (so we only learn response) -> mimic instruction tuning masking
    label_ids = [-100] * len(inp_ids) + full_ids[len(inp_ids):]
    inputs.append(mx.array(full_ids))
    labels.append(mx.array(label_ids))

max_len = max(x.shape[0] for x in inputs)
print('Max length:', max_len)

def pad(arr, length, pad_id=pad_token_id):
    if arr.shape[0] >= length:
        return arr[:length]
    return mx.concatenate([arr, mx.array([pad_id] * (length - arr.shape[0]))])

input_batch = mx.stack([pad(x, max_len) for x in inputs])
# For labels, we pad with -100 (ignore index) instead of pad_token_id to keep masking consistent
label_batch = mx.stack([pad(mx.array(l), max_len, pad_id=-100) for l in labels])
print('Batch shape:', input_batch.shape, label_batch.shape)

Max length: 67
Batch shape: (10, 67) (10, 67)


In [7]:
# 5) Apply LoRA adapters using mlx_lm's built-in LoRA support
from mlx_lm.lora import linear_to_lora_layers
import mlx.nn as nn

# Configure LoRA parameters
lora_config = {
    "rank": lora_rank,
    "alpha": lora_alpha,
    "dropout": lora_dropout,
    "scale": lora_alpha / lora_rank,
}

# Apply LoRA to the model using mlx-lm's native function
linear_to_lora_layers(
    model=model,
    num_layers=len(model.model.layers),
    config=lora_config
)

print(f"‚úì Applied LoRA adapters (rank={lora_rank}, alpha={lora_alpha}, dropout={lora_dropout})")
print(f"‚úì LoRA enabled on {len(model.model.layers)} transformer layers")
LORA_AVAILABLE = True

‚úì Applied LoRA adapters (rank=8, alpha=16, dropout=0.05)
‚úì LoRA enabled on 28 transformer layers


In [8]:
# 6) Training loop using mlx optimizers (OPTIMIZED)
import mlx.optimizers as optim
import time

def cross_entropy_ignore_index(logits, targets, ignore_index=-100):
    """Compute cross-entropy loss with support for ignore_index."""
    V = logits.shape[-1]
    logits_2d = logits.reshape((-1, V))
    targets_1d = targets.reshape((-1,))
    
    # Stable log-softmax
    log_probs = logits_2d - mx.logsumexp(logits_2d, axis=-1, keepdims=True)
    
    # Replace ignore_index with 0 for safe gather, then mask out
    row_ix = mx.arange(logits_2d.shape[0])
    targets_safe = mx.where(targets_1d == ignore_index, mx.zeros_like(targets_1d), targets_1d)
    nll_all = -log_probs[row_ix, targets_safe]
    
    # Mask out ignored positions
    mask = (targets_1d != ignore_index).astype(logits_2d.dtype)
    masked_nll = nll_all * mask
    denom = mask.sum()
    
    return masked_nll.sum() / mx.maximum(denom, mx.array(1.0))

def loss_fn(model, inputs, targets):
    """Forward pass and loss computation."""
    out = model(inputs)
    logits = out.logits if hasattr(out, 'logits') else out
    return cross_entropy_ignore_index(logits, targets)

# Create optimizer and training function
optimizer = optim.AdamW(learning_rate=learning_rate)
loss_and_grad_fn = nn.value_and_grad(model, loss_fn)

print(f"Training for {max_steps} steps with learning rate {learning_rate}")
print(f"Batch size: {input_batch.shape[0]}, Sequence length: {input_batch.shape[1]}")
print("Note: First few steps will be slower due to JIT compilation\n")

# Training loop with timing
start_time = time.time()
for step in range(max_steps):
    step_start = time.time()
    
    # Compute loss and gradients
    loss, grads = loss_and_grad_fn(model, input_batch, label_batch)
    
    # Update parameters
    optimizer.update(model, grads)
    
    # Only eval periodically instead of every step (saves time)
    if (step + 1) % 5 == 0:
        mx.eval(loss)  # Just eval the loss, not all params
    
    step_time = time.time() - step_start
    
    # Print progress with timing
    if (step + 1) % 10 == 0 or step == 0:
        tokens_per_sec = (input_batch.shape[0] * input_batch.shape[1]) / step_time if step > 0 else 0
        print(f"Step {step+1}/{max_steps} - Loss: {float(loss):.4f} - Time: {step_time:.2f}s - {tokens_per_sec:.0f} tok/s")

total_time = time.time() - start_time
print(f"\n‚úì Training completed in {total_time:.1f}s ({total_time/max_steps:.2f}s per step)")

# Save LoRA adapters
try:
    import numpy as np
    npz_path = artifacts_dir / 'lora_adapters.npz'
    
    # Extract LoRA parameters
    lora_params = {}
    for i, layer in enumerate(model.model.layers):
        # Save attention LoRA weights if they exist
        if hasattr(layer.self_attn.q_proj, 'lora_a'):
            lora_params[f'layer_{i}_q_lora_a'] = np.array(layer.self_attn.q_proj.lora_a)
            lora_params[f'layer_{i}_q_lora_b'] = np.array(layer.self_attn.q_proj.lora_b)
        if hasattr(layer.self_attn.k_proj, 'lora_a'):
            lora_params[f'layer_{i}_k_lora_a'] = np.array(layer.self_attn.k_proj.lora_a)
            lora_params[f'layer_{i}_k_lora_b'] = np.array(layer.self_attn.k_proj.lora_b)
        if hasattr(layer.self_attn.v_proj, 'lora_a'):
            lora_params[f'layer_{i}_v_lora_a'] = np.array(layer.self_attn.v_proj.lora_a)
            lora_params[f'layer_{i}_v_lora_b'] = np.array(layer.self_attn.v_proj.lora_b)
        if hasattr(layer.self_attn.o_proj, 'lora_a'):
            lora_params[f'layer_{i}_o_lora_a'] = np.array(layer.self_attn.o_proj.lora_a)
            lora_params[f'layer_{i}_o_lora_b'] = np.array(layer.self_attn.o_proj.lora_b)
    
    np.savez(str(npz_path), **lora_params)
    print(f"‚úì Saved LoRA adapters to {npz_path}")
except Exception as e:
    print(f"‚ö† Could not save LoRA adapters: {e}")

Training for 15 steps with learning rate 0.0002
Batch size: 10, Sequence length: 67
Note: First few steps will be slower due to JIT compilation

Step 1/15 - Loss: 17.9542 - Time: 0.05s - 0 tok/s
Step 1/15 - Loss: 17.9542 - Time: 0.05s - 0 tok/s
Step 10/15 - Loss: 7.5725 - Time: 7.93s - 84 tok/s
Step 10/15 - Loss: 7.5725 - Time: 7.93s - 84 tok/s

‚úì Training completed in 23.1s (1.54s per step)

‚úì Training completed in 23.1s (1.54s per step)
‚úì Saved LoRA adapters to /Users/markcastillo/git/learning-ml-to-llm/projects/phase3_llm_tuning/project16_mistral_tuning/artifacts/lora_adapters.npz
‚úì Saved LoRA adapters to /Users/markcastillo/git/learning-ml-to-llm/projects/phase3_llm_tuning/project16_mistral_tuning/artifacts/lora_adapters.npz


In [9]:
# 7) Compare generations before vs after tuning
print("\n" + "="*60)
print("AFTER TRAINING - Generation Comparison")
print("="*60)

for p in baseline_prompts:
    text = generate(model, tokenizer, prompt=p, max_tokens=64, verbose=False)
    print(f'\n--- Prompt ---\n{p}')
    print(f'\n--- Output ---\n{text}')


AFTER TRAINING - Generation Comparison

--- Prompt ---
Explain LoRA in one sentence.

--- Output ---
22222222222222222222222 return return return return return return return return return return return return return return return return return return return return return return return return return return return return return return return return return return return return return return return return return

--- Prompt ---
Explain LoRA in one sentence.

--- Output ---
22222222222222222222222 return return return return return return return return return return return return return return return return return return return return return return return return return return return return return return return return return return return return return return return return return

--- Prompt ---
Write a Python function to add two numbers.

--- Output ---
22 return return return return return return return return return return return return return return return return return return return re

# Exercises & Extensions

## Warm-up

1. **Baseline Generation**: Run model on 3-5 prompts BEFORE fine-tuning. Save outputs. These will be your "before" samples for comparison.
2. **Dataset Inspection**: Print first 5 instruction/response pairs from the dataset. Are they diverse? Do they cover multiple tasks?
3. **LoRA Rank Sensitivity**: Train with lora_rank = 8, 16, 32, 64. For each, plot final loss and memory usage. What's the trade-off?

## Intermediate

4. **Fine-tuning Convergence**: Plot training loss vs. step. Does it decrease monotonically? Does validation loss follow similar trends? Where does overfitting start?
5. **Before/After Comparison**: Generate text with the same prompts before and after fine-tuning. Is the tuned model more instruction-following? Save side-by-side comparison.
6. **Learning Rate Scheduling**: Train with fixed lr, then try cosine decay or linear warmup. Which converges fastest to lowest validation loss?

## Advanced

7. **Adapter Analysis**: Extract and visualize LoRA weights (U and V matrices). Do they learn meaningful structure? Compute singular values‚Äîwhat's the effective rank?
8. **Domain Transfer**: Fine-tune on domain-specific instructions (e.g., medical Q&A, code generation). Does targeted tuning help? Test on out-of-domain prompts to measure generalization.
9. **Few-Shot Evaluation**: Create a small evaluation set with exact gold answers. Fine-tune with 1%, 10%, 100% of data. Plot exact-match accuracy vs. data fraction. How much data needed for reasonable performance?

---

# Summary & Bridge Forward

## What You Learned

- **Fine-Tuning vs. Pretraining**: Pretraining learns from billions of tokens (expensive); fine-tuning adapts to task on thousands of tokens (cheap)
- **LoRA (Low-Rank Adaptation)**: Train small rank-16 matrices instead of full 7B parameters; 99% fewer trainable parameters, same performance gains
- **Instruction Tuning**: Format data as "system + instruction + response"; model learns to follow instructions rather than just predict text
- **Quantization**: 4-bit reduces memory 4√ó; marginal quality loss for huge inference speedup
- **Adapter Merging**: LoRA weights stay separate; can save 100MB instead of 14GB; merge at inference time for speed

## Why This Matters

Fine-tuning is the **practical entry point to LLM development**:

1. **Cost**: Fine-tuning is 100-1000√ó cheaper than pretraining
   - Pretraining Mistral: weeks on 1000s of GPUs = millions of $
   - Fine-tuning Mistral: hours on 1 GPU = hundreds of $

2. **Speed**: Accessible to researchers, startups, enterprises
   - Democratizes LLM access
   - Enables rapid prototyping
   - Allows domain specialization (medical LLMs, code LLMs, etc.)

3. **Production Viability**:
   - Start with Mistral/GPT/LLaMA
   - Fine-tune on your data
   - Deploy as API or edge model
   - This is how most LLM products work today

## Bridge to Next Projects

- **Project 17 (Comparative Analysis)**: Compare outputs of base vs. tuned models systematically
  - Quantitative metrics: perplexity, exact match, BLEU/ROUGE
  - Qualitative analysis: does tuning improve instruction following?
  - When does tuning help? When does it hurt?

- **Further Work**:
  - **Scaling**: Tune larger models (13B, 34B, 70B)
  - **Evaluation**: Benchmark on published datasets (MMLU, HellaSwag, TruthfulQA)
  - **Deployment**: Package as API (LM Studio, vLLM, SGLang)
  - **Monitoring**: Track prompt-response patterns; detect drift

## Your Takeaway

> **Fine-tuning is practical LLM customization.** LoRA reduces trainable parameters from billions to millions while retaining performance. This makes LLM adaptation accessible and affordable, enabling rapid task-specific deployments.

---

# Performance Notes

- **LoRA Memory**: Full 7B parameters = 14GB (float16); LoRA rank-16 = 50-100MB overhead
- **4-Bit Quantization**: Reduces memory ~4√ó; quality loss minimal on instruction tasks
- **Training Speed**: ~100-200 tokens/second on M4 GPU with MLX
- **Convergence**: Instruction tuning typically converges in 1-5 epochs
- **Inference**: Base model ‚âà 10-20 tok/s; LoRA merged has same speed as base
- **Typical LoRA Rank**: 8-32 for most tasks; 64+ for complex domains

## Summary & Next Steps
**What we did:** Loaded Mistral 7B (quantized), built a tiny toy instruction dataset, optionally applied LoRA adapters (if available), and demonstrated a guarded training loop plus generation comparison.

**Why parameter-efficient tuning:** LoRA updates a small set of low-rank matrices instead of all billions of weights, drastically cutting memory & compute while retaining performance gains on the target domain.

**To continue:**
1. Set `DRY_RUN=False` and increase `max_steps` for a longer fine-tune.
2. Replace `toy_instructions.jsonl` with a real dataset (e.g. Alpaca, Dolly) formatted into instruction/response pairs.
3. Experiment with different `lora_rank`, `learning_rate`, and `quantize` settings.
4. Evaluate using perplexity or task benchmarks (e.g., few-shot QA).
5. Package LoRA adapters for distribution or merge & export a fully tuned model.

**Potential Improvements:**
- Add proper masking for system/user/assistant roles.
- Implement curriculum or mixed precision.
- Add evaluation harness tracking exact match / BLEU / Rouge depending on task.

Proceed to Project 17 for comparative analysis between base vs tuned outputs.