# üéì Fine-Tuning LLMs with LoRA on Apple Silicon

## What You'll Learn
- **LoRA (Low-Rank Adaptation)**: Train a 1B parameter model on a laptop
- **Quantization**: Run models 4x smaller with 4-bit precision
- **MLX-LM**: Apple's unified interface for LLMs
- **Production Ready**: Chat templates and modern best practices (2025)

## The Problem: Traditional Fine-Tuning is Expensive

Training a 7B parameter model requires:
- **Full Fine-Tuning**: 80GB+ VRAM, $10,000+ GPU
- **LoRA**: 16GB RAM, Your MacBook (Free!)

## The Solution: LoRA

Instead of updating all 7 billion parameters, LoRA:
1. **Freezes** the original model weights
2. **Adds** small "adapter" matrices (rank-deficient)
3. **Trains** only these adapters (<1% of parameters)

Result: 99% of the performance, 1% of the memory.

---

In [1]:
# Install mlx-lm if not already installed
!pip install mlx-lm

Collecting mlx-lm
  Downloading mlx_lm-0.28.4-py3-none-any.whl.metadata (9.4 kB)
  Downloading mlx_lm-0.28.4-py3-none-any.whl.metadata (9.4 kB)
Collecting sentencepiece (from mlx-lm)
  Using cached sentencepiece-0.2.1-cp313-cp313-macosx_11_0_arm64.whl.metadata (10 kB)
Collecting sentencepiece (from mlx-lm)
  Using cached sentencepiece-0.2.1-cp313-cp313-macosx_11_0_arm64.whl.metadata (10 kB)
Collecting protobuf (from mlx-lm)
Collecting protobuf (from mlx-lm)
  Using cached protobuf-6.33.2-cp39-abi3-macosx_10_9_universal2.whl.metadata (593 bytes)
  Using cached protobuf-6.33.2-cp39-abi3-macosx_10_9_universal2.whl.metadata (593 bytes)
Downloading mlx_lm-0.28.4-py3-none-any.whl (323 kB)
Downloading mlx_lm-0.28.4-py3-none-any.whl (323 kB)
Using cached protobuf-6.33.2-cp39-abi3-macosx_10_9_universal2.whl (427 kB)
Using cached sentencepiece-0.2.1-cp313-cp313-macosx_11_0_arm64.whl (1.3 MB)
Using cached protobuf-6.33.2-cp39-abi3-macosx_10_9_universal2.whl (427 kB)
Using cached sentencepiece-0.2

In [2]:
import sys
from mlx_lm import load, generate
from mlx_nlp_utils import print_device_info

print_device_info()


üñ•Ô∏è  Hardware Acceleration Check:
   Device: Device(gpu, 0)
   ‚úÖ Using Apple Silicon GPU (Metal)
   ‚ÑπÔ∏è  MLX automatically optimizes for the GPU's Unified Memory.
   ‚ÑπÔ∏è  Note: While Apple Silicon has an NPU (Neural Engine), MLX primarily
       uses the powerful GPU for general-purpose training tasks like LSTMs.


## 1. Load a Pre-Trained Model

We will use a small but capable model like **Mistral-7B** or **TinyLlama** (depending on your RAM). MLX handles the downloading automatically from Hugging Face.

In [3]:
# Load model and tokenizer
# Using Llama-3.2-1B-Instruct (State of the art small model as of late 2024/2025)
model_name = "mlx-community/Llama-3.2-1B-Instruct-4bit"
print(f"Loading {model_name}...")

model, tokenizer = load(model_name)

print("‚úÖ Model loaded!")

Loading mlx-community/Llama-3.2-1B-Instruct-4bit...


Fetching 6 files:   0%|          | 0/6 [00:00<?, ?it/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

model.safetensors.index.json: 0.00B [00:00, ?B/s]

config.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/296 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.2M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/695M [00:00<?, ?B/s]

‚úÖ Model loaded!


## 2. Test Base Model

Let's see how it performs *before* fine-tuning.

In [4]:
# Use the tokenizer's chat template (Modern Best Practice)
messages = [
    {"role": "user", "content": "What is the capital of France?"}
]

prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
print(f"Formatted Prompt:\n{prompt}\n{'='*20}")

response = generate(model, tokenizer, prompt=prompt, verbose=True)

Formatted Prompt:
<|begin_of_text|><|start_header_id|>system<|end_header_id|>

Cutting Knowledge Date: December 2023
Today Date: 09 Dec 2025

<|eot_id|><|start_header_id|>user<|end_header_id|>

What is the capital of France?<|eot_id|><|start_header_id|>assistant<|end_header_id|>


The capital ofThe capital of France is Paris France is Paris.
Prompt: 42 tokens, 101.670 tokens-per-sec
Generation: 8 tokens, 280.702 tokens-per-sec
Peak memory: 0.755 GB
.
Prompt: 42 tokens, 101.670 tokens-per-sec
Generation: 8 tokens, 280.702 tokens-per-sec
Peak memory: 0.755 GB


## 3. Prepare Data for Fine-Tuning

We need data in a specific format (JSONL) for `mlx-lm`. Let's convert our Intent/Sentiment data.

In [5]:
import json
from pathlib import Path

# 1. Load our existing Intent Classification data
data_path = Path('../data/intent_samples/data.json')

if not data_path.exists():
    print("‚ö†Ô∏è Data not found. Please run: python ../scripts/download_datasets.py --samples")
else:
    with open(data_path, 'r') as f:
        raw_data = json.load(f)
    
    texts = raw_data['texts']
    labels = raw_data['labels']
    
    # 2. Convert to Chat Format (JSONL)
    # We want the model to learn to classify intents.
    # User: "Turn on the lights" -> Assistant: "intent: command"
    
    chat_data = []
    for text, label in zip(texts, labels):
        entry = {
            "messages": [
                {"role": "user", "content": f"Classify the intent of this text: '{text}'"},
                {"role": "assistant", "content": f"intent: {label}"}
            ]
        }
        chat_data.append(entry)
    
    # 3. Save as train.jsonl and valid.jsonl
    # Split 80/20
    split_idx = int(len(chat_data) * 0.8)
    train_data = chat_data[:split_idx]
    valid_data = chat_data[split_idx:]
    
    # Ensure data directory exists
    Path('data').mkdir(exist_ok=True)
    
    def save_jsonl(data, filename):
        with open(filename, 'w') as f:
            for entry in data:
                json.dump(entry, f)
                f.write('\n')
        print(f"Saved {len(data)} examples to {filename}")

    save_jsonl(train_data, 'data/train.jsonl')
    save_jsonl(valid_data, 'data/valid.jsonl')
    
    print("\n‚úÖ Data prepared for MLX LoRA!")
    print("   Format: JSONL (Chat format)")

Saved 128 examples to data/train.jsonl
Saved 32 examples to data/valid.jsonl

‚úÖ Data prepared for MLX LoRA!
   Format: JSONL (Chat format)


## 4. Run Fine-Tuning (LoRA)

We can use the `mlx_lm.lora` command line tool or API to train.

In [None]:
# Training Configuration
# We use the CLI tool provided by mlx-lm
# --batch-size 4: Fits easily on 8GB/16GB RAM
# --num-layers 16: Target more layers for better quality
# --iters 600: Enough for a small dataset

print("üìã Training Command (for reference):")
print("="*60)

cmd = """cd notebooks && python -m mlx_lm lora \\
    --model mlx-community/Llama-3.2-1B-Instruct-4bit \\
    --train \\
    --data data \\
    --batch-size 4 \\
    --num-layers 16 \\
    --iters 600 \\
    --learning-rate 1e-4 \\
    --adapter-path ./adapters"""

print(cmd)
print("="*60)
print("\nüí° Recommendation: Run the next cell to train in the notebook!")
print("   (It handles paths automatically)")

üìã Training Command (for reference):
cd notebooks && python -m mlx_lm.lora \
    --model mlx-community/Llama-3.2-1B-Instruct-4bit \
    --train \
    --data data \
    --batch-size 4 \
    --lora-layers 16 \
    --iters 600 \
    --learning-rate 1e-4 \
    --adapter-path ./adapters

üí° Recommendation: Run the next cell to train in the notebook!
   (It handles paths automatically)


In [None]:
# Run LoRA Training (5-10 minutes on M1/M2/M3)
import os
import sys
from pathlib import Path

# Disable tokenizer parallelism warning
os.environ["TOKENIZERS_PARALLELISM"] = "false"

print("üöÄ Starting LoRA Fine-Tuning...")
print("="*60)

# 1. Setup Paths
current_dir = Path.cwd()

# Find project root and working directory
if (current_dir / "notebooks").exists():
    project_root = current_dir
    working_dir = current_dir / "notebooks"
elif current_dir.name == "notebooks":
    project_root = current_dir.parent
    working_dir = current_dir
else:
    project_root = current_dir
    working_dir = current_dir

data_dir = working_dir / "data"
adapter_dir = working_dir / "adapters"

print(f"üìç Project Root: {project_root}")
print(f"üìÇ Data Dir: {data_dir}")
print(f"üíæ Adapter Dir: {adapter_dir}")

# 2. Verify Data Exists
if not (data_dir / "train.jsonl").exists():
    print("\n‚ùå ERROR: 'train.jsonl' not found in data directory.")
    print("   Please run the 'Prepare Data' cell above first!")
else:
    print("\n‚úÖ Data found. Starting training...")
    print("="*60)
    
    # 3. Run Training Command
    # Using the updated mlx_lm CLI syntax (2025)
    !python -m mlx_lm lora \
        --model mlx-community/Llama-3.2-1B-Instruct-4bit \
        --train \
        --data "{data_dir}" \
        --batch-size 4 \
        --num-layers 16 \
        --iters 100 \
        --learning-rate 1e-4 \
        --adapter-path "{adapter_dir}"
    
    print("\n‚úÖ Training Complete!")
    print(f"   Adapters saved to: {adapter_dir}")

üöÄ Starting LoRA Fine-Tuning...
üìç Project Root: /Users/markcastillo/git/apple-mlx-tutorial
üìÇ Data Dir: /Users/markcastillo/git/apple-mlx-tutorial/notebooks/data
üíæ Adapter Dir: /Users/markcastillo/git/apple-mlx-tutorial/notebooks/adapters

‚úÖ Data found. Starting training...


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Calling `python -m mlx_lm.lora...` directly is deprecated. Use `mlx_lm.lora...` or `python -m mlx_lm lora ...` instead.
usage: lora.py [-h] [--model MODEL] [--train] [--data DATA]
               [--fine-tune-type {lora,dora,full}]
               [--optimizer {adam,adamw,muon,sgd,adafactor}] [--mask-prompt]
               [--num-layers NUM_LAYERS] [--batch-size BATCH_SIZE]
               [--iters ITERS] [--val-batches VAL_BATCHES]
               [--learning-rate LEARNING_RATE]
               [--steps-per-report STEPS_PER_REPORT]
               [--steps-per-eval STEPS_PER_EVAL]
               [--grad-accumulation-steps GRAD_ACCUMULATION_STEPS]
               [--resume-adapter-file RESUME_ADAPTER_FILE]
               [--adapter-path ADAPTER_PATH] [--save-every SAVE_EVERY]
               [--test] [--test-batches TEST_BATCHES]
               [--max-seq-length MAX_SEQ_LENGTH] [-c CONFIG]
               [--grad-checkpoint] [--report-to REPORT_TO]
               [--project-name PROJECT_NAME] [

### What the Training Does

The LoRA training will:
1. Download the base model (if not cached)
2. Load your training/validation data
3. Add LoRA adapters to attention layers
4. Train for 600 iterations (~5-10 minutes on M1/M2)
5. Save the adapters to `./adapters`

**Note**: This is a demo. For production:
- Use more data (100+ examples minimum)
- Train longer (1000-5000 iterations)
- Tune hyperparameters (learning rate, rank, alpha)

## 5. Inference with Fine-Tuned Model

After training completes, we can load the base model + adapters and test it.

In [8]:
# Load the fine-tuned model (base + adapters) if available
from pathlib import Path

adapter_path = Path("./adapters")

if adapter_path.exists():
    print("‚úÖ Loading fine-tuned model (base + LoRA adapters)...")
    model_ft, tokenizer_ft = load(model_name, adapter_path=str(adapter_path))
    print("‚úÖ Fine-tuned model loaded!")
else:
    print("‚ö†Ô∏è  No adapters found. Will use base model only.")
    print("   (Run the training command first to create adapters)")
    model_ft, tokenizer_ft = model, tokenizer

# Side-by-side comparison
test_query = "Turn off the lights"

messages = [
    {"role": "user", "content": f"Classify the intent of this text: '{test_query}'"}
]
prompt_test = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

print(f"\nTest Query: '{test_query}'")
print("="*60)

print("\nüì¶ BASE MODEL:")
response_base = generate(model, tokenizer, prompt=prompt_test, max_tokens=50, verbose=False)
print(response_base)

if adapter_path.exists():
    print("\nüéØ FINE-TUNED MODEL:")
    response_ft = generate(model_ft, tokenizer_ft, prompt=prompt_test, max_tokens=50, verbose=False)
    print(response_ft)
else:
    print("\n‚ö†Ô∏è  Fine-tuned model not available (run training first)")

print("="*60)
print("\nüí° Expected Improvement:")
print("   - Base model: Generic response or incorrect intent")
print("   - Fine-tuned: Correctly identifies 'intent: command'")

‚úÖ Loading fine-tuned model (base + LoRA adapters)...


Fetching 6 files:   0%|          | 0/6 [00:00<?, ?it/s]

RuntimeError: [load_safetensors] Failed to open file adapters/adapters.safetensors

## 6. Compare: Base vs Fine-Tuned

Let's see the difference between the base model and our fine-tuned version.

In [None]:
# Test the fine-tuned model on intent classification
if not adapter_path.exists():
    print("‚ö†Ô∏è  No fine-tuned model available. Run training first.")
    print("   Using base model for testing.")

test_cases = [
    "Turn on the light",
    "What time is it",
    "Hello there",
    "Set a timer for 5 minutes",
    "How are you doing"
]

model_to_test = model_ft if adapter_path.exists() else model
tokenizer_to_test = tokenizer_ft if adapter_path.exists() else tokenizer
model_type = "Fine-Tuned" if adapter_path.exists() else "Base"

print(f"\nüß™ Testing {model_type} Model\n" + "="*50)

for test_text in test_cases:
    messages = [
        {"role": "user", "content": f"Classify the intent of this text: '{test_text}'"}
    ]
    
    prompt = tokenizer_to_test.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
    
    print(f"\nüìù Input: '{test_text}'")
    response = generate(model_to_test, tokenizer_to_test, prompt=prompt, max_tokens=50, verbose=False)
    print(f"ü§ñ Output: {response}")

print("\n" + "="*50)


üß™ Testing Fine-Tuned Model

üìù Input: 'Turn on the light'
ü§ñ Output: intent: question

üìù Input: 'What time is it'
ü§ñ Output: intent: question

üìù Input: 'Hello there'
ü§ñ Output: intent: greeting

üìù Input: 'Set a timer for 5 minutes'
ü§ñ Output: intent: question

üìù Input: 'Hello there'
ü§ñ Output: intent: greeting

üìù Input: 'Set a timer for 5 minutes'
ü§ñ Output: intent: question

üìù Input: 'How are you doing'
ü§ñ Output: intent: greeting

ü§ñ Output: intent: question

üìù Input: 'How are you doing'
ü§ñ Output: intent: greeting



## Further Experiments: Optimizing LoRA

Fine-tuning is an art as much as a science. Here are some experiments to deepen your understanding.

### Experiment: Finding the Optimal Rank
We used `lora_rank=8`. What happens if we change it?
- **Rank 4:** Faster, less memory, but might not learn complex patterns.
- **Rank 32:** Slower, more memory, potentially better quality.
- **Rank 64+:** Diminishing returns, high risk of overfitting.

In [None]:
# Experiment: Find optimal LoRA rank

def test_lora_ranks():
    """Compare different LoRA ranks on same task"""
    print("Testing LoRA Ranks...")
    
    ranks = [4, 8, 16, 32]
    
    # In a real experiment, we would loop through these ranks, 
    # train a model for each, and evaluate on a validation set.
    
    for rank in ranks:
        print(f"Training with Rank {rank}...")
        # train_lora(rank=rank)
        # score = evaluate()
        # print(f"Rank {rank} Score: {score}")
        pass
        
    print("General Rule: Rank 8 or 16 is usually the sweet spot for 7B models.")

# test_lora_ranks()

## üéâ Conclusion

You have now completed the journey from simple RNNs to fine-tuning modern LLMs on Apple Silicon!

**What's Next?**
- Build a RAG (Retrieval Augmented Generation) system.
- Deploy your fine-tuned model as an API.

## ‚ùì FAQ

**Q: LoRA vs. Full Fine-Tuning?**
A:
*   **Full Fine-Tuning:** Updates all weights. Requires massive VRAM (e.g., 80GB+ for 7B model).
*   **LoRA:** Updates <1% of weights. Runs on consumer hardware (e.g., 16GB MacBook). Performance is often 98-99% of full fine-tuning.

**Q: Can I fine-tune on my own emails?**
A: Yes! Just format them as JSONL: `{"messages": [{"role": "user", "content": "Subject: Meeting"}, {"role": "assistant", "content": "Hi team..."}]}`.

**Q: What is "Quantization" (4-bit)?**
A: It reduces the precision of weights from 16-bit (Float16) to 4-bit integers. This cuts memory usage by 4x with minimal loss in quality, allowing you to run a 7B model on a laptop.

## üí≠ Closing Thoughts

**The Commoditization of Intelligence**
We are entering an era where "Intelligence" is a downloadable package.
*   **2020:** Only OpenAI had GPT-3.
*   **2025:** You can run a model nearly as smart as GPT-3 on your laptop, fine-tuned on your private data, with no internet connection.

**Architectural Shift:**
The role of the AI Engineer is shifting from "Designing Architectures" (building LSTMs) to "Data Curation" (preparing high-quality datasets for fine-tuning). The model is a solved problem; the data is your competitive advantage.