# Supervised Fine-Tuning

This notebook demonstrates how to fine-tune a language model. We'll use parameter-efficient techniques and memory optimization strategies(PEFT).

## Cell 1: Install Required Dependencies

We install the necessary packages for model training, including Hugging Face transformers, PEFT for parameter-efficient fine-tuning, and bitsandbytes for quantization.

In [None]:
#%pip install transformers datasets peft bitsandbytes accelerate

## Cell 2: Import Libraries

Import all required libraries including transformers for model loading, datasets for data handling, and PEFT for efficient fine-tuning.

In [None]:
import json
import torch
import os
from dotenv import load_dotenv
load_dotenv()
from datasets import Dataset, load_dataset
from transformers import (
    AutoTokenizer, 
    AutoModelForCausalLM, 
    TrainingArguments, 
    Trainer,
    DataCollatorForLanguageModeling
)

from peft import get_peft_model, LoraConfig, TaskType
import bitsandbytes as bnb


## Cell 2b: Hardware Detection & Configuration

Automatically detect hardware capabilities and configure training settings appropriately:
- **Windows/NVIDIA (4GB VRAM)**: Use 4-bit quantization, small batches, gradient checkpointing
- **Mac M3 Ultra (512GB unified)**: No quantization, larger batches, full precision available
- **CPU fallback**: Supported but slow

The notebook auto-detects available hardware and adjusts settings. You can override with environment variables in `.env`.

In [None]:
import platform
import psutil

def detect_hardware():
    """Detect available hardware and return configuration dict."""
    config = {
        "device_type": "cpu",
        "device_name": "CPU",
        "total_memory_gb": psutil.virtual_memory().total / (1024**3),
        "use_quantization": True,
        "use_gradient_checkpointing": True,
        "recommended_batch_size": 1,
        "recommended_optimizer": "paged_adamw_8bit",
        "use_fp16": True,
        "use_bf16": False
    }
    
    # Check for CUDA (NVIDIA GPU)
    if torch.cuda.is_available():
        config["device_type"] = "cuda"
        config["device_name"] = torch.cuda.get_device_name(0)
        vram_gb = torch.cuda.get_device_properties(0).total_memory / (1024**3)
        config["gpu_memory_gb"] = vram_gb
        
        # 4GB VRAM or less: aggressive optimization
        if vram_gb <= 6:
            config["use_quantization"] = True
            config["recommended_batch_size"] = 1
            config["recommended_optimizer"] = "paged_adamw_8bit"
        # 6-16GB VRAM: moderate optimization
        elif vram_gb <= 16:
            config["use_quantization"] = True
            config["recommended_batch_size"] = 2
            config["recommended_optimizer"] = "paged_adamw_32bit"
        # >16GB VRAM: minimal optimization
        else:
            config["use_quantization"] = False
            config["recommended_batch_size"] = 4
            config["recommended_optimizer"] = "adamw_torch"
            config["use_gradient_checkpointing"] = False
    
    # Check for Apple Silicon (MPS)
    elif torch.backends.mps.is_available() and torch.backends.mps.is_built():
        config["device_type"] = "mps"
        config["device_name"] = f"Apple Silicon ({platform.processor()})"
        
        # M3 Ultra with 512GB unified memory - use full power
        if config["total_memory_gb"] > 256:
            config["use_quantization"] = False
            config["use_gradient_checkpointing"] = False
            config["recommended_batch_size"] = 8
            config["recommended_optimizer"] = "adamw_torch"
            config["use_fp16"] = False
            config["use_bf16"] = True  # Better for Apple Silicon
        # Other M-series chips
        else:
            config["use_quantization"] = True
            config["recommended_batch_size"] = 2
            config["recommended_optimizer"] = "adamw_torch"
    
    # CPU fallback
    else:
        config["device_type"] = "cpu"
        config["use_quantization"] = False
        config["recommended_batch_size"] = 1
        config["recommended_optimizer"] = "adamw_torch"
        config["use_fp16"] = False
    
    return config

# Detect hardware and apply overrides from .env
hw_config = detect_hardware()

# Allow manual overrides from environment
auto_detect = os.getenv("AUTO_DETECT_HARDWARE", "true").lower() in ("true", "1", "yes")
if not auto_detect:
    hw_config["device_type"] = os.getenv("DEVICE_TYPE", hw_config["device_type"])
    force_quant = os.getenv("FORCE_QUANTIZATION", "").lower() in ("true", "1", "yes")
    if force_quant:
        hw_config["use_quantization"] = True

print("=" * 60)
print("HARDWARE CONFIGURATION")
print("=" * 60)
print(f"Device Type: {hw_config['device_type'].upper()}")
print(f"Device Name: {hw_config['device_name']}")
print(f"Total System Memory: {hw_config['total_memory_gb']:.1f} GB")
if "gpu_memory_gb" in hw_config:
    print(f"GPU Memory: {hw_config['gpu_memory_gb']:.1f} GB")
print(f"Use Quantization: {hw_config['use_quantization']}")
print(f"Gradient Checkpointing: {hw_config['use_gradient_checkpointing']}")
print(f"Recommended Batch Size: {hw_config['recommended_batch_size']}")
print(f"Recommended Optimizer: {hw_config['recommended_optimizer']}")
print(f"FP16: {hw_config['use_fp16']} | BF16: {hw_config['use_bf16']}")
print("=" * 60)

## Cell 3: Load and Prepare Dataset

Load JSONL files from a directory or a single file and convert into a Hugging Face Dataset. The function automatically detects whether `DATASET_PATH` points to a directory (loads all `.jsonl` files) or a single file.

**Behavior:**
- If `DATASET_PATH` is a directory: loads all `.jsonl` and `.train.jsonl` files recursively
- If `DATASET_PATH` is a file: loads that single JSONL file
- Combines all data into a single unified dataset

In [None]:
from pathlib import Path

def load_jsonl_dataset(path):
    """Load JSONL file(s) and convert to Dataset format.
    
    Args:
        path: Can be a file path (loads single file) or directory path (loads all .jsonl files)
    
    Returns:
        Dataset: Combined dataset from all JSONL files found
    """
    data = []
    path = Path(path)
    
    if path.is_file():
        # Single file
        print(f"Loading single file: {path}")
        with open(path, 'r', encoding='utf-8') as f:
            for line in f:
                line = line.strip()
                if line:
                    data.append(json.loads(line))
    elif path.is_dir():
        # Directory - load all .jsonl files
        jsonl_files = sorted(path.rglob('*.jsonl'))
        if not jsonl_files:
            raise ValueError(f"No .jsonl files found in directory: {path}")
        
        print(f"Found {len(jsonl_files)} JSONL files in {path}")
        for jsonl_file in jsonl_files:
            print(f"  Loading: {jsonl_file.name}")
            with open(jsonl_file, 'r', encoding='utf-8') as f:
                for line in f:
                    line = line.strip()
                    if line:
                        data.append(json.loads(line))
    else:
        raise ValueError(f"Path does not exist: {path}")
    
    print(f"Total examples loaded: {len(data)}")
    return Dataset.from_list(data)

# Load from directory or single file
dataset_path = os.getenv("DATASET_PATH")
if not dataset_path:
    raise ValueError("DATASET_PATH not set in .env file")

dataset = load_jsonl_dataset(dataset_path)
print(f"\nDataset size: {len(dataset)}")
print(f"Sample example:\n{dataset[0]}")

## Cell 4: Initialize Model and Tokenizer

Load model with hardware-appropriate configuration:
- **Low memory (4GB)**: 4-bit quantization, aggressive optimization
- **High memory (M3 Ultra)**: Full precision, no quantization, maximum performance

The quantization and device mapping are automatically configured based on detected hardware.

In [None]:
from transformers import BitsAndBytesConfig

# Load model name and sanitize
model_name = os.getenv("MODEL_NAME", "TinyLlama/TinyLlama-1.1B-Chat-v1.0")
if isinstance(model_name, str) and model_name.strip().startswith("MODEL_NAME="):
    model_name = model_name.split("=", 1)[1]
model_name = model_name.strip().strip('"').strip("'")

print(f"\n{'='*60}")
print(f"Loading Model: {model_name}")
print(f"{'='*60}")

# Prepare model loading arguments based on hardware
model_kwargs = {
    "trust_remote_code": False
}

# Configure quantization (only for low-memory systems)
if hw_config["use_quantization"]:
    print("✓ Using 4-bit quantization (low-memory mode)")
    bnb_config = BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_use_double_quant=True,
        bnb_4bit_quant_type="nf4",
        bnb_4bit_compute_dtype=torch.float16
    )
    model_kwargs["quantization_config"] = bnb_config
    model_kwargs["device_map"] = "auto"
else:
    print("✓ Loading full precision model (high-memory mode)")
    # For MPS (Apple Silicon), let PyTorch handle device placement
    if hw_config["device_type"] == "mps":
        model_kwargs["torch_dtype"] = torch.float16 if hw_config["use_fp16"] else torch.bfloat16
    else:
        model_kwargs["device_map"] = "auto"

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token

# Load model
model = AutoModelForCausalLM.from_pretrained(model_name, **model_kwargs)

# Apply memory optimizations based on hardware
if hw_config["use_gradient_checkpointing"]:
    print("✓ Gradient checkpointing enabled")
    model.gradient_checkpointing_enable()
    model.config.use_cache = False
else:
    print("✓ Gradient checkpointing disabled (high-memory mode)")

# Move to MPS if available and not using device_map
if hw_config["device_type"] == "mps" and "device_map" not in model_kwargs:
    model = model.to("mps")
    print("✓ Model moved to Apple Silicon (MPS)")

print(f"{'='*60}\n")

## Cell 5: Configure Parameter-Efficient Fine-Tuning (LoRA)

Set up LoRA configuration to drastically reduce trainable parameters while maintaining model performance.

In [None]:
# LoRA configuration for parameter-efficient training
peft_config = LoraConfig(
    task_type=TaskType.CAUSAL_LM,
    inference_mode=False,
    r=32,  # Low rank
    lora_alpha=64,
    lora_dropout=0.1,
    target_modules=["q_proj", "v_proj"]  # Apply to attention layers
)

# Wrap model with LoRA adapters
model = get_peft_model(model, peft_config)
model.print_trainable_parameters()

## Cell 6: Preprocess Dataset

Tokenize the dataset and format it for causal language modeling. We concatenate prompt and completion for training.

In [None]:
def preprocess_function(examples):
    """Tokenize text and prepare for causal language modeling"""
    PROMPT_COLUMN = "instruction"
    COMPLETION_COLUMN = "output"
    texts = [
        f"{prompt} {completion}{tokenizer.eos_token}"
        for prompt, completion in zip(examples[PROMPT_COLUMN], examples[COMPLETION_COLUMN])
    ]
    
    # FIX: Change padding=False to padding="max_length"
    # This ensures every single example is exactly 512 tokens
    model_inputs = tokenizer(
        texts,
        max_length=512,
        truncation=True,
        padding="max_length" 
    )
    
    # IMPORTANT: Manually set labels. 
    # This prevents the "labels have excessive nesting" error found in your log.
    model_inputs["labels"] = model_inputs["input_ids"].copy()
    
    return model_inputs

# Apply preprocessing
tokenized_dataset = dataset.map(
    preprocess_function,
    batched=True,
    remove_columns=dataset.column_names
)

## Cell 7: Configure Training Arguments

Set up training parameters optimized for limited memory (small batch sizes, gradient accumulation, and memory-saving options). These values are read from environment variables (see `.env` or `.env.example`) and fall back to safe defaults if not set.

Important behavior:
- The notebook reads training settings from `.env` (for example: `OUTPUT_DIR`, `MODEL_NAME`, `PER_DEVICE_TRAIN_BATCH_SIZE`, `GRADIENT_ACCUMULATION_STEPS`, `NUM_TRAIN_EPOCHS`, `LEARNING_RATE`, `FP16`, `LOGGING_STEPS`, `SAVE_STEPS`, `SAVE_TOTAL_LIMIT`, `OPTIMIZER`).
- Before training, the notebook automatically creates a **model-specific output directory** at `OUTPUT_DIR/<sanitized MODEL_NAME>` (slashes and colons in `MODEL_NAME` are replaced with `_`), and `TrainingArguments.output_dir` is set to that path.

Verification & workflow:
- After changing `.env`, **restart the kernel** and re-run Cell 2 (which calls `load_dotenv()`), then re-run this cell and the diagnostic cell immediately after it to confirm the resolved settings and final output directory.

If you prefer timestamped run directories (e.g., `OUTPUT_DIR/<model>/2026-01-09_14-20-00`) or automatic retention of older runs, tell me and I will add that behavior.

In [None]:
# Memory-efficient training arguments (load from env with safe defaults)
# Create a model-specific output directory under OUTPUT_DIR (e.g. ./results/TinyLlama_TinyLlama-1.1B-Chat-v1.0)
base_output = os.getenv("OUTPUT_DIR", "./results")
model_name_env = os.getenv("MODEL_NAME", "model")
# Sanitize model name to safe directory name
sanitized_model_name = model_name_env.replace("/", "_").replace(":", "_")
model_output_dir = os.path.join(base_output, sanitized_model_name)

# Ensure directory exists
os.makedirs(model_output_dir, exist_ok=True)

# Apply hardware-aware defaults with .env overrides
# If user sets value in .env, use that; otherwise use hardware recommendation
def get_env_or_default(key, hw_default):
    """Get env var or use hardware-recommended default."""
    env_val = os.getenv(key)
    if env_val is None:
        return hw_default
    # Parse based on type
    if isinstance(hw_default, bool):
        return env_val.lower() in ("1", "true", "yes")
    elif isinstance(hw_default, int):
        return int(env_val)
    elif isinstance(hw_default, float):
        return float(env_val)
    return env_val

# Determine precision settings
use_fp16 = get_env_or_default("FP16", hw_config["use_fp16"])
use_bf16 = get_env_or_default("BF16", hw_config["use_bf16"])

# MPS doesn't support fp16 in training args - use bf16 instead
if hw_config["device_type"] == "mps" and use_fp16:
    print("⚠ MPS detected: switching from FP16 to BF16 for training")
    use_fp16 = False
    use_bf16 = True

training_args = TrainingArguments(
    output_dir=model_output_dir,
    per_device_train_batch_size=get_env_or_default("PER_DEVICE_TRAIN_BATCH_SIZE", hw_config["recommended_batch_size"]),
    gradient_accumulation_steps=get_env_or_default("GRADIENT_ACCUMULATION_STEPS", 8),
    num_train_epochs=get_env_or_default("NUM_TRAIN_EPOCHS", 3),
    learning_rate=get_env_or_default("LEARNING_RATE", 2e-4),
    fp16=use_fp16,
    bf16=use_bf16,
    logging_steps=get_env_or_default("LOGGING_STEPS", 10),
    save_steps=get_env_or_default("SAVE_STEPS", 100),
    save_total_limit=get_env_or_default("SAVE_TOTAL_LIMIT", 2),
    report_to=os.getenv("REPORT_TO", "none"),
    dataloader_pin_memory=get_env_or_default("DATALOADER_PIN_MEMORY", False),
    remove_unused_columns=get_env_or_default("REMOVE_UNUSED_COLUMNS", False),
    optim=get_env_or_default("OPTIMIZER", hw_config["recommended_optimizer"])
)

print(f"\n{'='*60}")
print("TRAINING CONFIGURATION")
print(f"{'='*60}")
print(f"Output Directory: {model_output_dir}")
print(f"Batch Size: {training_args.per_device_train_batch_size}")
print(f"Gradient Accumulation: {training_args.gradient_accumulation_steps}")
print(f"Effective Batch Size: {training_args.per_device_train_batch_size * training_args.gradient_accumulation_steps}")
print(f"Epochs: {training_args.num_train_epochs}")
print(f"Learning Rate: {training_args.learning_rate}")
print(f"Optimizer: {training_args.optim}")
print(f"FP16: {training_args.fp16} | BF16: {training_args.bf16}")
print(f"{'='*60}\n")

In [None]:
# Quick check: resolved training args (read from environment) and model output directory
resolved_training_args = {
    "output_dir": model_output_dir,
    "per_device_train_batch_size": int(os.getenv("PER_DEVICE_TRAIN_BATCH_SIZE", "1")),
    "gradient_accumulation_steps": int(os.getenv("GRADIENT_ACCUMULATION_STEPS", "8")),
    "num_train_epochs": int(os.getenv("NUM_TRAIN_EPOCHS", "3")),
    "learning_rate": float(os.getenv("LEARNING_RATE", "2e-4")),
    "fp16": os.getenv("FP16", "True").lower() in ("1", "true", "yes"),
    "logging_steps": int(os.getenv("LOGGING_STEPS", "10")),
    "save_steps": int(os.getenv("SAVE_STEPS", "100")),
    "save_total_limit": int(os.getenv("SAVE_TOTAL_LIMIT", "2")),
    "report_to": os.getenv("REPORT_TO", "none"),
    "dataloader_pin_memory": os.getenv("DATALOADER_PIN_MEMORY", "False").lower() in ("1", "true", "yes"),
    "remove_unused_columns": os.getenv("REMOVE_UNUSED_COLUMNS", "False").lower() in ("1", "true", "yes"),
    "optim": os.getenv("OPTIMIZER", "paged_adamw_8bit")
}

print("Model name:", model_name_env)
print("Sanitized model name:", sanitized_model_name)
print("Model output directory:", model_output_dir)
print("Resolved training args:")
print(json.dumps(resolved_training_args, indent=2))


## Output directories & run organization

- **Base output directory** is configured via `.env` using **OUTPUT_DIR** (default `./results`).
- The notebook creates a **model-specific** subdirectory automatically: **OUTPUT_DIR/<sanitized MODEL_NAME>**.
  - The notebook sanitizes `MODEL_NAME` by replacing `/` and `:` with `_` (e.g. `TinyLlama/TinyLlama-1.1B-Chat-v1.0` → `TinyLlama_TinyLlama-1.1B-Chat-v1.0`).
- This subdirectory is created before training starts so all Trainer checkpoints and logs go into that folder.

Recommended model candidates (start with one at a time):
- **Llama 2 (3B)** — `meta-llama/Llama-2-3b-chat-hf`
  - Good quality for instruction-following; may require HF access tokens. Best starting point for 3B family.
- **RedPajama / Together (3B)** — `togethercomputer/RedPajama-INCITE-3B-v1`
  - Open alternative with similar scale and good community support.
- **Mistral (3B family)** — use a Mistral 3B variant from Hugging Face (e.g., `mistralai/*`).
  - Competitive performance; pick an official 3B HF repo if available.

Tips when trying a larger model:
- Use 4-bit quantization (`BitsAndBytesConfig`) and set `bnb_4bit_compute_dtype=torch.float16`.
- Load with `device_map='auto'` and allow CPU offload so the bigger model can use remaining host RAM.
- Enable `gradient_checkpointing` and set `model.config.use_cache = False` to reduce peak memory during training.

How to test a candidate:
1. Update `MODEL_NAME` in `.env` to the chosen model (e.g., `MODEL_NAME=meta-llama/Llama-2-3b-chat-hf`).
2. Restart the kernel, re-run Cell 2 (which loads `.env`), then run the *load-test* cell (I can add this cell for you to automatically test loads) or re-run Cell 7 and the diagnostic cell to see if the model output directory and settings resolve correctly.
3. If the model fails to load on your laptop, revert to a smaller candidate or consider using a remote GPU (recommended for >3B models).

If you'd like, I can add a small **load-test** cell that attempts to load each candidate and reports success/failure and approximate VRAM usage. Would you like me to add that now?

## Cell 8: Initialize Trainer and Start Training

Create the trainer with our model, dataset, and training configuration, then start the fine-tuning process.

In [None]:
# Data collator for language modeling
data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer,
    mlm=False  # Causal language modeling
)

# Initialize trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset,
    data_collator=data_collator
)
 
# Start training
trainer.train()

## Cell 9: Save the Fine-Tuned Model

Save the trained adapters and tokenizer for later use. The base model is not saved to save space.

In [None]:
# Save only the LoRA adapters (not the full model)
# Create model-specific subdirectory under fine-tuned-model
fine_tuned_base = "./fine-tuned-model"
fine_tuned_model_dir = os.path.join(fine_tuned_base, sanitized_model_name)
os.makedirs(fine_tuned_model_dir, exist_ok=True)

model.save_pretrained(fine_tuned_model_dir)
tokenizer.save_pretrained(fine_tuned_model_dir)
print(f"Model saved successfully to: {fine_tuned_model_dir}")

## Cell 10: Test the Fine-Tuned Model

Test the model with a sample prompt to verify the fine-tuning results.

In [None]:
# Test the model
model.eval()
prompt = os.getenv("TEST_PROMPT", "Tell me a joke about cats.")
print(f"Using TEST_PROMPT: {prompt!r}")

# Handle device placement based on hardware
if hw_config["device_type"] == "cuda":
    device = "cuda"
elif hw_config["device_type"] == "mps":
    device = "mps"
else:
    device = "cpu"

inputs = tokenizer(prompt, return_tensors="pt").to(device)

with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_new_tokens=100,
        temperature=0.7,
        do_sample=True
    )
    
print(tokenizer.decode(outputs[0], skip_special_tokens=True))