# Supervised Fine-Tuning

This notebook demonstrates how to fine-tune a language model. We'll use parameter-efficient techniques and memory optimization strategies(PEFT).

## Cell 1: Install Required Dependencies

We install the necessary packages for model training, including Hugging Face transformers, PEFT for parameter-efficient fine-tuning, and bitsandbytes for quantization.

In [None]:
#%pip install transformers datasets peft bitsandbytes accelerate

## Cell 2: Import Libraries

Import all required libraries including transformers for model loading, datasets for data handling, and PEFT for efficient fine-tuning.

In [None]:
import json
import torch
import os
from dotenv import load_dotenv
load_dotenv()
from datasets import Dataset, load_dataset
from transformers import (
    AutoTokenizer, 
    AutoModelForCausalLM, 
    TrainingArguments, 
    Trainer,
    DataCollatorForLanguageModeling
)

from peft import get_peft_model, LoraConfig, TaskType
import bitsandbytes as bnb


## Cell 3: Load and Prepare Dataset

Load the JSONL file and convert it into a Hugging Face Dataset. We assume each line contains 'prompt' and 'completion' fields.

In [None]:
def load_jsonl_dataset(file_path):
    """Load JSONL file and convert to Dataset format"""
    data = []
    with open(file_path, 'r') as f:
        for line in f:
            data.append(json.loads(line))
    return Dataset.from_list(data)

# Replace with your actual file path
dataset_path = os.getenv("DATASET_PATH")
dataset = load_jsonl_dataset(dataset_path)
print(f"Dataset size: {len(dataset)}")
print(dataset[0])

## Cell 4: Initialize Model and Tokenizer

Load a lightweight model suitable for 4GB memory. We use a quantized version of a small model and apply 4-bit quantization to reduce memory usage.

In [None]:
from transformers import BitsAndBytesConfig

# Configuration for 4-bit quantization (use float16 for consumer GPUs)
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16
)

# Load model and tokenizer (small model for limited memory)
model_name = os.getenv("MODEL_NAME", "TinyLlama/TinyLlama-1.1B-Chat-v1.0")
# Sanitize accidental values and whitespace (handles cases like 'MODEL_NAME=...')
if isinstance(model_name, str) and model_name.strip().startswith("MODEL_NAME="):
    model_name = model_name.split("=", 1)[1]
model_name = model_name.strip().strip('"').strip("'")

print("Using MODEL_NAME:", repr(model_name))

tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    device_map="auto",  # allow offload if needed
    trust_remote_code=False
)

# Memory-savers
model.gradient_checkpointing_enable()
model.config.use_cache = False


## Cell 5: Configure Parameter-Efficient Fine-Tuning (LoRA)

Set up LoRA configuration to drastically reduce trainable parameters while maintaining model performance.

In [None]:
# LoRA configuration for parameter-efficient training
peft_config = LoraConfig(
    task_type=TaskType.CAUSAL_LM,
    inference_mode=False,
    r=8,  # Low rank
    lora_alpha=32,
    lora_dropout=0.1,
    target_modules=["q_proj", "v_proj"]  # Apply to attention layers
)

# Wrap model with LoRA adapters
model = get_peft_model(model, peft_config)
model.print_trainable_parameters()

## Cell 6: Preprocess Dataset

Tokenize the dataset and format it for causal language modeling. We concatenate prompt and completion for training.

In [None]:
def preprocess_function(examples):
    """Tokenize text and prepare for causal language modeling"""
    # Combine prompt and completion
    PROMPT_COLUMN = "instruction"
    COMPLETION_COLUMN = "output"
    texts = [
        f"{prompt} {completion}{tokenizer.eos_token}"
        for prompt, completion in zip(examples[PROMPT_COLUMN], examples[COMPLETION_COLUMN])
    ]
    
    # Tokenize with truncation
    model_inputs = tokenizer(
        texts,
        max_length=512,
        truncation=True,
        padding=False
    )
    
    # Create labels for causal LM (shifted by 1)
    model_inputs["labels"] = [
        [-100] * (len(token_ids) - 1) + [token_ids[-1]]
        for token_ids in model_inputs["input_ids"]
    ]
    
    return model_inputs

# Apply preprocessing to dataset
tokenized_dataset = dataset.map(
    preprocess_function,
    batched=True,
    remove_columns=dataset.column_names
)

## Cell 7: Configure Training Arguments

Set up training parameters optimized for limited memory (small batch sizes, gradient accumulation, and memory-saving options). These values are read from environment variables (see `.env` or `.env.example`) and fall back to safe defaults if not set.

Important behavior:
- The notebook reads training settings from `.env` (for example: `OUTPUT_DIR`, `MODEL_NAME`, `PER_DEVICE_TRAIN_BATCH_SIZE`, `GRADIENT_ACCUMULATION_STEPS`, `NUM_TRAIN_EPOCHS`, `LEARNING_RATE`, `FP16`, `LOGGING_STEPS`, `SAVE_STEPS`, `SAVE_TOTAL_LIMIT`, `OPTIMIZER`).
- Before training, the notebook automatically creates a **model-specific output directory** at `OUTPUT_DIR/<sanitized MODEL_NAME>` (slashes and colons in `MODEL_NAME` are replaced with `_`), and `TrainingArguments.output_dir` is set to that path.

Verification & workflow:
- After changing `.env`, **restart the kernel** and re-run Cell 2 (which calls `load_dotenv()`), then re-run this cell and the diagnostic cell immediately after it to confirm the resolved settings and final output directory.

If you prefer timestamped run directories (e.g., `OUTPUT_DIR/<model>/2026-01-09_14-20-00`) or automatic retention of older runs, tell me and I will add that behavior.

In [None]:
# Memory-efficient training arguments (load from env with safe defaults)
# Create a model-specific output directory under OUTPUT_DIR (e.g. ./results/TinyLlama_TinyLlama-1.1B-Chat-v1.0)
base_output = os.getenv("OUTPUT_DIR", "./results")
model_name_env = os.getenv("MODEL_NAME", "model")
# Sanitize model name to safe directory name
sanitized_model_name = model_name_env.replace("/", "_").replace(":", "_")
model_output_dir = os.path.join(base_output, sanitized_model_name)

# Ensure directory exists
os.makedirs(model_output_dir, exist_ok=True)

training_args = TrainingArguments(
    output_dir=model_output_dir,
    per_device_train_batch_size=int(os.getenv("PER_DEVICE_TRAIN_BATCH_SIZE", "1")),
    gradient_accumulation_steps=int(os.getenv("GRADIENT_ACCUMULATION_STEPS", "8")),
    num_train_epochs=int(os.getenv("NUM_TRAIN_EPOCHS", "3")),
    learning_rate=float(os.getenv("LEARNING_RATE", "2e-4")),
    fp16=(os.getenv("FP16", "True").lower() in ("1", "true", "yes")),
    logging_steps=int(os.getenv("LOGGING_STEPS", "10")),
    save_steps=int(os.getenv("SAVE_STEPS", "100")),
    save_total_limit=int(os.getenv("SAVE_TOTAL_LIMIT", "2")),
    report_to=os.getenv("REPORT_TO", "none"),
    dataloader_pin_memory=(os.getenv("DATALOADER_PIN_MEMORY", "False").lower() in ("1", "true", "yes")),
    remove_unused_columns=(os.getenv("REMOVE_UNUSED_COLUMNS", "False").lower() in ("1", "true", "yes")),
    optim=os.getenv("OPTIMIZER", "paged_adamw_8bit")
)


In [None]:
# Quick check: resolved training args (read from environment) and model output directory
resolved_training_args = {
    "output_dir": model_output_dir,
    "per_device_train_batch_size": int(os.getenv("PER_DEVICE_TRAIN_BATCH_SIZE", "1")),
    "gradient_accumulation_steps": int(os.getenv("GRADIENT_ACCUMULATION_STEPS", "8")),
    "num_train_epochs": int(os.getenv("NUM_TRAIN_EPOCHS", "3")),
    "learning_rate": float(os.getenv("LEARNING_RATE", "2e-4")),
    "fp16": os.getenv("FP16", "True").lower() in ("1", "true", "yes"),
    "logging_steps": int(os.getenv("LOGGING_STEPS", "10")),
    "save_steps": int(os.getenv("SAVE_STEPS", "100")),
    "save_total_limit": int(os.getenv("SAVE_TOTAL_LIMIT", "2")),
    "report_to": os.getenv("REPORT_TO", "none"),
    "dataloader_pin_memory": os.getenv("DATALOADER_PIN_MEMORY", "False").lower() in ("1", "true", "yes"),
    "remove_unused_columns": os.getenv("REMOVE_UNUSED_COLUMNS", "False").lower() in ("1", "true", "yes"),
    "optim": os.getenv("OPTIMIZER", "paged_adamw_8bit")
}

print("Model name:", model_name_env)
print("Sanitized model name:", sanitized_model_name)
print("Model output directory:", model_output_dir)
print("Resolved training args:")
print(json.dumps(resolved_training_args, indent=2))


## Output directories & run organization

- **Base output directory** is configured via `.env` using **OUTPUT_DIR** (default `./results`).
- The notebook creates a **model-specific** subdirectory automatically: **OUTPUT_DIR/<sanitized MODEL_NAME>**.
  - The notebook sanitizes `MODEL_NAME` by replacing `/` and `:` with `_` (e.g. `TinyLlama/TinyLlama-1.1B-Chat-v1.0` → `TinyLlama_TinyLlama-1.1B-Chat-v1.0`).
- This subdirectory is created before training starts so all Trainer checkpoints and logs go into that folder.

Recommended model candidates (start with one at a time):
- **Llama 2 (3B)** — `meta-llama/Llama-2-3b-chat-hf`
  - Good quality for instruction-following; may require HF access tokens. Best starting point for 3B family.
- **RedPajama / Together (3B)** — `togethercomputer/RedPajama-INCITE-3B-v1`
  - Open alternative with similar scale and good community support.
- **Mistral (3B family)** — use a Mistral 3B variant from Hugging Face (e.g., `mistralai/*`).
  - Competitive performance; pick an official 3B HF repo if available.

Tips when trying a larger model:
- Use 4-bit quantization (`BitsAndBytesConfig`) and set `bnb_4bit_compute_dtype=torch.float16`.
- Load with `device_map='auto'` and allow CPU offload so the bigger model can use remaining host RAM.
- Enable `gradient_checkpointing` and set `model.config.use_cache = False` to reduce peak memory during training.

How to test a candidate:
1. Update `MODEL_NAME` in `.env` to the chosen model (e.g., `MODEL_NAME=meta-llama/Llama-2-3b-chat-hf`).
2. Restart the kernel, re-run Cell 2 (which loads `.env`), then run the *load-test* cell (I can add this cell for you to automatically test loads) or re-run Cell 7 and the diagnostic cell to see if the model output directory and settings resolve correctly.
3. If the model fails to load on your laptop, revert to a smaller candidate or consider using a remote GPU (recommended for >3B models).

If you'd like, I can add a small **load-test** cell that attempts to load each candidate and reports success/failure and approximate VRAM usage. Would you like me to add that now?

## Cell 8: Initialize Trainer and Start Training

Create the trainer with our model, dataset, and training configuration, then start the fine-tuning process.

In [None]:
# Data collator for language modeling
data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer,
    mlm=False  # Causal language modeling
)

# Initialize trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset,
    data_collator=data_collator
)

# Start training
trainer.train()

## Cell 9: Save the Fine-Tuned Model

Save the trained adapters and tokenizer for later use. The base model is not saved to save space.

In [None]:
# Save only the LoRA adapters (not the full model)
model.save_pretrained("./fine-tuned-model")
tokenizer.save_pretrained("./fine-tuned-model")
print("Model saved successfully!")

## Cell 10: Test the Fine-Tuned Model

Test the model with a sample prompt to verify the fine-tuning results.

In [None]:
# Test the model
model.eval()
prompt = os.getenv("TEST_PROMPT", "Tell me a joke about cats.")
print(f"Using TEST_PROMPT: {prompt!r}")
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")

with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_new_tokens=100,
        temperature=0.7,
        do_sample=True
    )
    
print(tokenizer.decode(outputs[0], skip_special_tokens=True))