\renewcommand{\thesection}{}
\renewcommand{\thesubsection}{}

# CAS4133 - Assignment 2 (Due **5/23** at **11:59 PM**)

## [Summary]

Your goal is to use alignment algorithms such as but not limited to SFT, RLHF, or DPO to achieve a good score vs. your base model on an alignment benchmark. 

This notebook provides an example SFT + DPO training pipeline, but you have unlimited freedom in utilizing other approaches.

## [What to submit]

- An `.ipynb` file named: `CAS4133-assn2-studentnumber.ipynb`.  
- **[Important]** HuggingFace Repository containing your final model in the format: `${User_Name}/${Repo_Name}`  
- Describe and justify **every design choice** in the pipeline (alignment algorithms, hyperparameters, datasets, and models selected).  
- A comparison of your model's **benchmark scores** and a **naïvely trained model**, using the evaluation pipeline provided here.

## [Grading: 10 pts total]

Evaluation will be based on **Downstream performance** on private benchmark

**Scoring breakdown:**
- **+6 pts** Achieved win rate higher than 30% over the base model on undisclosed benchmark evaluation
- **+2 pts** Submitted a PDF report with required details in [What to submit] section
- **+1 pt** If your model is in top 70% in benchmark evaluation
- **+1 pt** If your model is in top 40% in benchmark evaluation
- **+1 pt** (*Extra credit*) If your model is in top-3 in benchmark evaluation  
  *(Can be transferred to another assignment as bonus)*


## [Notes]

- Assignment should be done with 1 RTX-3090 GPU (; 24G VRAM)
- TA: Hojin Kim / E-mail: hojinkimirl@gmail.com
- On TA's setup, the full pipline takes ~3 hours

- **HuggingFace Hub submission requirements:**
  - If you're using LORA adapters, make sure to merge your model before pushing to huggingface.
  - Generate a write-access token beforehand ([official documentation](https://huggingface.co/docs/hub/security-tokens))
  - Securely store your token
  - Authenticate via terminal:
    ```bash
    huggingface-cli login
        ```

**Remark: The Maximum allowed runtime for Assignment 2 is 72 hours per account. Please note that if the total exceeds 72 hours, your assigned credits will be depleted and you will no longer be able to proceed.**

---

## **SFT and DPO Training Pipeline with Unsloth**

This notebook demonstrates:
1. Fine-tuning a Llama-3.2-1B model using Supervised Fine-Tuning (SFT) with Unsloth
2. Evaluating the SFT model against the base model
3. Performing DPO training on the SFT model
4. Evaluating the DPO model against the base model

## Setup
Create a directory for logs and install the necessary packages.

In [1]:
# Create logs directory
!mkdir -p ./outputs/logs

# Start log
!echo "=== STARTING OPTIMIZED SFT AND DPO RUN ===" > ./outputs/logs/training.log

In [None]:
# I prefer to use 'uv' which is the future of environment management tools
!uv sync

In [None]:
# !pip install ipykernel==6.29.5
# !pip install unsloth==2025.4.3
# !pip install unsloth-zoo==2025.4.2
# !pip install torch==2.7.0+cu121 --index-url https://download.pytorch.org/whl/cu121
# !pip install gdown==5.2.0
# !pip install huggingface_hub==0.30.2
# !pip install wandb==0.19.9  # Optional
# !pip install bitsandbytes==0.45.5
# !pip install transformers==4.51.3

In [2]:
# Import libraries
import torch
device = 'cuda' if torch.cuda.is_available() else 'cpu'

import time
start_time = time.time()

import os
import random
import json
import numpy as np
from collections import defaultdict

try:
    from unsloth.chat_templates import get_chat_template
    from unsloth import FastLanguageModel, is_bfloat16_supported
    from trl import SFTTrainer, DPOTrainer
    from peft import PeftModel
    from datasets import load_dataset
    from transformers import (
        AutoTokenizer, 
        TrainingArguments, 
        TextStreamer,
        AutoModelForCausalLM,
    )
    print("✅ [CHECKPOINT] Imports successful")
except ImportError as e:
    print(f"❌ ImportError: {e}")
    raise

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.


  from .autonotebook import tqdm as notebook_tqdm


🦥 Unsloth Zoo will now patch everything to make training faster!
✅ [CHECKPOINT] Imports successful


In [3]:
# Check CUDA
print("CUDA available:", torch.cuda.is_available())
device = 'cuda' if torch.cuda.is_available() else 'cpu'
print(f"Using device: {device}")

if torch.cuda.is_available():
    print("GPU Info:")
    print(f"- Device count: {torch.cuda.device_count()}")
    print(f"- Current device: {torch.cuda.current_device()}")
    print(f"- Device name: {torch.cuda.get_device_name(0)}")

CUDA available: True
Using device: cuda
GPU Info:
- Device count: 1
- Current device: 0
- Device name: NVIDIA GeForce RTX 3090


In [4]:
# Seed for reproducibility
def set_seed(seed=1):
    os.environ['PYTHONHASHSEED'] = str(seed)
    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    if torch.cuda.is_available():
        torch.cuda.manual_seed(seed)
        torch.cuda.manual_seed_all(seed)
        torch.backends.cudnn.deterministic = True
        torch.backends.cudnn.benchmark = False

set_seed(1)
print("✅ [CHECKPOINT] Seed set")

✅ [CHECKPOINT] Seed set


## Load Model
Load model utilizing Unsloth's optimized loading.

In [5]:
print("Loading model - this may take a moment...")
model_load_start = time.time()

max_seq_length = 2048
dtype = None
load_in_4bit = True

try:
    model, tokenizer = FastLanguageModel.from_pretrained(
        model_name="unsloth/Llama-3.2-1B-unsloth-bnb-4bit",
        max_seq_length=max_seq_length,
        load_in_4bit=load_in_4bit,
        dtype=dtype,
    )
    print(f"✅ [CHECKPOINT] Model loaded in {time.time() - model_load_start:.2f}s")
    print("Model config:", model.config)
except Exception as e:
    print(f"❌ Failed to load model: {e}")
    raise

Loading model - this may take a moment...
==((====))==  Unsloth 2025.5.7: Fast Llama patching. Transformers: 4.51.3.
   \\   /|    NVIDIA GeForce RTX 3090. Num GPUs = 1. Max memory: 23.684 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.7.0+cu126. CUDA: 8.6. CUDA Toolkit: 12.6. Triton: 3.3.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.30. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
✅ [CHECKPOINT] Model loaded in 105.94s
Model config: LlamaConfig {
  "_attn_implementation_autoset": true,
  "architectures": [
    "LlamaForCausalLM"
  ],
  "attention_bias": false,
  "attention_dropout": 0.0,
  "bos_token_id": 128000,
  "eos_token_id": 128001,
  "head_dim": 64,
  "hidden_act": "silu",
  "hidden_size": 2048,
  "initializer_range": 0.02,
  "intermediate_size": 8192,
  "max_position_embeddings": 131072,
  "mlp_bias": false,
  "model_type": "llama",
  "num_attention_heads": 

## Setup PEFT/LoRA
Configure model for Parameter-Efficient Fine-Tuning using LoRA.

In [6]:
peft_start = time.time()
model = FastLanguageModel.get_peft_model(
    model,
    r=32,
    lora_alpha=64,
    lora_dropout=0.05,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
    use_rslora=True,
    use_gradient_checkpointing=True
)
print(f"✅ [CHECKPOINT] PEFT model created in {time.time() - peft_start:.2f}s")

Unsloth: Dropout = 0 is supported for fast patching. You are using dropout = 0.05.
Unsloth will patch all other layers, except LoRA matrices, causing a performance hit.
Unsloth 2025.5.7 patched 16 layers with 0 QKV layers, 0 O layers and 0 MLP layers.


✅ [CHECKPOINT] PEFT model created in 2.45s


## Prepare Dataset
Prepare the Alpaca dataset for fine-tuning.

In [7]:
# Prepare dataset formatting function
alpaca_prompt = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
{}

### Input:
{}

### Response:
{}"""

EOS_TOKEN = tokenizer.eos_token

def formatting_prompts_func(examples):
    instructions = examples["instruction"]
    inputs = examples["input"]
    outputs = examples["output"]
    texts = []
    for instruction, input_text, output in zip(instructions, inputs, outputs):
        text = alpaca_prompt.format(instruction, input_text, output) + EOS_TOKEN
        texts.append(text)
    return {"text": texts}

In [8]:
# Load and process the dataset
dataset_start = time.time()
print("Loading dataset...")

try:
    dataset = load_dataset("yahma/alpaca-cleaned", split="train[:30000]")
    dataset = dataset.map(formatting_prompts_func, batched=True)
    print(f"✅ [CHECKPOINT] Dataset loaded and processed in {time.time() - dataset_start:.2f}s with {len(dataset)} examples")
    
    # Display a sample
    print("\nSample from dataset:")
    print(dataset[0]["text"][:500] + "...")


    print(f"Filtered dataset size: {len(dataset)} (original had ~52k)")
except Exception as e:
    print(f"❌ Failed to load/process dataset: {e}")
    raise

Loading dataset...


Generating train split: 100%|██████████| 51760/51760 [00:00<00:00, 158900.85 examples/s]
Map: 100%|██████████| 30000/30000 [00:00<00:00, 141601.71 examples/s]

✅ [CHECKPOINT] Dataset loaded and processed in 8.34s with 30000 examples

Sample from dataset:
Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
Give three tips for staying healthy.

### Input:


### Response:
1. Eat a balanced and nutritious diet: Make sure your meals are inclusive of a variety of fruits and vegetables, lean protein, whole grains, and healthy fats. This helps to provide your body with the essential nutrients to function at its best and can help pr...
Filtered dataset size: 30000 (original had ~52k)





## Configure Training
Set up the training configuration using SFTTrainer.

In [15]:
# Configure the SFT Trainer
output_dir = "./outputs"

trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=dataset,
    dataset_text_field="text",
    max_seq_length=2048,
    dataset_num_proc=12,
    packing=True,
    args=TrainingArguments(
        per_device_train_batch_size=32,
        gradient_accumulation_steps=1,
        warmup_steps=100,
        num_train_epochs=2.0,
        learning_rate=1e-5,
        fp16=not is_bfloat16_supported(),
        bf16=is_bfloat16_supported(),
        logging_steps=50,
        logging_first_step=True,
        optim="adamw_8bit",
        weight_decay=0.02,
        lr_scheduler_type="cosine",
        seed=3407,
        output_dir=output_dir,
        report_to="none",
        save_strategy="steps",
        save_steps=100,
        gradient_checkpointing="unsloth",
        max_grad_norm=0.3,
        dataloader_num_workers=8,
        dataloader_pin_memory=True,
    ),
)

print("✅ [MAXIMUM PERFORMANCE CONFIG] Trainer configured for optimal quality training")

Detected kernel version 5.4.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.


Unsloth: Hugging Face's packing is currently buggy - we're disabling it for now!
✅ [MAXIMUM PERFORMANCE CONFIG] Trainer configured for optimal quality training


## Train the Model
Start the training process.

In [None]:
# Train the model
train_start = time.time()

try:
    trainer_stats = trainer.train()
    print(f"✅ [CHECKPOINT] Training completed in {time.time() - train_start:.2f}s")
    print(f"Training stats: {trainer_stats}")
except Exception as e:
    print(f"❌ Training failed: {e}")
    raise

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 30,000 | Num Epochs = 2 | Total steps = 1,876
O^O/ \_/ \    Batch size per device = 32 | Gradient accumulation steps = 1
\        /    Data Parallel GPUs = 1 | Total batch size (32 x 1 x 1) = 32
 "-____-"     Trainable parameters = 6,815,744/1,000,000,000 (0.68% trained)


Step,Training Loss
1,1.2862
50,1.2506
100,1.2467
150,1.2247
200,1.2219
250,1.2135
300,1.206
350,1.1964
400,1.2046
450,1.1895


## Save the Model
Save the trained LoRA adapter.

In [None]:
# Save adapter
save_start = time.time()
sft_adapter_path = "./outputs/sft_lora_adapter"

try:
    model.save_pretrained(
        sft_adapter_path,
        save_adapter=True,
        save_config=True
    )
    print(f"✅ [CHECKPOINT] SFT model adapter saved in {time.time() - save_start:.2f}s")
except Exception as e:
    print(f"❌ Failed to save model: {e}")
    raise

## Test the Fine-Tuned Model
We can test our fine-tuned model with a sample prompt.

In [None]:
# Test the fine-tuned model
print("Testing fine-tuned model with a sample prompt...")

try:
    # Load the adapter onto a fresh model instance
    test_model, test_tokenizer = FastLanguageModel.from_pretrained(
        model_name="unsloth/Llama-3.2-1B-unsloth-bnb-4bit",
        max_seq_length=max_seq_length,
        load_in_4bit=load_in_4bit
    )
    
    # Load the PEFT adapter
    test_model = PeftModel.from_pretrained(test_model, sft_adapter_path)
    
    # Create a text streamer for nice output
    streamer = TextStreamer(test_tokenizer, skip_prompt=True)
    
    # Sample prompt
    test_prompt = """Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
Explain the concept of fine-tuning in machine learning.

### Response:
"""
    
    print("\nPrompt:\n", test_prompt)
    print("\nGenerated Response:")
    
    # Generate response
    inputs = test_tokenizer(test_prompt, return_tensors="pt").to(device)
    
    output = test_model.generate(
        **inputs,
        max_new_tokens=256,
        use_cache=True,
        streamer=streamer
    )
except Exception as e:
    print(f"❌ Testing failed: {e}")

## Set up Evaluation
Now we'll set up a comprehensive evaluation comparing our fine-tuned model against the base model using a judge model.

In [None]:
# Set up evaluation
eval_start = time.time()

print("Loading models for evaluation...")

def load_base_model():
    print("Loading base model")
    try:
        model, tokenizer = FastLanguageModel.from_pretrained(
            model_name="unsloth/Llama-3.2-1B-unsloth-bnb-4bit",
            max_seq_length=1024,
            load_in_4bit=True,
            dtype=None,
        )
        return model, tokenizer
    except Exception as e:
        print(f"❌ Failed to load base model: {e}")
        raise

def load_sft_model():
    print("Loading SFT model")
    try:
        base_model, tokenizer = FastLanguageModel.from_pretrained(
            model_name="unsloth/Llama-3.2-1B-unsloth-bnb-4bit",
            max_seq_length=1024,
            load_in_4bit=True,
            dtype=None,
        )
        peft_model = PeftModel.from_pretrained(base_model, sft_adapter_path)
        return peft_model, tokenizer
    except Exception as e:
        print(f"❌ Failed to load SFT model: {e}")
        raise

def load_judge_model():
    print("Loading judge model")
    try:
        model, tokenizer = FastLanguageModel.from_pretrained(
            model_name="unsloth/Llama-3.1-8B-Instruct-unsloth-bnb-4bit",
            max_seq_length=2048,
            load_in_4bit=True,
            dtype=None,
        )
        return model, tokenizer
    except Exception as e:
        print(f"❌ Failed to load judge model: {e}")
        raise

try:
    base_model, base_tokenizer = load_base_model()
    sft_model, sft_tokenizer = load_sft_model()
    judge_model, judge_tokenizer = load_judge_model()
    print("✅ [CHECKPOINT] Evaluation models loaded")
except Exception as e:
    print(f"❌ Failed to load evaluation models: {e}")
    raise

## Run Comparative Evaluation
Let's run a comparative evaluation between the base model and our fine-tuned model.

In [None]:
import time
import torch
from datasets import load_dataset

# Define helper functions for evaluation
def format_prompt(instruction):
    return f"""Below is an instruction that describes a task. Write a response that completes the request. You will be judged on the following criteria:
    1. **Objectivity**: How objective and formal the tone is. 
    2. **Instruction Following**: How well the response directly addresses the instruction's requirements.
    3. **Specificity**: How specific and long the response is.
    4. **Clarity**: How readable and coherent the response is, with no extraneous characters.
    5. **Accuracy**: Factual correctness and logical consistency.
    6. **Helpfulness**: Practicality and usefulness of the response.

### Instruction:
{instruction}

### Response:"""

def generate_response(model, tokenizer, prompt):
    inputs = tokenizer(
        prompt, 
        return_tensors="pt", 
        truncation=True, 
        max_length=2048,
    ).to("cuda")
    
    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=256,
            temperature=0.7,
            top_p=0.9,
            top_k=50,
            do_sample=True,
            pad_token_id=tokenizer.eos_token_id,
            repetition_penalty=1.2,
        )
    
    full_response = tokenizer.decode(outputs[0], skip_special_tokens=True)
    # Remove the prompt from the response
    return full_response[len(prompt):].strip()

In [None]:
def get_judgment(instruction, sft_response, base_response):
    prompt = f"""You are an impartial judge evaluating two responses to an instruction. Score based on the following criteria, ignoring the order of the responses, prioritizing these metrics:

1. **Objectivity**: How objective and formal the tone is.
2. **Instruction Following**: How well the response directly addresses the instruction's requirements.
3. **Specificity**: How specific and long the response is.
4. **Clarity**: How readable and coherent the response is, with no extraneous characters.
5. **Accuracy**: Factual correctness and logical consistency.
6. **Helpfulness**: Practicality and usefulness of the response.


**Special Rule**: If the First Response (SFT Model) is empty, irrelevant, or fails to provide any meaningful answer, score the Second Response (Base Model) as significantly better unless it is also empty or irrelevant.

Return ONLY a single number:
1 = First response (SFT Model) is significantly better
2 = Second response (Base Model) is significantly better
0 = Both are comparable (similar quality or minor differences)

Instruction:
{instruction}

First Response (SFT Model):
{sft_response}

Second Response (Base Model):
{base_response}

Judgment (1/2/0):"""
    
    inputs = judge_tokenizer(
        prompt,
        return_tensors="pt",
        max_length=2048,
        truncation=True
    ).to("cuda")
    
    with torch.no_grad():
        outputs = judge_model.generate(
            **inputs,
            max_new_tokens=3,
            temperature=0.0,
            do_sample=False,
        )
    
    verdict = judge_tokenizer.decode(outputs[0], skip_special_tokens=True)
    # Extract the last valid digit from the response
    verdict_digits = [c for c in verdict.strip() if c.isdigit()]
    return int(verdict_digits[-1]) if verdict_digits else 0

# Load evaluation dataset
print("Loading evaluation dataset...")
MAX_SAMPLES = 500

try:
    eval_dataset = load_dataset("tatsu-lab/alpaca_eval", "alpaca_eval", trust_remote_code=True)["eval"]
    eval_dataset = eval_dataset.shuffle(seed=420).select(range(min(len(eval_dataset), MAX_SAMPLES)))
    print(f"Loaded evaluation dataset with {len(eval_dataset)} examples")
except Exception as e:
    print(f"❌ Failed to load evaluation dataset: {e}")
    raise

In [None]:
# Run evaluation
print(f"Running evaluation on {len(eval_dataset)} examples...")
eval_start = time.time()
results = []
cnt = 0

for i, example in enumerate(eval_dataset):
    try:
        
        print(f"\nExample {i+1}/{len(eval_dataset)}")
        if cnt%10==0: print(f"Evaluating: {example['instruction'][:100]}...")
        
        prompt = format_prompt(example["instruction"])
        base_response = generate_response(base_model, base_tokenizer, prompt)
        sft_response = generate_response(sft_model, sft_tokenizer, prompt)

        if cnt%10==0:
            print("\nBase Model Response:")
            print(base_response)
            print("\nSFT Model Response:")
            print(sft_response)
        
        # Check for empty or non-meaningful responses
        base_empty = not base_response.strip() or len(base_response.strip()) < 5 or base_response.strip().lower() in ["n/a", "none", "no response"]
        sft_empty = not sft_response.strip() or len(sft_response.strip()) < 5 or sft_response.strip().lower() in ["n/a", "none", "no response"]
        
        if base_empty and not sft_empty:
            verdict = 1
            print("Base response empty or non-meaningful - verdict: 1 (SFT wins)")
        elif sft_empty and not base_empty:
            verdict = 2
            print("SFT response empty or non-meaningful - verdict: 2 (Base wins)")
        elif base_empty and sft_empty:
            verdict = 0
            print("Both responses empty or non-meaningful - verdict: 0 (tie)")
        elif base_response.strip() == sft_response.strip():
            verdict = 0
            print("Responses identical - verdict: 0 (tie)")
        else:
            verdict = get_judgment(example["instruction"], sft_response, base_response)  # Fixed argument order
            print(f"\nVerdict: {verdict} ({['Tie', 'SFT wins', 'Base wins'][verdict]})")  # Fixed verdict labels
        
        results.append({
            "instruction": example["instruction"],
            "sft_response": sft_response,
            "base_response": base_response,
            "verdict": verdict
        })
        cnt += 1

    except Exception as e:
        print(f"Error processing example {i+1}: {e}")
        continue

# Calculate and display summary statistics
total = len(results)
if total > 0:
    base_wins = sum(1 for r in results if r["verdict"] == 2)
    sft_wins = sum(1 for r in results if r["verdict"] == 1)
    ties = sum(1 for r in results if r["verdict"] == 0)
    
    print("\n=== Evaluation Summary ===")
    print(f"Total examples evaluated: {total}")
    print(f"Base Model wins: {base_wins} ({base_wins/total:.1%})")
    print(f"SFT Model wins: {sft_wins} ({sft_wins/total:.1%})")
    print(f"Ties: {ties} ({ties/total:.1%})")

print(f"\n✅ Evaluation completed in {time.time() - eval_start:.2f}s")

## Analyze SFT Results
We can analyze and save the evaluation results.

In [None]:
# Calculate statistics
verdicts = [r["verdict"] for r in results]  # Changed from sft_results to results to match variable name
sft_stats = {
    "base_wins": sum(1 for v in verdicts if v == 2),  # Using sum() to match evaluation code style
    "sft_wins": sum(1 for v in verdicts if v == 1),
    "ties": sum(1 for v in verdicts if v == 0),
    "total": len(verdicts),
    "sft_win_rate": sum(1 for v in verdicts if v == 1) / len(verdicts) if verdicts else 0,
}

# Print summary statistics
print("\n=== SFT Evaluation Summary ===")
print(f"Total examples evaluated: {sft_stats['total']}")
print(f"Base Model wins: {sft_stats['base_wins']} ({sft_stats['base_wins']/sft_stats['total']:.1%})")
print(f"SFT Model wins: {sft_stats['sft_wins']} ({sft_stats['sft_wins']/sft_stats['total']:.1%})")
print(f"Ties: {sft_stats['ties']} ({sft_stats['ties']/sft_stats['total']:.1%})")

# Save results
sft_output_file = "./outputs/alpacaeval_sft_results.json"

## Prepare for DPO Training
Prepare for Direct Preference Optimization (DPO) training using the SFT model as our starting point.

In [None]:
# Load the SFT model as our policy model
print("Loading SFT model for DPO training...")

try:
    dpo_model, dpo_tokenizer = FastLanguageModel.from_pretrained(
        model_name="unsloth/Llama-3.2-1B-unsloth-bnb-4bit",
        max_seq_length=max_seq_length,
        load_in_4bit=True,
        dtype=None,
    )
    
    dpo_model = PeftModel.from_pretrained(dpo_model, sft_adapter_path)
    
    ref_model, _ = FastLanguageModel.from_pretrained(
        model_name="unsloth/Llama-3.2-1B-unsloth-bnb-4bit",
        max_seq_length=max_seq_length,
        load_in_4bit=True,
        dtype=None,
    )
    
    print("✅ [CHECKPOINT] DPO models loaded")
except Exception as e:
    print(f"❌ Failed to load DPO models: {e}")
    raise

In [None]:
# Load preference dataset for DPO
print("Loading preference dataset for DPO...")
try:
    NUM_EXAMPLES = 3000
    
    # Load dataset
    dataset = load_dataset("Intel/orca_dpo_pairs")
    
    # Use specified number of examples for training
    dpo_dataset = dataset["train"].select(range(min(NUM_EXAMPLES, len(dataset["train"]))))
    
    # Format the dataset for DPO and perform validation
    def format_dpo_example(example):
        return {
            "prompt": example["question"],
            "chosen": example["chosen"],
            "rejected": example["rejected"],
        }
    
    def filter_valid_examples(example):
        if not example["chosen"].strip() or not example["rejected"].strip():
            return False
        if len(example["question"]) > 3000:  # ~512 tokens
            return False
        return True
    
    # Format dataset
    dpo_dataset = dpo_dataset.map(format_dpo_example)
    dpo_dataset = dpo_dataset.filter(filter_valid_examples)
    
    print(f"Loaded DPO dataset with {len(dpo_dataset)} examples (from target of {NUM_EXAMPLES})")
    
    # Display a sample
    print("\nSample from DPO dataset:")
    print("Prompt:", dpo_dataset[0]["prompt"][:500] + "..." if len(dpo_dataset[0]["prompt"]) > 500 else dpo_dataset[0]["prompt"])
    print("Chosen response:", dpo_dataset[0]["chosen"][:500] + "..." if len(dpo_dataset[0]["chosen"]) > 500 else dpo_dataset[0]["chosen"])
    print("Rejected response:", dpo_dataset[0]["rejected"][:500] + "..." if len(dpo_dataset[0]["rejected"]) > 500 else dpo_dataset[0]["rejected"])
    
    # Dataset statistics
    prompt_lengths = [len(example["prompt"]) for example in dpo_dataset]
    chosen_lengths = [len(example["chosen"]) for example in dpo_dataset]
    rejected_lengths = [len(example["rejected"]) for example in dpo_dataset]
    
    print("\nDataset Statistics:")
    print(f"Average prompt length: {sum(prompt_lengths)/len(prompt_lengths):.1f} chars")
    print(f"Average chosen response length: {sum(chosen_lengths)/len(chosen_lengths):.1f} chars")
    print(f"Average rejected response length: {sum(rejected_lengths)/len(rejected_lengths):.1f} chars")
    
    # Pre-tokenize the dataset for DPOTrainer
    def tokenize_dataset(example):
        # Tokenize prompt
        prompt_tokens = dpo_tokenizer(example["prompt"], truncation=True, max_length=256, padding=False)
        
        # Tokenize chosen and rejected responses
        chosen_tokens = dpo_tokenizer(example["chosen"], truncation=True, max_length=max_seq_length, padding=False)
        rejected_tokens = dpo_tokenizer(example["rejected"], truncation=True, max_length=max_seq_length, padding=False)
        
        return {
            "prompt": example["prompt"],
            "chosen": example["chosen"],
            "rejected": example["rejected"],
            "input_ids": prompt_tokens["input_ids"],
            "attention_mask": prompt_tokens["attention_mask"],
            "chosen_input_ids": chosen_tokens["input_ids"],
            "chosen_attention_mask": chosen_tokens["attention_mask"],
            "rejected_input_ids": rejected_tokens["input_ids"],
            "rejected_attention_mask": rejected_tokens["attention_mask"],
        }
        
    tokenized_dataset = dpo_dataset.map(
        tokenize_dataset,
        batched=False,
        num_proc=4,
    )
    
    print("✅ Dataset tokenized successfully")
    
except Exception as e:
    print(f"❌ Failed to load/process DPO dataset: {e}")
    raise

## Prepare DPO Models
We need to prepare the SFT model and a reference model for DPO training.

In [None]:
# Load the SFT model as our policy model
print("Loading SFT model for DPO training...")

try:
    dpo_model, dpo_tokenizer = FastLanguageModel.from_pretrained(
        model_name="unsloth/Llama-3.2-1B-unsloth-bnb-4bit",
        max_seq_length=max_seq_length,
        load_in_4bit=True,
        dtype=None,
    )
    
    dpo_model = PeftModel.from_pretrained(dpo_model, sft_adapter_path)
    
    ref_model, _ = FastLanguageModel.from_pretrained(
        model_name="unsloth/Llama-3.2-1B-unsloth-bnb-4bit",
        max_seq_length=max_seq_length,
        load_in_4bit=True,
        dtype=None,
    )
    
    print("✅ [CHECKPOINT] DPO models loaded")
except Exception as e:
    print(f"❌ Failed to load DPO models: {e}")
    raise

## Configure DPO Training
Set up the DPO trainer with appropriate parameters.

In [None]:
# Configure DPO trainer
dpo_output_dir = "./outputs/dpo_model"
# Define DPO training arguments
dpo_training_args = TrainingArguments(
    per_device_train_batch_size=1,
    gradient_accumulation_steps=2,
    warmup_steps=100,
    num_train_epochs=2.0,
    learning_rate=1e-7,
    fp16=not is_bfloat16_supported(),
    bf16=is_bfloat16_supported(),
    logging_steps=100,
    optim="adamw_torch",
    weight_decay=0.01,
    lr_scheduler_type="cosine_with_restarts",
    seed=3407,
    output_dir=dpo_output_dir,
    report_to="none",
    save_strategy="steps",
    save_steps=100,
    gradient_checkpointing=True,
    max_grad_norm=2.0,
    dataloader_num_workers=2,
    dataloader_pin_memory=True,
    group_by_length=True,
)


dpo_trainer = DPOTrainer(
    model=dpo_model,
    ref_model=ref_model,
    args=dpo_training_args,
    beta=0.2,
    train_dataset=tokenized_dataset,
    tokenizer=dpo_tokenizer,
    max_length=max_seq_length,
    max_prompt_length=256,
    generate_during_eval=True,
)
print("✅ [CHECKPOINT] DPO trainer configured with optimized hyperparameters")

## Run DPO Training
Perform DPO training on the SFT model.

In [None]:
# Train with DPO
print("Starting DPO training...")
dpo_train_start = time.time()

try:
    dpo_trainer.train()
    print(f"✅ [CHECKPOINT] DPO training completed in {time.time() - dpo_train_start:.2f}s")
except Exception as e:
    print(f"❌ DPO training failed: {e}")
    raise

## Save DPO Model
Save the DPO-trained model adapter.

In [None]:
# Save DPO adapter
dpo_adapter_path = "./outputs/dpo_lora_adapter"

try:
    dpo_trainer.model.save_pretrained(
        dpo_adapter_path,
        save_adapter=True,
        save_config=True
    )
    print(f"✅ [CHECKPOINT] DPO model adapter saved to {dpo_adapter_path}")
except Exception as e:
    print(f"❌ Failed to save DPO model: {e}")
    raise

## Evaluate DPO Model
Now let's evaluate the DPO model against the base model using the same evaluation framework.

In [None]:
import torch
from peft import PeftModel
from unsloth import FastLanguageModel

# Clear GPU cache first
torch.cuda.empty_cache()

# 1. Load base model (4-bit quantized)
model, dpo_eval_tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/Llama-3.2-1B-unsloth-bnb-4bit",
    max_seq_length=1024,  # Reduce if OOM (original: 2048)
    dtype=torch.float16,  # Use float16 for efficiency
    load_in_4bit=True,
)

# 2. Define a custom device_map to offload some layers to CPU
device_map = {
    "model.embed_tokens": 0,  # GPU
    "model.layers.0": 0,
    "model.layers.1": 0,
    # ... (assign more layers to GPU if possible)
    "model.norm": "cpu",  # Offload norm layer to CPU
    "lm_head": "cpu",     # Offload final layer to CPU
}

# 3. Load DPO adapter with manual offloading
try:
    dpo_eval_model = PeftModel.from_pretrained(
        model,
        dpo_adapter_path,
        device_map=device_map,
    )
    print("✅ [SUCCESS] DPO model loaded with CPU offloading!")
except Exception as e:
    print(f"❌ [ERROR] Failed to load DPO adapter: {e}")
    raise

In [None]:
# Run DPO evaluation
print(f"Running evaluation on {len(eval_dataset)} examples...")
eval_start = time.time()
results = []
cnt = 0

for i, example in enumerate(eval_dataset):
    try:
        print(f"\nExample {i+1}/{len(eval_dataset)}")
        #if cnt%10==0: 
        print(f"Evaluating: {example['instruction'][:100]}...")
        
        prompt = format_prompt(example["instruction"])
        base_response = generate_response(base_model, base_tokenizer, prompt)
        dpo_response = generate_response(dpo_eval_model, dpo_eval_tokenizer, prompt)
        
        #if cnt%10==0: 
        print("\nDPO Model Response:")
        print(dpo_response)
        print("\nBase Model Response:")
        print(base_response)
        
        
        # Check for empty or non-meaningful responses
        base_empty = not base_response.strip() or len(base_response.strip()) < 5 or base_response.strip().lower() in ["n/a", "none", "no response"]
        dpo_empty = not dpo_response.strip() or len(dpo_response.strip()) < 5 or dpo_response.strip().lower() in ["n/a", "none", "no response"]
        
        if base_empty and not dpo_empty:
            verdict = 1
            print("Base response empty or non-meaningful - verdict: 1 (DPO wins)")
        elif dpo_empty and not base_empty:
            verdict = 2
            print("DPO response empty or non-meaningful - verdict: 2 (Base wins)")
        elif base_empty and dpo_empty:
            verdict = 0
            print("Both responses empty or non-meaningful - verdict: 0 (tie)")
        elif base_response.strip() == dpo_response.strip():
            verdict = 0
            print("Responses identical - verdict: 0 (tie)")
        else:
            verdict = get_judgment(example["instruction"], dpo_response, base_response)
            print(f"\nVerdict: {verdict} ({['Tie', 'DPO wins', 'Base wins'][verdict]})")
        
        results.append({
            "instruction": example["instruction"],
            "base_response": base_response,
            "dpo_response": dpo_response,
            "verdict": verdict
        })
        cnt += 1
    except Exception as e:
        print(f"Error processing example {i+1}: {e}")
        continue

# Calculate and display summary statistics
total = len(results)
if total > 0:
    base_wins = sum(1 for r in results if r["verdict"] == 2)
    dpo_wins = sum(1 for r in results if r["verdict"] == 1)
    ties = sum(1 for r in results if r["verdict"] == 0)
'''   
    print("\n=== Evaluationquet Summary ===")
    print(f"Total examples evaluated: {total}")
    print(f"Base Model wins: {base_wins} ({base_wins/total:.1%})")
    print(f"DPO Model wins: {dpo_wins} ({dpo_wins/total:.1%})")
    print(f"Ties: {ties} ({ties/total:.1%})")
'''

# Calculate statistics
verdicts = [r["verdict"] for r in results]
dpo_stats = {
    "base_wins": sum(1 for v in verdicts if v == 2),
    "dpo_wins": sum(1 for v in verdicts if v == 1),
    "ties": sum(1 for v in verdicts if v == 0),
    "total": len(verdicts),
    "dpo_win_rate": sum(1 for v in verdicts if v == 1) / len(verdicts) if verdicts else 0,
}

# Print summary statistics
print("\n=== DPO Evaluation Summary ===")
print(f"Total examples evaluated: {dpo_stats['total']}")
print(f"Base Model wins: {dpo_stats['base_wins']} ({dpo_stats['base_wins']/dpo_stats['total']:.1%})")
print(f"DPO Model wins: {dpo_stats['dpo_wins']} ({dpo_stats['dpo_wins']/dpo_stats['total']:.1%})")
print(f"Ties: {dpo_stats['ties']} ({dpo_stats['ties']/dpo_stats['total']:.1%})")

print(f"\n✅ Evaluation completed in {time.time() - eval_start:.2f}s")

## Analyze DPO Results
Compare the DPO model performance against the base model.

In [None]:
# Save DPO results
dpo_output_file = "./outputs/alpacaeval_dpo_results.json"

with open(dpo_output_file, "w") as f:
    json.dump({
        "statistics": dpo_stats,
        "results": results,
        "config": {
            "base_model": "unsloth/Llama-3.2-1B-unsloth-bnb-4bit",
            "dpo_adapter": "/output/dpo_lora_adapter",
            "judge_model": "unsloth/Llama-3.2-8B-Instruct-unsloth-bnb-4bit",
            "num_samples": len(results)
        }
    }, f, indent=2)

print(f"✅ [CHECKPOINT] DPO results saved to {dpo_output_file}")

## Compare SFT and DPO Results
Let's compare the performance of SFT and DPO models.

In [None]:
print("\n=== Model Comparison Summary ===")
print(f"SFT Win Rate: {sft_stats['sft_win_rate']:.1%}")
print(f"DPO Win Rate: {dpo_stats['dpo_win_rate']:.1%}")
print(f"\nSFT Performance: Base Wins: {sft_stats['base_wins']} | SFT Wins: {sft_stats['sft_wins']} | Ties: {sft_stats['ties']}")
print(f"DPO Performance: Base Wins: {dpo_stats['base_wins']} | DPO Wins: {dpo_stats['dpo_wins']} | Ties: {dpo_stats['ties']}")

# Calculate improvement
improvement = (dpo_stats['dpo_win_rate'] - sft_stats['sft_win_rate']) * 100
print(f"\nDPO improvement over SFT: {improvement:.1f} percentage points")

## Final Summary
Print a final summary of the entire pipeline.

In [None]:
print("\n=== FINAL SUMMARY ===")
print(f"SFT Win Rate: {sft_stats['sft_win_rate']:.1%}")
print(f"DPO Win Rate: {dpo_stats['dpo_win_rate']:.1%}")
print(f"DPO Improvement over SFT: {improvement:.1f} percentage points")
print(f"\nFULL PIPELINE COMPLETED SUCCESSFULLY in {time.time() - start_time:.2f}s")

# **IMPORTANT**: 
Remember to upload your model to HuggingFace!. 
**If you're using LORA, make sure to merge your model with your Lora adapters before pushing to HuggingFace. Otherwise, evaluation will automatically be carried out on just the base model.**