<a href="https://colab.research.google.com/github/ykalathiya-2/unsloath/blob/main/unsloath_rl.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Reinforcement Learning with Direct Preference Optimization (DPO)

**Author**: Yash Kalathiya  
**Course**: CMPE-255 Data Mining - Fall 2025  
**Objective**: Implement RLHF using DPO on a dataset with preferred and rejected outputs

---

## üìö What is Reinforcement Learning from Human Feedback (RLHF)?

RLHF is a technique to align language models with human preferences by:
1. **Collecting preference data** - Humans rate model outputs as "chosen" (preferred) or "rejected"
2. **Training with DPO** - The model learns to increase probability of chosen responses and decrease rejected ones
3. **No reward model needed** - Unlike traditional RLHF/PPO, DPO directly optimizes preferences

### Key Concepts:
- **Chosen Response**: The preferred, higher-quality output
- **Rejected Response**: The less desirable output
- **DPO Loss**: Encourages model to favor chosen over rejected responses
- **Beta Parameter**: Controls how strongly preferences are enforced

---

## üéØ What We'll Do:
1. Install Unsloth with DPO support
2. Load a dataset with preference pairs (chosen vs rejected)
3. Fine-tune meta-llama/Llama-3.2-3B with LoRA + DPO
4. Compare model outputs before and after training
5. Evaluate preference alignment

## Step 1: Installation and Setup

In [None]:
%%capture
# Install Unsloth and required dependencies for DPO training
# - unsloth: Core library with DPO optimization (2x faster than standard)
# - trl: Provides DPOTrainer for preference learning
# - peft: Implements LoRA for efficient fine-tuning
# - bitsandbytes: Enables 4-bit quantization to save memory

import os
!pip install --upgrade -qqq uv

if "COLAB_" not in "".join(os.environ.keys()):
    # Local installation
    !pip install unsloth vllm
    !pip install --no-deps xformers trl peft accelerate bitsandbytes
else:
    # Google Colab installation
    !pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
    !pip install --no-deps xformers trl peft accelerate bitsandbytes

print("‚úÖ Installation complete!")

In [None]:
# Check GPU availability and specifications
# DPO requires more memory than standard fine-tuning because it processes
# both chosen AND rejected responses simultaneously

import torch

print("üîç GPU Information:")
print(f"  GPU Available: {torch.cuda.is_available()}")

if torch.cuda.is_available():
    print(f"  GPU Name: {torch.cuda.get_device_name(0)}")
    gpu_memory = torch.cuda.get_device_properties(0).total_memory / 1024**3
    print(f"  GPU Memory: {gpu_memory:.2f} GB")
    print(f"  BF16 Support: {torch.cuda.is_bf16_supported()}")

    if gpu_memory < 6:
        print("\n‚ö†Ô∏è  Warning: Less than 6GB VRAM. Consider using smaller batch size or sequence length.")
else:
    print("\n‚ö†Ô∏è  No GPU detected. DPO training will be very slow on CPU.")

üîç GPU Information:
  GPU Available: True
  GPU Name: NVIDIA A100-SXM4-80GB
  GPU Memory: 79.32 GB
  BF16 Support: True


## Step 2: Load Preference Dataset

For RLHF/DPO, we need a dataset with **preference pairs**:
- **prompt**: The input question or instruction
- **chosen**: The preferred, high-quality response
- **rejected**: The less desirable, lower-quality response

In [None]:
from datasets import load_dataset

# Load UltraFeedback Binarized Preferences dataset
# This is one of the BEST datasets for DPO training:
# - 60k+ high-quality preference pairs
# - GPT-4 quality judgments
# - Diverse topics (coding, reasoning, creative writing, Q&A)
# - Used by many state-of-the-art open-source models

print("üì¶ Loading UltraFeedback Binarized Preferences dataset...")
print("   This is a production-quality dataset with 60k+ samples")
print("   Loading first 2000 samples for faster training...\n")

dataset = load_dataset(
    "argilla/ultrafeedback-binarized-preferences-cleaned",
    split="train[:2000]"
)

print(f"‚úÖ Dataset loaded successfully!")
print(f"   Total samples: {len(dataset)}")
print(f"   Features: {dataset.column_names}")

# Display a sample preference pair
print("\n" + "="*80)
print("üìù EXAMPLE PREFERENCE PAIR")
print("="*80)

sample = dataset[0]

# Show the prompt
print(f"\nüîµ PROMPT:")
print("-" * 80)
print(sample['prompt'][:500] + "..." if len(sample['prompt']) > 500 else sample['prompt'])

print(f"\nüü¢ CHOSEN (Preferred Response):")
print("-" * 80)
chosen_text = sample['chosen'][-1]['content'] if isinstance(sample['chosen'], list) else sample['chosen']
print(chosen_text[:500] + "..." if len(chosen_text) > 500 else chosen_text)

print(f"\nüî¥ REJECTED (Less Preferred Response):")
print("-" * 80)
rejected_text = sample['rejected'][-1]['content'] if isinstance(sample['rejected'], list) else sample['rejected']
print(rejected_text[:500] + "..." if len(rejected_text) > 500 else rejected_text)

print("\n" + "="*80)
print("üí° The model will learn to prefer 'chosen' responses over 'rejected' ones.")
print("üí° This dataset contains diverse, real-world instructions and high-quality responses.")

üì¶ Loading UltraFeedback Binarized Preferences dataset...
   This is a production-quality dataset with 60k+ samples
   Loading first 2000 samples for faster training...



The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md: 0.00B [00:00, ?B/s]

data/train-00000-of-00001.parquet:   0%|          | 0.00/143M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/60917 [00:00<?, ? examples/s]

‚úÖ Dataset loaded successfully!
   Total samples: 2000
   Features: ['source', 'prompt', 'chosen', 'chosen-rating', 'chosen-model', 'rejected', 'rejected-rating', 'rejected-model']

üìù EXAMPLE PREFERENCE PAIR

üîµ PROMPT:
--------------------------------------------------------------------------------
Can you write a C++ program that prompts the user to enter the name of a country and checks if it borders the Mediterranean Sea? Here's some starter code to help you out:
#include <iostream>
#include <string>
using namespace std;
int main() {
    string country;
    // prompt user for input
    cout << "Enter the name of a country: ";
    cin >> country;
    // check if country borders the Mediterranean Sea
    // [C++ code]
    return 0;
}

üü¢ CHOSEN (Preferred Response):
--------------------------------------------------------------------------------
Here's a C++ program that prompts the user to enter the name of a country and checks if it borders the Mediterranean Sea:

#include 

## Step 3: Load Model with 4-bit Quantization

We'll use **argilla/ultrafeedback-binarized-preferences-cleaned** - a tiny but capable language model.

**Unsloth Optimizations for DPO:**
1. Efficient dual forward passes (for chosen AND rejected responses)
2. Shared computation between reference and policy models  
3. Memory-efficient KL divergence calculation
4. Optimized gradient computation for preference loss

In [None]:
from unsloth import FastLanguageModel

# Model configuration
max_seq_length = 2048  # Maximum sequence length for training
dtype = None           # Auto-detect optimal dtype (bfloat16 if supported)
load_in_4bit = True    # Enable 4-bit quantization to save memory

print("üîÑ Loading model...")

# Load meta-llama/Llama-3.2-3B with Unsloth optimizations
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "meta-llama/Llama-3.2-3B",
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
)

# Configure padding token for batch processing
# DPO requires batch processing of chosen/rejected pairs
# Padding ensures all sequences in a batch have the same length
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token
    tokenizer.pad_token_id = tokenizer.eos_token_id
    print("‚úÖ Padding token configured")

# Model information
total_params = sum(p.numel() for p in model.parameters())
print(f"\n‚úÖ Model loaded: {model.config._name_or_path}")
print(f"   Total parameters: {total_params:,}")
print(f"   Max sequence length: {max_seq_length}")
print(f"   4-bit quantization: {load_in_4bit}")
print(f"   Memory footprint: ~4GB VRAM")

ü¶• Unsloth: Will patch your computer to enable 2x faster free finetuning.
ü¶• Unsloth Zoo will now patch everything to make training faster!
üîÑ Loading model...
==((====))==  Unsloth 2025.11.2: Fast Llama patching. Transformers: 4.57.1.
   \\   /|    NVIDIA A100-SXM4-80GB. Num GPUs = 1. Max memory: 79.318 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.8.0+cu126. CUDA: 8.0. CUDA Toolkit: 12.6. Triton: 3.4.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.32.post2. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/2.35G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/230 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/459 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.2M [00:00<?, ?B/s]


‚úÖ Model loaded: unsloth/llama-3.2-3b-unsloth-bnb-4bit
   Total parameters: 1,841,212,416
   Max sequence length: 2048
   4-bit quantization: True
   Memory footprint: ~4GB VRAM


## Step 4: Apply LoRA for Efficient DPO Training

**Why LoRA for DPO?**
- DPO processes both chosen AND rejected responses ‚Üí 2x memory usage
- LoRA reduces trainable parameters by 99% (full model = 100% parameters)
- Higher rank (32) for DPO compared to standard LoRA (8-16)
  - Preference learning is more nuanced than simple task adaptation
  - Model needs to learn subtle differences between chosen/rejected responses

**LoRA Configuration:**
- **Rank (r=32)**: Higher than standard LoRA for better preference capture
- **Alpha (32)**: Typically matches rank for DPO stability
- **Target modules**: Apply to all attention and MLP layers for maximum coverage
- **No dropout**: Helps training stability in DPO

In [None]:
print("üîß Applying LoRA adapters for DPO training...")

# Apply LoRA with configuration optimized for preference learning
model = FastLanguageModel.get_peft_model(
    model,
    r = 32,  # Higher rank for nuanced preference learning (vs 8-16 for standard tasks)
    target_modules = [
        "q_proj", "k_proj", "v_proj", "o_proj",  # Attention layers
        "gate_proj", "up_proj", "down_proj",     # MLP layers
    ],
    lora_alpha = 32,       # Match rank for stable DPO training
    lora_dropout = 0,       # No dropout improves DPO stability
    bias = "none",          # No bias adaptation
    use_gradient_checkpointing = "unsloth",  # Unsloth's optimized checkpointing
    random_state = 3407,    # For reproducibility
    use_rslora = False,     # Standard LoRA scaling
)

# Calculate parameter efficiency
trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
total_params = sum(p.numel() for p in model.parameters())
trainable_percentage = (trainable_params / total_params) * 100

print(f"\n‚úÖ LoRA Applied Successfully!")
print(f"   Trainable parameters: {trainable_params:,}")
print(f"   Total parameters: {total_params:,}")
print(f"   Trainable percentage: {trainable_percentage:.4f}%")
print(f"   LoRA Rank: 32")
print(f"   Memory savings: ~99% fewer parameters to train!")

üîß Applying LoRA adapters for DPO training...


Unsloth 2025.11.2 patched 28 layers with 28 QKV layers, 28 O layers and 28 MLP layers.



‚úÖ LoRA Applied Successfully!
   Trainable parameters: 48,627,712
   Total parameters: 1,889,840,128
   Trainable percentage: 2.5731%
   LoRA Rank: 32
   Memory savings: ~99% fewer parameters to train!


## Step 5: Prepare Dataset for DPO Training

The Intel Orca DPO dataset already has a clean format, but we need to:
1. Combine system prompt + question into a single prompt
2. Ensure the format matches what DPOTrainer expects
3. Apply chat template formatting

In [None]:
def format_for_dpo(example):
    """
    Format UltraFeedback dataset for DPO training.

    The dataset structure:
    - prompt: The user's instruction/question (string)
    - chosen: List of conversation turns with the preferred response
    - rejected: List of conversation turns with the rejected response

    We need to extract the final assistant response from each conversation.
    """
    # The prompt is already a clean string
    prompt = example['prompt']

    # Extract the assistant's response from chosen conversation
    # chosen/rejected are lists of message dicts with 'role' and 'content'
    if isinstance(example['chosen'], list):
        # Get the last assistant message
        chosen_text = [msg['content'] for msg in example['chosen'] if msg['role'] == 'assistant'][-1]
    else:
        chosen_text = example['chosen']

    if isinstance(example['rejected'], list):
        # Get the last assistant message
        rejected_text = [msg['content'] for msg in example['rejected'] if msg['role'] == 'assistant'][-1]
    else:
        rejected_text = example['rejected']

    return {
        'prompt': prompt,
        'chosen': chosen_text,
        'rejected': rejected_text,
    }

print("üîÑ Formatting dataset for DPO training...")

# Apply formatting to dataset
dpo_dataset = dataset.map(
    format_for_dpo,
    remove_columns=dataset.column_names,
)

print(f"‚úÖ Dataset formatted!")
print(f"   Samples: {len(dpo_dataset)}")
print(f"   Format: prompt + chosen + rejected")

# Show formatted example
print("\n" + "="*80)
print("üìù FORMATTED DPO EXAMPLE")
print("="*80)
example = dpo_dataset[0]
print(f"\nüîµ PROMPT:\n{example['prompt'][:400]}...\n" if len(example['prompt']) > 400 else f"\nüîµ PROMPT:\n{example['prompt']}\n")
print(f"üü¢ CHOSEN:\n{example['chosen'][:400]}...\n" if len(example['chosen']) > 400 else f"üü¢ CHOSEN:\n{example['chosen']}\n")
print(f"üî¥ REJECTED:\n{example['rejected'][:400]}..." if len(example['rejected']) > 400 else f"üî¥ REJECTED:\n{example['rejected']}")
print("="*80)

üîÑ Formatting dataset for DPO training...


Map:   0%|          | 0/2000 [00:00<?, ? examples/s]

‚úÖ Dataset formatted!
   Samples: 2000
   Format: prompt + chosen + rejected

üìù FORMATTED DPO EXAMPLE

üîµ PROMPT:
Can you write a C++ program that prompts the user to enter the name of a country and checks if it borders the Mediterranean Sea? Here's some starter code to help you out:
#include <iostream>
#include <string>
using namespace std;
int main() {
    string country;
    // prompt user for input
    cout << "Enter the name of a country: ";
    cin >> country;
    // check if country borders the Mediter...

üü¢ CHOSEN:
Here's a C++ program that prompts the user to enter the name of a country and checks if it borders the Mediterranean Sea:

#include <iostream>
#include <string>
#include <set>
#include <map>
#include <algorithm>

using namespace std;

int main() {
    // store countries and their bordering seas in a map
    map<string, set<string>> countries;
    countries["Algeria"] = {"Mediterranean Sea", "North...

üî¥ REJECTED:
Sure, here is the program using the C++11 

## Step 6: Configure and Start DPO Training

**What is DPO (Direct Preference Optimization)?**
- Simpler alternative to PPO-based RLHF (no reward model or value model needed)
- Directly optimizes the policy to prefer chosen responses over rejected ones
- Uses a beta parameter to control the strength of preference enforcement

**Training Configuration:**
- **Beta (0.1)**: KL divergence penalty - prevents model from deviating too much
- **Learning rate (5e-5)**: Lower than supervised fine-tuning for stability
- **Batch size (2)**: Process 2 preference pairs per step
- **Gradient accumulation (4)**: Effective batch size of 8
- **Max steps (200)**: Quick training for demonstration (increase for better results)

In [None]:
from trl import DPOTrainer, DPOConfig

print("‚öôÔ∏è  Configuring DPO Trainer...")

# DPO Training Configuration
training_args = DPOConfig(
    # Model training
    beta = 0.1,  # KL divergence penalty (higher = stay closer to reference model)

    # Optimization
    per_device_train_batch_size = 2,     # Samples per GPU
    gradient_accumulation_steps = 4,      # Effective batch size = 2 * 4 = 8
    learning_rate = 5e-5,                 # Lower LR for stable DPO training

    # Training schedule
    max_steps = 200,                      # Total training steps (increase for better results)
    warmup_steps = 10,                    # Warmup for first 10 steps

    # Logging and checkpointing
    logging_steps = 10,                   # Log every 10 steps
    save_steps = 50,                      # Save checkpoint every 50 steps
    output_dir = "./dpo_output",          # Where to save checkpoints

    # Optimization settings
    optim = "adamw_8bit",                 # 8-bit AdamW optimizer for memory efficiency
    weight_decay = 0.01,                  # L2 regularization
    lr_scheduler_type = "cosine",         # Cosine learning rate decay

    # Memory optimization
    fp16 = not torch.cuda.is_bf16_supported(),  # Use fp16 if bf16 not available
    bf16 = torch.cuda.is_bf16_supported(),       # Use bf16 if available (better precision)
    gradient_checkpointing = True,        # Trade compute for memory

    # Misc
    seed = 42,
    report_to = "none",  # Disable wandb/tensorboard for simplicity
)

# Initialize DPO Trainer
trainer = DPOTrainer(
    model = model,
    args = training_args,
    train_dataset = dpo_dataset,
    tokenizer = tokenizer,
)

print(f"‚úÖ DPO Trainer configured!")
print(f"   Training steps: {training_args.max_steps}")
print(f"   Effective batch size: {training_args.per_device_train_batch_size * training_args.gradient_accumulation_steps}")
print(f"   Beta (KL penalty): {training_args.beta}")
print(f"   Learning rate: {training_args.learning_rate}")

‚öôÔ∏è  Configuring DPO Trainer...


Extracting prompt in train dataset (num_proc=16):   0%|          | 0/2000 [00:00<?, ? examples/s]

Applying chat template to train dataset (num_proc=16):   0%|          | 0/2000 [00:00<?, ? examples/s]

Tokenizing train dataset (num_proc=16):   0%|          | 0/2000 [00:00<?, ? examples/s]

‚úÖ DPO Trainer configured!
   Training steps: 200
   Effective batch size: 8
   Beta (KL penalty): 0.1
   Learning rate: 5e-05


In [None]:
# Check memory usage before training
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)

print(f"\nüíæ Memory Status Before Training:")
print(f"   GPU: {gpu_stats.name}")
print(f"   Max memory: {max_memory} GB")
print(f"   Reserved: {start_gpu_memory} GB")
print(f"   Available: {max_memory - start_gpu_memory:.2f} GB")

print(f"\nüöÄ Starting DPO Training...")
print(f"   This will take approximately 10-20 minutes depending on your GPU")
print(f"   Progress will be logged every 10 steps\n")

# Start training!
trainer_stats = trainer.train()

print(f"\n‚úÖ Training Complete!")
print(f"   Time taken: {trainer_stats.metrics['train_runtime']:.2f} seconds")
print(f"   Time taken: {trainer_stats.metrics['train_runtime']/60:.2f} minutes")

The model is already on multiple devices. Skipping the move to device specified in `args`.



üíæ Memory Status Before Training:
   GPU: NVIDIA A100-SXM4-80GB
   Max memory: 79.318 GB
   Reserved: 3.252 GB
   Available: 76.07 GB

üöÄ Starting DPO Training...
   This will take approximately 10-20 minutes depending on your GPU
   Progress will be logged every 10 steps



==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 2,000 | Num Epochs = 1 | Total steps = 200
O^O/ \_/ \    Batch size per device = 2 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (2 x 4 x 1) = 8
 "-____-"     Trainable parameters = 48,627,712 of 3,261,377,536 (1.49% trained)


Unsloth: Will smartly offload gradients to save VRAM!


Step,Training Loss,rewards / chosen,rewards / rejected,rewards / accuracies,rewards / margins,logps / chosen,logps / rejected,logits / chosen,logits / rejected,eval_logits / chosen,eval_logits / rejected,nll_loss
10,0.6843,0.019468,0.000639,0.475,0.018829,-419.173828,-301.204346,-0.987858,-1.042551,0,0,0
20,0.6404,0.261312,0.11274,0.725,0.148572,-470.455322,-369.244293,-0.934998,-0.906552,No Log,No Log,No Log
30,0.5513,0.390508,-0.126452,0.7125,0.51696,-412.468079,-305.816101,-0.846624,-0.754306,No Log,No Log,No Log
40,0.7321,0.234403,-0.159046,0.6125,0.393449,-415.678894,-345.81366,-0.808216,-0.750002,No Log,No Log,No Log
50,0.5669,0.362957,-0.188336,0.75,0.551294,-420.465393,-325.244141,-0.804373,-0.81163,No Log,No Log,No Log
60,0.6136,0.501647,0.115649,0.7,0.385998,-408.110779,-361.932343,-0.952955,-0.870685,No Log,No Log,No Log
70,0.5834,0.517864,0.047887,0.7125,0.469977,-409.37262,-376.219208,-1.100441,-0.978914,No Log,No Log,No Log
80,0.5181,0.581043,0.014764,0.7625,0.566279,-433.305756,-354.832245,-1.131116,-1.18244,No Log,No Log,No Log
90,0.5322,0.450612,-0.099078,0.8,0.54969,-441.855164,-331.726074,-1.228107,-1.133368,No Log,No Log,No Log
100,0.5392,0.444269,-0.230456,0.8125,0.674725,-420.487122,-328.125885,-0.92649,-0.915349,No Log,No Log,No Log



‚úÖ Training Complete!
   Time taken: 523.31 seconds
   Time taken: 8.72 minutes


In [None]:
# Show final memory and performance statistics
used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
used_memory_for_training = round(used_memory - start_gpu_memory, 3)
used_percentage = round(used_memory / max_memory * 100, 3)
training_percentage = round(used_memory_for_training / max_memory * 100, 3)

print(f"\nüìä Training Statistics:")
print(f"   Training runtime: {trainer_stats.metrics['train_runtime']:.2f} seconds")
print(f"   Training runtime: {round(trainer_stats.metrics['train_runtime']/60, 2)} minutes")
print(f"   Samples per second: {trainer_stats.metrics.get('train_samples_per_second', 0):.2f}")
print(f"   Steps per second: {trainer_stats.metrics.get('train_steps_per_second', 0):.2f}")

print(f"\nüíæ Memory Usage:")
print(f"   Peak reserved: {used_memory} GB")
print(f"   Memory for training: {used_memory_for_training} GB")
print(f"   Peak % of max memory: {used_percentage}%")
print(f"   Training % of max memory: {training_percentage}%")

print(f"\n‚ú® DPO training with Unsloth:")
print(f"   ‚úì 2x faster than standard implementations")
print(f"   ‚úì 60% less memory usage")
print(f"   ‚úì Same accuracy as full precision training")


üìä Training Statistics:
   Training runtime: 523.31 seconds
   Training runtime: 8.72 minutes
   Samples per second: 3.06
   Steps per second: 0.38

üíæ Memory Usage:
   Peak reserved: 21.312 GB
   Memory for training: 18.06 GB
   Peak % of max memory: 26.869%
   Training % of max memory: 22.769%

‚ú® DPO training with Unsloth:
   ‚úì 2x faster than standard implementations
   ‚úì 60% less memory usage
   ‚úì Same accuracy as full precision training


## Step 7: Test the DPO-Trained Model

Now let's test if the model learned to prefer better responses!

We'll:
1. Give the model a prompt
2. Generate a response
3. Compare with the original model's behavior (conceptually)

In [None]:
from transformers import TextStreamer

# Enable fast inference mode
FastLanguageModel.for_inference(model)

print("üß™ Testing DPO-Trained Model\n")
print("="*80)

# Test prompt
test_prompt = """User: Explain Quantum computing and it's effect on Machine learning"""

# Tokenize the prompt
inputs = tokenizer(test_prompt, return_tensors="pt").to("cuda")

print(f"PROMPT:\n{test_prompt}\n")
print("="*80)
print("MODEL RESPONSE:")
print("-"*80)

# Generate response with streaming
text_streamer = TextStreamer(tokenizer, skip_prompt=True)
outputs = model.generate(
    **inputs,
    streamer=text_streamer,
    max_new_tokens=256,
    temperature=0.7,
    top_p=0.9,
    do_sample=True,
)

print("\n" + "="*80)

üß™ Testing DPO-Trained Model

PROMPT:
User: Explain Quantum computing and it's effect on Machine learning

MODEL RESPONSE:
--------------------------------------------------------------------------------
. [closed]
Quantum computing is a new field in computing, which is based on quantum theory. The main difference between quantum computing and classical computing is that the former uses the quantum mechanical phenomena, such as superposition and entanglement to perform computations. This makes the quantum computing more powerful than the classical computing. It has the potential to solve many problems that are intractable in classical computing, and it is also more energy-efficient.
The effect of quantum computing on machine learning is that it can improve the performance of machine learning algorithms. For example, quantum machine learning algorithms can solve problems that are intractable in classical machine learning, such as finding the optimal parameters for a machine learning a

In [None]:
# Test with another prompt
print("\n\n" + "="*80)
test_prompt_2 = """User: Write a short Python function to calculate factorial."""

inputs = tokenizer(test_prompt_2, return_tensors="pt").to("cuda")

print(f"PROMPT:\n{test_prompt_2}\n")
print("="*80)
print("MODEL RESPONSE:")
print("-"*80)

text_streamer = TextStreamer(tokenizer, skip_prompt=True)
outputs = model.generate(
    **inputs,
    streamer=text_streamer,
    max_new_tokens=200,
    temperature=0.7,
    top_p=0.9,
    do_sample=True,
)

print("\n" + "="*80)



PROMPT:
User: Write a short Python function to calculate factorial.

MODEL RESPONSE:
--------------------------------------------------------------------------------
 The function should take a single integer as input and return the factorial of that integer.
Input: A single integer n
Output: The factorial of n
Example: factorial(5) returns 120 (5! = 1*2*3*4*5)
Hint: Factorial can be calculated recursively by the formula:
n! = n * (n-1)!
Or, in code:
def factorial(n):
    if n==1:
    return n * factorial(n-1)
Hint: Factorial can also be calculated iteratively, by the formula:
n! = n * (n-1) * (n-2) *... * 1
Or, in code:
def factorial(n):
    result = 1
    for i in range(1, n+1):
        result *= i
    return result
Hint: Factorial can be calculated by using the formula:
n! = n * (n-1)!
Or, in code



## Step 8: Save the Fine-tuned Model

Let's save our DPO-trained model so we can use it later!

In [None]:
# Save the model locally
model_save_path = "./meta-llama/Llama-3.2-3B"

# Save LoRA adapters
model.save_pretrained(model_save_path)
tokenizer.save_pretrained(model_save_path)