# Preference Fine-tuning with Direct Preference Optimization (DPO) using LoRA

## Introduction

This notebook demonstrates how to implement Direct Preference Optimization (DPO) using only Low-Rank Adaptation (LoRA) for efficient preference fine-tuning of large language models.

Preference fine-tuning is a powerful technique that optimizes language models based on human-provided preferences, teaching them to produce more aligned, helpful, harmless, and honest responses. 

**We'll cover two main steps:**
1. Supervised Fine-Tuning (SFT) with LoRA
2. Direct Preference Optimization (DPO) with LoRA

**Prerequisites:**
- Basic understanding of language models and fine-tuning
- Access to a GPU for training

Let's get started!

## Install Required Packages

First, let's install the necessary packages for our fine-tuning process.

In [None]:
# Install required packages
!pip install -q accelerate peft bitsandbytes transformers trl sentencepiece datasets

## 1. Supervised Fine-Tuning (SFT) with LoRA

Before we can perform preference optimization, we first need to fine-tune our base model on instruction data. We'll use LoRA for parameter-efficient fine-tuning, which significantly reduces the number of trainable parameters.

### 1.1 Data Preparation for SFT

We'll prepare our dataset for the SFT stage. We'll use a sample of the UltraChat dataset for this demonstration.

In [None]:
from transformers import AutoTokenizer
from datasets import load_dataset

# Load a tokenizer to use its chat template
base_model_name = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"
tokenizer = AutoTokenizer.from_pretrained(base_model_name)

def format_prompt(example):
    """Format the prompt using the chat template the model was trained with"""
    chat = example["messages"]
    prompt = tokenizer.apply_chat_template(chat, tokenize=False)
    return {"text": prompt}

# Load and format the dataset (limit to a small sample for demonstration)
dataset = (
    load_dataset("HuggingFaceH4/ultrachat_200k", split="test_sft")
    .shuffle(seed=42)
    .select(range(1_000))  # Using a small sample for demonstration
)
dataset = dataset.map(format_prompt)

In [None]:
# Display an example to verify formatting
print(dataset["text"][0])

### 1.2 Model Setup with LoRA

Now, let's set up our model with LoRA for efficient fine-tuning.

In [None]:
import torch
from transformers import AutoModelForCausalLM
from peft import LoraConfig, get_peft_model

# We'll use a small pre-trained model for this demonstration
base_model_name = "TinyLlama/TinyLlama-1.1B-intermediate-step-1431k-3T"

# Load the model to train
model = AutoModelForCausalLM.from_pretrained(
    base_model_name,
    device_map="auto",
    torch_dtype=torch.float16,  # Use float16 precision for efficiency
)
model.config.use_cache = False  # Disable KV caching for training
model.config.pretraining_tp = 1  # Set tensor parallelism to 1

# Load tokenizer if not already loaded
if 'tokenizer' not in locals():
    tokenizer = AutoTokenizer.from_pretrained(base_model_name)
    tokenizer.pad_token = tokenizer.eos_token
    tokenizer.padding_side = "right"  # Right padding for causal language modeling

### 1.3 LoRA Configuration

Next, we'll set up our LoRA configuration. LoRA works by adding small trainable rank decomposition matrices to key layers of the model, significantly reducing the number of parameters that need to be trained.

In [None]:
# Define LoRA Configuration
lora_config = LoraConfig(
    r=16,               # Rank of the update matrices
    lora_alpha=32,      # LoRA scaling factor
    lora_dropout=0.05,  # Dropout probability for LoRA layers
    bias="none",        # Whether to train bias parameters
    task_type="CAUSAL_LM",  # Task type (causal language modeling)
    # Target modules to apply LoRA to
    target_modules=[
        "q_proj",     # Query projection
        "k_proj",     # Key projection
        "v_proj",     # Value projection
        "o_proj",     # Output projection
        "gate_proj",  # Gating projection (for MLP blocks)
        "up_proj",    # Upward projection (for MLP blocks)
        "down_proj",  # Downward projection (for MLP blocks)
    ]
)

# Apply LoRA to the model
model = get_peft_model(model, lora_config)

# Print trainable parameters information
model.print_trainable_parameters()

### 1.4 Training Configuration for SFT

In [None]:
from transformers import TrainingArguments
from trl import SFTTrainer

# Output directory for saving models
output_dir = "./sft_lora_model"

# Configure training arguments
training_args = TrainingArguments(
    output_dir=output_dir,
    per_device_train_batch_size=4,     # Batch size per GPU
    gradient_accumulation_steps=4,     # Number of updates to accumulate before backward pass
    learning_rate=2e-4,                # Learning rate
    num_train_epochs=1,                # Number of training epochs
    lr_scheduler_type="cosine",        # Learning rate scheduler type
    warmup_ratio=0.1,                  # Warmup ratio
    optim="adamw_torch",              # Optimizer
    logging_steps=50,                  # Log every X steps
    save_strategy="epoch",             # Save at the end of each epoch
    fp16=True,                         # Use mixed precision training
    gradient_checkpointing=True,       # Use gradient checkpointing to save memory
    remove_unused_columns=False,       # Keep all columns in the dataset
)

# Create SFT trainer
trainer = SFTTrainer(
    model=model,
    args=training_args,
    train_dataset=dataset,
    tokenizer=tokenizer,
    dataset_text_field="text",
    max_seq_length=512,    # Maximum sequence length
    peft_config=lora_config # Use LoRA configuration
)

### 1.5 Train the Model with SFT

In [None]:
# Train the model
trainer.train()

# Save the trained LoRA adapters
model.save_pretrained(output_dir)

### 1.6 Test the SFT Model

Let's see how our SFT model performs before moving to the DPO stage.

In [None]:
from peft import AutoPeftModelForCausalLM
from transformers import pipeline

# Create a test prompt in the model's chat format
test_prompt = tokenizer.apply_chat_template(
    [{"role": "user", "content": "What are the key benefits of fine-tuning with LoRA?"}],
    tokenize=False
)

# Load the fine-tuned LoRA model
sft_model = AutoPeftModelForCausalLM.from_pretrained(
    output_dir,
    device_map="auto",
    torch_dtype=torch.float16
)

# Create a text generation pipeline
pipe = pipeline(
    "text-generation",
    model=sft_model,
    tokenizer=tokenizer,
    max_new_tokens=256,
    temperature=0.7,
    top_p=0.9,
    repetition_penalty=1.1
)

# Generate a response
response = pipe(test_prompt)[0]["generated_text"]
print(response)

## 2. Direct Preference Optimization (DPO) with LoRA

Now that we have our SFT model, we can move on to preference optimization. DPO trains the model to prefer certain outputs over others based on human preference data.

### 2.1 Load Preference Data

We'll use a preference dataset containing pairs of responses where one is preferred over the other.

In [None]:
from datasets import load_dataset

# Function to format data for DPO
def format_dpo_data(example):
    """Format the data for DPO training"""
    if "system" in example:
        system = example["system"]
        system_prompt = f"<|system|>\n{system}</s>\n"
    else:
        system_prompt = ""
        
    prompt = f"<|user|>\n{example['input']}</s>\n<|assistant|>\n"
    chosen = example['chosen'] + "</s>\n"
    rejected = example['rejected'] + "</s>\n"
    
    return {
        "prompt": system_prompt + prompt,
        "chosen": chosen,
        "rejected": rejected,
    }

# Load a preference dataset
dpo_dataset = load_dataset("argilla/distilabel-intel-orca-dpo-pairs", split="train")

# Filter the dataset to include only high-quality preference pairs
dpo_dataset = dpo_dataset.filter(
    lambda r: 
        r["status"] != "tie" and  # Exclude tied preferences
        r["chosen_score"] >= 8 and  # Include only high-scored chosen responses
        len(r["chosen"]) < 1500 and  # Limit length for efficiency
        len(r["rejected"]) < 1500    # Limit length for efficiency
)

# Format the dataset for DPO
dpo_dataset = dpo_dataset.map(
    format_dpo_data, 
    remove_columns=dpo_dataset.column_names  # Remove original columns
)

# Take a smaller sample for demonstration
dpo_dataset = dpo_dataset.shuffle(seed=42).select(range(2000))

In [None]:
# Display the first example
print("PROMPT:\n", dpo_dataset[0]["prompt"])
print("\nCHOSEN:\n", dpo_dataset[0]["chosen"])
print("\nREJECTED:\n", dpo_dataset[0]["rejected"])

### 2.2 Configure LoRA for DPO

Now we'll set up another LoRA configuration for the DPO phase. We'll apply this to our SFT model.

In [None]:
# Define LoRA configuration for DPO
dpo_lora_config = LoraConfig(
    r=8,                # Using a smaller rank for DPO
    lora_alpha=16,      # Scaling factor
    lora_dropout=0.05,  # Dropout probability
    bias="none",        # Don't train bias parameters
    task_type="CAUSAL_LM",
    # Target modules to apply LoRA to (same as SFT)
    target_modules=[
        "q_proj", "k_proj", "v_proj", "o_proj", 
        "gate_proj", "up_proj", "down_proj"
    ]
)

### 2.3 DPO Training Configuration

In [None]:
from trl import DPOConfig, DPOTrainer

# Output directory for DPO model
dpo_output_dir = "./dpo_lora_model"

# Configure DPO training arguments
dpo_args = DPOConfig(
    output_dir=dpo_output_dir,
    per_device_train_batch_size=2,     # Smaller batch size due to memory constraints
    gradient_accumulation_steps=4,     # Accumulate gradients to compensate for smaller batch size
    learning_rate=5e-5,                # Lower learning rate for DPO
    lr_scheduler_type="cosine",        # Learning rate scheduler
    max_steps=500,                     # Fixed number of training steps
    warmup_ratio=0.1,                  # Warmup ratio
    optim="adamw_torch",              # Optimizer
    logging_steps=10,                  # Log every X steps
    save_strategy="steps",             # Save strategy
    save_steps=100,                    # Save every X steps
    fp16=True,                         # Use mixed precision training
    gradient_checkpointing=True,       # Use gradient checkpointing
    remove_unused_columns=False,       # Keep all columns
    beta=0.1,                          # DPO beta parameter (controls KL penalty)
    max_prompt_length=512,             # Maximum prompt length
    max_length=1024,                   # Maximum sequence length
)

### 2.4 Initialize DPO Trainer

In [None]:
# The reference model is the SFT model (we don't need to load it again as DPOTrainer will handle this)
dpo_trainer = DPOTrainer(
    model=sft_model,             # Use the SFT model as the starting point
    ref_model=None,              # DPOTrainer will create a copy of the model for reference
    args=dpo_args,               # Training arguments
    train_dataset=dpo_dataset,   # Training data
    tokenizer=tokenizer,         # Tokenizer
    peft_config=dpo_lora_config, # LoRA configuration for DPO
)

### 2.5 Train with DPO

In [None]:
# Train the model with DPO
dpo_trainer.train()

# Save the trained LoRA adapters
dpo_trainer.model.save_pretrained(dpo_output_dir)

### 2.6 Merge LoRA Adapters and Test

Finally, let's test our DPO-tuned model to see the improvements.

In [None]:
from peft import PeftModel, AutoPeftModelForCausalLM

# Load base model and SFT LoRA
base_model = AutoModelForCausalLM.from_pretrained(
    base_model_name,
    device_map="auto",
    torch_dtype=torch.float16
)

# Load and merge SFT LoRA adapters
sft_model = PeftModel.from_pretrained(base_model, output_dir)
merged_sft_model = sft_model.merge_and_unload()

# Load and merge DPO LoRA adapters on top of the SFT model
dpo_model = PeftModel.from_pretrained(merged_sft_model, dpo_output_dir)
merged_dpo_model = dpo_model.merge_and_unload()

In [None]:
# Test various prompts to compare
test_prompts = [
    "What are the ethical considerations when developing AI systems?",
    "How can I optimize my study habits for better retention?",
    "What are some strategies for dealing with workplace stress?"
]

# Function to generate responses from a model
def generate_response(model, prompt, tokenizer):
    formatted_prompt = tokenizer.apply_chat_template(
        [{"role": "user", "content": prompt}],
        tokenize=False
    )
    
    pipe = pipeline(
        "text-generation",
        model=model,
        tokenizer=tokenizer,
        max_new_tokens=256,
        temperature=0.7,
        top_p=0.9,
        repetition_penalty=1.1
    )
    
    return pipe(formatted_prompt)[0]["generated_text"]

# Compare base model, SFT model, and DPO model
for prompt in test_prompts:
    print(f"\n\n===== PROMPT: {prompt} =====")
    
    print("\n----- BASE MODEL RESPONSE -----")
    print(generate_response(base_model, prompt, tokenizer))
    
    print("\n----- SFT MODEL RESPONSE -----")
    print(generate_response(merged_sft_model, prompt, tokenizer))
    
    print("\n----- DPO MODEL RESPONSE -----")
    print(generate_response(merged_dpo_model, prompt, tokenizer))

## Conclusion

In this notebook, we've demonstrated a complete workflow for preference fine-tuning using LoRA:

1. **SFT Stage**: We first performed supervised fine-tuning with LoRA adapters to teach the model to follow instructions.
2. **DPO Stage**: We then applied Direct Preference Optimization with LoRA to improve the model's outputs based on human preferences.

The key benefits of this approach include:

- **Efficiency**: LoRA allows us to fine-tune models with minimal computational resources
- **Quality**: DPO helps models produce more preferred outputs without reinforcement learning
- **Modularity**: We can stack multiple LoRA adapters for different tasks

By combining these techniques, we can create more helpful, accurate, and aligned language models with relatively limited computational resources.