# FIPO Focused Learning: Preference Learning Models (DPO/IPO)

## 🎯 Learning Objectives

Notebook này tập trung vào việc hiểu sâu về **Preference Learning** - một trong những đóng góp quan trọng nhất của FIPO. Chúng ta sẽ:

1. Hiểu cơ chế hoạt động của Direct Preference Optimization (DPO)
2. So sánh DPO với Identity Preference Optimization (IPO)
3. Implement các loss functions và training loops
4. Visualize quá trình học preference

## 📚 Paper References

- **Section 2.4**: Fine-tuning Strategies (Equations 9-11)
- **Figure 3**: Strategic Fine-tuning approaches
- **Table 3**: Comparison of different fine-tuning strategies

## 1. Theoretical Foundation

### 1.1 Why Preference Learning?

Thay vì chỉ học từ "good examples" (SFT), preference learning học từ cả "good" và "bad" examples để hiểu rõ hơn về chất lượng.

In [None]:
import numpy as np
import torch
import torch.nn as nn
import torch.nn.functional as F
import matplotlib.pyplot as plt
import seaborn as sns
from typing import Tuple, Dict, List
from dataclasses import dataclass
import pandas as pd

# Set style for better visualizations
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")

# Set random seeds
torch.manual_seed(42)
np.random.seed(42)

In [None]:
@dataclass
class PreferenceExample:
    """Represents a preference learning example"""
    prompt: str
    chosen: str  # xo+ in paper
    rejected: str  # xo- in paper
    
# Example preference pairs from FIPO
examples = [
    PreferenceExample(
        prompt="Calculate the average of [12, 34, 56, 75]",
        chosen="To find the average: 1) Add all numbers: 12+34+56+75=177, 2) Divide by count: 177/4=44.25",
        rejected="Find the average of the given numbers"
    ),
    PreferenceExample(
        prompt="What is the capital of France?",
        chosen="Identify the capital city of France. Provide a direct answer: The capital of France is Paris",
        rejected="Tell me about France's capital"
    )
]

print("Preference Learning Examples:")
for i, ex in enumerate(examples):
    print(f"\nExample {i+1}:")
    print(f"Prompt: {ex.prompt}")
    print(f"✅ Chosen: {ex.chosen}")
    print(f"❌ Rejected: {ex.rejected}")

### 1.2 Mathematical Foundation

FIPO sử dụng preference learning để học hàm reward ẩn từ human preferences.

In [None]:
class PreferenceLearningVisualizer:
    """Visualize preference learning concepts"""
    
    @staticmethod
    def visualize_preference_distribution():
        """Visualize how preference learning works"""
        
        fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 6))
        
        # Left: Quality distribution
        quality_scores = np.linspace(0, 10, 100)
        chosen_dist = np.exp(-(quality_scores - 8)**2 / 2) / np.sqrt(2 * np.pi)
        rejected_dist = np.exp(-(quality_scores - 3)**2 / 2) / np.sqrt(2 * np.pi)
        
        ax1.fill_between(quality_scores, chosen_dist, alpha=0.5, label='Chosen (xo+)', color='green')
        ax1.fill_between(quality_scores, rejected_dist, alpha=0.5, label='Rejected (xo-)', color='red')
        ax1.set_xlabel('Quality Score')
        ax1.set_ylabel('Probability Density')
        ax1.set_title('Quality Distribution of Prompts')
        ax1.legend()
        ax1.grid(True, alpha=0.3)
        
        # Right: Preference probability
        delta = np.linspace(-5, 5, 100)
        beta = 0.1
        preference_prob = 1 / (1 + np.exp(-beta * delta))
        
        ax2.plot(delta, preference_prob, linewidth=3, label=f'β={beta}')
        ax2.axvline(0, color='gray', linestyle='--', alpha=0.5)
        ax2.axhline(0.5, color='gray', linestyle='--', alpha=0.5)
        ax2.set_xlabel('Δ = log p(chosen) - log p(rejected)')
        ax2.set_ylabel('P(chosen > rejected)')
        ax2.set_title('Preference Probability Function')
        ax2.grid(True, alpha=0.3)
        ax2.legend()
        
        plt.tight_layout()
        plt.show()

visualizer = PreferenceLearningVisualizer()
visualizer.visualize_preference_distribution()

## 2. Direct Preference Optimization (DPO)

### 2.1 DPO Loss Function

DPO directly optimizes for human preferences without explicit reward modeling:

$$L_{DPO}(M_o) = -E_{(x_n, \hat{y}_n, y_n, x_o^+, x_o^-) \sim D}[\log \sigma(\beta \cdot \Delta)]$$

where $\Delta = \log \frac{M_o(x_o^+|x_r, \hat{y}_r, y_r)}{M_{ref}(x_o^+|x_r, \hat{y}_r, y_r)} - \log \frac{M_o(x_o^-|x_r, \hat{y}_r, y_r)}{M_{ref}(x_o^-|x_r, \hat{y}_r, y_r)}$

In [None]:
class DPOLoss(nn.Module):
    """Direct Preference Optimization Loss (Equation 9)"""
    
    def __init__(self, beta: float = 0.01):
        super().__init__()
        self.beta = beta
        
    def forward(
        self,
        chosen_logps: torch.Tensor,     # log p(chosen|prompt)
        rejected_logps: torch.Tensor,   # log p(rejected|prompt)
        reference_chosen_logps: torch.Tensor,   # log p_ref(chosen|prompt)
        reference_rejected_logps: torch.Tensor  # log p_ref(rejected|prompt)
    ) -> torch.Tensor:
        """Compute DPO loss"""
        
        # Compute log ratios
        chosen_logratios = chosen_logps - reference_chosen_logps
        rejected_logratios = rejected_logps - reference_rejected_logps
        
        # Compute delta
        delta = chosen_logratios - rejected_logratios
        
        # DPO loss
        loss = -F.logsigmoid(self.beta * delta).mean()
        
        # Compute metrics for logging
        with torch.no_grad():
            accuracy = (delta > 0).float().mean()
            
        return loss, {
            'loss': loss.item(),
            'accuracy': accuracy.item(),
            'delta_mean': delta.mean().item(),
            'delta_std': delta.std().item()
        }

# Test DPO loss
dpo_loss = DPOLoss(beta=0.01)

# Simulate log probabilities
batch_size = 4
chosen_logps = torch.randn(batch_size) - 1  # Slightly negative
rejected_logps = torch.randn(batch_size) - 2  # More negative
ref_chosen_logps = torch.randn(batch_size) - 1.5
ref_rejected_logps = torch.randn(batch_size) - 2.5

loss, metrics = dpo_loss(chosen_logps, rejected_logps, ref_chosen_logps, ref_rejected_logps)

print("DPO Loss Computation:")
print(f"Loss: {metrics['loss']:.4f}")
print(f"Accuracy: {metrics['accuracy']:.2%}")
print(f"Delta mean: {metrics['delta_mean']:.4f}")
print(f"Delta std: {metrics['delta_std']:.4f}")

### 2.2 Visualizing DPO Behavior

In [None]:
def visualize_dpo_loss_landscape():
    """Visualize how DPO loss changes with different parameters"""
    
    fig, axes = plt.subplots(2, 2, figsize=(14, 10))
    
    # 1. Loss vs Delta for different beta values
    ax = axes[0, 0]
    deltas = np.linspace(-5, 5, 100)
    betas = [0.01, 0.1, 0.5, 1.0]
    
    for beta in betas:
        losses = -np.log(1 / (1 + np.exp(-beta * deltas)))
        ax.plot(deltas, losses, label=f'β={beta}', linewidth=2)
    
    ax.set_xlabel('Δ (log ratio difference)')
    ax.set_ylabel('DPO Loss')
    ax.set_title('DPO Loss vs Delta for Different β')
    ax.legend()
    ax.grid(True, alpha=0.3)
    
    # 2. Gradient magnitude
    ax = axes[0, 1]
    for beta in betas:
        gradients = beta * np.exp(-beta * deltas) / (1 + np.exp(-beta * deltas))
        ax.plot(deltas, gradients, label=f'β={beta}', linewidth=2)
    
    ax.set_xlabel('Δ (log ratio difference)')
    ax.set_ylabel('Gradient Magnitude')
    ax.set_title('DPO Gradient vs Delta')
    ax.legend()
    ax.grid(True, alpha=0.3)
    
    # 3. Training dynamics simulation
    ax = axes[1, 0]
    steps = 100
    delta_trajectory = np.zeros(steps)
    delta_trajectory[0] = -2  # Start with rejected > chosen
    
    learning_rate = 0.1
    beta = 0.1
    
    for t in range(1, steps):
        gradient = beta * np.exp(-beta * delta_trajectory[t-1]) / (1 + np.exp(-beta * delta_trajectory[t-1]))
        delta_trajectory[t] = delta_trajectory[t-1] + learning_rate * gradient
    
    ax.plot(delta_trajectory, linewidth=2, color='purple')
    ax.axhline(0, color='gray', linestyle='--', alpha=0.5)
    ax.set_xlabel('Training Step')
    ax.set_ylabel('Δ Value')
    ax.set_title('DPO Training Dynamics')
    ax.grid(True, alpha=0.3)
    
    # 4. Preference accuracy vs delta
    ax = axes[1, 1]
    accuracy = 1 / (1 + np.exp(-deltas))
    ax.plot(deltas, accuracy, linewidth=3, color='green')
    ax.axvline(0, color='gray', linestyle='--', alpha=0.5)
    ax.axhline(0.5, color='gray', linestyle='--', alpha=0.5)
    ax.fill_between(deltas[deltas > 0], 0.5, accuracy[deltas > 0], alpha=0.3, color='green')
    ax.set_xlabel('Δ (log ratio difference)')
    ax.set_ylabel('P(chosen > rejected)')
    ax.set_title('Preference Accuracy')
    ax.grid(True, alpha=0.3)
    
    plt.tight_layout()
    plt.show()

visualize_dpo_loss_landscape()

## 3. Identity Preference Optimization (IPO)

### 3.1 IPO Loss Function

IPO is a regularized version of DPO with squared loss:

$$L_{IPO}(M_o) = -E_{(x_n, \hat{y}_n, y_n, x_o^+, x_o^-) \sim D}[(\Delta - \frac{1}{2\beta})^2]$$

In [None]:
class IPOLoss(nn.Module):
    """Identity Preference Optimization Loss (Equation 10)"""
    
    def __init__(self, beta: float = 0.01):
        super().__init__()
        self.beta = beta
        
    def forward(
        self,
        chosen_logps: torch.Tensor,
        rejected_logps: torch.Tensor,
        reference_chosen_logps: torch.Tensor,
        reference_rejected_logps: torch.Tensor
    ) -> Tuple[torch.Tensor, Dict]:
        """Compute IPO loss"""
        
        # Compute log ratios
        chosen_logratios = chosen_logps - reference_chosen_logps
        rejected_logratios = rejected_logps - reference_rejected_logps
        
        # Compute delta
        delta = chosen_logratios - rejected_logratios
        
        # IPO loss - squared loss for regularization
        target = 1 / (2 * self.beta)
        loss = ((delta - target) ** 2).mean()
        
        # Compute metrics
        with torch.no_grad():
            accuracy = (delta > 0).float().mean()
            
        return loss, {
            'loss': loss.item(),
            'accuracy': accuracy.item(),
            'delta_mean': delta.mean().item(),
            'delta_std': delta.std().item(),
            'target': target
        }

# Compare DPO and IPO
ipo_loss = IPOLoss(beta=0.01)

# Use same data
ipo_result, ipo_metrics = ipo_loss(chosen_logps, rejected_logps, ref_chosen_logps, ref_rejected_logps)
dpo_result, dpo_metrics = dpo_loss(chosen_logps, rejected_logps, ref_chosen_logps, ref_rejected_logps)

print("Loss Comparison:")
print(f"DPO Loss: {dpo_metrics['loss']:.4f}")
print(f"IPO Loss: {ipo_metrics['loss']:.4f}")
print(f"IPO Target: {ipo_metrics['target']:.4f}")
print(f"\nBoth achieve {ipo_metrics['accuracy']:.2%} preference accuracy")

### 3.2 DPO vs IPO Comparison

In [None]:
def compare_dpo_ipo():
    """Compare DPO and IPO loss functions"""
    
    fig, axes = plt.subplots(2, 3, figsize=(18, 10))
    
    deltas = np.linspace(-3, 3, 100)
    beta = 0.1
    
    # Row 1: Loss functions
    ax = axes[0, 0]
    dpo_losses = -np.log(1 / (1 + np.exp(-beta * deltas)))
    ipo_losses = (deltas - 1/(2*beta))**2
    
    ax.plot(deltas, dpo_losses, label='DPO', linewidth=2, color='blue')
    ax.plot(deltas, ipo_losses, label='IPO', linewidth=2, color='orange')
    ax.axvline(0, color='gray', linestyle='--', alpha=0.5)
    ax.set_xlabel('Δ')
    ax.set_ylabel('Loss')
    ax.set_title('Loss Functions')
    ax.legend()
    ax.grid(True, alpha=0.3)
    
    # Gradients
    ax = axes[0, 1]
    dpo_grads = beta * np.exp(-beta * deltas) / (1 + np.exp(-beta * deltas))
    ipo_grads = 2 * (deltas - 1/(2*beta))
    
    ax.plot(deltas, dpo_grads, label='DPO', linewidth=2, color='blue')
    ax.plot(deltas, ipo_grads, label='IPO', linewidth=2, color='orange')
    ax.axvline(0, color='gray', linestyle='--', alpha=0.5)
    ax.axhline(0, color='gray', linestyle='--', alpha=0.5)
    ax.set_xlabel('Δ')
    ax.set_ylabel('Gradient')
    ax.set_title('Gradient Comparison')
    ax.legend()
    ax.grid(True, alpha=0.3)
    
    # Gradient magnitude
    ax = axes[0, 2]
    ax.plot(deltas, np.abs(dpo_grads), label='DPO', linewidth=2, color='blue')
    ax.plot(deltas, np.abs(ipo_grads), label='IPO', linewidth=2, color='orange')
    ax.set_xlabel('Δ')
    ax.set_ylabel('|Gradient|')
    ax.set_title('Gradient Magnitude')
    ax.legend()
    ax.grid(True, alpha=0.3)
    
    # Row 2: Training dynamics
    ax = axes[1, 0]
    steps = 200
    lr = 0.05
    
    # DPO trajectory
    delta_dpo = np.zeros(steps)
    delta_dpo[0] = -2
    for t in range(1, steps):
        grad = beta * np.exp(-beta * delta_dpo[t-1]) / (1 + np.exp(-beta * delta_dpo[t-1]))
        delta_dpo[t] = delta_dpo[t-1] + lr * grad
    
    # IPO trajectory
    delta_ipo = np.zeros(steps)
    delta_ipo[0] = -2
    for t in range(1, steps):
        grad = -2 * (delta_ipo[t-1] - 1/(2*beta))
        delta_ipo[t] = delta_ipo[t-1] + lr * grad
    
    ax.plot(delta_dpo, label='DPO', linewidth=2, color='blue')
    ax.plot(delta_ipo, label='IPO', linewidth=2, color='orange')
    ax.axhline(0, color='gray', linestyle='--', alpha=0.5)
    ax.axhline(1/(2*beta), color='orange', linestyle='--', alpha=0.5, label='IPO target')
    ax.set_xlabel('Training Step')
    ax.set_ylabel('Δ')
    ax.set_title('Training Trajectories')
    ax.legend()
    ax.grid(True, alpha=0.3)
    
    # Loss over time
    ax = axes[1, 1]
    dpo_loss_traj = -np.log(1 / (1 + np.exp(-beta * delta_dpo)))
    ipo_loss_traj = (delta_ipo - 1/(2*beta))**2
    
    ax.semilogy(dpo_loss_traj, label='DPO', linewidth=2, color='blue')
    ax.semilogy(ipo_loss_traj, label='IPO', linewidth=2, color='orange')
    ax.set_xlabel('Training Step')
    ax.set_ylabel('Loss (log scale)')
    ax.set_title('Loss Convergence')
    ax.legend()
    ax.grid(True, alpha=0.3)
    
    # Beta sensitivity
    ax = axes[1, 2]
    betas = np.logspace(-3, 0, 50)
    delta_test = 1.0
    
    dpo_losses_beta = [-np.log(1 / (1 + np.exp(-b * delta_test))) for b in betas]
    ipo_losses_beta = [(delta_test - 1/(2*b))**2 for b in betas]
    
    ax.loglog(betas, dpo_losses_beta, label='DPO', linewidth=2, color='blue')
    ax.loglog(betas, ipo_losses_beta, label='IPO', linewidth=2, color='orange')
    ax.set_xlabel('β')
    ax.set_ylabel('Loss')
    ax.set_title('β Sensitivity (Δ=1.0)')
    ax.legend()
    ax.grid(True, alpha=0.3)
    
    plt.tight_layout()
    plt.show()

compare_dpo_ipo()

## 4. Practical Implementation

### 4.1 Mini Preference Learning Model

In [None]:
class SimplePromptOptimizer(nn.Module):
    """Simplified prompt optimizer for demonstration"""
    
    def __init__(self, vocab_size: int = 1000, hidden_dim: int = 128):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, hidden_dim)
        self.lstm = nn.LSTM(hidden_dim, hidden_dim, batch_first=True)
        self.output = nn.Linear(hidden_dim, vocab_size)
        
    def forward(self, input_ids: torch.Tensor) -> torch.Tensor:
        embeds = self.embedding(input_ids)
        lstm_out, _ = self.lstm(embeds)
        logits = self.output(lstm_out)
        return logits
    
    def get_logprobs(self, input_ids: torch.Tensor, target_ids: torch.Tensor) -> torch.Tensor:
        """Get log probabilities for target sequences"""
        logits = self.forward(input_ids)
        log_probs = F.log_softmax(logits, dim=-1)
        
        # Gather log probs for target tokens
        target_log_probs = log_probs.gather(2, target_ids.unsqueeze(-1)).squeeze(-1)
        
        # Sum over sequence length
        return target_log_probs.sum(dim=1)

# Create model and reference model
model = SimplePromptOptimizer()
ref_model = SimplePromptOptimizer()
ref_model.load_state_dict(model.state_dict())  # Same initialization
ref_model.eval()  # Freeze reference model

print("Model architecture:")
print(model)

### 4.2 Training Loop with DPO/IPO

In [None]:
class PreferenceTrainer:
    """Trainer for preference learning"""
    
    def __init__(self, model: nn.Module, ref_model: nn.Module, loss_type: str = "DPO", beta: float = 0.01):
        self.model = model
        self.ref_model = ref_model
        self.loss_type = loss_type
        
        if loss_type == "DPO":
            self.loss_fn = DPOLoss(beta)
        else:
            self.loss_fn = IPOLoss(beta)
            
        self.optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
        self.history = {'loss': [], 'accuracy': [], 'delta_mean': []}
    
    def create_dummy_batch(self, batch_size: int = 8):
        """Create dummy preference data"""
        seq_len = 10
        vocab_size = 1000
        
        # Random sequences
        prompts = torch.randint(0, vocab_size, (batch_size, seq_len))
        chosen = torch.randint(0, vocab_size, (batch_size, seq_len))
        rejected = torch.randint(0, vocab_size, (batch_size, seq_len))
        
        return prompts, chosen, rejected
    
    def train_step(self):
        """Single training step"""
        self.model.train()
        
        # Get batch
        prompts, chosen, rejected = self.create_dummy_batch()
        
        # Get log probabilities
        with torch.no_grad():
            ref_chosen_logps = self.ref_model.get_logprobs(prompts, chosen)
            ref_rejected_logps = self.ref_model.get_logprobs(prompts, rejected)
        
        chosen_logps = self.model.get_logprobs(prompts, chosen)
        rejected_logps = self.model.get_logprobs(prompts, rejected)
        
        # Compute loss
        loss, metrics = self.loss_fn(chosen_logps, rejected_logps, ref_chosen_logps, ref_rejected_logps)
        
        # Backward pass
        self.optimizer.zero_grad()
        loss.backward()
        self.optimizer.step()
        
        # Record metrics
        for key in ['loss', 'accuracy', 'delta_mean']:
            self.history[key].append(metrics[key])
        
        return metrics
    
    def train(self, num_steps: int = 100):
        """Training loop"""
        pbar = range(num_steps)
        
        for step in pbar:
            metrics = self.train_step()
            
            if step % 10 == 0:
                print(f"Step {step}: Loss={metrics['loss']:.4f}, Acc={metrics['accuracy']:.2%}, Δ={metrics['delta_mean']:.3f}")
    
    def plot_training_curves(self):
        """Plot training history"""
        fig, axes = plt.subplots(1, 3, figsize=(15, 4))
        
        axes[0].plot(self.history['loss'], linewidth=2)
        axes[0].set_xlabel('Step')
        axes[0].set_ylabel('Loss')
        axes[0].set_title(f'{self.loss_type} Loss')
        axes[0].grid(True, alpha=0.3)
        
        axes[1].plot(self.history['accuracy'], linewidth=2, color='green')
        axes[1].set_xlabel('Step')
        axes[1].set_ylabel('Preference Accuracy')
        axes[1].set_title('Training Accuracy')
        axes[1].grid(True, alpha=0.3)
        
        axes[2].plot(self.history['delta_mean'], linewidth=2, color='purple')
        axes[2].axhline(0, color='gray', linestyle='--', alpha=0.5)
        axes[2].set_xlabel('Step')
        axes[2].set_ylabel('Δ (mean)')
        axes[2].set_title('Average Delta')
        axes[2].grid(True, alpha=0.3)
        
        plt.tight_layout()
        plt.show()

# Train with DPO
print("Training with DPO...")
dpo_trainer = PreferenceTrainer(model, ref_model, loss_type="DPO", beta=0.1)
dpo_trainer.train(num_steps=50)
dpo_trainer.plot_training_curves()

# Reset model and train with IPO
model2 = SimplePromptOptimizer()
print("\nTraining with IPO...")
ipo_trainer = PreferenceTrainer(model2, ref_model, loss_type="IPO", beta=0.1)
ipo_trainer.train(num_steps=50)
ipo_trainer.plot_training_curves()

## 5. Key Insights from FIPO's Preference Learning

### 5.1 Why IPO Outperforms DPO in FIPO?

In [None]:
def analyze_fipo_results():
    """Analyze FIPO's experimental results from Table 3"""
    
    # Results from paper
    results = pd.DataFrame({
        'Method': ['Naive', 'SFT-70B', 'DPO-70B', 'IPO-70B', 'IPL-DPO-70B', 'IPL-IPO-70B'],
        'GSM8K': [24.77, 21.43, 27.74, 25.00, 25.13, 26.67],
        'BBH': [36.21, 32.92, 35.56, 39.21, 35.25, 39.60],
        'PiQA': [73.35, 74.39, 74.17, 76.84, 74.95, 77.11],
        'CosmosQA': [51.17, 49.97, 54.93, 56.01, 50.46, 56.71],
        'MMLU': [51.22, 51.55, 52.73, 54.29, 52.12, 56.02],
        'Weighted_Avg': [47.79, 46.96, 49.07, 50.94, 48.10, 52.13]
    })
    
    # Visualize results
    fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 6))
    
    # Bar plot comparison
    methods = results['Method']
    x = np.arange(len(methods))
    width = 0.35
    
    ax1.bar(x, results['Weighted_Avg'], width, label='Weighted Average', color='skyblue')
    ax1.axhline(results.loc[0, 'Weighted_Avg'], color='red', linestyle='--', alpha=0.5, label='Naive baseline')
    
    ax1.set_xlabel('Method')
    ax1.set_ylabel('Performance (%)')
    ax1.set_title('FIPO Fine-tuning Strategies Comparison')
    ax1.set_xticks(x)
    ax1.set_xticklabels(methods, rotation=45)
    ax1.legend()
    ax1.grid(True, alpha=0.3, axis='y')
    
    # Heatmap of improvements
    improvements = results.iloc[:, 1:-1].values - results.loc[0, results.columns[1:-1]].values
    
    im = ax2.imshow(improvements.T, cmap='RdYlGn', aspect='auto')
    ax2.set_xticks(np.arange(len(methods)))
    ax2.set_yticks(np.arange(len(results.columns[1:-1])))
    ax2.set_xticklabels(methods, rotation=45)
    ax2.set_yticklabels(results.columns[1:-1])
    ax2.set_title('Improvement over Naive Baseline')
    
    # Add text annotations
    for i in range(len(methods)):
        for j in range(len(results.columns[1:-1])):
            text = ax2.text(i, j, f'{improvements[i, j]:.1f}',
                           ha="center", va="center", color="black", fontsize=8)
    
    plt.colorbar(im, ax=ax2)
    plt.tight_layout()
    plt.show()
    
    # Key insights
    print("\n🔍 Key Insights from FIPO Results:")
    print("1. IPO consistently outperforms DPO across benchmarks")
    print("2. IPL-IPO achieves best overall performance (52.13%)")
    print("3. SFT alone performs worse than naive baseline")
    print("4. Preference learning is crucial for prompt optimization")
    print("\n📊 Performance Rankings:")
    print(results[['Method', 'Weighted_Avg']].sort_values('Weighted_Avg', ascending=False))

analyze_fipo_results()

### 5.2 Understanding Beta Parameter Impact

In [None]:
def beta_parameter_study():
    """Study the impact of beta parameter on preference learning"""
    
    fig, axes = plt.subplots(2, 2, figsize=(14, 10))
    
    betas = [0.001, 0.01, 0.1, 1.0]
    colors = plt.cm.viridis(np.linspace(0, 1, len(betas)))
    
    # Effect on optimal delta
    ax = axes[0, 0]
    optimal_deltas_ipo = [1/(2*beta) for beta in betas]
    ax.bar(range(len(betas)), optimal_deltas_ipo, color=colors)
    ax.set_xticks(range(len(betas)))
    ax.set_xticklabels([f'β={b}' for b in betas])
    ax.set_ylabel('Optimal Δ (IPO)')
    ax.set_title('IPO Target Delta vs Beta')
    ax.set_yscale('log')
    ax.grid(True, alpha=0.3, axis='y')
    
    # Learning dynamics
    ax = axes[0, 1]
    for i, beta in enumerate(betas):
        steps = 100
        delta = np.zeros(steps)
        delta[0] = -1
        lr = 0.1
        
        for t in range(1, steps):
            # IPO gradient
            grad = -2 * (delta[t-1] - 1/(2*beta))
            delta[t] = delta[t-1] + lr * grad
        
        ax.plot(delta, label=f'β={beta}', color=colors[i], linewidth=2)
    
    ax.set_xlabel('Step')
    ax.set_ylabel('Δ')
    ax.set_title('IPO Convergence for Different β')
    ax.legend()
    ax.grid(True, alpha=0.3)
    
    # Gradient scale
    ax = axes[1, 0]
    delta_range = np.linspace(-2, 2, 100)
    
    for i, beta in enumerate(betas):
        # DPO gradient magnitude
        dpo_grads = beta * np.exp(-beta * delta_range) / (1 + np.exp(-beta * delta_range))
        ax.plot(delta_range, dpo_grads, label=f'β={beta}', color=colors[i], linewidth=2)
    
    ax.set_xlabel('Δ')
    ax.set_ylabel('DPO Gradient')
    ax.set_title('DPO Gradient Scale vs Beta')
    ax.legend()
    ax.grid(True, alpha=0.3)
    
    # Practical recommendations
    ax = axes[1, 1]
    ax.text(0.5, 0.8, "FIPO Beta Selection Guidelines", 
            horizontalalignment='center', fontsize=16, weight='bold', transform=ax.transAxes)
    
    guidelines = [
        "β = 0.01 (FIPO default):",
        "  • Stable training",
        "  • Good for large models",
        "  • IPO target Δ = 50",
        "",
        "Smaller β (0.001):",
        "  • Stronger preferences",
        "  • Risk of overfitting",
        "",
        "Larger β (0.1-1.0):",
        "  • Weaker preferences",
        "  • More exploration"
    ]
    
    y_pos = 0.6
    for line in guidelines:
        ax.text(0.1, y_pos, line, transform=ax.transAxes, fontsize=12)
        y_pos -= 0.06
    
    ax.axis('off')
    
    plt.tight_layout()
    plt.show()

beta_parameter_study()

## 6. Summary & Key Takeaways

### Core Concepts Learned:

1. **Preference Learning Fundamentals**:
   - Learn from both positive (chosen) and negative (rejected) examples
   - Model implicit reward function through pairwise comparisons
   - More sample-efficient than standard supervised learning

2. **DPO (Direct Preference Optimization)**:
   - Uses logistic loss: $-\log \sigma(\beta \cdot \Delta)$
   - Directly optimizes for preferences without explicit reward model
   - Gradient vanishes as model becomes confident

3. **IPO (Identity Preference Optimization)**:
   - Uses squared loss: $(\Delta - \frac{1}{2\beta})^2$
   - Regularized version provides more stable training
   - Targets specific delta value rather than maximizing

4. **FIPO's Success Factors**:
   - IPO > DPO due to regularization benefits
   - Beta = 0.01 provides good balance
   - Preference learning crucial for prompt optimization

### Practical Implementation Tips:

- Always maintain frozen reference model
- Monitor delta values during training
- Use IPO for more stable convergence
- Careful beta tuning based on model size