# FIPO Focused Learning: Iterative Preference Learning (IPL)

## 🎯 Learning Objectives

Notebook này tập trung vào **Iterative Preference Learning (IPL)** - phương pháp self-rewarding độc đáo của FIPO:

1. Hiểu cơ chế self-rewarding và self-improvement
2. Implement Algorithm 1 từ paper
3. Phân tích quá trình iterative refinement
4. So sánh IPL-DPO vs IPL-IPO

## 📚 Paper References

- **Section 2.4**: Iterative Preference Learning strategy
- **Algorithm 1**: Self-rewarding IPL Algorithm (Appendix C)
- **Equations 12-14**: IPL mathematical formulation
- **Table 7**: IPL iteration analysis

## 1. Understanding Self-Rewarding Systems

### 1.1 The Concept of Self-Improvement

IPL cho phép model tự đánh giá và cải thiện chất lượng của mình qua nhiều iterations.

In [None]:
import numpy as np
import torch
import torch.nn as nn
import matplotlib.pyplot as plt
import seaborn as sns
from typing import Dict, List, Tuple, Optional
from dataclasses import dataclass, field
import pandas as pd
from tqdm import tqdm
import networkx as nx

# Set style
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("muted")

# Set random seeds
torch.manual_seed(42)
np.random.seed(42)

In [None]:
def visualize_ipl_concept():
    """Visualize the IPL self-rewarding concept"""
    
    fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 6))
    
    # Left: Traditional vs IPL training
    methods = ['Traditional\nPreference Learning', 'Iterative\nPreference Learning']
    capabilities = [1.0, 1.45]  # IPL shows more improvement
    colors = ['lightcoral', 'lightgreen']
    
    bars = ax1.bar(methods, capabilities, color=colors, alpha=0.7, width=0.6)
    
    # Add annotations
    ax1.text(0, 0.5, "Static Data\nNo Self-Improvement", ha='center', fontsize=10)
    ax1.text(1, 0.7, "Dynamic Updates\nSelf-Rewarding\nIterative Refinement", 
            ha='center', fontsize=10)
    
    ax1.set_ylabel('Relative Performance', fontsize=12)
    ax1.set_title('Traditional vs IPL Training', fontsize=14, weight='bold')
    ax1.set_ylim(0, 1.6)
    ax1.grid(True, alpha=0.3, axis='y')
    
    # Right: IPL process flow
    G = nx.DiGraph()
    nodes = [
        ("Start", {"pos": (0, 2), "color": "lightblue"}),
        ("Generate\nNew Prompt", {"pos": (1, 2), "color": "lightgreen"}),
        ("Self-Judge\nQuality", {"pos": (2, 2), "color": "lightyellow"}),
        ("Better?", {"pos": (3, 2), "color": "lightcoral"}),
        ("Update\nDataset", {"pos": (4, 3), "color": "lightgreen"}),
        ("Keep\nOriginal", {"pos": (4, 1), "color": "lightgray"}),
        ("Train\nModel", {"pos": (5, 2), "color": "lightblue"}),
        ("Next\nIteration", {"pos": (6, 2), "color": "lightblue"})
    ]
    
    for node, attrs in nodes:
        G.add_node(node, **attrs)
    
    edges = [
        ("Start", "Generate\nNew Prompt"),
        ("Generate\nNew Prompt", "Self-Judge\nQuality"),
        ("Self-Judge\nQuality", "Better?"),
        ("Better?", "Update\nDataset"),
        ("Better?", "Keep\nOriginal"),
        ("Update\nDataset", "Train\nModel"),
        ("Keep\nOriginal", "Train\nModel"),
        ("Train\nModel", "Next\nIteration"),
        ("Next\nIteration", "Generate\nNew Prompt")
    ]
    
    G.add_edges_from(edges)
    
    pos = nx.get_node_attributes(G, 'pos')
    colors = [G.nodes[node]['color'] for node in G.nodes()]
    
    nx.draw(G, pos, ax=ax2, with_labels=True, node_color=colors, 
            node_size=2000, font_size=9, font_weight='bold',
            arrows=True, arrowsize=20, edge_color='gray')
    
    ax2.set_title('IPL Self-Rewarding Process Flow', fontsize=14, weight='bold')
    ax2.axis('off')
    
    plt.tight_layout()
    plt.show()

visualize_ipl_concept()

## 2. IPL Algorithm Implementation

### 2.1 Core IPL Components

In [None]:
@dataclass
class IPLDataPoint:
    """Data point for IPL training"""
    naive_prompt: str
    naive_response: str
    ground_truth: str
    current_prompt: str = field(default="")
    current_response: str = field(default="")
    iteration_history: List[Dict] = field(default_factory=list)
    
    def __post_init__(self):
        if not self.current_prompt:
            self.current_prompt = self.naive_prompt
        if not self.current_response:
            self.current_response = self.naive_response

class IPLOptimizer:
    """Iterative Preference Learning Optimizer (Algorithm 1)"""
    
    def __init__(
        self,
        base_method: str = "IPO",  # or "DPO"
        beta: float = 0.01,
        total_iterations: int = 3
    ):
        self.base_method = base_method
        self.beta = beta
        self.total_iterations = total_iterations
        self.current_iteration = 0
        self.update_history = []
        
    def generate_new_prompt(self, data_point: IPLDataPoint) -> str:
        """Generate new optimized prompt (Line 8 in Algorithm 1)"""
        
        # Simulate prompt optimization
        # In reality, this would use the trained model Mo
        current = data_point.current_prompt
        
        # Add progressive improvements
        improvements = [
            "Please carefully ",
            "Step by step, ",
            "With detailed explanation, ",
            "Systematically and accurately "
        ]
        
        if self.current_iteration < len(improvements):
            new_prompt = improvements[self.current_iteration] + current.lower()
        else:
            new_prompt = current + " (Enhanced)"
            
        return new_prompt
    
    def judge_prompts(
        self,
        prompt1: str,
        prompt2: str,
        ground_truth: str
    ) -> bool:
        """Judge if prompt2 is better than prompt1 (Line 9)"""
        
        # Simulate discrimination
        # In reality, this uses the trained discriminator
        
        # Simple heuristics for demonstration
        score1 = len(prompt1.split()) + prompt1.count("step")
        score2 = len(prompt2.split()) + prompt2.count("step")
        
        # Add some randomness but bias towards improvement
        if np.random.random() < 0.7:  # 70% chance of correct judgment
            return score2 > score1
        else:
            return np.random.random() < 0.3  # 30% chance of accepting anyway
    
    def generate_new_response(self, prompt: str) -> str:
        """Generate response for new prompt (Line 10)"""
        
        # Simulate response generation
        if "step" in prompt.lower():
            return "[Improved response with step-by-step reasoning]"
        elif "careful" in prompt.lower():
            return "[Carefully considered response]"
        else:
            return "[Standard response]"
    
    def run_iteration(self, dataset: List[IPLDataPoint]) -> Tuple[List[IPLDataPoint], Dict]:
        """Run one IPL iteration"""
        
        self.current_iteration += 1
        iteration_stats = {
            "iteration": self.current_iteration,
            "total_samples": len(dataset),
            "updated_samples": 0,
            "acceptance_rate": 0.0
        }
        
        # Skip first iteration (warmup)
        if self.current_iteration == 1:
            iteration_stats["status"] = "warmup"
            return dataset, iteration_stats
        
        # Process each data point
        updated_dataset = []
        
        for data_point in tqdm(dataset, desc=f"IPL Iteration {self.current_iteration}"):
            # Generate new prompt (Line 8)
            new_prompt = self.generate_new_prompt(data_point)
            
            # Judge if better (Line 9)
            is_better = self.judge_prompts(
                data_point.current_prompt,
                new_prompt,
                data_point.ground_truth
            )
            
            if is_better:
                # Generate new response (Line 10)
                new_response = self.generate_new_response(new_prompt)
                
                # Update data point (Lines 11-12)
                data_point.current_prompt = new_prompt
                data_point.current_response = new_response
                iteration_stats["updated_samples"] += 1
                
                # Record history
                data_point.iteration_history.append({
                    "iteration": self.current_iteration,
                    "prompt": new_prompt,
                    "response": new_response,
                    "accepted": True
                })
            else:
                data_point.iteration_history.append({
                    "iteration": self.current_iteration,
                    "prompt": new_prompt,
                    "accepted": False
                })
            
            updated_dataset.append(data_point)
        
        iteration_stats["acceptance_rate"] = iteration_stats["updated_samples"] / len(dataset)
        self.update_history.append(iteration_stats)
        
        return updated_dataset, iteration_stats

# Create sample dataset
sample_dataset = [
    IPLDataPoint(
        naive_prompt="Calculate the average",
        naive_response="The average is 42",
        ground_truth="The average is 44.25"
    ),
    IPLDataPoint(
        naive_prompt="What is machine learning?",
        naive_response="ML is AI",
        ground_truth="Machine learning is a subset of AI that learns from data"
    )
] * 5  # Replicate for demonstration

# Run IPL
ipl_optimizer = IPLOptimizer(base_method="IPO", total_iterations=3)

print("Running IPL optimization...\n")
for i in range(ipl_optimizer.total_iterations):
    sample_dataset, stats = ipl_optimizer.run_iteration(sample_dataset)
    print(f"Iteration {stats['iteration']}: ", end="")
    if 'status' in stats:
        print(f"{stats['status']}")
    else:
        print(f"Updated {stats['updated_samples']}/{stats['total_samples']} samples "
              f"({stats['acceptance_rate']:.1%} acceptance rate)")

### 2.2 Visualizing IPL Progress

In [None]:
def visualize_ipl_progress(optimizer: IPLOptimizer, dataset: List[IPLDataPoint]):
    """Visualize IPL training progress"""
    
    fig, axes = plt.subplots(2, 2, figsize=(14, 10))
    
    # 1. Acceptance rate over iterations
    ax = axes[0, 0]
    iterations = [h['iteration'] for h in optimizer.update_history if 'acceptance_rate' in h]
    acceptance_rates = [h['acceptance_rate'] for h in optimizer.update_history if 'acceptance_rate' in h]
    
    if iterations:
        ax.plot(iterations, acceptance_rates, 'o-', linewidth=2, markersize=8, color='green')
        ax.set_xlabel('Iteration', fontsize=12)
        ax.set_ylabel('Acceptance Rate', fontsize=12)
        ax.set_title('Self-Rewarding Acceptance Rate', fontsize=14, weight='bold')
        ax.set_ylim(0, 1)
        ax.grid(True, alpha=0.3)
    
    # 2. Sample evolution
    ax = axes[0, 1]
    sample = dataset[0]  # Track first sample
    
    evolution_data = [
        {"iteration": 0, "prompt_length": len(sample.naive_prompt.split())}
    ]
    
    for hist in sample.iteration_history:
        if hist['accepted']:
            evolution_data.append({
                "iteration": hist['iteration'],
                "prompt_length": len(hist['prompt'].split())
            })
    
    if len(evolution_data) > 1:
        df = pd.DataFrame(evolution_data)
        ax.plot(df['iteration'], df['prompt_length'], 's-', linewidth=2, markersize=8, color='blue')
        ax.set_xlabel('Iteration', fontsize=12)
        ax.set_ylabel('Prompt Length (words)', fontsize=12)
        ax.set_title('Prompt Evolution Example', fontsize=14, weight='bold')
        ax.grid(True, alpha=0.3)
    
    # 3. Update distribution
    ax = axes[1, 0]
    update_counts = [0] * (optimizer.total_iterations + 1)
    
    for dp in dataset:
        updates = sum(1 for h in dp.iteration_history if h.get('accepted', False))
        if updates < len(update_counts):
            update_counts[updates] += 1
    
    ax.bar(range(len(update_counts)), update_counts, color='orange', alpha=0.7)
    ax.set_xlabel('Number of Updates', fontsize=12)
    ax.set_ylabel('Sample Count', fontsize=12)
    ax.set_title('Distribution of Updates per Sample', fontsize=14, weight='bold')
    ax.grid(True, alpha=0.3, axis='y')
    
    # 4. Performance simulation
    ax = axes[1, 1]
    
    # Simulate performance improvement
    base_performance = 47.79  # From paper
    ipl_improvements = [0, 0.51, 3.02, 4.34]  # Approximate from paper
    
    iterations_plot = list(range(len(ipl_improvements)))
    performances = [base_performance + imp for imp in ipl_improvements]
    
    ax.plot(iterations_plot, performances, 'o-', linewidth=3, markersize=10, color='purple')
    ax.axhline(base_performance, color='red', linestyle='--', alpha=0.5, label='Baseline')
    
    # Add improvement annotations
    for i, (iter_num, perf, imp) in enumerate(zip(iterations_plot[1:], performances[1:], ipl_improvements[1:])):
        ax.annotate(f'+{imp:.1f}%', xy=(iter_num, perf), xytext=(iter_num, perf+0.5),
                   ha='center', fontsize=9, color='darkgreen', weight='bold')
    
    ax.set_xlabel('IPL Iteration', fontsize=12)
    ax.set_ylabel('Performance (%)', fontsize=12)
    ax.set_title('IPL Performance Improvement (from paper)', fontsize=14, weight='bold')
    ax.legend()
    ax.grid(True, alpha=0.3)
    
    plt.tight_layout()
    plt.show()

visualize_ipl_progress(ipl_optimizer, sample_dataset)

## 3. IPL Loss Functions

### 3.1 IPL-DPO vs IPL-IPO Implementation

In [None]:
class IPLLoss(nn.Module):
    """IPL Loss implementation (Equation 12)"""
    
    def __init__(self, base_loss: str = "IPO", beta: float = 0.01):
        super().__init__()
        self.base_loss = base_loss
        self.beta = beta
        
    def forward(
        self,
        chosen_logps: torch.Tensor,
        rejected_logps: torch.Tensor,
        ref_chosen_logps: torch.Tensor,
        ref_rejected_logps: torch.Tensor,
        updated_mask: torch.Tensor  # Indicates which samples were updated
    ) -> Tuple[torch.Tensor, Dict]:
        """Compute IPL loss with dynamic updates"""
        
        # Compute log ratios
        chosen_logratios = chosen_logps - ref_chosen_logps
        rejected_logratios = rejected_logps - ref_rejected_logps
        delta = chosen_logratios - rejected_logratios
        
        if self.base_loss == "DPO":
            # IPL-DPO loss
            losses = -torch.nn.functional.logsigmoid(self.beta * delta)
        else:
            # IPL-IPO loss
            target = 1 / (2 * self.beta)
            losses = (delta - target) ** 2
        
        # Apply update weighting
        # Give more weight to updated samples
        weights = torch.where(updated_mask, 1.5, 1.0)
        weighted_loss = (losses * weights).mean()
        
        # Compute metrics
        with torch.no_grad():
            accuracy = (delta > 0).float().mean()
            updated_accuracy = (delta[updated_mask] > 0).float().mean() if updated_mask.any() else 0
            
        return weighted_loss, {
            'loss': weighted_loss.item(),
            'accuracy': accuracy.item(),
            'updated_accuracy': updated_accuracy.item(),
            'delta_mean': delta.mean().item(),
            'update_rate': updated_mask.float().mean().item()
        }

# Test IPL losses
def test_ipl_losses():
    """Compare IPL-DPO and IPL-IPO losses"""
    
    batch_size = 8
    
    # Simulate logprobs
    chosen_logps = torch.randn(batch_size) - 1
    rejected_logps = torch.randn(batch_size) - 2
    ref_chosen_logps = torch.randn(batch_size) - 1.5
    ref_rejected_logps = torch.randn(batch_size) - 2.5
    
    # Simulate that 30% of samples were updated
    updated_mask = torch.rand(batch_size) < 0.3
    
    # Test both losses
    results = {}
    
    for loss_type in ["DPO", "IPO"]:
        ipl_loss = IPLLoss(base_loss=loss_type, beta=0.01)
        loss, metrics = ipl_loss(
            chosen_logps, rejected_logps,
            ref_chosen_logps, ref_rejected_logps,
            updated_mask
        )
        results[f"IPL-{loss_type}"] = metrics
    
    # Display results
    df = pd.DataFrame(results).T
    print("IPL Loss Comparison:")
    print(df.round(4))
    
    return df

ipl_results = test_ipl_losses()

### 3.2 IPL Training Dynamics

In [None]:
def simulate_ipl_training():
    """Simulate IPL training dynamics"""
    
    fig, axes = plt.subplots(2, 3, figsize=(18, 10))
    
    # Training parameters
    num_steps = 200
    num_iterations = 3
    steps_per_iteration = num_steps // num_iterations
    
    # Initialize metrics
    metrics = {
        'IPL-DPO': {'loss': [], 'accuracy': [], 'delta': [], 'update_rate': []},
        'IPL-IPO': {'loss': [], 'accuracy': [], 'delta': [], 'update_rate': []}
    }
    
    # Simulate training
    for step in range(num_steps):
        iteration = step // steps_per_iteration + 1
        
        # Simulate metrics with iteration-based improvements
        base_loss = 2.5 * np.exp(-0.02 * step)
        base_acc = 0.5 + 0.4 * (1 - np.exp(-0.03 * step))
        base_delta = -1 + 2 * (1 - np.exp(-0.025 * step))
        
        # Update rate increases then stabilizes each iteration
        if iteration > 1:
            update_rate = 0.02 + 0.02 * np.sin(0.1 * (step - iteration * steps_per_iteration))
        else:
            update_rate = 0
        
        # IPL-DPO (more volatile)
        metrics['IPL-DPO']['loss'].append(base_loss + 0.1 * np.random.randn())
        metrics['IPL-DPO']['accuracy'].append(base_acc + 0.05 * np.random.randn())
        metrics['IPL-DPO']['delta'].append(base_delta + 0.2 * np.random.randn())
        metrics['IPL-DPO']['update_rate'].append(update_rate)
        
        # IPL-IPO (more stable)
        metrics['IPL-IPO']['loss'].append(base_loss * 0.8 + 0.05 * np.random.randn())
        metrics['IPL-IPO']['accuracy'].append(base_acc + 0.02 + 0.03 * np.random.randn())
        metrics['IPL-IPO']['delta'].append(base_delta + 0.5 + 0.1 * np.random.randn())
        metrics['IPL-IPO']['update_rate'].append(update_rate * 1.2)
    
    # Plot results
    steps = np.arange(num_steps)
    
    # Loss curves
    ax = axes[0, 0]
    ax.plot(steps, metrics['IPL-DPO']['loss'], label='IPL-DPO', alpha=0.8)
    ax.plot(steps, metrics['IPL-IPO']['loss'], label='IPL-IPO', alpha=0.8)
    
    # Add iteration markers
    for i in range(1, num_iterations):
        ax.axvline(i * steps_per_iteration, color='gray', linestyle='--', alpha=0.5)
        ax.text(i * steps_per_iteration, ax.get_ylim()[1] * 0.9, f'Iter {i+1}',
               ha='center', fontsize=9)
    
    ax.set_xlabel('Training Step')
    ax.set_ylabel('Loss')
    ax.set_title('IPL Loss Curves')
    ax.legend()
    ax.grid(True, alpha=0.3)
    
    # Accuracy curves
    ax = axes[0, 1]
    ax.plot(steps, metrics['IPL-DPO']['accuracy'], label='IPL-DPO', alpha=0.8)
    ax.plot(steps, metrics['IPL-IPO']['accuracy'], label='IPL-IPO', alpha=0.8)
    for i in range(1, num_iterations):
        ax.axvline(i * steps_per_iteration, color='gray', linestyle='--', alpha=0.5)
    ax.set_xlabel('Training Step')
    ax.set_ylabel('Accuracy')
    ax.set_title('Preference Accuracy')
    ax.legend()
    ax.grid(True, alpha=0.3)
    
    # Delta evolution
    ax = axes[0, 2]
    ax.plot(steps, metrics['IPL-DPO']['delta'], label='IPL-DPO', alpha=0.8)
    ax.plot(steps, metrics['IPL-IPO']['delta'], label='IPL-IPO', alpha=0.8)
    ax.axhline(0, color='red', linestyle='--', alpha=0.5)
    ax.axhline(1/(2*0.01), color='green', linestyle='--', alpha=0.5, label='IPO Target')
    for i in range(1, num_iterations):
        ax.axvline(i * steps_per_iteration, color='gray', linestyle='--', alpha=0.5)
    ax.set_xlabel('Training Step')
    ax.set_ylabel('Delta (Δ)')
    ax.set_title('Delta Evolution')
    ax.legend()
    ax.grid(True, alpha=0.3)
    
    # Update rate
    ax = axes[1, 0]
    ax.fill_between(steps, 0, metrics['IPL-DPO']['update_rate'], 
                   alpha=0.5, label='IPL-DPO')
    ax.fill_between(steps, 0, metrics['IPL-IPO']['update_rate'], 
                   alpha=0.5, label='IPL-IPO')
    ax.set_xlabel('Training Step')
    ax.set_ylabel('Update Rate')
    ax.set_title('Self-Rewarding Update Rate')
    ax.legend()
    ax.grid(True, alpha=0.3)
    
    # Performance comparison (from paper Table 3)
    ax = axes[1, 1]
    methods = ['Naive', 'DPO-70B', 'IPL-DPO-70B\n(e3)', 'IPO-70B', 'IPL-IPO-70B\n(e3)']
    scores = [47.79, 49.07, 48.10, 50.94, 52.13]
    colors = ['gray', 'lightblue', 'blue', 'lightgreen', 'green']
    
    bars = ax.bar(methods, scores, color=colors, alpha=0.7)
    ax.axhline(47.79, color='red', linestyle='--', alpha=0.5)
    
    # Add improvement labels
    for i, (method, score) in enumerate(zip(methods[1:], scores[1:]), 1):
        improvement = score - 47.79
        ax.text(i, score + 0.5, f'+{improvement:.2f}%', ha='center', fontsize=9)
    
    ax.set_ylabel('Weighted Average (%)')
    ax.set_title('Final Performance Comparison (Table 3)')
    ax.grid(True, alpha=0.3, axis='y')
    
    # Selection rate analysis (from paper)
    ax = axes[1, 2]
    iterations = ['e1 (warmup)', 'e2', 'e3']
    selection_rates = [0, 1.25, 2.40]
    
    ax.plot(iterations, selection_rates, 'o-', linewidth=3, markersize=10, color='purple')
    ax.fill_between(range(len(iterations)), 0, selection_rates, alpha=0.3, color='purple')
    
    ax.set_xlabel('IPL Iteration')
    ax.set_ylabel('Selection Rate (%)')
    ax.set_title('Prompt Update Selection Rate (Table 7)')
    ax.grid(True, alpha=0.3)
    
    plt.tight_layout()
    plt.show()

simulate_ipl_training()

## 4. IPL Components Deep Dive

### 4.1 Discrimination Capability

In [None]:
class IPLDiscriminator:
    """Discriminator for IPL prompt quality judgment"""
    
    def __init__(self):
        self.discrimination_template = """
You are an expert of prompt discrimination.

```
Raw Prompt:
{raw_prompt}
```

```
New Prompt A:
{prompt_a}
```

```
New Prompt B:
{prompt_b}
```

```
Golden Response:
{ground_truth}
```

New Prompt A and New Prompt B are optimized from Raw Prompt. 
Please judge which prompt is more loyal to the factual information of Raw Prompt, 
and is more desirable for an AI to generate the Golden Response. 
Only answer with A or B.
"""
    
    def create_discrimination_example(
        self,
        raw_prompt: str,
        current_prompt: str,
        new_prompt: str,
        ground_truth: str
    ) -> str:
        """Create discrimination task"""
        
        return self.discrimination_template.format(
            raw_prompt=raw_prompt,
            prompt_a=current_prompt,
            prompt_b=new_prompt,
            ground_truth=ground_truth
        )
    
    def analyze_prompt_quality(self, prompt: str) -> Dict[str, float]:
        """Analyze prompt quality features"""
        
        features = {
            "length": len(prompt.split()),
            "specificity": len([w for w in prompt.split() if len(w) > 6]) / max(len(prompt.split()), 1),
            "structure": prompt.count(".") + prompt.count(":") + prompt.count(","),
            "clarity_words": sum(1 for word in ["step", "first", "then", "finally", "ensure"] 
                                if word in prompt.lower()),
            "instruction_strength": sum(1 for word in ["must", "should", "need", "required"] 
                                      if word in prompt.lower())
        }
        
        # Compute quality score
        features["quality_score"] = (
            features["specificity"] * 20 +
            features["clarity_words"] * 10 +
            features["instruction_strength"] * 5 +
            min(features["length"] / 10, 5)
        )
        
        return features

# Demonstrate discrimination
discriminator = IPLDiscriminator()

# Example prompts
examples = [
    {
        "raw": "Calculate the average",
        "current": "Find the average value",
        "new": "Step 1: Add all numbers. Step 2: Divide by count to get average.",
        "ground_truth": "The average is 44.25"
    },
    {
        "raw": "Explain photosynthesis",
        "current": "Tell me about photosynthesis",
        "new": "Explain photosynthesis: First, describe light-dependent reactions. Then, explain the Calvin cycle. Finally, summarize the overall process.",
        "ground_truth": "Photosynthesis converts light energy into chemical energy..."
    }
]

# Analyze examples
fig, axes = plt.subplots(1, 2, figsize=(14, 6))

for idx, (example, ax) in enumerate(zip(examples, axes)):
    # Analyze both prompts
    current_features = discriminator.analyze_prompt_quality(example["current"])
    new_features = discriminator.analyze_prompt_quality(example["new"])
    
    # Compare features
    feature_names = list(current_features.keys())
    current_values = list(current_features.values())
    new_values = list(new_features.values())
    
    x = np.arange(len(feature_names))
    width = 0.35
    
    bars1 = ax.bar(x - width/2, current_values, width, label='Current', color='lightcoral', alpha=0.7)
    bars2 = ax.bar(x + width/2, new_values, width, label='New', color='lightgreen', alpha=0.7)
    
    ax.set_xlabel('Features')
    ax.set_ylabel('Value')
    ax.set_title(f'Example {idx+1}: Prompt Quality Analysis')
    ax.set_xticks(x)
    ax.set_xticklabels(feature_names, rotation=45, ha='right')
    ax.legend()
    ax.grid(True, alpha=0.3, axis='y')
    
    # Add winner annotation
    winner = "New" if new_features["quality_score"] > current_features["quality_score"] else "Current"
    ax.text(0.5, 0.95, f"Winner: {winner}", transform=ax.transAxes, 
           ha='center', fontsize=12, weight='bold',
           bbox=dict(boxstyle="round,pad=0.3", facecolor='yellow', alpha=0.5))

plt.tight_layout()
plt.show()

# Show discrimination prompt example
print("\nDiscrimination Prompt Example:")
print("=" * 60)
disc_example = discriminator.create_discrimination_example(
    examples[0]["raw"], examples[0]["current"], 
    examples[0]["new"], examples[0]["ground_truth"]
)
print(disc_example)

### 4.2 IPL Data Augmentation Strategy

In [None]:
def visualize_ipl_data_augmentation():
    """Visualize how IPL augments training data"""
    
    fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 6))
    
    # Left: Data composition
    original_size = 30000  # Original POP dataset
    instruction_data = 15000  # Additional instruction following
    discrimination_data = 15000  # Additional discrimination
    
    sizes = [original_size, instruction_data, discrimination_data]
    labels = ['Original POP\n(30k)', 'Instruction\nFollowing (15k)', 'Discrimination\n(15k)']
    colors = ['lightblue', 'lightgreen', 'lightyellow']
    explode = (0.05, 0.05, 0.05)
    
    ax1.pie(sizes, labels=labels, colors=colors, autopct='%1.0f%%', 
           startangle=90, explode=explode, shadow=True)
    ax1.set_title('IPL Training Data Composition', fontsize=14, weight='bold')
    
    # Right: Data flow
    ax2.text(0.5, 0.9, "IPL Data Flow", ha='center', fontsize=16, weight='bold',
            transform=ax2.transAxes)
    
    # Draw flow diagram
    stages = [
        {"y": 0.7, "text": "Original 30k POP Data", "color": "lightblue"},
        {"y": 0.5, "text": "↓ Reuse for IPL", "color": "white"},
        {"y": 0.3, "text": "15k Instruction + 15k Discrimination", "color": "lightgreen"},
        {"y": 0.1, "text": "Total: 60k Training Examples", "color": "gold"}
    ]
    
    for stage in stages:
        ax2.text(0.5, stage["y"], stage["text"], ha='center', fontsize=12,
                transform=ax2.transAxes, 
                bbox=dict(boxstyle="round,pad=0.5", facecolor=stage["color"], alpha=0.7))
    
    ax2.axis('off')
    
    plt.tight_layout()
    plt.show()
    
    # Show data examples
    print("\nIPL Data Examples:")
    print("=" * 60)
    
    print("1. Original Preference Data:")
    print("   - Naive prompt: 'Calculate the sum'")
    print("   - Chosen: 'To calculate the sum, add all numbers together'")
    print("   - Rejected: 'Find the sum'")
    
    print("\n2. Instruction Following Data (reused):")
    print("   - Input: 'Calculate the sum'")
    print("   - Output: 'The sum is 150'")
    
    print("\n3. Discrimination Data (reused):")
    print("   - Task: Judge between two prompts")
    print("   - Answer: A or B")

visualize_ipl_data_augmentation()

## 5. Practical IPL Implementation

### 5.1 Complete IPL Training Pipeline

In [None]:
class IPLTrainingPipeline:
    """Complete IPL training pipeline"""
    
    def __init__(
        self,
        model_name: str = "Tulu2-70B",
        base_method: str = "IPO",
        num_iterations: int = 3,
        beta: float = 0.01
    ):
        self.model_name = model_name
        self.base_method = base_method
        self.num_iterations = num_iterations
        self.beta = beta
        
        # Components
        self.optimizer = IPLOptimizer(base_method, beta, num_iterations)
        self.discriminator = IPLDiscriminator()
        
        # Training history
        self.training_history = {
            "iteration": [],
            "performance": [],
            "selection_rate": [],
            "discrimination_accuracy": []
        }
    
    def run_training(
        self,
        initial_dataset: List[IPLDataPoint],
        validation_set: List[IPLDataPoint]
    ):
        """Run complete IPL training"""
        
        print(f"Starting IPL-{self.base_method} training on {self.model_name}")
        print(f"Dataset size: {len(initial_dataset)}")
        print(f"Iterations: {self.num_iterations}")
        print("=" * 60)
        
        current_dataset = initial_dataset.copy()
        
        for iteration in range(self.num_iterations):
            print(f"\nIteration {iteration + 1}/{self.num_iterations}")
            
            # Run IPL iteration
            current_dataset, stats = self.optimizer.run_iteration(current_dataset)
            
            # Simulate performance evaluation
            performance = self._evaluate_performance(current_dataset, validation_set)
            
            # Record history
            self.training_history["iteration"].append(iteration + 1)
            self.training_history["performance"].append(performance)
            self.training_history["selection_rate"].append(
                stats.get("acceptance_rate", 0) * 100
            )
            self.training_history["discrimination_accuracy"].append(
                100 if iteration == 0 else 95 + np.random.randn() * 2
            )
            
            print(f"Performance: {performance:.2f}%")
            print(f"Selection rate: {stats.get('acceptance_rate', 0):.1%}")
        
        return current_dataset, self.training_history
    
    def _evaluate_performance(self, dataset, validation_set):
        """Simulate performance evaluation"""
        
        # Base performance from paper
        base_performances = {
            "IPO": [47.79, 48.30, 50.81, 52.13],
            "DPO": [47.79, 48.12, 48.23, 48.10]
        }
        
        iteration = self.optimizer.current_iteration
        if iteration < len(base_performances[self.base_method]):
            return base_performances[self.base_method][iteration]
        else:
            # Extrapolate with diminishing returns
            last_perf = base_performances[self.base_method][-1]
            return last_perf + np.random.randn() * 0.5
    
    def visualize_results(self):
        """Visualize training results"""
        
        history = self.training_history
        
        fig, axes = plt.subplots(2, 2, figsize=(14, 10))
        
        # Performance curve
        ax = axes[0, 0]
        ax.plot(history["iteration"], history["performance"], 
               'o-', linewidth=3, markersize=10, color='green')
        ax.axhline(history["performance"][0], color='red', 
                  linestyle='--', alpha=0.5, label='Baseline')
        
        for i, (iter_num, perf) in enumerate(zip(history["iteration"][1:], 
                                                 history["performance"][1:]), 1):
            improvement = perf - history["performance"][0]
            ax.annotate(f'+{improvement:.2f}%', 
                       xy=(iter_num, perf), 
                       xytext=(iter_num, perf + 0.5),
                       ha='center', fontsize=9, color='darkgreen')
        
        ax.set_xlabel('Iteration')
        ax.set_ylabel('Performance (%)')
        ax.set_title(f'IPL-{self.base_method} Performance')
        ax.legend()
        ax.grid(True, alpha=0.3)
        
        # Selection rate
        ax = axes[0, 1]
        ax.bar(history["iteration"], history["selection_rate"], 
              color='orange', alpha=0.7)
        ax.set_xlabel('Iteration')
        ax.set_ylabel('Selection Rate (%)')
        ax.set_title('Prompt Update Selection Rate')
        ax.grid(True, alpha=0.3, axis='y')
        
        # Discrimination accuracy
        ax = axes[1, 0]
        ax.plot(history["iteration"], history["discrimination_accuracy"],
               's-', linewidth=2, markersize=8, color='purple')
        ax.axhline(100, color='green', linestyle='--', alpha=0.5)
        ax.set_xlabel('Iteration')
        ax.set_ylabel('Discrimination Accuracy (%)')
        ax.set_title('Self-Judgment Accuracy')
        ax.set_ylim(90, 105)
        ax.grid(True, alpha=0.3)
        
        # Summary statistics
        ax = axes[1, 1]
        ax.axis('off')
        
        summary_text = f"""
IPL-{self.base_method} Training Summary
{'='*40}
Model: {self.model_name}
Iterations: {self.num_iterations}
Beta: {self.beta}

Final Performance: {history['performance'][-1]:.2f}%
Total Improvement: +{history['performance'][-1] - history['performance'][0]:.2f}%
Average Selection Rate: {np.mean(history['selection_rate'][1:]):.1f}%

Key Insights:
• IPL-IPO shows steady improvement
• Conservative selection (2.4% final)
• High discrimination accuracy (>95%)
"""
        
        ax.text(0.1, 0.9, summary_text, transform=ax.transAxes,
               fontsize=11, verticalalignment='top',
               fontfamily='monospace',
               bbox=dict(boxstyle="round,pad=0.5", facecolor='wheat', alpha=0.8))
        
        plt.tight_layout()
        plt.show()

# Run IPL training simulation
print("IPL Training Simulation")
print("=" * 60)

# Create datasets
train_data = [IPLDataPoint(
    naive_prompt=f"Task {i}",
    naive_response=f"Response {i}",
    ground_truth=f"Ground truth {i}"
) for i in range(100)]

val_data = train_data[:20]  # Use subset for validation

# Train IPL-IPO
pipeline = IPLTrainingPipeline(base_method="IPO")
final_dataset, history = pipeline.run_training(train_data, val_data)

# Visualize results
pipeline.visualize_results()

## 6. Summary & Key Takeaways

### Core IPL Concepts:

1. **Self-Rewarding Mechanism**:
   - Model generates new prompts and judges their quality
   - Conservative updates (only 2.4% acceptance rate)
   - Iterative improvement through self-evaluation

2. **Algorithm Components**:
   - Warmup in first iteration
   - Dynamic prompt generation (xn+)
   - Self-discrimination for quality judgment
   - Conditional dataset updates

3. **IPL Advantages**:
   - IPL-IPO achieves best results (52.13%)
   - Steady improvement across iterations
   - No external feedback required
   - Maintains high discrimination accuracy

4. **Implementation Details**:
   - Requires additional discrimination & instruction data
   - Total 60k training examples (30k original + 30k augmented)
   - 3 iterations optimal (e1=warmup, e2-e3=updates)
   - Beta=0.01 for stable training

### Practical Tips:

- Start with IPO as base method (more stable than DPO)
- Monitor selection rate (should be conservative)
- Ensure discrimination accuracy stays high (>95%)
- Use warmup iteration to establish baseline

IPL represents a significant advancement in self-improving AI systems!