# 🎯 LLM Fine-Tuning: Modern Approaches and When to Use Them

Welcome to the comprehensive guide on fine-tuning Large Language Models!

---

## 📚 What You'll Learn

By the end of this tutorial, you will:

1. **Understand the modern philosophy** of fine-tuning (what changed and why)
2. **Master three fine-tuning approaches:**
   - **Supervised Fine-Tuning (SFT)** - Traditional approach for teaching formats and styles
   - **Preference-Based Fine-Tuning** - Learning from comparisons and rankings
   - **Reinforcement Learning from Human Feedback (RLHF)** - Advanced behavioral alignment
3. **Calculate costs** and understand economic trade-offs
4. **Make informed decisions** about which approach to use for your use case
5. **Prepare datasets** appropriate for each method
6. **Evaluate results** with proper metrics

---

## ⚠️ Important Note About This Notebook

**This notebook is primarily educational and will NOT execute actual fine-tuning.**

Why?
- Fine-tuning is **expensive** (can cost hundreds to thousands of dollars)
- It requires significant compute time (hours to days)
- Mock datasets are used for illustration purposes

**What you WILL get:**
- ✅ Complete code examples showing how to fine-tune on OpenAI's platform
- ✅ Mock datasets demonstrating proper data formatting
- ✅ Cost calculations and economic analysis
- ✅ Decision frameworks for choosing approaches
- ✅ Best practices and evaluation strategies

**What you WON'T do:**
- ❌ Actually train models (too expensive for a tutorial)
- ❌ Use real API keys for fine-tuning
- ❌ Wait hours for training to complete

Think of this as a **comprehensive blueprint** - when you're ready to fine-tune for real, you'll know exactly what to do!

---

## 🗺️ Tutorial Structure

**Part 1:** Introduction & Modern Philosophy (15-20 min)

**Part 2:** Cost Understanding (10 min)

**Part 3:** Supervised Fine-Tuning (SFT) - 30 min

**Part 4:** Preference-Based Fine-Tuning - 30 min

**Part 5:** RLHF - 30 min

**Part 6:** Decision Framework - 20 min

**Part 7:** Best Practices & Evaluation - 15 min

**Part 8:** Summary & Resources - 10 min

---

Let's begin! 🚀

---

# Part 1: The Paradigm Shift in Fine-Tuning

## 🔄 The Old Approach vs. The New Approach

### The Old Days (2018-2020)

**The Problem:**
- Base models (BERT, GPT-2) had limited general knowledge
- Needed fine-tuning to teach them domain-specific facts
- Fine-tuning was about **adding knowledge**

**Example:**
```
❌ Base GPT-2: "What is Kubernetes?" → Confused or wrong answer
✅ Fine-tuned GPT-2: Train on DevOps docs → Now knows Kubernetes
```

### The Modern Era (2023+)

**The Revolution:**
- Modern LLMs (GPT-4, Claude, Gemini) are trained on massive datasets
- They **already know** most facts, concepts, and domains
- Fine-tuning is now about **behavior, style, and format**

**Example:**
```
✅ GPT-4: "What is Kubernetes?" → Already knows perfectly well
🎯 Fine-tuned GPT-4: Make it respond in your company's specific tone,
                      format output as JSON, follow brand guidelines
```

---

## ✅ What Fine-Tuning IS Good For (Modern Use Cases)

### 1. **Teaching Output Format**
- **Problem:** Need consistent JSON structure, specific field names
- **Solution:** Fine-tune on examples of perfect formatting
- **Example:** Extract invoice data → always output same JSON schema

### 2. **Establishing Brand Voice/Style**
- **Problem:** Company has specific tone (formal, casual, technical)
- **Solution:** Fine-tune on company's approved responses
- **Example:** Customer service chatbot matching brand personality

### 3. **Following Internal Guidelines**
- **Problem:** Company has specific policies, templates, procedures
- **Solution:** Fine-tune on examples following these rules
- **Example:** Medical advice bot following clinical protocols

### 4. **Consistent Edge Case Handling**
- **Problem:** Model should refuse certain requests in specific ways
- **Solution:** Fine-tune on examples of proper refusals
- **Example:** Financial advisor bot declining legal advice politely

### 5. **Reducing Latency + Cost (Post-Distillation)**
- **Problem:** GPT-4 is expensive and slow
- **Solution:** Fine-tune smaller model (gpt-5-nano) on GPT-4 outputs
- **Example:** Customer service with 10x lower costs, 5x faster

---

## ❌ What Fine-Tuning IS NOT Good For

### 1. **Teaching New Knowledge/Facts**
- **Why:** Modern LLMs already know most information
- **Better approach:** Use RAG (Retrieval Augmented Generation)
- **Example:** Instead of fine-tuning on company docs, use RAG to retrieve relevant sections

**Why RAG beats fine-tuning for knowledge:**
- ✅ Updates in real-time (no retraining needed)
- ✅ Cheaper (no training costs)
- ✅ Explainable (shows sources)
- ✅ Handles changing information better

### 2. **Making Model "Smarter"**
- **Why:** You can't fine-tune a small model to match GPT-4's reasoning
- **Reality:** Fine-tuning teaches patterns, not intelligence
- **Example:** Can't make gpt-3.5 reason like GPT-4 through fine-tuning

### 3. **Fixing Fundamental Model Limitations**
- **Why:** Fine-tuning doesn't change core capabilities
- **Better approach:** Use a better base model
- **Example:** If model can't do math, fine-tuning won't help → use tool calling instead

### 4. **Reducing Hallucinations Significantly**
- **Why:** Fine-tuning can actually increase hallucinations
- **Reality:** Fine-tuning teaches confidence, which can worsen false claims
- **Better approach:** Use RAG, citations, temperature tuning

---

## 🎯 The Golden Rule of Modern Fine-Tuning

**Ask yourself:**

> *"Could I solve this with better prompts, RAG, or tool calling?"*

**If YES → Don't fine-tune yet!**

**If NO → Fine-tuning might be worth it**

### Decision Hierarchy (Try in Order):

1. **Prompt Engineering** (minutes, free)
   - Costs: $0
   - Time: Minutes
   - Example: Clear instructions, few-shot examples

2. **RAG for Knowledge** (hours, low cost)
   - Costs: $50-500 setup
   - Time: Hours to days
   - Example: Vector database + retrieval

3. **Tool Calling for Accuracy** (days, medium cost)
   - Costs: $100-1000 setup
   - Time: Days to weeks
   - Example: Calculator, search, database queries

4. **Fine-Tuning for Behavior** (weeks, high cost)
   - Costs: $500-10,000+
   - Time: Weeks to months
   - Example: Brand voice, output format, style

---

## 📊 Real-World Success Example

**Company:** E-commerce customer support

**Initial idea:** "Let's fine-tune on all our product docs!"

**What they actually did:**
1. **Prompt engineering** → Got 60% success rate (1 day)
2. **Added RAG** → 85% success rate (1 week, $200 cost)
3. **Added tool calling** → 92% success rate (2 weeks, $500 cost)
4. **Fine-tuned for tone** → 95% success rate (4 weeks, $2000 cost)

**Lesson:** Fine-tuning was the **final 3% improvement**, not the first step!

---

## 🚀 The Three Modern Fine-Tuning Approaches

Now that you understand **when** to fine-tune, let's learn **how**.

### 1. Supervised Fine-Tuning (SFT)
- **Teaching:** "Here are perfect examples, copy this pattern"
- **Data:** Input-output pairs showing desired behavior
- **Use case:** Consistent formatting, style, structure
- **Complexity:** ⭐⭐☆☆☆ (Easiest)

### 2. Preference-Based Fine-Tuning
- **Teaching:** "Response A is better than Response B"
- **Data:** Pairs of outputs with quality rankings
- **Use case:** Subjective quality, tone, helpfulness
- **Complexity:** ⭐⭐⭐☆☆ (Medium)

### 3. RLHF (Reinforcement Learning from Human Feedback)
- **Teaching:** "Here's a reward score across multiple dimensions"
- **Data:** Outputs with multi-dimensional quality scores
- **Use case:** Complex quality with multiple criteria
- **Complexity:** ⭐⭐⭐⭐⭐ (Advanced)

---

Let's explore each approach in depth! 📖

---

# Part 2: Understanding Costs

Before diving into techniques, let's understand the **economic reality** of fine-tuning.

## 💰 Two Types of Costs

### 1. Training Cost (One-Time)
- Pay per token during training
- Depends on: dataset size, number of epochs, model size
- **This is visible and obvious**

### 2. Inference Cost (Ongoing)
- Fine-tuned models cost MORE per token than base models
- For high-volume applications, this adds up quickly
- **This is hidden and often forgotten!**

---

## 📊 OpenAI Pricing (gpt-5-nano)

### Training Costs:
- **Input tokens:** $0.10 per 1M tokens
- **Output tokens:** $0.80 per 1M tokens
- *Note: You're charged for BOTH user prompts AND assistant responses in your training data*

### Inference Costs:
- **Base gpt-5-nano:**
  - Input: $0.05 / 1M tokens
  - Output: $0.40 / 1M tokens
  
- **Fine-tuned gpt-5-nano:**
  - Input: $0.15 / 1M tokens (3x more!)
  - Output: $1.20 / 1M tokens (3x more!)

---

In [None]:
# Cost Calculator Function

def calculate_finetuning_costs(
    num_examples: int,
    avg_tokens_per_example: int,
    epochs: int = 3,
    model: str = "gpt-5-nano"
) -> dict:
    """
    Calculate the costs of fine-tuning a model on OpenAI's platform.

    Args:
        num_examples: Number of training examples
        avg_tokens_per_example: Average tokens per example (input + output combined)
        epochs: Number of training epochs (default: 3)
        model: Model to fine-tune (only gpt-5-nano supported in this example)

    Returns:
        Dictionary with cost breakdown
    """

    # Pricing for gpt-5-nano (as of 2025)
    if model == "gpt-5-nano":
        # Training costs (per 1M tokens)
        training_cost_per_1m = 0.10  # Simplified: average of input (0.10) and output (0.80)

        # Inference costs (per 1M tokens)
        base_inference_cost = 0.225  # Average of input (0.05) and output (0.40)
        finetuned_inference_cost = 0.675  # Average of input (0.15) and output (1.20)
    else:
        raise ValueError(f"Model {model} not supported in this example")

    # Calculate training costs
    total_training_tokens = num_examples * avg_tokens_per_example * epochs
    training_cost = (total_training_tokens / 1_000_000) * training_cost_per_1m

    # Calculate inference cost difference (for 1 million requests)
    # Assuming each inference uses similar token count
    inference_requests = 1_000_000  # 1 million requests for comparison
    total_inference_tokens = inference_requests * avg_tokens_per_example

    base_inference_total = (total_inference_tokens / 1_000_000) * base_inference_cost
    finetuned_inference_total = (total_inference_tokens / 1_000_000) * finetuned_inference_cost
    inference_cost_increase = finetuned_inference_total - base_inference_total

    # Calculate break-even point
    # How many requests until inference costs exceed training savings?
    cost_per_request_increase = (finetuned_inference_cost - base_inference_cost) * (avg_tokens_per_example / 1_000_000)
    break_even_requests = training_cost / cost_per_request_increase if cost_per_request_increase > 0 else float('inf')

    return {
        "training_cost": round(training_cost, 2),
        "total_training_tokens": total_training_tokens,
        "base_inference_cost_per_1m_requests": round(base_inference_total, 2),
        "finetuned_inference_cost_per_1m_requests": round(finetuned_inference_total, 2),
        "inference_cost_increase_per_1m_requests": round(inference_cost_increase, 2),
        "break_even_requests": int(break_even_requests) if break_even_requests != float('inf') else "N/A",
        "cost_per_example": round(training_cost / num_examples, 4)
    }

def print_cost_analysis(results: dict, num_examples: int):
    """
    Pretty print cost analysis results.
    """
    print(f"\n{'='*70}")
    print(f"💰 COST ANALYSIS: {num_examples:,} Training Examples")
    print(f"{'='*70}\n")

    print(f"🔧 TRAINING COSTS (One-Time)")
    print(f"   Total cost: ${results['training_cost']:,.2f}")
    print(f"   Cost per example: ${results['cost_per_example']:.4f}")
    print(f"   Total tokens processed: {results['total_training_tokens']:,}\n")

    print(f"📊 INFERENCE COSTS (Per 1M Requests)")
    print(f"   Base model: ${results['base_inference_cost_per_1m_requests']:,.2f}")
    print(f"   Fine-tuned model: ${results['finetuned_inference_cost_per_1m_requests']:,.2f}")
    print(f"   Additional cost: ${results['inference_cost_increase_per_1m_requests']:,.2f} (⬆️ {((results['finetuned_inference_cost_per_1m_requests'] / results['base_inference_cost_per_1m_requests']) - 1) * 100:.0f}%)\n")

    print(f"⚖️ BREAK-EVEN ANALYSIS")
    if results['break_even_requests'] != "N/A":
        print(f"   Break-even at: {results['break_even_requests']:,} requests")
        print(f"   After this point, inference costs exceed training costs!\n")
    else:
        print(f"   Break-even: Not applicable\n")

    print(f"{'='*70}")

print("✅ Cost calculator functions loaded!")

## 💡 Let's Calculate Real Scenarios

In [None]:
# Scenario 1: Small fine-tuning project (100 examples)
print("\n📝 Scenario 1: Small Project (Quick Experiment)")
print("   Use case: Testing if fine-tuning helps with your use case")
print("   Dataset: 100 examples, ~500 tokens each")

results_small = calculate_finetuning_costs(
    num_examples=100,
    avg_tokens_per_example=500,
    epochs=3
)
print_cost_analysis(results_small, 100)

print("\n💭 Analysis:")
print("   ✅ Low training cost - good for experimentation")
print("   ⚠️ May not be enough data for significant improvement")
print("   ✅ Quick to iterate if results aren't good")

In [None]:
# Scenario 2: Medium fine-tuning project (500 examples)
print("\n📝 Scenario 2: Medium Project (Standard Production)")
print("   Use case: Customer service chatbot with brand voice")
print("   Dataset: 500 examples, ~400 tokens each")

results_medium = calculate_finetuning_costs(
    num_examples=500,
    avg_tokens_per_example=400,
    epochs=3
)
print_cost_analysis(results_medium, 500)

print("\n💭 Analysis:")
print("   ✅ Balanced cost-benefit ratio")
print("   ✅ Usually sufficient for style/format learning")
print("   ⚠️ Inference costs start adding up at scale")

In [None]:
# Scenario 3: Large fine-tuning project (1000 examples)
print("\n📝 Scenario 3: Large Project (Comprehensive)")
print("   Use case: Multi-domain chatbot with complex behaviors")
print("   Dataset: 1000 examples, ~600 tokens each")

results_large = calculate_finetuning_costs(
    num_examples=1000,
    avg_tokens_per_example=600,
    epochs=3
)
print_cost_analysis(results_large, 1000)

print("\n💭 Analysis:")
print("   ⚠️ Significant upfront investment")
print("   ✅ High-quality results expected")
print("   ⚠️ Inference costs become major factor at scale")

In [None]:
# Scenario 4: Enterprise fine-tuning (5000 examples)
print("\n📝 Scenario 4: Enterprise Project (Maximum Quality)")
print("   Use case: Mission-critical application with strict requirements")
print("   Dataset: 5000 examples, ~500 tokens each")

results_enterprise = calculate_finetuning_costs(
    num_examples=5000,
    avg_tokens_per_example=500,
    epochs=3
)
print_cost_analysis(results_enterprise, 5000)

print("\n💭 Analysis:")
print("   ⚠️ Very expensive - ensure this is necessary!")
print("   ⚠️ Consider if prompt engineering + RAG could work instead")
print("   ⚠️ Inference costs will be substantial - budget accordingly")
print("   ✅ Only justified for high-value, high-volume applications")

## 🎯 Key Cost Insights

### 1. **Training costs scale linearly**
- 2x examples = 2x cost
- BUT: Diminishing returns on quality!
- Sweet spot: Usually 500-1500 examples

### 2. **Inference costs are ongoing**
- Fine-tuned models cost 3x more per request
- At high volume, this exceeds training costs
- **Critical consideration:** Calculate your expected volume!

### 3. **Quality vs. Quantity trade-off**
- 100 high-quality examples > 1000 mediocre examples
- Focus on data quality, not just quantity
- Manual curation is worth the effort

### 4. **The Total Cost of Ownership**

**Example:** Customer service chatbot
- Training: $18 (one-time)
- Inference (1M requests/month): $270/month extra cost
- Annual cost: $18 + ($270 × 12) = $3,258

**Question to ask:** *Is the improved performance worth $3,258/year?*

---

## 💡 Cost Optimization Strategies

### Strategy 1: Start Small
```
100 examples → Test → Works? → Add 400 more → Evaluate
```

### Strategy 2: Distillation Path
```
1. Use GPT-4 for quality
2. Generate many outputs
3. Fine-tune gpt-5-nano on GPT-4 outputs
4. Get 80% of quality at 10% of cost
```

### Strategy 3: Hybrid Approach
```
Fine-tune for format → Use base model with RAG for knowledge
```

### Strategy 4: Batch Processing
```
Non-urgent requests → Process in batches → 50% cost reduction
```

---

**Remember:** The cheapest fine-tuning is the one you don't need to do!

Always exhaust prompt engineering and RAG before considering fine-tuning.

---

# Part 3: Supervised Fine-Tuning (SFT)

## 📖 What is Supervised Fine-Tuning?

**Supervised Fine-Tuning (SFT)** is the most straightforward fine-tuning approach:

**Core Concept:**
> "Here are perfect examples of inputs and their ideal outputs. Learn to copy this pattern."

**How it works:**
1. Provide pairs of (input → desired output)
2. Model learns to predict outputs given inputs
3. Model adjusts weights to minimize difference between its output and your examples

---

## 🔬 Technical Deep Dive

**Training Process:**
```
For each example in dataset:
    1. Model sees: "User: How do I reset my password?"
    2. Model generates: "Click on forgot password..."
    3. Compare to your example: "Here's how to reset your password: 1) Click..."
    4. Calculate loss (difference)
    5. Update model weights to reduce this difference
    6. Repeat thousands of times
```

**What the model learns:**
- **Patterns in format:** If all examples use JSON → model outputs JSON
- **Consistency in style:** If all examples are formal → model becomes formal
- **Specific phrasings:** If all examples start with "Here's how" → model copies this
- **Edge case handling:** If examples show refusals → model learns when to refuse

**What the model does NOT learn:**
- ❌ New facts (use RAG instead)
- ❌ Better reasoning (use better base model)
- ❌ External knowledge (use tools/APIs)

---

## ✅ When to Use SFT

### Perfect for:
1. **Consistent output formatting** (JSON schemas, XML, structured data)
2. **Brand voice enforcement** (specific tone, emoji usage, greetings)
3. **Template following** (company-specific response structures)
4. **Code style compliance** (naming conventions, documentation patterns)
5. **Reducing latency** (distilling larger model behavior into smaller model)

### Signs SFT is right:
- ✅ You have clear "right answers" or "correct formats"
- ✅ Consistency is more important than creativity
- ✅ You can describe success objectively
- ✅ Prompt engineering gets close but not reliable enough

---

## ❌ When NOT to Use SFT

### Bad for:
1. **Teaching new information** → Use RAG
2. **Subjective quality improvements** → Use preference-based tuning
3. **Complex multi-objective optimization** → Use RLHF
4. **Fixing hallucinations** → Won't work, may make it worse

### Signs SFT is wrong:
- ❌ You need to update knowledge frequently
- ❌ "Better" is subjective and varies by context
- ❌ You want model to be "smarter"
- ❌ You don't have clear examples of correct behavior

---

## 📊 Data Requirements

**Minimum viable:** 50-100 examples
- Good for: Testing if fine-tuning helps
- Quality: May be inconsistent

**Recommended:** 500-1000 examples
- Good for: Production use
- Quality: Reliable and consistent

**Maximum useful:** 5000-10,000 examples
- Good for: Complex behaviors, multiple patterns
- Quality: Diminishing returns above this

**Data quality > Data quantity**
- 100 perfect examples > 1000 mediocre examples
- Consistency is critical
- Remove outliers and errors

---

Let's see SFT in action with three real-world use cases! 🚀

---

## 🔧 How to Actually Fine-Tune with OpenAI (SFT)

**Note:** This code is educational and will NOT be executed in this notebook.

Here's the complete process for Supervised Fine-Tuning on OpenAI's platform:

In [None]:
# Step 1: Prepare your data in JSONL format
# Each line is a JSON object with 'messages' array

import json

# Example: Converting our customer service data to JSONL
training_data = [
    {
        "messages": [
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": "Hello, how are you?"},
            {"role": "assistant", "content": "I'm doing well, thank you!"}
        ]
    },
    # ... more examples
]

# Save as JSONL file
with open('training_data.jsonl', 'w') as f:
    for example in training_data:
        f.write(json.dumps(example) + '\n')

print("✅ Training data prepared: training_data.jsonl")

In [None]:
# Step 2: Upload the training file to OpenAI

from openai import OpenAI

# Initialize client
client = OpenAI(api_key="your-api-key-here")  # Replace with your API key

# Upload training file
training_file = client.files.create(
    file=open("training_data.jsonl", "rb"),
    purpose="fine-tune"
)

print(f"✅ File uploaded: {training_file.id}")
print(f"   Status: {training_file.status}")
print(f"   Filename: {training_file.filename}")

In [None]:
# Step 3: Create a fine-tuning job

# Start fine-tuning
fine_tune_job = client.fine_tuning.jobs.create(
    training_file=training_file.id,
    model="gpt-4o-mini-2024-07-18",  # or gpt-5-nano, gpt-3.5-turbo
    hyperparameters={
        "n_epochs": 3,  # Number of training epochs (default: auto)
    },
    suffix="customer-service-v1"  # Optional: custom name suffix
)

print(f"✅ Fine-tuning job created: {fine_tune_job.id}")
print(f"   Status: {fine_tune_job.status}")
print(f"   Model: {fine_tune_job.model}")

In [None]:
# Step 4: Monitor training progress

import time

while True:
    job = client.fine_tuning.jobs.retrieve(fine_tune_job.id)

    print(f"Status: {job.status}")

    if job.status == "succeeded":
        print(f"\n✅ Fine-tuning completed!")
        print(f"   Fine-tuned model: {job.fine_tuned_model}")
        break
    elif job.status == "failed":
        print(f"\n❌ Fine-tuning failed: {job.error}")
        break

    # Wait before checking again
    time.sleep(60)  # Check every minute

# Optional: View training events
events = client.fine_tuning.jobs.list_events(fine_tune_job.id, limit=10)
for event in events.data:
    print(f"{event.created_at}: {event.message}")

In [None]:
# Step 5: Use your fine-tuned model

fine_tuned_model_id = job.fine_tuned_model

# Make a request to your fine-tuned model
response = client.chat.completions.create(
    model=fine_tuned_model_id,
    messages=[
        {"role": "system", "content": "You are a customer service agent."},
        {"role": "user", "content": "My order hasn't arrived yet."}
    ],
    temperature=0.7,
    max_tokens=500
)

print("\n🤖 Fine-tuned model response:")
print(response.choices[0].message.content)

# Compare with base model
base_response = client.chat.completions.create(
    model="gpt-4o-mini-2024-07-18",  # Base model
    messages=[
        {"role": "system", "content": "You are a customer service agent."},
        {"role": "user", "content": "My order hasn't arrived yet."}
    ],
    temperature=0.7,
    max_tokens=500
)

print("\n📊 Base model response:")
print(base_response.choices[0].message.content)

### 💰 Cost Estimation for This Job

```python
# Calculate costs
num_examples = 500
avg_tokens_per_example = 400
epochs = 3

total_tokens = num_examples * avg_tokens_per_example * epochs
training_cost = (total_tokens / 1_000_000) * 0.10  # $0.10 per 1M tokens

print(f"Estimated training cost: ${training_cost:.2f}")
```

**Remember:**
- Fine-tuned models cost 3x more per request for inference
- Budget for both training AND ongoing usage
- Start small (100-200 examples) to test effectiveness

---

## 💼 SFT Use Case #1: Customer Service Brand Voice

**Scenario:**
- Company: Hip e-commerce startup targeting Gen Z
- Problem: Generic AI responses don't match brand personality
- Goal: Every response should feel on-brand (casual, emoji-friendly, empathetic)

**Why SFT?**
- ✅ Clear pattern to copy (brand voice guidelines)
- ✅ Consistency is key (all responses should feel similar)
- ✅ Prompt engineering gets close but drifts over time

**What we're teaching:**
- Tone (casual but professional)
- Emoji usage (tasteful, not excessive)
- Greeting/closing patterns
- Empathy expressions

In [None]:
import pandas as pd

# Mock Dataset: Customer Service Brand Voice
# This shows the EXACT format OpenAI expects for SFT

sft_customer_service_data = [
    {
        "messages": [
            {
                "role": "system",
                "content": "You are a customer service agent for TrendyThreads, a Gen Z fashion brand. Be friendly, casual, use appropriate emojis, and show genuine empathy. Keep responses concise but warm."
            },
            {
                "role": "user",
                "content": "My order hasn't arrived yet and it's been 5 days. I'm getting worried."
            },
            {
                "role": "assistant",
                "content": "Oh no, I totally get why you're worried! 😟 Let me look into this for you right away.\n\nI've checked your order #TT-12847 and I can see it's currently with our shipping partner. It looks like there was a small delay at the distribution center, but the good news is it's already on its way to you! 📦\n\nYou should receive it within the next 2 business days. To make up for the wait, I'm adding a 15% discount code to your account (code: SORRY15) for your next purchase.\n\nIs there anything else I can help you with today? 💙"
            }
        ]
    },
    {
        "messages": [
            {
                "role": "system",
                "content": "You are a customer service agent for TrendyThreads, a Gen Z fashion brand. Be friendly, casual, use appropriate emojis, and show genuine empathy. Keep responses concise but warm."
            },
            {
                "role": "user",
                "content": "The hoodie I received is the wrong size. Can I return it?"
            },
            {
                "role": "assistant",
                "content": "Ah, sizing issues are the worst! 😅 Don't worry though, we've got you covered.\n\nYes, absolutely! You can return it within 30 days for a full refund or exchange. Here's the super easy process:\n\n1️⃣ Go to our website → Your Orders → Select this hoodie\n2️⃣ Click \"Return or Exchange\"\n3️⃣ Print the prepaid shipping label\n4️⃣ Drop it off at any post office\n\nWe'll send you the correct size as soon as we receive your return! Usually takes about 5-7 days total.\n\nNeed any help with the process or have questions about sizing? I'm here! ✨"
            }
        ]
    },
    {
        "messages": [
            {
                "role": "system",
                "content": "You are a customer service agent for TrendyThreads, a Gen Z fashion brand. Be friendly, casual, use appropriate emojis, and show genuine empathy. Keep responses concise but warm."
            },
            {
                "role": "user",
                "content": "Do you have this jacket in blue?"
            },
            {
                "role": "assistant",
                "content": "Great question! 💙 Let me check that for you.\n\nYes! The CloudFlex Jacket is available in Ocean Blue in sizes XS-XL. It's actually one of our most popular colors this season!\n\nWould you like me to send you a direct link? Or I can help you add it to your cart right now! 🛍️"
            }
        ]
    },
    {
        "messages": [
            {
                "role": "system",
                "content": "You are a customer service agent for TrendyThreads, a Gen Z fashion brand. Be friendly, casual, use appropriate emojis, and show genuine empathy. Keep responses concise but warm."
            },
            {
                "role": "user",
                "content": "I got charged twice for the same order! This is unacceptable."
            },
            {
                "role": "assistant",
                "content": "Oh wow, that's definitely not okay! 😟 I'm really sorry this happened - let me fix this immediately.\n\nI've checked your account and you're right - there was a duplicate charge of $67.99. This was a processing error on our end, and I've already initiated a refund for the duplicate charge. You should see it back in your account within 3-5 business days depending on your bank.\n\nAs an apology for this hassle, I'm also crediting your account with $15 store credit (already applied!).\n\nI've flagged this issue with our payment team to make sure it doesn't happen again. Is there anything else I can do to make this right? 💙"
            }
        ]
    },
    {
        "messages": [
            {
                "role": "system",
                "content": "You are a customer service agent for TrendyThreads, a Gen Z fashion brand. Be friendly, casual, use appropriate emojis, and show genuine empathy. Keep responses concise but warm."
            },
            {
                "role": "user",
                "content": "Thanks for helping me with my return!"
            },
            {
                "role": "assistant",
                "content": "You're so welcome! 🌟 Happy to help anytime!\n\nIf you need anything else or have questions when your exchange arrives, just reach out. We're always here for you! 💙\n\nHave an awesome day! ✨"
            }
        ]
    }
]

# Display as a readable format
print("\n" + "="*80)
print("💼 SFT USE CASE #1: Customer Service Brand Voice")
print("="*80 + "\n")

for idx, example in enumerate(sft_customer_service_data, 1):
    print(f"\n{'─'*80}")
    print(f"📝 Example {idx}")
    print(f"{'─'*80}\n")

    messages = example["messages"]

    # Skip system message in display (but it's included in training)
    for msg in messages:
        if msg["role"] == "user":
            print(f"👤 USER:\n{msg['content']}\n")
        elif msg["role"] == "assistant":
            print(f"🤖 ASSISTANT:\n{msg['content']}\n")

print("\n" + "="*80)
print("🎯 WHAT THE MODEL LEARNS FROM THESE EXAMPLES:")
print("="*80)
print("""\n
1. **Tone & Voice:**
   - Casual but professional (\"Let me check that for you\")
   - Empathetic acknowledgment (\"I totally get why you're worried\")
   - Positive language (\"good news\", \"Great question!\")

2. **Emoji Usage:**
   - 1-3 emojis per response (not excessive)
   - Contextually appropriate (😟 for problems, 💙 for warmth)
   - Enhances feeling, doesn't replace words

3. **Structure:**
   - Acknowledge emotion/issue first
   - Provide solution or information
   - Offer additional help
   - Warm closing

4. **Problem Handling:**
   - Immediate apology when appropriate
   - Take ownership (\"I've already initiated...\")
   - Offer compensation for errors
   - Follow up question

5. **Consistency:**
   - Always ends with warmth
   - Uses \"we/us\" for brand
   - Balances friendliness with professionalism
   - Never dismissive or robotic
""")

print("\n" + "="*80)
print("⚠️ IMPORTANT: These examples show CONSISTENT patterns!")
print("   The model learns from repetition across many examples.")
print("   In production, you'd need 500-1000 examples like these.")
print("="*80)

## 📊 SFT Use Case #2: Structured JSON Output

**Scenario:**
- Company: Invoice processing automation
- Problem: Need to extract data from invoices into exact JSON schema
- Goal: 100% consistent JSON format, specific field names, proper data types

**Why SFT?**
- ✅ Exact format requirements (can't vary)
- ✅ Prompt engineering gets 80% right, but 20% have format errors
- ✅ Even with structured outputs, need specific business field names
- ✅ Need consistent handling of edge cases (missing data, unclear amounts)

**What we're teaching:**
- Exact JSON schema
- Field naming conventions
- Data type consistency
- Null handling for missing data
- Edge case patterns

In [None]:
# Mock Dataset: Structured JSON Output for Invoice Extraction

sft_json_extraction_data = [
    {
        "messages": [
            {
                "role": "system",
                "content": "You are an invoice data extraction system. Extract information and output in JSON format with these exact fields: invoice_number, invoice_date, vendor_name, vendor_address, total_amount, tax_amount, line_items (array of objects with: description, quantity, unit_price, total_price). Use null for missing values."
            },
            {
                "role": "user",
                "content": "INVOICE\n\nInvoice #: INV-2024-001\nDate: January 15, 2024\n\nFrom:\nTech Supplies Inc.\n123 Silicon Valley Rd, San Jose, CA 95110\n\nLine Items:\n- Dell Monitor 27\" x 5 @ $299.99 = $1,499.95\n- USB-C Cable x 10 @ $12.99 = $129.90\n- Wireless Mouse x 5 @ $34.99 = $174.95\n\nSubtotal: $1,804.80\nTax (8.5%): $153.41\nTotal: $1,958.21"
            },
            {
                "role": "assistant",
                "content": "{\n  \"invoice_number\": \"INV-2024-001\",\n  \"invoice_date\": \"2024-01-15\",\n  \"vendor_name\": \"Tech Supplies Inc.\",\n  \"vendor_address\": \"123 Silicon Valley Rd, San Jose, CA 95110\",\n  \"total_amount\": 1958.21,\n  \"tax_amount\": 153.41,\n  \"line_items\": [\n    {\n      \"description\": \"Dell Monitor 27\\\"\",\n      \"quantity\": 5,\n      \"unit_price\": 299.99,\n      \"total_price\": 1499.95\n    },\n    {\n      \"description\": \"USB-C Cable\",\n      \"quantity\": 10,\n      \"unit_price\": 12.99,\n      \"total_price\": 129.90\n    },\n    {\n      \"description\": \"Wireless Mouse\",\n      \"quantity\": 5,\n      \"unit_price\": 34.99,\n      \"total_price\": 174.95\n    }\n  ]\n}"
            }
        ]
    },
    {
        "messages": [
            {
                "role": "system",
                "content": "You are an invoice data extraction system. Extract information and output in JSON format with these exact fields: invoice_number, invoice_date, vendor_name, vendor_address, total_amount, tax_amount, line_items (array of objects with: description, quantity, unit_price, total_price). Use null for missing values."
            },
            {
                "role": "user",
                "content": "Invoice Number: 2024-Q1-789\nDate: 03/22/2024\n\nBill From: Office Depot\n500 Commerce St, Austin TX 78701\n\nItems Purchased:\nPrinter Paper (Case) - Qty: 3 - $42.50 each\nStapler - Qty: 8 - $8.99 each\n\nTotal Due: $199.42 (includes tax)"
            },
            {
                "role": "assistant",
                "content": "{\n  \"invoice_number\": \"2024-Q1-789\",\n  \"invoice_date\": \"2024-03-22\",\n  \"vendor_name\": \"Office Depot\",\n  \"vendor_address\": \"500 Commerce St, Austin TX 78701\",\n  \"total_amount\": 199.42,\n  \"tax_amount\": null,\n  \"line_items\": [\n    {\n      \"description\": \"Printer Paper (Case)\",\n      \"quantity\": 3,\n      \"unit_price\": 42.50,\n      \"total_price\": 127.50\n    },\n    {\n      \"description\": \"Stapler\",\n      \"quantity\": 8,\n      \"unit_price\": 8.99,\n      \"total_price\": 71.92\n    }\n  ]\n}"
            }
        ]
    },
    {
        "messages": [
            {
                "role": "system",
                "content": "You are an invoice data extraction system. Extract information and output in JSON format with these exact fields: invoice_number, invoice_date, vendor_name, vendor_address, total_amount, tax_amount, line_items (array of objects with: description, quantity, unit_price, total_price). Use null for missing values."
            },
            {
                "role": "user",
                "content": "Receipt #ACC-9921\n2024-02-10\n\nAcme Consulting Services\nAddress not provided\n\nConsulting Services (February) - $5,000.00\n\nAmount Due: $5,000.00"
            },
            {
                "role": "assistant",
                "content": "{\n  \"invoice_number\": \"ACC-9921\",\n  \"invoice_date\": \"2024-02-10\",\n  \"vendor_name\": \"Acme Consulting Services\",\n  \"vendor_address\": null,\n  \"total_amount\": 5000.00,\n  \"tax_amount\": null,\n  \"line_items\": [\n    {\n      \"description\": \"Consulting Services (February)\",\n      \"quantity\": 1,\n      \"unit_price\": 5000.00,\n      \"total_price\": 5000.00\n    }\n  ]\n}"
            }
        ]
    },
    {
        "messages": [
            {
                "role": "system",
                "content": "You are an invoice data extraction system. Extract information and output in JSON format with these exact fields: invoice_number, invoice_date, vendor_name, vendor_address, total_amount, tax_amount, line_items (array of objects with: description, quantity, unit_price, total_price). Use null for missing values."
            },
            {
                "role": "user",
                "content": "Invoice: WEB-2024-456\nIssued: 01/30/2024\n\nCompany: WebHost Pro\nLocation: 789 Server Lane, Seattle WA 98101\n\nServices:\n• Premium Hosting Plan (Annual) x 1 @ $599.00\n• Domain Registration (.com) x 2 @ $14.99\n• SSL Certificate x 1 @ $89.00\n\nSubtotal: $718.98\nSales Tax (10%): $71.90\nGrand Total: $790.88"
            },
            {
                "role": "assistant",
                "content": "{\n  \"invoice_number\": \"WEB-2024-456\",\n  \"invoice_date\": \"2024-01-30\",\n  \"vendor_name\": \"WebHost Pro\",\n  \"vendor_address\": \"789 Server Lane, Seattle WA 98101\",\n  \"total_amount\": 790.88,\n  \"tax_amount\": 71.90,\n  \"line_items\": [\n    {\n      \"description\": \"Premium Hosting Plan (Annual)\",\n      \"quantity\": 1,\n      \"unit_price\": 599.00,\n      \"total_price\": 599.00\n    },\n    {\n      \"description\": \"Domain Registration (.com)\",\n      \"quantity\": 2,\n      \"unit_price\": 14.99,\n      \"total_price\": 29.98\n    },\n    {\n      \"description\": \"SSL Certificate\",\n      \"quantity\": 1,\n      \"unit_price\": 89.00,\n      \"total_price\": 89.00\n    }\n  ]\n}"
            }
        ]
    },
    {
        "messages": [
            {
                "role": "system",
                "content": "You are an invoice data extraction system. Extract information and output in JSON format with these exact fields: invoice_number, invoice_date, vendor_name, vendor_address, total_amount, tax_amount, line_items (array of objects with: description, quantity, unit_price, total_price). Use null for missing values."
            },
            {
                "role": "user",
                "content": "BILL\n\nRef: MISC-045\nWhen: April 5, 2024\n\nSupplier: Random Parts Co\n\nPurchased: Various hardware components\nCost: $2,450\n\nPaid in full."
            },
            {
                "role": "assistant",
                "content": "{\n  \"invoice_number\": \"MISC-045\",\n  \"invoice_date\": \"2024-04-05\",\n  \"vendor_name\": \"Random Parts Co\",\n  \"vendor_address\": null,\n  \"total_amount\": 2450.00,\n  \"tax_amount\": null,\n  \"line_items\": [\n    {\n      \"description\": \"Various hardware components\",\n      \"quantity\": 1,\n      \"unit_price\": 2450.00,\n      \"total_price\": 2450.00\n    }\n  ]\n}"
            }
        ]
    }
]

# Display examples
print("\n" + "="*80)
print("📊 SFT USE CASE #2: Structured JSON Output")
print("="*80 + "\n")

for idx, example in enumerate(sft_json_extraction_data, 1):
    print(f"\n{'─'*80}")
    print(f"📝 Example {idx}")
    print(f"{'─'*80}\n")

    messages = example["messages"]
    user_msg = next(m["content"] for m in messages if m["role"] == "user")
    assistant_msg = next(m["content"] for m in messages if m["role"] == "assistant")

    print(f"📄 INPUT (Invoice Text):\n{user_msg}\n")
    print(f"📋 OUTPUT (Extracted JSON):\n{assistant_msg}\n")

print("\n" + "="*80)
print("🎯 WHAT THE MODEL LEARNS FROM THESE EXAMPLES:")
print("="*80)
print("""
1. **Exact Schema Adherence:**
   - ALWAYS includes all fields (invoice_number, invoice_date, etc.)
   - NEVER adds extra fields not in schema
   - NEVER renames fields

2. **Data Type Consistency:**
   - Numbers are numbers (not strings): 1958.21, not \"1958.21\"
   - Dates in ISO format: \"2024-01-15\"
   - Quantities are integers: 5, not 5.0
   - Prices are floats with 2 decimals: 299.99

3. **Null Handling:**
   - Missing tax? → \"tax_amount\": null
   - Missing address? → \"vendor_address\": null
   - NEVER uses empty strings or \"N/A\"

4. **Edge Case Patterns:**
   - Vague items (\"Various components\") → Still extracted
   - Different date formats → Normalized to ISO
   - Included tax in total → tax_amount: null (not calculable)
   - Single line item → Still use array format

5. **Format Variations Handled:**
   - \"Invoice #\" vs \"Receipt #\" vs \"Ref:" → All extracted
   - Different layouts (table vs list vs paragraph)
   - Bullet points (•, -, no marker)
   - Currency symbols ($) → Stripped from numbers

⚠️ CRITICAL: Without fine-tuning, GPT might:
   - Add \"currency\" field → Schema violation
   - Use \"N/A\" instead of null → Type error
   - Format dates inconsistently
   - Include dollar signs in numbers

✅ After fine-tuning: 100% schema compliance
""")

print("\n" + "="*80)
print("💡 WHY SFT BEATS PROMPT ENGINEERING HERE:")
print("="*80)
print("""
Prompt engineering can get you 80-90% accuracy, but:
- Occasional schema violations (extra fields, wrong names)
- Inconsistent null handling
- Type errors (strings vs numbers)
- Date format variations

Fine-tuning achieves 98-99% accuracy with:
- Perfect schema adherence
- Consistent edge case handling
- Reliable null usage
- Proper data types

For production systems with strict requirements, this 10-20%
improvement is worth the investment.
""")
print("="*80)

## 💻 SFT Use Case #3: Code Generation with Style Guide

**Scenario:**
- Company: Enterprise software development team
- Problem: AI generates code that doesn't follow company standards
- Goal: Every generated function follows exact style guide

**Why SFT?**
- ✅ Strict company coding standards (docstrings, type hints, naming)
- ✅ Consistency across entire codebase
- ✅ Prompt engineering produces "good" code but wrong style
- ✅ Need to enforce non-negotiable patterns

**What we're teaching:**
- Specific docstring format (Google style)
- Type hints everywhere
- Naming conventions (snake_case functions, PascalCase classes)
- Error handling patterns
- Logging practices

In [None]:
# Mock Dataset: Code Generation with Style Guide

sft_code_generation_data = [
    {
        "messages": [
            {
                "role": "system",
                "content": "You are a code generation assistant for TechCorp. Generate Python code following our style guide: Google-style docstrings, full type hints, snake_case for functions, PascalCase for classes, explicit error handling with logging, and comprehensive input validation."
            },
            {
                "role": "user",
                "content": "Create a function that calculates the average of a list of numbers."
            },
            {
                "role": "assistant",
                "content": """import logging\nfrom typing import List, Union\n\nlogger = logging.getLogger(__name__)\n\n\ndef calculate_average(numbers: List[Union[int, float]]) -> float:\n    \"\"\"Calculate the arithmetic mean of a list of numbers.\n    \n    Args:\n        numbers: A list of integers or floats to average.\n            \n    Returns:\n        The arithmetic mean of the input numbers.\n        \n    Raises:\n        ValueError: If the input list is empty.\n        TypeError: If the input contains non-numeric values.\n        \n    Examples:\n        >>> calculate_average([1, 2, 3, 4, 5])\n        3.0\n        >>> calculate_average([10.5, 20.5])\n        15.5\n    \"\"\"\n    logger.debug(f\"Calculating average for {len(numbers)} numbers\")\n    \n    # Input validation\n    if not numbers:\n        logger.error(\"Cannot calculate average of empty list\")\n        raise ValueError(\"Input list cannot be empty\")\n    \n    if not all(isinstance(n, (int, float)) for n in numbers):\n        logger.error(\"Input list contains non-numeric values\")\n        raise TypeError(\"All elements must be integers or floats\")\n    \n    # Calculate average\n    result = sum(numbers) / len(numbers)\n    logger.info(f\"Average calculated: {result}\")\n    \n    return result"""
            }
        ]
    },
    {
        "messages": [
            {
                "role": "system",
                "content": "You are a code generation assistant for TechCorp. Generate Python code following our style guide: Google-style docstrings, full type hints, snake_case for functions, PascalCase for classes, explicit error handling with logging, and comprehensive input validation."
            },
            {
                "role": "user",
                "content": "Create a class that represents a User with name, email, and age."
            },
            {
                "role": "assistant",
                "content": """import logging\nimport re\nfrom typing import Optional\n\nlogger = logging.getLogger(__name__)\n\n\nclass User:\n    \"\"\"Represents a user in the system.\n    \n    This class encapsulates user information including name, email, and age,\n    with built-in validation for each field.\n    \n    Attributes:\n        name: The user's full name.\n        email: The user's email address.\n        age: The user's age in years.\n        \n    Examples:\n        >>> user = User(name=\"John Doe\", email=\"john@example.com\", age=30)\n        >>> print(user.email)\n        john@example.com\n    \"\"\"\n    \n    def __init__(\n        self,\n        name: str,\n        email: str,\n        age: int\n    ) -> None:\n        \"\"\"Initialize a new User instance.\n        \n        Args:\n            name: The user's full name (must be non-empty).\n            email: The user's email address (must be valid format).\n            age: The user's age in years (must be positive).\n            \n        Raises:\n            ValueError: If any input validation fails.\n        \"\"\"\n        logger.debug(f\"Creating new User: {name}\")\n        \n        self._validate_and_set_name(name)\n        self._validate_and_set_email(email)\n        self._validate_and_set_age(age)\n        \n        logger.info(f\"User created successfully: {self.email}\")\n    \n    def _validate_and_set_name(self, name: str) -> None:\n        \"\"\"Validate and set the user's name.\n        \n        Args:\n            name: The name to validate and set.\n            \n        Raises:\n            ValueError: If name is empty or not a string.\n        \"\"\"\n        if not isinstance(name, str):\n            logger.error(\"Name must be a string\")\n            raise ValueError(\"Name must be a string\")\n        \n        if not name.strip():\n            logger.error(\"Name cannot be empty\")\n            raise ValueError(\"Name cannot be empty\")\n        \n        self.name = name.strip()\n    \n    def _validate_and_set_email(self, email: str) -> None:\n        \"\"\"Validate and set the user's email address.\n        \n        Args:\n            email: The email address to validate and set.\n            \n        Raises:\n            ValueError: If email format is invalid.\n        \"\"\"\n        email_pattern = r'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,}$'\n        \n        if not isinstance(email, str) or not re.match(email_pattern, email):\n            logger.error(f\"Invalid email format: {email}\")\n            raise ValueError(\"Invalid email format\")\n        \n        self.email = email.lower()\n    \n    def _validate_and_set_age(self, age: int) -> None:\n        \"\"\"Validate and set the user's age.\n        \n        Args:\n            age: The age to validate and set.\n            \n        Raises:\n            ValueError: If age is not a positive integer.\n        \"\"\"\n        if not isinstance(age, int):\n            logger.error(\"Age must be an integer\")\n            raise ValueError(\"Age must be an integer\")\n        \n        if age <= 0 or age > 150:\n            logger.error(f\"Age out of valid range: {age}\")\n            raise ValueError(\"Age must be between 1 and 150\")\n        \n        self.age = age\n    \n    def __repr__(self) -> str:\n        \"\"\"Return a string representation of the User.\n        \n        Returns:\n            A string representation including name, email, and age.\n        \"\"\"\n        return f\"User(name='{self.name}', email='{self.email}', age={self.age})\""""
            }
        ]
    },
    {
        "messages": [
            {
                "role": "system",
                "content": "You are a code generation assistant for TechCorp. Generate Python code following our style guide: Google-style docstrings, full type hints, snake_case for functions, PascalCase for classes, explicit error handling with logging, and comprehensive input validation."
            },
            {
                "role": "user",
                "content": "Write a function that reads a JSON file and returns the parsed data."
            },
            {
                "role": "assistant",
                "content": """import json\nimport logging\nfrom pathlib import Path\nfrom typing import Any, Dict, Union\n\nlogger = logging.getLogger(__name__)\n\n\ndef read_json_file(file_path: Union[str, Path]) -> Dict[str, Any]:\n    \"\"\"Read and parse a JSON file.\n    \n    Args:\n        file_path: Path to the JSON file (string or Path object).\n        \n    Returns:\n        A dictionary containing the parsed JSON data.\n        \n    Raises:\n        FileNotFoundError: If the specified file does not exist.\n        json.JSONDecodeError: If the file contains invalid JSON.\n        PermissionError: If the file cannot be read due to permissions.\n        \n    Examples:\n        >>> data = read_json_file('config.json')\n        >>> print(data['api_key'])\n        'abc123'\n    \"\"\"\n    logger.debug(f\"Attempting to read JSON file: {file_path}\")\n    \n    # Convert string to Path object\n    if isinstance(file_path, str):\n        file_path = Path(file_path)\n    \n    # Validate file exists\n    if not file_path.exists():\n        logger.error(f\"File not found: {file_path}\")\n        raise FileNotFoundError(f\"File not found: {file_path}\")\n    \n    # Validate it's a file (not directory)\n    if not file_path.is_file():\n        logger.error(f\"Path is not a file: {file_path}\")\n        raise ValueError(f\"Path is not a file: {file_path}\")\n    \n    # Read and parse JSON\n    try:\n        with open(file_path, 'r', encoding='utf-8') as file:\n            data = json.load(file)\n            logger.info(f\"Successfully read JSON file: {file_path}\")\n            return data\n    \n    except json.JSONDecodeError as e:\n        logger.error(f\"Invalid JSON in file {file_path}: {str(e)}\")\n        raise json.JSONDecodeError(\n            f\"Invalid JSON in file {file_path}\",\n            e.doc,\n            e.pos\n        )\n    \n    except PermissionError as e:\n        logger.error(f\"Permission denied reading file {file_path}: {str(e)}\")\n        raise PermissionError(f\"Cannot read file {file_path}: {str(e)}\")\n    \n    except Exception as e:\n        logger.error(f\"Unexpected error reading file {file_path}: {str(e)}\")\n        raise RuntimeError(f\"Failed to read file {file_path}: {str(e)}\")"""
            }
        ]
    }
]

# Display examples
print("\n" + "="*80)
print("💻 SFT USE CASE #3: Code Generation with Style Guide")
print("="*80 + "\n")

for idx, example in enumerate(sft_code_generation_data, 1):
    print(f"\n{'─'*80}")
    print(f"📝 Example {idx}")
    print(f"{'─'*80}\n")

    messages = example["messages"]
    user_msg = next(m["content"] for m in messages if m["role"] == "user")
    assistant_msg = next(m["content"] for m in messages if m["role"] == "assistant")

    print(f"💬 REQUEST:\n{user_msg}\n")
    print(f"💻 GENERATED CODE:\n{assistant_msg}\n")

print("\n" + "="*80)
print("🎯 WHAT THE MODEL LEARNS FROM THESE EXAMPLES:")
print("="*80)
print("""
1. **Docstring Format (Google Style):**
   - Summary line first
   - Blank line
   - Detailed description
   - Args section with type info
   - Returns section
   - Raises section
   - Examples section with doctests

2. **Type Hints Everywhere:**
   - Function parameters: name: str
   - Return types: -> float
   - Complex types: List[Union[int, float]]
   - Optional types: Optional[str]
   - Import from typing module

3. **Naming Conventions:**
   - Functions: snake_case (calculate_average, read_json_file)
   - Classes: PascalCase (User, DataProcessor)
   - Private methods: _validate_and_set_name
   - Constants: UPPER_SNAKE_CASE (implied)

4. **Error Handling Pattern:**
   - Validate all inputs
   - Log errors before raising
   - Raise specific exceptions
   - Descriptive error messages
   - Try-except with specific handling

5. **Logging Practice:**
   - Import logger at module level
   - Debug for input values
   - Info for successful operations
   - Error for exceptions
   - Structured log messages

6. **Input Validation:**
   - Check types explicitly
   - Validate ranges/formats
   - Handle edge cases (empty list, None, etc.)
   - Provide clear error messages

7. **Code Organization:**
   - Imports at top (grouped: stdlib, third-party, local)
   - Logger after imports
   - Public methods first
   - Private methods (prefixed with _) last
   - Blank lines for readability

⚠️ WITHOUT FINE-TUNING, GPT might:
   - Use different docstring formats (Sphinx, NumPy)
   - Omit type hints on some parameters
   - Skip input validation
   - Use generic exceptions (Exception instead of ValueError)
   - Inconsistent logging (or none at all)
   - Mix naming conventions

✅ AFTER FINE-TUNING:
   - 100% compliance with style guide
   - Consistent patterns across all generated code
   - Proper error handling every time
   - Code passes linting without changes
""")

print("\n" + "="*80)
print("💡 BUSINESS VALUE:")
print("="*80)
print("""
For a development team:
- Code review time reduced by 40% (no style discussions)
- Onboarding faster (AI generates consistent examples)
- Fewer bugs (consistent validation patterns)
- Better maintainability (uniform codebase)
- Automated code generation for boilerplate

ROI Example:
- 10 developers × 5 hours/week saved = 50 hours/week
- At $50/hour = $2,500/week = $130,000/year
- Fine-tuning cost: ~$500-2,000
- Break-even: < 1 week
""")
print("="*80)

## 📊 SFT Summary

### ✅ Strengths

1. **Simplicity**: Easiest fine-tuning approach to understand and implement
2. **Clear Success Criteria**: You know exactly what "right" looks like
3. **Data Collection**: Straightforward - just input-output pairs
4. **Fast Training**: Converges quickly (usually 1-3 epochs)
5. **Consistency**: Excellent for enforcing patterns and formats

### ❌ Limitations

1. **Can't Teach Subjective Quality**: "Better" must be objective
2. **Risk of Overfitting**: Too many examples → loss of generalization
3. **No Nuance**: All examples weighted equally (good or bad)
4. **Knowledge Limitations**: Can't add new facts effectively
5. **Binary Learning**: Either follows pattern or doesn't - no middle ground

---

### 💡 Data Collection Tips for SFT

#### 1. Quality Over Quantity
```
❌ 1000 mediocre examples with inconsistencies
✅ 500 perfect examples that are 100% consistent
```

#### 2. Remove Outliers
```python
# Review your dataset for:
- Inconsistent formatting
- Different writing styles
- Errors or typos
- Edge cases that shouldn't be patterns
```

#### 3. Cover Edge Cases
```
Include examples of:
- Empty inputs
- Maximum/minimum values
- Unusual but valid inputs
- Proper error handling
```

#### 4. Balance Your Dataset
```
If you have multiple patterns:
- 30% common case A
- 30% common case B
- 20% edge case C
- 20% edge case D

Don't do:
- 90% case A, 10% everything else
```

#### 5. Use Real Data
```
✅ Actual customer questions + your best responses
❌ Made-up examples that don't reflect reality
```

---

### 🎯 When SFT is the Best Choice

**Use SFT when you can answer YES to all:**

1. ✅ Do I have clear examples of correct behavior?
2. ✅ Is "correct" objective and consistent?
3. ✅ Is the goal consistency rather than knowledge?
4. ✅ Have I exhausted prompt engineering?
5. ✅ Am I okay with the cost and inference overhead?

**Common successful use cases:**
- Customer service brand voice ✅
- Structured output generation (JSON, XML) ✅
- Code style enforcement ✅
- Response template following ✅
- Model distillation (GPT-4 → gpt-5-nano) ✅

---

Next, let's explore **Preference-Based Fine-Tuning** for situations where "better" is subjective! 🎯

---

# Part 4: Preference-Based Fine-Tuning

## 📖 What is Preference-Based Fine-Tuning?

**Preference-Based Fine-Tuning** teaches models subjective quality by showing comparisons:

**Core Concept:**
> "Response A is better than Response B for this input. Learn what makes responses better."

**Key Difference from SFT:**
- **SFT**: "This is the ONE correct answer"
- **Preferences**: "Both answers are valid, but this one is BETTER"

---

## 🔬 Technical Deep Dive

**Training Process:**
```
For each example:
    1. Model sees: "How do I improve my credit score?"
    2. Two responses provided:
       Response A (chosen): Empathetic, actionable, comprehensive
       Response B (rejected): Correct but terse and unhelpful
    3. Model learns: Response A > Response B
    4. Adjusts to increase probability of A-style responses
    5. Adjusts to decrease probability of B-style responses
```

**What the model learns:**
- Subjective quality (helpfulness, tone, depth)
- Contextual appropriateness
- Balancing multiple objectives (accuracy + empathy)
- When to be verbose vs. concise

**What it does NOT learn:**
- New facts (use RAG)
- Complex multi-dimensional quality (use RLHF)
- Exact formatting (use SFT)

---

## 🔑 Key Concept: Both Responses Are "Correct"

This is the critical insight that differentiates preference-based tuning:

**Example:**
```
User: "How do I reset my password?"

Response A (Chosen): ✅
"I'd be happy to help you reset your password! Here's how:
1. Click 'Forgot Password' on the login page
2. Enter your email address
3. Check your email for a reset link
4. Click the link and create your new password

The link expires in 24 hours. If you don't receive it within 5
minutes, check your spam folder. Need any other help?"

Response B (Rejected): ✅ (Still correct!)
"Click 'Forgot Password', enter your email, and follow the link."
```

**Why A is better:**
- More helpful and complete
- Anticipates follow-up questions
- Warmer tone
- Provides context

**But B isn't wrong:**
- Factually accurate
- Answers the question
- Just less helpful

---

## ✅ When to Use Preference-Based Tuning

### Perfect for:
1. **Subjective quality improvements** (helpfulness, tone, depth)
2. **Style preferences** (verbose vs. concise, formal vs. casual)
3. **Contextual appropriateness** (when to be detailed vs. brief)
4. **Empathy and tone** (warm vs. professional)
5. **Refusal quality** (graceful vs. blunt)

### Signs preference tuning is right:
- ✅ Multiple "correct" answers exist
- ✅ Quality is subjective but rankable
- ✅ You can get humans to consistently pick the better response
- ✅ You want to optimize for user satisfaction
- ✅ SFT is too rigid (you don't want ONE exact pattern)

---

## ❌ When NOT to Use Preference-Based Tuning

### Bad for:
1. **Exact formatting requirements** → Use SFT
2. **Teaching new knowledge** → Use RAG
3. **Complex multi-objective optimization** → Use RLHF
4. **Binary correctness** (right vs. wrong) → Use SFT

### Signs preference tuning is wrong:
- ❌ You need exact output format
- ❌ There's only ONE correct answer
- ❌ Humans disagree on which response is better
- ❌ You're optimizing for many dimensions simultaneously

---

## 📊 Data Requirements

**Minimum viable:** 100-200 comparison pairs
- Good for: Testing if preferences help
- Quality: May see some improvement

**Recommended:** 500-1500 comparison pairs
- Good for: Production use
- Quality: Reliable improvements

**Maximum useful:** 5000-10,000 pairs
- Good for: Fine-grained quality
- Quality: Diminishing returns above this

**Critical requirement: Consistency**
- Humans must agree on rankings (>80% agreement)
- Clear criteria for "better"
- Same evaluators throughout dataset

---

## 🔬 DPO (Direct Preference Optimization)

**Quick Technical Note:**

Traditional preference learning used **RLHF** (complex, requires reward model).

**DPO** (Direct Preference Optimization) is a newer, simpler approach:
- Skip reward model training
- Directly optimize from preferences
- Faster and more stable
- What OpenAI likely uses under the hood

**You don't need to know the math, just know:**
- DPO is the modern way to do preference learning
- It's simpler and more reliable than old methods
- OpenAI abstracts this complexity away

---

Let's see preference-based tuning in action with three use cases! 🚀

---

## 🔧 How to Actually Fine-Tune with OpenAI (Preferences)

**Note:** OpenAI doesn't directly expose preference-based fine-tuning via API yet.

**Current approach:**

1. **Option A: Use comparison data in SFT format**
   - Include both chosen and rejected responses in training data
   - Add explicit feedback in system messages

2. **Option B: Use third-party tools**
   - Hugging Face TRL library (Transformers Reinforcement Learning)
   - Supports DPO (Direct Preference Optimization)
   - Requires running your own infrastructure

**Expected future API format:**

In [None]:
# HYPOTHETICAL: How preference-based fine-tuning might work on OpenAI
# (Not currently available as of 2025)

import json

# Prepare preference data in JSONL format
preference_data = [
    {
        "prompt": [{"role": "user", "content": "How do I reset my password?"}],
        "chosen": {
            "role": "assistant",
            "content": "I'd be happy to help! Here's how: 1) Click 'Forgot Password'..."
        },
        "rejected": {
            "role": "assistant",
            "content": "Click forgot password and follow the link."
        }
    },
    # ... more pairs
]

# Save as JSONL
with open('preference_data.jsonl', 'w') as f:
    for example in preference_data:
        f.write(json.dumps(example) + '\n')

print("✅ Preference data prepared")

In [None]:
# HYPOTHETICAL API usage (not real as of 2025)

from openai import OpenAI

client = OpenAI(api_key="your-api-key")

# Upload file
file = client.files.create(
    file=open("preference_data.jsonl", "rb"),
    purpose="fine-tune"
)

# Create preference-based fine-tuning job
# (This is speculative - API doesn't exist yet)
fine_tune_job = client.fine_tuning.jobs.create(
    training_file=file.id,
    model="gpt-4o-mini-2024-07-18",
    method="preference",  # Hypothetical parameter
    hyperparameters={
        "n_epochs": 3,
        "learning_rate": 5e-5
    }
)

print(f"Job created: {fine_tune_job.id}")

### 🔧 Practical Alternative: Using Hugging Face

For actual preference-based fine-tuning today, use Hugging Face's TRL library:

```python
from trl import DPOTrainer
from transformers import AutoModelForCausalLM, AutoTokenizer

# Load model
model = AutoModelForCausalLM.from_pretrained("gpt2")
tokenizer = AutoTokenizer.from_pretrained("gpt2")

# Prepare dataset with chosen/rejected pairs
# ... dataset preparation ...

# Train with DPO
trainer = DPOTrainer(
    model=model,
    train_dataset=train_dataset,
    tokenizer=tokenizer,
    beta=0.1,  # KL penalty coefficient
)

trainer.train()
```

**Trade-offs:**
- ✅ Full control over preference training
- ✅ Use any base model
- ❌ Requires your own infrastructure
- ❌ More complex than OpenAI API
- ❌ Need ML engineering expertise

---

## 💬 Preference Use Case #1: Conciseness vs. Verbosity

**Scenario:**
- Company: Technical support for software developers
- Problem: Responses are either too brief (unhelpful) or too verbose (time-wasting)
- Goal: Teach model to be concise yet complete

**Why Preference Tuning?**
- ✅ Both responses are technically correct
- ✅ "Better" is subjective (depends on context)
- ✅ Can't use SFT (no single "correct" format)
- ✅ Humans can consistently rank responses

**What we're teaching:**
- When to be detailed vs. brief
- How to anticipate needs without over-explaining
- Balance between completeness and efficiency
- Professional yet approachable tone

In [None]:
# Mock Dataset: Preference-Based Fine-Tuning (Conciseness)
# Format: {prompt, chosen, rejected}

preference_conciseness_data = [
    {
        "prompt": "How do I check the version of Python installed on my system?",
        "chosen": "You can check your Python version by running:\n\n```bash\npython --version\n```\n\nor\n\n```bash\npython3 --version\n```\n\nThis will display something like \"Python 3.11.2\". If you need programmatic access, use `sys.version` in Python code.",
        "rejected": "To check the version of Python installed on your system, you have multiple options available to you. The most common and straightforward method is to use the command line or terminal. You can open your command prompt (on Windows) or terminal (on macOS or Linux) and type the command \"python --version\" or \"python3 --version\". This will return the version number of Python that is currently installed. For example, you might see something like \"Python 3.11.2\" or \"Python 3.9.7\". Additionally, if you're already inside a Python interpreter or script, you can import the sys module and print sys.version to get detailed version information. Another way is to use python -V (with a capital V) which is a shorthand for --version. It's worth noting that on some systems, especially Unix-based ones, you might have both Python 2 and Python 3 installed, so you might need to specify python3 instead of just python. The version number is important because different Python versions have different features and syntax, so knowing which version you're using helps ensure your code is compatible."
    },
    {
        "prompt": "What's the difference between git merge and git rebase?",
        "chosen": "**Git merge:**\n- Creates a new commit that combines two branches\n- Preserves complete history (all commits from both branches)\n- Results in non-linear history with merge commits\n- Safer for public/shared branches\n\n**Git rebase:**\n- Moves your commits to the tip of another branch\n- Creates a linear history (looks cleaner)\n- Rewrites commit history (changes commit SHAs)\n- Best for cleaning up local branches before merging\n\n**Rule of thumb:** Use merge for shared branches, rebase for local cleanup.",
        "rejected": "Great question! Git merge and git rebase are both used to integrate changes from one branch into another, but they work quite differently.\n\nGit merge takes all the changes from one branch and merges them into another branch using a special merge commit. This creates a new commit that has two parent commits, preserving the complete history of both branches. The advantage is that it's non-destructive and maintains a complete record of what happened. The disadvantage is that if you have many merges, your history can become cluttered with merge commits.\n\nGit rebase, on the other hand, takes your commits and \"replays\" them on top of another branch. Instead of creating a merge commit, it actually rewrites the commit history by creating new commits with new SHA hashes. This results in a linear history that looks much cleaner. However, because it rewrites history, it can be dangerous if used on branches that other people are working on.\n\nSo when should you use each? Generally speaking, you should use merge when you're working on a shared branch or when you want to preserve the complete history. Use rebase when you're cleaning up your local branch before pushing it, or when you want a cleaner, more linear history.\n\nThere are also other considerations like interactive rebasing, which lets you squash commits, reorder them, or edit them, which can be very powerful for cleaning up your commit history before sharing it with others."
    },
    {
        "prompt": "How do I exit vim?",
        "chosen": "To exit vim:\n\n- `:q` - Quit (if no changes)\n- `:q!` - Quit without saving changes\n- `:wq` or `:x` - Save and quit\n- `ZZ` - Save and quit (shortcut)\n\nPress `Esc` first to ensure you're in command mode, then type one of the above.",
        "rejected": "To exit vim, you need to first understand that vim operates in different modes. By default, when you open a file, you're in \"normal mode\". To execute commands, you need to type a colon (:) which puts you into \"command mode\". Once you're in command mode, you can type various commands. To quit vim, you would type :q and press Enter. However, if you've made changes to the file, vim won't let you quit without either saving or explicitly discarding those changes. If you want to quit without saving, you need to use :q! (the exclamation mark forces the quit). If you want to save your changes and then quit, you can use :wq (write and quit) or alternatively :x which does the same thing. There's also a keyboard shortcut ZZ (capital Z twice) that saves and quits. Make sure you press Escape first to ensure you're in normal mode before typing any of these commands, otherwise vim might just insert those characters into your file."
    },
    {
        "prompt": "What does the 'async' keyword do in JavaScript?",
        "chosen": "The `async` keyword makes a function return a Promise. It allows you to:\n\n1. Use `await` inside the function to pause execution until a Promise resolves\n2. Write asynchronous code that looks synchronous\n\n```javascript\nasync function getData() {\n  const response = await fetch('/api/data');\n  return response.json();\n}\n```\n\nAny value you `return` gets automatically wrapped in `Promise.resolve()`.",
        "rejected": "`async` is a very important keyword in JavaScript that was introduced in ES2017 (ES8) as part of the async/await syntax. When you put the async keyword in front of a function declaration, it does several things. First and most importantly, it makes the function automatically return a Promise. Even if your function returns a simple value like a number or string, JavaScript will automatically wrap that value in a resolved Promise. This is really useful because it means you can use the function with Promise chains or with await in other async functions. The second main thing that async does is it allows you to use the await keyword inside that function. The await keyword can only be used inside async functions (or at the top level of modules in modern JavaScript). When you await a Promise, JavaScript pauses the execution of that async function until the Promise resolves, but importantly, it doesn't block the entire program - other code can still run. This makes async/await a much more readable way to handle asynchronous operations compared to traditional Promise chains or callbacks. It's worth noting that async functions always return Promises, so if your async function throws an error, that error gets wrapped in a rejected Promise automatically."
    },
    {
        "prompt": "How do I center a div in CSS?",
        "chosen": "**Horizontal centering:**\n```css\n.div {\n  margin: 0 auto;\n  width: 500px; /* must have width */\n}\n```\n\n**Vertical and horizontal (flexbox):**\n```css\n.parent {\n  display: flex;\n  justify-content: center;\n  align-items: center;\n  height: 100vh;\n}\n```\n\n**Vertical and horizontal (grid):**\n```css\n.parent {\n  display: grid;\n  place-items: center;\n  height: 100vh;\n}\n```",
        "rejected": "Centering a div in CSS is one of the most common questions in web development, and there are actually many different ways to do it depending on your specific needs and browser support requirements. Let me walk you through the main approaches.\n\nFor horizontal centering only, the classic method is to set the div's left and right margins to 'auto' and give it a specific width. This works because auto margins will split the available space equally on both sides. However, this doesn't work for vertical centering.\n\nFor both horizontal and vertical centering, there are several modern approaches. The most popular these days is using Flexbox. You set the parent element to display: flex, then use justify-content: center for horizontal centering and align-items: center for vertical centering. This is very flexible and works great.\n\nAnother modern approach is CSS Grid, where you can use display: grid on the parent and then place-items: center, which centers in both directions at once. This is even more concise than Flexbox.\n\nThere's also the old position: absolute method where you position the div absolutely, set top: 50% and left: 50%, then use transform: translate(-50%, -50%) to adjust for the div's own dimensions. This works but is generally considered more of a hack.\n\nThe method you choose depends on your browser support requirements and the context of what you're building."
    }
]

# Display examples
print("\n" + "="*80)
print("💬 PREFERENCE USE CASE #1: Conciseness vs. Verbosity")
print("="*80 + "\n")

for idx, example in enumerate(preference_conciseness_data, 1):
    print(f"\n{'─'*80}")
    print(f"📝 Example {idx}")
    print(f"{'─'*80}\n")

    print(f"❓ PROMPT:\n{example['prompt']}\n")

    print(f"✅ CHOSEN RESPONSE (Better):\n{example['chosen']}\n")

    print(f"❌ REJECTED RESPONSE (Worse):\n{example['rejected'][:300]}...\n")
    print("   [Truncated for display - actual response is much longer]\n")

print("\n" + "="*80)
print("🎯 WHAT THE MODEL LEARNS:")
print("="*80)
print("""\n
1. **Efficient Communication:**
   - Get to the point quickly
   - Include essential context, skip unnecessary elaboration
   - Use examples/code snippets effectively

2. **Structure:**
   - Bullet points for clarity
   - Code blocks for technical content
   - Clear section headers when needed

3. **Completeness:**
   - Answer the question fully
   - Anticipate immediate follow-ups
   - Provide actionable information

4. **What to Avoid:**
   - Lengthy explanations of obvious concepts
   - Excessive historical context
   - Redundant phrasing
   - Over-explaining edge cases upfront

✨ KEY INSIGHT: Both responses are technically correct and helpful.
   But the chosen response respects the reader's time while still
   being complete and useful. This is subjective quality that
   SFT can't teach effectively.
""")
print("="*80)

## 🤝 Preference Use Case #2: Helpfulness & Empathy

**Scenario:**
- Company: Mental health support chatbot
- Problem: Responses are either too clinical (cold) or too casual (unprofessional)
- Goal: Balance empathy with professionalism

**Why Preference Tuning?**
- ✅ Both responses can be factually correct
- ✅ "Better" depends on emotional intelligence
- ✅ Humans can judge which feels more supportive
- ✅ Can't reduce to a template (needs flexibility)

**What we're teaching:**
- Empathetic acknowledgment
- Professional boundaries
- When to be warm vs. when to be directive
- Avoiding dismissiveness

In [None]:
# Mock Dataset: Preference-Based Fine-Tuning (Helpfulness & Empathy)

preference_empathy_data = [
    {
        "prompt": "I'm feeling really overwhelmed with work lately and I don't know what to do.",
        "chosen": "I'm sorry to hear you're feeling overwhelmed - that's a really difficult place to be. It's important to recognize that this feeling is valid and that you're not alone in experiencing it.\n\nHere are some steps that might help:\n\n1. **Prioritize**: Make a list of tasks and identify what's truly urgent vs. what can wait\n2. **Break it down**: Large projects feel less daunting when broken into smaller steps\n3. **Set boundaries**: It's okay to say no or to ask for help\n4. **Take breaks**: Even 5-10 minutes away from work can help reset your mind\n\nIf these feelings persist or worsen, please consider reaching out to a mental health professional who can provide personalized support. Would you like to talk more about any specific challenges you're facing?",
        "rejected": "Feeling overwhelmed with work is common. You should make a to-do list and prioritize your tasks. Break large projects into smaller tasks. Take regular breaks. Consider talking to your manager about workload. If stress continues, see a therapist."
    },
    {
        "prompt": "I'm having trouble sleeping because I keep thinking about my mistakes at work.",
        "chosen": "It sounds like you're experiencing a lot of worry about work, and that's affecting your sleep - that must be exhausting. It's actually very common to replay difficult moments, even though it doesn't feel good.\n\nA few things that might help:\n\n**For immediate relief:**\n- Try the 4-7-8 breathing technique before bed (inhale 4 seconds, hold 7, exhale 8)\n- Keep a notepad by your bed - if worries come up, write them down to address tomorrow\n- Consider a brief \"worry period\" earlier in the evening to process these thoughts\n\n**Longer-term perspective:**\n- Remember that everyone makes mistakes - they're opportunities to learn\n- Ask yourself: \"Will this matter in 5 years?\" for perspective\n- If specific mistakes are fixable, make a plan to address them tomorrow\n\nIf sleep problems continue for more than a few weeks, it might be worth discussing with a doctor. You deserve rest. Is there a particular mistake that's weighing on you?",
        "rejected": "Stop thinking about work mistakes before bed. Use breathing exercises. Write down your worries. Remember everyone makes mistakes. See a doctor if sleep problems persist."
    },
    {
        "prompt": "I don't think therapy is working for me. Should I quit?",
        "chosen": "Thank you for sharing that - it takes courage to acknowledge when something isn't feeling right. Before making a decision, let's think through this together.\n\n**Questions to consider:**\n\n1. **How long have you been in therapy?** Sometimes it takes 4-6 sessions to build rapport and see progress\n2. **Have you discussed this with your therapist?** They may have insights or be willing to adjust their approach\n3. **Is it the therapist or the process?** Sometimes the fit isn't right, and that's okay - finding another therapist might help\n4. **What were you hoping to achieve?** Comparing expectations to reality can clarify if progress is happening\n\n**It's completely valid to:**\n- Switch therapists if the connection isn't there\n- Take a break and return later\n- Try a different therapy approach (CBT, DBT, EMDR, etc.)\n\nTherapy should feel like it's helping more than it's frustrating. What specifically feels like it's not working?",
        "rejected": "Therapy takes time to work. You should talk to your therapist about your concerns. If you still want to quit after discussing it, you can try a different therapist or take a break. Don't give up entirely though."
    }
]

print("\n" + "="*80)
print("🤝 PREFERENCE USE CASE #2: Helpfulness & Empathy")
print("="*80 + "\n")

for idx, example in enumerate(preference_empathy_data, 1):
    print(f"\n{'─'*80}")
    print(f"📝 Example {idx}")
    print(f"{'─'*80}\n")
    print(f"❓ PROMPT:\n{example['prompt']}\n")
    print(f"✅ CHOSEN (Better - Empathetic & Complete):\n{example['chosen']}\n")
    print(f"❌ REJECTED (Worse - Terse & Clinical):\n{example['rejected']}\n")

print("\n" + "="*80)
print("🎯 WHAT THE MODEL LEARNS:")
print("="*80)
print("""\n
1. **Empathetic Acknowledgment:**
   - Validate feelings first
   - Use phrases like \"I'm sorry to hear\" and \"that must be difficult\"
   - Show understanding before problem-solving

2. **Structured Support:**
   - Provide actionable steps
   - Organize information clearly
   - Balance immediate help with long-term perspective

3. **Professional Boundaries:**
   - Suggest professional help when appropriate
   - Don't diagnose or prescribe
   - Empower user to make their own decisions

4. **Follow-up Engagement:**
   - Ask open-ended questions
   - Show genuine interest
   - Create space for further discussion

✨ KEY INSIGHT: The rejected responses aren't wrong - they provide correct
   information. But they lack warmth and fail to make the user feel heard.
   This is subjective quality that requires preference-based learning.
""")
print("="*80)

## 🚫 Preference Use Case #3: Refusal Quality

**Scenario:**
- Company: AI assistant for financial advice
- Problem: Model sometimes refuses correctly but rudely/unhelpfully
- Goal: Teach graceful, helpful refusals

**Why Preference Tuning?**
- ✅ Both responses correctly refuse
- ✅ Quality is about tone and helpfulness
- ✅ SFT would enforce one exact refusal format (too rigid)
- ✅ Context determines best refusal style

**What we're teaching:**
- Polite but firm boundaries
- Offering alternatives
- Explaining why (when appropriate)
- Maintaining positive relationship

In [None]:
# Mock Dataset: Preference-Based Fine-Tuning (Refusal Quality)

preference_refusal_data = [
    {
        "prompt": "Can you review my specific stock portfolio and tell me what to buy/sell?",
        "chosen": "I appreciate you thinking of me for this, but I'm not able to provide personalized investment advice or specific buy/sell recommendations. Here's why:\n\n**What I can't do:**\n- Analyze your specific portfolio\n- Recommend individual stocks to buy or sell\n- Provide financial advice tailored to your situation\n\n**What I can help with:**\n- Explain general investment concepts (diversification, risk tolerance, etc.)\n- Provide educational information about different investment types\n- Help you understand financial terms and strategies\n- Suggest questions to ask a financial advisor\n\n**Better alternative:**\nConsider consulting with a licensed financial advisor who can review your complete financial picture, understand your goals, and provide personalized recommendations. Would you like help understanding any general investment concepts instead?",
        "rejected": "I cannot provide personalized investment advice or recommend specific stocks to buy or sell. You should consult with a licensed financial advisor for that."
    },
    {
        "prompt": "Can you write me a college essay and make it sound like I wrote it?",
        "chosen": "I can't write your college essay for you, but I'd love to help you write a great one yourself! Here's why and how I can assist:\n\n**Why I can't write it for you:**\n- It would be academic dishonesty\n- Admissions officers want to hear YOUR voice\n- You'd miss the valuable reflection process\n\n**How I CAN help:**\n- Brainstorm ideas and experiences to write about\n- Review your draft and suggest improvements\n- Help with structure and organization\n- Explain effective writing techniques\n- Provide examples of strong essay elements\n\n**Want to get started?**\n- Tell me about an experience that shaped you\n- Share what makes you unique\n- Discuss what you want colleges to know about you\n\nI can help you find and express your authentic story. What topics are you considering?",
        "rejected": "I cannot write your college essay for you as that would be cheating. You need to write it yourself. I can help you brainstorm or review a draft."
    },
    {
        "prompt": "Help me craft a message to get back together with my ex who blocked me.",
        "chosen": "I understand this must be a painful situation, and I want to help in a way that's truly in your best interest. However, I need to respectfully decline helping you contact someone who has blocked you. Here's why:\n\n**Why reaching out might not be helpful:**\n- Blocking is a clear boundary they've set\n- Attempting contact could be perceived as harassment\n- It's unlikely to achieve the outcome you want\n- It could cause legal issues or restraining orders\n\n**What might help instead:**\n- Give them (and yourself) space and time\n- Focus on your own healing and growth\n- Talk to a therapist about processing these feelings\n- Reconnect with friends and hobbies\n- If they want to reconnect later, they'll reach out\n\n**I can help with:**\n- Coping strategies for breakups\n- Understanding healthy relationships\n- Building self-esteem after a relationship ends\n\nI know this isn't the answer you wanted, but respecting boundaries - even when it hurts - is crucial. Would you like to talk about healing instead?",
        "rejected": "I cannot help you contact someone who has blocked you. That would not be respecting their boundaries. You should move on and respect their decision."
    }
]

print("\n" + "="*80)
print("🚫 PREFERENCE USE CASE #3: Refusal Quality")
print("="*80 + "\n")

for idx, example in enumerate(preference_refusal_data, 1):
    print(f"\n{'─'*80}")
    print(f"📝 Example {idx}")
    print(f"{'─'*80}\n")
    print(f"❓ PROMPT:\n{example['prompt']}\n")
    print(f"✅ CHOSEN (Better - Graceful & Helpful):\n{example['chosen']}\n")
    print(f"❌ REJECTED (Worse - Blunt & Unhelpful):\n{example['rejected']}\n")

print("\n" + "="*80)
print("🎯 WHAT THE MODEL LEARNS:")
print("="*80)
print("""\n
1. **Graceful Refusal Pattern:**
   - Acknowledge the request positively
   - Clearly state what you CAN'T do
   - Explain WHY (builds trust)
   - Offer what you CAN do
   - Maintain warm, helpful tone

2. **Structure:**
   - \"I understand... but I can't because...\"
   - Bullet points for clarity
   - Alternative offerings
   - Re-engagement question

3. **Tone Balance:**
   - Firm boundaries (not apologetic or wishy-washy)
   - Empathetic (acknowledge disappointment)
   - Constructive (redirect to helpful alternatives)
   - Professional (not preachy)

4. **Context-Appropriate:**
   - Academic dishonesty: Focus on learning value
   - Boundary violations: Focus on respect
   - Regulated advice: Focus on safety/legality

✨ KEY INSIGHT: Both responses correctly refuse the request. But the
   chosen response turns a \"no\" into a \"yes, but differently,\" maintaining
   the relationship and providing value. This nuanced quality judgment is
   what preference-based tuning excels at.
""")
print("="*80)

## 📊 Preference-Based Tuning Summary

### ✅ Strengths

1. **Handles Subjectivity**: Perfect for \"better\" rather than \"correct\"
2. **More Natural**: Doesn't enforce rigid patterns
3. **User Satisfaction**: Directly optimizes for what humans prefer
4. **Flexible**: Model adapts appropriately to context
5. **Less Brittle**: More robust than SFT to variations

### ❌ Limitations

1. **Needs Consistency**: Humans must agree on rankings
2. **More Data**: Requires 2x examples (chosen + rejected for each)
3. **Harder to Collect**: Ranking is more cognitively demanding than writing
4. **Single Dimension**: Optimizes one quality axis at a time
5. **Can't Handle Multi-Objective**: If \"better\" means \"more A AND more B,\" use RLHF

---

### 💡 Data Collection Tips

#### 1. Ensure Inter-Rater Agreement
```
✅ 3+ people rank the same pairs
✅ Track agreement rate (should be >80%)
❌ If agreement <70%, criteria aren't clear enough
```

#### 2. Generate Rejection Candidates
```
Approach A: Take real model outputs, rank them
Approach B: Deliberately create \"worse\" versions
Approach C: Use different models (GPT-4 = chosen, GPT-3.5 = rejected)
```

#### 3. Clear Ranking Criteria
```
Document exactly what makes a response \"better\":
- More concise?
- More empathetic?
- Better structured?
- More actionable?

Raters need these guidelines!
```

#### 4. Avoid Close Calls
```
❌ When responses are similarly good, preferences are noise
✅ Choose pairs with clear quality differences
```

#### 5. Representative Distribution
```
Include:
- Different prompt types
- Various difficulty levels
- Edge cases
- Common scenarios
```

---

### 🎯 When to Use Preference-Based Tuning

**Use when you answer YES to:**

1. ✅ Are multiple correct answers possible?
2. ✅ Is quality subjective but rankable by humans?
3. ✅ Do you want flexibility, not rigid formatting?
4. ✅ Can humans consistently agree on which is better?
5. ✅ Is it one-dimensional quality (helpfulness OR conciseness, not both simultaneously)?

**Perfect use cases:**
- Chatbot helpfulness ✅
- Response tone/empathy ✅
- Conciseness vs. completeness ✅
- Refusal quality ✅
- Writing style preferences ✅

---

Next: **RLHF** for when you need to optimize multiple quality dimensions simultaneously! 🎯

---

# Part 5: RLHF (Reinforcement Learning from Human Feedback)

## 📖 What is RLHF?

**RLHF** is the most sophisticated fine-tuning approach, used to train models like ChatGPT and Claude.

**Core Concept:**
> \"Optimize for multiple quality dimensions simultaneously using scored feedback.\"

**Key Differences:**
- **SFT**: \"Here's the correct answer\"
- **Preferences**: \"Response A is better than Response B\"
- **RLHF**: \"Response gets 8/10 on helpfulness, 6/10 on safety, 9/10 on accuracy\"

---

## 🔬 Technical Overview

**Traditional RLHF has 3 stages:**

### Stage 1: Supervised Fine-Tuning (SFT)
- Start with base model
- Fine-tune on high-quality demonstrations
- Creates the \"initial policy\"

### Stage 2: Reward Model Training
- Collect human rankings of outputs
- Train a separate \"reward model\" to predict human preferences
- This model scores outputs (e.g., 0-10)

### Stage 3: RL Optimization (PPO)
- Use the reward model to guide the main model
- Model generates outputs → Reward model scores them → Model learns
- Uses PPO (Proximal Policy Optimization) algorithm
- Balances: maximize reward while staying close to original model

**OpenAI's Approach (Simplified):**
- You likely won't implement full RLHF yourself
- OpenAI may abstract this to: \"Provide scores for multiple dimensions\"
- The complexity is handled behind the scenes

---

## ✅ When to Use RLHF

### Perfect for:
1. **Multi-dimensional quality** (accuracy + safety + helpfulness all matter)
2. **Complex alignment** (human values, safety, ethical behavior)
3. **Nuanced trade-offs** (sometimes verbose is better, sometimes concise)
4. **Continuous quality scales** (not just \"better/worse\" but \"how much better?\")
5. **Flagship products** (ChatGPT, Claude - where quality is critical)

### Signs RLHF is right:
- ✅ Quality has multiple important dimensions
- ✅ Trade-offs between dimensions are contextual
- ✅ Simple preferences can't capture the nuance
- ✅ You have resources for complex data collection
- ✅ You're building a high-value, high-impact product

---

## ❌ When NOT to Use RLHF

### Overkill for:
1. **Single quality dimension** → Use preference tuning
2. **Exact formatting** → Use SFT
3. **Quick prototypes** → Too complex and expensive
4. **Simple patterns** → Simpler methods work fine

### Signs RLHF is wrong:
- ❌ Your problem is straightforward
- ❌ You don't have significant resources
- ❌ Simpler methods haven't been tried yet
- ❌ You can't collect multi-dimensional scores

---

## 📊 Data Requirements

**Minimum viable:** 500-1000 scored examples
- Each example needs scores on 3-5 dimensions
- Multiple raters per example for reliability

**Recommended:** 5,000-20,000 examples
- Production quality alignment
- Robust across edge cases

**Reality check:**
- This is expensive and time-consuming
- Typically reserved for flagship products
- Most companies should try SFT or preferences first

---

Let's see RLHF-style data with three complex use cases! 🚀

---

## 🔧 How to Actually Fine-Tune with RLHF

**Reality Check:** True RLHF is not available via OpenAI's fine-tuning API.

**Why?**
- RLHF is extremely complex (3-stage process)
- Requires significant compute infrastructure
- Needs thousands of examples with multi-dimensional scores
- Reserved for flagship products (ChatGPT, Claude)

**Your Options:**

### Option 1: Use OpenAI's Pre-Aligned Models
- GPT-4, GPT-4o, GPT-5-nano are already RLHF-tuned
- These models already have excellent alignment
- For most use cases, prompt engineering + RAG is sufficient

### Option 2: Build Your Own RLHF Pipeline
If you have significant resources and expertise:

In [None]:
# CONCEPTUAL: RLHF training pipeline (high-level)
# This is NOT runnable code - just illustrative

# Stage 1: Supervised Fine-Tuning (SFT)
base_model = load_model("gpt2-large")
sft_model = supervised_fine_tune(
    model=base_model,
    dataset=high_quality_demonstrations,
    epochs=3
)

# Stage 2: Reward Model Training
# Collect human preferences with multi-dimensional scores
reward_model = train_reward_model(
    preference_data=scored_comparisons,  # Thousands of examples
    dimensions=["helpfulness", "safety", "accuracy"],
    model_size="large"
)

# Stage 3: RL Optimization (PPO)
final_model = ppo_training(
    policy_model=sft_model,
    reward_model=reward_model,
    prompts=training_prompts,
    iterations=10000,
    kl_penalty=0.1  # Stay close to original model
)

print("RLHF training complete (in theory!)")

### 🏢 Enterprise Solutions

If you need RLHF-level quality:

**1. OpenAI Custom Models Program**
- Contact OpenAI for custom model training
- They handle RLHF complexity
- Minimum budget: $100K+
- Turnaround: Months

**2. Anthropic Custom Training**
- Similar to OpenAI's program
- Uses Constitutional AI (their RLHF variant)
- Enterprise-scale budgets

**3. Build In-House**
- Use Hugging Face TRL + Accelerate
- Requires: ML team, GPU cluster, months of work
- Budget: $50K-500K+

### 💡 Practical Advice

**For 99% of use cases:**
```python
# Instead of RLHF, do this:

# 1. Start with prompt engineering
prompt = """You are a helpful assistant. Be accurate, safe, and empathetic."""

# 2. Add RAG for knowledge
context = retrieve_relevant_docs(query)

# 3. Use tool calling for accuracy
result = use_calculator_tool(math_problem)

# 4. Fine-tune for format/style (SFT)
# Only if the above isn't sufficient
```

**When you actually need RLHF:**
- Building a flagship AI product (ChatGPT competitor)
- Safety-critical application (medical diagnosis AI)
- Multi-million user base
- Budget: $100K+ for training

**For everyone else:**
- Use GPT-4 (already RLHF-tuned) ✅
- Focus on prompt engineering + RAG ✅
- Consider SFT for format/style ✅
- Save the RLHF complexity for later ✅

---

## 🏥 RLHF Use Case #1: Medical Information (Multi-Dimensional Quality)

**Scenario:**
- Company: Health information chatbot
- Problem: Need to balance accuracy, safety, empathy, and actionability
- Goal: Provide helpful health information that's safe and appropriate

**Why RLHF?**
- ✅ Multiple critical dimensions (can't optimize just one)
- ✅ Trade-offs are contextual (sometimes empathy > brevity, sometimes not)
- ✅ Safety is paramount but so is helpfulness
- ✅ Preferences alone can't capture all dimensions

**Quality Dimensions:**
1. **Medical Accuracy** (0-10): Factual correctness
2. **Safety** (0-10): Appropriate disclaimers, doesn't diagnose
3. **Empathy** (0-10): Shows understanding and compassion
4. **Actionability** (0-10): Provides clear next steps
5. **Appropriateness** (0-10): Recognizes limitations, suggests professional help when needed

In [None]:
import pandas as pd

# Mock Dataset: RLHF for Medical Information
# Format: {prompt, response, scores_per_dimension}

rlhf_medical_data = [
    {
        "prompt": "I've had a headache for 3 days. What's wrong with me?",
        "responses": [
            {
                "text": "Headaches lasting several days can have many causes. Common ones include tension, dehydration, lack of sleep, or stress. More rarely, they could indicate something more serious.\n\nImmediate steps you can try:\n- Ensure you're well hydrated\n- Get adequate rest\n- Try over-the-counter pain relief if appropriate for you\n- Reduce screen time and stress\n\nYou should see a doctor if:\n- The headache is severe or getting worse\n- You have fever, stiff neck, confusion, or vision changes\n- It's different from your usual headaches\n- It persists beyond a week\n\nI can't diagnose what's causing your headache, but I hope these suggestions help. Please don't hesitate to seek medical attention if you're concerned.",
                "scores": {
                    "medical_accuracy": 9,
                    "safety": 10,
                    "empathy": 7,
                    "actionability": 9,
                    "appropriateness": 10
                }
            },
            {
                "text": "I understand having a persistent headache must be uncomfortable and concerning. Let me try to help.\n\nThere are many possible causes for a 3-day headache, ranging from common to more serious. While I can provide general information, I can't diagnose your specific situation.\n\nCommon causes include:\n- Tension or stress\n- Dehydration\n- Poor sleep\n- Eye strain\n- Caffeine withdrawal\n\nWhat you can try now:\n1. Drink plenty of water\n2. Rest in a dark, quiet room\n3. Apply a cold compress to your forehead\n4. Try gentle neck stretches\n\nWarning signs to see a doctor immediately:\n- Sudden, severe headache (\"worst of your life\")\n- Headache with fever, stiff neck, rash, confusion\n- Headache after a head injury\n- Progressive worsening\n- New headache if you're over 50\n\nGiven it's been 3 days, I'd recommend scheduling a doctor's appointment soon for proper evaluation. They can rule out underlying issues and provide appropriate treatment.\n\nHow are you feeling otherwise? Any other symptoms?",
                "scores": {
                    "medical_accuracy": 9,
                    "safety": 10,
                    "empathy": 10,
                    "actionability": 10,
                    "appropriateness": 10
                }
            },
            {
                "text": "You probably have a tension headache or migraine. Try taking ibuprofen and resting. If it doesn't go away, see a doctor.",
                "scores": {
                    "medical_accuracy": 6,
                    "safety": 5,
                    "empathy": 3,
                    "actionability": 6,
                    "appropriateness": 4
                }
            },
            {
                "text": "A 3-day headache could be a sign of many conditions ranging from benign to serious. Possible causes include tension headaches, migraines, cluster headaches, sinus infections, high blood pressure, brain tumor, aneurysm, meningitis, or temporal arteritis. You should see a doctor immediately for proper diagnostic imaging and blood work to rule out serious conditions.",
                "scores": {
                    "medical_accuracy": 7,
                    "safety": 3,
                    "empathy": 2,
                    "actionability": 5,
                    "appropriateness": 3
                }
            }
        ]
    }
]

# Display RLHF data
print("\n" + "="*80)
print("🏥 RLHF USE CASE #1: Medical Information (Multi-Dimensional Quality)")
print("="*80 + "\n")

for example in rlhf_medical_data:
    print(f"❓ PROMPT: {example['prompt']}\n")
    print("="*80)

    for idx, response in enumerate(example['responses'], 1):
        print(f"\n📋 RESPONSE {idx}:\n{response['text']}\n")

        scores = response['scores']
        total_score = sum(scores.values())
        max_score = len(scores) * 10

        print(f"\n📊 SCORES:")
        for dimension, score in scores.items():
            bar = "█" * score + "░" * (10 - score)
            print(f"  {dimension.replace('_', ' ').title():20s} {bar} {score}/10")

        print(f"\n  {'TOTAL':20s} {total_score}/{max_score} ({total_score/max_score*100:.0f}%)")
        print("─"*80)

print("\n" + "="*80)
print("🎯 WHAT THE MODEL LEARNS:")
print("="*80)
print("""\n
1. **Multi-Dimensional Optimization:**
   - Can't just maximize one dimension
   - Response 2 scores highest overall (50/50)
   - Balances ALL dimensions effectively

2. **Understanding Trade-offs:**
   - Response 1 is good (45/50) but less empathetic
   - Response 3 is dangerously brief (lacks safety)
   - Response 4 is accurate but causes unnecessary alarm

3. **Contextual Appropriateness:**
   - When to be detailed vs. concise
   - How much medical terminology to use
   - When to emphasize seeking professional help

4. **Safety Paramount:**
   - Never diagnose
   - Always include appropriate disclaimers
   - Highlight red flags clearly
   - Suggest professional consultation

✨ KEY INSIGHT: This isn't \"Response A vs Response B\" - it's understanding
   that Response 2 achieves the best overall balance across ALL quality
   dimensions. RLHF teaches the model to navigate these complex trade-offs.
""")
print("="*80)

## ✍️ RLHF Use Case #2: Creative Writing Quality

**Scenario:**
- Company: AI writing assistant
- Problem: Story openings need to balance multiple creative elements
- Goal: Generate engaging story openings that score high across all dimensions

**Why RLHF?**
- ✅ Multiple subjective dimensions (atmosphere, originality, prose, hook)
- ✅ Trade-offs are contextual and complex
- ✅ Can't reduce to simple preferences
- ✅ Need continuous quality scores, not just rankings

**Quality Dimensions:**
1. **Atmosphere** (0-10): Establishes mood and setting effectively
2. **Originality** (0-10): Fresh ideas, avoids clichés
3. **Hook** (0-10): Compels reader to continue
4. **Prose Quality** (0-10): Writing style, word choice, flow
5. **Character** (0-10): Interesting character introduction

In [None]:
# Mock Dataset: RLHF for Creative Writing

rlhf_writing_data = [
    {
        "prompt": "Write an opening paragraph for a sci-fi story about a lone astronaut.",
        "responses": [
            {
                "text": "The silence was different here. Commander Sarah Chen had spent eighteen months on the International Space Station, but this—this was the silence of true isolation. Through the porthole of her one-person craft, Earth hung like a blue marble against the infinite black, shrinking with each passing hour. Her only companion was the steady hum of life support systems and the ghost of a voice that should have been on the comms channel but wasn't.",
                "scores": {
                    "atmosphere": 9,
                    "originality": 7,
                    "hook": 9,
                    "prose_quality": 9,
                    "character": 8
                }
            },
            {
                "text": "The astronaut floated alone in space. She looked out the window at Earth. It was far away now. She wondered if she would ever go back. The spaceship was quiet except for the sounds of the machines. She had been traveling for a long time and felt lonely.",
                "scores": {
                    "atmosphere": 4,
                    "originality": 3,
                    "hook": 3,
                    "prose_quality": 3,
                    "character": 2
                }
            },
            {
                "text": "Maya's fingers trembled as she powered down the distress beacon. Three days of silence from Mission Control. Three days of telling herself there was a reasonable explanation. But deep in the lizard part of her brain—the part that had kept her ancestors alive on the savanna—she knew the truth: she was the last human being with a heartbeat.",
                "scores": {
                    "atmosphere": 10,
                    "originality": 9,
                    "hook": 10,
                    "prose_quality": 9,
                    "character": 9
                }
            },
            {
                "text": "The astronaut was alone in the vast emptiness of space, floating through the cosmic void, surrounded by infinite darkness and twinkling stars, feeling the profound loneliness of being the only human for millions of miles in any direction, contemplating the meaning of existence and humanity's place in the universe.",
                "scores": {
                    "atmosphere": 6,
                    "originality": 4,
                    "hook": 5,
                    "prose_quality": 5,
                    "character": 1
                }
            }
        ]
    }
]

print("\n" + "="*80)
print("✍️ RLHF USE CASE #2: Creative Writing Quality")
print("="*80 + "\n")

for example in rlhf_writing_data:
    print(f"❓ PROMPT: {example['prompt']}\n")
    print("="*80)

    for idx, response in enumerate(example['responses'], 1):
        print(f"\n📋 RESPONSE {idx}:\n{response['text']}\n")

        scores = response['scores']
        total_score = sum(scores.values())
        max_score = len(scores) * 10

        print(f"\n📊 SCORES:")
        for dimension, score in scores.items():
            bar = "█" * score + "░" * (10 - score)
            print(f"  {dimension.replace('_', ' ').title():20s} {bar} {score}/10")

        print(f"\n  {'TOTAL':20s} {total_score}/{max_score} ({total_score/max_score*100:.0f}%)")
        print("─"*80)

print("\n" + "="*80)
print("🎯 WHAT THE MODEL LEARNS:")
print("="*80)
print("""\n
1. **Balancing Multiple Creative Elements:**
   - Response 3 scores highest (47/50 = 94%)
   - Excellent atmosphere AND hook AND character
   - Can't just optimize one dimension

2. **Avoiding Common Pitfalls:**
   - Response 2: Too simple, tells instead of shows
   - Response 4: Purple prose, run-on sentence, no character
   - Response 1: Good but slightly conventional

3. **Show Don't Tell:**
   - \"trembled,\" \"lizard part of her brain\" > \"felt lonely\"
   - Specific details > Generic descriptions
   - Action and thought > Exposition

4. **Creating Hooks:**
   - Mystery: \"should have been on comms but wasn't\"
   - High stakes: \"last human being with a heartbeat\"
   - Immediate tension: \"powered down the distress beacon\"

✨ KEY INSIGHT: Creative quality can't be reduced to \"better/worse.\"
   RLHF teaches the model to understand that Response 3 achieves the
   best balance across all creative dimensions simultaneously.
""")
print("="*80)

## 👨‍💻 RLHF Use Case #3: Code Review Quality

**Scenario:**
- Company: Developer tools platform
- Problem: Code reviews need to balance correctness, tone, specificity, and teaching
- Goal: Provide helpful code reviews that improve code AND developer skills

**Why RLHF?**
- ✅ Multiple competing dimensions (thorough vs. not overwhelming)
- ✅ Context determines balance (junior devs need more explanation)
- ✅ Trade-offs are subtle and complex
- ✅ Quality is continuous, not binary

**Quality Dimensions:**
1. **Technical Correctness** (0-10): Identifies real issues accurately
2. **Tone** (0-10): Professional, encouraging, not dismissive
3. **Specificity** (0-10): Concrete suggestions, not vague
4. **Educational Value** (0-10): Explains why, teaches concepts
5. **Prioritization** (0-10): Distinguishes critical from nice-to-have

In [None]:
# Mock Dataset: RLHF for Code Review

rlhf_code_review_data = [
    {
        "prompt": "Review this Python function:\n\n```python\ndef calculate_discount(price, discount_percent):\n    discount = price * discount_percent\n    final_price = price - discount\n    return final_price\n```",
        "responses": [
            {
                "text": "The function has a bug. You're not dividing discount_percent by 100, so if someone passes 10 for 10%, you'll calculate 10x the price instead of 10% of the price. This needs to be fixed.",
                "scores": {
                    "technical_correctness": 10,
                    "tone": 5,
                    "specificity": 7,
                    "educational_value": 4,
                    "prioritization": 8
                }
            },
            {
                "text": "Good start! I found one issue that would cause incorrect calculations:\n\n**Issue: Percentage calculation**\n- Line 2: `discount = price * discount_percent`\n- Problem: If discount_percent is 10 (meaning 10%), this calculates price × 10 instead of price × 0.10\n- Fix: `discount = price * (discount_percent / 100)`\n\n**Suggestions:**\n1. Add input validation (negative prices/percentages)\n2. Add a docstring explaining expected inputs\n3. Consider adding type hints\n\n**Example fix:**\n```python\ndef calculate_discount(price: float, discount_percent: float) -> float:\n    \"\"\"Calculate final price after discount.\n    \n    Args:\n        price: Original price (must be positive)\n        discount_percent: Discount as percentage (e.g., 10 for 10%)\n    \"\"\"\n    if price < 0 or discount_percent < 0:\n        raise ValueError(\"Price and discount must be positive\")\n    \n    discount = price * (discount_percent / 100)\n    return price - discount\n```\n\nThis makes the function more robust and self-documenting. Nice work overall!",
                "scores": {
                    "technical_correctness": 10,
                    "tone": 10,
                    "specificity": 10,
                    "educational_value": 10,
                    "prioritization": 9
                }
            },
            {
                "text": "This code could be improved in several ways. First, the calculation logic isn't quite right. Also, you should add error handling. The function needs better documentation. Variable names could be more descriptive. You might want to consider using type hints. Also think about edge cases. Overall this needs work before it's production-ready.",
                "scores": {
                    "technical_correctness": 6,
                    "tone": 4,
                    "specificity": 2,
                    "educational_value": 2,
                    "prioritization": 3
                }
            },
            {
                "text": "Great function! Just one tiny thing - you might want to divide discount_percent by 100. Otherwise looks perfect! 👍",
                "scores": {
                    "technical_correctness": 9,
                    "tone": 9,
                    "specificity": 5,
                    "educational_value": 3,
                    "prioritization": 7
                }
            }
        ]
    }
]

print("\n" + "="*80)
print("👨‍💻 RLHF USE CASE #3: Code Review Quality")
print("="*80 + "\n")

for example in rlhf_code_review_data:
    print(f"❓ PROMPT: {example['prompt']}\n")
    print("="*80)

    for idx, response in enumerate(example['responses'], 1):
        print(f"\n📋 RESPONSE {idx}:\n{response['text']}\n")

        scores = response['scores']
        total_score = sum(scores.values())
        max_score = len(scores) * 10

        print(f"\n📊 SCORES:")
        for dimension, score in scores.items():
            bar = "█" * score + "░" * (10 - score)
            print(f"  {dimension.replace('_', ' ').title():20s} {bar} {score}/10")

        print(f"\n  {'TOTAL':20s} {total_score}/{max_score} ({total_score/max_score*100:.0f}%)")
        print("─"*80)

print("\n" + "="*80)
print("🎯 WHAT THE MODEL LEARNS:")
print("="*80)
print("""\n
1. **Optimal Balance Across All Dimensions:**
   - Response 2 scores highest (49/50 = 98%)
   - Technical accuracy + encouraging tone + specific fixes + teaching
   - Can't sacrifice any dimension

2. **Understanding Trade-offs:**
   - Response 1: Accurate but abrasive (could discourage developer)
   - Response 3: Vague critique without actionable guidance
   - Response 4: Too nice, misses educational opportunity

3. **Effective Code Review Elements:**
   - Point out issue clearly
   - Explain WHY it's wrong
   - Show exactly HOW to fix it
   - Suggest improvements beyond the bug
   - Maintain encouraging, professional tone

4. **Prioritization:**
   - Critical: The bug (must fix)
   - Important: Input validation (should fix)
   - Nice-to-have: Docstrings, type hints (good to add)

✨ KEY INSIGHT: Code reviews require balancing being thorough without
   being overwhelming, direct without being harsh, and educational without
   being condescending. RLHF teaches this nuanced, multi-dimensional quality.
""")
print("="*80)

## 📊 RLHF Summary

### ✅ Strengths

1. **Multi-Dimensional Optimization**: Simultaneously optimizes across multiple quality axes
2. **Handles Complex Trade-offs**: Understands contextual balance between competing goals
3. **Continuous Quality**: Not just "better/worse" but "how much better?"
4. **Most Sophisticated**: State-of-the-art alignment technique
5. **Proven at Scale**: Powers ChatGPT, Claude, and other flagship products

### ❌ Limitations

1. **Extreme Complexity**: Multi-stage process, requires ML expertise
2. **Very Expensive**: 5-10x cost of preference tuning
3. **Data Intensive**: Needs thousands of multi-dimensionally scored examples
4. **Slow to Iterate**: Changes require full retraining
5. **Overkill for Most**: Simpler methods work for 90% of use cases

---

### 💡 Data Collection for RLHF

#### 1. Define Clear Dimensions
```
✅ Specific, measurable qualities
✅ 3-5 dimensions (not too many)
✅ Each dimension has clear 0-10 scale
❌ Vague dimensions like "quality"
```

#### 2. Multiple Raters per Example
```
Ideal: 3-5 raters score each response
Average their scores
Track inter-rater agreement
```

#### 3. Calibration Sessions
```
Before data collection:
- Train raters together
- Score practice examples
- Discuss disagreements
- Establish shared understanding
```

#### 4. Diverse Response Quality
```
Include:
- Excellent examples (8-10 across dimensions)
- Good examples (6-8)
- Poor examples (2-5)
- Terrible examples (0-2)
```

---

### 🎯 When to Use RLHF

**Use RLHF when ALL are true:**

1. ✅ Quality has 3+ critical dimensions
2. ✅ Dimensions sometimes conflict/trade-off
3. ✅ Simpler methods have been tried and aren't sufficient
4. ✅ You have substantial resources ($10K-100K+ budget)
5. ✅ This is a flagship product where quality is paramount

**Perfect use cases:**
- Conversational AI assistants (ChatGPT-like) ✅
- Safety-critical applications (medical, legal) ✅
- Creative tools with subjective quality ✅
- Complex content moderation ✅
- Code generation with multiple quality criteria ✅

**Not worth it for:**
- Simple formatting tasks ❌
- Single quality dimension ❌
- Prototypes or MVPs ❌
- Limited budget (<$5K) ❌
- Internal tools with small user base ❌

