# Focused Learning: Efficient SFT with Model-Generated Data

## Learning Objectives
1. Understand why model-generated data outperforms human-written solutions for training
2. Learn the multi-stage generation process for creating high-quality training data
3. Master efficient supervised fine-tuning (SFT) techniques for code generation
4. Implement data efficiency strategies that achieve 40x performance improvement

## Concept Source
- **Paper Section**: Section 4 (Efficient Training) and Section 2.1 (Data Collection)
- **Key Table**: Table 4 (Model SFT-training results)
- **Critical Finding**: "2.6K model-generated LeetCode samples achieved superior performance on HumanEval (79.9%) and MBPP (77.5%), surpassing models trained on much larger datasets (9.5K–111.1K rows)"

## 1. The Data Quality vs. Quantity Paradigm

### Why Model-Generated Data Works Better

The paper reveals a counterintuitive finding: **Model-generated solutions are better for training than human-written ones**.

Key insights:
- **Human solutions**: Often terse, minimal comments, optimized for contests
- **Model-generated solutions**: Explicit reasoning, clear structure, educational value
- **Training efficiency**: Quality trumps quantity in code generation tasks

In [None]:
import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader
from transformers import (
    AutoTokenizer, AutoModelForCausalLM,
    TrainingArguments, Trainer,
    DataCollatorForLanguageModeling
)
import pandas as pd
import numpy as np
from typing import List, Dict, Tuple, Optional
import json
import matplotlib.pyplot as plt
import seaborn as sns
from dataclasses import dataclass
import wandb
from tqdm import tqdm
import re

# Set up visualization
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")

## 2. Multi-Stage Data Generation Process

Following the paper's approach using Qwen2.5-Coder-32B-Instruct:

In [None]:
@dataclass
class GenerationConfig:
    """Configuration for multi-stage generation"""
    model_name: str = "Qwen/Qwen2.5-Coder-32B-Instruct"
    high_temp: float = 1.0  # For diversity
    low_temp: float = 0.2   # For quality
    max_attempts: int = 5
    use_ground_truth_hints: bool = True

class ModelBasedDataGenerator:
    """Generate high-quality training data using LLMs"""
    
    def __init__(self, config: GenerationConfig):
        self.config = config
        # In practice, you'd load the actual model
        # self.tokenizer = AutoTokenizer.from_pretrained(config.model_name)
        # self.model = AutoModelForCausalLM.from_pretrained(config.model_name)
        
    def generate_solution_variants(self, problem: Dict, 
                                 num_variants: int = 5) -> List[Dict]:
        """Stage 1: Generate diverse solution candidates with high temperature"""
        
        prompt_template = """
You are an expert competitive programmer. Solve this problem with clear reasoning:

Problem: {problem_description}

Starter Code:
{starter_code}

Provide a complete solution with step-by-step reasoning. Focus on:
1. Understanding the problem
2. Identifying the algorithm/approach
3. Implementation with clear variable names
4. Time/space complexity analysis
"""
        
        variants = []
        
        for i in range(num_variants):
            # Simulate diverse generation with different approaches
            approaches = [
                self._generate_iterative_solution(problem),
                self._generate_mathematical_solution(problem),
                self._generate_two_pointer_solution(problem),
                self._generate_binary_search_solution(problem),
                self._generate_optimized_solution(problem)
            ]
            
            variant = approaches[i % len(approaches)]
            variant['generation_id'] = i
            variant['temperature'] = self.config.high_temp
            
            variants.append(variant)
        
        return variants
    
    def _generate_iterative_solution(self, problem: Dict) -> Dict:
        """Generate iterative approach solution"""
        return {
            'approach': 'iterative',
            'reasoning': """
Step 1: Understanding the Problem
We need to find the missing number in an arithmetic progression.
In an AP, consecutive differences are constant.

Step 2: Algorithm
- Calculate the expected common difference
- Iterate through array to find the gap
- Return the missing number

Step 3: Implementation
""",
            'code': """
class Solution:
    def missingNumber(self, arr: List[int]) -> int:
        n = len(arr)
        # Calculate expected common difference
        total_diff = arr[-1] - arr[0]
        common_diff = total_diff // n
        
        # Find the missing number by checking consecutive differences
        for i in range(n - 1):
            if arr[i + 1] - arr[i] != common_diff:
                return arr[i] + common_diff
        
        # Edge case: missing at end
        return arr[-1] + common_diff
""",
            'complexity': 'Time: O(n), Space: O(1)'
        }
    
    def _generate_mathematical_solution(self, problem: Dict) -> Dict:
        """Generate mathematical approach"""
        return {
            'approach': 'mathematical',
            'reasoning': """
Step 1: Mathematical Insight
In arithmetic progression: sum = n * (first + last) / 2
But we have (n-1) elements, so one is missing.

Step 2: Formula Application
- Calculate expected sum of complete AP
- Subtract actual sum to find missing number

Step 3: Edge Case Handling
""",
            'code': """
class Solution:
    def missingNumber(self, arr: List[int]) -> int:
        n = len(arr)
        # Expected sum of complete arithmetic progression
        # Formula: (n+1) * (first + last) / 2
        expected_sum = (n + 1) * (arr[0] + arr[-1]) // 2
        
        # Actual sum of given array
        actual_sum = sum(arr)
        
        # Missing number is the difference
        return expected_sum - actual_sum
""",
            'complexity': 'Time: O(n), Space: O(1)'
        }
    
    def _generate_two_pointer_solution(self, problem: Dict) -> Dict:
        """Generate two-pointer approach"""
        return {
            'approach': 'two_pointer',
            'reasoning': """
Step 1: Two Pointer Strategy
Use binary search concept to narrow down the missing position.

Step 2: Logic
- Calculate expected value at each position
- Use binary search to find the break point

Step 3: Optimization
""",
            'code': """
class Solution:
    def missingNumber(self, arr: List[int]) -> int:
        n = len(arr)
        common_diff = (arr[-1] - arr[0]) // n
        
        left, right = 0, n - 1
        
        while left <= right:
            mid = (left + right) // 2
            expected_val = arr[0] + mid * common_diff
            
            if arr[mid] == expected_val:
                left = mid + 1
            else:
                right = mid - 1
        
        return arr[0] + left * common_diff
""",
            'complexity': 'Time: O(log n), Space: O(1)'
        }
    
    def _generate_binary_search_solution(self, problem: Dict) -> Dict:
        """Generate binary search solution"""
        return self._generate_two_pointer_solution(problem)  # Similar approach
    
    def _generate_optimized_solution(self, problem: Dict) -> Dict:
        """Generate most optimized solution"""
        return self._generate_mathematical_solution(problem)  # Math is most optimal

# Test the generator
config = GenerationConfig()
generator = ModelBasedDataGenerator(config)

test_problem = {
    'problem_description': "Find missing number in arithmetic progression",
    'starter_code': "class Solution:\n    def missingNumber(self, arr: List[int]) -> int:"
}

variants = generator.generate_solution_variants(test_problem, num_variants=3)

print("Generated Solution Variants:")
for i, variant in enumerate(variants):
    print(f"\nVariant {i+1} ({variant['approach']}):")
    print(f"Complexity: {variant['complexity']}")
    print("Reasoning excerpt:", variant['reasoning'][:100] + "...")

## 3. Automated Test Case Verification

### Stage 2: Filter Functionally Correct Responses

In [None]:
class SolutionVerifier:
    """Verify solution correctness against test cases"""
    
    def __init__(self, timeout: int = 5):
        self.timeout = timeout
        
    def verify_solutions(self, variants: List[Dict], 
                        test_cases: List[Dict]) -> List[Dict]:
        """Filter variants by correctness"""
        verified_solutions = []
        
        for variant in variants:
            verification_result = self._verify_single_solution(
                variant, test_cases
            )
            
            if verification_result['passed']:
                variant['verification'] = verification_result
                verified_solutions.append(variant)
        
        return verified_solutions
    
    def _verify_single_solution(self, variant: Dict, 
                               test_cases: List[Dict]) -> Dict:
        """Verify a single solution against test cases"""
        results = {
            'total_tests': len(test_cases),
            'passed_tests': 0,
            'failed_tests': [],
            'execution_time': 0,
            'passed': False
        }
        
        # Simulate test execution for each test case
        for i, test_case in enumerate(test_cases):
            try:
                # In real implementation, execute the code safely
                # result = self._execute_code(variant['code'], test_case)
                
                # Simulate execution result
                if self._simulate_test_execution(variant, test_case):
                    results['passed_tests'] += 1
                else:
                    results['failed_tests'].append({
                        'test_id': i,
                        'input': test_case['input'],
                        'expected': test_case['output'],
                        'actual': 'simulated_wrong_output'
                    })
                    
            except Exception as e:
                results['failed_tests'].append({
                    'test_id': i,
                    'error': str(e)
                })
        
        # Solution passes if it gets all test cases correct
        results['passed'] = results['passed_tests'] == results['total_tests']
        results['pass_rate'] = results['passed_tests'] / results['total_tests']
        
        return results
    
    def _simulate_test_execution(self, variant: Dict, test_case: Dict) -> bool:
        """Simulate test execution (replace with actual execution)"""
        # Simulate different success rates for different approaches
        success_rates = {
            'mathematical': 0.95,
            'iterative': 0.90,
            'two_pointer': 0.85
        }
        
        approach = variant.get('approach', 'unknown')
        success_rate = success_rates.get(approach, 0.80)
        
        return np.random.random() < success_rate
    
    def generate_verification_report(self, variants: List[Dict], 
                                   verified: List[Dict]) -> Dict:
        """Generate comprehensive verification report"""
        report = {
            'initial_variants': len(variants),
            'verified_solutions': len(verified),
            'success_rate': len(verified) / len(variants) if variants else 0,
            'approach_performance': {},
            'quality_metrics': {}
        }
        
        # Analyze by approach
        for variant in variants:
            approach = variant['approach']
            if approach not in report['approach_performance']:
                report['approach_performance'][approach] = {
                    'total': 0, 'verified': 0
                }
            report['approach_performance'][approach]['total'] += 1
        
        for variant in verified:
            approach = variant['approach']
            report['approach_performance'][approach]['verified'] += 1
        
        # Calculate success rates by approach
        for approach, stats in report['approach_performance'].items():
            stats['success_rate'] = stats['verified'] / stats['total'] if stats['total'] > 0 else 0
        
        # Quality metrics
        if verified:
            pass_rates = [v.get('verification', {}).get('pass_rate', 0) for v in verified]
            report['quality_metrics'] = {
                'avg_pass_rate': np.mean(pass_rates),
                'min_pass_rate': min(pass_rates),
                'max_pass_rate': max(pass_rates)
            }
        
        return report

# Test the verifier
test_cases = [
    {'input': {'arr': [5, 7, 11, 13]}, 'output': 9},
    {'input': {'arr': [15, 13, 12]}, 'output': 14},
    {'input': {'arr': [1, 3, 5, 9]}, 'output': 7}
]

verifier = SolutionVerifier()
verified_solutions = verifier.verify_solutions(variants, test_cases)
verification_report = verifier.generate_verification_report(variants, verified_solutions)

print("Verification Report:")
print(json.dumps(verification_report, indent=2))

print(f"\nVerified {len(verified_solutions)}/{len(variants)} solutions")

## 4. Ground Truth Integration for Difficult Problems

### Stage 3: Handle Persistently Failing Problems

In [None]:
class GroundTruthIntegrator:
    """Integrate ground truth hints for difficult problems"""
    
    def __init__(self, failure_threshold: float = 0.3):
        self.failure_threshold = failure_threshold
        
    def identify_difficult_problems(self, problem_results: Dict[str, List[Dict]]) -> List[str]:
        """Identify problems with low success rates"""
        difficult_problems = []
        
        for problem_id, solutions in problem_results.items():
            if not solutions:  # No solutions passed
                difficult_problems.append(problem_id)
            else:
                # Check if verification pass rate is low
                avg_pass_rate = np.mean([
                    sol.get('verification', {}).get('pass_rate', 0) 
                    for sol in solutions
                ])
                
                if avg_pass_rate < self.failure_threshold:
                    difficult_problems.append(problem_id)
        
        return difficult_problems
    
    def create_hint_enhanced_prompts(self, problem: Dict, 
                                   ground_truth_solution: str) -> List[str]:
        """Create prompts with progressive hints"""
        
        # Extract key insights from ground truth
        hints = self._extract_solution_hints(ground_truth_solution)
        
        prompts = []
        
        # Level 1: Algorithmic hint
        prompts.append(f"""
Problem: {problem['description']}

Hint: This problem can be solved using {hints['algorithm_type']}.
Think about {hints['key_insight']}.

Provide a complete solution with explanation.
""")
        
        # Level 2: Implementation hint
        prompts.append(f"""
Problem: {problem['description']}

Algorithm: {hints['algorithm_type']}
Key insight: {hints['key_insight']}
Implementation approach: {hints['implementation_hint']}

Complete the solution:
""")
        
        # Level 3: Code structure hint
        prompts.append(f"""
Problem: {problem['description']}

Solution structure:
{hints['code_structure']}

Fill in the implementation details:
""")
        
        return prompts
    
    def _extract_solution_hints(self, ground_truth: str) -> Dict[str, str]:
        """Extract hints from ground truth solution"""
        # Analyze the ground truth to extract patterns
        hints = {
            'algorithm_type': 'mathematical approach',
            'key_insight': 'arithmetic progression sum formula',
            'implementation_hint': 'calculate expected sum vs actual sum',
            'code_structure': """
# Step 1: Calculate expected sum of complete AP
# Step 2: Calculate actual sum of given array  
# Step 3: Return the difference
"""
        }
        
        # In practice, use AST analysis or LLM to extract hints
        if 'binary' in ground_truth.lower():
            hints['algorithm_type'] = 'binary search'
        elif 'two' in ground_truth.lower() and 'pointer' in ground_truth.lower():
            hints['algorithm_type'] = 'two pointer technique'
        elif 'dp' in ground_truth.lower() or 'dynamic' in ground_truth.lower():
            hints['algorithm_type'] = 'dynamic programming'
        
        return hints
    
    def generate_hint_based_solutions(self, difficult_problems: List[str],
                                    problem_data: Dict[str, Dict],
                                    ground_truth_solutions: Dict[str, str]) -> Dict[str, List[Dict]]:
        """Generate solutions with hints for difficult problems"""
        
        hint_solutions = {}
        
        for problem_id in difficult_problems:
            if problem_id not in ground_truth_solutions:
                continue
                
            problem = problem_data[problem_id]
            ground_truth = ground_truth_solutions[problem_id]
            
            # Create hint-enhanced prompts
            hint_prompts = self.create_hint_enhanced_prompts(problem, ground_truth)
            
            # Generate solutions with different hint levels
            solutions = []
            for i, prompt in enumerate(hint_prompts):
                # In practice, call LLM with this prompt
                solution = self._simulate_hint_based_generation(prompt, ground_truth, i)
                solution['hint_level'] = i + 1
                solution['prompt_used'] = prompt
                solutions.append(solution)
            
            hint_solutions[problem_id] = solutions
        
        return hint_solutions
    
    def _simulate_hint_based_generation(self, prompt: str, 
                                      ground_truth: str, 
                                      hint_level: int) -> Dict:
        """Simulate hint-based solution generation"""
        # Simulate increasing success rate with more hints
        success_rates = [0.6, 0.8, 0.95]  # Level 1, 2, 3 hints
        success_rate = success_rates[min(hint_level, len(success_rates) - 1)]
        
        if np.random.random() < success_rate:
            # Return a solution based on ground truth
            return {
                'approach': 'hint_guided',
                'code': ground_truth,  # Simplified for demo
                'reasoning': f"Used hint level {hint_level + 1} to solve",
                'success': True
            }
        else:
            return {
                'approach': 'hint_guided',
                'code': "# Failed to generate correct solution",
                'reasoning': "Could not solve even with hints",
                'success': False
            }

# Test ground truth integration
integrator = GroundTruthIntegrator(failure_threshold=0.5)

# Simulate problem results
problem_results = {
    'easy_problem': verified_solutions,  # Has solutions
    'hard_problem': [],  # No solutions
    'medium_problem': [{'verification': {'pass_rate': 0.2}}]  # Low pass rate
}

difficult_problems = integrator.identify_difficult_problems(problem_results)
print(f"Identified difficult problems: {difficult_problems}")

# Mock data for hint generation
problem_data = {
    'hard_problem': {'description': 'Complex algorithm problem'},
    'medium_problem': {'description': 'Medium difficulty problem'}
}

ground_truth_solutions = {
    'hard_problem': 'def solve(): return binary_search_solution()',
    'medium_problem': 'def solve(): return mathematical_solution()'
}

hint_solutions = integrator.generate_hint_based_solutions(
    difficult_problems, problem_data, ground_truth_solutions
)

print(f"\nGenerated hint-based solutions for {len(hint_solutions)} problems")
for problem_id, solutions in hint_solutions.items():
    print(f"  {problem_id}: {len(solutions)} solutions with different hint levels")

## 5. Training Dataset Construction

### Creating High-Quality Training Pairs

In [None]:
@dataclass
class TrainingExample:
    """Single training example for SFT"""
    problem_id: str
    difficulty: str
    query: str  # Problem description + starter code
    response: str  # Model-generated solution with reasoning
    metadata: Dict

class TrainingDatasetBuilder:
    """Build high-quality training dataset from verified solutions"""
    
    def __init__(self):
        self.examples = []
        
    def build_training_dataset(self, 
                             verified_solutions: Dict[str, List[Dict]],
                             problem_data: Dict[str, Dict],
                             max_examples_per_problem: int = 2) -> List[TrainingExample]:
        """Build training dataset from verified solutions"""
        
        training_examples = []
        
        for problem_id, solutions in verified_solutions.items():
            if problem_id not in problem_data:
                continue
                
            problem = problem_data[problem_id]
            
            # Select best solutions for training
            selected_solutions = self._select_best_solutions(
                solutions, max_examples_per_problem
            )
            
            for solution in selected_solutions:
                # Create query following LiveCodeBench format
                query = self._format_query(problem)
                
                # Create response with reasoning + code
                response = self._format_response(solution)
                
                example = TrainingExample(
                    problem_id=problem_id,
                    difficulty=problem.get('difficulty', 'Unknown'),
                    query=query,
                    response=response,
                    metadata={
                        'approach': solution['approach'],
                        'verification_score': solution.get('verification', {}).get('pass_rate', 0),
                        'generation_temperature': solution.get('temperature', 0.2)
                    }
                )
                
                training_examples.append(example)
        
        return training_examples
    
    def _select_best_solutions(self, solutions: List[Dict], 
                             max_count: int) -> List[Dict]:
        """Select the best solutions for training"""
        # Sort by verification score and approach diversity
        scored_solutions = []
        
        for solution in solutions:
            score = solution.get('verification', {}).get('pass_rate', 0)
            
            # Bonus for diverse approaches
            approach_bonus = {
                'mathematical': 0.1,
                'iterative': 0.05,
                'two_pointer': 0.08
            }.get(solution['approach'], 0)
            
            final_score = score + approach_bonus
            scored_solutions.append((final_score, solution))
        
        # Sort by score and take top solutions
        scored_solutions.sort(key=lambda x: x[0], reverse=True)
        
        # Ensure approach diversity
        selected = []
        used_approaches = set()
        
        for score, solution in scored_solutions:
            if len(selected) >= max_count:
                break
                
            approach = solution['approach']
            if approach not in used_approaches or len(selected) == 0:
                selected.append(solution)
                used_approaches.add(approach)
        
        # Fill remaining slots if needed
        for score, solution in scored_solutions:
            if len(selected) >= max_count:
                break
            if solution not in selected:
                selected.append(solution)
        
        return selected[:max_count]
    
    def _format_query(self, problem: Dict) -> str:
        """Format query following LiveCodeBench construction"""
        return f"""
Solve the following coding problem:

{problem['description']}

```python
{problem['starter_code']}
```

Please provide a complete solution with explanation.
""".strip()
    
    def _format_response(self, solution: Dict) -> str:
        """Format response with reasoning + code"""
        return f"""
{solution['reasoning']}

```python
{solution['code']}
```

**Complexity Analysis:**
{solution.get('complexity', 'Time: O(n), Space: O(1)')}
""".strip()
    
    def analyze_dataset_quality(self, examples: List[TrainingExample]) -> Dict:
        """Analyze the quality of the training dataset"""
        analysis = {
            'total_examples': len(examples),
            'difficulty_distribution': {},
            'approach_distribution': {},
            'avg_verification_score': 0,
            'unique_problems': len(set(ex.problem_id for ex in examples)),
            'avg_response_length': 0
        }
        
        if not examples:
            return analysis
        
        # Difficulty distribution
        difficulties = [ex.difficulty for ex in examples]
        analysis['difficulty_distribution'] = dict(pd.Series(difficulties).value_counts())
        
        # Approach distribution
        approaches = [ex.metadata['approach'] for ex in examples]
        analysis['approach_distribution'] = dict(pd.Series(approaches).value_counts())
        
        # Average verification score
        scores = [ex.metadata['verification_score'] for ex in examples]
        analysis['avg_verification_score'] = np.mean(scores)
        
        # Average response length
        lengths = [len(ex.response) for ex in examples]
        analysis['avg_response_length'] = np.mean(lengths)
        
        return analysis
    
    def save_dataset(self, examples: List[TrainingExample], 
                    filepath: str, format: str = 'jsonl'):
        """Save training dataset in specified format"""
        if format == 'jsonl':
            with open(filepath, 'w') as f:
                for example in examples:
                    data = {
                        'problem_id': example.problem_id,
                        'difficulty': example.difficulty,
                        'query': example.query,
                        'response': example.response,
                        'metadata': example.metadata
                    }
                    f.write(json.dumps(data) + '\n')
        elif format == 'hf':  # Hugging Face format
            dataset_dict = {
                'problem_id': [ex.problem_id for ex in examples],
                'difficulty': [ex.difficulty for ex in examples],
                'text': [f"### Query:\n{ex.query}\n\n### Response:\n{ex.response}" 
                        for ex in examples]
            }
            df = pd.DataFrame(dataset_dict)
            df.to_parquet(filepath.replace('.jsonl', '.parquet'))

# Test the dataset builder
builder = TrainingDatasetBuilder()

# Mock verified solutions and problem data
mock_verified_solutions = {
    'problem_1': verified_solutions[:2],  # Use our earlier solutions
    'problem_2': verified_solutions[1:3] if len(verified_solutions) > 2 else verified_solutions
}

mock_problem_data = {
    'problem_1': {
        'description': 'Find missing number in arithmetic progression',
        'starter_code': 'class Solution:\n    def missingNumber(self, arr: List[int]) -> int:',
        'difficulty': 'Easy'
    },
    'problem_2': {
        'description': 'Another problem description',
        'starter_code': 'class Solution:\n    def solve(self, nums: List[int]) -> int:',
        'difficulty': 'Medium'
    }
}

training_examples = builder.build_training_dataset(
    mock_verified_solutions, mock_problem_data, max_examples_per_problem=2
)

print(f"Built training dataset with {len(training_examples)} examples")

# Analyze dataset quality
quality_analysis = builder.analyze_dataset_quality(training_examples)
print("\nDataset Quality Analysis:")
print(json.dumps(quality_analysis, indent=2))

# Show sample training example
if training_examples:
    print("\nSample Training Example:")
    example = training_examples[0]
    print(f"Problem ID: {example.problem_id}")
    print(f"Difficulty: {example.difficulty}")
    print("Query:", example.query[:200] + "...")
    print("Response:", example.response[:200] + "...")
    print("Metadata:", example.metadata)

## 6. Efficient Supervised Fine-Tuning Implementation

### Following the Paper's Training Configuration

In [None]:
class EfficientSFTTrainer:
    """Efficient SFT trainer following paper's configuration"""
    
    def __init__(self, model_name: str = "Qwen/Qwen2.5-Coder-7B"):
        self.model_name = model_name
        # In practice, load the actual model
        # self.tokenizer = AutoTokenizer.from_pretrained(model_name)
        # self.model = AutoModelForCausalLM.from_pretrained(model_name)
        
    def prepare_training_config(self, num_examples: int) -> TrainingArguments:
        """Prepare training configuration following paper's setup"""
        
        # Paper's configuration:
        # - 3 epochs
        # - Learning rate 1e-5
        # - Warmup ratio 0.1
        # - Cosine learning rate scheduling
        # - Batch size 32
        
        return TrainingArguments(
            output_dir="./leetcode_sft_model",
            num_train_epochs=3,
            learning_rate=1e-5,
            per_device_train_batch_size=8,  # Adjusted for demo
            gradient_accumulation_steps=4,  # Effective batch size = 32
            warmup_ratio=0.1,
            lr_scheduler_type="cosine",
            logging_steps=10,
            save_steps=500,
            eval_steps=100,
            evaluation_strategy="steps",
            save_strategy="steps",
            load_best_model_at_end=True,
            metric_for_best_model="eval_loss",
            greater_is_better=False,
            report_to="wandb",  # For tracking
            run_name=f"leetcode_sft_{num_examples}_examples",
            dataloader_num_workers=4,
            fp16=True,  # Memory optimization
            remove_unused_columns=False
        )
    
    def create_dataset(self, examples: List[TrainingExample], 
                      split_ratio: float = 0.9) -> Tuple[Dataset, Dataset]:
        """Create train/validation datasets"""
        
        # Convert to training format
        formatted_examples = []
        for example in examples:
            # Format as instruction-following
            text = f"<|im_start|>user\n{example.query}<|im_end|>\n<|im_start|>assistant\n{example.response}<|im_end|>"
            
            formatted_examples.append({
                'text': text,
                'problem_id': example.problem_id,
                'difficulty': example.difficulty
            })
        
        # Split into train/validation
        split_idx = int(len(formatted_examples) * split_ratio)
        train_examples = formatted_examples[:split_idx]
        val_examples = formatted_examples[split_idx:]
        
        # Create datasets
        train_dataset = CustomDataset(train_examples)
        val_dataset = CustomDataset(val_examples)
        
        return train_dataset, val_dataset
    
    def calculate_efficiency_metrics(self, 
                                   dataset_size: int, 
                                   baseline_datasets: List[Dict]) -> Dict:
        """Calculate efficiency metrics vs. baseline datasets"""
        
        # Paper's baseline comparisons
        baselines = {
            'Magicoder Evol-Instruct-110K': {'size': 111100, 'humaneval': 77.4, 'mbpp': 74.1},
            'Magicoder OSS-Instruct-75K': {'size': 75100, 'humaneval': 73.8, 'mbpp': 76.5},
            'Open-R1 CodeForces-CoT': {'size': 9500, 'humaneval': 79.9, 'mbpp': 74.1},
            'OpenThoughts 114k': {'size': 19900, 'humaneval': 77.4, 'mbpp': 75.7},
            'LeetCodeDataset (model)': {'size': 2600, 'humaneval': 79.9, 'mbpp': 77.5}
        }
        
        # Calculate efficiency
        efficiency_metrics = {
            'dataset_size': dataset_size,
            'size_reduction_vs_largest': dataset_size / max(b['size'] for b in baselines.values()),
            'efficiency_comparisons': []
        }
        
        leetcode_performance = baselines['LeetCodeDataset (model)']
        
        for name, baseline in baselines.items():
            if name == 'LeetCodeDataset (model)':
                continue
                
            size_ratio = dataset_size / baseline['size']
            performance_ratio_he = leetcode_performance['humaneval'] / baseline['humaneval']
            performance_ratio_mbpp = leetcode_performance['mbpp'] / baseline['mbpp']
            
            efficiency_comparisons = {
                'baseline': name,
                'size_ratio': size_ratio,
                'performance_ratio_humaneval': performance_ratio_he,
                'performance_ratio_mbpp': performance_ratio_mbpp,
                'efficiency_score': (performance_ratio_he + performance_ratio_mbpp) / (2 * size_ratio)
            }
            
            efficiency_metrics['efficiency_comparisons'].append(efficiency_comparisons)
        
        return efficiency_metrics
    
    def simulate_training_results(self, dataset_size: int) -> Dict:
        """Simulate training results based on paper's findings"""
        
        # Simulate performance based on dataset size
        # Paper shows diminishing returns after optimal size
        optimal_size = 2600
        
        if dataset_size <= optimal_size:
            # Linear improvement up to optimal size
            size_factor = dataset_size / optimal_size
            humaneval_score = 45 + (79.9 - 45) * size_factor
            mbpp_score = 40 + (77.5 - 40) * size_factor
        else:
            # Diminishing returns beyond optimal size
            excess_factor = (dataset_size - optimal_size) / optimal_size
            decay = np.exp(-excess_factor * 0.1)  # Exponential decay
            humaneval_score = 79.9 * decay
            mbpp_score = 77.5 * decay
        
        return {
            'dataset_size': dataset_size,
            'humaneval_pass1': round(humaneval_score, 1),
            'mbpp_pass1': round(mbpp_score, 1),
            'training_time_hours': dataset_size * 0.01,  # Estimate
            'is_optimal_size': abs(dataset_size - optimal_size) < 500
        }

class CustomDataset(Dataset):
    """Custom dataset for training"""
    
    def __init__(self, examples: List[Dict]):
        self.examples = examples
    
    def __len__(self):
        return len(self.examples)
    
    def __getitem__(self, idx):
        return self.examples[idx]

# Test the SFT trainer
trainer = EfficientSFTTrainer()

# Calculate efficiency metrics
efficiency_metrics = trainer.calculate_efficiency_metrics(
    dataset_size=len(training_examples), 
    baseline_datasets=[]
)

print("Efficiency Metrics:")
print(json.dumps(efficiency_metrics, indent=2))

# Simulate training results for different dataset sizes
sizes = [500, 1000, 2600, 5000, 10000, 50000, 111000]
results = []

for size in sizes:
    result = trainer.simulate_training_results(size)
    results.append(result)

# Visualize efficiency curve
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 6))

sizes_plot = [r['dataset_size'] for r in results]
humaneval_scores = [r['humaneval_pass1'] for r in results]
mbpp_scores = [r['mbpp_pass1'] for r in results]

# Performance vs dataset size
ax1.plot(sizes_plot, humaneval_scores, 'o-', label='HumanEval', linewidth=2, markersize=8)
ax1.plot(sizes_plot, mbpp_scores, 's-', label='MBPP', linewidth=2, markersize=8)
ax1.axvline(x=2600, color='red', linestyle='--', alpha=0.7, label='Optimal Size (2.6K)')
ax1.set_xlabel('Dataset Size')
ax1.set_ylabel('Pass@1 Score (%)')
ax1.set_title('Performance vs Dataset Size')
ax1.set_xscale('log')
ax1.legend()
ax1.grid(True, alpha=0.3)

# Efficiency ratio
efficiency_ratios = []
for r in results:
    # Efficiency = Performance / Size
    avg_performance = (r['humaneval_pass1'] + r['mbpp_pass1']) / 2
    efficiency = avg_performance / (r['dataset_size'] / 1000)  # Per 1K examples
    efficiency_ratios.append(efficiency)

ax2.plot(sizes_plot, efficiency_ratios, 'D-', color='green', linewidth=2, markersize=8)
ax2.axvline(x=2600, color='red', linestyle='--', alpha=0.7, label='Optimal Size')
ax2.set_xlabel('Dataset Size')
ax2.set_ylabel('Efficiency (Performance / 1K Examples)')
ax2.set_title('Data Efficiency Analysis')
ax2.set_xscale('log')
ax2.legend()
ax2.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("\nKey Findings:")
optimal_result = next(r for r in results if r['dataset_size'] == 2600)
largest_result = results[-1]

print(f"Optimal size (2.6K): HumanEval {optimal_result['humaneval_pass1']}%, MBPP {optimal_result['mbpp_pass1']}%")
print(f"Largest baseline (111K): HumanEval {largest_result['humaneval_pass1']}%, MBPP {largest_result['mbpp_pass1']}%")
print(f"Size reduction: {2600/111000:.1%} of largest dataset")
print(f"Performance retention: {optimal_result['humaneval_pass1']/largest_result['humaneval_pass1']:.1%}")

## 7. Comparative Analysis: Human vs Model-Generated Data

### Understanding the Quality Difference

In [None]:
class DataQualityAnalyzer:
    """Analyze quality differences between human and model-generated data"""
    
    def compare_solution_quality(self, human_solutions: List[str], 
                               model_solutions: List[str]) -> Dict:
        """Compare quality metrics between human and model solutions"""
        
        comparison = {
            'human_solutions': self._analyze_solution_set(human_solutions),
            'model_solutions': self._analyze_solution_set(model_solutions),
            'quality_differences': {}
        }
        
        # Calculate differences
        human_metrics = comparison['human_solutions']
        model_metrics = comparison['model_solutions']
        
        comparison['quality_differences'] = {
            'avg_length_ratio': model_metrics['avg_length'] / human_metrics['avg_length'],
            'comment_density_ratio': model_metrics['comment_density'] / human_metrics['comment_density'],
            'reasoning_score_ratio': model_metrics['reasoning_score'] / human_metrics['reasoning_score'],
            'readability_ratio': model_metrics['readability_score'] / human_metrics['readability_score']
        }
        
        return comparison
    
    def _analyze_solution_set(self, solutions: List[str]) -> Dict:
        """Analyze a set of solutions"""
        if not solutions:
            return {}
            
        metrics = {
            'count': len(solutions),
            'avg_length': np.mean([len(sol) for sol in solutions]),
            'comment_density': np.mean([self._calculate_comment_density(sol) for sol in solutions]),
            'reasoning_score': np.mean([self._calculate_reasoning_score(sol) for sol in solutions]),
            'readability_score': np.mean([self._calculate_readability_score(sol) for sol in solutions]),
            'variable_naming_score': np.mean([self._calculate_variable_naming_score(sol) for sol in solutions])
        }
        
        return metrics
    
    def _calculate_comment_density(self, solution: str) -> float:
        """Calculate density of comments in solution"""
        lines = solution.split('\n')
        comment_lines = [line for line in lines if line.strip().startswith('#')]
        code_lines = [line for line in lines if line.strip() and not line.strip().startswith('#')]
        
        if not code_lines:
            return 0
        
        return len(comment_lines) / len(code_lines)
    
    def _calculate_reasoning_score(self, solution: str) -> float:
        """Calculate reasoning quality score"""
        reasoning_indicators = [
            'step', 'because', 'since', 'therefore', 'thus',
            'algorithm', 'approach', 'strategy', 'intuition',
            'complexity', 'time', 'space', 'optimization'
        ]
        
        solution_lower = solution.lower()
        score = sum(1 for indicator in reasoning_indicators 
                   if indicator in solution_lower)
        
        # Normalize by solution length
        return score / max(len(solution.split()), 1) * 100
    
    def _calculate_readability_score(self, solution: str) -> float:
        """Calculate readability score based on variable names and structure"""
        lines = solution.split('\n')
        code_lines = [line for line in lines if line.strip() and not line.strip().startswith('#')]
        
        if not code_lines:
            return 0
        
        # Count descriptive variable names
        descriptive_vars = 0
        total_vars = 0
        
        for line in code_lines:
            # Simple heuristic: look for assignment operators
            if '=' in line and not line.strip().startswith('def'):
                var_part = line.split('=')[0].strip()
                if var_part and not var_part.startswith(('if', 'for', 'while')):
                    total_vars += 1
                    # Check if variable name is descriptive (length > 2)
                    if len(var_part) > 2 and var_part.isidentifier():
                        descriptive_vars += 1
        
        if total_vars == 0:
            return 50  # Neutral score
        
        return (descriptive_vars / total_vars) * 100
    
    def _calculate_variable_naming_score(self, solution: str) -> float:
        """Score variable naming quality"""
        # Extract variable names using regex
        import re
        var_pattern = r'\b([a-zA-Z_][a-zA-Z0-9_]*)\s*='
        variables = re.findall(var_pattern, solution)
        
        if not variables:
            return 50
        
        good_names = 0
        for var in variables:
            # Good naming: descriptive, not single letter (except i, j for loops)
            if len(var) > 2 or var in ['i', 'j', 'k']:
                good_names += 1
        
        return (good_names / len(variables)) * 100
    
    def create_quality_visualization(self, comparison: Dict):
        """Create visualization of quality comparison"""
        human = comparison['human_solutions']
        model = comparison['model_solutions']
        
        metrics = ['comment_density', 'reasoning_score', 'readability_score', 'variable_naming_score']
        human_values = [human[metric] for metric in metrics]
        model_values = [model[metric] for metric in metrics]
        
        x = np.arange(len(metrics))
        width = 0.35
        
        fig, ax = plt.subplots(figsize=(12, 6))
        
        bars1 = ax.bar(x - width/2, human_values, width, label='Human Solutions', alpha=0.8)
        bars2 = ax.bar(x + width/2, model_values, width, label='Model Solutions', alpha=0.8)
        
        ax.set_xlabel('Quality Metrics')
        ax.set_ylabel('Score')
        ax.set_title('Human vs Model-Generated Solution Quality')
        ax.set_xticks(x)
        ax.set_xticklabels([m.replace('_', ' ').title() for m in metrics])
        ax.legend()
        
        # Add value labels on bars
        def autolabel(bars):
            for bar in bars:
                height = bar.get_height()
                ax.annotate(f'{height:.1f}',
                           xy=(bar.get_x() + bar.get_width() / 2, height),
                           xytext=(0, 3),
                           textcoords="offset points",
                           ha='center', va='bottom')
        
        autolabel(bars1)
        autolabel(bars2)
        
        plt.xticks(rotation=45)
        plt.tight_layout()
        plt.show()

# Create sample solutions for comparison
human_solutions = [
    """
class Solution:
    def missingNumber(self, arr):
        n = len(arr)
        d = (arr[-1] - arr[0]) // n
        for i in range(n-1):
            if arr[i+1] - arr[i] != d:
                return arr[i] + d
        return arr[-1] + d
""",
    """
class Solution:
    def missingNumber(self, arr):
        total = (len(arr) + 1) * (arr[0] + arr[-1]) // 2
        return total - sum(arr)
"""
]

model_solutions = [
    """
class Solution:
    def missingNumber(self, arr: List[int]) -> int:
        # Step 1: Understanding the problem
        # We have an arithmetic progression with one missing element
        # Need to find that missing number
        
        n = len(arr)
        
        # Step 2: Calculate the common difference
        # In a complete AP, total difference = (last - first)
        # With n+1 elements, common difference = total_diff / n
        total_difference = arr[-1] - arr[0]
        common_diff = total_difference // n
        
        # Step 3: Find the missing number
        # Check each consecutive pair for the gap
        for i in range(n - 1):
            expected_next = arr[i] + common_diff
            if arr[i + 1] != expected_next:
                return expected_next
        
        # Edge case: missing number is at the end
        return arr[-1] + common_diff
""",
    """
class Solution:
    def missingNumber(self, arr: List[int]) -> int:
        # Mathematical approach using sum formula
        # For arithmetic progression: sum = n * (first + last) / 2
        
        array_length = len(arr)
        first_element = arr[0]
        last_element = arr[-1]
        
        # Calculate expected sum of complete progression
        # We have (n+1) elements in complete sequence
        complete_sequence_length = array_length + 1
        expected_sum = complete_sequence_length * (first_element + last_element) // 2
        
        # Calculate actual sum of given array
        actual_sum = sum(arr)
        
        # The difference gives us the missing number
        missing_number = expected_sum - actual_sum
        
        return missing_number
"""
]

# Analyze quality differences
analyzer = DataQualityAnalyzer()
quality_comparison = analyzer.compare_solution_quality(human_solutions, model_solutions)

print("Quality Comparison Results:")
print(json.dumps(quality_comparison, indent=2))

# Create visualization
analyzer.create_quality_visualization(quality_comparison)

# Key insights
differences = quality_comparison['quality_differences']
print("\nKey Quality Differences:")
print(f"Model solutions are {differences['avg_length_ratio']:.1f}x longer (more detailed)")
print(f"Model solutions have {differences['comment_density_ratio']:.1f}x more comments")
print(f"Model solutions have {differences['reasoning_score_ratio']:.1f}x more reasoning indicators")
print(f"Model solutions have {differences['readability_ratio']:.1f}x better readability scores")

## 8. Training Efficiency Analysis

### Quantifying the 40x Improvement

In [None]:
class EfficiencyAnalyzer:
    """Analyze training efficiency improvements"""
    
    def __init__(self):
        # Paper's benchmark results
        self.benchmark_results = {
            'Magicoder Evol-Instruct-110K': {
                'size': 111100, 'humaneval': 77.4, 'mbpp': 74.1,
                'type': 'large_synthetic'
            },
            'Magicoder OSS-Instruct-75K': {
                'size': 75100, 'humaneval': 73.8, 'mbpp': 76.5,
                'type': 'large_oss'
            },
            'Open-R1 CodeForces-CoT': {
                'size': 9500, 'humaneval': 79.9, 'mbpp': 74.1,
                'type': 'medium_reasoning'
            },
            'OpenThoughts 114k': {
                'size': 19900, 'humaneval': 77.4, 'mbpp': 75.7,
                'type': 'medium_synthetic'
            },
            'LeetCodeDataset (human)': {
                'size': 2600, 'humaneval': 55.5, 'mbpp': 53.4,
                'type': 'small_human'
            },
            'LeetCodeDataset (model)': {
                'size': 2600, 'humaneval': 79.9, 'mbpp': 77.5,
                'type': 'small_model'
            }
        }
    
    def calculate_efficiency_scores(self) -> pd.DataFrame:
        """Calculate comprehensive efficiency scores"""
        data = []
        
        for name, results in self.benchmark_results.items():
            # Calculate efficiency metrics
            avg_performance = (results['humaneval'] + results['mbpp']) / 2
            size_k = results['size'] / 1000  # Size in thousands
            
            efficiency_score = avg_performance / size_k
            
            # Performance per parameter (assuming standard scaling)
            perf_per_1k = avg_performance / size_k
            
            data.append({
                'dataset': name,
                'size': results['size'],
                'size_k': size_k,
                'humaneval': results['humaneval'],
                'mbpp': results['mbpp'],
                'avg_performance': avg_performance,
                'efficiency_score': efficiency_score,
                'type': results['type']
            })
        
        return pd.DataFrame(data)
    
    def analyze_improvement_factors(self, df: pd.DataFrame) -> Dict:
        """Analyze improvement factors"""
        leetcode_model = df[df['dataset'] == 'LeetCodeDataset (model)'].iloc[0]
        leetcode_human = df[df['dataset'] == 'LeetCodeDataset (human)'].iloc[0]
        largest_dataset = df.loc[df['size'].idxmax()]
        
        analysis = {
            'model_vs_human_same_size': {
                'humaneval_improvement': leetcode_model['humaneval'] / leetcode_human['humaneval'],
                'mbpp_improvement': leetcode_model['mbpp'] / leetcode_human['mbpp'],
                'avg_improvement': leetcode_model['avg_performance'] / leetcode_human['avg_performance']
            },
            'model_vs_largest': {
                'size_reduction': leetcode_model['size'] / largest_dataset['size'],
                'performance_retention': leetcode_model['avg_performance'] / largest_dataset['avg_performance'],
                'efficiency_gain': leetcode_model['efficiency_score'] / largest_dataset['efficiency_score']
            },
            'overall_efficiency': {
                'best_efficiency': df.loc[df['efficiency_score'].idxmax(), 'dataset'],
                'worst_efficiency': df.loc[df['efficiency_score'].idxmin(), 'dataset'],
                'efficiency_range': df['efficiency_score'].max() / df['efficiency_score'].min()
            }
        }
        
        return analysis
    
    def create_efficiency_dashboard(self, df: pd.DataFrame):
        """Create comprehensive efficiency visualization"""
        fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2, 2, figsize=(16, 12))
        
        # 1. Size vs Performance scatter
        colors = {'large_synthetic': 'red', 'large_oss': 'orange', 
                 'medium_reasoning': 'blue', 'medium_synthetic': 'purple',
                 'small_human': 'gray', 'small_model': 'green'}
        
        for _, row in df.iterrows():
            ax1.scatter(row['size_k'], row['avg_performance'], 
                       c=colors[row['type']], s=200, alpha=0.7,
                       label=row['type'] if row['type'] not in ax1.get_legend_handles_labels()[1] else "")
            ax1.annotate(row['dataset'].split()[0], 
                        (row['size_k'], row['avg_performance']),
                        xytext=(5, 5), textcoords='offset points', fontsize=8)
        
        ax1.set_xlabel('Dataset Size (thousands)')
        ax1.set_ylabel('Average Performance (%)')
        ax1.set_title('Dataset Size vs Performance')
        ax1.set_xscale('log')
        ax1.legend()
        ax1.grid(True, alpha=0.3)
        
        # 2. Efficiency scores
        df_sorted = df.sort_values('efficiency_score', ascending=True)
        bars = ax2.barh(range(len(df_sorted)), df_sorted['efficiency_score'])
        ax2.set_yticks(range(len(df_sorted)))
        ax2.set_yticklabels([d.split()[0] for d in df_sorted['dataset']])
        ax2.set_xlabel('Efficiency Score (Performance / 1K examples)')
        ax2.set_title('Dataset Efficiency Ranking')
        
        # Highlight LeetCodeDataset
        for i, (_, row) in enumerate(df_sorted.iterrows()):
            if 'LeetCodeDataset (model)' in row['dataset']:
                bars[i].set_color('green')
                bars[i].set_alpha(0.8)
        
        # 3. HumanEval vs MBPP
        ax3.scatter(df['humaneval'], df['mbpp'], s=200, alpha=0.7)
        for _, row in df.iterrows():
            ax3.annotate(row['dataset'].split()[0], 
                        (row['humaneval'], row['mbpp']),
                        xytext=(5, 5), textcoords='offset points', fontsize=8)
        
        ax3.plot([0, 100], [0, 100], 'k--', alpha=0.5, label='Equal Performance')
        ax3.set_xlabel('HumanEval Pass@1 (%)')
        ax3.set_ylabel('MBPP Pass@1 (%)')
        ax3.set_title('HumanEval vs MBPP Performance')
        ax3.legend()
        ax3.grid(True, alpha=0.3)
        
        # 4. Efficiency improvement over size
        df_by_size = df.sort_values('size')
        ax4.plot(df_by_size['size_k'], df_by_size['efficiency_score'], 'o-', linewidth=2, markersize=8)
        
        # Highlight optimal point
        optimal_idx = df['efficiency_score'].idxmax()
        optimal_row = df.loc[optimal_idx]
        ax4.scatter(optimal_row['size_k'], optimal_row['efficiency_score'], 
                   color='red', s=300, marker='*', label='Optimal Efficiency')
        
        ax4.set_xlabel('Dataset Size (thousands)')
        ax4.set_ylabel('Efficiency Score')
        ax4.set_title('Efficiency vs Dataset Size')
        ax4.set_xscale('log')
        ax4.legend()
        ax4.grid(True, alpha=0.3)
        
        plt.tight_layout()
        plt.show()
    
    def generate_efficiency_report(self, df: pd.DataFrame, analysis: Dict) -> str:
        """Generate comprehensive efficiency report"""
        report = f"""
# LeetCodeDataset Efficiency Analysis Report

## Key Findings

### 1. Model-Generated vs Human-Written Data (Same Size)
- HumanEval improvement: {analysis['model_vs_human_same_size']['humaneval_improvement']:.1f}x
- MBPP improvement: {analysis['model_vs_human_same_size']['mbpp_improvement']:.1f}x
- Average improvement: {analysis['model_vs_human_same_size']['avg_improvement']:.1f}x

### 2. Model-Generated vs Largest Baseline
- Dataset size reduction: {1/analysis['model_vs_largest']['size_reduction']:.0f}x smaller
- Performance retention: {analysis['model_vs_largest']['performance_retention']:.1%}
- Efficiency gain: {analysis['model_vs_largest']['efficiency_gain']:.1f}x

### 3. Overall Efficiency Ranking
- Most efficient: {analysis['overall_efficiency']['best_efficiency']}
- Least efficient: {analysis['overall_efficiency']['worst_efficiency']}
- Efficiency range: {analysis['overall_efficiency']['efficiency_range']:.1f}x difference

## Data Details
"""
        
        # Add detailed table
        report += "\n### Dataset Comparison Table\n\n"
        report += df.to_string(index=False, float_format='%.1f')
        
        return report

# Run efficiency analysis
analyzer = EfficiencyAnalyzer()
efficiency_df = analyzer.calculate_efficiency_scores()
improvement_analysis = analyzer.analyze_improvement_factors(efficiency_df)

print("Efficiency Analysis:")
print(json.dumps(improvement_analysis, indent=2))

# Create visualization dashboard
analyzer.create_efficiency_dashboard(efficiency_df)

# Generate report
efficiency_report = analyzer.generate_efficiency_report(efficiency_df, improvement_analysis)
print(efficiency_report)

## 9. Key Takeaways and Best Practices

### Critical Insights from the Paper:

1. **Quality > Quantity**: 2.6K high-quality examples outperform 111K lower-quality ones
2. **Model-Generated > Human**: For training purposes, model solutions are superior due to explicit reasoning
3. **Multi-Stage Generation**: High-temp diversity → verification → ground-truth hints for difficult cases
4. **Efficiency Sweet Spot**: There's an optimal dataset size beyond which returns diminish

### Implementation Best Practices:

1. **Generation Strategy**:
   - Use high temperature (T=1.0) for diversity
   - Generate multiple solution approaches per problem
   - Implement robust verification with comprehensive test cases

2. **Quality Control**:
   - Verify ALL solutions against test cases
   - Use ground-truth hints for persistently failing problems
   - Prioritize solutions with explicit reasoning and clear explanations

3. **Training Configuration**:
   - 3 epochs, learning rate 1e-5
   - Warmup ratio 0.1, cosine scheduling
   - Batch size 32 (adjust for hardware)

4. **Data Efficiency**:
   - Aim for 2-3K high-quality examples rather than 10K+ lower quality
   - Focus on approach diversity within the dataset
   - Monitor efficiency metrics during collection

### Research Implications:

1. **Rethink Data Collection**: Manual curation may be less effective than automated high-quality generation
2. **Model as Teacher**: Use strong models to create training data for weaker models
3. **Domain-Specific Optimization**: Competitive programming benefits from reasoning-rich solutions
4. **Efficiency Metrics**: Track performance-per-example rather than just absolute performance

### Future Directions:

1. **Cross-Domain Application**: Apply these principles to other coding domains
2. **Dynamic Generation**: Adapt generation strategy based on model weaknesses
3. **Multi-Modal Integration**: Include visual/algorithmic reasoning in solutions
4. **Continuous Learning**: Update training data as new problems emerge