# LeetCodeDataset: A Temporal Dataset for Robust Evaluation and Efficient Training of Code LLMs

## Paper Information
- **Title**: LeetCodeDataset: A Temporal Dataset for Robust Evaluation and Efficient Training of Code LLMs
- **Authors**: Yunhui Xia, Wei Shen, Yan Wang, Jason Klein Liu, Huifeng Sun, Siyue Wu, Jian Hu, Xiaolong Xu
- **Link**: https://arxiv.org/abs/2504.14655
- **Date**: April 20, 2025

## Paper Summary
This paper introduces LeetCodeDataset, a high-quality benchmark for evaluating and training code-generation models. The dataset addresses two key challenges:
1. **Lack of reasoning-focused coding benchmarks**: Provides comprehensive evaluation with rich metadata
2. **Self-contained training testbeds**: Enables contamination-free evaluation and efficient supervised fine-tuning (SFT)

Key contributions:
- Curated 2,869 LeetCode Python problems with 100+ test cases per problem
- Temporal splits (pre/post July 2024) for contamination-free evaluation
- Demonstrated that reasoning models significantly outperform non-reasoning counterparts
- Achieved comparable performance with only 2.6K model-generated solutions vs. 110K-sample datasets

## 1. Environment Setup and Installation

In [None]:
# Install required packages
!pip install -q langchain langchain-openai langchain-anthropic langchain-community
!pip install -q langgraph langsmith deepeval
!pip install -q datasets transformers torch
!pip install -q pandas numpy matplotlib seaborn
!pip install -q python-dotenv tqdm

In [None]:
import os
import json
import pandas as pd
import numpy as np
from datetime import datetime
from typing import List, Dict, Optional, Tuple
import matplotlib.pyplot as plt
import seaborn as sns
from tqdm import tqdm

# LangChain imports
from langchain.prompts import ChatPromptTemplate, FewShotChatMessagePromptTemplate
from langchain.schema import SystemMessage, HumanMessage, AIMessage
from langchain_openai import ChatOpenAI
from langchain_anthropic import ChatAnthropic
from langchain.output_parsers import PydanticOutputParser
from pydantic import BaseModel, Field

# Set up environment
from dotenv import load_dotenv
load_dotenv()

# Configure plotting
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")

## 2. Data Loading and Exploration

Since the actual LeetCodeDataset is available on Hugging Face, we'll demonstrate how to load and explore it.

In [None]:
# Load dataset from Hugging Face
from datasets import load_dataset

# Note: The actual dataset is at: https://huggingface.co/datasets/newfacade/LeetCodeDataset
# For demonstration, we'll create a mock dataset structure

class LeetCodeProblem(BaseModel):
    """Schema for a LeetCode problem"""
    slug: str = Field(description="URL identifier and primary key")
    question_id: int = Field(description="Unique sequential number")
    difficulty: str = Field(description="Easy/Medium/Hard")
    problem_description: str = Field(description="Full text with examples and constraints")
    starter_code: str = Field(description="Language template code")
    topic_tags: List[str] = Field(description="Problem tags like Array, Dynamic Programming")
    release_date: str = Field(description="Problem release date")
    test_cases: List[Dict[str, any]] = Field(description="Input-output test cases")

# Create mock dataset for demonstration
def create_mock_leetcode_dataset():
    """Create a mock dataset for demonstration purposes"""
    problems = []
    
    # Example problem from the paper
    problem1 = LeetCodeProblem(
        slug="missing-number-in-arithmetic-progression",
        question_id=1228,
        difficulty="Easy",
        problem_description="""In some array arr, the values were in arithmetic progression: 
the values arr[i + 1] - arr[i] are all equal for every 0 <= i < arr.length - 1.
A value from arr was removed that was not the first or last value in the array.
Given arr, return the removed value.

Example 1:
Input: arr = [5,7,11,13]
Output: 9
Explanation: The previous array was [5,7,9,11,13].

Example 2:
Input: arr = [15,13,12]
Output: 14
Explanation: The previous array was [15,14,13,12].

Constraints:
3 <= arr.length <= 1000
0 <= arr[i] <= 10^5
The given array is guaranteed to be a valid array.""",
        starter_code="""class Solution:
    def missingNumber(self, arr: List[int]) -> int:
        """,
        topic_tags=["Array", "Math"],
        release_date="2019-10-13",
        test_cases=[
            {"input": {"arr": [5, 7, 11, 13]}, "output": 9},
            {"input": {"arr": [15, 13, 12]}, "output": 14},
            {"input": {"arr": [1, 2, 3, 5]}, "output": 4},
        ]
    )
    problems.append(problem1)
    
    # Add more mock problems
    difficulties = ["Easy", "Medium", "Hard"]
    tags = ["Array", "String", "Dynamic Programming", "Binary Search", "Tree", "Graph"]
    
    for i in range(2, 11):  # Create 10 mock problems
        problems.append(LeetCodeProblem(
            slug=f"problem-{i}",
            question_id=1000 + i,
            difficulty=np.random.choice(difficulties),
            problem_description=f"Mock problem {i} description",
            starter_code=f"class Solution:\n    def solve{i}(self, nums: List[int]) -> int:\n        ",
            topic_tags=list(np.random.choice(tags, size=np.random.randint(1, 4), replace=False)),
            release_date=f"2024-{np.random.randint(1, 12):02d}-{np.random.randint(1, 28):02d}",
            test_cases=[{"input": {"nums": [1, 2, 3]}, "output": i}]
        ))
    
    return problems

# Create mock dataset
mock_problems = create_mock_leetcode_dataset()
print(f"Created {len(mock_problems)} mock problems for demonstration")

## 3. Algorithm Implementation: LeetCodeDataset Construction Pipeline

We'll implement the key components of the dataset construction pipeline using LangChain and LangGraph for better structure and reasoning capabilities.

In [None]:
# Why use LangChain/LangGraph for this implementation?
# 1. LangChain provides structured prompting and output parsing - essential for generating test cases
# 2. LangGraph enables workflow orchestration for the multi-stage pipeline
# 3. Built-in support for multiple LLMs (GPT-4, Claude, etc.) for response generation
# 4. Robust error handling and retry mechanisms

from langchain.chains import LLMChain
from langgraph.graph import StateGraph, END
from typing import TypedDict

class DatasetConstructionState(TypedDict):
    """State for the dataset construction pipeline"""
    problem: Dict
    metadata: Dict
    canonical_solution: Optional[str]
    entry_point: Optional[str]
    test_inputs: List[Dict]
    test_cases: List[Dict]
    model_generated_solution: Optional[str]
    
# Initialize LLM for test case generation
llm = ChatOpenAI(model="gpt-4o", temperature=0.2)

# Step 1: Metadata Acquisition (simulated)
def acquire_metadata(state: DatasetConstructionState) -> DatasetConstructionState:
    """Acquire problem metadata from LeetCode API"""
    # In real implementation, this would call LeetCode GraphQL API
    # Reference: https://github.com/fspv/python-leetcode
    state["metadata"] = {
        "slug": state["problem"]["slug"],
        "question_id": state["problem"]["question_id"],
        "difficulty": state["problem"]["difficulty"],
        "topic_tags": state["problem"]["topic_tags"],
        "release_date": state["problem"]["release_date"]
    }
    return state

# Step 2: Entry Point Identification
def identify_entry_point(state: DatasetConstructionState) -> DatasetConstructionState:
    """Identify the function entry point from starter code"""
    import re
    starter_code = state["problem"]["starter_code"]
    
    # Extract function name using regex
    match = re.search(r'def\s+(\w+)\s*\(', starter_code)
    if match:
        state["entry_point"] = match.group(1)
    else:
        state["entry_point"] = None
    return state

# Step 3: Input Generation using LangChain
class TestInput(BaseModel):
    """Schema for test input"""
    inputs: List[Dict] = Field(description="List of input dictionaries")

def generate_test_inputs(state: DatasetConstructionState) -> DatasetConstructionState:
    """Generate test inputs using LLM with one-shot prompting"""
    
    # Create parser
    parser = PydanticOutputParser(pydantic_object=TestInput)
    
    # Create prompt template (from Figure 4 in paper)
    prompt = ChatPromptTemplate.from_messages([
        SystemMessage(content="You are an expert Python programmer. You will be given a question (including a problem specification and starter code). Your task is to generate inputs that are consistent with the problem specification and starter code."),
        HumanMessage(content="""**** Example ****
#### Question:
Given an array of integers, return the sum.
class Solution:
    def arraySum(self, nums: List[int]) -> int:

#### Some valid inputs of the starter code (json format):
```json
[{"nums": [1, 2, 3]}, {"nums": [4, 5, 6]}, {"nums": [-1, 0, 1]}]
```

**** Now Your Task ****
#### Question:
{problem_description}
{starter_code}

#### Some valid inputs of the starter code (json format):
{format_instructions}""")
    ])
    
    # Generate inputs
    chain = LLMChain(llm=llm, prompt=prompt)
    result = chain.invoke({
        "problem_description": state["problem"]["problem_description"],
        "starter_code": state["problem"]["starter_code"],
        "format_instructions": parser.get_format_instructions()
    })
    
    # For demonstration, use mock inputs
    state["test_inputs"] = [
        {"arr": [5, 7, 11, 13]},
        {"arr": [15, 13, 12]},
        {"arr": [1, 2, 3, 5]},
        {"arr": [100, 200, 400]},  # Complex case
        {"arr": [10, 20, 30, 40, 60]},  # Complex case
    ]
    
    return state

# Step 4: Test Case Generation with Sandboxed Execution
def generate_test_cases(state: DatasetConstructionState) -> DatasetConstructionState:
    """Generate test cases by executing canonical solution in sandbox"""
    
    # In real implementation, this would:
    # 1. Set up sandboxed execution environment
    # 2. Execute canonical solution with generated inputs
    # 3. Capture outputs
    
    # For demonstration, create mock test cases
    test_cases = []
    for input_data in state["test_inputs"]:
        # Simulate execution of canonical solution
        if "arr" in input_data:
            arr = input_data["arr"]
            # Mock solution for arithmetic progression
            if len(arr) >= 2:
                diff = (arr[-1] - arr[0]) // len(arr)
                expected_sum = len(arr) * (arr[0] + arr[-1]) // 2 + diff
                actual_sum = sum(arr)
                output = expected_sum - actual_sum
            else:
                output = 0
        else:
            output = 0
            
        test_cases.append({
            "input": input_data,
            "output": output
        })
    
    state["test_cases"] = test_cases
    return state

# Step 5: Model-Generated Solution (for SFT)
def generate_model_solution(state: DatasetConstructionState) -> DatasetConstructionState:
    """Generate high-quality solution using Qwen2.5-Coder-32B-Instruct"""
    
    # In real implementation, this would use Qwen2.5-Coder-32B-Instruct
    # with temperature=1.0 for diverse solutions
    
    solution_prompt = ChatPromptTemplate.from_messages([
        SystemMessage(content="You are an expert competitive programmer. Provide a clean, efficient solution with clear reasoning."),
        HumanMessage(content="""Solve this problem:
{problem_description}

Starter code:
{starter_code}

Provide a complete solution with step-by-step reasoning.""")
    ])
    
    # For demonstration, provide a mock solution
    state["model_generated_solution"] = """class Solution:
    def missingNumber(self, arr: List[int]) -> int:
        # Step 1: Calculate the common difference
        # In an arithmetic progression, the difference should be constant
        n = len(arr)
        total_diff = arr[-1] - arr[0]
        common_diff = total_diff // n
        
        # Step 2: Find the missing number
        # Check each consecutive pair for the anomaly
        for i in range(n - 1):
            expected_next = arr[i] + common_diff
            if arr[i + 1] != expected_next:
                return expected_next
        
        # Edge case: missing number is at the beginning or end
        return arr[0] + common_diff
"""
    
    return state

In [None]:
# Build the LangGraph workflow
workflow = StateGraph(DatasetConstructionState)

# Add nodes
workflow.add_node("acquire_metadata", acquire_metadata)
workflow.add_node("identify_entry_point", identify_entry_point)
workflow.add_node("generate_inputs", generate_test_inputs)
workflow.add_node("generate_test_cases", generate_test_cases)
workflow.add_node("generate_solution", generate_model_solution)

# Define edges
workflow.add_edge("acquire_metadata", "identify_entry_point")
workflow.add_edge("identify_entry_point", "generate_inputs")
workflow.add_edge("generate_inputs", "generate_test_cases")
workflow.add_edge("generate_test_cases", "generate_solution")
workflow.add_edge("generate_solution", END)

# Set entry point
workflow.set_entry_point("acquire_metadata")

# Compile the graph
dataset_pipeline = workflow.compile()

# Test the pipeline with a mock problem
test_problem = mock_problems[0]
initial_state = DatasetConstructionState(
    problem=test_problem.dict(),
    metadata={},
    canonical_solution=None,
    entry_point=None,
    test_inputs=[],
    test_cases=[],
    model_generated_solution=None
)

# Run the pipeline
result = dataset_pipeline.invoke(initial_state)
print(f"Pipeline completed. Generated {len(result['test_cases'])} test cases.")
print(f"Entry point identified: {result['entry_point']}")

## 4. Model Training and Evaluation

We'll implement the evaluation framework and demonstrate how to use deepeval for metrics.

In [None]:
# Why use deepeval for this evaluation?
# 1. Provides comprehensive code evaluation metrics
# 2. Supports custom metrics for pass@k evaluation
# 3. Built-in test case execution and validation
# 4. Integration with LangChain for end-to-end testing

from deepeval import assert_test
from deepeval.metrics import BaseMetric
from deepeval.test_case import LLMTestCase
import ast
import subprocess
import tempfile

class CodeExecutionMetric(BaseMetric):
    """Custom metric for code execution evaluation"""
    
    def __init__(self, test_cases: List[Dict], timeout: int = 5):
        self.test_cases = test_cases
        self.timeout = timeout
        self.threshold = 1.0  # All test cases must pass
        
    @property
    def name(self):
        return "Code Execution Pass Rate"
    
    def execute_code(self, code: str, test_case: Dict) -> Tuple[bool, str]:
        """Execute code with test case in sandboxed environment"""
        try:
            # Create temporary file
            with tempfile.NamedTemporaryFile(mode='w', suffix='.py', delete=False) as f:
                # Write the solution and test code
                test_code = f"""
from typing import List

{code}

# Test execution
solution = Solution()
result = solution.{test_case['function_name']}(**{test_case['input']})
expected = {test_case['output']}
assert result == expected, f"Expected {{expected}}, got {{result}}"
print("Test passed!")
"""
                f.write(test_code)
                f.flush()
                
            # Execute the code
            result = subprocess.run(
                ['python', f.name],
                capture_output=True,
                text=True,
                timeout=self.timeout
            )
            
            # Clean up
            os.unlink(f.name)
            
            if result.returncode == 0:
                return True, "Test passed"
            else:
                return False, result.stderr
                
        except subprocess.TimeoutExpired:
            return False, "Execution timeout"
        except Exception as e:
            return False, str(e)
    
    def measure(self, test_case: LLMTestCase) -> float:
        """Measure the pass rate of generated code"""
        code = test_case.actual_output
        
        passed = 0
        total = len(self.test_cases)
        
        for tc in self.test_cases:
            success, _ = self.execute_code(code, tc)
            if success:
                passed += 1
        
        self.score = passed / total
        self.success = self.score >= self.threshold
        return self.score
    
    async def a_measure(self, test_case: LLMTestCase) -> float:
        return self.measure(test_case)
    
    def is_successful(self) -> bool:
        return self.success

In [None]:
# Implement evaluation pipeline
class LeetCodeEvaluator:
    """Evaluator for LeetCode problems"""
    
    def __init__(self, model_name: str = "gpt-4o"):
        self.llm = ChatOpenAI(model=model_name, temperature=0.2)
        self.results = []
        
    def generate_solution(self, problem: Dict) -> str:
        """Generate solution for a problem"""
        prompt = ChatPromptTemplate.from_messages([
            SystemMessage(content="You are an expert competitive programmer. Provide only the code solution without explanation."),
            HumanMessage(content="""Problem:
{problem_description}

Starter code:
{starter_code}

Complete the solution:""")
        ])
        
        chain = LLMChain(llm=self.llm, prompt=prompt)
        response = chain.invoke({
            "problem_description": problem["problem_description"],
            "starter_code": problem["starter_code"]
        })
        
        return response["text"]
    
    def evaluate_problem(self, problem: Dict) -> Dict:
        """Evaluate a single problem"""
        # Generate solution
        solution = self.generate_solution(problem)
        
        # Create test case for deepeval
        test_case = LLMTestCase(
            input=problem["problem_description"],
            actual_output=solution,
            expected_output=None  # We use test cases for validation
        )
        
        # Prepare test cases with function name
        enhanced_test_cases = []
        for tc in problem["test_cases"]:
            enhanced_tc = tc.copy()
            enhanced_tc["function_name"] = "missingNumber"  # Extract from problem
            enhanced_test_cases.append(enhanced_tc)
        
        # Create metric
        metric = CodeExecutionMetric(test_cases=enhanced_test_cases)
        
        # Measure performance
        score = metric.measure(test_case)
        
        return {
            "problem_id": problem["question_id"],
            "difficulty": problem["difficulty"],
            "topic_tags": problem["topic_tags"],
            "pass_rate": score,
            "passed": metric.is_successful()
        }
    
    def evaluate_dataset(self, problems: List[Dict], limit: int = None) -> pd.DataFrame:
        """Evaluate multiple problems"""
        results = []
        
        for i, problem in enumerate(tqdm(problems[:limit], desc="Evaluating problems")):
            result = self.evaluate_problem(problem)
            results.append(result)
            
        return pd.DataFrame(results)

# Demonstrate evaluation
evaluator = LeetCodeEvaluator(model_name="gpt-4o")

# Convert mock problems to dict format
problems_dict = [p.dict() for p in mock_problems[:3]]

# Note: In real implementation, this would evaluate actual problems
# results_df = evaluator.evaluate_dataset(problems_dict)
print("Evaluation framework implemented successfully!")

## 5. Results Analysis and Visualization

We'll create visualizations similar to those in the paper to analyze model performance.

In [None]:
# Create mock results data based on paper's findings
models_data = {
    "Model": ["GPT-4o-0806", "Claude-3.7-Sonnet", "DeepSeek-V3", "DeepSeek-R1", "Qwen2.5-Max", "QwQ-Plus"],
    "Easy": [81.48, 87.04, 77.78, 94.44, 74.07, 92.59],
    "Medium": [32.76, 54.31, 31.90, 68.97, 25.00, 62.93],
    "Hard": [10.47, 23.26, 13.95, 41.86, 10.47, 24.42],
    "Overall": [35.55, 50.78, 35.55, 65.23, 30.47, 56.25],
    "Type": ["Non-Reasoning", "Non-Reasoning", "Non-Reasoning", "Reasoning", "Non-Reasoning", "Reasoning"]
}

results_df = pd.DataFrame(models_data)

# Visualization 1: Pass rates by difficulty level
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 6))

# Bar plot for overall performance
colors = ['#FF6B6B' if t == "Non-Reasoning" else '#4ECDC4' for t in results_df['Type']]
bars = ax1.bar(results_df['Model'], results_df['Overall'], color=colors)
ax1.set_xlabel('Model')
ax1.set_ylabel('Pass@1 Rate (%)')
ax1.set_title('Overall Model Performance on LeetCodeDataset')
ax1.set_xticklabels(results_df['Model'], rotation=45, ha='right')

# Add value labels on bars
for bar, value in zip(bars, results_df['Overall']):
    height = bar.get_height()
    ax1.text(bar.get_x() + bar.get_width()/2., height + 0.5,
             f'{value:.1f}%', ha='center', va='bottom')

# Create legend
from matplotlib.patches import Patch
legend_elements = [Patch(facecolor='#4ECDC4', label='Reasoning Model'),
                   Patch(facecolor='#FF6B6B', label='Non-Reasoning Model')]
ax1.legend(handles=legend_elements, loc='upper right')

# Grouped bar plot for difficulty levels
difficulty_data = results_df[['Model', 'Easy', 'Medium', 'Hard']].set_index('Model')
difficulty_data.plot(kind='bar', ax=ax2, width=0.8)
ax2.set_xlabel('Model')
ax2.set_ylabel('Pass@1 Rate (%)')
ax2.set_title('Model Performance by Difficulty Level')
ax2.set_xticklabels(results_df['Model'], rotation=45, ha='right')
ax2.legend(title='Difficulty')

plt.tight_layout()
plt.show()

In [None]:
# Visualization 2: Topic-wise performance heatmap
# Create mock topic performance data
topics = ['Array', 'String', 'Dynamic Programming', 'Binary Search', 'Tree', 'Graph']
models = ['GPT-4o', 'DeepSeek-R1', 'Claude-3.7']

# Mock data based on paper's findings
topic_performance = np.array([
    [32.1, 67.9, 51.2],  # Array
    [37.3, 68.7, 49.3],  # String
    [10.5, 70.2, 31.6],  # Dynamic Programming
    [7.7, 73.1, 30.8],   # Binary Search
    [27.3, 72.7, 9.1],   # Tree
    [40.0, 66.7, 53.3]   # Graph
])

plt.figure(figsize=(10, 8))
sns.heatmap(topic_performance, 
            xticklabels=models, 
            yticklabels=topics,
            annot=True, 
            fmt='.1f',
            cmap='YlOrRd',
            cbar_kws={'label': 'Pass Rate (%)'})
plt.title('Model Performance Across Different Topic Tags')
plt.xlabel('Model')
plt.ylabel('Topic Tag')
plt.tight_layout()
plt.show()

In [None]:
# Visualization 3: Temporal analysis
# Create mock monthly performance data
months = ['2024-07', '2024-08', '2024-09', '2024-10', '2024-11', '2024-12', '2025-01', '2025-02']
model_monthly_performance = {
    'GPT-4o-0806': [38.5, 42.1, 35.2, 33.8, 31.5, 36.9, 35.1, 34.3],
    'DeepSeek-R1': [68.2, 70.5, 63.8, 65.1, 62.9, 67.4, 64.2, 63.7],
    'Claude-3.7-Sonnet': [52.3, 55.1, 48.9, 50.2, 49.5, 52.8, 50.1, 49.6]
}

plt.figure(figsize=(12, 6))
for model, performance in model_monthly_performance.items():
    plt.plot(months, performance, marker='o', label=model, linewidth=2)

plt.xlabel('LeetCode Problem Release Month')
plt.ylabel('Pass@1 Rate (%)')
plt.title('Monthly Pass Rates of Various Models on LeetCodeDataset')
plt.legend()
plt.grid(True, alpha=0.3)
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

## 6. Efficient SFT Training Analysis

Analyze the efficiency of training with model-generated data vs. human-written solutions.

In [None]:
# Training efficiency comparison
training_data = {
    'Dataset': ['Magicoder Evol-Instruct-110K', 'Magicoder OSS-Instruct-75K', 
                'Open-R1 CodeForces-CoT', 'OpenThoughts 114k',
                'LeetCodeDataset (human)', 'LeetCodeDataset (model)'],
    'Rows': [111100, 75100, 9500, 19900, 2600, 2600],
    'HumanEval': [77.4, 73.8, 79.9, 77.4, 55.5, 79.9],
    'MBPP': [74.1, 76.5, 74.1, 75.7, 53.4, 77.5],
    'Type': ['Large', 'Large', 'Medium', 'Medium', 'Small', 'Small']
}

sft_df = pd.DataFrame(training_data)

# Create efficiency plot
fig, ax = plt.subplots(figsize=(12, 8))

# Create bubble chart
colors = {'Large': '#FF6B6B', 'Medium': '#4ECDC4', 'Small': '#95E1D3'}
for _, row in sft_df.iterrows():
    ax.scatter(row['Rows'], row['HumanEval'], 
               s=row['MBPP']*10,  # Size based on MBPP score
               alpha=0.6,
               color=colors[row['Type']],
               edgecolors='black',
               linewidth=2)
    
    # Add labels
    ax.annotate(row['Dataset'].split()[0], 
                (row['Rows'], row['HumanEval']),
                xytext=(5, 5), 
                textcoords='offset points',
                fontsize=9)

ax.set_xlabel('Training Dataset Size (rows)')
ax.set_ylabel('HumanEval Pass@1 (%)')
ax.set_title('Training Efficiency: Dataset Size vs. Performance')
ax.set_xscale('log')
ax.grid(True, alpha=0.3)

# Add reference lines
ax.axvline(x=2600, color='green', linestyle='--', alpha=0.5, label='LeetCodeDataset Size')
ax.axhline(y=79.9, color='green', linestyle='--', alpha=0.5, label='LeetCodeDataset Performance')

# Create legend
from matplotlib.patches import Patch
legend_elements = [Patch(facecolor=colors['Large'], label='Large Dataset (>75K)'),
                   Patch(facecolor=colors['Medium'], label='Medium Dataset (10K-75K)'),
                   Patch(facecolor=colors['Small'], label='Small Dataset (<10K)')]
ax.legend(handles=legend_elements, loc='lower right')

plt.tight_layout()
plt.show()

# Key insight
print("\nKey Insight:")
print("LeetCodeDataset with only 2.6K model-generated samples achieves 79.9% on HumanEval,")
print("outperforming datasets that are 40x larger. This demonstrates exceptional data efficiency.")

## 7. Research Template: Extending LeetCodeDataset

Template for researchers to build upon this work.

In [None]:
# Research Extension Template
class ResearchExtension:
    """Template for extending LeetCodeDataset research"""
    
    def __init__(self, research_focus: str):
        self.research_focus = research_focus
        self.experiments = []
        
    def define_hypothesis(self, hypothesis: str):
        """Define your research hypothesis"""
        self.hypothesis = hypothesis
        print(f"Research Hypothesis: {hypothesis}")
        
    def design_experiment(self, name: str, description: str, metrics: List[str]):
        """Design an experiment"""
        experiment = {
            "name": name,
            "description": description,
            "metrics": metrics,
            "status": "designed"
        }
        self.experiments.append(experiment)
        return experiment
    
    def implement_metric(self, metric_name: str):
        """Template for implementing custom metrics"""
        # Example: Complexity analysis metric
        if metric_name == "time_complexity":
            return """
class TimeComplexityMetric(BaseMetric):
    def measure(self, code: str, test_cases: List[Dict]) -> Dict:
        # 1. Parse code to AST
        # 2. Analyze loops and recursive calls
        # 3. Run with increasing input sizes
        # 4. Measure execution time
        # 5. Fit complexity curve
        pass
"""
        
    def suggest_improvements(self):
        """Suggest dataset improvements based on research focus"""
        suggestions = {
            "complexity_analysis": [
                "Add time/space complexity ground truth",
                "Include performance test cases",
                "Create complexity-aware evaluation metrics"
            ],
            "multi_language": [
                "Extend to Java, C++, JavaScript",
                "Create language-specific test harnesses",
                "Study cross-language performance"
            ],
            "reasoning_analysis": [
                "Capture intermediate reasoning steps",
                "Create reasoning complexity taxonomy",
                "Develop reasoning-aware metrics"
            ],
            "robustness": [
                "Add adversarial test cases",
                "Include edge case detection",
                "Develop mutation testing framework"
            ]
        }
        
        return suggestions.get(self.research_focus, [])

# Example research directions
print("Potential Research Directions:\n")

# Direction 1: Complexity Analysis
research1 = ResearchExtension("complexity_analysis")
research1.define_hypothesis("Models that generate efficient solutions (O(n)) perform better on larger test cases")
research1.design_experiment(
    "complexity_correlation",
    "Analyze correlation between solution complexity and model architecture",
    ["time_complexity", "space_complexity", "pass_rate_large_inputs"]
)

# Direction 2: Reasoning Enhancement
research2 = ResearchExtension("reasoning_analysis")
research2.define_hypothesis("Explicit reasoning steps in training data improve performance on hard problems")
research2.design_experiment(
    "reasoning_ablation",
    "Compare models trained with/without reasoning traces",
    ["hard_problem_pass_rate", "reasoning_step_quality", "solution_correctness"]
)

# Show improvement suggestions
print("\nSuggested improvements for complexity analysis research:")
for suggestion in research1.suggest_improvements():
    print(f"- {suggestion}")

## 8. Conclusion and Key Takeaways

### Summary of LeetCodeDataset Contributions:

1. **Comprehensive Coverage**: 2,869 Python problems (>90% of LeetCode)
2. **Temporal Split**: Clean separation for contamination-free evaluation
3. **Rich Metadata**: Difficulty levels, topic tags, release dates
4. **Robust Testing**: 100+ test cases per problem
5. **Exceptional Efficiency**: 2.6K samples match 110K dataset performance

### Key Findings:

1. **Reasoning Models Dominate**: DeepSeek-R1 (65.23%) significantly outperforms non-reasoning models
2. **Model-Generated > Human Data**: For code generation tasks
3. **Topic-Specific Strengths**: Large performance gaps in DP, Binary Search, Tree problems
4. **Data Quality > Quantity**: Careful curation beats large-scale collection

### Implementation Notes:

- **Why LangChain/LangGraph**: Structured prompting, workflow orchestration, multi-LLM support
- **Why deepeval**: Comprehensive code evaluation metrics, custom metric support
- **Future Work**: Complexity analysis, multi-language support, reasoning enhancement