# Focused Learning: AICoderEval Benchmark Design & Data Curation

## Mục tiêu học tập

Notebook này tập trung vào việc hiểu sâu về cách thiết kế và xây dựng benchmark AICoderEval - một trong những đóng góp quan trọng nhất của paper. Chúng ta sẽ:

1. Hiểu cách thu thập và xử lý dữ liệu từ HuggingFace Hub và PyTorch Hub
2. Học cách thiết kế cấu trúc benchmark cho AI code generation
3. Implement quy trình data curation với filtering và validation
4. Xây dựng test cases theo mẫu HumanEval

## Trích xuất từ paper

**Section 2: Benchmark Construction** (trang 3-5)
- "We leverage the power of GPT-4 to process data collected from the web and format it into a structured form"
- "Each generated file is meticulously structured to encompass a comprehensive suite of components necessary for robust testing"
- Figure 2 mô tả kiến trúc CoderGen với data generation pipeline

## 1. Lý thuyết: Thiết kế Benchmark cho AI Code Generation

### 1.1. Challenges trong việc đánh giá AI code generation

$$\text{Evaluation Challenge} = \text{Correctness} + \text{Library Usage} + \text{Real-world Applicability}$$

Paper giải quyết các thách thức:
- **Domain Specificity**: Code phải sử dụng đúng libraries (HuggingFace, PyTorch, TensorFlow)
- **Executable Validation**: Code phải chạy được và pass test cases
- **Diversity**: Cover nhiều domains (NLP, CV, Audio, etc.)

### 1.2. Cấu trúc của một benchmark entry

Mỗi entry trong AICoderEval bao gồm:
1. **Task Description**: Mô tả clear về yêu cầu
2. **Complete Code File**: Bao gồm imports, implementation, tests
3. **Test Cases**: 3 levels - normal, edge case, correctness
4. **Metadata**: Domain, model info, performance metrics

In [None]:
# Import libraries
import json
import ast
import subprocess
import tempfile
import os
from typing import Dict, List, Tuple, Optional, Any
from dataclasses import dataclass, field
from enum import Enum
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime

## 2. Implementation: Benchmark Data Structure

In [None]:
class TaskDomain(Enum):
    """Domains covered in AICoderEval"""
    NLP = "Natural Language Processing"
    CV = "Computer Vision"
    TABULAR = "Tabular Data"
    AUDIO = "Audio and Speech"
    MULTIMODAL = "Multimodal"
    RL = "Reinforcement Learning"
    CLASSIFICATION = "Classification"

@dataclass
class TestCase:
    """Structure for a single test case"""
    name: str
    test_type: str  # "normal", "edge_case", "correctness"
    input_data: Any
    expected_output: Any
    description: str

@dataclass
class BenchmarkEntry:
    """Complete structure for a benchmark entry following paper design"""
    # Basic information
    task_id: str
    domain: TaskDomain
    task_description: str
    
    # Model information
    model_name: str
    model_description: str
    libraries_used: List[str]
    
    # Code components
    imports: List[str]
    function_signature: str
    function_docstring: str
    implementation: str
    
    # Test cases
    test_cases: List[TestCase]
    
    # Metadata
    created_date: str = field(default_factory=lambda: datetime.now().isoformat())
    validation_status: str = "pending"
    execution_time: Optional[float] = None
    
    def to_code_file(self) -> str:
        """Generate complete Python file following paper's format"""
        code_parts = []
        
        # Package installation comments
        code_parts.append("# Package installation")
        for lib in self.libraries_used:
            code_parts.append(f"# pip install {lib}")
        code_parts.append("")
        
        # Imports
        code_parts.append("# Imports")
        code_parts.extend(self.imports)
        code_parts.append("")
        
        # Main function
        code_parts.append("# Main function")
        code_parts.append(self.function_signature)
        code_parts.append(f'    """{self.function_docstring}"""')
        code_parts.append(self.implementation)
        code_parts.append("")
        
        # Test functions
        code_parts.append("# Test functions")
        code_parts.append(self._generate_test_functions())
        
        # Main execution
        code_parts.append("")
        code_parts.append("if __name__ == '__main__':")
        code_parts.append("    test_main()")
        
        return "\n".join(code_parts)
    
    def _generate_test_functions(self) -> str:
        """Generate test functions following paper's format"""
        test_code = []
        
        test_code.append("def test_main():")
        test_code.append('    print("Testing started.")')
        test_code.append("    ")
        
        for i, test_case in enumerate(self.test_cases, 1):
            test_code.append(f'    # Test case {i}: {test_case.test_type}')
            test_code.append(f'    print("Testing case [{i}/{len(self.test_cases)}] started")')
            test_code.append("    try:")
            test_code.append(f"        result = {self._generate_test_call(test_case)}")
            test_code.append(f"        assert result == {repr(test_case.expected_output)}")
            test_code.append(f'        print(f"Test case [{i}/{len(self.test_cases)}] succeeded")')
            test_code.append("    except Exception as e:")
            test_code.append(f'        print(f"Test case [{i}/{len(self.test_cases)}] failed: {{e}}")')
            test_code.append("    ")
        
        test_code.append('    print("Testing finished.")')
        
        return "\n".join(test_code)
    
    def _generate_test_call(self, test_case: TestCase) -> str:
        """Generate function call for test"""
        func_name = self.function_signature.split("(")[0].replace("def ", "")
        return f"{func_name}({repr(test_case.input_data)})"

## 3. Data Collection Simulation

In [None]:
class DataCollector:
    """Simulates data collection from HuggingFace Hub and PyTorch Hub"""
    
    def __init__(self):
        # Mock data representing scraped model information
        self.mock_model_data = [
            {
                "source": "huggingface",
                "model_name": "bert-base-uncased",
                "domain": "NLP",
                "task": "text-classification",
                "description": "BERT model for text classification",
                "example_code": "from transformers import pipeline\nclassifier = pipeline('text-classification')"
            },
            {
                "source": "pytorch",
                "model_name": "resnet50",
                "domain": "CV",
                "task": "image-classification",
                "description": "ResNet50 for image classification",
                "example_code": "import torchvision.models as models\nmodel = models.resnet50(pretrained=True)"
            },
            {
                "source": "huggingface",
                "model_name": "wav2vec2-base",
                "domain": "Audio",
                "task": "speech-recognition",
                "description": "Wav2Vec2 for automatic speech recognition",
                "example_code": "from transformers import pipeline\nasr = pipeline('automatic-speech-recognition')"
            }
        ]
    
    def collect_raw_data(self) -> List[Dict[str, Any]]:
        """Simulate collecting raw data from web sources"""
        print("Collecting data from HuggingFace Hub and PyTorch Hub...")
        return self.mock_model_data
    
    def preprocess_data(self, raw_data: List[Dict[str, Any]]) -> List[Dict[str, Any]]:
        """Preprocess and structure the collected data"""
        processed_data = []
        
        for item in raw_data:
            processed_item = {
                "domain": self._map_domain(item["domain"]),
                "model_name": item["model_name"],
                "model_description": item["description"],
                "task_type": item["task"],
                "source_library": item["source"],
                "example_code": item["example_code"],
                "metadata": {
                    "collected_date": datetime.now().isoformat(),
                    "source": item["source"]
                }
            }
            processed_data.append(processed_item)
        
        return processed_data
    
    def _map_domain(self, domain_str: str) -> TaskDomain:
        """Map string domain to TaskDomain enum"""
        mapping = {
            "NLP": TaskDomain.NLP,
            "CV": TaskDomain.CV,
            "Audio": TaskDomain.AUDIO,
            "Tabular": TaskDomain.TABULAR,
            "Multimodal": TaskDomain.MULTIMODAL,
            "RL": TaskDomain.RL
        }
        return mapping.get(domain_str, TaskDomain.CLASSIFICATION)

## 4. Benchmark Generation với GPT-4 Simulation

In [None]:
class BenchmarkGenerator:
    """Generate benchmark entries following paper's approach"""
    
    def __init__(self):
        self.entry_counter = 0
    
    def generate_from_model_info(self, model_info: Dict[str, Any]) -> BenchmarkEntry:
        """Generate a complete benchmark entry from model information"""
        self.entry_counter += 1
        
        # Generate task description
        task_description = self._generate_task_description(model_info)
        
        # Generate function components
        func_components = self._generate_function_components(model_info)
        
        # Generate test cases following paper's 3-level approach
        test_cases = self._generate_test_cases(model_info)
        
        # Create benchmark entry
        entry = BenchmarkEntry(
            task_id=f"task_{self.entry_counter:04d}",
            domain=model_info["domain"],
            task_description=task_description,
            model_name=model_info["model_name"],
            model_description=model_info["model_description"],
            libraries_used=self._extract_libraries(model_info),
            imports=func_components["imports"],
            function_signature=func_components["signature"],
            function_docstring=func_components["docstring"],
            implementation=func_components["implementation"],
            test_cases=test_cases
        )
        
        return entry
    
    def _generate_task_description(self, model_info: Dict[str, Any]) -> str:
        """Generate one-sentence task description"""
        templates = {
            "text-classification": "Create a function that classifies text into categories using {model_name}.",
            "image-classification": "Implement image classification using {model_name} to identify objects in images.",
            "speech-recognition": "Build a speech recognition function using {model_name} to transcribe audio."
        }
        
        template = templates.get(model_info["task_type"], 
                                "Implement a function using {model_name} for {task_type}.")
        return template.format(
            model_name=model_info["model_name"],
            task_type=model_info["task_type"]
        )
    
    def _generate_function_components(self, model_info: Dict[str, Any]) -> Dict[str, Any]:
        """Generate function signature, docstring, and implementation"""
        
        # Map task type to function details
        function_templates = {
            "text-classification": {
                "signature": "def classify_text(text: str, model_name: str = '{model_name}') -> Dict[str, float]",
                "docstring": """Classify text using {model_name}.
                
                Args:
                    text: Input text to classify
                    model_name: Name of the model to use
                    
                Returns:
                    Dict containing label and confidence score
                    
                Raises:
                    ValueError: If text is empty
                    RuntimeError: If model loading fails
                """,
                "imports": [
                    "from transformers import pipeline",
                    "from typing import Dict",
                    "import warnings",
                    "warnings.filterwarnings('ignore')"
                ],
                "implementation": """    if not text or not text.strip():
        raise ValueError("Text cannot be empty")
    
    try:
        classifier = pipeline('text-classification', model=model_name)
        results = classifier(text)
        return {results[0]['label']: results[0]['score']}
    except Exception as e:
        raise RuntimeError(f"Failed to classify text: {e}")"""
            },
            "image-classification": {
                "signature": "def classify_image(image_path: str) -> Dict[str, float]",
                "docstring": """Classify image using ResNet50.
                
                Args:
                    image_path: Path to the image file
                    
                Returns:
                    Dict containing top prediction and confidence
                    
                Raises:
                    FileNotFoundError: If image file not found
                    ValueError: If image format is invalid
                """,
                "imports": [
                    "import torch",
                    "import torchvision.models as models",
                    "import torchvision.transforms as transforms",
                    "from PIL import Image",
                    "from typing import Dict",
                    "import os"
                ],
                "implementation": """    if not os.path.exists(image_path):
        raise FileNotFoundError(f"Image not found: {image_path}")
    
    try:
        # Load and preprocess image
        image = Image.open(image_path).convert('RGB')
        preprocess = transforms.Compose([
            transforms.Resize(256),
            transforms.CenterCrop(224),
            transforms.ToTensor(),
            transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
        ])
        
        # Load model and classify
        model = models.resnet50(pretrained=True)
        model.eval()
        
        input_tensor = preprocess(image).unsqueeze(0)
        with torch.no_grad():
            output = model(input_tensor)
            
        # Get top prediction
        probabilities = torch.nn.functional.softmax(output[0], dim=0)
        top_prob, top_class = torch.topk(probabilities, 1)
        
        return {f"class_{top_class.item()}": top_prob.item()}
    except Exception as e:
        raise ValueError(f"Failed to process image: {e}")"""
            },
            "speech-recognition": {
                "signature": "def transcribe_audio(audio_path: str) -> str",
                "docstring": """Transcribe audio using Wav2Vec2.
                
                Args:
                    audio_path: Path to audio file
                    
                Returns:
                    Transcribed text
                    
                Raises:
                    FileNotFoundError: If audio file not found
                    RuntimeError: If transcription fails
                """,
                "imports": [
                    "from transformers import pipeline",
                    "import os"
                ],
                "implementation": """    if not os.path.exists(audio_path):
        raise FileNotFoundError(f"Audio file not found: {audio_path}")
    
    try:
        asr = pipeline('automatic-speech-recognition', model='facebook/wav2vec2-base-960h')
        result = asr(audio_path)
        return result['text']
    except Exception as e:
        raise RuntimeError(f"Transcription failed: {e}")"""
            }
        }
        
        template = function_templates.get(model_info["task_type"], function_templates["text-classification"])
        
        return {
            "signature": template["signature"].format(model_name=model_info["model_name"]),
            "docstring": template["docstring"].format(model_name=model_info["model_name"]),
            "imports": template["imports"],
            "implementation": template["implementation"]
        }
    
    def _generate_test_cases(self, model_info: Dict[str, Any]) -> List[TestCase]:
        """Generate 3 test cases: normal, edge case, correctness"""
        
        test_templates = {
            "text-classification": [
                TestCase(
                    name="test_normal_execution",
                    test_type="normal",
                    input_data="This is a great product!",
                    expected_output={"POSITIVE": 0.95},
                    description="Test normal text classification"
                ),
                TestCase(
                    name="test_edge_case",
                    test_type="edge_case",
                    input_data="",
                    expected_output="ValueError",
                    description="Test empty input handling"
                ),
                TestCase(
                    name="test_correctness",
                    test_type="correctness",
                    input_data="The movie was terrible and boring.",
                    expected_output={"NEGATIVE": 0.92},
                    description="Test correct sentiment detection"
                )
            ],
            "image-classification": [
                TestCase(
                    name="test_normal_execution",
                    test_type="normal",
                    input_data="test_image.jpg",
                    expected_output={"class_281": 0.89},
                    description="Test normal image classification"
                ),
                TestCase(
                    name="test_edge_case",
                    test_type="edge_case",
                    input_data="nonexistent.jpg",
                    expected_output="FileNotFoundError",
                    description="Test missing file handling"
                ),
                TestCase(
                    name="test_correctness",
                    test_type="correctness",
                    input_data="cat_image.jpg",
                    expected_output={"class_285": 0.94},
                    description="Test correct cat detection"
                )
            ],
            "speech-recognition": [
                TestCase(
                    name="test_normal_execution",
                    test_type="normal",
                    input_data="sample_audio.wav",
                    expected_output="HELLO WORLD",
                    description="Test normal transcription"
                ),
                TestCase(
                    name="test_edge_case",
                    test_type="edge_case",
                    input_data="missing_audio.wav",
                    expected_output="FileNotFoundError",
                    description="Test missing audio file"
                ),
                TestCase(
                    name="test_correctness",
                    test_type="correctness",
                    input_data="speech_sample.wav",
                    expected_output="THE QUICK BROWN FOX",
                    description="Test accurate transcription"
                )
            ]
        }
        
        return test_templates.get(model_info["task_type"], test_templates["text-classification"])
    
    def _extract_libraries(self, model_info: Dict[str, Any]) -> List[str]:
        """Extract required libraries from model info"""
        library_mapping = {
            "huggingface": ["transformers", "torch"],
            "pytorch": ["torch", "torchvision", "pillow"]
        }
        return library_mapping.get(model_info["source_library"], ["transformers"])

## 5. Data Curation và Validation

In [None]:
class BenchmarkCurator:
    """Curate and validate benchmark entries following paper's approach"""
    
    def __init__(self):
        self.validation_results = []
    
    def validate_entry(self, entry: BenchmarkEntry) -> Tuple[bool, Dict[str, Any]]:
        """Validate a benchmark entry through execution"""
        validation_result = {
            "task_id": entry.task_id,
            "syntax_valid": False,
            "imports_valid": False,
            "tests_passed": 0,
            "total_tests": len(entry.test_cases),
            "execution_time": None,
            "errors": []
        }
        
        # Step 1: Validate syntax
        code = entry.to_code_file()
        try:
            ast.parse(code)
            validation_result["syntax_valid"] = True
        except SyntaxError as e:
            validation_result["errors"].append(f"Syntax error: {e}")
            return False, validation_result
        
        # Step 2: Check imports (mock validation)
        validation_result["imports_valid"] = self._validate_imports(entry.imports)
        
        # Step 3: Execute tests (mock execution)
        import time
        import random
        
        start_time = time.time()
        
        # Simulate test execution
        for test_case in entry.test_cases:
            if random.random() > 0.2:  # 80% success rate
                validation_result["tests_passed"] += 1
            else:
                validation_result["errors"].append(
                    f"Test {test_case.name} failed"
                )
        
        validation_result["execution_time"] = time.time() - start_time
        
        # Entry is valid if all tests pass
        is_valid = validation_result["tests_passed"] == validation_result["total_tests"]
        
        return is_valid, validation_result
    
    def _validate_imports(self, imports: List[str]) -> bool:
        """Check if imports are valid"""
        # Mock validation - in real implementation would check actual imports
        forbidden_imports = ["os.system", "subprocess.call", "eval", "exec"]
        
        for imp in imports:
            if any(forbidden in imp for forbidden in forbidden_imports):
                return False
        return True
    
    def curate_dataset(self, entries: List[BenchmarkEntry], 
                      min_success_rate: float = 1.0) -> List[BenchmarkEntry]:
        """Curate dataset by filtering based on validation results"""
        print(f"\nCurating {len(entries)} benchmark entries...")
        curated_entries = []
        
        for entry in entries:
            is_valid, result = self.validate_entry(entry)
            self.validation_results.append(result)
            
            success_rate = result["tests_passed"] / result["total_tests"]
            
            if success_rate >= min_success_rate:
                entry.validation_status = "validated"
                entry.execution_time = result["execution_time"]
                curated_entries.append(entry)
                print(f"✓ {entry.task_id}: All tests passed")
            else:
                print(f"✗ {entry.task_id}: {result['tests_passed']}/{result['total_tests']} tests passed")
        
        print(f"\nCuration complete: {len(curated_entries)}/{len(entries)} entries validated")
        return curated_entries
    
    def generate_statistics(self) -> pd.DataFrame:
        """Generate statistics about validation results"""
        if not self.validation_results:
            return pd.DataFrame()
        
        stats_df = pd.DataFrame(self.validation_results)
        stats_df['success_rate'] = stats_df['tests_passed'] / stats_df['total_tests']
        
        return stats_df

## 6. Putting It All Together: Complete Benchmark Pipeline

In [None]:
# Initialize components
collector = DataCollector()
generator = BenchmarkGenerator()
curator = BenchmarkCurator()

# Step 1: Collect raw data
print("=== Phase 1: Data Collection ===")
raw_data = collector.collect_raw_data()
processed_data = collector.preprocess_data(raw_data)
print(f"Collected {len(processed_data)} model entries")

# Step 2: Generate benchmark entries
print("\n=== Phase 2: Benchmark Generation ===")
benchmark_entries = []
for data in processed_data:
    entry = generator.generate_from_model_info(data)
    benchmark_entries.append(entry)
    print(f"Generated: {entry.task_id} - {entry.task_description[:50]}...")

# Step 3: Curate and validate
print("\n=== Phase 3: Curation & Validation ===")
curated_entries = curator.curate_dataset(benchmark_entries)

# Step 4: Generate statistics
print("\n=== Phase 4: Statistics ===")
stats_df = curator.generate_statistics()
print("\nValidation Statistics:")
print(stats_df)

## 7. Visualization và Analysis

In [None]:
# Visualize benchmark composition
fig, axes = plt.subplots(2, 2, figsize=(15, 10))

# 1. Domain distribution
ax1 = axes[0, 0]
domain_counts = pd.Series([e.domain.value for e in benchmark_entries]).value_counts()
ax1.pie(domain_counts.values, labels=domain_counts.index, autopct='%1.1f%%')
ax1.set_title('Domain Distribution in Generated Benchmark')

# 2. Validation success rates
ax2 = axes[0, 1]
if not stats_df.empty:
    ax2.hist(stats_df['success_rate'], bins=5, edgecolor='black', alpha=0.7)
    ax2.set_xlabel('Success Rate')
    ax2.set_ylabel('Count')
    ax2.set_title('Distribution of Test Success Rates')

# 3. Test complexity (lines of code)
ax3 = axes[1, 0]
code_lengths = [len(e.implementation.split('\n')) for e in benchmark_entries]
ax3.bar(range(len(code_lengths)), code_lengths)
ax3.set_xlabel('Benchmark Entry')
ax3.set_ylabel('Lines of Code')
ax3.set_title('Code Complexity by Entry')

# 4. Library usage
ax4 = axes[1, 1]
all_libs = []
for e in benchmark_entries:
    all_libs.extend(e.libraries_used)
lib_counts = pd.Series(all_libs).value_counts()
ax4.bar(lib_counts.index, lib_counts.values)
ax4.set_xlabel('Library')
ax4.set_ylabel('Usage Count')
ax4.set_title('Library Usage Frequency')
ax4.tick_params(axis='x', rotation=45)

plt.tight_layout()
plt.show()

## 8. Export Benchmark to File

In [None]:
# Example: Export one entry as executable Python file
if curated_entries:
    example_entry = curated_entries[0]
    
    # Generate complete code file
    code_content = example_entry.to_code_file()
    
    print("=== Generated Benchmark Entry ===")
    print(f"Task ID: {example_entry.task_id}")
    print(f"Description: {example_entry.task_description}")
    print(f"Domain: {example_entry.domain.value}")
    print("\n--- Generated Code ---")
    print(code_content[:1000] + "..." if len(code_content) > 1000 else code_content)
    
    # Save to file
    output_file = f"benchmark_{example_entry.task_id}.py"
    with open(output_file, 'w') as f:
        f.write(code_content)
    print(f"\nSaved to: {output_file}")

## Key Takeaways

1. **Benchmark Design Philosophy**:
   - Focus on real-world AI tasks với specific libraries
   - Structured format inspired by HumanEval
   - Comprehensive testing với 3 levels

2. **Data Curation Process**:
   - Initial generation: ~9,000 files
   - After filtering: ~2,000 files (ít nhất 1 test pass)
   - Final benchmark: ~500 files (all tests pass)

3. **Quality Assurance**:
   - Syntax validation
   - Import safety checks
   - Execution in sandboxed environment
   - Multiple test cases per task

4. **Practical Applications**:
   - Benchmark có thể dùng để train và evaluate LLMs
   - Framework này có thể adapt cho domains khác
   - Data format standardized để dễ dàng mở rộng