# AICoderEval: Improving AI Domain Code Generation - Main Implementation

## 1. Giới thiệu Paper

**Tên paper**: AICoderEval: Improving AI Domain Code Generation of Large Language Models

**Tác giả**: Yinghui Xia (AutoAgents.ai), Yuyan Chen (Fudan University), Tianyu Shi (University of Toronto), Jun Wang (East China Normal University), Jinsong Yang (AutoAgents.ai)

**Link**: https://arxiv.org/abs/2406.04712v1

**Tóm tắt**: Paper giới thiệu AICoderEval - một benchmark dataset tập trung vào các task code generation trong lĩnh vực AI, sử dụng các thư viện phổ biến như HuggingFace, PyTorch, và TensorFlow. Paper cũng đề xuất CoderGen - một agent-based framework để cải thiện khả năng sinh code của LLMs cho các task cụ thể, và huấn luyện AICoder - một model mạnh hơn được fine-tune từ Llama-3.

### Đóng góp chính:
- **Benchmark Construction**: Xây dựng AICoderEval dataset với 492 task về AI
- **Framework Design**: Thiết kế CoderGen framework để sinh training data chất lượng cao
- **Model Evaluation**: Đánh giá nhiều LLMs và chứng minh hiệu quả của phương pháp

## 2. Cài đặt môi trường và thư viện

In [None]:
# Cài đặt các thư viện cần thiết
!pip install -q langchain langchain-openai langchain-community
!pip install -q langgraph
!pip install -q deepeval
!pip install -q transformers torch
!pip install -q huggingface-hub
!pip install -q pandas numpy matplotlib seaborn
!pip install -q python-dotenv

In [None]:
import os
import json
import pandas as pd
import numpy as np
from typing import List, Dict, Any, Optional, Tuple
import matplotlib.pyplot as plt
import seaborn as sns
from dotenv import load_dotenv

# LangChain imports
from langchain.chat_models import ChatOpenAI
from langchain.prompts import ChatPromptTemplate, SystemMessagePromptTemplate, HumanMessagePromptTemplate
from langchain.schema import BaseMessage
from langchain.callbacks import get_openai_callback

# LangGraph imports
from langgraph.graph import StateGraph, END
from langgraph.checkpoint import MemorySaver

# DeepEval imports for evaluation
from deepeval.metrics import AnswerRelevancyMetric, FaithfulnessMetric
from deepeval.test_case import LLMTestCase

load_dotenv()

# Set up visualization
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")

## 3. Load và khám phá AICoderEval Dataset

In [None]:
from huggingface_hub import hf_hub_download
from datasets import load_dataset

# Load AICoderEval dataset từ HuggingFace
print("Loading AICoderEval dataset...")
dataset = load_dataset("vixuowis/AICoderEval", split="train")

print(f"\nDataset size: {len(dataset)} samples")
print(f"Features: {dataset.features}")

# Xem một số ví dụ
print("\nSample data:")
sample = dataset[0]
for key, value in sample.items():
    if isinstance(value, str) and len(value) > 200:
        print(f"{key}: {value[:200]}...")
    else:
        print(f"{key}: {value}")

In [None]:
# Phân tích phân bố các category trong dataset
categories = {
    "Natural Language Processing": 383,
    "Computer Vision": 50,
    "Tabular Data": 18,
    "Audio and Speech": 17,
    "Classification": 12,
    "Multimodal": 9,
    "Reinforcement Learning": 3
}

# Visualization
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 6))

# Bar chart
ax1.bar(categories.keys(), categories.values())
ax1.set_xlabel('Category')
ax1.set_ylabel('Count')
ax1.set_title('Distribution of Tasks by Category')
ax1.tick_params(axis='x', rotation=45)

# Pie chart
ax2.pie(categories.values(), labels=categories.keys(), autopct='%1.1f%%')
ax2.set_title('Percentage Distribution of Tasks')

plt.tight_layout()
plt.show()

## 4. Implement CoderGen Framework với LangChain và LangGraph

### Lý do sử dụng LangChain/LangGraph:
- **LangChain**: Cung cấp abstraction tốt cho việc tương tác với LLMs, quản lý prompts, và xử lý output
- **LangGraph**: Phù hợp để xây dựng agent-based framework với các state transitions như trong CoderGen

In [None]:
from typing import TypedDict, Annotated, Sequence
from langchain.schema import AIMessage, HumanMessage, SystemMessage
import operator

# Define state cho CoderGen agent
class CoderGenState(TypedDict):
    """State definition for CoderGen agent"""
    task_description: str
    generated_code: str
    error_traceback: Optional[str]
    iteration_count: int
    test_results: Dict[str, Any]
    messages: Annotated[Sequence[BaseMessage], operator.add]
    is_completed: bool

In [None]:
class CoderGenFramework:
    """CoderGen Framework implementation using LangChain and LangGraph"""
    
    def __init__(self, model_name: str = "gpt-3.5-turbo-1106"):
        self.llm = ChatOpenAI(model=model_name, temperature=0.6)
        self.max_iterations = 5
        self._setup_prompts()
        self._build_graph()
    
    def _setup_prompts(self):
        """Setup prompts for different stages of code generation"""
        
        # Initial code generation prompt
        self.code_gen_prompt = ChatPromptTemplate.from_messages([
            SystemMessagePromptTemplate.from_template(
                """You are an expert AI developer specializing in HuggingFace, PyTorch, and TensorFlow.
                Generate Python code that implements the given task using appropriate AI libraries.
                Follow these guidelines:
                1. Use proper imports and package installations
                2. Include comprehensive error handling
                3. Add test functions with 3 test cases (normal, edge case, correctness)
                4. Follow Google Python Style Guide for documentation"""
            ),
            HumanMessagePromptTemplate.from_template(
                """Task: {task_description}
                
                Generate complete Python code with:
                - All necessary imports
                - Main function implementation
                - Test functions with 3 test cases
                - Proper documentation"""
            )
        ])
        
        # Error analysis and fix prompt
        self.error_fix_prompt = ChatPromptTemplate.from_messages([
            SystemMessagePromptTemplate.from_template(
                """You are debugging Python code. Analyze the error and provide a fixed version.
                Focus on:
                1. Understanding the error traceback
                2. Identifying the root cause
                3. Providing a corrected implementation"""
            ),
            HumanMessagePromptTemplate.from_template(
                """Original code:
                {generated_code}
                
                Error traceback:
                {error_traceback}
                
                Please fix the code and ensure it runs without errors."""
            )
        ])
    
    def _build_graph(self):
        """Build the LangGraph workflow for CoderGen"""
        workflow = StateGraph(CoderGenState)
        
        # Add nodes
        workflow.add_node("generate_code", self._generate_code)
        workflow.add_node("execute_tests", self._execute_tests)
        workflow.add_node("analyze_errors", self._analyze_errors)
        workflow.add_node("regenerate_code", self._regenerate_code)
        
        # Define edges
        workflow.set_entry_point("generate_code")
        workflow.add_edge("generate_code", "execute_tests")
        
        # Conditional edges based on test results
        workflow.add_conditional_edges(
            "execute_tests",
            self._check_test_results,
            {
                "success": END,
                "failure": "analyze_errors"
            }
        )
        
        workflow.add_edge("analyze_errors", "regenerate_code")
        workflow.add_edge("regenerate_code", "execute_tests")
        
        # Compile the graph
        memory = MemorySaver()
        self.app = workflow.compile(checkpointer=memory)
    
    def _generate_code(self, state: CoderGenState) -> CoderGenState:
        """Generate initial code based on task description"""
        print(f"\n[Iteration {state['iteration_count']}] Generating code...")
        
        messages = self.code_gen_prompt.format_messages(
            task_description=state["task_description"]
        )
        
        with get_openai_callback() as cb:
            response = self.llm(messages)
            print(f"Tokens used: {cb.total_tokens}")
        
        state["generated_code"] = response.content
        state["messages"].append(response)
        return state
    
    def _execute_tests(self, state: CoderGenState) -> CoderGenState:
        """Execute the generated code and run tests"""
        print("\nExecuting tests...")
        
        # Mock execution for demonstration
        # In real implementation, this would execute code in sandboxed environment
        import random
        
        if state["iteration_count"] < 2 and random.random() < 0.7:
            # Simulate error
            state["error_traceback"] = """Traceback (most recent call last):
  File "test.py", line 15, in <module>
    model = pipeline('text-classification')
NameError: name 'pipeline' is not defined"""
            state["test_results"] = {
                "passed": 0,
                "failed": 3,
                "errors": [state["error_traceback"]]
            }
        else:
            # Simulate success
            state["error_traceback"] = None
            state["test_results"] = {
                "passed": 3,
                "failed": 0,
                "errors": []
            }
        
        return state
    
    def _check_test_results(self, state: CoderGenState) -> str:
        """Check test results and decide next action"""
        if state["test_results"]["failed"] == 0:
            state["is_completed"] = True
            print("\n✅ All tests passed!")
            return "success"
        elif state["iteration_count"] >= self.max_iterations:
            print("\n❌ Max iterations reached")
            return "success"  # Stop even if not all tests pass
        else:
            print(f"\n⚠️ {state['test_results']['failed']} tests failed")
            return "failure"
    
    def _analyze_errors(self, state: CoderGenState) -> CoderGenState:
        """Analyze errors and prepare for regeneration"""
        print("\nAnalyzing errors...")
        state["iteration_count"] += 1
        return state
    
    def _regenerate_code(self, state: CoderGenState) -> CoderGenState:
        """Regenerate code based on error analysis"""
        print(f"\n[Iteration {state['iteration_count']}] Regenerating code with fixes...")
        
        messages = self.error_fix_prompt.format_messages(
            generated_code=state["generated_code"],
            error_traceback=state["error_traceback"]
        )
        
        with get_openai_callback() as cb:
            response = self.llm(messages)
            print(f"Tokens used: {cb.total_tokens}")
        
        state["generated_code"] = response.content
        state["messages"].append(response)
        return state
    
    def generate(self, task_description: str) -> Dict[str, Any]:
        """Main method to generate code for a given task"""
        initial_state = {
            "task_description": task_description,
            "generated_code": "",
            "error_traceback": None,
            "iteration_count": 1,
            "test_results": {},
            "messages": [],
            "is_completed": False
        }
        
        config = {"configurable": {"thread_id": "codergen-1"}}
        final_state = self.app.invoke(initial_state, config)
        
        return {
            "code": final_state["generated_code"],
            "iterations": final_state["iteration_count"],
            "success": final_state["is_completed"],
            "test_results": final_state["test_results"]
        }

## 5. Sử dụng CoderGen để sinh code

In [None]:
# Initialize CoderGen
codergen = CoderGenFramework(model_name="gpt-3.5-turbo-1106")

# Example task from paper
task_description = """
Create a text classification function using HuggingFace transformers.
The function should:
1. Load a pre-trained sentiment analysis model
2. Accept text input and return sentiment (positive/negative/neutral)
3. Handle batch processing for multiple texts
4. Include proper error handling
"""

# Generate code
result = codergen.generate(task_description)

print("\n" + "="*50)
print("GENERATED CODE:")
print("="*50)
print(result["code"])
print("\n" + "="*50)
print(f"Iterations: {result['iterations']}")
print(f"Success: {result['success']}")
print(f"Test Results: {result['test_results']}")

## 6. Evaluation với DeepEval

### Lý do sử dụng DeepEval:
- **Phù hợp với paper**: Paper đánh giá SR@All và SR@Any metrics
- **DeepEval mapping**: 
  - SR@All → Tất cả test cases pass (DeepEval's test suite)
  - SR@Any → Ít nhất 1 test case pass
  - Code quality → AnswerRelevancy và Faithfulness metrics

In [None]:
class AICoderEvaluator:
    """Evaluator for AICoderEval using DeepEval metrics"""
    
    def __init__(self):
        self.relevancy_metric = AnswerRelevancyMetric(threshold=0.7)
        self.faithfulness_metric = FaithfulnessMetric(threshold=0.7)
    
    def evaluate_code_generation(self, task: str, generated_code: str, 
                                test_results: Dict[str, Any]) -> Dict[str, float]:
        """Evaluate generated code using multiple metrics"""
        
        # Calculate SR@All and SR@Any
        total_tests = test_results.get("passed", 0) + test_results.get("failed", 0)
        sr_all = 1.0 if test_results.get("failed", 0) == 0 else 0.0
        sr_any = 1.0 if test_results.get("passed", 0) > 0 else 0.0
        
        # Create test case for DeepEval
        test_case = LLMTestCase(
            input=task,
            actual_output=generated_code,
            expected_output="Working code implementation",  # Simplified
            context=["AI code generation", "HuggingFace", "PyTorch"]
        )
        
        # Evaluate with DeepEval metrics
        relevancy_score = self.relevancy_metric.measure(test_case)
        faithfulness_score = self.faithfulness_metric.measure(test_case)
        
        # Calculate code metrics
        code_lines = len(generated_code.split('\n'))
        code_tokens = len(generated_code.split())
        
        return {
            "sr_all": sr_all,
            "sr_any": sr_any,
            "relevancy": relevancy_score,
            "faithfulness": faithfulness_score,
            "code_lines": code_lines,
            "code_tokens": code_tokens,
            "passed_tests": test_results.get("passed", 0),
            "total_tests": total_tests
        }
    
    def benchmark_models(self, models: List[str], tasks: List[str]) -> pd.DataFrame:
        """Benchmark multiple models on tasks"""
        results = []
        
        for model in models:
            print(f"\nEvaluating {model}...")
            codergen = CoderGenFramework(model_name=model)
            
            model_results = {
                "model": model,
                "sr_all_scores": [],
                "sr_any_scores": [],
                "code_lines": [],
                "code_tokens": []
            }
            
            for task in tasks:
                try:
                    result = codergen.generate(task)
                    eval_scores = self.evaluate_code_generation(
                        task, result["code"], result["test_results"]
                    )
                    
                    model_results["sr_all_scores"].append(eval_scores["sr_all"])
                    model_results["sr_any_scores"].append(eval_scores["sr_any"])
                    model_results["code_lines"].append(eval_scores["code_lines"])
                    model_results["code_tokens"].append(eval_scores["code_tokens"])
                except Exception as e:
                    print(f"Error evaluating {model} on task: {e}")
                    model_results["sr_all_scores"].append(0)
                    model_results["sr_any_scores"].append(0)
                    model_results["code_lines"].append(0)
                    model_results["code_tokens"].append(0)
            
            # Calculate averages
            results.append({
                "Model": model,
                "SR@All": np.mean(model_results["sr_all_scores"]) * 100,
                "SR@Any": np.mean(model_results["sr_any_scores"]) * 100,
                "Avg Code Lines": np.mean(model_results["code_lines"]),
                "Avg Code Tokens": np.mean(model_results["code_tokens"])
            })
        
        return pd.DataFrame(results)

In [None]:
# Example evaluation
evaluator = AICoderEvaluator()

# Sample tasks for evaluation
sample_tasks = [
    "Create a text classification function using HuggingFace transformers",
    "Implement image classification using PyTorch with pretrained ResNet",
    "Build a speech recognition pipeline using HuggingFace models"
]

# Models to evaluate (simplified for demo)
models_to_test = ["gpt-3.5-turbo-1106"]

# Run benchmark
benchmark_results = evaluator.benchmark_models(models_to_test, sample_tasks[:1])
print("\nBenchmark Results:")
print(benchmark_results.to_string(index=False))

## 7. Visualization của kết quả

In [None]:
# Recreate results from paper for visualization
paper_results = pd.DataFrame({
    'Model': ['GPT-3.5-turbo', 'llama-2-7b', 'llama-2-13b', 'llama-2-70b', 
              'codellama-7b', 'codellama-13b', 'codellama-34b', 'llama-3-8b'],
    'SR@All_Original': [9.16, 1.23, 2.76, 6.32, 19.58, 20.46, 23.68, 30.49],
    'SR@All_Agent': [13.03, 1.83, 3.98, 8.16, 23.86, 23.88, 25.78, 32.11],
    'SR@Any_Original': [46.84, 26.02, 42.04, 65.89, 66.95, 67.22, 70.19, 85.80],
    'SR@Any_Agent': [60.63, 33.41, 51.24, 78.68, 78.18, 75.67, 77.33, 86.82]
})

# Calculate improvements
paper_results['SR@All_Improvement'] = (
    (paper_results['SR@All_Agent'] - paper_results['SR@All_Original']) / 
    paper_results['SR@All_Original'] * 100
)

# Visualization
fig, axes = plt.subplots(2, 2, figsize=(16, 12))

# 1. SR@All comparison
ax1 = axes[0, 0]
x = np.arange(len(paper_results))
width = 0.35
ax1.bar(x - width/2, paper_results['SR@All_Original'], width, label='Original', alpha=0.8)
ax1.bar(x + width/2, paper_results['SR@All_Agent'], width, label='With Agent', alpha=0.8)
ax1.set_xlabel('Model')
ax1.set_ylabel('SR@All (%)')
ax1.set_title('SR@All: Original vs With ReAct Agent')
ax1.set_xticks(x)
ax1.set_xticklabels(paper_results['Model'], rotation=45, ha='right')
ax1.legend()
ax1.grid(axis='y', alpha=0.3)

# 2. SR@Any comparison
ax2 = axes[0, 1]
ax2.bar(x - width/2, paper_results['SR@Any_Original'], width, label='Original', alpha=0.8)
ax2.bar(x + width/2, paper_results['SR@Any_Agent'], width, label='With Agent', alpha=0.8)
ax2.set_xlabel('Model')
ax2.set_ylabel('SR@Any (%)')
ax2.set_title('SR@Any: Original vs With ReAct Agent')
ax2.set_xticks(x)
ax2.set_xticklabels(paper_results['Model'], rotation=45, ha='right')
ax2.legend()
ax2.grid(axis='y', alpha=0.3)

# 3. Improvement percentage
ax3 = axes[1, 0]
ax3.bar(paper_results['Model'], paper_results['SR@All_Improvement'], 
        color='green', alpha=0.7)
ax3.set_xlabel('Model')
ax3.set_ylabel('Improvement (%)')
ax3.set_title('SR@All Improvement with ReAct Agent')
ax3.tick_params(axis='x', rotation=45)
ax3.grid(axis='y', alpha=0.3)

# 4. Code efficiency (lines vs tokens)
code_metrics = pd.DataFrame({
    'Model': ['GPT-3.5', 'llama-2-7b', 'llama-2-13b', 'llama-2-70b', 
              'codellama-7b', 'codellama-13b', 'codellama-34b', 'llama-3-8b'],
    'Code_Lines': [8.6, 16.2, 18.5, 13.1, 21.5, 18.9, 18.4, 11.02],
    'Code_Tokens': [62.9, 112.9, 116.3, 107.8, 128.3, 116.3, 114.4, 96.97]
})

ax4 = axes[1, 1]
scatter = ax4.scatter(code_metrics['Code_Lines'], code_metrics['Code_Tokens'], 
                     s=100, alpha=0.6, c=range(len(code_metrics)), cmap='viridis')
for i, model in enumerate(code_metrics['Model']):
    ax4.annotate(model, (code_metrics['Code_Lines'][i], code_metrics['Code_Tokens'][i]),
                xytext=(5, 5), textcoords='offset points', fontsize=9)
ax4.set_xlabel('Average Code Lines')
ax4.set_ylabel('Average Code Tokens')
ax4.set_title('Code Efficiency: Lines vs Tokens')
ax4.grid(alpha=0.3)

plt.tight_layout()
plt.show()

## 8. Template cho nghiên cứu cá nhân

### Hướng dẫn sử dụng CoderGen cho nghiên cứu:

1. **Chuẩn bị dataset của riêng bạn**:
   - Thu thập task descriptions cho domain cụ thể
   - Format theo cấu trúc AICoderEval

2. **Customize CoderGen**:
   - Thay đổi prompts cho domain của bạn
   - Thêm specialized error handlers
   - Tích hợp domain-specific validators

3. **Fine-tune model**:
   - Sử dụng generated data để fine-tune
   - Experiment với different base models

4. **Evaluation**:
   - Định nghĩa metrics phù hợp với domain
   - So sánh với baselines

In [None]:
# Template for custom research
class CustomCoderGen(CoderGenFramework):
    """Template for customizing CoderGen for your research"""
    
    def __init__(self, model_name: str, domain: str):
        self.domain = domain
        super().__init__(model_name)
    
    def _setup_prompts(self):
        """Override to customize prompts for your domain"""
        super()._setup_prompts()
        
        # Add domain-specific instructions
        self.code_gen_prompt = ChatPromptTemplate.from_messages([
            SystemMessagePromptTemplate.from_template(
                f"""You are an expert in {self.domain} development.
                Generate code following best practices for {self.domain}.
                [Add your domain-specific guidelines here]"""
            ),
            HumanMessagePromptTemplate.from_template(
                "Task: {task_description}\n\nGenerate complete implementation."
            )
        ])
    
    def add_domain_validator(self, code: str) -> bool:
        """Add custom validation logic for your domain"""
        # Implement domain-specific validation
        return True

# Example usage
# custom_gen = CustomCoderGen("gpt-3.5-turbo", "robotics")
# result = custom_gen.generate("Create a ROS2 node for obstacle detection")

## 9. Kết luận

### Key Takeaways:

1. **AICoderEval Dataset**: Benchmark quan trọng cho AI code generation với 492 real-world tasks

2. **CoderGen Framework**: 
   - Agent-based approach với iterative refinement
   - Error analysis và automatic fixing
   - Cải thiện đáng kể performance (trung bình 28.20% cho SR@All)

3. **Implementation với LangChain/LangGraph**:
   - LangChain: Quản lý prompts và LLM interactions hiệu quả
   - LangGraph: Perfect cho agent workflows với state management
   - DeepEval: Phù hợp để đánh giá code quality và test success rates

4. **Future Work**:
   - Mở rộng cho nhiều programming languages
   - Tích hợp nhiều AI frameworks hơn
   - Improve sandboxed execution environment
   - Fine-tune models cho specific domains

### Resources:
- Paper: https://arxiv.org/abs/2406.04712v1
- Dataset: https://huggingface.co/datasets/vixuowis/AICoderEval
- This implementation: [Your GitHub repo]