# DeepSeek-Coder-V2: Breaking the Barrier of Closed-Source Models in Code Intelligence

## üìÑ Paper Information
- **Title**: DeepSeek-Coder-V2: Breaking the Barrier of Closed-Source Models in Code Intelligence
- **Authors**: Qihao Zhu, Daya Guo, Zhihong Shao, et al. (DeepSeek-AI)
- **Link**: [arXiv:2406.11931v1](https://arxiv.org/abs/2406.11931)
- **GitHub**: https://github.com/deepseek-ai/DeepSeek-Coder-V2

## üéØ Paper Summary

DeepSeek-Coder-V2 l√† m√¥ h√¨nh ng√¥n ng·ªØ m√£ ngu·ªìn m·ªü d·ª±a tr√™n Mixture-of-Experts (MoE) ƒë·∫°t hi·ªáu su·∫•t t∆∞∆°ng ƒë∆∞∆°ng GPT-4 Turbo trong c√°c t√°c v·ª• code-specific. ƒê∆∞·ª£c ti·ªÅn hu·∫•n luy·ªán t·ª´ checkpoint trung gian c·ªßa DeepSeek-V2 v·ªõi th√™m 6 trillion tokens, m√¥ h√¨nh n√†y:

- **M·ªü r·ªông ng√¥n ng·ªØ l·∫≠p tr√¨nh**: t·ª´ 86 l√™n 338 ng√¥n ng·ªØ
- **TƒÉng ƒë·ªô d√†i context**: t·ª´ 16K l√™n 128K tokens
- **Hi·ªáu su·∫•t v∆∞·ª£t tr·ªôi**: So v·ªõi c√°c closed-source models (GPT-4 Turbo, Claude 3 Opus, Gemini 1.5 Pro)
- **Hai phi√™n b·∫£n**: 16B (2.4B active params) v√† 236B (21B active params)

### Key Results:
- **HumanEval**: 90.2%
- **MBPP+**: 76.2% 
- **MATH**: 75.7%
- **LiveCodeBench**: 43.4%
- **SWE-Bench**: 12.7% (first open-source >10%)

## üîß Environment Setup

In [None]:
# Core dependencies
!pip install torch transformers datasets tokenizers
!pip install langchain langchain-openai langchain-anthropic langchain-community
!pip install deepeval
!pip install numpy pandas matplotlib seaborn plotly
!pip install jupyter ipywidgets

# For code evaluation
!pip install human-eval
!pip install code-bert-score
!pip install requests beautifulsoup4

In [None]:
import torch
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from transformers import AutoTokenizer, AutoModelForCausalLM
from datasets import Dataset, load_dataset
import json
import re
from typing import List, Dict, Any
import warnings
warnings.filterwarnings('ignore')

# Set plotting style
plt.style.use('seaborn-v0_8')
sns.set_palette("husl")

print("Environment setup completed!")
print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")

## üìä Data Collection Analysis

### Theo Section 2 c·ªßa paper: Data Collection

DeepSeek-Coder-V2 s·ª≠ d·ª•ng corpus g·ªìm:
- **60% source code**: 1,170B code-related tokens t·ª´ GitHub v√† CommonCrawl
- **10% math corpus**: 221B math-related tokens
- **30% natural language corpus**: t·ª´ DeepSeek-V2 dataset

T·ªïng c·ªông: **10.2T tokens** (4.2T t·ª´ DeepSeek-V2 + 6T m·ªõi)

In [None]:
# Simulate data composition analysis based on paper statistics
data_composition = {
    'Data Type': ['Source Code', 'Math Corpus', 'Natural Language'],
    'Percentage': [60, 10, 30],
    'Tokens (Billions)': [1170, 221, 660],  # Estimated based on 6T total new tokens
    'Sources': ['GitHub + CommonCrawl', 'CommonCrawl', 'DeepSeek-V2']
}

df_composition = pd.DataFrame(data_composition)
print("üìä DeepSeek-Coder-V2 Data Composition:")
print(df_composition)

# Visualization
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 6))

# Pie chart for percentage
ax1.pie(df_composition['Percentage'], labels=df_composition['Data Type'], 
        autopct='%1.1f%%', startangle=90)
ax1.set_title('Data Composition by Type (%)')

# Bar chart for token counts
ax2.bar(df_composition['Data Type'], df_composition['Tokens (Billions)'])
ax2.set_title('Token Count by Data Type (Billions)')
ax2.set_ylabel('Tokens (Billions)')
ax2.tick_params(axis='x', rotation=45)

plt.tight_layout()
plt.show()

## üèóÔ∏è Model Architecture Analysis

### Mixture-of-Experts (MoE) Architecture

DeepSeek-Coder-V2 s·ª≠ d·ª•ng MoE architecture t∆∞∆°ng t·ª± DeepSeek-V2 v·ªõi 2 phi√™n b·∫£n:

| Model | Total Params | Active Params | Context Length | FIM Support |
|-------|-------------|---------------|----------------|-------------|
| DeepSeek-Coder-V2-Lite | 16B | 2.4B | 128K | ‚úÖ |
| DeepSeek-Coder-V2 | 236B | 21B | 128K | ‚ùå |

In [None]:
# Model specifications analysis
model_specs = {
    'Model': ['DeepSeek-Coder-V2-Lite', 'DeepSeek-Coder-V2'],
    'Total Parameters (B)': [16, 236],
    'Active Parameters (B)': [2.4, 21],
    'Context Length (K)': [128, 128],
    'FIM Support': ['Yes', 'No'],
    'Training Tokens (T)': [10.2, 10.2]
}

df_models = pd.DataFrame(model_specs)
print("üèóÔ∏è DeepSeek-Coder-V2 Model Specifications:")
print(df_models.to_string(index=False))

# Efficiency comparison
efficiency_ratio = df_models['Active Parameters (B)'] / df_models['Total Parameters (B)'] * 100
df_models['Efficiency (%)'] = efficiency_ratio.round(2)

print("\n‚ö° Parameter Efficiency:")
for i, row in df_models.iterrows():
    print(f"{row['Model']}: {row['Efficiency (%)']}% active parameters")

## üß™ Fill-In-the-Middle (FIM) Implementation

### Theo Section 3.1: Training Policy

DeepSeek-Coder-V2-Lite s·ª≠ d·ª•ng FIM v·ªõi PSM (Prefix, Suffix, Middle) mode:

```
<ÔΩúfim_beginÔΩú>prefix<ÔΩúfim_holeÔΩú>suffix<ÔΩúfim_endÔΩú>middle<|eos_token|>
```

FIM rate: 0.5 (50% c·ªßa training data)

In [None]:
class FIMProcessor:
    """Fill-In-the-Middle processor theo DeepSeek-Coder-V2 paper"""
    
    def __init__(self):
        self.fim_begin = "<ÔΩúfim_beginÔΩú>"
        self.fim_hole = "<ÔΩúfim_holeÔΩú>"
        self.fim_end = "<ÔΩúfim_endÔΩú>"
        self.eos_token = "<|eos_token|>"
        
    def create_fim_sample(self, code: str, hole_ratio: float = 0.3) -> Dict[str, str]:
        """T·∫°o FIM sample t·ª´ code ho√†n ch·ªânh
        
        Args:
            code: Code ho√†n ch·ªânh
            hole_ratio: T·ª∑ l·ªá code ƒë·ªÉ l√†m hole (middle part)
            
        Returns:
            Dict ch·ª©a prefix, suffix, middle v√† fim_format
        """
        lines = code.strip().split('\n')
        total_lines = len(lines)
        
        # T√≠nh to√°n v·ªã tr√≠ hole
        hole_size = max(1, int(total_lines * hole_ratio))
        start_idx = np.random.randint(0, max(1, total_lines - hole_size))
        end_idx = min(start_idx + hole_size, total_lines)
        
        # T√°ch th√†nh prefix, middle, suffix
        prefix = '\n'.join(lines[:start_idx])
        middle = '\n'.join(lines[start_idx:end_idx]) 
        suffix = '\n'.join(lines[end_idx:])
        
        # T·∫°o FIM format: <fim_begin>prefix<fim_hole>suffix<fim_end>middle<eos>
        fim_format = f"{self.fim_begin}{prefix}{self.fim_hole}{suffix}{self.fim_end}{middle}{self.eos_token}"
        
        return {
            'prefix': prefix,
            'middle': middle,
            'suffix': suffix,
            'fim_format': fim_format,
            'original': code
        }
    
    def demonstrate_fim(self, code_sample: str):
        """Demo FIM process"""
        result = self.create_fim_sample(code_sample)
        
        print("üîß Fill-In-the-Middle Demo")
        print("=" * 50)
        print("üìù Original Code:")
        print(result['original'])
        print("\nüìç Prefix:")
        print(repr(result['prefix']))
        print("\nüï≥Ô∏è  Middle (to be predicted):")
        print(repr(result['middle']))
        print("\nüìç Suffix:")
        print(repr(result['suffix']))
        print("\nüéØ FIM Training Format:")
        print(result['fim_format'])
        
        return result

# Demo FIM v·ªõi Python code
sample_code = """def fibonacci(n):
    if n <= 1:
        return n
    else:
        return fibonacci(n-1) + fibonacci(n-2)

# Test function
for i in range(10):
    print(f"fib({i}) = {fibonacci(i)}")"""

fim_processor = FIMProcessor()
fim_result = fim_processor.demonstrate_fim(sample_code)

## üìà Benchmark Performance Analysis

### Theo Section 4: Experimental Results

Ph√¢n t√≠ch hi·ªáu su·∫•t tr√™n c√°c benchmark ch√≠nh

In [None]:
# Benchmark results t·ª´ paper (Table 3, 4, 9)
benchmark_data = {
    'Model': ['GPT-4o', 'DeepSeek-Coder-V2', 'GPT-4-Turbo', 'Claude-3-Opus', 'Gemini-1.5-Pro', 'Codestral'],
    'HumanEval': [91.0, 90.2, 88.2, 84.2, 83.5, 78.1],
    'MBPP+': [73.5, 76.2, 72.2, 72.0, 74.6, 68.2],
    'MATH': [76.6, 75.7, 73.4, 60.1, 67.7, None],
    'LiveCodeBench': [43.4, 43.4, 45.7, 34.6, 34.1, 31.0],
    'GSM8K': [95.8, 94.9, 93.7, 95.0, 90.8, None],
    'Type': ['Closed', 'Open', 'Closed', 'Closed', 'Closed', 'Open']
}

df_benchmarks = pd.DataFrame(benchmark_data)
print("üìä Benchmark Performance Comparison:")
print(df_benchmarks.to_string(index=False))

# Visualization
fig, axes = plt.subplots(2, 3, figsize=(18, 12))
axes = axes.flatten()

benchmarks = ['HumanEval', 'MBPP+', 'MATH', 'LiveCodeBench', 'GSM8K']
colors = ['red' if t == 'Open' else 'blue' for t in df_benchmarks['Type']]

for i, benchmark in enumerate(benchmarks):
    if i < len(axes):
        # Filter out None values
        mask = df_benchmarks[benchmark].notna()
        data = df_benchmarks[mask]
        
        bars = axes[i].bar(data['Model'], data[benchmark], 
                          color=[colors[j] for j in data.index])
        axes[i].set_title(f'{benchmark} Performance')
        axes[i].set_ylabel('Score (%)')
        axes[i].tick_params(axis='x', rotation=45)
        
        # Highlight DeepSeek-Coder-V2
        for j, bar in enumerate(bars):
            if data.iloc[j]['Model'] == 'DeepSeek-Coder-V2':
                bar.set_edgecolor('orange')
                bar.set_linewidth(3)

# Remove empty subplot
axes[-1].remove()

# Add legend
from matplotlib.patches import Patch
legend_elements = [Patch(facecolor='blue', label='Closed-Source'),
                   Patch(facecolor='red', label='Open-Source'),
                   Patch(facecolor='white', edgecolor='orange', linewidth=3, label='DeepSeek-Coder-V2')]
fig.legend(handles=legend_elements, loc='lower right')

plt.tight_layout()
plt.show()

## üéì Code Generation Demo

M√¥ ph·ªèng kh·∫£ nƒÉng sinh code c·ªßa DeepSeek-Coder-V2 (s·ª≠ d·ª•ng mock model do kh√¥ng c√≥ access tr·ª±c ti·∫øp)

In [None]:
class MockDeepSeekCoderV2:
    """Mock implementation ƒë·ªÉ demo kh·∫£ nƒÉng c·ªßa DeepSeek-Coder-V2"""
    
    def __init__(self):
        self.supported_languages = [
            'Python', 'JavaScript', 'Java', 'C++', 'C#', 'TypeScript', 
            'PHP', 'Go', 'Rust', 'Ruby', 'Swift', 'Kotlin'
        ]
        self.context_length = 128000  # 128K tokens
        
    def generate_code(self, prompt: str, language: str = 'Python', max_tokens: int = 500) -> Dict[str, Any]:
        """Mock code generation"""
        
        # Template responses for different types of problems
        if 'fibonacci' in prompt.lower():
            if language.lower() == 'python':
                code = '''def fibonacci(n):
    """Calculate the nth Fibonacci number using dynamic programming.
    
    Args:
        n (int): The position in the Fibonacci sequence
        
    Returns:
        int: The nth Fibonacci number
    """
    if n <= 1:
        return n
    
    # Use dynamic programming for efficiency
    dp = [0] * (n + 1)
    dp[1] = 1
    
    for i in range(2, n + 1):
        dp[i] = dp[i-1] + dp[i-2]
    
    return dp[n]

# Test the function
if __name__ == "__main__":
    for i in range(10):
        print(f"F({i}) = {fibonacci(i)}")'''
            
        elif 'quicksort' in prompt.lower() or 'sort' in prompt.lower():
            if language.lower() == 'python':
                code = '''def quicksort(arr):
    """Implement quicksort algorithm with random pivot selection.
    
    Args:
        arr (list): List of comparable elements
        
    Returns:
        list: Sorted list
    """
    import random
    
    if len(arr) <= 1:
        return arr
    
    # Choose random pivot to avoid worst-case O(n¬≤)
    pivot_idx = random.randint(0, len(arr) - 1)
    pivot = arr[pivot_idx]
    
    # Partition
    left = [x for x in arr if x < pivot]
    middle = [x for x in arr if x == pivot]
    right = [x for x in arr if x > pivot]
    
    # Recursive sort and combine
    return quicksort(left) + middle + quicksort(right)

# Example usage
test_array = [64, 34, 25, 12, 22, 11, 90]
sorted_array = quicksort(test_array.copy())
print(f"Original: {test_array}")
print(f"Sorted: {sorted_array}")'''
        
        else:
            code = f'''# Generated code for: {prompt}
# Language: {language}
# This is a mock implementation demonstrating DeepSeek-Coder-V2 capabilities

def solution():
    """Mock solution based on the prompt."""
    print("This would be a sophisticated solution generated by DeepSeek-Coder-V2")
    return "Success"

if __name__ == "__main__":
    result = solution()
    print(f"Result: {result}")'''
        
        return {
            'generated_code': code,
            'language': language,
            'prompt': prompt,
            'tokens_used': len(code.split()),
            'confidence': 0.92  # Mock confidence score
        }
    
    def evaluate_code(self, code: str, test_cases: List[Dict]) -> Dict[str, Any]:
        """Mock code evaluation"""
        passed = 0
        total = len(test_cases)
        
        # Simulate test execution
        for i, test in enumerate(test_cases):
            # Mock execution - in reality this would run the code
            success_rate = 0.9  # DeepSeek-Coder-V2's high success rate
            if np.random.random() < success_rate:
                passed += 1
        
        return {
            'passed': passed,
            'total': total,
            'success_rate': passed / total,
            'status': 'PASSED' if passed == total else 'PARTIAL'
        }

# Demo the mock model
mock_model = MockDeepSeekCoderV2()

print("ü§ñ DeepSeek-Coder-V2 Demo")
print(f"üìã Supported Languages: {', '.join(mock_model.supported_languages)}")
print(f"üìè Context Length: {mock_model.context_length:,} tokens")
print()

# Test code generation
prompts = [
    "Implement fibonacci sequence with dynamic programming",
    "Create a quicksort algorithm with random pivot"
]

for prompt in prompts:
    print(f"üéØ Prompt: {prompt}")
    result = mock_model.generate_code(prompt)
    print(f"üíª Generated Code ({result['tokens_used']} tokens):")
    print("```python")
    print(result['generated_code'])
    print("```")
    print(f"üéØ Confidence: {result['confidence']:.2%}")
    print("=" * 80)
    print()

## üîç Context Length Extension Demo

### Theo Section 3.4: Long Context Extension

DeepSeek-Coder-V2 s·ª≠ d·ª•ng YARN ƒë·ªÉ m·ªü r·ªông context length l√™n 128K tokens

In [None]:
def simulate_context_extension():
    """M√¥ ph·ªèng kh·∫£ nƒÉng x·ª≠ l√Ω long context c·ªßa DeepSeek-Coder-V2"""
    
    # YARN parameters theo paper
    yarn_params = {
        'scale_s': 40,
        'alpha': 1,
        'beta': 32,
        'original_context': 16384,  # 16K
        'extended_context': 131072  # 128K
    }
    
    print("üßµ YARN Context Extension Analysis")
    print("=" * 50)
    print(f"üìè Original Context Length: {yarn_params['original_context']:,} tokens")
    print(f"üìè Extended Context Length: {yarn_params['extended_context']:,} tokens")
    print(f"üìà Extension Ratio: {yarn_params['extended_context'] / yarn_params['original_context']:.1f}x")
    print()
    print("üéõÔ∏è YARN Hyperparameters:")
    for param, value in yarn_params.items():
        if param not in ['original_context', 'extended_context']:
            print(f"  {param}: {value}")
    
    # Simulate "Needle in a Haystack" test performance
    context_lengths = np.logspace(3, np.log10(128000), 20)  # From 1K to 128K
    # Based on Figure 2 in paper - high performance across all lengths
    performance = 95 + 5 * np.random.random(len(context_lengths))  # 95-100% range
    performance = np.clip(performance, 90, 100)  # Ensure realistic range
    
    plt.figure(figsize=(12, 6))
    plt.plot(context_lengths/1000, performance, 'b-', linewidth=2, marker='o')
    plt.axhline(y=95, color='r', linestyle='--', alpha=0.7, label='95% Threshold')
    plt.xlabel('Context Length (K tokens)')
    plt.ylabel('Performance (%)')
    plt.title('DeepSeek-Coder-V2: "Needle in a Haystack" Performance\n(Simulated based on Figure 2)')
    plt.xscale('log')
    plt.grid(True, alpha=0.3)
    plt.legend()
    plt.ylim(85, 102)
    
    # Add annotations
    plt.annotate('Original DeepSeek-Coder\n(16K)', xy=(16, 98), xytext=(30, 88),
                arrowprops=dict(arrowstyle='->', color='orange', lw=1.5),
                fontsize=10, ha='center')
    plt.annotate('DeepSeek-Coder-V2\n(128K)', xy=(128, 97), xytext=(80, 102),
                arrowprops=dict(arrowstyle='->', color='green', lw=1.5),
                fontsize=10, ha='center')
    
    plt.tight_layout()
    plt.show()
    
    return yarn_params

yarn_config = simulate_context_extension()

## üßÆ Mathematical Reasoning Analysis

### Theo Section 4.5: Mathematical Reasoning

DeepSeek-Coder-V2 ƒë·∫°t hi·ªáu su·∫•t t∆∞∆°ng ƒë∆∞∆°ng GPT-4o trong mathematical reasoning

In [None]:
def analyze_math_performance():
    """Ph√¢n t√≠ch hi·ªáu su·∫•t mathematical reasoning"""
    
    # Data t·ª´ Table 9 trong paper
    math_results = {
        'Model': ['GPT-4o', 'DeepSeek-Coder-V2', 'GPT-4-Turbo', 'Claude-3-Opus', 'Gemini-1.5-Pro'],
        'GSM8K': [95.8, 94.9, 93.7, 95.0, 90.8],
        'MATH': [76.6, 75.7, 73.4, 60.1, 67.7],
        'AIME_2024': [2, 4, 3, 2, 2],  # Out of 30 problems
        'Math_Odyssey': [53.2, 53.7, 46.8, 40.6, 45.0],
        'Type': ['Closed', 'Open', 'Closed', 'Closed', 'Closed']
    }
    
    df_math = pd.DataFrame(math_results)
    print("üßÆ Mathematical Reasoning Performance:")
    print(df_math.to_string(index=False))
    
    # Visualization
    fig, axes = plt.subplots(2, 2, figsize=(15, 10))
    
    benchmarks = ['GSM8K', 'MATH', 'AIME_2024', 'Math_Odyssey']
    colors = ['orange' if t == 'Open' else 'skyblue' for t in df_math['Type']]
    
    for i, benchmark in enumerate(benchmarks):
        ax = axes[i//2, i%2]
        bars = ax.bar(df_math['Model'], df_math[benchmark], color=colors)
        
        # Highlight DeepSeek-Coder-V2
        bars[1].set_edgecolor('red')
        bars[1].set_linewidth(3)
        
        ax.set_title(f'{benchmark} Performance')
        if benchmark == 'AIME_2024':
            ax.set_ylabel('Problems Solved (out of 30)')
        else:
            ax.set_ylabel('Accuracy (%)')
        ax.tick_params(axis='x', rotation=45)
        
        # Add value labels on bars
        for bar, value in zip(bars, df_math[benchmark]):
            height = bar.get_height()
            ax.text(bar.get_x() + bar.get_width()/2., height + 0.5,
                   f'{value}', ha='center', va='bottom', fontsize=9)
    
    plt.tight_layout()
    plt.show()
    
    # Key insights
    print("\nüîç Key Mathematical Reasoning Insights:")
    print("‚Ä¢ DeepSeek-Coder-V2 achieves SOTA performance among open-source models")
    print("‚Ä¢ Nearly matches GPT-4o on MATH benchmark (75.7% vs 76.6%)")
    print("‚Ä¢ Outperforms GPT-4o on Math Odyssey (53.7% vs 53.2%)")
    print("‚Ä¢ Solves most AIME 2024 problems (4/30) among all models")
    print("‚Ä¢ Strong elementary math reasoning (GSM8K: 94.9%)")
    
    return df_math

math_analysis = analyze_math_performance()

## üéØ DeepEval Integration

S·ª≠ d·ª•ng DeepEval framework ƒë·ªÉ ƒë√°nh gi√° code generation capabilities

In [None]:
try:
    from deepeval import assert_test
    from deepeval.metrics import GEval, AnswerRelevancyMetric
    from deepeval.test_case import LLMTestCase
    DEEPEVAL_AVAILABLE = True
except ImportError:
    print("DeepEval not available. Installing...")
    !pip install deepeval
    DEEPEVAL_AVAILABLE = False

def evaluate_code_generation():
    """ƒê√°nh gi√° kh·∫£ nƒÉng sinh code b·∫±ng DeepEval metrics"""
    
    # Define evaluation criteria
    code_quality_criteria = {
        'correctness': 'Does the code solve the problem correctly?',
        'efficiency': 'Is the code efficient in terms of time and space complexity?',
        'readability': 'Is the code well-structured and readable?',
        'documentation': 'Are there appropriate comments and docstrings?'
    }
    
    # Test cases based on HumanEval-style problems
    test_cases = [
        {
            'problem': 'Write a function to check if a number is prime',
            'expected_features': ['efficiency check', 'edge cases', 'documentation'],
            'test_inputs': [2, 3, 4, 17, 25, 29]
        },
        {
            'problem': 'Implement binary search algorithm',
            'expected_features': ['O(log n) complexity', 'proper bounds', 'recursive or iterative'],
            'test_inputs': [[1,2,3,4,5], [10,20,30,40,50]]
        }
    ]
    
    print("üîç Code Generation Evaluation Framework")
    print("=" * 50)
    
    # Simulate evaluation results based on DeepSeek-Coder-V2's reported performance
    evaluation_results = []
    
    for i, test_case in enumerate(test_cases):
        # Generate mock code
        generated_code = mock_model.generate_code(test_case['problem'])
        
        # Simulate evaluation scores (based on reported 90.2% HumanEval performance)
        scores = {
            'correctness': np.random.uniform(0.85, 0.95),
            'efficiency': np.random.uniform(0.80, 0.90),
            'readability': np.random.uniform(0.88, 0.95),
            'documentation': np.random.uniform(0.82, 0.92)
        }
        
        evaluation_results.append({
            'problem': test_case['problem'],
            'scores': scores,
            'overall_score': np.mean(list(scores.values()))
        })
        
        print(f"\nüìù Problem {i+1}: {test_case['problem']}")
        print("üìä Evaluation Scores:")
        for criterion, score in scores.items():
            print(f"  ‚Ä¢ {criterion.capitalize()}: {score:.2%}")
        print(f"üéØ Overall Score: {scores['correctness']:.2%}")
    
    # Summary
    avg_scores = {}
    for criterion in code_quality_criteria.keys():
        avg_scores[criterion] = np.mean([result['scores'][criterion] for result in evaluation_results])
    
    print("\nüìà Summary Evaluation:")
    print("=" * 30)
    for criterion, avg_score in avg_scores.items():
        print(f"{criterion.capitalize()}: {avg_score:.2%}")
    
    overall_avg = np.mean(list(avg_scores.values()))
    print(f"\nüèÜ Overall Performance: {overall_avg:.2%}")
    print("‚úÖ Comparable to reported HumanEval performance (90.2%)")
    
    return evaluation_results, avg_scores

eval_results, avg_scores = evaluate_code_generation()

## üèÅ Conclusion & Key Insights

### üìã Summary c·ªßa DeepSeek-Coder-V2 Implementation

1. **Architecture Innovation**: MoE v·ªõi high parameter efficiency
2. **Data Quality**: Multi-source corpus v·ªõi 60% code, 10% math, 30% NL
3. **Context Extension**: YARN technique ƒë·ªÉ m·ªü r·ªông l√™n 128K tokens
4. **Training Strategy**: FIM cho code completion, GRPO cho alignment
5. **Performance**: SOTA trong open-source, comparable v·ªõi closed-source models

In [None]:
def generate_research_template():
    """T·∫°o template cho nghi√™n c·ª©u c√° nh√¢n"""
    
    template = """
# üî¨ Personal Research Template: DeepSeek-Coder-V2

## üéØ Research Questions
1. How does MoE architecture impact code generation quality?
2. What is the optimal ratio of code/math/NL data for code models?
3. How does context length affect complex coding tasks?
4. Can we improve FIM training for better code completion?

## üß™ Experiments to Try
1. **Data Composition Analysis**
   - Test different ratios of code/math/natural language
   - Evaluate impact on different benchmark tasks
   
2. **Context Length Studies**
   - Implement YARN extension technique
   - Test on repository-level code understanding
   
3. **FIM Training Optimization**
   - Experiment with different FIM rates (0.3, 0.5, 0.7)
   - Compare PSM vs other FIM modes
   
4. **Multi-language Code Generation**
   - Test cross-language code translation
   - Evaluate performance on less common languages

## üìä Metrics to Track
- HumanEval, MBPP+ for code generation
- RepoBench for repository-level completion
- SWE-Bench for real-world bug fixing
- Custom metrics for specific use cases

## üõ†Ô∏è Implementation Ideas
1. Create smaller MoE models for experimentation
2. Implement custom FIM data preprocessing
3. Build evaluation harness for multiple languages
4. Develop tools for long-context code analysis

## üìö Further Reading
- Original DeepSeek-V2 paper for architecture details
- YARN paper for context extension technique
- MoE training best practices
- Code evaluation benchmarks and metrics
    """
    
    print(template)
    return template

research_template = generate_research_template()

## üìä Final Performance Summary

In [None]:
# Create comprehensive performance summary
def create_performance_dashboard():
    """T·∫°o dashboard t·ªïng h·ª£p hi·ªáu su·∫•t"""
    
    fig = plt.figure(figsize=(20, 12))
    gs = fig.add_gridspec(3, 4, hspace=0.3, wspace=0.3)
    
    # 1. Model comparison radar chart
    ax1 = fig.add_subplot(gs[0, :2], projection='polar')
    
    categories = ['HumanEval', 'MBPP+', 'MATH', 'LiveCodeBench', 'GSM8K']
    deepseek_scores = [90.2, 76.2, 75.7, 43.4, 94.9]
    gpt4_scores = [88.2, 72.2, 73.4, 45.7, 93.7]
    
    angles = np.linspace(0, 2*np.pi, len(categories), endpoint=False).tolist()
    angles += angles[:1]  # Complete the circle
    
    deepseek_scores += deepseek_scores[:1]
    gpt4_scores += gpt4_scores[:1]
    
    ax1.plot(angles, deepseek_scores, 'o-', linewidth=2, label='DeepSeek-Coder-V2', color='red')
    ax1.fill(angles, deepseek_scores, alpha=0.25, color='red')
    ax1.plot(angles, gpt4_scores, 'o-', linewidth=2, label='GPT-4-Turbo', color='blue')
    ax1.fill(angles, gpt4_scores, alpha=0.25, color='blue')
    
    ax1.set_xticks(angles[:-1])
    ax1.set_xticklabels(categories)
    ax1.set_ylim(0, 100)
    ax1.set_title('Performance Comparison: DeepSeek-Coder-V2 vs GPT-4-Turbo', pad=20)
    ax1.legend(loc='upper right', bbox_to_anchor=(1.3, 1.0))
    
    # 2. Parameter efficiency
    ax2 = fig.add_subplot(gs[0, 2:])
    models = ['DeepSeek-Coder-V2\n(236B/21B)', 'DeepSeek-Coder-V2-Lite\n(16B/2.4B)', 'Codestral\n(22B/22B)']
    total_params = [236, 16, 22]
    active_params = [21, 2.4, 22]
    
    x = np.arange(len(models))
    width = 0.35
    
    ax2.bar(x - width/2, total_params, width, label='Total Parameters (B)', alpha=0.7)
    ax2.bar(x + width/2, active_params, width, label='Active Parameters (B)', alpha=0.7)
    
    ax2.set_xlabel('Models')
    ax2.set_ylabel('Parameters (Billions)')
    ax2.set_title('Parameter Efficiency: MoE vs Dense Models')
    ax2.set_xticks(x)
    ax2.set_xticklabels(models)
    ax2.legend()
    
    # 3. Training data composition
    ax3 = fig.add_subplot(gs[1, :2])
    data_types = ['Source Code\n(60%)', 'Natural Language\n(30%)', 'Math Corpus\n(10%)']
    percentages = [60, 30, 10]
    colors = ['#ff9999', '#66b3ff', '#99ff99']
    
    ax3.pie(percentages, labels=data_types, colors=colors, autopct='%1.1f%%', startangle=90)
    ax3.set_title('Training Data Composition (6T tokens)')
    
    # 4. Context length evolution
    ax4 = fig.add_subplot(gs[1, 2:])
    models_context = ['DeepSeek-Coder', 'DeepSeek-Coder-V2']
    context_lengths = [16, 128]
    
    bars = ax4.bar(models_context, context_lengths, color=['lightblue', 'darkblue'])
    ax4.set_ylabel('Context Length (K tokens)')
    ax4.set_title('Context Length Extension')
    
    # Add value labels
    for bar, value in zip(bars, context_lengths):
        height = bar.get_height()
        ax4.text(bar.get_x() + bar.get_width()/2., height + 2,
                f'{value}K', ha='center', va='bottom', fontweight='bold')
    
    # 5. Language support expansion
    ax5 = fig.add_subplot(gs[2, :2])
    versions = ['DeepSeek-Coder', 'DeepSeek-Coder-V2']
    language_counts = [86, 338]
    
    bars = ax5.bar(versions, language_counts, color=['orange', 'red'])
    ax5.set_ylabel('Number of Languages')
    ax5.set_title('Programming Language Support')
    
    for bar, value in zip(bars, language_counts):
        height = bar.get_height()
        ax5.text(bar.get_x() + bar.get_width()/2., height + 5,
                f'{value}', ha='center', va='bottom', fontweight='bold')
    
    # 6. Key achievements
    ax6 = fig.add_subplot(gs[2, 2:])
    ax6.axis('off')
    
    achievements = [
        'üèÜ First open-source model > 10% on SWE-Bench',
        'üéØ 90.2% on HumanEval (SOTA open-source)',
        'üìö 338 programming languages supported',
        'üìè 128K context length with YARN',
        '‚ö° 21B active params (vs 236B total)',
        'üßÆ 75.7% on MATH benchmark'
    ]
    
    ax6.text(0.05, 0.95, 'Key Achievements:', fontsize=14, fontweight='bold', transform=ax6.transAxes)
    for i, achievement in enumerate(achievements):
        ax6.text(0.05, 0.85 - i*0.12, achievement, fontsize=12, transform=ax6.transAxes)
    
    plt.suptitle('DeepSeek-Coder-V2: Complete Performance Dashboard', fontsize=16, fontweight='bold')
    plt.tight_layout()
    plt.show()

create_performance_dashboard()

print("\nüéâ DeepSeek-Coder-V2 Implementation Complete!")
print("\nüìã Next Steps:")
print("1. üìñ Explore the 3 focused learning notebooks")
print("2. üß™ Run experiments with your own data")
print("3. üî¨ Implement custom evaluation metrics")
print("4. üìä Compare with other code models")
print("\n‚ú® Happy coding and researching! ‚ú®")