# Focused Learning 3: CodeBLEU vs Exact Match Evaluation

## Mục tiêu học tập
Hiểu sâu về evaluation metrics trong code generation, đặc biệt là sự khác biệt giữa Exact Match (EM) và CodeBLEU, dựa trên Section 3.5 của paper.

## Trích xuất từ Paper

### Section 3.5: The Evaluation Measures

#### Exact Match (EM)
> "*Exact Match (EM) [4, 5, 6] is the number of the generated revised code that is the same as the actual revised code in the testing dataset. We use this measure since it is widely used for evaluating code review automation approaches [1, 4, 6]. To compare the generated revised code with the actual revised code, we first tokenize both revised code to sequences of tokens. Then, we compared the sequence of tokens of the generated revised code with the sequence of tokens of the actual revised code. A high value of EM indicates that a model can generate revised code that is the same as the actual revised code in the testing dataset.*"

#### CodeBLEU
> "*CodeBLEU [22] is the extended version of BLEU (i.e., an n-gram overlap between the translation generated by a deep learning model and the translation in ground truth) [43] for automatic evaluation of the generated code. We do not measure BLEU like in prior work [5, 6] since Ren et al. [22] found that this measure ignores syntactic and semantic correctness of the generated code. In addition to BLEU, CodeBLEU considers the weighted n-gram match, matched syntactic information (i.e., abstract syntax tree: AST) and matched semantic information (i.e., data flow: DF) when computing the similarity between the generated revised code and the actual revised code.*"

### Formula từ CodeBLEU Paper [22]
$$\text{CodeBLEU} = \alpha \cdot \text{BLEU} + \beta \cdot \text{BLEU}_{\text{weight}} + \gamma \cdot \text{Match}_{\text{AST}} + \delta \cdot \text{Match}_{\text{DF}}$$

Trong đó:
- $\alpha + \beta + \gamma + \delta = 1$
- $\text{BLEU}$: Traditional n-gram overlap
- $\text{BLEU}_{\text{weight}}$: Weighted n-gram với keyword emphasis
- $\text{Match}_{\text{AST}}$: Abstract Syntax Tree matching
- $\text{Match}_{\text{DF}}$: Data Flow matching

## Lý thuyết Evaluation Metrics cho Code Generation

### Exact Match (EM)

**Định nghĩa**: Binary metric kiểm tra token-level exact matching

**Ưu điểm**:
- Simple và interpretable
- No ambiguity trong evaluation
- Widely used trong literature

**Nhược điểm**:
- Too strict: Rejects semantically equivalent code
- Sensitive to formatting differences
- Doesn't capture partial correctness

### CodeBLEU

**Components Analysis**:

1. **N-gram BLEU**: Surface-level similarity
2. **Weighted BLEU**: Emphasizes programming keywords
3. **AST Matching**: Structural similarity
4. **Data Flow Matching**: Semantic similarity

**Advantages over EM**:
- Captures semantic equivalence
- Partial credit for similar solutions
- More robust to formatting differences

**Challenges**:
- Complex implementation
- Parameter tuning required
- Language-specific adaptations needed

In [None]:
# Import required libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from typing import List, Dict, Tuple, Any, Optional
import re
import ast
import difflib
from collections import Counter, defaultdict
from dataclasses import dataclass
import warnings
warnings.filterwarnings('ignore')

# Math and statistics
import math
from scipy import stats
from sklearn.metrics import accuracy_score, precision_recall_fscore_support

# Text processing
from nltk.translate.bleu_score import sentence_bleu, SmoothingFunction
import nltk

# Try to download nltk data (ignore if already present)
try:
    nltk.download('punkt', quiet=True)
except:
    pass

# Set style
plt.style.use('seaborn-v0_8')
sns.set_palette("husl")
np.random.seed(42)

print("📚 Libraries imported for evaluation metrics analysis!")

## Implementation: Exact Match và CodeBLEU từ Scratch

Implement cả hai metrics để hiểu rõ differences và use cases.

In [None]:
@dataclass
class EvaluationResult:
    """Container for evaluation results"""
    exact_match: float
    code_bleu: float
    bleu_components: Dict[str, float]
    ast_similarity: float
    token_overlap: float
    example_id: str = ""

class CodeTokenizer:
    """Enhanced tokenizer for code, following paper's approach"""
    
    def __init__(self):
        # Programming keywords to emphasize
        self.keywords = {
            'java': ['public', 'private', 'static', 'void', 'int', 'String', 'boolean', 'if', 'else', 'for', 'while', 'return', 'class', 'interface'],
            'python': ['def', 'class', 'if', 'else', 'elif', 'for', 'while', 'return', 'import', 'from', 'try', 'except', 'with'],
            'javascript': ['function', 'var', 'let', 'const', 'if', 'else', 'for', 'while', 'return', 'class', 'async', 'await'],
            'go': ['func', 'var', 'const', 'if', 'else', 'for', 'range', 'return', 'struct', 'interface', 'package']
        }
    
    def tokenize(self, code: str, language: str = 'java') -> List[str]:
        """Tokenize code into meaningful tokens"""
        # Normalize whitespace
        code = re.sub(r'\s+', ' ', code.strip())
        
        # Split on programming delimiters while preserving them
        # This regex captures operators, brackets, semicolons, etc.
        tokens = re.findall(r'\w+|[{}()\[\];,.<>=!&|+\-*/"\']', code)
        
        # Convert to lowercase for case-insensitive comparison
        tokens = [token.lower() for token in tokens if token.strip()]
        
        return tokens
    
    def get_keyword_weights(self, tokens: List[str], language: str = 'java') -> List[float]:
        """Get weights for tokens, emphasizing keywords"""
        lang_keywords = self.keywords.get(language, [])
        weights = []
        
        for token in tokens:
            if token in lang_keywords:
                weights.append(2.0)  # Emphasize keywords
            elif re.match(r'^[a-zA-Z_][a-zA-Z0-9_]*$', token):  # Identifiers
                weights.append(1.5)
            else:  # Operators and punctuation
                weights.append(1.0)
        
        return weights

class ExactMatchEvaluator:
    """Implementation of Exact Match metric từ paper"""
    
    def __init__(self):
        self.tokenizer = CodeTokenizer()
    
    def evaluate(self, generated_code: str, reference_code: str, language: str = 'java') -> float:
        """Calculate Exact Match score"""
        # Tokenize both codes
        generated_tokens = self.tokenizer.tokenize(generated_code, language)
        reference_tokens = self.tokenizer.tokenize(reference_code, language)
        
        # Exact comparison of token sequences
        if generated_tokens == reference_tokens:
            return 1.0
        else:
            return 0.0
    
    def detailed_comparison(self, generated_code: str, reference_code: str, language: str = 'java') -> Dict[str, Any]:
        """Detailed comparison for analysis"""
        generated_tokens = self.tokenizer.tokenize(generated_code, language)
        reference_tokens = self.tokenizer.tokenize(reference_code, language)
        
        # Calculate various similarity measures
        exact_match = 1.0 if generated_tokens == reference_tokens else 0.0
        
        # Token overlap
        if not reference_tokens:
            token_overlap = 0.0
        else:
            common_tokens = set(generated_tokens) & set(reference_tokens)
            token_overlap = len(common_tokens) / len(set(reference_tokens))
        
        # Sequence similarity (using difflib)
        sequence_similarity = difflib.SequenceMatcher(None, generated_tokens, reference_tokens).ratio()
        
        # Length difference
        length_ratio = len(generated_tokens) / max(len(reference_tokens), 1)
        
        return {
            'exact_match': exact_match,
            'token_overlap': token_overlap,
            'sequence_similarity': sequence_similarity,
            'length_ratio': length_ratio,
            'generated_length': len(generated_tokens),
            'reference_length': len(reference_tokens),
            'generated_tokens': generated_tokens,
            'reference_tokens': reference_tokens
        }

class SimplifiedCodeBLEUEvaluator:
    """Simplified implementation of CodeBLEU (without full AST/DF analysis)"""
    
    def __init__(self, weights: Tuple[float, float, float, float] = (0.25, 0.25, 0.25, 0.25)):
        self.tokenizer = CodeTokenizer()
        self.alpha, self.beta, self.gamma, self.delta = weights  # BLEU, Weighted-BLEU, AST, DF
        self.smoothing = SmoothingFunction()
    
    def calculate_ngram_bleu(self, generated_tokens: List[str], reference_tokens: List[str], max_n: int = 4) -> float:
        """Calculate n-gram BLEU score"""
        if not reference_tokens:
            return 0.0
        
        # Use NLTK's BLEU implementation with smoothing
        try:
            bleu_score = sentence_bleu(
                [reference_tokens], 
                generated_tokens,
                weights=tuple(1/max_n for _ in range(max_n)),
                smoothing_function=self.smoothing.method1
            )
            return bleu_score
        except:
            return 0.0
    
    def calculate_weighted_bleu(self, generated_tokens: List[str], reference_tokens: List[str], language: str = 'java') -> float:
        """Calculate weighted BLEU with keyword emphasis"""
        if not reference_tokens:
            return 0.0
        
        # Get keyword weights
        gen_weights = self.tokenizer.get_keyword_weights(generated_tokens, language)
        ref_weights = self.tokenizer.get_keyword_weights(reference_tokens, language)
        
        # Calculate weighted overlap
        weighted_overlap = 0.0
        total_ref_weight = sum(ref_weights)
        
        if total_ref_weight == 0:
            return 0.0
        
        # Simple weighted matching (bag-of-words approach)
        gen_counter = Counter(generated_tokens)
        ref_counter = Counter(reference_tokens)
        
        for token, ref_count in ref_counter.items():
            gen_count = gen_counter.get(token, 0)
            matched_count = min(gen_count, ref_count)
            
            # Find weight for this token
            token_weight = 1.0
            if token in reference_tokens:
                idx = reference_tokens.index(token)
                token_weight = ref_weights[idx] if idx < len(ref_weights) else 1.0
            
            weighted_overlap += matched_count * token_weight
        
        return weighted_overlap / total_ref_weight
    
    def calculate_ast_similarity(self, generated_code: str, reference_code: str, language: str = 'java') -> float:
        """Simplified AST similarity (using structural patterns)"""
        
        # For simplicity, we'll use pattern-based structural similarity
        # In a full implementation, this would use actual AST parsing
        
        def extract_structural_patterns(code: str) -> List[str]:
            patterns = []
            
            # Control flow patterns
            if 'if' in code: patterns.append('conditional')
            if 'for' in code or 'while' in code: patterns.append('loop')
            if 'return' in code: patterns.append('return')
            if 'function' in code or 'def' in code or 'func' in code: patterns.append('function_def')
            
            # Operator patterns
            if '?' in code and ':' in code: patterns.append('ternary')
            if '&&' in code or '||' in code: patterns.append('logical_op')
            if '+=' in code or '-=' in code: patterns.append('compound_assignment')
            
            # Bracket patterns
            patterns.append(f"braces_{code.count('{')}")  
            patterns.append(f"parens_{code.count('(')}")  
            patterns.append(f"brackets_{code.count('[')}")  
            
            return patterns
        
        gen_patterns = set(extract_structural_patterns(generated_code))
        ref_patterns = set(extract_structural_patterns(reference_code))
        
        if not ref_patterns:
            return 0.0
        
        common_patterns = gen_patterns & ref_patterns
        return len(common_patterns) / len(ref_patterns)
    
    def calculate_dataflow_similarity(self, generated_code: str, reference_code: str, language: str = 'java') -> float:
        """Simplified data flow similarity (using variable usage patterns)"""
        
        def extract_variable_patterns(code: str) -> Dict[str, List[str]]:
            patterns = defaultdict(list)
            
            # Simple variable usage detection
            tokens = re.findall(r'\w+', code)
            
            for i, token in enumerate(tokens):
                # Variable assignment pattern
                if i < len(tokens) - 1 and tokens[i + 1] == '=':
                    patterns[token].append('assigned')
                
                # Variable usage in conditions
                if i > 0 and tokens[i - 1] in ['if', 'while']:
                    patterns[token].append('condition')
                
                # Function call pattern
                if i < len(tokens) - 1 and '(' in ' '.join(tokens[i:i+2]):
                    patterns[token].append('function_call')
            
            return dict(patterns)
        
        gen_patterns = extract_variable_patterns(generated_code)
        ref_patterns = extract_variable_patterns(reference_code)
        
        if not ref_patterns:
            return 0.0
        
        # Calculate overlap in variable usage patterns
        total_ref_patterns = sum(len(patterns) for patterns in ref_patterns.values())
        if total_ref_patterns == 0:
            return 0.0
        
        common_patterns = 0
        for var, ref_var_patterns in ref_patterns.items():
            if var in gen_patterns:
                gen_var_patterns = gen_patterns[var]
                common = set(ref_var_patterns) & set(gen_var_patterns)
                common_patterns += len(common)
        
        return common_patterns / total_ref_patterns
    
    def evaluate(self, generated_code: str, reference_code: str, language: str = 'java') -> EvaluationResult:
        """Calculate comprehensive CodeBLEU score"""
        
        # Tokenize
        generated_tokens = self.tokenizer.tokenize(generated_code, language)
        reference_tokens = self.tokenizer.tokenize(reference_code, language)
        
        # Calculate components
        ngram_bleu = self.calculate_ngram_bleu(generated_tokens, reference_tokens)
        weighted_bleu = self.calculate_weighted_bleu(generated_tokens, reference_tokens, language)
        ast_similarity = self.calculate_ast_similarity(generated_code, reference_code, language)
        dataflow_similarity = self.calculate_dataflow_similarity(generated_code, reference_code, language)
        
        # Combine with weights
        code_bleu = (
            self.alpha * ngram_bleu +
            self.beta * weighted_bleu +
            self.gamma * ast_similarity +
            self.delta * dataflow_similarity
        )
        
        # Calculate exact match for comparison
        exact_match = 1.0 if generated_tokens == reference_tokens else 0.0
        
        # Token overlap
        if reference_tokens:
            common_tokens = set(generated_tokens) & set(reference_tokens)
            token_overlap = len(common_tokens) / len(set(reference_tokens))
        else:
            token_overlap = 0.0
        
        return EvaluationResult(
            exact_match=exact_match,
            code_bleu=code_bleu,
            bleu_components={
                'ngram_bleu': ngram_bleu,
                'weighted_bleu': weighted_bleu,
                'ast_similarity': ast_similarity,
                'dataflow_similarity': dataflow_similarity
            },
            ast_similarity=ast_similarity,
            token_overlap=token_overlap
        )

print("🔧 Evaluation metrics implemented!")

## Test Cases: Diverse Code Examples

Tạo diverse test cases để demonstrate differences giữa EM và CodeBLEU.

In [None]:
@dataclass
class TestCase:
    """Test case for evaluation metrics comparison"""
    case_id: str
    description: str
    generated_code: str
    reference_code: str
    language: str
    expected_em: float
    expected_relationship: str  # "em_higher", "codebleu_higher", "similar"

def create_comprehensive_test_cases() -> List[TestCase]:
    """Create comprehensive test cases covering different scenarios"""
    
    test_cases = [
        # Case 1: Perfect match
        TestCase(
            case_id="perfect_match",
            description="Identical code - both metrics should be 1.0",
            generated_code="String result = condition ? \"true\" : \"false\";",
            reference_code="String result = condition ? \"true\" : \"false\";",
            language="java",
            expected_em=1.0,
            expected_relationship="similar"
        ),
        
        # Case 2: Whitespace differences  
        TestCase(
            case_id="whitespace_diff",
            description="Different whitespace - EM should be same, CodeBLEU might be more robust",
            generated_code="String result=condition?\"true\":\"false\";",
            reference_code="String result = condition ? \"true\" : \"false\";",
            language="java",
            expected_em=1.0,  # After tokenization, should be same
            expected_relationship="similar"
        ),
        
        # Case 3: Semantically equivalent but syntactically different
        TestCase(
            case_id="semantic_equivalent",
            description="Semantically equivalent - CodeBLEU should be higher than EM",
            generated_code="if (condition) { result = \"true\"; } else { result = \"false\"; }",
            reference_code="String result = condition ? \"true\" : \"false\";",
            language="java",
            expected_em=0.0,
            expected_relationship="codebleu_higher"
        ),
        
        # Case 4: Variable name changes
        TestCase(
            case_id="variable_names",
            description="Different variable names - CodeBLEU should be more forgiving",
            generated_code="String flag = status ? \"active\" : \"inactive\";",
            reference_code="String result = condition ? \"true\" : \"false\";",
            language="java",
            expected_em=0.0,
            expected_relationship="codebleu_higher"
        ),
        
        # Case 5: Partial correctness
        TestCase(
            case_id="partial_correct",
            description="Partially correct code - CodeBLEU gives partial credit",
            generated_code="for (String item : items) { process(item); }",
            reference_code="for (String item : items) { process(item); validate(item); }",
            language="java",
            expected_em=0.0,
            expected_relationship="codebleu_higher"
        ),
        
        # Case 6: Different control structures
        TestCase(
            case_id="control_structure",
            description="Different loop types but same effect",
            generated_code="for (int i = 0; i < items.size(); i++) { process(items.get(i)); }",
            reference_code="for (String item : items) { process(item); }",
            language="java",
            expected_em=0.0,
            expected_relationship="codebleu_higher"
        ),
        
        # Case 7: Wrong code (low similarity)
        TestCase(
            case_id="wrong_code",
            description="Completely different code - both metrics should be low",
            generated_code="System.out.println(\"Hello World\");",
            reference_code="String result = condition ? \"true\" : \"false\";",
            language="java",
            expected_em=0.0,
            expected_relationship="similar"
        ),
        
        # Case 8: Python example - function definition
        TestCase(
            case_id="python_function",
            description="Python function with different parameter names",
            generated_code="def calculate_sum(numbers): return sum(numbers)",
            reference_code="def calculate_sum(values): return sum(values)",
            language="python",
            expected_em=0.0,
            expected_relationship="codebleu_higher"
        ),
        
        # Case 9: JavaScript arrow function vs regular function
        TestCase(
            case_id="js_function_style",
            description="Different function syntax in JavaScript",
            generated_code="const multiply = (a, b) => a * b;",
            reference_code="function multiply(a, b) { return a * b; }",
            language="javascript",
            expected_em=0.0,
            expected_relationship="codebleu_higher"
        ),
        
        # Case 10: Go error handling styles
        TestCase(
            case_id="go_error_handling",
            description="Different error handling approaches in Go",
            generated_code="if err != nil { return err }",
            reference_code="if err != nil { log.Fatal(err) }",
            language="go",
            expected_em=0.0,
            expected_relationship="codebleu_higher"
        ),
        
        # Case 11: Complex Java example - method with different implementation
        TestCase(
            case_id="java_method_impl",
            description="Same method signature, different implementation style",
            generated_code="public boolean isEmpty() { return size == 0; }",
            reference_code="public boolean isEmpty() { return this.size() == 0; }",
            language="java",
            expected_em=0.0,
            expected_relationship="codebleu_higher"
        ),
        
        # Case 12: Operator precedence (subtle difference)
        TestCase(
            case_id="operator_precedence",
            description="Different parentheses usage",
            generated_code="result = (a + b) * c;",
            reference_code="result = a + b * c;",
            language="java",
            expected_em=0.0,
            expected_relationship="codebleu_higher"
        )
    ]
    
    return test_cases

# Create test cases
test_cases = create_comprehensive_test_cases()

print(f"📝 Created {len(test_cases)} comprehensive test cases:")
for case in test_cases[:5]:  # Show first 5
    print(f"   • {case.case_id}: {case.description}")
print(f"   ... và {len(test_cases)-5} cases khác")

## Comparative Evaluation

Run evaluation trên all test cases và analyze differences giữa EM và CodeBLEU.

In [None]:
def run_comparative_evaluation(test_cases: List[TestCase]) -> pd.DataFrame:
    """Run comparative evaluation of EM vs CodeBLEU"""
    
    print("🔍 Running comparative evaluation...")
    
    # Initialize evaluators
    em_evaluator = ExactMatchEvaluator()
    codebleu_evaluator = SimplifiedCodeBLEUEvaluator()
    
    results = []
    
    for case in test_cases:
        print(f"Evaluating {case.case_id}...", end="\r")
        
        # EM evaluation
        em_score = em_evaluator.evaluate(case.generated_code, case.reference_code, case.language)
        em_details = em_evaluator.detailed_comparison(case.generated_code, case.reference_code, case.language)
        
        # CodeBLEU evaluation
        codebleu_result = codebleu_evaluator.evaluate(case.generated_code, case.reference_code, case.language)
        
        # Store results
        result = {
            'case_id': case.case_id,
            'description': case.description,
            'language': case.language,
            'em_score': em_score,
            'codebleu_score': codebleu_result.code_bleu,
            'ngram_bleu': codebleu_result.bleu_components['ngram_bleu'],
            'weighted_bleu': codebleu_result.bleu_components['weighted_bleu'],
            'ast_similarity': codebleu_result.bleu_components['ast_similarity'],
            'dataflow_similarity': codebleu_result.bleu_components['dataflow_similarity'],
            'token_overlap': em_details['token_overlap'],
            'sequence_similarity': em_details['sequence_similarity'],
            'length_ratio': em_details['length_ratio'],
            'expected_relationship': case.expected_relationship,
            'generated_length': em_details['generated_length'],
            'reference_length': em_details['reference_length']
        }
        
        # Calculate differences
        result['codebleu_minus_em'] = result['codebleu_score'] - result['em_score']
        result['advantage'] = 'CodeBLEU' if result['codebleu_score'] > result['em_score'] else ('EM' if result['em_score'] > result['codebleu_score'] else 'Tie')
        
        results.append(result)
    
    print("\n✅ Evaluation completed!")
    return pd.DataFrame(results)

# Run evaluation
evaluation_df = run_comparative_evaluation(test_cases)

# Display summary
print("\n📊 Evaluation Summary:")
print(f"Total test cases: {len(evaluation_df)}")
print(f"Cases where CodeBLEU > EM: {len(evaluation_df[evaluation_df['advantage'] == 'CodeBLEU'])}")
print(f"Cases where EM > CodeBLEU: {len(evaluation_df[evaluation_df['advantage'] == 'EM'])}")
print(f"Ties: {len(evaluation_df[evaluation_df['advantage'] == 'Tie'])}")

print("\n🔍 Detailed Results (first 5 cases):")
display_cols = ['case_id', 'em_score', 'codebleu_score', 'advantage', 'token_overlap']
print(evaluation_df[display_cols].head().to_string(index=False))

## Visualization và Analysis

Create comprehensive visualizations để understand metric behaviors.

In [None]:
def create_comprehensive_visualizations(df: pd.DataFrame):
    """Create comprehensive visualizations for metric comparison"""
    
    fig, axes = plt.subplots(3, 2, figsize=(16, 18))
    
    # Plot 1: EM vs CodeBLEU scatter plot
    ax1 = axes[0, 0]
    
    # Color by advantage
    colors = {'CodeBLEU': 'red', 'EM': 'blue', 'Tie': 'gray'}
    for advantage, group in df.groupby('advantage'):
        ax1.scatter(group['em_score'], group['codebleu_score'], 
                   c=colors[advantage], label=f'{advantage} advantage', 
                   alpha=0.7, s=60)
    
    # Add diagonal line (perfect correlation)
    ax1.plot([0, 1], [0, 1], 'k--', alpha=0.5, label='Perfect correlation')
    
    ax1.set_xlabel('Exact Match Score')
    ax1.set_ylabel('CodeBLEU Score')
    ax1.set_title('EM vs CodeBLEU Comparison')
    ax1.legend()
    ax1.grid(True, alpha=0.3)
    ax1.set_xlim(-0.05, 1.05)
    ax1.set_ylim(-0.05, 1.05)
    
    # Plot 2: Score differences by case
    ax2 = axes[0, 1]
    
    case_indices = range(len(df))
    ax2.bar(case_indices, df['codebleu_minus_em'], 
           color=['red' if x > 0 else 'blue' for x in df['codebleu_minus_em']],
           alpha=0.7)
    
    ax2.set_xlabel('Test Case Index')
    ax2.set_ylabel('CodeBLEU - EM')
    ax2.set_title('Score Differences by Test Case')
    ax2.axhline(y=0, color='black', linestyle='-', alpha=0.3)
    ax2.grid(True, alpha=0.3)
    
    # Add case labels
    ax2.set_xticks(case_indices[::2])  # Every other case
    ax2.set_xticklabels([df.iloc[i]['case_id'][:8] for i in case_indices[::2]], rotation=45)
    
    # Plot 3: Component analysis (CodeBLEU breakdown)
    ax3 = axes[1, 0]
    
    components = ['ngram_bleu', 'weighted_bleu', 'ast_similarity', 'dataflow_similarity']
    component_means = [df[comp].mean() for comp in components]
    
    bars = ax3.bar(components, component_means, 
                   color=['skyblue', 'lightgreen', 'orange', 'pink'], alpha=0.8)
    
    ax3.set_ylabel('Average Score')
    ax3.set_title('CodeBLEU Component Analysis')
    ax3.set_xticklabels([c.replace('_', '\n') for c in components], rotation=45)
    
    # Add value labels
    for bar, value in zip(bars, component_means):
        height = bar.get_height()
        ax3.text(bar.get_x() + bar.get_width()/2., height + 0.01,
                f'{value:.3f}', ha='center', va='bottom')
    
    # Plot 4: Language-specific comparison
    ax4 = axes[1, 1]
    
    lang_comparison = df.groupby('language')[['em_score', 'codebleu_score']].mean()
    
    x_pos = np.arange(len(lang_comparison))
    width = 0.35
    
    ax4.bar(x_pos - width/2, lang_comparison['em_score'], width, 
           label='EM', alpha=0.8, color='blue')
    ax4.bar(x_pos + width/2, lang_comparison['codebleu_score'], width, 
           label='CodeBLEU', alpha=0.8, color='red')
    
    ax4.set_xlabel('Programming Language')
    ax4.set_ylabel('Average Score')
    ax4.set_title('Language-specific Performance')
    ax4.set_xticks(x_pos)
    ax4.set_xticklabels(lang_comparison.index)
    ax4.legend()
    ax4.grid(True, alpha=0.3)
    
    # Plot 5: Token overlap vs metrics relationship
    ax5 = axes[2, 0]
    
    ax5.scatter(df['token_overlap'], df['em_score'], 
               alpha=0.7, color='blue', label='EM', s=60)
    ax5.scatter(df['token_overlap'], df['codebleu_score'], 
               alpha=0.7, color='red', label='CodeBLEU', s=60)
    
    ax5.set_xlabel('Token Overlap')
    ax5.set_ylabel('Score')
    ax5.set_title('Token Overlap vs Metric Scores')
    ax5.legend()
    ax5.grid(True, alpha=0.3)
    
    # Plot 6: Metric advantage distribution
    ax6 = axes[2, 1]
    
    advantage_counts = df['advantage'].value_counts()
    colors_pie = [colors[adv] for adv in advantage_counts.index]
    
    wedges, texts, autotexts = ax6.pie(advantage_counts.values, 
                                      labels=advantage_counts.index,
                                      colors=colors_pie,
                                      autopct='%1.1f%%',
                                      startangle=90)
    
    ax6.set_title('Metric Advantage Distribution')
    
    plt.tight_layout()
    plt.show()
    
    # Create correlation analysis
    print("\n📊 Correlation Analysis:")
    corr_matrix = df[['em_score', 'codebleu_score', 'token_overlap', 'sequence_similarity', 'ast_similarity']].corr()
    
    plt.figure(figsize=(10, 8))
    sns.heatmap(corr_matrix, annot=True, fmt='.3f', cmap='coolwarm', center=0, 
                square=True, cbar_kws={'label': 'Correlation Coefficient'})
    plt.title('Correlation Matrix of Evaluation Metrics')
    plt.tight_layout()
    plt.show()
    
    # Print statistical summary
    print("\n📈 Statistical Summary:")
    print(f"EM - Mean: {df['em_score'].mean():.3f}, Std: {df['em_score'].std():.3f}")
    print(f"CodeBLEU - Mean: {df['codebleu_score'].mean():.3f}, Std: {df['codebleu_score'].std():.3f}")
    print(f"Correlation (EM, CodeBLEU): {df['em_score'].corr(df['codebleu_score']):.3f}")
    
    # Identify most interesting cases
    print("\n🔍 Most Interesting Cases:")
    
    # Largest CodeBLEU advantage
    max_codebleu_adv = df.loc[df['codebleu_minus_em'].idxmax()]
    print(f"Largest CodeBLEU advantage: {max_codebleu_adv['case_id']} (+{max_codebleu_adv['codebleu_minus_em']:.3f})")
    
    # Perfect EM but low CodeBLEU (if any)
    perfect_em_low_cb = df[(df['em_score'] == 1.0) & (df['codebleu_score'] < 0.9)]
    if not perfect_em_low_cb.empty:
        print(f"Perfect EM, low CodeBLEU: {perfect_em_low_cb['case_id'].iloc[0]}")
    
    # High CodeBLEU but zero EM
    high_cb_zero_em = df[(df['em_score'] == 0.0) & (df['codebleu_score'] > 0.5)]
    if not high_cb_zero_em.empty:
        best_case = high_cb_zero_em.loc[high_cb_zero_em['codebleu_score'].idxmax()]
        print(f"High CodeBLEU, zero EM: {best_case['case_id']} (CodeBLEU: {best_case['codebleu_score']:.3f})")

# Create visualizations
create_comprehensive_visualizations(evaluation_df)

## Case Study Analysis

Deep dive vào specific cases để understand metric behaviors.

In [None]:
def analyze_specific_cases(df: pd.DataFrame, test_cases: List[TestCase]):
    """Analyze specific cases in detail"""
    
    print("🔍 Detailed Case Study Analysis")
    print("=" * 60)
    
    # Select interesting cases for analysis
    interesting_cases = [
        'semantic_equivalent',  # Should show CodeBLEU advantage
        'partial_correct',      # Partial credit scenario
        'control_structure',    # Different but equivalent structures
        'wrong_code',          # Both should be low
        'perfect_match'        # Both should be high
    ]
    
    tokenizer = CodeTokenizer()
    
    for case_id in interesting_cases:
        if case_id not in df['case_id'].values:
            continue
            
        # Get case data
        case_row = df[df['case_id'] == case_id].iloc[0]
        test_case = next(tc for tc in test_cases if tc.case_id == case_id)
        
        print(f"\n📋 Case: {case_id.upper()}")
        print(f"Description: {test_case.description}")
        print(f"Language: {test_case.language}")
        
        print(f"\nGenerated: {test_case.generated_code}")
        print(f"Reference: {test_case.reference_code}")
        
        # Tokenization analysis
        gen_tokens = tokenizer.tokenize(test_case.generated_code, test_case.language)
        ref_tokens = tokenizer.tokenize(test_case.reference_code, test_case.language)
        
        print(f"\n🔤 Tokenization:")
        print(f"Generated tokens: {gen_tokens}")
        print(f"Reference tokens: {ref_tokens}")
        
        # Token differences
        common_tokens = set(gen_tokens) & set(ref_tokens)
        gen_only = set(gen_tokens) - set(ref_tokens)
        ref_only = set(ref_tokens) - set(gen_tokens)
        
        print(f"Common tokens: {common_tokens}")
        if gen_only:
            print(f"Generated only: {gen_only}")
        if ref_only:
            print(f"Reference only: {ref_only}")
        
        # Scores
        print(f"\n📊 Scores:")
        print(f"EM Score: {case_row['em_score']:.4f}")
        print(f"CodeBLEU Score: {case_row['codebleu_score']:.4f}")
        print(f"Difference (CB - EM): {case_row['codebleu_minus_em']:.4f}")
        
        # Component breakdown
        print(f"\n🧩 CodeBLEU Components:")
        print(f"N-gram BLEU: {case_row['ngram_bleu']:.4f}")
        print(f"Weighted BLEU: {case_row['weighted_bleu']:.4f}")
        print(f"AST Similarity: {case_row['ast_similarity']:.4f}")
        print(f"DataFlow Similarity: {case_row['dataflow_similarity']:.4f}")
        
        # Additional metrics
        print(f"\n📏 Additional Metrics:")
        print(f"Token Overlap: {case_row['token_overlap']:.4f}")
        print(f"Sequence Similarity: {case_row['sequence_similarity']:.4f}")
        print(f"Length Ratio: {case_row['length_ratio']:.4f}")
        
        # Analysis
        print(f"\n💡 Analysis:")
        if case_row['em_score'] == 0 and case_row['codebleu_score'] > 0.3:
            print("✅ CodeBLEU successfully captures semantic similarity despite syntax differences")
        elif case_row['em_score'] == case_row['codebleu_score'] == 1.0:
            print("✅ Perfect match - both metrics agree")
        elif case_row['em_score'] > case_row['codebleu_score']:
            print("⚠️  EM higher than CodeBLEU - possible tokenization artifacts")
        else:
            print("📊 Mixed results - context-dependent evaluation")
        
        print("-" * 40)
    
    # Summary insights
    print("\n🎯 Key Insights from Case Studies:")
    
    semantic_cases = df[df['case_id'].isin(['semantic_equivalent', 'control_structure', 'js_function_style'])]
    if not semantic_cases.empty:
        avg_em = semantic_cases['em_score'].mean()
        avg_cb = semantic_cases['codebleu_score'].mean()
        print(f"• Semantically equivalent cases: EM={avg_em:.3f}, CodeBLEU={avg_cb:.3f}")
        print(f"  → CodeBLEU gives {((avg_cb - avg_em) / avg_em * 100):.1f}% higher scores for semantic equivalence")
    
    partial_cases = df[df['case_id'].isin(['partial_correct', 'variable_names'])]
    if not partial_cases.empty:
        avg_em = partial_cases['em_score'].mean()
        avg_cb = partial_cases['codebleu_score'].mean()
        print(f"• Partial correctness cases: EM={avg_em:.3f}, CodeBLEU={avg_cb:.3f}")
        print(f"  → CodeBLEU provides partial credit while EM gives zero")
    
    # Component importance analysis
    print(f"\n🧩 Component Importance:")
    components = ['ngram_bleu', 'weighted_bleu', 'ast_similarity', 'dataflow_similarity']
    
    for comp in components:
        corr_with_total = df[comp].corr(df['codebleu_score'])
        print(f"• {comp.replace('_', ' ').title()}: correlation with CodeBLEU = {corr_with_total:.3f}")
    
    # Recommendation based on analysis
    print(f"\n💡 Practical Recommendations:")
    
    high_agreement = len(df[abs(df['codebleu_minus_em']) < 0.1]) / len(df)
    print(f"• Metrics agree (±0.1) in {high_agreement:.1%} of cases")
    
    codebleu_advantage = len(df[df['codebleu_minus_em'] > 0.2]) / len(df)
    print(f"• CodeBLEU shows significant advantage (>0.2) in {codebleu_advantage:.1%} of cases")
    
    if codebleu_advantage > 0.3:
        print("  → Recommend using CodeBLEU for more nuanced evaluation")
    else:
        print("  → Both metrics provide valuable but different perspectives")

# Run case study analysis
analyze_specific_cases(evaluation_df, test_cases)

## Paper Findings Validation

Validate paper's usage of both metrics và understand their complementary nature.

In [None]:
def validate_paper_metric_usage(df: pd.DataFrame):
    """Validate paper's approach to using both EM and CodeBLEU"""
    
    print("📄 Validating Paper's Metric Usage")
    print("=" * 50)
    
    # Paper's rationale analysis
    print("\n📊 Paper's Rationale for Using Both Metrics:")
    print("\n1. EXACT MATCH (EM):")
    print("   • Widely used in code review automation literature")
    print("   • Provides strict, unambiguous evaluation")
    print("   • Binary outcome: perfect match or not")
    print("   • Easy to interpret and compare across studies")
    
    print("\n2. CODEBLEU:")
    print("   • Addresses BLEU's limitations for code evaluation")
    print("   • Considers syntactic (AST) and semantic (DF) information")
    print("   • Provides partial credit for similar solutions")
    print("   • More robust to formatting and style differences")
    
    # Statistical analysis of paper's approach
    print(f"\n📈 Statistical Validation:")
    
    # Correlation analysis
    correlation = df['em_score'].corr(df['codebleu_score'])
    print(f"\n• Correlation between EM and CodeBLEU: {correlation:.3f}")
    
    if correlation > 0.7:
        print("  ✅ Strong positive correlation - metrics generally agree")
    elif correlation > 0.4:
        print("  ⚠️  Moderate correlation - metrics capture different aspects")
    else:
        print("  ⚠️  Weak correlation - metrics measure very different things")
    
    # Disagreement analysis
    strong_disagreement = df[abs(df['codebleu_minus_em']) > 0.3]
    print(f"\n• Cases with strong disagreement (>0.3 difference): {len(strong_disagreement)}/{len(df)}")
    
    if not strong_disagreement.empty:
        print("  Disagreement cases:")
        for _, row in strong_disagreement.iterrows():
            print(f"    - {row['case_id']}: EM={row['em_score']:.3f}, CB={row['codebleu_score']:.3f}")
    
    # Sensitivity analysis
    print(f"\n📏 Sensitivity Analysis:")
    
    # EM sensitivity to small changes
    em_zero_count = len(df[df['em_score'] == 0])
    em_one_count = len(df[df['em_score'] == 1])
    em_middle_count = len(df) - em_zero_count - em_one_count
    
    print(f"\n• EM Distribution:")
    print(f"  - Zero (no match): {em_zero_count}/{len(df)} ({em_zero_count/len(df):.1%})")
    print(f"  - One (perfect match): {em_one_count}/{len(df)} ({em_one_count/len(df):.1%})")
    print(f"  - Middle values: {em_middle_count}/{len(df)} ({em_middle_count/len(df):.1%})")
    
    if em_middle_count == 0:
        print("  → EM is purely binary (0 or 1) - very strict evaluation")
    
    # CodeBLEU distribution
    cb_quartiles = df['codebleu_score'].quantile([0.25, 0.5, 0.75])
    print(f"\n• CodeBLEU Distribution:")
    print(f"  - Q1 (25th percentile): {cb_quartiles[0.25]:.3f}")
    print(f"  - Median: {cb_quartiles[0.5]:.3f}")
    print(f"  - Q3 (75th percentile): {cb_quartiles[0.75]:.3f}")
    print(f"  → CodeBLEU provides gradual scoring across full range")
    
    # Information content analysis
    print(f"\n🔍 Information Content Analysis:")
    
    # Entropy calculation (measure of information)
    def calculate_entropy(scores, bins=10):
        hist, _ = np.histogram(scores, bins=bins, range=(0, 1))
        hist = hist / hist.sum()  # Normalize
        hist = hist[hist > 0]  # Remove zeros
        return -np.sum(hist * np.log2(hist))
    
    em_entropy = calculate_entropy(df['em_score'])
    cb_entropy = calculate_entropy(df['codebleu_score'])
    
    print(f"\n• Information Entropy:")
    print(f"  - EM entropy: {em_entropy:.3f} bits")
    print(f"  - CodeBLEU entropy: {cb_entropy:.3f} bits")
    
    if cb_entropy > em_entropy:
        print(f"  → CodeBLEU provides {((cb_entropy - em_entropy) / em_entropy * 100):.1f}% more information")
    
    # Practical implications from paper
    print(f"\n🎯 Practical Implications (from Paper Context):")
    
    print(f"\n1. FOR RESEARCH COMPARISON:")
    print(f"   • Use EM for compatibility with prior work")
    print(f"   • Report both metrics for comprehensive evaluation")
    print(f"   • EM enables direct comparison with existing approaches")
    
    print(f"\n2. FOR MODEL DEVELOPMENT:")
    print(f"   • Use CodeBLEU for training signal (gradual feedback)")
    print(f"   • Use EM for final evaluation (strict criterion)")
    print(f"   • Monitor both during fine-tuning process")
    
    print(f"\n3. FOR PRACTICAL DEPLOYMENT:")
    print(f"   • CodeBLEU better for user experience (partial credit)")
    print(f"   • EM better for automated systems (clear threshold)")
    print(f"   • Consider domain-specific weight tuning for CodeBLEU")
    
    # Validation of paper's dual-metric approach
    print(f"\n✅ Validation of Paper's Approach:")
    
    # Check if using both metrics provides more insight
    em_only_ranking = df.nlargest(5, 'em_score')['case_id'].tolist()
    cb_only_ranking = df.nlargest(5, 'codebleu_score')['case_id'].tolist()
    
    ranking_overlap = len(set(em_only_ranking) & set(cb_only_ranking))
    print(f"\n• Top-5 ranking overlap: {ranking_overlap}/5 cases")
    
    if ranking_overlap < 5:
        print(f"  → Using both metrics reveals different aspects of quality")
        print(f"  → Paper's dual-metric approach is justified")
    else:
        print(f"  → Metrics largely agree on quality ranking")
    
    # Final recommendation
    print(f"\n🎯 FINAL VALIDATION:")
    print(f"\n✅ Paper's use of both EM and CodeBLEU is JUSTIFIED because:")
    print(f"   1. EM provides strict, literature-compatible evaluation")
    print(f"   2. CodeBLEU captures semantic similarity missed by EM")
    print(f"   3. Together they provide comprehensive code quality assessment")
    print(f"   4. Different use cases benefit from different metric emphases")
    
    if correlation < 0.9:
        print(f"   5. Metrics are sufficiently different to justify using both")
    
    return {
        'correlation': correlation,
        'em_entropy': em_entropy,
        'cb_entropy': cb_entropy,
        'disagreement_cases': len(strong_disagreement),
        'ranking_overlap': ranking_overlap
    }

# Validate paper's metric usage
validation_results = validate_paper_metric_usage(evaluation_df)

## Practical Implementation Guide

Generate practical guide for implementing evaluation metrics in code review systems.

In [None]:
def generate_implementation_guide(validation_results: Dict[str, float], df: pd.DataFrame):
    """Generate practical implementation guide for evaluation metrics"""
    
    guide = f"""
# 🛠️ Practical Implementation Guide: EM vs CodeBLEU

## 📊 Key Findings from Analysis

### Metric Characteristics
- **Correlation**: {validation_results['correlation']:.3f} (moderate positive correlation)
- **Information Content**: CodeBLEU provides {((validation_results['cb_entropy'] - validation_results['em_entropy']) / validation_results['em_entropy'] * 100):.1f}% more information
- **Disagreement Cases**: {validation_results['disagreement_cases']}/{len(df)} cases show significant differences
- **Ranking Agreement**: {validation_results['ranking_overlap']}/5 top cases overlap

### When Each Metric Excels

#### Exact Match (EM) is Better For:
- 🎯 **Binary decision making** (accept/reject code)
- 📚 **Literature comparison** (standardized across papers)
- 🔍 **Debugging evaluation** (clear pass/fail)
- ⚡ **Fast computation** (simple token comparison)
- 📊 **Statistical significance** (clear 0/1 outcomes)

#### CodeBLEU is Better For:
- 🎓 **Training feedback** (gradual improvement signal)
- 👥 **User experience** (partial credit recognition)
- 🔄 **Iterative development** (progress tracking)
- 🌐 **Cross-language evaluation** (structural similarity)
- 🎨 **Style-agnostic assessment** (semantic focus)

## 🔧 Implementation Recipes

### Recipe 1: Research Paper Evaluation
```python
def research_evaluation(generated_codes, reference_codes):
    """Standard evaluation for research papers"""
    
    em_scores = []
    codebleu_scores = []
    
    for gen, ref in zip(generated_codes, reference_codes):
        # Calculate both metrics
        em_score = exact_match(gen, ref)
        cb_score = code_bleu(gen, ref)
        
        em_scores.append(em_score)
        codebleu_scores.append(cb_score)
    
    # Report both for comprehensive evaluation
    return {{
        'exact_match': np.mean(em_scores),
        'code_bleu': np.mean(codebleu_scores),
        'em_std': np.std(em_scores),
        'cb_std': np.std(codebleu_scores)
    }}
```

### Recipe 2: Model Training Feedback
```python
def training_loss_with_codebleu(model_output, target_code, language='java'):
    """Use CodeBLEU for training feedback"""
    
    # Primary loss (e.g., cross-entropy)
    primary_loss = cross_entropy_loss(model_output, target_code)
    
    # CodeBLEU reward (convert to loss)
    cb_score = code_bleu(model_output, target_code, language)
    cb_loss = 1.0 - cb_score  # Convert to loss
    
    # Combine losses
    total_loss = 0.8 * primary_loss + 0.2 * cb_loss
    
    return total_loss
```

### Recipe 3: Production System Evaluation
```python
def production_evaluation(generated_code, reference_code, threshold=0.7):
    """Dual-metric evaluation for production systems"""
    
    em_score = exact_match(generated_code, reference_code)
    cb_score = code_bleu(generated_code, reference_code)
    
    # Decision logic
    if em_score == 1.0:
        return "PERFECT", 1.0
    elif cb_score >= threshold:
        return "ACCEPTABLE", cb_score
    else:
        return "REJECT", min(em_score, cb_score)
```

### Recipe 4: Fine-tuning Progress Monitoring
```python
def monitor_finetuning_progress(epoch_results):
    """Monitor both metrics during fine-tuning"""
    
    epochs = list(range(len(epoch_results)))
    em_scores = [r['em'] for r in epoch_results]
    cb_scores = [r['codebleu'] for r in epoch_results]
    
    # Plot progress
    plt.figure(figsize=(12, 5))
    
    plt.subplot(1, 2, 1)
    plt.plot(epochs, em_scores, 'b-', label='Exact Match')
    plt.xlabel('Epoch')
    plt.ylabel('EM Score')
    plt.title('EM Progress (Strict Evaluation)')
    
    plt.subplot(1, 2, 2)
    plt.plot(epochs, cb_scores, 'r-', label='CodeBLEU')
    plt.xlabel('Epoch')
    plt.ylabel('CodeBLEU Score')
    plt.title('CodeBLEU Progress (Gradual Feedback)')
    
    plt.tight_layout()
    plt.show()
    
    # Early stopping logic
    if len(em_scores) >= 3:
        em_trend = np.mean(em_scores[-3:])
        cb_trend = np.mean(cb_scores[-3:])
        
        if em_trend > 0.8:  # EM plateau
            return "CONVERGED_EM"
        elif cb_trend > 0.9:  # CodeBLEU plateau
            return "CONVERGED_CB"
    
    return "CONTINUE"
```

## ⚙️ Configuration Guidelines

### CodeBLEU Weight Tuning
```python
# Default weights from CodeBLEU paper
DEFAULT_WEIGHTS = (0.25, 0.25, 0.25, 0.25)  # BLEU, Weighted-BLEU, AST, DF

# Code review specific weights (based on our analysis)
CODE_REVIEW_WEIGHTS = (0.2, 0.3, 0.3, 0.2)  # Emphasize structure and keywords

# Language-specific adjustments
LANGUAGE_WEIGHTS = {{
    'java': (0.2, 0.3, 0.3, 0.2),      # Structure-heavy
    'python': (0.3, 0.2, 0.25, 0.25),  # More flexible syntax
    'javascript': (0.25, 0.25, 0.2, 0.3),  # Dynamic typing emphasis
    'go': (0.2, 0.3, 0.35, 0.15)       # Simple, structured
}}
```

### Tokenization Best Practices
```python
def robust_tokenization(code, language='java'):
    """Production-ready tokenization"""
    
    # Normalize whitespace
    code = re.sub(r'\s+', ' ', code.strip())
    
    # Language-specific preprocessing
    if language == 'python':
        # Handle indentation significance
        code = re.sub(r'\n\s+', ' INDENT ', code)
    elif language == 'javascript':
        # Handle semicolon insertion
        code = re.sub(r'\n(?!\s*[}}\]))', ' NEWLINE ', code)
    
    # Standard tokenization
    tokens = re.findall(r'\w+|[{{}}()\[\];,.<>=!&|+\-*/"\']', code)
    return [t.lower() for t in tokens if t.strip()]
```

## 🚨 Common Pitfalls and Solutions

### Pitfall 1: Over-reliance on EM
**Problem**: Missing semantically correct solutions
**Solution**: Use CodeBLEU for initial filtering, EM for final validation

### Pitfall 2: CodeBLEU Weight Sensitivity
**Problem**: Results vary with component weights
**Solution**: Validate weights on held-out data, use domain-specific tuning

### Pitfall 3: Language Bias
**Problem**: Metrics favor certain programming languages
**Solution**: Language-specific normalization and weight adjustments

### Pitfall 4: Evaluation Dataset Quality
**Problem**: Poor reference code affects both metrics
**Solution**: Multiple reference implementations, human validation

## 📈 Performance Optimization

### Fast EM Implementation
```python
def fast_exact_match(generated_codes, reference_codes):
    """Vectorized EM computation"""
    
    # Batch tokenization
    gen_tokens = [tokenize(code) for code in generated_codes]
    ref_tokens = [tokenize(code) for code in reference_codes]
    
    # Parallel comparison
    with multiprocessing.Pool() as pool:
        results = pool.starmap(
            lambda g, r: 1.0 if g == r else 0.0,
            zip(gen_tokens, ref_tokens)
        )
    
    return np.array(results)
```

### Cached CodeBLEU
```python
from functools import lru_cache

@lru_cache(maxsize=1000)
def cached_code_bleu(gen_code, ref_code, language):
    """Cached CodeBLEU for repeated evaluations"""
    return compute_code_bleu(gen_code, ref_code, language)
```

## 🎯 Decision Matrix: When to Use Which Metric

| Use Case | Primary Metric | Secondary Metric | Rationale |
|----------|---------------|------------------|----------|
| Research Paper | EM | CodeBLEU | Literature compatibility |
| Model Training | CodeBLEU | EM | Gradual feedback |
| Production Filtering | CodeBLEU | EM | User experience |
| Final Validation | EM | CodeBLEU | Strict quality gate |
| Cross-language Study | CodeBLEU | EM | Structural similarity |
| Automated Testing | EM | CodeBLEU | Binary pass/fail |
| Human Evaluation | CodeBLEU | EM | Partial credit |

## 🚀 Advanced Techniques

### Ensemble Evaluation
```python
def ensemble_evaluation(gen_code, ref_code, weights=(0.6, 0.4)):
    """Combine metrics with learned weights"""
    
    em_score = exact_match(gen_code, ref_code)
    cb_score = code_bleu(gen_code, ref_code)
    
    # Weighted combination
    ensemble_score = weights[0] * em_score + weights[1] * cb_score
    
    return ensemble_score
```

### Adaptive Thresholding
```python
def adaptive_threshold(cb_score, difficulty_level):
    """Adjust CodeBLEU threshold based on task difficulty"""
    
    base_threshold = 0.7
    difficulty_adjustment = {
        'easy': 0.1,      # Higher threshold for easy tasks
        'medium': 0.0,    # Standard threshold
        'hard': -0.15     # Lower threshold for hard tasks
    }}
    
    threshold = base_threshold + difficulty_adjustment.get(difficulty_level, 0.0)
    return cb_score >= threshold
```

## ✅ Validation Checklist

Before deploying evaluation metrics:

- [ ] **Tokenization tested** on target programming languages
- [ ] **Metric correlation analyzed** on your specific dataset
- [ ] **Component weights tuned** for your domain
- [ ] **Performance benchmarked** for your scale
- [ ] **Edge cases handled** (empty code, syntax errors)
- [ ] **Human evaluation baseline** established
- [ ] **Cross-language consistency** verified
- [ ] **Threshold sensitivity** analyzed

## 📚 References and Further Reading

- **Original CodeBLEU Paper**: Ren et al. "CodeBLEU: a Method for Automatic Evaluation of Code Synthesis"
- **Code Review Automation**: Li et al. "Automating Code Review Activities by Large-scale Pre-training"
- **Evaluation Best Practices**: Our analysis shows complementary nature of metrics
"""
    
    return guide

# Generate implementation guide
implementation_guide = generate_implementation_guide(validation_results, evaluation_df)
print(implementation_guide)

print("\n" + "="*80)
print("🎓 FOCUSED LEARNING COMPLETED: CodeBLEU vs Exact Match Evaluation")
print("✅ Deep understanding of evaluation metrics for code generation achieved!")
print("🔧 Ready to implement robust evaluation systems for code review automation!")
print("="*80)