# Focused Learning: Code Review Comment Generation with Transformers

## Learning Objective
Master the technical implementation of neural code review comment generation, understanding how T5-based models are adapted for the code review domain and how to evaluate generated comments across multiple dimensions.

## Paper Reference
- **Section 2.2**: Review Comment Generation (Pages 4-5)
- **Section 4.4-4.5**: Evaluation Metrics and Manual Evaluation Process (Pages 12-16)
- **Table 2**: Code Review Comment Categories
- **Figure 3-4**: Examples of comment evaluation

## Why This Topic is Complex
1. **Multimodal Input**: Code changes + context require specialized encoding
2. **Diverse Output Space**: Comments range from syntax issues to design discussions
3. **Evaluation Challenges**: No single metric captures comment quality
4. **Domain Adaptation**: General LLMs struggle without code review specific training

## 1. Understanding the Code Review Generation Task

In [None]:
import numpy as np
import pandas as pd
import torch
import torch.nn as nn
from transformers import T5ForConditionalGeneration, T5Tokenizer, T5Config
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
import matplotlib.pyplot as plt
import seaborn as sns
from typing import List, Dict, Tuple, Optional
from dataclasses import dataclass
import re
from collections import Counter
import warnings
warnings.filterwarnings('ignore')

# Set visualization style
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")

### 1.1 Problem Formulation

In [None]:
@dataclass
class CodeReviewExample:
    """Represents a single code review training example"""
    # Input
    code_change_before: str  # Code before modification (H_pre)
    code_change_after: str   # Code after modification (H_post)
    
    # Output
    review_comment: str      # Natural language review (R_nl)
    
    # Metadata
    reviewer_id: str
    repository: str
    file_path: str
    programming_language: str
    
    def to_model_input(self, include_context: bool = True) -> str:
        """Format for model input following CodeReviewer approach"""
        if include_context:
            # Include file context
            return f"Review the change in {self.file_path}:\n<OLD>\n{self.code_change_before}\n</OLD>\n<NEW>\n{self.code_change_after}\n</NEW>"
        else:
            # Simple diff format
            return f"<OLD>\n{self.code_change_before}\n</OLD>\n<NEW>\n{self.code_change_after}\n</NEW>"

# Generate example code review scenarios
example_reviews = [
    CodeReviewExample(
        code_change_before="if (user.role == 'admin') {\n    processRequest(request);\n}",
        code_change_after="if (user.role == 'admin') {\n    processAdminRequest(request);\n}",
        review_comment="Add validation check: if (!validateRequest(request)) return; before processing.",
        reviewer_id="exp_reviewer_1",
        repository="web-app",
        file_path="src/auth/handler.js",
        programming_language="javascript"
    ),
    CodeReviewExample(
        code_change_before="conn = database.connect()\ndata = conn.query(sql)",
        code_change_after="conn = database.connect()\ndata = conn.query(sql)\nresults = process(data)",
        review_comment="Resource leak: connection is never closed. Use try-finally or context manager.",
        reviewer_id="exp_reviewer_2",
        repository="data-service",
        file_path="src/db/query.py",
        programming_language="python"
    ),
    CodeReviewExample(
        code_change_before="for i in range(len(items)):\n    total += items[i].price",
        code_change_after="for item in items:\n    total += item.price",
        review_comment="Good refactoring! This is more Pythonic and readable.",
        reviewer_id="mid_reviewer_1",
        repository="e-commerce",
        file_path="src/cart/calculator.py",
        programming_language="python"
    )
]

# Visualize the task
print("Code Review Comment Generation Task:")
print("====================================\n")
for i, example in enumerate(example_reviews[:2]):
    print(f"Example {i+1}:")
    print(f"File: {example.file_path}")
    print(f"\nBefore:\n{example.code_change_before}")
    print(f"\nAfter:\n{example.code_change_after}")
    print(f"\nGenerated Review: {example.review_comment}")
    print("\n" + "="*50 + "\n")

### 1.2 Code Change Representation

In [None]:
class CodeChangeProcessor:
    """Process code changes for model input"""
    
    def __init__(self, max_length: int = 512):
        self.max_length = max_length
        
    def create_diff_representation(self, before: str, after: str) -> str:
        """Create a unified diff representation"""
        before_lines = before.strip().split('\n')
        after_lines = after.strip().split('\n')
        
        diff_lines = []
        
        # Simple line-by-line diff
        max_lines = max(len(before_lines), len(after_lines))
        
        for i in range(max_lines):
            if i < len(before_lines) and i < len(after_lines):
                if before_lines[i] != after_lines[i]:
                    diff_lines.append(f"- {before_lines[i]}")
                    diff_lines.append(f"+ {after_lines[i]}")
                else:
                    diff_lines.append(f"  {before_lines[i]}")
            elif i < len(before_lines):
                diff_lines.append(f"- {before_lines[i]}")
            else:
                diff_lines.append(f"+ {after_lines[i]}")
        
        return "\n".join(diff_lines)
    
    def create_ast_aware_representation(self, code: str, language: str) -> Dict:
        """Create AST-aware representation (simplified)"""
        # In real implementation, use tree-sitter or similar
        ast_features = {
            'has_function_call': bool(re.search(r'\w+\(', code)),
            'has_condition': bool(re.search(r'if\s*\(', code)),
            'has_loop': bool(re.search(r'(for|while)\s*\(', code)),
            'has_try_catch': bool(re.search(r'try\s*{', code)),
            'variable_count': len(re.findall(r'\b(var|let|const|\w+\s*=)\b', code))
        }
        return ast_features
    
    def create_semantic_tokens(self, code: str) -> List[str]:
        """Extract semantic tokens from code"""
        # Extract identifiers, keywords, operators
        tokens = re.findall(r'\b\w+\b|[(){}\[\];,.]|[+\-*/=<>!]+', code)
        
        # Classify tokens
        keywords = {'if', 'else', 'for', 'while', 'try', 'catch', 'return', 
                   'function', 'class', 'const', 'let', 'var'}
        
        semantic_tokens = []
        for token in tokens:
            if token in keywords:
                semantic_tokens.append(f"<KEYWORD:{token}>")
            elif token.isdigit():
                semantic_tokens.append("<NUMBER>")
            elif re.match(r'^[A-Z_]+$', token):
                semantic_tokens.append("<CONSTANT>")
            elif re.match(r'^[a-z_][a-zA-Z0-9_]*$', token):
                semantic_tokens.append(f"<ID:{token}>")
            else:
                semantic_tokens.append(token)
        
        return semantic_tokens

# Demonstrate different representations
processor = CodeChangeProcessor()
example = example_reviews[0]

print("Different Code Change Representations:")
print("====================================\n")

# 1. Raw representation
print("1. Raw Input:")
print(example.to_model_input(include_context=True))
print("\n" + "-"*50 + "\n")

# 2. Diff representation
print("2. Diff Representation:")
diff = processor.create_diff_representation(example.code_change_before, example.code_change_after)
print(diff)
print("\n" + "-"*50 + "\n")

# 3. AST features
print("3. AST Features:")
ast_features = processor.create_ast_aware_representation(example.code_change_after, example.programming_language)
for feature, value in ast_features.items():
    print(f"  {feature}: {value}")
print("\n" + "-"*50 + "\n")

# 4. Semantic tokens
print("4. Semantic Tokens:")
tokens = processor.create_semantic_tokens(example.code_change_after)
print(" ".join(tokens[:20]) + "...")

## 2. Transformer Architecture for Code Review Generation

In [None]:
class CodeReviewT5Model:
    """Simplified T5-based model for code review generation"""
    
    def __init__(self, model_name: str = "t5-small", max_length: int = 512):
        self.tokenizer = T5Tokenizer.from_pretrained(model_name)
        self.model = T5ForConditionalGeneration.from_pretrained(model_name)
        self.max_length = max_length
        
        # Add code-specific tokens
        self._add_code_tokens()
        
    def _add_code_tokens(self):
        """Add code-specific special tokens"""
        special_tokens = [
            '<OLD>', '</OLD>', '<NEW>', '</NEW>',
            '<FUNC>', '<VAR>', '<CLASS>', '<COMMENT>',
            '<ADDED>', '<REMOVED>', '<MODIFIED>'
        ]
        
        self.tokenizer.add_special_tokens({
            'additional_special_tokens': special_tokens
        })
        self.model.resize_token_embeddings(len(self.tokenizer))
        
    def prepare_input(self, code_before: str, code_after: str, 
                     task_prefix: str = "review: ") -> Dict:
        """Prepare input for T5 model"""
        # Format input
        input_text = f"{task_prefix}<OLD>{code_before}</OLD><NEW>{code_after}</NEW>"
        
        # Tokenize
        inputs = self.tokenizer(
            input_text,
            max_length=self.max_length,
            truncation=True,
            padding="max_length",
            return_tensors="pt"
        )
        
        return inputs
    
    def generate_review(self, code_before: str, code_after: str,
                       num_beams: int = 4,
                       max_new_tokens: int = 100,
                       temperature: float = 0.7) -> str:
        """Generate code review comment"""
        inputs = self.prepare_input(code_before, code_after)
        
        # Generate
        with torch.no_grad():
            outputs = self.model.generate(
                **inputs,
                max_new_tokens=max_new_tokens,
                num_beams=num_beams,
                temperature=temperature,
                do_sample=True,
                top_p=0.95
            )
        
        # Decode
        review = self.tokenizer.decode(outputs[0], skip_special_tokens=True)
        return review
    
    def visualize_attention(self, code_before: str, code_after: str):
        """Visualize attention patterns (simplified)"""
        inputs = self.prepare_input(code_before, code_after)
        
        # Get model outputs with attention
        with torch.no_grad():
            outputs = self.model(
                **inputs,
                decoder_input_ids=inputs['input_ids'][:, :10],  # First 10 tokens
                output_attentions=True
            )
        
        # Extract cross-attention from last layer
        cross_attention = outputs.cross_attentions[-1][0, 0, :, :].numpy()
        
        # Plot attention heatmap
        plt.figure(figsize=(10, 8))
        sns.heatmap(cross_attention[:10, :20], cmap='Blues', cbar=True)
        plt.xlabel('Input Tokens')
        plt.ylabel('Output Tokens')
        plt.title('Cross-Attention Visualization')
        plt.tight_layout()
        plt.show()

# Note: Due to model size, we'll simulate outputs instead of loading actual model
print("Code Review T5 Model Architecture:")
print("=================================")
print("\nModel Components:")
print("1. Encoder: Processes code changes (before/after)")
print("2. Decoder: Generates natural language review")
print("3. Cross-Attention: Links code elements to review content")
print("\nSpecial Tokens for Code:")
print("- <OLD>, </OLD>: Mark original code")
print("- <NEW>, </NEW>: Mark modified code")
print("- <ADDED>, <REMOVED>: Mark diff operations")

# Simulate model outputs for demonstration
simulated_reviews = {
    "example_1": "Missing validation check. Add: if (!validateRequest(request)) return;",
    "example_2": "Resource leak detected. Use try-finally to ensure connection.close().",
    "example_3": "Good refactoring! More idiomatic Python."
}

## 3. Training Process with Experience Awareness

In [None]:
class ExperienceAwareTrainer:
    """Training pipeline with ELF integration"""
    
    def __init__(self, model, tokenizer, device='cpu'):
        self.model = model
        self.tokenizer = tokenizer
        self.device = device
        self.training_history = []
        
    def create_training_batch(self, examples: List[CodeReviewExample], 
                            reviewer_metrics: Dict) -> Dict:
        """Create batch with experience weights"""
        batch_inputs = []
        batch_targets = []
        batch_weights = []
        
        for example in examples:
            # Prepare input
            input_text = f"review: <OLD>{example.code_change_before}</OLD><NEW>{example.code_change_after}</NEW>"
            batch_inputs.append(input_text)
            
            # Prepare target
            batch_targets.append(example.review_comment)
            
            # Calculate experience weight
            metrics = reviewer_metrics.get(example.reviewer_id, {'aco': 0.1, 'rso': 0.1})
            weight = np.exp(1 + (metrics['aco'] + metrics['rso']) / 2)
            batch_weights.append(weight)
        
        return {
            'inputs': batch_inputs,
            'targets': batch_targets,
            'weights': torch.tensor(batch_weights)
        }
    
    def visualize_training_dynamics(self, n_epochs: int = 10):
        """Visualize how ELF affects training"""
        # Simulate training dynamics
        epochs = range(n_epochs)
        
        # Different reviewer types
        high_exp_loss = [4.5 - 0.3*e - 0.05*e*np.random.random() for e in epochs]
        mid_exp_loss = [4.5 - 0.25*e - 0.05*e*np.random.random() for e in epochs]
        low_exp_loss = [4.5 - 0.2*e - 0.05*e*np.random.random() for e in epochs]
        
        # ELF-weighted average
        weights = [7.39, 4.48, 2.72]  # High, mid, low experience weights
        weighted_loss = [
            (weights[0]*h + weights[1]*m + weights[2]*l) / sum(weights)
            for h, m, l in zip(high_exp_loss, mid_exp_loss, low_exp_loss)
        ]
        
        fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 6))
        
        # Plot 1: Loss by experience level
        ax1.plot(epochs, high_exp_loss, 'r-', label='High Experience', linewidth=2)
        ax1.plot(epochs, mid_exp_loss, 'g--', label='Mid Experience', linewidth=2)
        ax1.plot(epochs, low_exp_loss, 'b:', label='Low Experience', linewidth=2)
        ax1.plot(epochs, weighted_loss, 'k-', label='ELF Weighted', linewidth=3)
        
        ax1.set_xlabel('Epoch')
        ax1.set_ylabel('Loss')
        ax1.set_title('Training Loss by Reviewer Experience')
        ax1.legend()
        ax1.grid(True, alpha=0.3)
        
        # Plot 2: Gradient influence
        gradient_influence = [
            weights[0] / sum(weights) * 100,
            weights[1] / sum(weights) * 100,
            weights[2] / sum(weights) * 100
        ]
        
        ax2.bar(['High Exp', 'Mid Exp', 'Low Exp'], gradient_influence, 
                color=['red', 'green', 'blue'], alpha=0.7)
        ax2.set_ylabel('Gradient Influence (%)')
        ax2.set_title('Relative Influence on Model Updates')
        ax2.grid(True, alpha=0.3)
        
        # Add value labels
        for i, v in enumerate(gradient_influence):
            ax2.text(i, v + 1, f'{v:.1f}%', ha='center', va='bottom')
        
        plt.tight_layout()
        plt.show()
        
        print("\nTraining Insights:")
        print(f"- High experience reviewers have {weights[0]/weights[2]:.1f}x more influence")
        print(f"- Weighted loss converges faster than uniform weighting")
        print(f"- Model learns to prioritize patterns from experienced reviewers")

# Create mock reviewer metrics
mock_reviewer_metrics = {
    'exp_reviewer_1': {'aco': 0.35, 'rso': 0.45},
    'exp_reviewer_2': {'aco': 0.30, 'rso': 0.40},
    'mid_reviewer_1': {'aco': 0.10, 'rso': 0.20},
    'low_reviewer_1': {'aco': 0.02, 'rso': 0.05}
}

# Demonstrate training
trainer = ExperienceAwareTrainer(None, None)  # Mock trainer
batch = trainer.create_training_batch(example_reviews[:3], mock_reviewer_metrics)

print("Training Batch with Experience Weights:")
print("======================================")
for i, (target, weight) in enumerate(zip(batch['targets'], batch['weights'])):
    print(f"\nExample {i+1}:")
    print(f"Target: {target[:50]}...")
    print(f"Experience Weight: {weight:.3f}")

# Visualize training dynamics
trainer.visualize_training_dynamics()

## 4. Comprehensive Evaluation Framework

In [None]:
class CodeReviewEvaluator:
    """Comprehensive evaluation following paper's methodology"""
    
    def __init__(self):
        # Define comment categories from Table 2
        self.functional_categories = [
            'functional_defect', 'validation', 'logical', 
            'interface', 'resource', 'support', 'timing'
        ]
        
        self.evolvability_categories = [
            'solution_approach', 'documentation', 'organization',
            'alternate_output', 'naming_convention', 'visual_representation'
        ]
        
        self.discussion_categories = ['question', 'design_discussion']
        
    def evaluate_semantic_equivalence(self, generated: str, reference: str) -> bool:
        """Check if generated comment has same intent as reference"""
        # Extract key concepts
        keywords = ['validation', 'check', 'leak', 'resource', 'close', 
                   'error', 'missing', 'add', 'use', 'refactor']
        
        gen_concepts = set([kw for kw in keywords if kw in generated.lower()])
        ref_concepts = set([kw for kw in keywords if kw in reference.lower()])
        
        # Check overlap
        if len(ref_concepts) == 0:
            return len(gen_concepts) == 0
        
        overlap = len(gen_concepts.intersection(ref_concepts)) / len(ref_concepts)
        return overlap > 0.5
    
    def evaluate_applicability(self, generated: str, code_change: str) -> bool:
        """Check if comment is applicable to the code change"""
        # Extract code elements
        code_tokens = set(re.findall(r'\b\w+\b', code_change.lower()))
        comment_tokens = set(re.findall(r'\b\w+\b', generated.lower()))
        
        # Check if comment references code elements
        overlap = len(code_tokens.intersection(comment_tokens))
        
        # Also check for general applicability patterns
        applicable_patterns = [
            r'should\s+\w+',
            r'consider\s+\w+',
            r'missing\s+\w+',
            r'add\s+\w+',
            r'use\s+\w+',
            r'\w+\s+leak',
            r'refactor'
        ]
        
        has_pattern = any(re.search(pattern, generated.lower()) 
                         for pattern in applicable_patterns)
        
        return overlap > 2 or has_pattern
    
    def classify_feedback_type(self, comment: str) -> str:
        """Classify as suggestion, concern, or confused question"""
        comment_lower = comment.lower()
        
        # Suggestion patterns
        suggestion_patterns = [
            r'should\s+', r'consider\s+', r'try\s+', r'use\s+',
            r'add:', r'change\s+to', r'instead\s+of'
        ]
        
        # Confused question patterns
        confused_patterns = [
            r'is\s+this\s+correct\?',
            r'what\s+does',
            r"i\s+don't\s+understand",
            r'\?\?',
            r'why\s+.*\?$'
        ]
        
        if any(re.search(p, comment_lower) for p in suggestion_patterns):
            return 'suggestion'
        elif any(re.search(p, comment_lower) for p in confused_patterns):
            return 'confused_question'
        else:
            return 'concern'
    
    def has_explanation(self, comment: str) -> bool:
        """Check if comment contains rationale"""
        explanation_patterns = [
            r'because\s+', r'since\s+', r'this\s+prevents',
            r'this\s+ensures', r'to\s+avoid', r'for\s+better',
            r'which\s+', r'that\s+', r'prevents\s+', r'improves\s+'
        ]
        
        return any(re.search(p, comment.lower()) for p in explanation_patterns)
    
    def identify_issue_category(self, comment: str) -> str:
        """Identify the type of issue discussed"""
        comment_lower = comment.lower()
        
        # Functional issue patterns
        if any(word in comment_lower for word in ['validation', 'validate', 'check']):
            return 'validation'
        elif any(word in comment_lower for word in ['leak', 'resource', 'close', 'release']):
            return 'resource'
        elif any(word in comment_lower for word in ['logic', 'incorrect', 'wrong', 'error']):
            return 'logical'
        
        # Evolvability issue patterns
        elif any(word in comment_lower for word in ['refactor', 'extract', 'separate']):
            return 'organization'
        elif any(word in comment_lower for word in ['comment', 'document', 'explain']):
            return 'documentation'
        elif any(word in comment_lower for word in ['rename', 'naming', 'name']):
            return 'naming_convention'
        elif any(word in comment_lower for word in ['format', 'indent', 'space', 'style']):
            return 'visual_representation'
        
        return 'other'
    
    def comprehensive_evaluate(self, generated: str, reference: str, 
                             code_change: str) -> Dict:
        """Perform comprehensive evaluation"""
        return {
            'semantic_equivalence': self.evaluate_semantic_equivalence(generated, reference),
            'applicability': self.evaluate_applicability(generated, code_change),
            'feedback_type': self.classify_feedback_type(generated),
            'has_explanation': self.has_explanation(generated),
            'issue_category': self.identify_issue_category(generated)
        }

# Evaluate example comments
evaluator = CodeReviewEvaluator()

# Test cases
test_cases = [
    {
        'generated': "Missing validation check. Add: if (!validateRequest(request)) return; This prevents unauthorized access.",
        'reference': "Add validation before processing request",
        'code': "if (user.role == 'admin') { processAdminRequest(request); }"
    },
    {
        'generated': "Resource leak detected. Use try-finally to ensure connection.close().",
        'reference': "Connection should be closed after use",
        'code': "conn = database.connect()\ndata = conn.query(sql)"
    },
    {
        'generated': "Is this correct?",
        'reference': "Check if validation is needed",
        'code': "processRequest(request)"
    }
]

print("Evaluation Results:")
print("==================")

for i, test in enumerate(test_cases):
    print(f"\nTest Case {i+1}:")
    print(f"Generated: {test['generated']}")
    
    results = evaluator.comprehensive_evaluate(
        test['generated'], 
        test['reference'],
        test['code']
    )
    
    print("\nEvaluation:")
    for metric, value in results.items():
        print(f"  {metric}: {value}")

## 5. Analyzing Model Behavior and Comment Quality

In [None]:
def analyze_comment_quality_distribution():
    """Analyze distribution of comment quality metrics"""
    
    # Simulate evaluation results for different models
    models = ['CodeReviewer', 'ELF_aco_pkg', 'ELF_rso_pkg', 'ELF_avg_pkg']
    
    # Simulated metrics (based on paper results)
    metrics_data = {
        'CodeReviewer': {
            'applicable': 42,
            'suggestions': 27,
            'concerns': 8,
            'confused': 7,
            'has_explanation': 8,
            'functional_issues': 7,
            'evolvability_issues': 24
        },
        'ELF_aco_pkg': {
            'applicable': 53,
            'suggestions': 42,
            'concerns': 9,
            'confused': 2,
            'has_explanation': 15,
            'functional_issues': 13,
            'evolvability_issues': 29
        },
        'ELF_rso_pkg': {
            'applicable': 53,
            'suggestions': 37,
            'concerns': 14,
            'confused': 2,
            'has_explanation': 11,
            'functional_issues': 12,
            'evolvability_issues': 27
        },
        'ELF_avg_pkg': {
            'applicable': 46,
            'suggestions': 30,
            'concerns': 12,
            'confused': 3,
            'has_explanation': 13,
            'functional_issues': 16,
            'evolvability_issues': 19
        }
    }
    
    # Create comprehensive visualization
    fig, axes = plt.subplots(2, 2, figsize=(16, 12))
    
    # Plot 1: Overall applicability
    ax1 = axes[0, 0]
    applicability = [metrics_data[m]['applicable'] for m in models]
    bars1 = ax1.bar(models, applicability, color=['gray', 'lightblue', 'lightgreen', 'lightcoral'])
    ax1.set_ylabel('Count (out of 100)')
    ax1.set_title('Applicable Comments Generated')
    ax1.set_ylim(0, 60)
    
    # Add value labels
    for bar, val in zip(bars1, applicability):
        ax1.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 1,
                f'{val}', ha='center', va='bottom')
    
    # Plot 2: Feedback type distribution
    ax2 = axes[0, 1]
    feedback_types = ['suggestions', 'concerns', 'confused']
    x = np.arange(len(models))
    width = 0.25
    
    for i, ftype in enumerate(feedback_types):
        values = [metrics_data[m][ftype] for m in models]
        ax2.bar(x + i*width, values, width, label=ftype.capitalize())
    
    ax2.set_xlabel('Model')
    ax2.set_ylabel('Count')
    ax2.set_title('Feedback Type Distribution')
    ax2.set_xticks(x + width)
    ax2.set_xticklabels(models, rotation=45)
    ax2.legend()
    
    # Plot 3: Issue type comparison
    ax3 = axes[1, 0]
    functional = [metrics_data[m]['functional_issues'] for m in models]
    evolvability = [metrics_data[m]['evolvability_issues'] for m in models]
    
    x = np.arange(len(models))
    width = 0.35
    
    ax3.bar(x - width/2, functional, width, label='Functional', color='darkred')
    ax3.bar(x + width/2, evolvability, width, label='Evolvability', color='darkblue')
    
    ax3.set_xlabel('Model')
    ax3.set_ylabel('Count')
    ax3.set_title('Issue Types Identified')
    ax3.set_xticks(x)
    ax3.set_xticklabels(models, rotation=45)
    ax3.legend()
    
    # Plot 4: Quality improvement radar chart
    ax4 = axes[1, 1]
    
    # Calculate improvement percentages
    baseline = metrics_data['CodeReviewer']
    categories = ['Applicable', 'Suggestions', 'Explanations', 'Functional', 'Less Confused']
    
    improvements = {}
    for model in models[1:]:
        m = metrics_data[model]
        improvements[model] = [
            (m['applicable'] - baseline['applicable']) / baseline['applicable'] * 100,
            (m['suggestions'] - baseline['suggestions']) / baseline['suggestions'] * 100,
            (m['has_explanation'] - baseline['has_explanation']) / baseline['has_explanation'] * 100,
            (m['functional_issues'] - baseline['functional_issues']) / baseline['functional_issues'] * 100,
            (baseline['confused'] - m['confused']) / baseline['confused'] * 100
        ]
    
    # Bar plot of improvements
    x = np.arange(len(categories))
    width = 0.25
    
    for i, (model, values) in enumerate(improvements.items()):
        ax4.bar(x + i*width, values, width, label=model.replace('ELF_', ''))
    
    ax4.set_xlabel('Metric')
    ax4.set_ylabel('Improvement (%)')
    ax4.set_title('Percentage Improvement over CodeReviewer')
    ax4.set_xticks(x + width)
    ax4.set_xticklabels(categories, rotation=45)
    ax4.axhline(y=0, color='black', linestyle='-', linewidth=0.5)
    ax4.legend()
    ax4.grid(True, alpha=0.3)
    
    plt.tight_layout()
    plt.show()
    
    # Print key insights
    print("\nKey Insights from Analysis:")
    print("===========================")
    print("1. ELF models generate 26-29% more applicable comments")
    print("2. Suggestion rate improves by up to 56% with ELF_aco_pkg")
    print("3. Confused questions reduced by 71% with experience-aware training")
    print("4. Functional issue detection improves by 86-129%")
    print("5. Comments with explanations increase by 38-88%")

# Analyze comment quality
analyze_comment_quality_distribution()

## 6. Advanced Techniques and Future Directions

In [None]:
class AdvancedCodeReviewTechniques:
    """Advanced techniques for code review generation"""
    
    @staticmethod
    def multi_task_learning():
        """Multi-task learning approach"""
        print("Multi-Task Learning for Code Review:")
        print("====================================")
        print("\nTasks:")
        print("1. Code Change Quality Estimation (Binary)")
        print("   - Input: Code change")
        print("   - Output: Needs review? (Yes/No)")
        print("\n2. Review Comment Generation (Seq2Seq)")
        print("   - Input: Code change")
        print("   - Output: Natural language comment")
        print("\n3. Code Refinement (Seq2Seq)")
        print("   - Input: Code + Review comment")
        print("   - Output: Refined code")
        print("\nShared encoder learns unified code representations")
        
    @staticmethod
    def retrieval_augmented_generation():
        """RAG approach for code reviews"""
        print("\nRetrieval-Augmented Generation:")
        print("===============================")
        print("\nComponents:")
        print("1. Code Change Encoder")
        print("2. Review Database (Vector Store)")
        print("3. Similarity Search")
        print("4. Context-aware Generator")
        print("\nProcess:")
        print("- Encode code change")
        print("- Retrieve similar past reviews")
        print("- Generate review using retrieved context")
        
    @staticmethod
    def visualize_future_directions():
        """Visualize future research directions"""
        fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 6))
        
        # Future direction 1: Multi-modal understanding
        ax1.text(0.5, 0.9, 'Multi-Modal Code Understanding', 
                ha='center', fontsize=16, weight='bold')
        
        components = [
            (0.2, 0.7, 'Code\nText'),
            (0.5, 0.7, 'AST\nStructure'),
            (0.8, 0.7, 'Data\nFlow'),
            (0.35, 0.4, 'Control\nFlow'),
            (0.65, 0.4, 'Type\nInfo'),
            (0.5, 0.1, 'Unified\nRepresentation')
        ]
        
        for x, y, text in components:
            if y > 0.2:
                ax1.add_patch(plt.Rectangle((x-0.08, y-0.05), 0.16, 0.1, 
                                          fill=True, color='lightblue', alpha=0.7))
                ax1.text(x, y, text, ha='center', va='center')
                ax1.arrow(x, y-0.05, 0, -0.15, head_width=0.02, head_length=0.02, 
                         fc='gray', ec='gray')
            else:
                ax1.add_patch(plt.Rectangle((x-0.1, y-0.05), 0.2, 0.1, 
                                          fill=True, color='lightgreen', alpha=0.7))
                ax1.text(x, y, text, ha='center', va='center', weight='bold')
        
        ax1.set_xlim(0, 1)
        ax1.set_ylim(0, 1)
        ax1.axis('off')
        
        # Future direction 2: Personalized review generation
        ax2.text(0.5, 0.9, 'Personalized Review Generation', 
                ha='center', fontsize=16, weight='bold')
        
        # Create flow diagram
        nodes = [
            (0.2, 0.7, 'Developer\nProfile'),
            (0.5, 0.7, 'Project\nContext'),
            (0.8, 0.7, 'Team\nStandards'),
            (0.5, 0.4, 'Adaptive\nModel'),
            (0.5, 0.1, 'Personalized\nReview')
        ]
        
        for i, (x, y, text) in enumerate(nodes):
            if i < 3:
                color = 'lightcoral'
            elif i == 3:
                color = 'lightyellow'
            else:
                color = 'lightgreen'
            
            ax2.add_patch(plt.Circle((x, y), 0.08, fill=True, color=color, alpha=0.7))
            ax2.text(x, y, text, ha='center', va='center', fontsize=10)
        
        # Add arrows
        for i in range(3):
            x, y, _ = nodes[i]
            ax2.arrow(x, y-0.08, 0.5-x, 0.4-y+0.08, 
                     head_width=0.02, head_length=0.02, fc='gray', ec='gray', alpha=0.5)
        
        ax2.arrow(0.5, 0.32, 0, -0.14, head_width=0.03, head_length=0.03, 
                 fc='darkgreen', ec='darkgreen')
        
        ax2.set_xlim(0, 1)
        ax2.set_ylim(0, 1)
        ax2.axis('off')
        
        plt.tight_layout()
        plt.show()

# Demonstrate advanced techniques
advanced = AdvancedCodeReviewTechniques()
advanced.multi_task_learning()
advanced.retrieval_augmented_generation()
advanced.visualize_future_directions()

print("\nFuture Research Opportunities:")
print("==============================")
print("1. **Cross-lingual Review Generation**: Support multiple programming languages")
print("2. **Interactive Review Systems**: Allow back-and-forth clarification")
print("3. **Review Quality Estimation**: Predict review helpfulness before posting")
print("4. **Team-aware Models**: Adapt to team coding standards and preferences")
print("5. **Incremental Learning**: Continuously improve from accepted/rejected reviews")

## 7. Practical Implementation Guide

In [None]:
class PracticalImplementationGuide:
    """Guide for implementing code review generation in practice"""
    
    @staticmethod
    def implementation_checklist():
        """Checklist for implementing the system"""
        checklist = [
            "Data Collection",
            "- [ ] Extract code review data from repositories",
            "- [ ] Filter bot accounts and non-NL comments",
            "- [ ] Calculate reviewer ownership metrics",
            "- [ ] Split data (train/val/test)",
            "",
            "Model Setup",
            "- [ ] Initialize T5/CodeT5 base model",
            "- [ ] Add code-specific tokens",
            "- [ ] Implement ELF loss function",
            "- [ ] Setup distributed training",
            "",
            "Training",
            "- [ ] Implement experience-aware batching",
            "- [ ] Monitor weight distributions",
            "- [ ] Track per-experience-level metrics",
            "- [ ] Implement gradient clipping",
            "",
            "Evaluation",
            "- [ ] Automatic metrics (BLEU-4)",
            "- [ ] Manual evaluation setup",
            "- [ ] Inter-rater agreement",
            "- [ ] Category-wise analysis",
            "",
            "Deployment",
            "- [ ] API endpoint setup",
            "- [ ] Caching for common patterns",
            "- [ ] Monitoring and logging",
            "- [ ] A/B testing framework"
        ]
        
        print("Implementation Checklist:")
        print("========================")
        for item in checklist:
            print(item)
    
    @staticmethod
    def sample_config():
        """Sample configuration for the system"""
        config = {
            "model": {
                "base_model": "Salesforce/codet5-base",
                "max_input_length": 512,
                "max_output_length": 128,
                "num_beams": 4,
                "temperature": 0.7
            },
            "training": {
                "batch_size": 32,
                "learning_rate": 3e-4,
                "num_epochs": 30,
                "warmup_steps": 1000,
                "gradient_clip": 1.0
            },
            "elf": {
                "strategy": "aco",
                "granularity": "package",
                "weight_clip": [0.5, 10.0],
                "adaptive_scaling": True
            },
            "data": {
                "min_comment_length": 10,
                "max_comment_length": 200,
                "languages": ["python", "java", "javascript"],
                "remove_bots": True
            }
        }
        
        print("\nSample Configuration:")
        print("====================")
        import json
        print(json.dumps(config, indent=2))
        
        return config

# Show implementation guide
guide = PracticalImplementationGuide()
guide.implementation_checklist()
config = guide.sample_config()

print("\nKey Implementation Tips:")
print("=======================")
print("1. Start with a smaller model (T5-small) for experimentation")
print("2. Use mixed precision training for efficiency")
print("3. Implement caching for ownership metric calculations")
print("4. Monitor ELF weight distributions during training")
print("5. Use stratified sampling for evaluation sets")

## 8. Summary and Key Takeaways

### Core Concepts Mastered
1. **Task Formulation**: Code change → Natural language review (Seq2Seq)
2. **Model Architecture**: T5-based with code-specific adaptations
3. **Experience Integration**: ELF weights influence training dynamics
4. **Evaluation Framework**: Multi-dimensional assessment beyond BLEU

### Technical Insights
1. **Input Representation**: Combine before/after code with context
2. **Special Tokens**: <OLD>, <NEW>, <ADDED>, <REMOVED> improve understanding
3. **Beam Search**: Balance diversity and quality in generation
4. **Manual Evaluation**: Essential for capturing semantic aspects

### Key Results
1. **+29% Applicable Comments**: ELF models generate more relevant reviews
2. **+56% Suggestions**: More actionable feedback vs concerns
3. **-71% Confused Questions**: Reduced uncertainty in comments
4. **+129% Functional Issues**: Better detection of critical problems

### Future Directions
1. **Multi-modal Input**: Integrate AST, data flow, type information
2. **Personalization**: Adapt to team standards and preferences
3. **Interactive Systems**: Enable clarification dialogues
4. **Cross-lingual Support**: Handle multiple programming languages

In [None]:
# Final research template
print("Code Review Generation Research Template")
print("=======================================")
print("\nclass CodeReviewResearch:")
print("    def __init__(self):")
print("        self.model = self.load_model()")
print("        self.evaluator = CodeReviewEvaluator()")
print("        self.ownership_calculator = OwnershipCalculator()")
print("        ")
print("    def train_with_elf(self, dataset, strategy='aco', granularity='package'):")
print("        # Your implementation")
print("        pass")
print("        ")
print("    def evaluate_comprehensive(self, test_set):")
print("        # Your evaluation")
print("        pass")
print("        ")
print("    def analyze_results(self):")
print("        # Your analysis")
print("        pass")
print("\nReady to implement code review generation with experience awareness!")