# Leveraging Reviewer Experience in Code Review Comment Generation - Main Implementation

## Paper Information
- **Title**: Leveraging Reviewer Experience in Code Review Comment Generation
- **Authors**: Hong Yi Lin, Patanamon Thongtanunam, Christoph Treude, Michael W. Godfrey, Chunhua Liu, Wachiraphan Charoenwet
- **Link**: [arXiv:2409.10959v1](https://arxiv.org/abs/2409.10959v1)
- **Institution**: The University of Melbourne, Singapore Management University, University of Waterloo

## Abstract Summary
This paper proposes Experience-aware Loss Functions (ELF) to improve code review comment generation by leveraging reviewers' past authoring and reviewing experiences. The method assigns weights to the model's loss function proportional to reviewer experience, allowing experienced reviewers' comments to have more influence over model behavior. Results show ELF achieves +29% more applicable comments, +56% more suggestions, and +129% more functional issues identified compared to state-of-the-art models.

## Key Contributions
1. Experience-aware training methods for code review comment generation
2. Analysis of emergent behaviors after experience-aware training
3. Large-scale datasets with commit/PR histories for 826 GitHub repositories
4. Augmented CodeReviewer dataset tagged with ownership metrics

## 1. Environment Setup and Dependencies

In [None]:
# Install required packages
!pip install torch transformers datasets
!pip install langchain langchain-openai langchain-community
!pip install numpy pandas matplotlib seaborn
!pip install scikit-learn
!pip install deepeval
!pip install pygithub pydriller

In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import DataLoader, Dataset
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from typing import Dict, List, Tuple, Optional
from dataclasses import dataclass
import json
from datetime import datetime
import warnings
warnings.filterwarnings('ignore')

# Set random seeds for reproducibility
torch.manual_seed(42)
np.random.seed(42)

## 2. Data Structures and Mock Data Generation

In [None]:
@dataclass
class CodeReviewComment:
    """Data structure for code review comments"""
    comment_id: str
    reviewer_id: str
    repository: str
    subsystem: str  # Top-level directory
    package: str    # Immediate folder
    code_change: str
    comment_text: str
    timestamp: datetime
    
@dataclass
class ReviewerMetrics:
    """Reviewer ownership metrics at different granularities"""
    reviewer_id: str
    aco_repo: float  # Authoring Code Ownership - Repository
    aco_sys: float   # Authoring Code Ownership - Subsystem
    aco_pkg: float   # Authoring Code Ownership - Package
    rso_repo: float  # Review-Specific Ownership - Repository
    rso_sys: float   # Review-Specific Ownership - Subsystem
    rso_pkg: float   # Review-Specific Ownership - Package

In [None]:
def generate_mock_data(n_samples: int = 1000) -> Tuple[List[CodeReviewComment], Dict[str, ReviewerMetrics]]:
    """Generate mock code review data for demonstration"""
    
    # Mock reviewers with varying experience levels
    reviewers = [
        {"id": "exp_reviewer_1", "exp_level": "high"},
        {"id": "exp_reviewer_2", "exp_level": "high"},
        {"id": "mid_reviewer_1", "exp_level": "medium"},
        {"id": "mid_reviewer_2", "exp_level": "medium"},
        {"id": "new_reviewer_1", "exp_level": "low"},
        {"id": "new_reviewer_2", "exp_level": "low"},
    ]
    
    # Mock code changes and corresponding review comments
    code_change_templates = [
        {
            "code": "if (user.role == 'admin') { processAdminRequest(request); }",
            "high_exp_comment": "Missing validation check. Add: if (!validateRequest(request)) return;",
            "low_exp_comment": "Is this correct?",
            "type": "validation"
        },
        {
            "code": "for i in range(len(data)): result.append(data[i] * 2)",
            "high_exp_comment": "Use list comprehension for better performance: result = [x * 2 for x in data]",
            "low_exp_comment": "Please add spaces around operators",
            "type": "performance"
        },
        {
            "code": "connection = db.connect()\ndata = connection.query(sql)",
            "high_exp_comment": "Resource leak. Use try-finally or context manager to ensure connection.close()",
            "low_exp_comment": "Add comment here",
            "type": "resource"
        }
    ]
    
    # Generate reviewer metrics
    reviewer_metrics = {}
    for reviewer in reviewers:
        if reviewer["exp_level"] == "high":
            base_aco = np.random.uniform(0.15, 0.35)
            base_rso = np.random.uniform(0.25, 0.45)
        elif reviewer["exp_level"] == "medium":
            base_aco = np.random.uniform(0.05, 0.15)
            base_rso = np.random.uniform(0.10, 0.25)
        else:
            base_aco = np.random.uniform(0.01, 0.05)
            base_rso = np.random.uniform(0.02, 0.10)
            
        # Ownership increases at finer granularities
        reviewer_metrics[reviewer["id"]] = ReviewerMetrics(
            reviewer_id=reviewer["id"],
            aco_repo=base_aco,
            aco_sys=base_aco * 1.2,
            aco_pkg=base_aco * 1.5,
            rso_repo=base_rso,
            rso_sys=base_rso * 1.3,
            rso_pkg=base_rso * 1.6
        )
    
    # Generate code review comments
    comments = []
    for i in range(n_samples):
        reviewer = np.random.choice(reviewers)
        template = np.random.choice(code_change_templates)
        
        # Select comment based on reviewer experience
        if reviewer["exp_level"] == "high":
            comment_text = template["high_exp_comment"]
        else:
            comment_text = template["low_exp_comment"]
            
        comment = CodeReviewComment(
            comment_id=f"comment_{i}",
            reviewer_id=reviewer["id"],
            repository="mock_repo",
            subsystem=f"src/module_{np.random.randint(1, 4)}",
            package=f"src/module_{np.random.randint(1, 4)}/submodule_{np.random.randint(1, 3)}",
            code_change=template["code"],
            comment_text=comment_text,
            timestamp=datetime.now()
        )
        comments.append(comment)
    
    return comments, reviewer_metrics

# Generate mock data
mock_comments, mock_metrics = generate_mock_data(1000)
print(f"Generated {len(mock_comments)} mock comments")
print(f"Generated metrics for {len(mock_metrics)} reviewers")

## 3. Ownership Metrics Calculation (ACO & RSO)

Implementation of Authoring Code Ownership (ACO) and Review-Specific Ownership (RSO) metrics from the paper.

In [None]:
class OwnershipCalculator:
    """Calculate ACO and RSO metrics at different granularities"""
    
    def __init__(self):
        self.commits = {}  # {granularity: {developer: count}}
        self.reviews = {}  # {granularity: {developer: count}}
        
    def calculate_aco(self, developer: str, granularity: str, 
                     review_timestamp: datetime) -> float:
        """
        Calculate Authoring Code Ownership (ACO) - Equation (1) from paper
        ACO(D,G) = α(D,G) / C(G)
        where:
        - α(D,G) = commits by developer D at granularity G
        - C(G) = total commits at granularity G
        """
        if granularity not in self.commits:
            return 0.0
            
        developer_commits = self.commits[granularity].get(developer, 0)
        total_commits = sum(self.commits[granularity].values())
        
        if total_commits == 0:
            return 0.0
            
        return developer_commits / total_commits
    
    def calculate_rso(self, developer: str, granularity: str,
                     review_timestamp: datetime) -> float:
        """
        Calculate Review-Specific Ownership (RSO) - Equation (2) from paper
        RSO(D,G) = r(D,G) / ρ(G)
        where:
        - r(D,G) = PRs reviewed by developer D at granularity G
        - ρ(G) = total PRs at granularity G
        """
        if granularity not in self.reviews:
            return 0.0
            
        developer_reviews = self.reviews[granularity].get(developer, 0)
        total_reviews = sum(self.reviews[granularity].values())
        
        if total_reviews == 0:
            return 0.0
            
        return developer_reviews / total_reviews
    
    def add_commit(self, developer: str, granularity: str, timestamp: datetime):
        """Add a commit record"""
        if granularity not in self.commits:
            self.commits[granularity] = {}
        self.commits[granularity][developer] = self.commits[granularity].get(developer, 0) + 1
        
    def add_review(self, developer: str, granularity: str, timestamp: datetime):
        """Add a review record"""
        if granularity not in self.reviews:
            self.reviews[granularity] = {}
        self.reviews[granularity][developer] = self.reviews[granularity].get(developer, 0) + 1

# Demonstrate ownership calculation
calculator = OwnershipCalculator()

# Simulate historical data
for i in range(1000):
    dev = f"exp_reviewer_{np.random.randint(1, 3)}"
    gran = np.random.choice(["repository", "subsystem", "package"])
    calculator.add_commit(dev, gran, datetime.now())
    
for i in range(1500):
    dev = f"exp_reviewer_{np.random.randint(1, 3)}"
    gran = np.random.choice(["repository", "subsystem", "package"])
    calculator.add_review(dev, gran, datetime.now())

# Calculate metrics
print("Sample Ownership Metrics:")
for dev in ["exp_reviewer_1", "exp_reviewer_2"]:
    aco_repo = calculator.calculate_aco(dev, "repository", datetime.now())
    rso_repo = calculator.calculate_rso(dev, "repository", datetime.now())
    print(f"{dev}: ACO(repo)={aco_repo:.3f}, RSO(repo)={rso_repo:.3f}")

## 4. Experience-Aware Loss Functions (ELF) Implementation

Core implementation of the ELF method with four weighting strategies.

In [None]:
class ExperienceAwareLoss(nn.Module):
    """Experience-Aware Loss Function (ELF) - Equation (3) from paper"""
    
    def __init__(self, strategy: str = "aco", granularity: str = "package"):
        """
        Args:
            strategy: One of ["aco", "rso", "avg", "max"]
            granularity: One of ["repository", "subsystem", "package"]
        """
        super().__init__()
        self.strategy = strategy
        self.granularity = granularity
        self.base_loss = nn.CrossEntropyLoss(reduction='none')
        
    def calculate_weight(self, metrics: ReviewerMetrics) -> float:
        """Calculate weight ω based on strategy and granularity"""
        
        # Get ownership values based on granularity
        if self.granularity == "repository":
            aco = metrics.aco_repo
            rso = metrics.rso_repo
        elif self.granularity == "subsystem":
            aco = metrics.aco_sys
            rso = metrics.rso_sys
        else:  # package
            aco = metrics.aco_pkg
            rso = metrics.rso_pkg
            
        # Apply strategy-specific weight calculation
        if self.strategy == "aco":
            # ω_aco = e^(1+aco)
            weight = np.exp(1 + aco)
        elif self.strategy == "rso":
            # ω_rso = e^(1+rso)
            weight = np.exp(1 + rso)
        elif self.strategy == "avg":
            # ω_avg = e^(1+(rso+aco)/2)
            weight = np.exp(1 + (rso + aco) / 2)
        else:  # max
            # ω_max = e^(1+max(rso,aco))
            weight = np.exp(1 + max(rso, aco))
            
        return weight
    
    def forward(self, logits: torch.Tensor, targets: torch.Tensor, 
                reviewer_metrics: List[ReviewerMetrics]) -> torch.Tensor:
        """
        Calculate experience-aware loss
        L_RCG = ω * Σ(-log P(w_t|c,w_<t))
        """
        # Get base loss for each sample
        batch_size, seq_len, vocab_size = logits.shape
        logits_flat = logits.view(-1, vocab_size)
        targets_flat = targets.view(-1)
        
        base_losses = self.base_loss(logits_flat, targets_flat)
        base_losses = base_losses.view(batch_size, seq_len)
        
        # Apply experience-aware weights
        weighted_losses = []
        for i, metrics in enumerate(reviewer_metrics):
            weight = self.calculate_weight(metrics)
            weighted_loss = weight * base_losses[i].mean()
            weighted_losses.append(weighted_loss)
            
        return torch.stack(weighted_losses).mean()

# Demonstrate ELF with different strategies
print("ELF Weight Examples:")
for strategy in ["aco", "rso", "avg", "max"]:
    elf = ExperienceAwareLoss(strategy=strategy, granularity="package")
    
    # High experience reviewer
    high_exp_metrics = mock_metrics["exp_reviewer_1"]
    high_weight = elf.calculate_weight(high_exp_metrics)
    
    # Low experience reviewer
    low_exp_metrics = mock_metrics["new_reviewer_1"]
    low_weight = elf.calculate_weight(low_exp_metrics)
    
    print(f"\nStrategy: {strategy}")
    print(f"  High exp weight: {high_weight:.3f}")
    print(f"  Low exp weight: {low_weight:.3f}")
    print(f"  Weight ratio: {high_weight/low_weight:.2f}x")

## 5. Code Review Comment Generation Model

Simplified implementation using LangChain for demonstration (paper uses T5-based CodeReviewer).

In [None]:
from langchain.prompts import PromptTemplate
from langchain.chains import LLMChain
from langchain.schema import BaseMessage

class CodeReviewGenerator:
    """Simplified code review comment generator using LangChain"""
    
    def __init__(self, experience_aware: bool = True):
        self.experience_aware = experience_aware
        
        # Define prompts based on experience awareness
        if experience_aware:
            self.prompt_template = PromptTemplate(
                input_variables=["code_change", "reviewer_experience"],
                template="""
You are an experienced code reviewer with {reviewer_experience} level expertise.
Review the following code change and provide a detailed, technical comment:

Code Change:
{code_change}

Provide a review comment that:
1. Identifies specific issues (functional, validation, resource, etc.)
2. Suggests concrete improvements with code examples when applicable
3. Explains the rationale behind your suggestions

Review Comment:"""
            )
        else:
            self.prompt_template = PromptTemplate(
                input_variables=["code_change"],
                template="""
Review the following code change:

Code Change:
{code_change}

Review Comment:"""
            )
    
    def generate_comment(self, code_change: str, reviewer_metrics: Optional[ReviewerMetrics] = None) -> str:
        """Generate a code review comment"""
        
        if self.experience_aware and reviewer_metrics:
            # Determine experience level based on metrics
            avg_ownership = (reviewer_metrics.aco_pkg + reviewer_metrics.rso_pkg) / 2
            if avg_ownership > 0.2:
                exp_level = "high"
            elif avg_ownership > 0.1:
                exp_level = "medium"
            else:
                exp_level = "low"
                
            # For demonstration, return experience-aware mock responses
            if exp_level == "high":
                if "validation" in code_change or "admin" in code_change:
                    return "Missing validation check. Add: if (!validateRequest(request)) return; This prevents unauthorized access."
                elif "connect" in code_change:
                    return "Resource leak detected. Use try-finally or context manager to ensure connection.close(). This prevents database connection exhaustion."
                else:
                    return "Consider refactoring this logic into a separate method for better testability and reusability."
            else:
                return "Please review this code change."
        else:
            return "Code looks fine, but please add comments."

# Demonstrate comment generation
generator_elf = CodeReviewGenerator(experience_aware=True)
generator_base = CodeReviewGenerator(experience_aware=False)

test_code = "if (user.role == 'admin') { processAdminRequest(request); }"

print("Code Change:")
print(test_code)
print("\nELF-based Review (High Experience):")
print(generator_elf.generate_comment(test_code, mock_metrics["exp_reviewer_1"]))
print("\nELF-based Review (Low Experience):")
print(generator_elf.generate_comment(test_code, mock_metrics["new_reviewer_1"]))
print("\nBaseline Review:")
print(generator_base.generate_comment(test_code))

## 6. Evaluation Metrics Implementation

Implementation of evaluation metrics from the paper using deepeval where applicable.

In [None]:
from typing import List, Dict, Tuple
import re
from collections import Counter

class CodeReviewEvaluator:
    """Evaluate generated code review comments"""
    
    def __init__(self):
        self.comment_categories = {
            "functional": ["functional defect", "validation", "logical", "interface", "resource", "timing"],
            "evolvability": ["solution approach", "documentation", "organization", "naming", "visual"],
            "discussion": ["question", "design discussion"]
        }
        
    def calculate_bleu4(self, reference: str, hypothesis: str) -> float:
        """Simplified BLEU-4 calculation"""
        from nltk.translate.bleu_score import sentence_bleu
        reference_tokens = reference.lower().split()
        hypothesis_tokens = hypothesis.lower().split()
        return sentence_bleu([reference_tokens], hypothesis_tokens, weights=(0.25, 0.25, 0.25, 0.25))
    
    def is_semantically_equivalent(self, reference: str, hypothesis: str) -> bool:
        """Check if comments have same intent (simplified)"""
        # Extract key concepts
        ref_concepts = set(re.findall(r'\b(validation|resource|performance|refactor|leak|check)\b', 
                                    reference.lower()))
        hyp_concepts = set(re.findall(r'\b(validation|resource|performance|refactor|leak|check)\b', 
                                    hypothesis.lower()))
        
        # Check overlap
        if len(ref_concepts) == 0:
            return False
        overlap = len(ref_concepts.intersection(hyp_concepts)) / len(ref_concepts)
        return overlap > 0.5
    
    def is_applicable(self, code_change: str, comment: str) -> bool:
        """Check if comment is applicable to code change"""
        # Simple heuristics for demonstration
        code_keywords = set(re.findall(r'\b\w+\b', code_change.lower()))
        comment_keywords = set(re.findall(r'\b\w+\b', comment.lower()))
        
        # Check if comment references code elements
        overlap = len(code_keywords.intersection(comment_keywords))
        return overlap > 2 or len(comment) > 20
    
    def classify_feedback_type(self, comment: str) -> str:
        """Classify comment as suggestion, concern, or confused question"""
        comment_lower = comment.lower()
        
        # Patterns for suggestions
        if any(pattern in comment_lower for pattern in 
               ["should", "consider", "try", "use", "add:", "change to"]):
            return "suggestion"
        
        # Patterns for confused questions
        if any(pattern in comment_lower for pattern in
               ["is this correct?", "what does", "i don't understand", "??"]):
            return "confused_question"
        
        # Default to concern
        return "concern"
    
    def has_explanation(self, comment: str) -> bool:
        """Check if comment contains explanation/rationale"""
        explanation_patterns = [
            "because", "since", "this prevents", "this ensures",
            "to avoid", "for better", "which", "that"
        ]
        return any(pattern in comment.lower() for pattern in explanation_patterns)
    
    def identify_issue_type(self, comment: str) -> str:
        """Identify the type of issue discussed in comment"""
        comment_lower = comment.lower()
        
        # Check functional issues
        if any(word in comment_lower for word in ["validation", "validate", "check"]):
            return "validation"
        if any(word in comment_lower for word in ["leak", "resource", "close", "release"]):
            return "resource"
        if any(word in comment_lower for word in ["logic", "incorrect", "wrong"]):
            return "logical"
            
        # Check evolvability issues
        if any(word in comment_lower for word in ["refactor", "extract", "separate"]):
            return "organization"
        if any(word in comment_lower for word in ["comment", "document", "explain"]):
            return "documentation"
        if any(word in comment_lower for word in ["rename", "naming", "variable name"]):
            return "naming"
            
        return "other"

# Evaluate sample comments
evaluator = CodeReviewEvaluator()

# Test comments
test_comments = [
    {
        "code": "if (user.role == 'admin') { processAdminRequest(request); }",
        "reference": "Add validation check before processing",
        "generated": "Missing validation check. Add: if (!validateRequest(request)) return; This prevents unauthorized access."
    },
    {
        "code": "connection = db.connect()",
        "reference": "Close connection after use",
        "generated": "Resource leak detected. Use try-finally to ensure connection.close()."
    }
]

print("Evaluation Results:")
for i, test in enumerate(test_comments):
    print(f"\nExample {i+1}:")
    print(f"Code: {test['code']}")
    print(f"Generated: {test['generated']}")
    
    # Evaluate
    bleu = evaluator.calculate_bleu4(test['reference'], test['generated'])
    sem_eq = evaluator.is_semantically_equivalent(test['reference'], test['generated'])
    applicable = evaluator.is_applicable(test['code'], test['generated'])
    feedback_type = evaluator.classify_feedback_type(test['generated'])
    has_exp = evaluator.has_explanation(test['generated'])
    issue_type = evaluator.identify_issue_type(test['generated'])
    
    print(f"\nMetrics:")
    print(f"  BLEU-4: {bleu:.3f}")
    print(f"  Semantically Equivalent: {sem_eq}")
    print(f"  Applicable: {applicable}")
    print(f"  Feedback Type: {feedback_type}")
    print(f"  Has Explanation: {has_exp}")
    print(f"  Issue Type: {issue_type}")

## 7. Results Visualization and Analysis

In [None]:
# Simulate evaluation results from the paper
results_data = {
    "Model": ["CodeReviewer", "ELF_aco_pkg", "ELF_rso_pkg", "ELF_avg_pkg", "ELF_max_pkg"],
    "BLEU-4": [7.27, 7.46, 7.55, 7.45, 7.38],
    "Applicable": [42, 53, 53, 46, 44],
    "Suggestions": [27, 42, 37, 30, 30],
    "Functional_Issues": [7, 13, 12, 16, 11],
    "Has_Explanation": [8, 15, 11, 13, 15]
}

results_df = pd.DataFrame(results_data)

# Create visualizations
fig, axes = plt.subplots(2, 2, figsize=(15, 12))

# Plot 1: BLEU-4 scores
ax1 = axes[0, 0]
bars1 = ax1.bar(results_df["Model"], results_df["BLEU-4"], color=['gray'] + ['skyblue']*4)
ax1.set_title("BLEU-4 Scores", fontsize=14)
ax1.set_ylabel("BLEU-4")
ax1.set_ylim(7.0, 7.6)
ax1.tick_params(axis='x', rotation=45)

# Plot 2: Applicable comments
ax2 = axes[0, 1]
bars2 = ax2.bar(results_df["Model"], results_df["Applicable"], color=['gray'] + ['lightgreen']*4)
ax2.set_title("Applicable Comments (out of 100)", fontsize=14)
ax2.set_ylabel("Count")
ax2.tick_params(axis='x', rotation=45)

# Plot 3: Comment quality breakdown
ax3 = axes[1, 0]
quality_metrics = results_df[["Model", "Suggestions", "Functional_Issues", "Has_Explanation"]]
quality_metrics.set_index("Model").plot(kind='bar', ax=ax3, width=0.8)
ax3.set_title("Comment Quality Metrics", fontsize=14)
ax3.set_ylabel("Count")
ax3.legend(loc='upper left')
ax3.tick_params(axis='x', rotation=45)

# Plot 4: Improvement percentages
ax4 = axes[1, 1]
baseline = results_df.iloc[0]
improvements = []
for i in range(1, len(results_df)):
    model_data = results_df.iloc[i]
    imp = {
        "Model": model_data["Model"].replace("ELF_", ""),
        "Applicable": ((model_data["Applicable"] - baseline["Applicable"]) / baseline["Applicable"]) * 100,
        "Suggestions": ((model_data["Suggestions"] - baseline["Suggestions"]) / baseline["Suggestions"]) * 100,
        "Functional": ((model_data["Functional_Issues"] - baseline["Functional_Issues"]) / baseline["Functional_Issues"]) * 100
    }
    improvements.append(imp)

imp_df = pd.DataFrame(improvements)
imp_df.set_index("Model").plot(kind='bar', ax=ax4, width=0.8)
ax4.set_title("Percentage Improvement over CodeReviewer", fontsize=14)
ax4.set_ylabel("Improvement (%)")
ax4.axhline(y=0, color='black', linestyle='-', linewidth=0.5)
ax4.legend(loc='upper left')
ax4.tick_params(axis='x', rotation=45)

plt.tight_layout()
plt.show()

# Display summary statistics
print("\nKey Findings:")
print(f"- Best BLEU-4: ELF_rso_pkg with {results_df.loc[2, 'BLEU-4']:.2f} (+{((results_df.loc[2, 'BLEU-4'] - results_df.loc[0, 'BLEU-4']) / results_df.loc[0, 'BLEU-4'] * 100):.1f}%)")
print(f"- Most Applicable Comments: ELF_aco_pkg and ELF_rso_pkg with {results_df.loc[1, 'Applicable']} (+{((results_df.loc[1, 'Applicable'] - results_df.loc[0, 'Applicable']) / results_df.loc[0, 'Applicable'] * 100):.0f}%)")
print(f"- Most Suggestions: ELF_aco_pkg with {results_df.loc[1, 'Suggestions']} (+{((results_df.loc[1, 'Suggestions'] - results_df.loc[0, 'Suggestions']) / results_df.loc[0, 'Suggestions'] * 100):.0f}%)")
print(f"- Most Functional Issues: ELF_avg_pkg with {results_df.loc[3, 'Functional_Issues']} (+{((results_df.loc[3, 'Functional_Issues'] - results_df.loc[0, 'Functional_Issues']) / results_df.loc[0, 'Functional_Issues'] * 100):.0f}%)")

## 8. Personal Research Template

Use this section to experiment with different configurations and explore the method further.

In [None]:
# Template for experimenting with different ELF configurations

class ResearchExperiment:
    """Template for conducting ELF experiments"""
    
    def __init__(self, name: str):
        self.name = name
        self.results = []
        
    def run_experiment(self, strategy: str, granularity: str, 
                      test_comments: List[CodeReviewComment],
                      reviewer_metrics: Dict[str, ReviewerMetrics]):
        """Run an experiment with specific ELF configuration"""
        
        print(f"\nRunning experiment: {strategy}_{granularity}")
        
        # Initialize ELF loss
        elf_loss = ExperienceAwareLoss(strategy=strategy, granularity=granularity)
        
        # Calculate average weights for different experience levels
        exp_levels = {"high": [], "medium": [], "low": []}
        
        for comment in test_comments:
            metrics = reviewer_metrics[comment.reviewer_id]
            weight = elf_loss.calculate_weight(metrics)
            
            # Classify experience level
            avg_ownership = (metrics.aco_pkg + metrics.rso_pkg) / 2
            if avg_ownership > 0.2:
                exp_levels["high"].append(weight)
            elif avg_ownership > 0.1:
                exp_levels["medium"].append(weight)
            else:
                exp_levels["low"].append(weight)
        
        # Calculate statistics
        result = {
            "strategy": strategy,
            "granularity": granularity,
            "avg_weight_high": np.mean(exp_levels["high"]) if exp_levels["high"] else 0,
            "avg_weight_medium": np.mean(exp_levels["medium"]) if exp_levels["medium"] else 0,
            "avg_weight_low": np.mean(exp_levels["low"]) if exp_levels["low"] else 0,
            "weight_ratio_high_low": np.mean(exp_levels["high"]) / np.mean(exp_levels["low"]) 
                                    if exp_levels["high"] and exp_levels["low"] else 0
        }
        
        self.results.append(result)
        
        print(f"  High exp avg weight: {result['avg_weight_high']:.3f}")
        print(f"  Low exp avg weight: {result['avg_weight_low']:.3f}")
        print(f"  Weight ratio: {result['weight_ratio_high_low']:.2f}x")
        
        return result
    
    def compare_strategies(self):
        """Compare different strategies"""
        if not self.results:
            print("No experiments run yet!")
            return
            
        results_df = pd.DataFrame(self.results)
        
        # Visualization
        fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 6))
        
        # Weight distribution by strategy
        strategies = results_df['strategy'].unique()
        x = np.arange(len(strategies))
        width = 0.25
        
        for i, gran in enumerate(['repository', 'subsystem', 'package']):
            gran_data = results_df[results_df['granularity'] == gran]
            ax1.bar(x + i*width, gran_data['weight_ratio_high_low'], 
                   width, label=gran)
        
        ax1.set_xlabel('Strategy')
        ax1.set_ylabel('Weight Ratio (High/Low Experience)')
        ax1.set_title('Experience Weight Ratios by Strategy and Granularity')
        ax1.set_xticks(x + width)
        ax1.set_xticklabels(strategies)
        ax1.legend()
        
        # Heatmap of average weights
        pivot_data = results_df.pivot(index='strategy', columns='granularity', 
                                     values='avg_weight_high')
        sns.heatmap(pivot_data, annot=True, fmt='.2f', cmap='YlOrRd', ax=ax2)
        ax2.set_title('Average Weights for High Experience Reviewers')
        
        plt.tight_layout()
        plt.show()

# Run experiments
experiment = ResearchExperiment("ELF_Analysis")

# Test all combinations
for strategy in ["aco", "rso", "avg", "max"]:
    for granularity in ["repository", "subsystem", "package"]:
        experiment.run_experiment(strategy, granularity, mock_comments, mock_metrics)

# Compare results
experiment.compare_strategies()

## 9. Conclusions and Future Directions

### Key Takeaways
1. **Experience Matters**: Leveraging reviewer experience through ELF significantly improves code review quality
2. **Granularity is Important**: Package-level ownership metrics tend to perform best
3. **Strategy Selection**: Both ACO and RSO strategies are effective, with complementary strengths
4. **Quality Over Quantity**: ELF models generate more functional issue detections and helpful suggestions

### Implementation Notes
- This notebook uses LangChain for demonstration purposes
- The actual paper uses T5-based CodeReviewer model fine-tuned on GitHub data
- deepeval can be used for more sophisticated evaluation of generated comments

### Future Research Directions
1. **Dynamic Weight Adjustment**: Explore adaptive weighting based on comment quality feedback
2. **Multi-modal Experience**: Incorporate other signals like code complexity and review history
3. **Cross-project Transfer**: Study how experience metrics transfer across different projects
4. **Real-time Adaptation**: Develop online learning approaches for continuous improvement

In [None]:
# Save experiment results
import pickle

experiment_data = {
    "mock_comments": mock_comments,
    "mock_metrics": mock_metrics,
    "experiment_results": experiment.results
}

with open("elf_experiment_results.pkl", "wb") as f:
    pickle.dump(experiment_data, f)

print("Experiment data saved!")
print(f"Total experiments run: {len(experiment.results)}")
print(f"Best configuration: {max(experiment.results, key=lambda x: x['weight_ratio_high_low'])}")