# Focused Learning: Measuring Code Review Quality Metrics

## Learning Objectives
1. Understand the **multi-dimensional nature** of code review quality
2. Implement metrics for measuring review effectiveness (from Table II)
3. Map paper metrics to **DeepEval** evaluation framework
4. Build a comprehensive quality assessment system

## Paper Context
**Section Reference**: Section II-D (Data Analysis) and Section III-B (RQ1: Impact on Quality Issues Found)

**Key Metrics from Paper**:
- Number of reported quality issues
- Length of code review (sentences)
- Covered code locations
- Issue severity classification
- Inter-rater agreement (Cohen's Kappa = 0.315)

**Table Reference**: Table II - Variables used in the study

## 1. Setup and Data Models

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from typing import List, Dict, Any, Tuple, Optional
from dataclasses import dataclass, field
from enum import Enum
import re
from collections import Counter
import networkx as nx
from sklearn.metrics import cohen_kappa_score, confusion_matrix
import nltk
from nltk.tokenize import sent_tokenize, word_tokenize
from deepeval.metrics import BaseMetric
from deepeval.test_case import LLMTestCase

# Download NLTK data
try:
    nltk.download('punkt', quiet=True)
except:
    pass

# Set up visualization
plt.style.use('seaborn-v0_8-darkgrid')
np.random.seed(42)

In [None]:
# Define issue taxonomy based on Mäntylä and Lassenius (referenced in paper)
class IssueCategory(Enum):
    """Issue categories from Mäntylä and Lassenius taxonomy"""
    # Evolvability issues (77% in paper)
    DOCUMENTATION = "documentation"
    STRUCTURE = "structure"
    VISUAL_REPRESENTATION = "visual_representation"
    SOLUTION_APPROACH = "solution_approach"
    
    # Functional issues (23% in paper)
    LOGIC = "logic"
    RESOURCE = "resource"
    CHECK = "check"
    INTERFACE = "interface"

@dataclass
class CodeReviewComment:
    """Detailed code review comment with metadata"""
    id: str
    text: str
    file_path: str
    line_start: int
    line_end: int
    category: IssueCategory
    severity: str  # low, medium, high
    author_type: str  # human, llm, comprehensive
    confidence: float = 0.5  # Reviewer confidence in the comment
    
@dataclass
class CodeReviewMetrics:
    """Comprehensive metrics for a code review"""
    # Basic metrics from paper
    num_issues_reported: int
    review_length_sentences: int
    covered_lines: int
    covered_files: int
    
    # Quality metrics
    issues_by_severity: Dict[str, int]
    issues_by_category: Dict[str, int]
    
    # Effectiveness metrics
    true_positives: int = 0
    false_positives: int = 0
    false_negatives: int = 0
    
    # Time metrics
    time_total_seconds: int = 0
    time_per_issue: float = 0.0
    
    # Advanced metrics
    comment_clarity_score: float = 0.0
    actionability_score: float = 0.0
    coverage_uniformity: float = 0.0

## 2. Quality Metrics Implementation

In [None]:
class CodeReviewQualityAnalyzer:
    """Analyzes code review quality using multiple metrics"""
    
    def __init__(self):
        self.severity_weights = {
            'low': 1,
            'medium': 2,
            'high': 3
        }
    
    def calculate_basic_metrics(self, comments: List[CodeReviewComment]) -> Dict[str, Any]:
        """Calculate basic metrics as defined in the paper"""
        
        # Number of issues reported
        num_issues = len(comments)
        
        # Review length in sentences
        all_text = ' '.join([c.text for c in comments])
        sentences = sent_tokenize(all_text) if all_text else []
        review_length = len(sentences)
        
        # Covered code locations
        covered_lines = set()
        covered_files = set()
        
        for comment in comments:
            covered_files.add(comment.file_path)
            for line in range(comment.line_start, comment.line_end + 1):
                covered_lines.add((comment.file_path, line))
        
        return {
            'num_issues_reported': num_issues,
            'review_length_sentences': review_length,
            'covered_lines': len(covered_lines),
            'covered_files': len(covered_files),
            'avg_comment_length': np.mean([len(c.text.split()) for c in comments]) if comments else 0
        }
    
    def calculate_severity_distribution(self, comments: List[CodeReviewComment]) -> Dict[str, int]:
        """Calculate distribution of issues by severity"""
        severity_counts = {'low': 0, 'medium': 0, 'high': 0}
        
        for comment in comments:
            severity_counts[comment.severity] += 1
        
        return severity_counts
    
    def calculate_category_distribution(self, comments: List[CodeReviewComment]) -> Dict[str, int]:
        """Calculate distribution by issue category"""
        category_counts = Counter([c.category.value for c in comments])
        return dict(category_counts)
    
    def calculate_weighted_score(self, comments: List[CodeReviewComment]) -> float:
        """Calculate weighted quality score based on severity"""
        if not comments:
            return 0.0
        
        total_weight = sum(self.severity_weights[c.severity] for c in comments)
        max_possible = len(comments) * self.severity_weights['high']
        
        return total_weight / max_possible if max_possible > 0 else 0.0
    
    def calculate_clarity_score(self, comments: List[CodeReviewComment]) -> float:
        """Estimate clarity of review comments"""
        if not comments:
            return 0.0
        
        clarity_indicators = [
            'should', 'must', 'need to', 'consider', 'recommend',
            'instead of', 'rather than', 'because', 'since', 'due to'
        ]
        
        scores = []
        for comment in comments:
            text_lower = comment.text.lower()
            
            # Check for clarity indicators
            indicator_count = sum(1 for ind in clarity_indicators if ind in text_lower)
            
            # Check for specific line references
            has_line_ref = bool(re.search(r'line \d+|lines? \d+-\d+', text_lower))
            
            # Check for code snippets
            has_code = bool(re.search(r'`[^`]+`|```[^`]+```', comment.text))
            
            # Calculate clarity score
            score = min(1.0, (indicator_count * 0.2 + (0.3 if has_line_ref else 0) + 
                            (0.3 if has_code else 0) + 0.2))
            scores.append(score)
        
        return np.mean(scores)
    
    def calculate_actionability_score(self, comments: List[CodeReviewComment]) -> float:
        """Measure how actionable the comments are"""
        if not comments:
            return 0.0
        
        action_keywords = [
            'change', 'modify', 'update', 'fix', 'remove', 'add',
            'replace', 'refactor', 'rename', 'move', 'extract'
        ]
        
        actionable_count = 0
        for comment in comments:
            text_lower = comment.text.lower()
            if any(keyword in text_lower for keyword in action_keywords):
                actionable_count += 1
        
        return actionable_count / len(comments)
    
    def calculate_coverage_uniformity(self, comments: List[CodeReviewComment], 
                                    total_files: int) -> float:
        """Measure how uniformly the review covers the codebase"""
        if not comments or total_files == 0:
            return 0.0
        
        # Count comments per file
        file_comment_counts = Counter([c.file_path for c in comments])
        
        # Calculate entropy as measure of uniformity
        total_comments = len(comments)
        entropy = 0
        
        for count in file_comment_counts.values():
            if count > 0:
                p = count / total_comments
                entropy -= p * np.log2(p)
        
        # Normalize by maximum possible entropy
        max_entropy = np.log2(min(total_files, total_comments))
        
        return entropy / max_entropy if max_entropy > 0 else 0.0

# Create analyzer instance
analyzer = CodeReviewQualityAnalyzer()
print("Code Review Quality Analyzer initialized!")

## 3. Simulating Review Data

In [None]:
def generate_synthetic_review(treatment: str, num_issues: int = 10) -> List[CodeReviewComment]:
    """Generate synthetic review data based on paper findings"""
    
    comments = []
    
    # Treatment-specific distributions based on paper
    if treatment == "MCR":
        severity_probs = [0.4, 0.4, 0.2]  # More balanced
        category_weights = {
            IssueCategory.DOCUMENTATION: 0.15,
            IssueCategory.STRUCTURE: 0.20,
            IssueCategory.SOLUTION_APPROACH: 0.25,
            IssueCategory.LOGIC: 0.15,
            IssueCategory.CHECK: 0.15,
            IssueCategory.INTERFACE: 0.10
        }
    elif treatment == "ACR":
        severity_probs = [0.6, 0.3, 0.1]  # More low-severity issues
        category_weights = {
            IssueCategory.DOCUMENTATION: 0.25,
            IssueCategory.STRUCTURE: 0.30,
            IssueCategory.SOLUTION_APPROACH: 0.20,
            IssueCategory.LOGIC: 0.10,
            IssueCategory.CHECK: 0.10,
            IssueCategory.INTERFACE: 0.05
        }
    else:  # CCR
        severity_probs = [0.3, 0.4, 0.3]  # Comprehensive coverage
        category_weights = {
            IssueCategory.DOCUMENTATION: 0.20,
            IssueCategory.STRUCTURE: 0.20,
            IssueCategory.SOLUTION_APPROACH: 0.20,
            IssueCategory.LOGIC: 0.15,
            IssueCategory.CHECK: 0.15,
            IssueCategory.INTERFACE: 0.10
        }
    
    # Generate comments
    for i in range(num_issues):
        severity = np.random.choice(['low', 'medium', 'high'], p=severity_probs)
        category = np.random.choice(list(category_weights.keys()), 
                                  p=list(category_weights.values()))
        
        # Generate realistic comment text
        comment_templates = {
            'low': [
                "Consider adding documentation for this method.",
                "This variable name could be more descriptive.",
                "Minor: inconsistent spacing in this section."
            ],
            'medium': [
                "This method is too long and should be refactored.",
                "Missing error handling for edge cases.",
                "Performance issue: string concatenation in loop."
            ],
            'high': [
                "Critical bug: incorrect logic in condition.",
                "Security vulnerability: SQL injection risk.",
                "Memory leak: resources not properly released."
            ]
        }
        
        text = np.random.choice(comment_templates[severity])
        
        comment = CodeReviewComment(
            id=f"{treatment}-{i+1}",
            text=text,
            file_path=f"src/module_{np.random.randint(1, 5)}.py",
            line_start=np.random.randint(1, 100),
            line_end=0,  # Will be set below
            category=category,
            severity=severity,
            author_type='llm' if treatment == 'ACR' else 'human',
            confidence=np.random.uniform(0.6, 0.95)
        )
        comment.line_end = comment.line_start + np.random.randint(0, 5)
        
        comments.append(comment)
    
    return comments

# Generate sample reviews
mcr_review = generate_synthetic_review("MCR", num_issues=8)
acr_review = generate_synthetic_review("ACR", num_issues=12)
ccr_review = generate_synthetic_review("CCR", num_issues=10)

print(f"Generated synthetic reviews:")
print(f"  MCR: {len(mcr_review)} comments")
print(f"  ACR: {len(acr_review)} comments")
print(f"  CCR: {len(ccr_review)} comments")

## 4. Computing Comprehensive Metrics

In [None]:
def compute_review_metrics(comments: List[CodeReviewComment], 
                         treatment: str,
                         time_seconds: int = None) -> CodeReviewMetrics:
    """Compute comprehensive metrics for a code review"""
    
    # Basic metrics
    basic = analyzer.calculate_basic_metrics(comments)
    
    # Quality distributions
    severity_dist = analyzer.calculate_severity_distribution(comments)
    category_dist = analyzer.calculate_category_distribution(comments)
    
    # Advanced scores
    clarity = analyzer.calculate_clarity_score(comments)
    actionability = analyzer.calculate_actionability_score(comments)
    uniformity = analyzer.calculate_coverage_uniformity(comments, total_files=5)
    
    # Time metrics (from paper averages)
    if time_seconds is None:
        time_seconds = {
            'MCR': 42 * 60,  # 42 minutes
            'ACR': 56 * 60,  # 56 minutes
            'CCR': 57 * 60   # 57 minutes
        }.get(treatment, 50 * 60)
    
    metrics = CodeReviewMetrics(
        num_issues_reported=basic['num_issues_reported'],
        review_length_sentences=basic['review_length_sentences'],
        covered_lines=basic['covered_lines'],
        covered_files=basic['covered_files'],
        issues_by_severity=severity_dist,
        issues_by_category=category_dist,
        time_total_seconds=time_seconds,
        time_per_issue=time_seconds / basic['num_issues_reported'] if basic['num_issues_reported'] > 0 else 0,
        comment_clarity_score=clarity,
        actionability_score=actionability,
        coverage_uniformity=uniformity
    )
    
    return metrics

# Compute metrics for all treatments
metrics_mcr = compute_review_metrics(mcr_review, 'MCR')
metrics_acr = compute_review_metrics(acr_review, 'ACR')
metrics_ccr = compute_review_metrics(ccr_review, 'CCR')

# Display metrics comparison
metrics_data = {
    'Treatment': ['MCR', 'ACR', 'CCR'],
    'Issues Reported': [metrics_mcr.num_issues_reported, metrics_acr.num_issues_reported, metrics_ccr.num_issues_reported],
    'Review Length (sentences)': [metrics_mcr.review_length_sentences, metrics_acr.review_length_sentences, metrics_ccr.review_length_sentences],
    'Clarity Score': [metrics_mcr.comment_clarity_score, metrics_acr.comment_clarity_score, metrics_ccr.comment_clarity_score],
    'Actionability': [metrics_mcr.actionability_score, metrics_acr.actionability_score, metrics_ccr.actionability_score],
    'Time per Issue (min)': [metrics_mcr.time_per_issue/60, metrics_acr.time_per_issue/60, metrics_ccr.time_per_issue/60]
}

metrics_df = pd.DataFrame(metrics_data)
print("\nCode Review Metrics Comparison:")
print(metrics_df.round(2).to_string(index=False))

## 5. DeepEval Integration

In [None]:
class CodeReviewRelevanceMetric(BaseMetric):
    """Custom DeepEval metric for code review relevance"""
    
    def __init__(self, threshold: float = 0.7):
        self.threshold = threshold
        self.score = 0
        self.reason = ""
        self.success = False
    
    def measure(self, test_case: LLMTestCase):
        """Measure relevance of code review comments to actual code issues"""
        
        # Extract code issues from expected output
        expected_issues = set(test_case.expected_output.lower().split('\n'))
        
        # Extract issues from actual output
        actual_issues = set(test_case.actual_output.lower().split('\n'))
        
        # Calculate relevance score
        if not expected_issues:
            self.score = 0
        else:
            # Check keyword overlap
            relevance_keywords = ['bug', 'error', 'issue', 'problem', 'incorrect', 
                                'missing', 'wrong', 'fix', 'should', 'must']
            
            relevant_count = 0
            for actual in actual_issues:
                if any(keyword in actual for keyword in relevance_keywords):
                    relevant_count += 1
            
            self.score = relevant_count / len(actual_issues) if actual_issues else 0
        
        self.success = self.score >= self.threshold
        self.reason = f"Relevance score: {self.score:.2f} (threshold: {self.threshold})"
        
        return self.score
    
    def is_successful(self) -> bool:
        return self.success

class CodeReviewCompletenessMetric(BaseMetric):
    """Measure completeness of code review coverage"""
    
    def __init__(self, threshold: float = 0.5):
        self.threshold = threshold
        self.score = 0
        self.reason = ""
        self.success = False
    
    def measure(self, test_case: LLMTestCase):
        """Measure how completely the review covers known issues"""
        
        # Parse expected issues (ground truth)
        expected_lines = []
        for line in test_case.expected_output.split('\n'):
            # Extract line numbers from expected output
            import re
            line_nums = re.findall(r'line (\d+)', line.lower())
            expected_lines.extend([int(n) for n in line_nums])
        
        # Parse actual review
        actual_lines = []
        for line in test_case.actual_output.split('\n'):
            line_nums = re.findall(r'line (\d+)', line.lower())
            actual_lines.extend([int(n) for n in line_nums])
        
        # Calculate completeness
        if not expected_lines:
            self.score = 1.0 if not actual_lines else 0.0
        else:
            covered = len(set(actual_lines) & set(expected_lines))
            self.score = covered / len(set(expected_lines))
        
        self.success = self.score >= self.threshold
        self.reason = f"Completeness: {self.score:.2f} (covered {len(set(actual_lines) & set(expected_lines))} of {len(set(expected_lines))} expected locations)"
        
        return self.score
    
    def is_successful(self) -> bool:
        return self.success

# Initialize custom metrics
relevance_metric = CodeReviewRelevanceMetric(threshold=0.7)
completeness_metric = CodeReviewCompletenessMetric(threshold=0.5)

print("Custom DeepEval metrics for code review initialized!")

## 6. Inter-rater Agreement Analysis

In [None]:
def simulate_inter_rater_agreement(n_issues: int = 50, n_raters: int = 2):
    """Simulate inter-rater agreement for severity classification"""
    
    # Generate ground truth severities
    true_severities = np.random.choice(['low', 'medium', 'high'], 
                                     size=n_issues, 
                                     p=[0.5, 0.3, 0.2])
    
    # Simulate rater classifications with some noise
    # Paper reports Cohen's Kappa of 0.315 (fair agreement)
    rater1_severities = []
    rater2_severities = []
    
    for true_sev in true_severities:
        # Rater 1
        if np.random.random() < 0.6:  # 60% agreement rate
            rater1_severities.append(true_sev)
        else:
            # Random disagreement
            options = ['low', 'medium', 'high']
            options.remove(true_sev)
            rater1_severities.append(np.random.choice(options))
        
        # Rater 2
        if np.random.random() < 0.6:
            rater2_severities.append(true_sev)
        else:
            options = ['low', 'medium', 'high']
            options.remove(true_sev)
            rater2_severities.append(np.random.choice(options))
    
    # Calculate Cohen's Kappa
    kappa = cohen_kappa_score(rater1_severities, rater2_severities)
    
    # Create confusion matrix
    cm = confusion_matrix(rater1_severities, rater2_severities, 
                         labels=['low', 'medium', 'high'])
    
    return {
        'rater1': rater1_severities,
        'rater2': rater2_severities,
        'kappa': kappa,
        'confusion_matrix': cm
    }

# Simulate agreement
agreement_data = simulate_inter_rater_agreement(n_issues=100)

# Visualize confusion matrix
plt.figure(figsize=(8, 6))
sns.heatmap(agreement_data['confusion_matrix'], 
            annot=True, fmt='d', cmap='Blues',
            xticklabels=['Low', 'Medium', 'High'],
            yticklabels=['Low', 'Medium', 'High'])
plt.title(f"Inter-rater Agreement Matrix\nCohen's Kappa: {agreement_data['kappa']:.3f}")
plt.xlabel('Rater 2 Classification')
plt.ylabel('Rater 1 Classification')
plt.show()

print(f"\nInter-rater Agreement Analysis:")
print(f"Cohen's Kappa: {agreement_data['kappa']:.3f}")
print(f"Interpretation: {'Poor' if agreement_data['kappa'] < 0.2 else 'Fair' if agreement_data['kappa'] < 0.4 else 'Moderate' if agreement_data['kappa'] < 0.6 else 'Good'}")
print(f"\nNote: Paper reported κ = 0.315 (Fair agreement)")

## 7. Quality Metric Visualization

In [None]:
def visualize_quality_metrics(metrics_dict: Dict[str, CodeReviewMetrics]):
    """Create comprehensive visualization of review quality metrics"""
    
    fig, axes = plt.subplots(2, 3, figsize=(18, 12))
    axes = axes.flatten()
    
    treatments = list(metrics_dict.keys())
    
    # 1. Issue severity distribution
    ax = axes[0]
    severity_data = pd.DataFrame([
        metrics.issues_by_severity for metrics in metrics_dict.values()
    ], index=treatments)
    severity_data.plot(kind='bar', ax=ax, color=['lightgreen', 'orange', 'red'])
    ax.set_title('Issue Severity Distribution by Treatment')
    ax.set_xlabel('Treatment')
    ax.set_ylabel('Number of Issues')
    ax.legend(title='Severity')
    
    # 2. Time efficiency
    ax = axes[1]
    time_data = [m.time_per_issue/60 for m in metrics_dict.values()]
    bars = ax.bar(treatments, time_data, color=['skyblue', 'lightcoral', 'lightgreen'])
    ax.set_title('Time Efficiency: Minutes per Issue Found')
    ax.set_ylabel('Minutes per Issue')
    
    # Add value labels on bars
    for bar, val in zip(bars, time_data):
        ax.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.1, 
                f'{val:.1f}', ha='center', va='bottom')
    
    # 3. Quality scores
    ax = axes[2]
    quality_metrics = ['Clarity', 'Actionability', 'Coverage Uniformity']
    quality_data = pd.DataFrame({
        treatment: [
            metrics.comment_clarity_score,
            metrics.actionability_score,
            metrics.coverage_uniformity
        ] for treatment, metrics in metrics_dict.items()
    }, index=quality_metrics)
    
    quality_data.plot(kind='bar', ax=ax)
    ax.set_title('Quality Scores Comparison')
    ax.set_ylabel('Score (0-1)')
    ax.set_ylim(0, 1.1)
    ax.legend(title='Treatment')
    
    # 4. Issue category distribution (stacked)
    ax = axes[3]
    category_data = pd.DataFrame([
        metrics.issues_by_category for metrics in metrics_dict.values()
    ], index=treatments).fillna(0)
    
    category_data.T.plot(kind='bar', stacked=True, ax=ax)
    ax.set_title('Issue Categories by Treatment')
    ax.set_xlabel('Issue Category')
    ax.set_ylabel('Number of Issues')
    ax.legend(title='Treatment', bbox_to_anchor=(1.05, 1), loc='upper left')
    
    # 5. Radar chart for multi-dimensional comparison
    ax = axes[4]
    ax.remove()
    ax = fig.add_subplot(2, 3, 5, projection='polar')
    
    # Metrics for radar chart
    radar_metrics = ['Coverage', 'Clarity', 'Actionability', 'Efficiency', 'Completeness']
    
    for treatment, metrics in metrics_dict.items():
        values = [
            min(1.0, metrics.covered_lines / 100),  # Normalized coverage
            metrics.comment_clarity_score,
            metrics.actionability_score,
            1 - min(1.0, metrics.time_per_issue / 3600),  # Inverse time (efficiency)
            min(1.0, metrics.num_issues_reported / 15)  # Normalized completeness
        ]
        
        # Add first value to close the polygon
        values += values[:1]
        
        # Angles for each metric
        angles = np.linspace(0, 2 * np.pi, len(radar_metrics), endpoint=False).tolist()
        angles += angles[:1]
        
        ax.plot(angles, values, 'o-', linewidth=2, label=treatment)
        ax.fill(angles, values, alpha=0.15)
    
    ax.set_xticks(angles[:-1])
    ax.set_xticklabels(radar_metrics)
    ax.set_ylim(0, 1)
    ax.set_title('Multi-dimensional Quality Comparison', y=1.08)
    ax.legend(loc='upper right', bbox_to_anchor=(1.3, 1.1))
    ax.grid(True)
    
    # 6. Summary statistics table
    ax = axes[5]
    ax.axis('tight')
    ax.axis('off')
    
    summary_data = []
    for treatment, metrics in metrics_dict.items():
        summary_data.append([
            treatment,
            metrics.num_issues_reported,
            f"{metrics.time_total_seconds/60:.0f}",
            f"{metrics.comment_clarity_score:.2f}",
            f"{metrics.actionability_score:.2f}",
            f"{sum(metrics.issues_by_severity.values())}"
        ])
    
    table = ax.table(cellText=summary_data,
                     colLabels=['Treatment', 'Issues', 'Time (min)', 'Clarity', 'Actionability', 'Total'],
                     cellLoc='center',
                     loc='center')
    table.auto_set_font_size(False)
    table.set_fontsize(10)
    table.scale(1.2, 1.5)
    ax.set_title('Summary Statistics', pad=20)
    
    plt.tight_layout()
    plt.show()

# Visualize all metrics
all_metrics = {
    'MCR': metrics_mcr,
    'ACR': metrics_acr,
    'CCR': metrics_ccr
}

visualize_quality_metrics(all_metrics)

## 8. Creating a Composite Quality Score

In [None]:
class CompositeQualityScorer:
    """Calculate composite quality score for code reviews"""
    
    def __init__(self):
        # Weights based on importance
        self.weights = {
            'severity_score': 0.25,      # Finding high-severity issues
            'completeness': 0.20,        # Coverage of codebase
            'clarity': 0.15,             # Clear communication
            'actionability': 0.15,       # Actionable feedback
            'efficiency': 0.15,          # Time efficiency
            'uniformity': 0.10          # Balanced coverage
        }
    
    def calculate_severity_score(self, metrics: CodeReviewMetrics) -> float:
        """Score based on severity-weighted issues found"""
        severity_weights = {'low': 1, 'medium': 2, 'high': 3}
        
        weighted_sum = sum(
            count * severity_weights[sev] 
            for sev, count in metrics.issues_by_severity.items()
        )
        
        # Normalize by maximum possible (all high severity)
        max_possible = metrics.num_issues_reported * 3
        
        return weighted_sum / max_possible if max_possible > 0 else 0
    
    def calculate_completeness_score(self, metrics: CodeReviewMetrics) -> float:
        """Score based on code coverage"""
        # Normalize by expected coverage (e.g., 100 lines)
        return min(1.0, metrics.covered_lines / 100)
    
    def calculate_efficiency_score(self, metrics: CodeReviewMetrics) -> float:
        """Score based on time efficiency"""
        # Inverse of time per issue, normalized
        # Assume 5 minutes per issue is excellent
        target_time = 5 * 60  # 5 minutes in seconds
        
        if metrics.time_per_issue == 0:
            return 0
        
        return min(1.0, target_time / metrics.time_per_issue)
    
    def calculate_composite_score(self, metrics: CodeReviewMetrics) -> Dict[str, float]:
        """Calculate overall composite quality score"""
        
        scores = {
            'severity_score': self.calculate_severity_score(metrics),
            'completeness': self.calculate_completeness_score(metrics),
            'clarity': metrics.comment_clarity_score,
            'actionability': metrics.actionability_score,
            'efficiency': self.calculate_efficiency_score(metrics),
            'uniformity': metrics.coverage_uniformity
        }
        
        # Calculate weighted composite
        composite = sum(score * self.weights[name] for name, score in scores.items())
        
        return {
            'component_scores': scores,
            'composite_score': composite
        }

# Calculate composite scores
scorer = CompositeQualityScorer()

composite_results = {}
for treatment, metrics in all_metrics.items():
    composite_results[treatment] = scorer.calculate_composite_score(metrics)

# Visualize composite scores
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 6))

# Overall composite scores
treatments = list(composite_results.keys())
composite_scores = [r['composite_score'] for r in composite_results.values()]

bars = ax1.bar(treatments, composite_scores, color=['#3498db', '#e74c3c', '#2ecc71'])
ax1.set_title('Composite Quality Scores by Treatment', fontsize=14)
ax1.set_ylabel('Composite Score (0-1)')
ax1.set_ylim(0, 1)

# Add value labels
for bar, score in zip(bars, composite_scores):
    ax1.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.01,
             f'{score:.3f}', ha='center', va='bottom')

# Component breakdown
component_data = pd.DataFrame([
    result['component_scores'] for result in composite_results.values()
], index=treatments)

component_data.plot(kind='bar', ax=ax2, stacked=False)
ax2.set_title('Quality Score Components by Treatment', fontsize=14)
ax2.set_ylabel('Component Score (0-1)')
ax2.legend(title='Component', bbox_to_anchor=(1.05, 1), loc='upper left')
ax2.set_ylim(0, 1.1)

plt.tight_layout()
plt.show()

# Print detailed results
print("\nDetailed Composite Quality Analysis:")
print("=" * 60)
for treatment, result in composite_results.items():
    print(f"\n{treatment}:")
    print(f"  Composite Score: {result['composite_score']:.3f}")
    print("  Component Scores:")
    for component, score in result['component_scores'].items():
        weight = scorer.weights[component]
        contribution = score * weight
        print(f"    {component:15} {score:.3f} (weight: {weight:.2f}, contribution: {contribution:.3f})")

## 9. Building a Quality Prediction Model

In [None]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score

def build_quality_predictor():
    """Build a model to predict review quality based on features"""
    
    # Generate synthetic training data
    n_samples = 1000
    data = []
    
    for _ in range(n_samples):
        # Simulate review characteristics
        treatment = np.random.choice(['MCR', 'ACR', 'CCR'])
        
        # Treatment-specific distributions
        if treatment == 'MCR':
            num_issues = np.random.poisson(8)
            clarity = np.random.beta(8, 2)  # Higher clarity
            time_minutes = np.random.normal(42, 10)
        elif treatment == 'ACR':
            num_issues = np.random.poisson(12)
            clarity = np.random.beta(6, 4)  # Medium clarity
            time_minutes = np.random.normal(56, 15)
        else:  # CCR
            num_issues = np.random.poisson(10)
            clarity = np.random.beta(7, 3)  # Good clarity
            time_minutes = np.random.normal(57, 12)
        
        # Other features
        actionability = np.random.beta(5, 5)
        coverage = np.random.beta(4, 2)
        high_severity_ratio = np.random.beta(2, 8) if treatment == 'ACR' else np.random.beta(3, 7)
        
        # Calculate quality score (simplified)
        quality = (
            0.3 * high_severity_ratio +
            0.2 * clarity +
            0.2 * actionability +
            0.2 * coverage +
            0.1 * (1 - min(1, time_minutes / 100))
        )
        
        data.append({
            'treatment_MCR': 1 if treatment == 'MCR' else 0,
            'treatment_ACR': 1 if treatment == 'ACR' else 0,
            'treatment_CCR': 1 if treatment == 'CCR' else 0,
            'num_issues': num_issues,
            'clarity': clarity,
            'actionability': actionability,
            'coverage': coverage,
            'time_minutes': time_minutes,
            'high_severity_ratio': high_severity_ratio,
            'quality': quality
        })
    
    # Create DataFrame
    df = pd.DataFrame(data)
    
    # Prepare features and target
    feature_cols = [col for col in df.columns if col != 'quality']
    X = df[feature_cols]
    y = df['quality']
    
    # Split data
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
    
    # Train model
    model = RandomForestRegressor(n_estimators=100, random_state=42, max_depth=10)
    model.fit(X_train, y_train)
    
    # Evaluate
    y_pred = model.predict(X_test)
    mse = mean_squared_error(y_test, y_pred)
    r2 = r2_score(y_test, y_pred)
    
    # Feature importance
    feature_importance = pd.DataFrame({
        'feature': feature_cols,
        'importance': model.feature_importances_
    }).sort_values('importance', ascending=False)
    
    return model, feature_importance, {'mse': mse, 'r2': r2}

# Build and evaluate model
model, feature_importance, metrics = build_quality_predictor()

# Visualize results
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 6))

# Feature importance
feature_importance.plot(x='feature', y='importance', kind='bar', ax=ax1)
ax1.set_title('Feature Importance for Quality Prediction')
ax1.set_xlabel('Feature')
ax1.set_ylabel('Importance')

# Model performance
ax2.text(0.1, 0.7, f"Model Performance:", fontsize=16, weight='bold')
ax2.text(0.1, 0.5, f"R² Score: {metrics['r2']:.3f}", fontsize=14)
ax2.text(0.1, 0.3, f"MSE: {metrics['mse']:.4f}", fontsize=14)
ax2.axis('off')

plt.tight_layout()
plt.show()

print("\nQuality Prediction Model Built!")
print(f"Model R² Score: {metrics['r2']:.3f}")
print("\nTop 3 Most Important Features:")
for _, row in feature_importance.head(3).iterrows():
    print(f"  {row['feature']}: {row['importance']:.3f}")

## 10. Key Insights and Best Practices

In [None]:
insights = {
    "Metric Design Principles": [
        "Multi-dimensional assessment captures review quality better than single metrics",
        "Severity weighting is crucial - not all issues are equal",
        "Time efficiency should be balanced with thoroughness",
        "Inter-rater agreement of 0.315 indicates subjectivity in quality assessment"
    ],
    
    "Key Findings from Analysis": [
        "ACR generates more comments but lower severity-weighted quality",
        "MCR shows better balance between efficiency and effectiveness",
        "CCR has high coverage but diminishing returns on time investment",
        "Clarity and actionability vary significantly between treatments"
    ],
    
    "DeepEval Integration Benefits": [
        "Automated quality assessment enables continuous improvement",
        "Custom metrics can capture domain-specific quality aspects",
        "Relevance and completeness metrics align with paper findings",
        "Can be integrated into CI/CD pipelines for quality gates"
    ],
    
    "Practical Implementation Guidelines": [
        "Track both quantitative metrics (coverage, count) and qualitative (clarity, actionability)",
        "Use composite scores but understand component contributions",
        "Consider reviewer experience and confidence in quality assessment",
        "Regular calibration sessions can improve inter-rater agreement",
        "Automate metric collection to reduce manual overhead"
    ],
    
    "Future Research Directions": [
        "Develop ML models to predict review quality from code characteristics",
        "Study correlation between review metrics and post-release defects",
        "Create adaptive metrics that learn from reviewer feedback",
        "Investigate cultural and team factors affecting quality perception"
    ]
}

print("\n" + "="*80)
print("KEY INSIGHTS: Measuring Code Review Quality")
print("="*80)

for category, items in insights.items():
    print(f"\n{category}:")
    for item in items:
        print(f"  • {item}")

print("\n" + "="*80)
print("\nConclusion:")
print("Quality measurement in code review is multi-faceted. While automated")
print("tools can increase quantity of feedback, true quality requires balancing")
print("severity detection, clarity, actionability, and efficiency. The framework")
print("presented here provides a foundation for comprehensive quality assessment.")