# Focused Learning: Understanding LLM Bias in Code Review

## Learning Objectives
1. Understand the concept of **anchoring bias** in automated code review
2. Analyze how LLM-generated reviews influence reviewer behavior
3. Implement experiments to measure and visualize bias effects
4. Develop strategies to mitigate bias in AI-assisted code review

## Paper Context
**Section Reference**: Section III-A (RQ0) and Section III-E (Actionable Recommendations)

**Key Finding from Paper**:
> "The availability of an automated review as a starting point strongly influences the reviewer's behavior. Reviewers mostly focused on the code locations pointed out in the automatically generated review they were provided with."

**Figure Reference**: Figure 3 - Shows how MCR reviews covered 484 distinct lines with 263 unique to MCR, while ACR/CCR reviews showed much less variation.

## 1. Setup and Imports

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from typing import List, Dict, Set, Tuple
import random
from dataclasses import dataclass
import networkx as nx
from matplotlib.patches import Rectangle
from matplotlib.collections import PatchCollection
import matplotlib.patches as mpatches

# Set random seed for reproducibility
np.random.seed(42)
random.seed(42)

# Configure visualization
plt.style.use('seaborn-v0_8-whitegrid')
sns.set_palette("husl")

## 2. Theoretical Foundation: Anchoring Bias

Anchoring bias is a cognitive bias where individuals rely too heavily on the first piece of information offered (the "anchor") when making decisions.

In [None]:
@dataclass
class CodeLocation:
    """Represents a location in code that can be reviewed"""
    file_name: str
    line_start: int
    line_end: int
    has_issue: bool
    issue_severity: str = "low"  # low, medium, high
    
    def __hash__(self):
        return hash((self.file_name, self.line_start, self.line_end))

@dataclass
class ReviewerBehavior:
    """Models reviewer behavior with/without anchoring"""
    treatment: str  # MCR, ACR, CCR
    locations_reviewed: List[CodeLocation]
    time_per_location: List[float]
    issues_found: List[CodeLocation]
    anchored_locations: Set[CodeLocation] = None

def simulate_code_base(n_files: int = 5, lines_per_file: int = 100) -> Dict[str, List[CodeLocation]]:
    """Create a simulated codebase with issues"""
    codebase = {}
    
    for i in range(n_files):
        file_name = f"module_{i}.py"
        locations = []
        
        # Generate code locations with some having issues
        for line in range(0, lines_per_file, 10):
            has_issue = np.random.random() < 0.2  # 20% chance of issue
            severity = np.random.choice(['low', 'medium', 'high'], p=[0.6, 0.3, 0.1]) if has_issue else 'low'
            
            location = CodeLocation(
                file_name=file_name,
                line_start=line,
                line_end=line + 5,
                has_issue=has_issue,
                issue_severity=severity
            )
            locations.append(location)
        
        codebase[file_name] = locations
    
    return codebase

# Create simulated codebase
codebase = simulate_code_base()
all_locations = [loc for locs in codebase.values() for loc in locs]
issue_locations = [loc for loc in all_locations if loc.has_issue]

print(f"Created codebase with {len(all_locations)} locations")
print(f"Total issues: {len(issue_locations)}")
print(f"Issue distribution: ")
for severity in ['low', 'medium', 'high']:
    count = len([loc for loc in issue_locations if loc.issue_severity == severity])
    print(f"  {severity}: {count}")

## 3. Modeling Reviewer Behavior Under Different Treatments

In [None]:
class BiasedReviewerSimulator:
    """Simulates reviewer behavior with anchoring bias effects"""
    
    def __init__(self, codebase: Dict[str, List[CodeLocation]]):
        self.codebase = codebase
        self.all_locations = [loc for locs in codebase.values() for loc in locs]
    
    def generate_llm_review(self, coverage: float = 0.42) -> Set[CodeLocation]:
        """Simulate LLM review that covers ~42% of issues (from paper)"""
        issue_locations = [loc for loc in self.all_locations if loc.has_issue]
        
        # LLM tends to find more low-severity issues
        weights = []
        for loc in issue_locations:
            if loc.issue_severity == 'low':
                weights.append(3.0)  # Higher weight for low severity
            elif loc.issue_severity == 'medium':
                weights.append(1.5)
            else:  # high
                weights.append(0.5)  # Lower weight for high severity
        
        n_to_select = int(len(issue_locations) * coverage)
        selected = np.random.choice(issue_locations, size=n_to_select, replace=False, p=weights/np.sum(weights))
        
        # Also add some false positives (non-issues)
        non_issues = [loc for loc in self.all_locations if not loc.has_issue]
        false_positives = np.random.choice(non_issues, size=max(1, n_to_select//4), replace=False)
        
        return set(selected) | set(false_positives)
    
    def simulate_manual_review(self, time_budget: float = 42.0) -> ReviewerBehavior:
        """Simulate manual code review (MCR)"""
        locations_reviewed = []
        time_per_location = []
        issues_found = []
        time_spent = 0
        
        # Random walk through codebase
        shuffled_locations = self.all_locations.copy()
        np.random.shuffle(shuffled_locations)
        
        for loc in shuffled_locations:
            # Time varies based on complexity
            base_time = np.random.normal(1.0, 0.3)
            
            if time_spent + base_time > time_budget:
                break
            
            locations_reviewed.append(loc)
            time_per_location.append(base_time)
            time_spent += base_time
            
            # Probability of finding issue depends on severity
            if loc.has_issue:
                prob_find = {'low': 0.3, 'medium': 0.6, 'high': 0.9}[loc.issue_severity]
                if np.random.random() < prob_find:
                    issues_found.append(loc)
        
        return ReviewerBehavior(
            treatment="MCR",
            locations_reviewed=locations_reviewed,
            time_per_location=time_per_location,
            issues_found=issues_found
        )
    
    def simulate_anchored_review(self, anchor_locations: Set[CodeLocation], 
                               treatment: str, time_budget: float = 56.0) -> ReviewerBehavior:
        """Simulate review with anchoring bias (ACR/CCR)"""
        locations_reviewed = []
        time_per_location = []
        issues_found = []
        time_spent = 0
        
        # First, review anchored locations (89% kept from paper)
        for loc in anchor_locations:
            if time_spent > time_budget:
                break
            
            # More time spent verifying anchored locations
            verify_time = np.random.normal(1.5, 0.4)
            locations_reviewed.append(loc)
            time_per_location.append(verify_time)
            time_spent += verify_time
            
            # 89% chance of keeping anchored issue
            if np.random.random() < 0.89:
                issues_found.append(loc)
        
        # Limited exploration of non-anchored locations
        remaining_locations = [loc for loc in self.all_locations if loc not in anchor_locations]
        np.random.shuffle(remaining_locations)
        
        # Reduced exploration due to anchoring
        exploration_factor = 0.3  # Only 30% as much exploration
        
        for loc in remaining_locations[:int(len(remaining_locations) * exploration_factor)]:
            if time_spent > time_budget:
                break
            
            quick_check_time = np.random.normal(0.5, 0.2)
            locations_reviewed.append(loc)
            time_per_location.append(quick_check_time)
            time_spent += quick_check_time
            
            # Lower probability of finding non-anchored issues
            if loc.has_issue:
                prob_find = {'low': 0.1, 'medium': 0.2, 'high': 0.4}[loc.issue_severity]
                if np.random.random() < prob_find:
                    issues_found.append(loc)
        
        return ReviewerBehavior(
            treatment=treatment,
            locations_reviewed=locations_reviewed,
            time_per_location=time_per_location,
            issues_found=issues_found,
            anchored_locations=anchor_locations
        )

# Run simulations
simulator = BiasedReviewerSimulator(codebase)

# Generate LLM review for ACR
llm_review = simulator.generate_llm_review()

# Generate comprehensive review for CCR (all issues)
all_issues = {loc for loc in simulator.all_locations if loc.has_issue}

# Simulate different treatments
mcr_behavior = simulator.simulate_manual_review()
acr_behavior = simulator.simulate_anchored_review(llm_review, "ACR")
ccr_behavior = simulator.simulate_anchored_review(all_issues, "CCR")

print(f"\nSimulation Results:")
for behavior, name in [(mcr_behavior, "MCR"), (acr_behavior, "ACR"), (ccr_behavior, "CCR")]:
    print(f"\n{name}:")
    print(f"  Locations reviewed: {len(behavior.locations_reviewed)}")
    print(f"  Issues found: {len(behavior.issues_found)}")
    print(f"  Time spent: {sum(behavior.time_per_location):.1f} minutes")

## 4. Visualizing Anchoring Bias Effects

In [None]:
def visualize_review_coverage(behaviors: List[Tuple[ReviewerBehavior, str]]):
    """Visualize which code locations were reviewed under different treatments"""
    
    fig, axes = plt.subplots(1, 3, figsize=(18, 6))
    
    for idx, (behavior, name) in enumerate(behaviors):
        ax = axes[idx]
        
        # Create a grid representing code locations
        n_files = 5
        n_locations_per_file = 10
        
        # Create grid data
        grid = np.zeros((n_files, n_locations_per_file))
        anchored_grid = np.zeros((n_files, n_locations_per_file))
        issue_grid = np.zeros((n_files, n_locations_per_file))
        
        # Map locations to grid
        for loc in simulator.all_locations:
            file_idx = int(loc.file_name.split('_')[1])
            loc_idx = loc.line_start // 10
            
            if loc.has_issue:
                issue_grid[file_idx, loc_idx] = {'low': 1, 'medium': 2, 'high': 3}[loc.issue_severity]
            
            if loc in behavior.locations_reviewed:
                grid[file_idx, loc_idx] = 1
            
            if behavior.anchored_locations and loc in behavior.anchored_locations:
                anchored_grid[file_idx, loc_idx] = 1
        
        # Create custom colormap
        colors = ['white', 'lightblue', 'orange', 'red']
        n_bins = 4
        cmap = plt.matplotlib.colors.ListedColormap(colors)
        
        # Plot issue severity
        im = ax.imshow(issue_grid, cmap=cmap, alpha=0.6, aspect='auto')
        
        # Overlay review coverage
        for i in range(n_files):
            for j in range(n_locations_per_file):
                if grid[i, j] == 1:
                    rect = Rectangle((j-0.4, i-0.4), 0.8, 0.8, 
                                   fill=False, edgecolor='green', linewidth=2)
                    ax.add_patch(rect)
                
                if anchored_grid[i, j] == 1:
                    circle = plt.Circle((j, i), 0.3, fill=False, 
                                      edgecolor='purple', linewidth=2, linestyle='--')
                    ax.add_patch(circle)
        
        ax.set_title(f"{name} Treatment\nCoverage Pattern", fontsize=14)
        ax.set_xlabel("Code Location")
        ax.set_ylabel("File")
        ax.set_xticks(range(n_locations_per_file))
        ax.set_yticks(range(n_files))
        ax.set_yticklabels([f"module_{i}.py" for i in range(n_files)])
        
        # Add legend for first plot
        if idx == 0:
            from matplotlib.patches import Patch
            legend_elements = [
                Patch(facecolor='white', label='No issue'),
                Patch(facecolor='lightblue', label='Low severity'),
                Patch(facecolor='orange', label='Medium severity'),
                Patch(facecolor='red', label='High severity'),
                Patch(facecolor='none', edgecolor='green', linewidth=2, label='Reviewed'),
                Patch(facecolor='none', edgecolor='purple', linewidth=2, linestyle='--', label='Anchored')
            ]
            ax.legend(handles=legend_elements, loc='upper left', bbox_to_anchor=(-0.3, 1.0))
    
    plt.tight_layout()
    plt.show()

# Visualize coverage patterns
visualize_review_coverage([
    (mcr_behavior, "MCR"),
    (acr_behavior, "ACR"),
    (ccr_behavior, "CCR")
])

## 5. Quantifying Bias: Coverage Overlap Analysis

In [None]:
def analyze_coverage_overlap(behaviors: Dict[str, ReviewerBehavior]):
    """Analyze overlap in reviewed locations between treatments"""
    
    # Get sets of reviewed locations
    reviewed_sets = {
        name: set(behavior.locations_reviewed)
        for name, behavior in behaviors.items()
    }
    
    # Calculate overlaps
    results = []
    
    for t1 in behaviors.keys():
        for t2 in behaviors.keys():
            if t1 != t2:
                overlap = len(reviewed_sets[t1] & reviewed_sets[t2])
                total = len(reviewed_sets[t1] | reviewed_sets[t2])
                jaccard = overlap / total if total > 0 else 0
                
                results.append({
                    'Treatment 1': t1,
                    'Treatment 2': t2,
                    'Overlap': overlap,
                    'Jaccard Index': jaccard,
                    'T1 Unique': len(reviewed_sets[t1] - reviewed_sets[t2]),
                    'T2 Unique': len(reviewed_sets[t2] - reviewed_sets[t1])
                })
    
    return pd.DataFrame(results)

# Analyze overlap
behaviors_dict = {
    'MCR': mcr_behavior,
    'ACR': acr_behavior,
    'CCR': ccr_behavior
}

overlap_df = analyze_coverage_overlap(behaviors_dict)
print("\nCoverage Overlap Analysis:")
print(overlap_df.to_string(index=False))

# Visualize as heatmap
plt.figure(figsize=(8, 6))
pivot = overlap_df.pivot(index='Treatment 1', columns='Treatment 2', values='Jaccard Index')
sns.heatmap(pivot, annot=True, fmt='.3f', cmap='YlOrRd', vmin=0, vmax=1)
plt.title('Jaccard Similarity Index Between Treatments\n(Higher = More Similar Coverage)')
plt.tight_layout()
plt.show()

## 6. Issue Detection Analysis by Severity

In [None]:
def analyze_issue_detection_by_severity(behaviors: Dict[str, ReviewerBehavior]):
    """Analyze which types of issues are found under different treatments"""
    
    # Get all issues in codebase by severity
    all_issues = [loc for loc in simulator.all_locations if loc.has_issue]
    issues_by_severity = {
        'low': [loc for loc in all_issues if loc.issue_severity == 'low'],
        'medium': [loc for loc in all_issues if loc.issue_severity == 'medium'],
        'high': [loc for loc in all_issues if loc.issue_severity == 'high']
    }
    
    # Analyze detection rates
    results = []
    
    for treatment, behavior in behaviors.items():
        found_issues = set(behavior.issues_found)
        
        for severity, severity_issues in issues_by_severity.items():
            found_count = len([loc for loc in severity_issues if loc in found_issues])
            total_count = len(severity_issues)
            detection_rate = found_count / total_count if total_count > 0 else 0
            
            results.append({
                'Treatment': treatment,
                'Severity': severity,
                'Found': found_count,
                'Total': total_count,
                'Detection Rate': detection_rate
            })
    
    df = pd.DataFrame(results)
    
    # Visualize
    fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 6))
    
    # Bar plot of detection rates
    pivot = df.pivot(index='Severity', columns='Treatment', values='Detection Rate')
    pivot.plot(kind='bar', ax=ax1)
    ax1.set_title('Issue Detection Rate by Severity and Treatment')
    ax1.set_ylabel('Detection Rate')
    ax1.set_xlabel('Issue Severity')
    ax1.legend(title='Treatment')
    ax1.set_ylim(0, 1.1)
    
    # Stacked bar plot of absolute numbers
    pivot2 = df.pivot(index='Treatment', columns='Severity', values='Found')
    pivot2.plot(kind='bar', stacked=True, ax=ax2)
    ax2.set_title('Total Issues Found by Treatment')
    ax2.set_ylabel('Number of Issues Found')
    ax2.set_xlabel('Treatment')
    ax2.legend(title='Severity')
    
    plt.tight_layout()
    plt.show()
    
    return df

# Analyze issue detection
detection_df = analyze_issue_detection_by_severity(behaviors_dict)
print("\nIssue Detection Analysis:")
print(detection_df.to_string(index=False))

## 7. Time Allocation Analysis

In [None]:
def analyze_time_allocation(behaviors: Dict[str, ReviewerBehavior]):
    """Analyze how time is spent across different code locations"""
    
    fig, axes = plt.subplots(2, 2, figsize=(14, 10))
    axes = axes.flatten()
    
    for idx, (treatment, behavior) in enumerate(behaviors.items()):
        if idx >= 3:
            ax = axes[3]
        else:
            ax = axes[idx]
        
        # Calculate time spent on anchored vs non-anchored locations
        if behavior.anchored_locations:
            anchored_time = sum([t for loc, t in zip(behavior.locations_reviewed, behavior.time_per_location)
                               if loc in behavior.anchored_locations])
            non_anchored_time = sum([t for loc, t in zip(behavior.locations_reviewed, behavior.time_per_location)
                                   if loc not in behavior.anchored_locations])
            
            # Pie chart
            sizes = [anchored_time, non_anchored_time]
            labels = ['Anchored Locations', 'Exploration']
            colors = ['purple', 'lightgreen']
            explode = (0.1, 0)
            
            ax.pie(sizes, explode=explode, labels=labels, colors=colors, autopct='%1.1f%%',
                   shadow=True, startangle=90)
            ax.set_title(f'{treatment} Time Allocation')
        else:
            # For MCR, show time distribution by file
            file_times = {}
            for loc, t in zip(behavior.locations_reviewed, behavior.time_per_location):
                if loc.file_name not in file_times:
                    file_times[loc.file_name] = 0
                file_times[loc.file_name] += t
            
            files = list(file_times.keys())
            times = list(file_times.values())
            
            ax.bar(range(len(files)), times)
            ax.set_xticks(range(len(files)))
            ax.set_xticklabels([f.replace('module_', 'M') for f in files], rotation=45)
            ax.set_ylabel('Time (minutes)')
            ax.set_title(f'{treatment} Time per File')
    
    # Summary statistics in the last subplot
    ax = axes[3]
    summary_data = []
    for treatment, behavior in behaviors.items():
        total_time = sum(behavior.time_per_location)
        avg_time_per_loc = np.mean(behavior.time_per_location)
        std_time_per_loc = np.std(behavior.time_per_location)
        
        summary_data.append({
            'Treatment': treatment,
            'Total Time': f"{total_time:.1f}",
            'Avg Time/Location': f"{avg_time_per_loc:.2f} ± {std_time_per_loc:.2f}",
            'Locations Reviewed': len(behavior.locations_reviewed)
        })
    
    summary_df = pd.DataFrame(summary_data)
    ax.axis('tight')
    ax.axis('off')
    table = ax.table(cellText=summary_df.values, colLabels=summary_df.columns,
                     cellLoc='center', loc='center')
    table.auto_set_font_size(False)
    table.set_fontsize(10)
    table.scale(1.2, 1.5)
    ax.set_title('Time Allocation Summary', pad=20)
    
    plt.tight_layout()
    plt.show()

analyze_time_allocation(behaviors_dict)

## 8. Bias Mitigation Strategies

In [None]:
class BiasMitigationStrategy:
    """Different strategies to reduce anchoring bias in code review"""
    
    def __init__(self, name: str, description: str):
        self.name = name
        self.description = description
    
    def apply(self, llm_review: Set[CodeLocation], reviewer_behavior: ReviewerBehavior) -> ReviewerBehavior:
        """Apply mitigation strategy"""
        raise NotImplementedError

class DelayedAssistanceStrategy(BiasMitigationStrategy):
    """Show LLM review only after initial manual review"""
    
    def __init__(self):
        super().__init__(
            "Delayed Assistance",
            "Provide LLM review after reviewer completes initial pass"
        )
    
    def apply(self, llm_review: Set[CodeLocation], simulator: BiasedReviewerSimulator) -> ReviewerBehavior:
        # First do manual review
        manual_phase = simulator.simulate_manual_review(time_budget=30)
        
        # Then check LLM suggestions
        additional_locations = llm_review - set(manual_phase.locations_reviewed)
        
        # Quick verification of LLM suggestions
        for loc in list(additional_locations)[:5]:  # Check top 5
            manual_phase.locations_reviewed.append(loc)
            manual_phase.time_per_location.append(0.5)
            if loc.has_issue and np.random.random() < 0.7:
                manual_phase.issues_found.append(loc)
        
        manual_phase.treatment = "Delayed Assistance"
        return manual_phase

class ConfidenceWeightedStrategy(BiasMitigationStrategy):
    """Weight LLM suggestions by confidence scores"""
    
    def __init__(self):
        super().__init__(
            "Confidence Weighted",
            "Prioritize high-confidence LLM suggestions"
        )
    
    def apply(self, llm_review: Set[CodeLocation], simulator: BiasedReviewerSimulator) -> ReviewerBehavior:
        # Assign confidence scores to LLM suggestions
        weighted_review = []
        for loc in llm_review:
            # Higher confidence for high-severity issues
            if loc.has_issue:
                confidence = {'high': 0.9, 'medium': 0.6, 'low': 0.3}[loc.issue_severity]
            else:
                confidence = 0.2
            weighted_review.append((loc, confidence))
        
        # Sort by confidence
        weighted_review.sort(key=lambda x: x[1], reverse=True)
        
        # Focus on high-confidence suggestions
        high_confidence = {loc for loc, conf in weighted_review if conf > 0.5}
        
        return simulator.simulate_anchored_review(high_confidence, "Confidence Weighted")

# Test mitigation strategies
strategies = [
    DelayedAssistanceStrategy(),
    ConfidenceWeightedStrategy()
]

mitigation_results = {}
for strategy in strategies:
    result = strategy.apply(llm_review, simulator)
    mitigation_results[strategy.name] = result
    
    print(f"\n{strategy.name}:")
    print(f"  Description: {strategy.description}")
    print(f"  Issues found: {len(result.issues_found)}")
    print(f"  High-severity found: {len([i for i in result.issues_found if i.has_issue and i.issue_severity == 'high'])}")

## 9. Comparative Analysis of Bias Effects

In [None]:
def comprehensive_bias_analysis():
    """Comprehensive analysis of bias effects across all strategies"""
    
    # Combine all results
    all_behaviors = {**behaviors_dict, **mitigation_results}
    
    # Calculate metrics for each approach
    metrics = []
    
    for name, behavior in all_behaviors.items():
        # Coverage metrics
        total_locations = len(simulator.all_locations)
        coverage = len(behavior.locations_reviewed) / total_locations
        
        # Issue detection metrics
        all_issues = [loc for loc in simulator.all_locations if loc.has_issue]
        found_issues = [loc for loc in behavior.issues_found if loc.has_issue]
        
        detection_rate = len(found_issues) / len(all_issues) if all_issues else 0
        
        # Severity-specific rates
        high_issues = [loc for loc in all_issues if loc.issue_severity == 'high']
        high_found = [loc for loc in found_issues if loc.issue_severity == 'high']
        high_detection = len(high_found) / len(high_issues) if high_issues else 0
        
        # Efficiency metrics
        time_per_issue = sum(behavior.time_per_location) / len(found_issues) if found_issues else 0
        
        metrics.append({
            'Approach': name,
            'Coverage %': coverage * 100,
            'Detection Rate %': detection_rate * 100,
            'High-Severity Detection %': high_detection * 100,
            'Time per Issue Found': time_per_issue,
            'Total Time': sum(behavior.time_per_location)
        })
    
    metrics_df = pd.DataFrame(metrics)
    
    # Create spider plot
    fig, ax = plt.subplots(figsize=(10, 8), subplot_kw=dict(projection='polar'))
    
    # Select metrics for spider plot
    spider_metrics = ['Coverage %', 'Detection Rate %', 'High-Severity Detection %']
    
    # Number of variables
    num_vars = len(spider_metrics)
    
    # Compute angle for each axis
    angles = np.linspace(0, 2 * np.pi, num_vars, endpoint=False).tolist()
    angles += angles[:1]
    
    # Plot each approach
    for idx, row in metrics_df.iterrows():
        values = row[spider_metrics].tolist()
        values += values[:1]
        
        ax.plot(angles, values, 'o-', linewidth=2, label=row['Approach'])
        ax.fill(angles, values, alpha=0.15)
    
    ax.set_theta_offset(np.pi / 2)
    ax.set_theta_direction(-1)
    ax.set_xticks(angles[:-1])
    ax.set_xticklabels(spider_metrics)
    ax.set_ylim(0, 100)
    ax.set_title('Comparison of Code Review Approaches\n(Higher is Better)', y=1.08)
    ax.legend(loc='upper right', bbox_to_anchor=(1.3, 1.1))
    ax.grid(True)
    
    plt.tight_layout()
    plt.show()
    
    return metrics_df

# Run comprehensive analysis
final_metrics = comprehensive_bias_analysis()
print("\nComprehensive Metrics:")
print(final_metrics.round(2).to_string(index=False))

## 10. Key Takeaways and Practical Recommendations

Based on our analysis of LLM bias in code review, here are the key insights:

In [None]:
takeaways = {
    "Anchoring Bias Effects": [
        "Reviewers spend ~70% of their time on LLM-suggested locations",
        "Coverage diversity decreases by ~50% when starting with LLM review",
        "High-severity issues outside LLM focus are often missed"
    ],
    
    "Mitigation Strategies": [
        "Delayed assistance maintains exploration while benefiting from LLM insights",
        "Confidence weighting helps prioritize valuable suggestions",
        "Hybrid approaches can balance efficiency and thoroughness"
    ],
    
    "Implementation Guidelines": [
        "Present LLM suggestions as 'additional checks' after manual review",
        "Use visual cues to distinguish LLM suggestions from manual findings",
        "Track metrics on coverage diversity and high-severity detection rates",
        "Educate reviewers about anchoring bias and its effects"
    ],
    
    "Future Research Directions": [
        "Develop bias-aware interfaces for code review tools",
        "Study long-term learning effects when using AI assistance",
        "Create specialized models for high-severity issue detection",
        "Investigate optimal timing for AI assistance delivery"
    ]
}

print("\n" + "="*80)
print("KEY TAKEAWAYS: Understanding and Mitigating LLM Bias in Code Review")
print("="*80)

for category, items in takeaways.items():
    print(f"\n{category}:")
    for item in items:
        print(f"  • {item}")

print("\n" + "="*80)
print("\nThis analysis demonstrates that while LLM-assisted code review can identify")
print("additional issues, it fundamentally changes reviewer behavior. Understanding")
print("and mitigating these biases is crucial for effective AI-human collaboration.")