# Focused Learning: Reviewer Ownership Metrics (ACO/RSO) Implementation

## Learning Objective
Master the calculation and application of Authoring Code Ownership (ACO) and Review-Specific Ownership (RSO) metrics, understanding how to measure developer experience at multiple granularities (repository, subsystem, package).

## Paper Reference
- **Section 3.1**: Reviewer Experience Heuristics (Pages 6-7)
- **Equation (1)**: ACO(D,G) = α(D,G) / C(G)
- **Equation (2)**: RSO(D,G) = r(D,G) / ρ(G)
- **Algorithm 1 & 2**: ACO and RSO Implementation

## Why Ownership Metrics are Complex
1. **Temporal Dynamics**: Metrics must be calculated at specific timestamps
2. **Multi-granularity**: Repository, subsystem, and package levels capture different expertise
3. **Large-scale Computation**: Processing millions of commits and PRs efficiently
4. **Data Quality Issues**: Missing data, bot accounts, rebasing affects accuracy

## 1. Foundation: Understanding Code Ownership

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime, timedelta
from typing import Dict, List, Tuple, Optional, Set
from collections import defaultdict, Counter
import networkx as nx
from dataclasses import dataclass, field
import json

# Configure visualization
plt.style.use('seaborn-v0_8-whitegrid')
sns.set_palette("husl")

### 1.1 Data Structures for Repository History

In [None]:
@dataclass
class Commit:
    """Represents a single commit in the repository"""
    commit_id: str
    author_id: str
    timestamp: datetime
    files_changed: List[str]
    is_merge: bool = False
    
    def get_subsystem(self, file_path: str) -> str:
        """Extract subsystem (top-level directory) from file path"""
        parts = file_path.strip('/').split('/')
        return parts[0] if parts else 'root'
    
    def get_package(self, file_path: str) -> str:
        """Extract package (immediate folder) from file path"""
        parts = file_path.strip('/').split('/')
        if len(parts) >= 2:
            return '/'.join(parts[:2])
        return parts[0] if parts else 'root'

@dataclass
class PullRequest:
    """Represents a pull request/code review"""
    pr_id: str
    reviewers: List[str]  # List of reviewer IDs who commented
    timestamp: datetime
    files_changed: List[str]
    status: str = "closed"  # closed, merged, open

@dataclass
class ReviewComment:
    """Represents a single review comment"""
    comment_id: str
    reviewer_id: str
    pr_id: str
    timestamp: datetime
    file_path: str
    repository: str
    content: str

### 1.2 Repository History Generator (Mock Data)

In [None]:
class RepositoryHistoryGenerator:
    """Generate realistic repository history for demonstration"""
    
    def __init__(self, seed=42):
        np.random.seed(seed)
        self.developers = self._generate_developers()
        self.file_structure = self._generate_file_structure()
        
    def _generate_developers(self, n=20):
        """Generate developer profiles with different activity levels"""
        developers = []
        
        # Core maintainers (high activity)
        for i in range(3):
            developers.append({
                'id': f'maintainer_{i}',
                'type': 'maintainer',
                'commit_prob': 0.3,
                'review_prob': 0.4
            })
        
        # Regular contributors
        for i in range(7):
            developers.append({
                'id': f'contributor_{i}',
                'type': 'contributor',
                'commit_prob': 0.15,
                'review_prob': 0.2
            })
        
        # Occasional contributors
        for i in range(10):
            developers.append({
                'id': f'occasional_{i}',
                'type': 'occasional',
                'commit_prob': 0.05,
                'review_prob': 0.1
            })
        
        return developers
    
    def _generate_file_structure(self):
        """Generate realistic file structure"""
        structure = {
            'src': [
                'src/core/engine.py',
                'src/core/utils.py',
                'src/core/config.py',
                'src/api/routes.py',
                'src/api/middleware.py',
                'src/api/auth.py',
                'src/models/user.py',
                'src/models/product.py',
                'src/models/order.py'
            ],
            'tests': [
                'tests/unit/test_engine.py',
                'tests/unit/test_utils.py',
                'tests/integration/test_api.py',
                'tests/integration/test_models.py'
            ],
            'docs': [
                'docs/api.md',
                'docs/setup.md',
                'docs/contributing.md'
            ],
            'config': [
                'config/production.yml',
                'config/development.yml',
                'config/testing.yml'
            ]
        }
        
        # Flatten to list
        all_files = []
        for category, files in structure.items():
            all_files.extend(files)
        return all_files
    
    def generate_commits(self, n_commits=1000, time_span_days=365):
        """Generate commit history"""
        commits = []
        start_date = datetime.now() - timedelta(days=time_span_days)
        
        for i in range(n_commits):
            # Select developer based on activity probability
            dev_probs = [d['commit_prob'] for d in self.developers]
            dev_probs = np.array(dev_probs) / sum(dev_probs)
            developer = np.random.choice(self.developers, p=dev_probs)
            
            # Generate timestamp
            days_offset = np.random.uniform(0, time_span_days)
            timestamp = start_date + timedelta(days=days_offset)
            
            # Select files changed (developers tend to work on specific areas)
            n_files = np.random.poisson(2) + 1  # At least 1 file
            if developer['type'] == 'maintainer':
                # Maintainers work across the codebase
                files = np.random.choice(self.file_structure, min(n_files, 5), replace=False)
            else:
                # Others tend to focus on specific subsystems
                subsystem = np.random.choice(['src', 'tests', 'docs', 'config'])
                subsystem_files = [f for f in self.file_structure if f.startswith(subsystem)]
                files = np.random.choice(subsystem_files, 
                                       min(n_files, len(subsystem_files)), 
                                       replace=False)
            
            commit = Commit(
                commit_id=f"commit_{i:04d}",
                author_id=developer['id'],
                timestamp=timestamp,
                files_changed=list(files),
                is_merge=np.random.random() < 0.1  # 10% merge commits
            )
            commits.append(commit)
        
        # Sort by timestamp
        commits.sort(key=lambda x: x.timestamp)
        return commits
    
    def generate_pull_requests(self, n_prs=300, time_span_days=365):
        """Generate pull request history"""
        prs = []
        start_date = datetime.now() - timedelta(days=time_span_days)
        
        for i in range(n_prs):
            # Generate timestamp
            days_offset = np.random.uniform(0, time_span_days)
            timestamp = start_date + timedelta(days=days_offset)
            
            # Select reviewers (usually 1-3)
            n_reviewers = np.random.poisson(1.5) + 1
            review_probs = [d['review_prob'] for d in self.developers]
            review_probs = np.array(review_probs) / sum(review_probs)
            reviewers = np.random.choice(self.developers, 
                                       min(n_reviewers, 3), 
                                       replace=False,
                                       p=review_probs)
            reviewer_ids = [r['id'] for r in reviewers]
            
            # Select files changed
            n_files = np.random.poisson(3) + 1
            files = np.random.choice(self.file_structure, min(n_files, 10), replace=False)
            
            pr = PullRequest(
                pr_id=f"pr_{i:04d}",
                reviewers=reviewer_ids,
                timestamp=timestamp,
                files_changed=list(files),
                status="closed"
            )
            prs.append(pr)
        
        # Sort by timestamp
        prs.sort(key=lambda x: x.timestamp)
        return prs

# Generate mock repository history
generator = RepositoryHistoryGenerator()
commits = generator.generate_commits(1000, 365)
pull_requests = generator.generate_pull_requests(300, 365)

print(f"Generated {len(commits)} commits and {len(pull_requests)} pull requests")
print(f"Time span: {commits[0].timestamp.date()} to {commits[-1].timestamp.date()}")
print(f"\nDevelopers: {len(generator.developers)}")
print(f"Files: {len(generator.file_structure)}")

## 2. Implementing ACO and RSO Calculations

In [None]:
class OwnershipCalculator:
    """Calculate ACO and RSO metrics following Algorithm 1 & 2 from the paper"""
    
    def __init__(self, commits: List[Commit], pull_requests: List[PullRequest]):
        self.commits = commits
        self.pull_requests = pull_requests
        self._preprocess_data()
        
    def _preprocess_data(self):
        """Preprocess data for efficient calculation"""
        # Remove merge commits as per paper
        self.commits = [c for c in self.commits if not c.is_merge]
        
        # Index commits by granularity for faster lookup
        self.commits_by_repo = defaultdict(list)
        self.commits_by_subsystem = defaultdict(list)
        self.commits_by_package = defaultdict(list)
        
        for commit in self.commits:
            # Repository level
            self.commits_by_repo['repository'].append(commit)
            
            # Subsystem and package level
            for file_path in commit.files_changed:
                subsystem = commit.get_subsystem(file_path)
                package = commit.get_package(file_path)
                
                self.commits_by_subsystem[subsystem].append(commit)
                self.commits_by_package[package].append(commit)
        
        # Similarly for PRs
        self.prs_by_repo = defaultdict(list)
        self.prs_by_subsystem = defaultdict(list)
        self.prs_by_package = defaultdict(list)
        
        for pr in self.pull_requests:
            if pr.status == "closed":  # Only closed PRs as per paper
                self.prs_by_repo['repository'].append(pr)
                
                for file_path in pr.files_changed:
                    subsystem = Commit(None, None, None, []).get_subsystem(file_path)
                    package = Commit(None, None, None, []).get_package(file_path)
                    
                    self.prs_by_subsystem[subsystem].append(pr)
                    self.prs_by_package[package].append(pr)
    
    def calculate_aco(self, developer_id: str, granularity: str, 
                     target: str, review_timestamp: datetime) -> float:
        """
        Calculate Authoring Code Ownership (ACO) - Algorithm 1
        ACO(D,G) = α(D,G) / C(G)
        """
        if granularity == "repository":
            commits_at_g = self.commits_by_repo[target]
        elif granularity == "subsystem":
            commits_at_g = self.commits_by_subsystem[target]
        else:  # package
            commits_at_g = self.commits_by_package[target]
        
        # Filter commits before review timestamp
        prior_commits = [c for c in commits_at_g if c.timestamp < review_timestamp]
        
        if not prior_commits:
            return 0.0
        
        # Count developer's commits
        developer_commits = sum(1 for c in prior_commits if c.author_id == developer_id)
        total_commits = len(prior_commits)
        
        return developer_commits / total_commits
    
    def calculate_rso(self, developer_id: str, granularity: str,
                     target: str, review_timestamp: datetime) -> float:
        """
        Calculate Review-Specific Ownership (RSO) - Algorithm 2
        RSO(D,G) = r(D,G) / ρ(G)
        """
        if granularity == "repository":
            prs_at_g = self.prs_by_repo[target]
        elif granularity == "subsystem":
            prs_at_g = self.prs_by_subsystem[target]
        else:  # package
            prs_at_g = self.prs_by_package[target]
        
        # Filter PRs before review timestamp
        prior_prs = [pr for pr in prs_at_g if pr.timestamp < review_timestamp]
        
        if not prior_prs:
            return 0.0
        
        # Count PRs reviewed by developer
        developer_reviews = sum(1 for pr in prior_prs if developer_id in pr.reviewers)
        total_prs = len(prior_prs)
        
        return developer_reviews / total_prs
    
    def calculate_all_metrics(self, developer_id: str, file_path: str, 
                            review_timestamp: datetime) -> Dict[str, float]:
        """Calculate all ownership metrics for a developer at a given time"""
        # Determine granularity targets
        subsystem = Commit(None, None, None, []).get_subsystem(file_path)
        package = Commit(None, None, None, []).get_package(file_path)
        
        metrics = {
            'aco_repo': self.calculate_aco(developer_id, "repository", "repository", review_timestamp),
            'aco_sys': self.calculate_aco(developer_id, "subsystem", subsystem, review_timestamp),
            'aco_pkg': self.calculate_aco(developer_id, "package", package, review_timestamp),
            'rso_repo': self.calculate_rso(developer_id, "repository", "repository", review_timestamp),
            'rso_sys': self.calculate_rso(developer_id, "subsystem", subsystem, review_timestamp),
            'rso_pkg': self.calculate_rso(developer_id, "package", package, review_timestamp)
        }
        
        return metrics

# Create calculator and test
calculator = OwnershipCalculator(commits, pull_requests)

# Test calculation for a specific developer
test_timestamp = datetime.now()
test_file = "src/core/engine.py"
test_developer = "maintainer_0"

metrics = calculator.calculate_all_metrics(test_developer, test_file, test_timestamp)

print(f"Ownership Metrics for {test_developer} at {test_file}:")
print(f"\nAuthoring Code Ownership (ACO):")
print(f"  Repository: {metrics['aco_repo']:.3f}")
print(f"  Subsystem:  {metrics['aco_sys']:.3f}")
print(f"  Package:    {metrics['aco_pkg']:.3f}")
print(f"\nReview-Specific Ownership (RSO):")
print(f"  Repository: {metrics['rso_repo']:.3f}")
print(f"  Subsystem:  {metrics['rso_sys']:.3f}")
print(f"  Package:    {metrics['rso_pkg']:.3f}")

## 3. Visualizing Ownership Distributions

In [None]:
def analyze_ownership_distributions(calculator: OwnershipCalculator, developers: List[Dict]):
    """Analyze and visualize ownership distributions across developers"""
    
    # Calculate metrics for all developers
    all_metrics = []
    timestamp = datetime.now()
    
    for dev in developers:
        # Calculate for a common file
        metrics = calculator.calculate_all_metrics(
            dev['id'], 
            "src/core/engine.py", 
            timestamp
        )
        metrics['developer_id'] = dev['id']
        metrics['developer_type'] = dev['type']
        all_metrics.append(metrics)
    
    df_metrics = pd.DataFrame(all_metrics)
    
    # Create visualizations
    fig, axes = plt.subplots(2, 3, figsize=(18, 12))
    
    # Plot ACO distributions
    for idx, gran in enumerate(['repo', 'sys', 'pkg']):
        ax = axes[0, idx]
        
        # Group by developer type
        for dev_type in ['maintainer', 'contributor', 'occasional']:
            data = df_metrics[df_metrics['developer_type'] == dev_type][f'aco_{gran}']
            ax.hist(data, alpha=0.6, label=dev_type, bins=20)
        
        ax.set_title(f'ACO Distribution - {gran.capitalize()} Level')
        ax.set_xlabel('ACO Value')
        ax.set_ylabel('Count')
        ax.legend()
        ax.grid(True, alpha=0.3)
    
    # Plot RSO distributions
    for idx, gran in enumerate(['repo', 'sys', 'pkg']):
        ax = axes[1, idx]
        
        # Group by developer type
        for dev_type in ['maintainer', 'contributor', 'occasional']:
            data = df_metrics[df_metrics['developer_type'] == dev_type][f'rso_{gran}']
            ax.hist(data, alpha=0.6, label=dev_type, bins=20)
        
        ax.set_title(f'RSO Distribution - {gran.capitalize()} Level')
        ax.set_xlabel('RSO Value')
        ax.set_ylabel('Count')
        ax.legend()
        ax.grid(True, alpha=0.3)
    
    plt.tight_layout()
    plt.show()
    
    # Create correlation heatmap
    fig, ax = plt.subplots(1, 1, figsize=(10, 8))
    
    # Select numeric columns
    numeric_cols = ['aco_repo', 'aco_sys', 'aco_pkg', 'rso_repo', 'rso_sys', 'rso_pkg']
    correlation_matrix = df_metrics[numeric_cols].corr()
    
    sns.heatmap(correlation_matrix, annot=True, fmt='.2f', cmap='coolwarm', 
                center=0, square=True, ax=ax)
    ax.set_title('Correlation between Ownership Metrics', fontsize=14)
    plt.tight_layout()
    plt.show()
    
    return df_metrics

# Analyze ownership distributions
df_metrics = analyze_ownership_distributions(calculator, generator.developers)

# Print summary statistics
print("\nSummary Statistics by Developer Type:")
summary = df_metrics.groupby('developer_type')[['aco_repo', 'rso_repo']].agg(['mean', 'std'])
print(summary)

## 4. Temporal Dynamics of Ownership

In [None]:
def analyze_temporal_ownership(calculator: OwnershipCalculator, 
                             developer_id: str,
                             file_path: str,
                             time_points: int = 12):
    """Analyze how ownership changes over time"""
    
    # Get time range from commits
    min_time = min(c.timestamp for c in calculator.commits)
    max_time = max(c.timestamp for c in calculator.commits)
    
    # Create time points
    time_delta = (max_time - min_time) / time_points
    timestamps = [min_time + time_delta * i for i in range(1, time_points + 1)]
    
    # Calculate metrics at each time point
    temporal_metrics = []
    for ts in timestamps:
        metrics = calculator.calculate_all_metrics(developer_id, file_path, ts)
        metrics['timestamp'] = ts
        temporal_metrics.append(metrics)
    
    df_temporal = pd.DataFrame(temporal_metrics)
    
    # Visualize temporal evolution
    fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(14, 10), sharex=True)
    
    # Plot ACO evolution
    ax1.plot(df_temporal['timestamp'], df_temporal['aco_repo'], 
             'b-', label='Repository', linewidth=2)
    ax1.plot(df_temporal['timestamp'], df_temporal['aco_sys'], 
             'g--', label='Subsystem', linewidth=2)
    ax1.plot(df_temporal['timestamp'], df_temporal['aco_pkg'], 
             'r:', label='Package', linewidth=2)
    ax1.set_ylabel('ACO Value')
    ax1.set_title(f'Temporal Evolution of ACO for {developer_id}', fontsize=14)
    ax1.legend()
    ax1.grid(True, alpha=0.3)
    ax1.set_ylim(0, max(df_temporal[['aco_repo', 'aco_sys', 'aco_pkg']].max()) * 1.1)
    
    # Plot RSO evolution
    ax2.plot(df_temporal['timestamp'], df_temporal['rso_repo'], 
             'b-', label='Repository', linewidth=2)
    ax2.plot(df_temporal['timestamp'], df_temporal['rso_sys'], 
             'g--', label='Subsystem', linewidth=2)
    ax2.plot(df_temporal['timestamp'], df_temporal['rso_pkg'], 
             'r:', label='Package', linewidth=2)
    ax2.set_xlabel('Time')
    ax2.set_ylabel('RSO Value')
    ax2.set_title(f'Temporal Evolution of RSO for {developer_id}', fontsize=14)
    ax2.legend()
    ax2.grid(True, alpha=0.3)
    ax2.set_ylim(0, max(df_temporal[['rso_repo', 'rso_sys', 'rso_pkg']].max()) * 1.1)
    
    # Format x-axis
    ax2.tick_params(axis='x', rotation=45)
    
    plt.tight_layout()
    plt.show()
    
    return df_temporal

# Analyze temporal dynamics for a maintainer
df_temporal = analyze_temporal_ownership(calculator, "maintainer_0", "src/core/engine.py")

# Calculate growth rates
print("\nOwnership Growth Analysis:")
for metric in ['aco_repo', 'aco_sys', 'aco_pkg', 'rso_repo', 'rso_sys', 'rso_pkg']:
    initial = df_temporal[metric].iloc[0]
    final = df_temporal[metric].iloc[-1]
    growth = (final - initial) / (initial + 1e-6) * 100  # Avoid division by zero
    print(f"{metric}: {initial:.3f} → {final:.3f} (Growth: {growth:+.1f}%)")

## 5. Granularity Analysis: Repository vs Subsystem vs Package

In [None]:
def analyze_granularity_effects(calculator: OwnershipCalculator, developers: List[Dict]):
    """Analyze how granularity affects ownership metrics"""
    
    # Calculate metrics for multiple files at different levels
    test_files = [
        "src/core/engine.py",
        "src/api/routes.py",
        "tests/unit/test_engine.py",
        "docs/api.md"
    ]
    
    timestamp = datetime.now()
    results = []
    
    for dev in developers[:10]:  # Top 10 developers
        for file_path in test_files:
            metrics = calculator.calculate_all_metrics(dev['id'], file_path, timestamp)
            
            result = {
                'developer': dev['id'],
                'dev_type': dev['type'],
                'file': file_path,
                'subsystem': Commit(None, None, None, []).get_subsystem(file_path),
                'package': Commit(None, None, None, []).get_package(file_path),
                **metrics
            }
            results.append(result)
    
    df_gran = pd.DataFrame(results)
    
    # Visualization 1: Ownership increase by granularity
    fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 6))
    
    # Calculate average increase factors
    df_gran['aco_sys_factor'] = df_gran['aco_sys'] / (df_gran['aco_repo'] + 1e-6)
    df_gran['aco_pkg_factor'] = df_gran['aco_pkg'] / (df_gran['aco_repo'] + 1e-6)
    df_gran['rso_sys_factor'] = df_gran['rso_sys'] / (df_gran['rso_repo'] + 1e-6)
    df_gran['rso_pkg_factor'] = df_gran['rso_pkg'] / (df_gran['rso_repo'] + 1e-6)
    
    # Box plot of increase factors
    factor_data = [
        df_gran['aco_sys_factor'].dropna(),
        df_gran['aco_pkg_factor'].dropna(),
        df_gran['rso_sys_factor'].dropna(),
        df_gran['rso_pkg_factor'].dropna()
    ]
    
    positions = [1, 2, 4, 5]
    colors = ['lightblue', 'darkblue', 'lightgreen', 'darkgreen']
    
    bp = ax1.boxplot(factor_data, positions=positions, widths=0.6, patch_artist=True)
    for patch, color in zip(bp['boxes'], colors):
        patch.set_facecolor(color)
    
    ax1.set_xticks([1.5, 4.5])
    ax1.set_xticklabels(['ACO', 'RSO'])
    ax1.set_ylabel('Ownership Increase Factor')
    ax1.set_title('Ownership Increase from Repository to Finer Granularities')
    ax1.axhline(y=1, color='red', linestyle='--', alpha=0.5)
    ax1.grid(True, alpha=0.3)
    
    # Add legend
    from matplotlib.patches import Patch
    legend_elements = [
        Patch(facecolor='lightblue', label='Subsystem/Repo'),
        Patch(facecolor='darkblue', label='Package/Repo')
    ]
    ax1.legend(handles=legend_elements)
    
    # Visualization 2: Scatter plot of ACO vs RSO at different granularities
    for idx, (gran, marker) in enumerate([('repo', 'o'), ('sys', 's'), ('pkg', '^')]):
        ax2.scatter(df_gran[f'aco_{gran}'], df_gran[f'rso_{gran}'], 
                   label=f'{gran.capitalize()} level',
                   alpha=0.6, s=100, marker=marker)
    
    ax2.set_xlabel('ACO (Authoring Code Ownership)')
    ax2.set_ylabel('RSO (Review-Specific Ownership)')
    ax2.set_title('ACO vs RSO Relationship at Different Granularities')
    ax2.legend()
    ax2.grid(True, alpha=0.3)
    
    # Add diagonal line
    max_val = max(ax2.get_xlim()[1], ax2.get_ylim()[1])
    ax2.plot([0, max_val], [0, max_val], 'k--', alpha=0.3)
    
    plt.tight_layout()
    plt.show()
    
    # Statistical analysis
    print("\nGranularity Effect Statistics:")
    print("\nAverage Ownership Increase Factors:")
    print(f"ACO Subsystem/Repository: {df_gran['aco_sys_factor'].mean():.2f}x")
    print(f"ACO Package/Repository: {df_gran['aco_pkg_factor'].mean():.2f}x")
    print(f"RSO Subsystem/Repository: {df_gran['rso_sys_factor'].mean():.2f}x")
    print(f"RSO Package/Repository: {df_gran['rso_pkg_factor'].mean():.2f}x")
    
    return df_gran

# Analyze granularity effects
df_granularity = analyze_granularity_effects(calculator, generator.developers)

## 6. Efficient Implementation for Large-Scale Processing

In [None]:
class OptimizedOwnershipCalculator:
    """Optimized implementation for processing millions of commits/PRs"""
    
    def __init__(self):
        self.commit_cache = {}
        self.pr_cache = {}
        self.metric_cache = {}
        self.cache_hits = 0
        self.cache_misses = 0
        
    def _get_cache_key(self, developer_id: str, granularity: str, 
                      target: str, timestamp: datetime) -> str:
        """Generate cache key for metrics"""
        # Round timestamp to nearest hour for better cache hits
        rounded_ts = timestamp.replace(minute=0, second=0, microsecond=0)
        return f"{developer_id}:{granularity}:{target}:{rounded_ts.isoformat()}"
    
    def batch_calculate_metrics(self, review_comments: List[ReviewComment],
                              commits: List[Commit],
                              pull_requests: List[PullRequest]) -> pd.DataFrame:
        """Efficiently calculate metrics for a batch of review comments"""
        
        # Build indices for fast lookup
        print("Building indices...")
        commit_index = self._build_commit_index(commits)
        pr_index = self._build_pr_index(pull_requests)
        
        results = []
        total = len(review_comments)
        
        print(f"Processing {total} review comments...")
        for idx, comment in enumerate(review_comments):
            if idx % 100 == 0:
                print(f"Progress: {idx}/{total} ({idx/total*100:.1f}%)")
            
            # Check cache first
            cache_key = self._get_cache_key(
                comment.reviewer_id,
                "all",
                comment.file_path,
                comment.timestamp
            )
            
            if cache_key in self.metric_cache:
                metrics = self.metric_cache[cache_key]
                self.cache_hits += 1
            else:
                # Calculate metrics
                metrics = self._calculate_metrics_fast(
                    comment.reviewer_id,
                    comment.file_path,
                    comment.timestamp,
                    commit_index,
                    pr_index
                )
                self.metric_cache[cache_key] = metrics
                self.cache_misses += 1
            
            result = {
                'comment_id': comment.comment_id,
                'reviewer_id': comment.reviewer_id,
                'timestamp': comment.timestamp,
                'file_path': comment.file_path,
                **metrics
            }
            results.append(result)
        
        print(f"\nCache performance: {self.cache_hits} hits, {self.cache_misses} misses")
        print(f"Cache hit rate: {self.cache_hits/(self.cache_hits+self.cache_misses)*100:.1f}%")
        
        return pd.DataFrame(results)
    
    def _build_commit_index(self, commits: List[Commit]) -> Dict:
        """Build efficient index for commits"""
        index = {
            'by_author': defaultdict(list),
            'by_subsystem': defaultdict(list),
            'by_package': defaultdict(list)
        }
        
        for commit in commits:
            if not commit.is_merge:
                index['by_author'][commit.author_id].append(commit)
                
                for file_path in commit.files_changed:
                    subsystem = commit.get_subsystem(file_path)
                    package = commit.get_package(file_path)
                    index['by_subsystem'][subsystem].append(commit)
                    index['by_package'][package].append(commit)
        
        return index
    
    def _build_pr_index(self, pull_requests: List[PullRequest]) -> Dict:
        """Build efficient index for pull requests"""
        index = {
            'by_reviewer': defaultdict(list),
            'by_subsystem': defaultdict(list),
            'by_package': defaultdict(list)
        }
        
        for pr in pull_requests:
            if pr.status == "closed":
                for reviewer_id in pr.reviewers:
                    index['by_reviewer'][reviewer_id].append(pr)
                
                for file_path in pr.files_changed:
                    subsystem = Commit(None, None, None, []).get_subsystem(file_path)
                    package = Commit(None, None, None, []).get_package(file_path)
                    index['by_subsystem'][subsystem].append(pr)
                    index['by_package'][package].append(pr)
        
        return index
    
    def _calculate_metrics_fast(self, developer_id: str, file_path: str,
                              timestamp: datetime, commit_index: Dict,
                              pr_index: Dict) -> Dict[str, float]:
        """Fast calculation using indices"""
        subsystem = Commit(None, None, None, []).get_subsystem(file_path)
        package = Commit(None, None, None, []).get_package(file_path)
        
        # Calculate ACO metrics
        aco_repo = self._calculate_aco_fast(
            developer_id, 
            [c for author_commits in commit_index['by_author'].values() 
             for c in author_commits],
            timestamp
        )
        
        aco_sys = self._calculate_aco_fast(
            developer_id,
            commit_index['by_subsystem'][subsystem],
            timestamp
        )
        
        aco_pkg = self._calculate_aco_fast(
            developer_id,
            commit_index['by_package'][package],
            timestamp
        )
        
        # Calculate RSO metrics
        rso_repo = self._calculate_rso_fast(
            developer_id,
            [pr for reviewer_prs in pr_index['by_reviewer'].values() 
             for pr in reviewer_prs],
            timestamp
        )
        
        rso_sys = self._calculate_rso_fast(
            developer_id,
            pr_index['by_subsystem'][subsystem],
            timestamp
        )
        
        rso_pkg = self._calculate_rso_fast(
            developer_id,
            pr_index['by_package'][package],
            timestamp
        )
        
        return {
            'aco_repo': aco_repo,
            'aco_sys': aco_sys,
            'aco_pkg': aco_pkg,
            'rso_repo': rso_repo,
            'rso_sys': rso_sys,
            'rso_pkg': rso_pkg
        }
    
    def _calculate_aco_fast(self, developer_id: str, commits: List[Commit],
                           timestamp: datetime) -> float:
        """Fast ACO calculation"""
        prior_commits = [c for c in commits if c.timestamp < timestamp]
        if not prior_commits:
            return 0.0
        
        developer_commits = sum(1 for c in prior_commits if c.author_id == developer_id)
        return developer_commits / len(prior_commits)
    
    def _calculate_rso_fast(self, developer_id: str, prs: List[PullRequest],
                           timestamp: datetime) -> float:
        """Fast RSO calculation"""
        prior_prs = [pr for pr in prs if pr.timestamp < timestamp]
        if not prior_prs:
            return 0.0
        
        developer_reviews = sum(1 for pr in prior_prs if developer_id in pr.reviewers)
        return developer_reviews / len(prior_prs)

# Demonstrate optimized calculation
print("Generating mock review comments...")
mock_comments = []
for i in range(500):  # Simulate 500 review comments
    reviewer = np.random.choice(generator.developers)
    file_path = np.random.choice(generator.file_structure)
    
    comment = ReviewComment(
        comment_id=f"comment_{i:04d}",
        reviewer_id=reviewer['id'],
        pr_id=f"pr_{np.random.randint(0, 300):04d}",
        timestamp=datetime.now() - timedelta(days=np.random.randint(0, 365)),
        file_path=file_path,
        repository="mock_repo",
        content="Mock review comment"
    )
    mock_comments.append(comment)

# Calculate metrics
opt_calculator = OptimizedOwnershipCalculator()
df_results = opt_calculator.batch_calculate_metrics(mock_comments, commits, pull_requests)

print(f"\nProcessed {len(df_results)} comments")
print("\nSample results:")
print(df_results.head())

# Performance statistics
print("\nOwnership Statistics:")
print(df_results[['aco_repo', 'aco_sys', 'aco_pkg', 'rso_repo', 'rso_sys', 'rso_pkg']].describe())

## 7. Practical Challenges and Solutions

In [None]:
class OwnershipChallenges:
    """Common challenges when calculating ownership metrics"""
    
    @staticmethod
    def handle_bot_accounts(commits: List[Commit], pull_requests: List[PullRequest]):
        """Filter out bot accounts from ownership calculations"""
        print("Challenge: Bot Account Detection")
        
        # Common bot patterns
        bot_patterns = [
            lambda x: x.endswith('[bot]'),
            lambda x: x.endswith('-bot'),
            lambda x: 'dependabot' in x.lower(),
            lambda x: 'renovate' in x.lower(),
            lambda x: 'github-actions' in x.lower()
        ]
        
        def is_bot(user_id: str) -> bool:
            return any(pattern(user_id) for pattern in bot_patterns)
        
        # Filter commits
        human_commits = [c for c in commits if not is_bot(c.author_id)]
        bot_commits = [c for c in commits if is_bot(c.author_id)]
        
        print(f"Filtered {len(bot_commits)} bot commits out of {len(commits)}")
        print(f"Remaining human commits: {len(human_commits)}")
        
        return human_commits
    
    @staticmethod
    def handle_file_renames(file_history: Dict[str, List[str]]):
        """Track file renames to maintain accurate ownership"""
        print("\nChallenge: File Rename Tracking")
        
        # Build rename graph
        rename_graph = nx.DiGraph()
        
        for old_path, new_paths in file_history.items():
            for new_path in new_paths:
                rename_graph.add_edge(old_path, new_path)
        
        # Find connected components (files that are the same through renames)
        file_groups = list(nx.weakly_connected_components(rename_graph.to_undirected()))
        
        print(f"Found {len(file_groups)} unique files after resolving renames")
        
        # Create mapping
        file_mapping = {}
        for group in file_groups:
            canonical_name = sorted(group)[0]  # Use first name alphabetically
            for file_name in group:
                file_mapping[file_name] = canonical_name
        
        return file_mapping
    
    @staticmethod
    def handle_large_scale_data():
        """Strategies for handling millions of commits/PRs"""
        print("\nChallenge: Large-Scale Data Processing")
        print("Solutions:")
        print("1. Use incremental processing with checkpoints")
        print("2. Implement parallel processing for independent calculations")
        print("3. Use database indexing for fast lookups")
        print("4. Implement time-based partitioning")
        print("5. Cache frequently accessed metrics")
        
        # Example: Time-based partitioning
        class TimePartitionedCalculator:
            def __init__(self, partition_days=30):
                self.partition_days = partition_days
                self.partitions = {}
            
            def add_to_partition(self, item, timestamp):
                partition_key = timestamp.date() // timedelta(days=self.partition_days)
                if partition_key not in self.partitions:
                    self.partitions[partition_key] = []
                self.partitions[partition_key].append(item)
        
        return TimePartitionedCalculator
    
    @staticmethod
    def visualize_data_quality_issues():
        """Visualize common data quality problems"""
        fig, axes = plt.subplots(2, 2, figsize=(15, 12))
        
        # Issue 1: Missing author information
        ax1 = axes[0, 0]
        missing_rates = np.random.beta(2, 20, 12)  # Monthly missing rates
        months = pd.date_range('2023-01', periods=12, freq='MS')
        ax1.plot(months, missing_rates * 100, 'r-', marker='o')
        ax1.set_title('Missing Author Information Over Time')
        ax1.set_ylabel('Missing Rate (%)')
        ax1.tick_params(axis='x', rotation=45)
        ax1.grid(True, alpha=0.3)
        
        # Issue 2: Bot activity spikes
        ax2 = axes[0, 1]
        human_commits = np.random.poisson(100, 52)  # Weekly
        bot_commits = np.random.poisson(20, 52)
        bot_commits[10:15] = np.random.poisson(100, 5)  # Spike
        
        weeks = range(52)
        ax2.bar(weeks, human_commits, label='Human', alpha=0.7)
        ax2.bar(weeks, bot_commits, bottom=human_commits, label='Bot', alpha=0.7)
        ax2.set_title('Weekly Commit Activity (Human vs Bot)')
        ax2.set_xlabel('Week')
        ax2.set_ylabel('Number of Commits')
        ax2.legend()
        ax2.grid(True, alpha=0.3)
        
        # Issue 3: File rename frequency
        ax3 = axes[1, 0]
        subsystems = ['src', 'tests', 'docs', 'config', 'scripts']
        rename_counts = [45, 23, 12, 5, 8]
        ax3.bar(subsystems, rename_counts, color='orange')
        ax3.set_title('File Renames by Subsystem')
        ax3.set_ylabel('Number of Renames')
        ax3.grid(True, alpha=0.3)
        
        # Issue 4: Reviewer coverage gaps
        ax4 = axes[1, 1]
        # Create heatmap data
        developers = [f'Dev{i}' for i in range(10)]
        subsystems = ['src/core', 'src/api', 'tests', 'docs']
        coverage = np.random.random((len(developers), len(subsystems)))
        coverage[5:8, 2:] = 0  # Coverage gaps
        
        im = ax4.imshow(coverage, cmap='YlOrRd', aspect='auto')
        ax4.set_xticks(range(len(subsystems)))
        ax4.set_xticklabels(subsystems, rotation=45)
        ax4.set_yticks(range(len(developers)))
        ax4.set_yticklabels(developers)
        ax4.set_title('Review Coverage Heatmap')
        
        # Add colorbar
        plt.colorbar(im, ax=ax4)
        
        plt.tight_layout()
        plt.show()

# Demonstrate challenges
challenges = OwnershipChallenges()

# Handle bot accounts
filtered_commits = challenges.handle_bot_accounts(commits, pull_requests)

# Handle file renames
mock_renames = {
    'src/old_module.py': ['src/core/module.py'],
    'src/core/module.py': ['src/core/engine.py'],
    'tests/test_old.py': ['tests/unit/test_engine.py']
}
file_mapping = challenges.handle_file_renames(mock_renames)

# Show large-scale strategies
challenges.handle_large_scale_data()

# Visualize data quality issues
challenges.visualize_data_quality_issues()

## 8. Integration with ELF and Practical Applications

In [None]:
def demonstrate_elf_integration(df_metrics: pd.DataFrame):
    """Show how ownership metrics integrate with ELF"""
    
    # Calculate ELF weights for each metric combination
    strategies = ['aco', 'rso', 'avg', 'max']
    granularities = ['repo', 'sys', 'pkg']
    
    # Add ELF weights to dataframe
    for strategy in strategies:
        for gran in granularities:
            if strategy == 'aco':
                df_metrics[f'weight_{strategy}_{gran}'] = np.exp(1 + df_metrics[f'aco_{gran}'])
            elif strategy == 'rso':
                df_metrics[f'weight_{strategy}_{gran}'] = np.exp(1 + df_metrics[f'rso_{gran}'])
            elif strategy == 'avg':
                df_metrics[f'weight_{strategy}_{gran}'] = np.exp(1 + 
                    (df_metrics[f'aco_{gran}'] + df_metrics[f'rso_{gran}']) / 2)
            else:  # max
                df_metrics[f'weight_{strategy}_{gran}'] = np.exp(1 + 
                    np.maximum(df_metrics[f'aco_{gran}'], df_metrics[f'rso_{gran}']))
    
    # Visualize weight distributions
    fig, axes = plt.subplots(3, 4, figsize=(20, 15))
    
    for i, gran in enumerate(granularities):
        for j, strategy in enumerate(strategies):
            ax = axes[i, j]
            
            # Plot weight distribution
            weights = df_metrics[f'weight_{strategy}_{gran}']
            ax.hist(weights, bins=30, alpha=0.7, edgecolor='black')
            
            # Add statistics
            mean_w = weights.mean()
            std_w = weights.std()
            ax.axvline(mean_w, color='red', linestyle='--', label=f'μ={mean_w:.2f}')
            
            ax.set_title(f'{strategy.upper()} - {gran.capitalize()} Level')
            ax.set_xlabel('ELF Weight')
            ax.set_ylabel('Count')
            ax.legend()
            ax.grid(True, alpha=0.3)
    
    plt.tight_layout()
    plt.show()
    
    # Show impact on training
    print("\nELF Weight Impact Analysis:")
    print("\nWeight Ratios (High Experience / Low Experience):")
    
    for strategy in strategies:
        for gran in granularities:
            weight_col = f'weight_{strategy}_{gran}'
            high_exp = df_metrics[df_metrics['developer_type'] == 'maintainer'][weight_col].mean()
            low_exp = df_metrics[df_metrics['developer_type'] == 'occasional'][weight_col].mean()
            ratio = high_exp / low_exp if low_exp > 0 else np.inf
            
            print(f"{strategy}_{gran}: {ratio:.2f}x")
    
    return df_metrics

# Integrate with ELF
df_with_weights = demonstrate_elf_integration(df_metrics)

# Best configuration analysis
print("\nBest ELF Configurations (highest weight differentiation):")
weight_cols = [col for col in df_with_weights.columns if col.startswith('weight_')]
weight_stds = df_with_weights[weight_cols].std().sort_values(ascending=False)
print(weight_stds.head())

## 9. Summary and Key Takeaways

### Core Concepts Mastered
1. **ACO Formula**: α(D,G) / C(G) - Proportion of commits by developer
2. **RSO Formula**: r(D,G) / ρ(G) - Proportion of PRs reviewed
3. **Granularity Levels**: Repository → Subsystem → Package (increasing specialization)
4. **Temporal Aspects**: Metrics calculated at review timestamp

### Implementation Insights
1. **Preprocessing is Critical**: Index data by granularity for efficiency
2. **Cache Aggressively**: Ownership values change slowly
3. **Handle Edge Cases**: Bot accounts, file renames, missing data
4. **Batch Processing**: Essential for large-scale datasets

### Key Findings
1. **Ownership Increases at Finer Granularities**: ~1.5-2x at package level
2. **ACO < RSO**: Developers review more broadly than they code
3. **High Correlation**: But not perfect - capturing different aspects
4. **Temporal Stability**: Ownership evolves gradually over months

In [None]:
# Final implementation template
class ProductionOwnershipCalculator:
    """Production-ready ownership calculator template"""
    
    def __init__(self, config):
        self.cache_size = config.get('cache_size', 10000)
        self.batch_size = config.get('batch_size', 1000)
        self.parallel_workers = config.get('parallel_workers', 4)
        self.bot_patterns = config.get('bot_patterns', [])
        
    def process_repository(self, repo_path: str) -> pd.DataFrame:
        """
        Process entire repository to calculate ownership metrics
        
        Steps:
        1. Extract commit history
        2. Extract PR/review history  
        3. Filter bot accounts
        4. Handle file renames
        5. Calculate metrics in batches
        6. Save results
        """
        # Your implementation here
        pass
    
    def update_metrics_incremental(self, new_commits, new_prs):
        """Incrementally update metrics with new data"""
        # Your implementation here
        pass

print("Ownership Metrics Implementation Complete!")
print("\nNext Steps:")
print("1. Apply to your repository using PyGithub/PyDriller")
print("2. Experiment with different granularity levels")
print("3. Analyze ownership patterns in your codebase")
print("4. Integrate with ELF for model training")