# Data Analysis and Visualization with ghops

This notebook demonstrates how to analyze and visualize your repository portfolio using ghops data. Learn to create insightful dashboards, perform time series analysis, and build interactive visualizations.

## Table of Contents
1. [Loading and Preparing Data](#loading-data)
2. [Repository Portfolio Analysis](#portfolio-analysis)
3. [Creating Visualizations](#creating-visualizations)
4. [Interactive Dashboards](#interactive-dashboards)
5. [Network Graph Analysis](#network-analysis)
6. [Time Series Analysis](#time-series)
7. [Statistical Analysis](#statistical-analysis)
8. [Custom Metrics and KPIs](#custom-metrics)
9. [Reporting and Export](#reporting)
10. [Exercises](#exercises)

## Setup and Imports

In [None]:
import subprocess
import json
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.graph_objects as go
import plotly.express as px
from plotly.subplots import make_subplots
import networkx as nx
from datetime import datetime, timedelta
from pathlib import Path
import tempfile
import warnings
warnings.filterwarnings('ignore')

# Set style
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")

# Create workspace
workspace = tempfile.mkdtemp(prefix="ghops_analysis_")
print(f"Workspace: {workspace}")

# Helper functions
def run_command(cmd):
    result = subprocess.run(cmd, shell=True, capture_output=True, text=True)
    return result.stdout, result.stderr, result.returncode

def parse_jsonl(output):
    results = []
    for line in output.strip().split('\n'):
        if line:
            try:
                results.append(json.loads(line))
            except json.JSONDecodeError:
                continue
    return results

## 1. Loading and Preparing Data {#loading-data}

First, let's generate and load repository data for analysis.

In [None]:
# Generate sample repository data
np.random.seed(42)

# Repository categories
categories = ['web', 'api', 'ml', 'data', 'tool', 'lib', 'docs']
languages = ['Python', 'JavaScript', 'Go', 'Rust', 'TypeScript', 'Java']
licenses = ['MIT', 'Apache-2.0', 'GPL-3.0', 'BSD-3-Clause', 'Proprietary']

# Generate sample repositories
num_repos = 50
repos_data = []

for i in range(num_repos):
    category = np.random.choice(categories)
    language = np.random.choice(languages)
    
    repo = {
        'name': f'{category}-project-{i:03d}',
        'category': category,
        'language': language,
        'license': np.random.choice(licenses),
        'stars': np.random.randint(0, 1000) if np.random.random() > 0.3 else 0,
        'forks': np.random.randint(0, 100),
        'issues': np.random.randint(0, 50),
        'pull_requests': np.random.randint(0, 20),
        'commits': np.random.randint(10, 1000),
        'contributors': np.random.randint(1, 20),
        'lines_of_code': np.random.randint(100, 50000),
        'test_coverage': np.random.uniform(0, 100) if np.random.random() > 0.2 else 0,
        'last_commit_days': np.random.randint(0, 365),
        'created_date': datetime.now() - timedelta(days=np.random.randint(30, 1500)),
        'size_mb': np.random.uniform(0.1, 500),
        'dependencies': np.random.randint(0, 50),
        'is_private': np.random.random() > 0.7,
        'has_ci': np.random.random() > 0.3,
        'has_docs': np.random.random() > 0.4,
        'complexity_score': np.random.uniform(1, 10)
    }
    
    # Calculate health score
    health_factors = [
        min(repo['test_coverage'] / 100, 1) * 25,
        (1 - min(repo['last_commit_days'] / 365, 1)) * 25,
        min(repo['contributors'] / 10, 1) * 25,
        (1 if repo['has_docs'] else 0) * 15,
        (1 if repo['has_ci'] else 0) * 10
    ]
    repo['health_score'] = sum(health_factors)
    
    repos_data.append(repo)

# Convert to DataFrame
df = pd.DataFrame(repos_data)
df['created_date'] = pd.to_datetime(df['created_date'])

print(f"Loaded {len(df)} repositories")
print("\nDataset Overview:")
print(df.info())
print("\nFirst 5 repositories:")
df.head()

## 2. Repository Portfolio Analysis {#portfolio-analysis}

Analyze your repository portfolio composition and characteristics.

In [None]:
# Portfolio summary statistics
print("Portfolio Summary")
print("=" * 60)
print(f"Total Repositories: {len(df)}")
print(f"Private Repositories: {df['is_private'].sum()} ({df['is_private'].mean()*100:.1f}%)")
print(f"Total Lines of Code: {df['lines_of_code'].sum():,}")
print(f"Total Contributors: {df['contributors'].sum()}")
print(f"Average Health Score: {df['health_score'].mean():.1f}/100")
print(f"\nLanguage Distribution:")
print(df['language'].value_counts().head())
print(f"\nCategory Distribution:")
print(df['category'].value_counts())

In [None]:
# Portfolio composition visualization
fig, axes = plt.subplots(2, 3, figsize=(15, 10))

# Language distribution
ax1 = axes[0, 0]
df['language'].value_counts().plot(kind='pie', ax=ax1, autopct='%1.1f%%')
ax1.set_title('Language Distribution')
ax1.set_ylabel('')

# Category distribution
ax2 = axes[0, 1]
df['category'].value_counts().plot(kind='bar', ax=ax2, color='skyblue')
ax2.set_title('Repository Categories')
ax2.set_xlabel('Category')
ax2.set_ylabel('Count')

# License distribution
ax3 = axes[0, 2]
df['license'].value_counts().plot(kind='barh', ax=ax3, color='lightgreen')
ax3.set_title('License Types')
ax3.set_xlabel('Count')

# Health score distribution
ax4 = axes[1, 0]
ax4.hist(df['health_score'], bins=20, color='coral', edgecolor='black')
ax4.set_title('Health Score Distribution')
ax4.set_xlabel('Health Score')
ax4.set_ylabel('Count')
ax4.axvline(df['health_score'].mean(), color='red', linestyle='--', label=f'Mean: {df["health_score"].mean():.1f}')
ax4.legend()

# Activity heatmap (last commit)
ax5 = axes[1, 1]
activity_bins = [0, 7, 30, 90, 180, 365]
activity_labels = ['< 1 week', '1-4 weeks', '1-3 months', '3-6 months', '> 6 months']
df['activity_category'] = pd.cut(df['last_commit_days'], bins=activity_bins, labels=activity_labels)
activity_counts = df['activity_category'].value_counts()
colors = ['darkgreen', 'green', 'yellow', 'orange', 'red']
ax5.bar(range(len(activity_counts)), activity_counts.values, color=colors)
ax5.set_xticks(range(len(activity_counts)))
ax5.set_xticklabels(activity_counts.index, rotation=45)
ax5.set_title('Repository Activity Status')
ax5.set_ylabel('Count')

# Size distribution
ax6 = axes[1, 2]
ax6.scatter(df['lines_of_code'], df['size_mb'], alpha=0.6, c=df['health_score'], cmap='viridis')
ax6.set_title('Repository Size Analysis')
ax6.set_xlabel('Lines of Code')
ax6.set_ylabel('Size (MB)')
ax6.set_xscale('log')
ax6.set_yscale('log')
cbar = plt.colorbar(ax6.collections[0], ax=ax6)
cbar.set_label('Health Score')

plt.tight_layout()
plt.show()

## 3. Creating Visualizations {#creating-visualizations}

Create various types of visualizations to understand your data.

In [None]:
# Advanced visualizations with Plotly

# 1. Sunburst chart for hierarchical data
hierarchy_df = df.groupby(['language', 'category']).size().reset_index(name='count')
hierarchy_df['path'] = hierarchy_df['language'] + '/' + hierarchy_df['category']

fig_sunburst = px.sunburst(
    hierarchy_df,
    path=['language', 'category'],
    values='count',
    title='Repository Hierarchy: Language â†’ Category'
)
fig_sunburst.show()

# 2. Scatter matrix for correlation analysis
metrics = ['stars', 'forks', 'contributors', 'health_score', 'test_coverage']
fig_scatter = px.scatter_matrix(
    df[metrics],
    dimensions=metrics,
    color=df['health_score'],
    title='Repository Metrics Correlation Matrix',
    height=800
)
fig_scatter.show()

In [None]:
# 3. Box plots for comparative analysis
fig_box = make_subplots(
    rows=2, cols=2,
    subplot_titles=('Health Score by Language', 'Test Coverage by Category',
                   'Contributors by Language', 'Complexity by Category')
)

# Health Score by Language
for lang in df['language'].unique():
    lang_data = df[df['language'] == lang]['health_score']
    fig_box.add_trace(
        go.Box(y=lang_data, name=lang, showlegend=False),
        row=1, col=1
    )

# Test Coverage by Category
for cat in df['category'].unique():
    cat_data = df[df['category'] == cat]['test_coverage']
    fig_box.add_trace(
        go.Box(y=cat_data, name=cat, showlegend=False),
        row=1, col=2
    )

# Contributors by Language
for lang in df['language'].unique():
    lang_data = df[df['language'] == lang]['contributors']
    fig_box.add_trace(
        go.Box(y=lang_data, name=lang, showlegend=False),
        row=2, col=1
    )

# Complexity by Category  
for cat in df['category'].unique():
    cat_data = df[df['category'] == cat]['complexity_score']
    fig_box.add_trace(
        go.Box(y=cat_data, name=cat, showlegend=False),
        row=2, col=2
    )

fig_box.update_layout(height=800, title_text="Comparative Analysis")
fig_box.show()

## 4. Interactive Dashboards {#interactive-dashboards}

Build interactive dashboards for real-time monitoring.

In [None]:
# Create an interactive dashboard
from ipywidgets import interact, widgets, VBox, HBox
import IPython.display as display

def create_dashboard(language='All', category='All', min_health=0):
    """Create filtered dashboard based on selections"""
    
    # Filter data
    filtered_df = df.copy()
    if language != 'All':
        filtered_df = filtered_df[filtered_df['language'] == language]
    if category != 'All':
        filtered_df = filtered_df[filtered_df['category'] == category]
    filtered_df = filtered_df[filtered_df['health_score'] >= min_health]
    
    # Create dashboard
    fig = make_subplots(
        rows=2, cols=3,
        specs=[[{'type': 'indicator'}, {'type': 'indicator'}, {'type': 'indicator'}],
               [{'type': 'bar'}, {'type': 'scatter', 'colspan': 2}, None]],
        subplot_titles=('Total Repos', 'Avg Health', 'Total Contributors',
                       'Top Repos by Stars', 'Health vs Activity')
    )
    
    # KPI Indicators
    fig.add_trace(
        go.Indicator(
            mode="number+delta",
            value=len(filtered_df),
            delta={'reference': len(df)},
            title={'text': 'Repositories'},
        ),
        row=1, col=1
    )
    
    fig.add_trace(
        go.Indicator(
            mode="gauge+number",
            value=filtered_df['health_score'].mean(),
            title={'text': 'Avg Health'},
            gauge={'axis': {'range': [0, 100]},
                  'bar': {'color': 'darkgreen'},
                  'steps': [
                      {'range': [0, 50], 'color': 'lightgray'},
                      {'range': [50, 80], 'color': 'yellow'}],
                  'threshold': {'value': 80, 'thickness': 0.75}}
        ),
        row=1, col=2
    )
    
    fig.add_trace(
        go.Indicator(
            mode="number",
            value=filtered_df['contributors'].sum(),
            title={'text': 'Total Contributors'},
        ),
        row=1, col=3
    )
    
    # Top repositories by stars
    top_repos = filtered_df.nlargest(10, 'stars')[['name', 'stars']]
    fig.add_trace(
        go.Bar(x=top_repos['stars'], y=top_repos['name'], orientation='h'),
        row=2, col=1
    )
    
    # Health vs Activity scatter
    fig.add_trace(
        go.Scatter(
            x=filtered_df['last_commit_days'],
            y=filtered_df['health_score'],
            mode='markers',
            marker=dict(
                size=filtered_df['contributors'] * 2,
                color=filtered_df['test_coverage'],
                colorscale='Viridis',
                showscale=True,
                colorbar=dict(x=1.1)
            ),
            text=filtered_df['name'],
            hovertemplate='%{text}<br>Health: %{y:.1f}<br>Days since commit: %{x}'
        ),
        row=2, col=2
    )
    
    fig.update_xaxes(title_text="Days Since Last Commit", row=2, col=2)
    fig.update_yaxes(title_text="Health Score", row=2, col=2)
    
    fig.update_layout(
        height=600,
        showlegend=False,
        title_text=f"Repository Dashboard (Filtered: {len(filtered_df)} repos)"
    )
    
    fig.show()
    
    # Summary statistics
    print("\nFiltered Statistics:")
    print("=" * 40)
    print(f"Repositories: {len(filtered_df)}")
    print(f"Avg Health Score: {filtered_df['health_score'].mean():.1f}")
    print(f"Avg Test Coverage: {filtered_df['test_coverage'].mean():.1f}%")
    print(f"Total Lines of Code: {filtered_df['lines_of_code'].sum():,}")

# Create interactive widgets
language_widget = widgets.Dropdown(
    options=['All'] + list(df['language'].unique()),
    value='All',
    description='Language:'
)

category_widget = widgets.Dropdown(
    options=['All'] + list(df['category'].unique()),
    value='All',
    description='Category:'
)

health_widget = widgets.IntSlider(
    value=0,
    min=0,
    max=100,
    step=10,
    description='Min Health:'
)

# Create interactive dashboard
interact(create_dashboard, 
         language=language_widget,
         category=category_widget,
         min_health=health_widget)

## 5. Network Graph Analysis {#network-analysis}

Visualize relationships between repositories.

In [None]:
# Create repository network based on shared characteristics
import networkx as nx

# Create network graph
G = nx.Graph()

# Add nodes for repositories
for idx, repo in df.iterrows():
    G.add_node(repo['name'], 
              language=repo['language'],
              category=repo['category'],
              health=repo['health_score'])

# Add edges based on similarity
threshold = 0.7  # Similarity threshold

for i, repo1 in df.iterrows():
    for j, repo2 in df.iterrows():
        if i < j:  # Avoid duplicates
            # Calculate similarity based on multiple factors
            similarity = 0
            if repo1['language'] == repo2['language']:
                similarity += 0.4
            if repo1['category'] == repo2['category']:
                similarity += 0.3
            if abs(repo1['health_score'] - repo2['health_score']) < 20:
                similarity += 0.3
            
            if similarity >= threshold:
                G.add_edge(repo1['name'], repo2['name'], weight=similarity)

# Calculate network metrics
degree_centrality = nx.degree_centrality(G)
betweenness = nx.betweenness_centrality(G)
communities = nx.community.greedy_modularity_communities(G)

print(f"Network Statistics:")
print(f"  Nodes: {G.number_of_nodes()}")
print(f"  Edges: {G.number_of_edges()}")
print(f"  Communities: {len(communities)}")
print(f"  Average Degree: {sum(dict(G.degree()).values()) / G.number_of_nodes():.2f}")

In [None]:
# Visualize network with Plotly
pos = nx.spring_layout(G, k=1, iterations=50)

# Extract node positions
edge_trace = []
for edge in G.edges():
    x0, y0 = pos[edge[0]]
    x1, y1 = pos[edge[1]]
    edge_trace.append(
        go.Scatter(x=[x0, x1, None], y=[y0, y1, None],
                  mode='lines',
                  line=dict(width=0.5, color='gray'),
                  hoverinfo='none')
    )

# Node trace
node_x = []
node_y = []
node_text = []
node_color = []

for node in G.nodes():
    x, y = pos[node]
    node_x.append(x)
    node_y.append(y)
    node_text.append(f"{node}<br>Connections: {G.degree(node)}")
    node_color.append(G.degree(node))

node_trace = go.Scatter(
    x=node_x, y=node_y,
    mode='markers',
    hoverinfo='text',
    text=node_text,
    marker=dict(
        showscale=True,
        colorscale='YlOrRd',
        size=10,
        color=node_color,
        colorbar=dict(
            thickness=15,
            title='Connections',
            xanchor='left',
            titleside='right'
        )
    )
)

# Create figure
fig_network = go.Figure(data=edge_trace + [node_trace],
                       layout=go.Layout(
                           title='Repository Network Graph',
                           showlegend=False,
                           hovermode='closest',
                           margin=dict(b=0,l=0,r=0,t=40),
                           xaxis=dict(showgrid=False, zeroline=False, showticklabels=False),
                           yaxis=dict(showgrid=False, zeroline=False, showticklabels=False),
                           height=600
                       ))
fig_network.show()

## 6. Time Series Analysis {#time-series}

Analyze repository trends over time.

In [None]:
# Generate time series data
dates = pd.date_range(start='2023-01-01', end='2024-12-31', freq='D')
time_series_data = []

for date in dates:
    # Simulate daily metrics
    daily_commits = np.random.poisson(5)
    daily_prs = np.random.poisson(2)
    daily_issues = np.random.poisson(3)
    active_repos = np.random.randint(20, 40)
    
    # Add weekly pattern
    if date.weekday() >= 5:  # Weekend
        daily_commits *= 0.3
        daily_prs *= 0.2
    
    time_series_data.append({
        'date': date,
        'commits': int(daily_commits),
        'pull_requests': int(daily_prs),
        'issues': int(daily_issues),
        'active_repos': active_repos
    })

ts_df = pd.DataFrame(time_series_data)
ts_df['date'] = pd.to_datetime(ts_df['date'])
ts_df.set_index('date', inplace=True)

# Calculate rolling averages
ts_df['commits_7d_avg'] = ts_df['commits'].rolling(window=7).mean()
ts_df['commits_30d_avg'] = ts_df['commits'].rolling(window=30).mean()

In [None]:
# Create time series visualizations
fig_ts = make_subplots(
    rows=3, cols=1,
    subplot_titles=('Daily Commits with Moving Averages',
                   'Weekly Activity Pattern',
                   'Monthly Trends'),
    specs=[[{'secondary_y': False}],
           [{'secondary_y': False}],
           [{'secondary_y': True}]]
)

# Daily commits with moving averages
fig_ts.add_trace(
    go.Scatter(x=ts_df.index, y=ts_df['commits'],
              name='Daily Commits', line=dict(color='lightblue', width=1)),
    row=1, col=1
)
fig_ts.add_trace(
    go.Scatter(x=ts_df.index, y=ts_df['commits_7d_avg'],
              name='7-Day Avg', line=dict(color='blue', width=2)),
    row=1, col=1
)
fig_ts.add_trace(
    go.Scatter(x=ts_df.index, y=ts_df['commits_30d_avg'],
              name='30-Day Avg', line=dict(color='darkblue', width=2)),
    row=1, col=1
)

# Weekly activity pattern
weekly_pattern = ts_df.groupby(ts_df.index.dayofweek).mean()
days = ['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun']
fig_ts.add_trace(
    go.Bar(x=days, y=weekly_pattern['commits'],
          name='Avg Commits', marker_color='green'),
    row=2, col=1
)

# Monthly trends
monthly_data = ts_df.resample('M').sum()
fig_ts.add_trace(
    go.Bar(x=monthly_data.index, y=monthly_data['commits'],
          name='Monthly Commits', marker_color='orange'),
    row=3, col=1
)
fig_ts.add_trace(
    go.Scatter(x=monthly_data.index, y=monthly_data['active_repos'].cumsum(),
              name='Cumulative Active', line=dict(color='red', width=2)),
    row=3, col=1, secondary_y=True
)

fig_ts.update_xaxes(title_text="Date", row=3, col=1)
fig_ts.update_yaxes(title_text="Commits", row=1, col=1)
fig_ts.update_yaxes(title_text="Avg Commits", row=2, col=1)
fig_ts.update_yaxes(title_text="Monthly Commits", row=3, col=1)
fig_ts.update_yaxes(title_text="Cumulative Active", row=3, col=1, secondary_y=True)

fig_ts.update_layout(height=900, title_text="Repository Activity Time Series Analysis")
fig_ts.show()

## 7. Statistical Analysis {#statistical-analysis}

Perform statistical analysis to find insights.

In [None]:
# Correlation analysis
correlation_matrix = df[['stars', 'forks', 'issues', 'contributors', 
                         'health_score', 'test_coverage', 'complexity_score']].corr()

# Create heatmap
plt.figure(figsize=(10, 8))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', center=0,
            square=True, linewidths=1)
plt.title('Repository Metrics Correlation Heatmap')
plt.tight_layout()
plt.show()

# Statistical summary
print("\nStatistical Summary:")
print("=" * 60)
print(df[['health_score', 'test_coverage', 'contributors', 'complexity_score']].describe())

In [None]:
# Hypothesis testing
from scipy import stats

# Test: Do repositories with CI/CD have better health scores?
ci_repos = df[df['has_ci'] == True]['health_score']
no_ci_repos = df[df['has_ci'] == False]['health_score']

t_stat, p_value = stats.ttest_ind(ci_repos, no_ci_repos)

print("Hypothesis Test: CI/CD Impact on Health Score")
print("=" * 50)
print(f"Mean health score with CI/CD: {ci_repos.mean():.2f}")
print(f"Mean health score without CI/CD: {no_ci_repos.mean():.2f}")
print(f"T-statistic: {t_stat:.4f}")
print(f"P-value: {p_value:.4f}")
if p_value < 0.05:
    print("Result: Significant difference (p < 0.05)")
else:
    print("Result: No significant difference")

# ANOVA: Health score by language
language_groups = [df[df['language'] == lang]['health_score'].values 
                  for lang in df['language'].unique()]
f_stat, p_value_anova = stats.f_oneway(*language_groups)

print("\nANOVA: Health Score by Programming Language")
print("=" * 50)
print(f"F-statistic: {f_stat:.4f}")
print(f"P-value: {p_value_anova:.4f}")
if p_value_anova < 0.05:
    print("Result: Significant difference between languages")
else:
    print("Result: No significant difference between languages")

## 8. Custom Metrics and KPIs {#custom-metrics}

Define and track custom metrics for your portfolio.

In [None]:
# Define custom KPIs
class PortfolioKPIs:
    """Calculate custom KPIs for repository portfolio"""
    
    @staticmethod
    def technical_debt_score(row):
        """Calculate technical debt based on multiple factors"""
        score = 0
        
        # Factor in test coverage
        if row['test_coverage'] < 30:
            score += 30
        elif row['test_coverage'] < 60:
            score += 15
        
        # Factor in complexity
        score += row['complexity_score'] * 3
        
        # Factor in age without updates
        if row['last_commit_days'] > 180:
            score += 20
        elif row['last_commit_days'] > 90:
            score += 10
        
        # Factor in documentation
        if not row['has_docs']:
            score += 15
        
        return min(score, 100)  # Cap at 100
    
    @staticmethod
    def maintenance_priority(row):
        """Calculate maintenance priority"""
        priority = 0
        
        # High activity repos need more maintenance
        if row['contributors'] > 10:
            priority += 30
        elif row['contributors'] > 5:
            priority += 20
        
        # Critical issues
        if row['issues'] > 20:
            priority += 25
        
        # Low health score
        if row['health_score'] < 50:
            priority += 25
        
        # No CI/CD
        if not row['has_ci']:
            priority += 20
        
        return priority
    
    @staticmethod
    def innovation_index(row):
        """Calculate innovation index based on activity and growth"""
        index = 0
        
        # Recent activity
        if row['last_commit_days'] < 7:
            index += 40
        elif row['last_commit_days'] < 30:
            index += 25
        
        # Contributor growth
        index += min(row['contributors'] * 2, 30)
        
        # Complexity (innovative projects might be more complex)
        if row['complexity_score'] > 7:
            index += 15
        
        # Size growth
        if row['lines_of_code'] > 10000:
            index += 15
        
        return min(index, 100)

# Calculate custom KPIs
df['technical_debt'] = df.apply(PortfolioKPIs.technical_debt_score, axis=1)
df['maintenance_priority'] = df.apply(PortfolioKPIs.maintenance_priority, axis=1)
df['innovation_index'] = df.apply(PortfolioKPIs.innovation_index, axis=1)

# Display top repositories by each KPI
print("Custom KPI Analysis")
print("=" * 60)

print("\nTop 5 Repositories by Technical Debt:")
print(df.nlargest(5, 'technical_debt')[['name', 'technical_debt', 'test_coverage', 'complexity_score']])

print("\nTop 5 Repositories by Maintenance Priority:")
print(df.nlargest(5, 'maintenance_priority')[['name', 'maintenance_priority', 'issues', 'health_score']])

print("\nTop 5 Most Innovative Repositories:")
print(df.nlargest(5, 'innovation_index')[['name', 'innovation_index', 'last_commit_days', 'contributors']])

In [None]:
# KPI Dashboard
fig_kpi = make_subplots(
    rows=2, cols=2,
    subplot_titles=('Technical Debt Distribution',
                   'Maintenance Priority Matrix',
                   'Innovation vs Health',
                   'Portfolio Risk Assessment'),
    specs=[[{'type': 'histogram'}, {'type': 'scatter'}],
           [{'type': 'scatter'}, {'type': 'pie'}]]
)

# Technical debt distribution
fig_kpi.add_trace(
    go.Histogram(x=df['technical_debt'], nbinsx=20,
                marker_color='red', name='Tech Debt'),
    row=1, col=1
)

# Maintenance priority matrix
fig_kpi.add_trace(
    go.Scatter(x=df['maintenance_priority'], y=df['health_score'],
              mode='markers',
              marker=dict(size=df['issues'], color=df['technical_debt'],
                         colorscale='RdYlGn_r', showscale=True),
              text=df['name'],
              hovertemplate='%{text}<br>Priority: %{x}<br>Health: %{y:.1f}'),
    row=1, col=2
)

# Innovation vs Health
fig_kpi.add_trace(
    go.Scatter(x=df['innovation_index'], y=df['health_score'],
              mode='markers',
              marker=dict(size=10, color=df['category'].astype('category').cat.codes),
              text=df['name']),
    row=2, col=1
)

# Risk assessment pie
risk_categories = pd.cut(df['technical_debt'], 
                         bins=[0, 30, 60, 100],
                         labels=['Low Risk', 'Medium Risk', 'High Risk'])
risk_counts = risk_categories.value_counts()

fig_kpi.add_trace(
    go.Pie(labels=risk_counts.index, values=risk_counts.values,
          marker=dict(colors=['green', 'yellow', 'red'])),
    row=2, col=2
)

# Update axes labels
fig_kpi.update_xaxes(title_text="Technical Debt Score", row=1, col=1)
fig_kpi.update_xaxes(title_text="Maintenance Priority", row=1, col=2)
fig_kpi.update_xaxes(title_text="Innovation Index", row=2, col=1)
fig_kpi.update_yaxes(title_text="Count", row=1, col=1)
fig_kpi.update_yaxes(title_text="Health Score", row=1, col=2)
fig_kpi.update_yaxes(title_text="Health Score", row=2, col=1)

fig_kpi.update_layout(height=800, title_text="Custom KPI Dashboard", showlegend=False)
fig_kpi.show()

## 9. Reporting and Export {#reporting}

Generate reports and export data for stakeholders.

In [None]:
# Generate executive report
from datetime import datetime

def generate_executive_report(df, output_file='executive_report.html'):
    """Generate comprehensive executive report"""
    
    # Calculate key metrics
    total_repos = len(df)
    avg_health = df['health_score'].mean()
    total_contributors = df['contributors'].sum()
    total_loc = df['lines_of_code'].sum()
    
    # High risk repos
    high_risk = df[df['technical_debt'] > 60]
    
    # Active repos
    active_repos = df[df['last_commit_days'] < 30]
    
    # Create HTML report
    html_content = f"""
    <!DOCTYPE html>
    <html>
    <head>
        <title>Repository Portfolio Executive Report</title>
        <style>
            body {{ font-family: Arial, sans-serif; margin: 40px; }}
            h1 {{ color: #333; }}
            h2 {{ color: #666; border-bottom: 2px solid #ddd; padding-bottom: 10px; }}
            .metric {{ display: inline-block; margin: 20px; padding: 20px; 
                      background: #f0f0f0; border-radius: 10px; }}
            .metric-value {{ font-size: 36px; font-weight: bold; color: #2196F3; }}
            .metric-label {{ font-size: 14px; color: #666; margin-top: 5px; }}
            table {{ border-collapse: collapse; width: 100%; margin: 20px 0; }}
            th, td {{ border: 1px solid #ddd; padding: 12px; text-align: left; }}
            th {{ background-color: #2196F3; color: white; }}
            .warning {{ background-color: #fff3cd; padding: 10px; border-radius: 5px; }}
            .success {{ background-color: #d4edda; padding: 10px; border-radius: 5px; }}
        </style>
    </head>
    <body>
        <h1>Repository Portfolio Executive Report</h1>
        <p>Generated: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}</p>
        
        <h2>Key Metrics</h2>
        <div>
            <div class="metric">
                <div class="metric-value">{total_repos}</div>
                <div class="metric-label">Total Repositories</div>
            </div>
            <div class="metric">
                <div class="metric-value">{avg_health:.1f}</div>
                <div class="metric-label">Avg Health Score</div>
            </div>
            <div class="metric">
                <div class="metric-value">{total_contributors}</div>
                <div class="metric-label">Total Contributors</div>
            </div>
            <div class="metric">
                <div class="metric-value">{total_loc:,}</div>
                <div class="metric-label">Lines of Code</div>
            </div>
        </div>
        
        <h2>Portfolio Composition</h2>
        <table>
            <tr>
                <th>Language</th>
                <th>Count</th>
                <th>Percentage</th>
            </tr>
    """
    
    # Add language distribution
    for lang, count in df['language'].value_counts().items():
        percentage = (count / total_repos) * 100
        html_content += f"""
            <tr>
                <td>{lang}</td>
                <td>{count}</td>
                <td>{percentage:.1f}%</td>
            </tr>
        """
    
    html_content += f"""
        </table>
        
        <h2>Risk Assessment</h2>
        <div class="warning">
            <strong>High Risk Repositories:</strong> {len(high_risk)} repositories have technical debt > 60
        </div>
        
        <h2>Activity Status</h2>
        <div class="success">
            <strong>Active Repositories:</strong> {len(active_repos)} repositories updated in last 30 days
        </div>
        
        <h2>Top Performers</h2>
        <table>
            <tr>
                <th>Repository</th>
                <th>Health Score</th>
                <th>Stars</th>
                <th>Contributors</th>
            </tr>
    """
    
    # Add top performers
    top_repos = df.nlargest(5, 'health_score')[['name', 'health_score', 'stars', 'contributors']]
    for _, repo in top_repos.iterrows():
        html_content += f"""
            <tr>
                <td>{repo['name']}</td>
                <td>{repo['health_score']:.1f}</td>
                <td>{repo['stars']}</td>
                <td>{repo['contributors']}</td>
            </tr>
        """
    
    html_content += """
        </table>
    </body>
    </html>
    """
    
    # Save report
    report_path = Path(workspace) / output_file
    with open(report_path, 'w') as f:
        f.write(html_content)
    
    print(f"Executive report generated: {report_path}")
    return report_path

# Generate report
report_file = generate_executive_report(df)

# Export data to different formats
print("\nExporting data...")

# CSV export
csv_file = Path(workspace) / 'repository_data.csv'
df.to_csv(csv_file, index=False)
print(f"  CSV: {csv_file}")

# JSON export
json_file = Path(workspace) / 'repository_data.json'
df.to_json(json_file, orient='records', indent=2, default_handler=str)
print(f"  JSON: {json_file}")

# Excel export with multiple sheets
excel_file = Path(workspace) / 'repository_analysis.xlsx'
with pd.ExcelWriter(excel_file) as writer:
    df.to_excel(writer, sheet_name='All Repositories', index=False)
    df.nlargest(10, 'health_score').to_excel(writer, sheet_name='Top Performers', index=False)
    df.nlargest(10, 'technical_debt').to_excel(writer, sheet_name='High Risk', index=False)
    df.groupby('language').agg({
        'name': 'count',
        'health_score': 'mean',
        'test_coverage': 'mean'
    }).to_excel(writer, sheet_name='Language Summary')
print(f"  Excel: {excel_file}")

## 10. Exercises {#exercises}

Practice data analysis and visualization skills.

### Exercise 1: Custom Visualization
Create a custom visualization showing the relationship between three variables.

In [None]:
# TODO: Create a 3D scatter plot showing:
# - X axis: Test coverage
# - Y axis: Number of contributors
# - Z axis: Health score
# - Color: Language
# - Size: Lines of code

# Your code here:

### Exercise 2: Predictive Analysis
Build a model to predict repository health score.

In [None]:
# TODO: Use machine learning to predict health score
# 1. Select features
# 2. Split data into training and testing
# 3. Train a model (e.g., Random Forest)
# 4. Evaluate performance
# 5. Identify most important features

# Your code here:

### Exercise 3: Custom Dashboard
Create an interactive dashboard for a specific use case.

In [None]:
# TODO: Create a dashboard for:
# - Security monitoring
# - Performance tracking
# - Team productivity
# Choose one and implement

# Your code here:

## Cleanup

In [None]:
# Clean up workspace
import shutil
if 'workspace' in locals() and os.path.exists(workspace):
    # List generated files
    print("Generated files:")
    for file in Path(workspace).glob('*'):
        print(f"  - {file.name}")
    
    # Clean up
    shutil.rmtree(workspace)
    print(f"\nCleaned up workspace: {workspace}")

## Summary

In this notebook, you learned:
- Loading and preparing repository data for analysis
- Comprehensive portfolio analysis techniques
- Creating various types of visualizations
- Building interactive dashboards
- Network graph analysis for repository relationships
- Time series analysis for trends
- Statistical analysis and hypothesis testing
- Defining custom KPIs and metrics
- Generating reports and exporting data

## Key Takeaways

1. **Data-Driven Decisions**: Use analytics to make informed decisions about your portfolio
2. **Visual Insights**: Visualizations reveal patterns not visible in raw data
3. **Interactive Exploration**: Dashboards enable real-time monitoring and exploration
4. **Custom Metrics**: Define KPIs that matter for your specific needs
5. **Statistical Rigor**: Use statistical methods to validate hypotheses
6. **Automated Reporting**: Generate consistent reports for stakeholders
7. **Predictive Analytics**: Use historical data to predict future trends

## Next Steps

- Apply these techniques to your actual repository data
- Build custom dashboards for your team
- Integrate analytics into your CI/CD pipeline
- Share insights with stakeholders
- Contribute visualizations to the ghops community

## Resources

- [Plotly Documentation](https://plotly.com/python/)
- [Pandas Documentation](https://pandas.pydata.org/)
- [Seaborn Gallery](https://seaborn.pydata.org/examples/)
- [NetworkX Documentation](https://networkx.org/)