# Repository Clustering and Analysis

This notebook demonstrates ghops' powerful clustering capabilities for analyzing and organizing your repository portfolio. You'll learn how to identify similar repositories, find duplicate code, and get consolidation recommendations.

## Table of Contents
1. [Loading Repository Metadata](#loading-metadata)
2. [Understanding Clustering Algorithms](#clustering-algorithms)
3. [Running Cluster Analysis](#running-analysis)
4. [Visualizing Clusters](#visualizing)
5. [Finding Duplicate Code](#duplicates)
6. [Getting Consolidation Recommendations](#consolidation)
7. [Interactive Exploration](#interactive)
8. [Advanced Analysis](#advanced)
9. [Exercises](#exercises)

## Setup and Imports

In [None]:
import subprocess
import json
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path
import tempfile
import os

# Set up plotting
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")

# Helper functions
def run_command(cmd):
    """Run shell command and return output"""
    result = subprocess.run(cmd, shell=True, capture_output=True, text=True)
    return result.stdout, result.stderr, result.returncode

def parse_jsonl(output):
    """Parse JSONL output into list of dicts"""
    results = []
    for line in output.strip().split('\n'):
        if line:
            try:
                results.append(json.loads(line))
            except json.JSONDecodeError:
                continue
    return results

print("Setup complete! Libraries loaded.")

## 1. Loading Repository Metadata {#loading-metadata}

First, let's load metadata about repositories. We'll create a sample dataset for demonstration.

In [None]:
# Create sample repositories with different characteristics
temp_dir = tempfile.mkdtemp(prefix="ghops_clustering_")
print(f"Working directory: {temp_dir}")

# Repository templates
repo_templates = [
    # Python web apps
    {"name": "flask-api", "lang": "python", "files": ["app.py", "requirements.txt", "config.py"]},
    {"name": "django-blog", "lang": "python", "files": ["manage.py", "settings.py", "urls.py"]},
    {"name": "fastapi-service", "lang": "python", "files": ["main.py", "requirements.txt", "models.py"]},
    
    # JavaScript projects
    {"name": "react-app", "lang": "javascript", "files": ["package.json", "App.js", "index.js"]},
    {"name": "vue-dashboard", "lang": "javascript", "files": ["package.json", "App.vue", "main.js"]},
    {"name": "node-api", "lang": "javascript", "files": ["package.json", "server.js", "routes.js"]},
    
    # Data science projects
    {"name": "ml-pipeline", "lang": "python", "files": ["train.py", "model.py", "data.csv"]},
    {"name": "data-analysis", "lang": "python", "files": ["analysis.ipynb", "utils.py", "data.csv"]},
    
    # Duplicate/similar projects
    {"name": "flask-api-v2", "lang": "python", "files": ["app.py", "requirements.txt", "config.py"]},
    {"name": "old-react-app", "lang": "javascript", "files": ["package.json", "App.js", "index.js"]},
]

# Create the repositories
for repo in repo_templates:
    repo_path = Path(temp_dir) / repo["name"]
    repo_path.mkdir(parents=True)
    
    # Initialize git
    os.chdir(repo_path)
    run_command("git init")
    run_command("git config user.email 'test@example.com'")
    run_command("git config user.name 'Test User'")
    
    # Create files
    for file in repo["files"]:
        (repo_path / file).write_text(f"# {repo['name']}\n# File: {file}\n")
    
    # Create README with metadata
    readme_content = f"""# {repo['name']}
Language: {repo['lang']}
Type: {'web' if 'api' in repo['name'] or 'app' in repo['name'] else 'data' if 'ml' in repo['name'] or 'data' in repo['name'] else 'other'}
"""
    (repo_path / "README.md").write_text(readme_content)
    
    # Commit
    run_command("git add .")
    run_command(f"git commit -m 'Initial commit for {repo['name']}'")

print(f"Created {len(repo_templates)} sample repositories")

In [None]:
# Load repository metadata
stdout, _, _ = run_command(f"ghops list {temp_dir}")
repos = parse_jsonl(stdout)

# Convert to DataFrame for easier analysis
df_repos = pd.DataFrame(repos)
print(f"Loaded {len(df_repos)} repositories")
print("\nRepository Overview:")
df_repos[['name', 'path']].head(10)

## 2. Understanding Clustering Algorithms {#clustering-algorithms}

ghops supports multiple clustering algorithms, each suited for different analysis needs:

- **similarity**: Content-based clustering using file similarity
- **language**: Groups repositories by programming language
- **size**: Clusters based on repository size and complexity
- **activity**: Groups by commit activity patterns
- **dependencies**: Clusters based on shared dependencies

In [None]:
# Check available clustering algorithms
stdout, _, _ = run_command("ghops cluster --help")
print("Available clustering options:")
print("=" * 50)
# Parse help text to show algorithms
for line in stdout.split('\n'):
    if '--algorithm' in line or 'similarity' in line.lower():
        print(line.strip())

## 3. Running Cluster Analysis {#running-analysis}

Let's run different clustering algorithms and analyze the results.

In [None]:
# Run similarity-based clustering
stdout, stderr, code = run_command(f"ghops cluster analyze {temp_dir} --algorithm similarity")

if code == 0:
    clusters = parse_jsonl(stdout)
    if clusters:
        print("Similarity-based Clustering Results:")
        print("=" * 50)
        
        for cluster in clusters:
            if 'cluster_id' in cluster:
                print(f"\nCluster {cluster['cluster_id']}:")
                print(f"  Members: {cluster.get('members', [])}")
                print(f"  Similarity: {cluster.get('similarity', 0):.2f}")
    else:
        print("No clusters found. Creating mock data for demonstration...")
        # Create mock clustering data for demonstration
        clusters = [
            {"cluster_id": 0, "members": ["flask-api", "flask-api-v2", "fastapi-service"], "similarity": 0.85},
            {"cluster_id": 1, "members": ["react-app", "old-react-app", "vue-dashboard"], "similarity": 0.75},
            {"cluster_id": 2, "members": ["ml-pipeline", "data-analysis"], "similarity": 0.65},
        ]
else:
    print(f"Note: Clustering command not available. Using mock data for demonstration.")
    # Mock data for demonstration
    clusters = [
        {"cluster_id": 0, "members": ["flask-api", "flask-api-v2", "fastapi-service"], "similarity": 0.85},
        {"cluster_id": 1, "members": ["react-app", "old-react-app", "vue-dashboard"], "similarity": 0.75},
        {"cluster_id": 2, "members": ["ml-pipeline", "data-analysis"], "similarity": 0.65},
    ]

In [None]:
# Analyze cluster characteristics
cluster_df = pd.DataFrame(clusters)

if not cluster_df.empty:
    print("Cluster Statistics:")
    print("=" * 50)
    print(f"Total clusters: {len(cluster_df)}")
    print(f"Average cluster size: {cluster_df['members'].apply(len).mean():.1f} repositories")
    print(f"Average similarity: {cluster_df['similarity'].mean():.2f}")
    
    # Create cluster size distribution
    cluster_sizes = cluster_df['members'].apply(len)
    
    plt.figure(figsize=(10, 4))
    
    plt.subplot(1, 2, 1)
    plt.bar(range(len(cluster_sizes)), cluster_sizes)
    plt.xlabel('Cluster ID')
    plt.ylabel('Number of Repositories')
    plt.title('Cluster Sizes')
    
    plt.subplot(1, 2, 2)
    plt.bar(range(len(cluster_df)), cluster_df['similarity'])
    plt.xlabel('Cluster ID')
    plt.ylabel('Similarity Score')
    plt.title('Cluster Similarity Scores')
    
    plt.tight_layout()
    plt.show()

## 4. Visualizing Clusters {#visualizing}

Let's create visualizations to better understand the repository relationships.

In [None]:
# Create a similarity matrix
import numpy as np
import random

# Get all repository names
all_repos = []
for cluster in clusters:
    all_repos.extend(cluster['members'])
all_repos = list(set(all_repos))  # Unique repos

# Create similarity matrix
n_repos = len(all_repos)
similarity_matrix = np.zeros((n_repos, n_repos))

# Fill similarity matrix based on clusters
for cluster in clusters:
    members = cluster['members']
    sim_score = cluster['similarity']
    
    for i, repo1 in enumerate(members):
        for j, repo2 in enumerate(members):
            if repo1 != repo2:
                idx1 = all_repos.index(repo1)
                idx2 = all_repos.index(repo2)
                similarity_matrix[idx1, idx2] = sim_score
                similarity_matrix[idx2, idx1] = sim_score

# Set diagonal to 1 (self-similarity)
np.fill_diagonal(similarity_matrix, 1.0)

# Add some noise for repos not in same cluster
for i in range(n_repos):
    for j in range(i+1, n_repos):
        if similarity_matrix[i, j] == 0:
            similarity_matrix[i, j] = random.uniform(0.1, 0.3)
            similarity_matrix[j, i] = similarity_matrix[i, j]

# Create heatmap
plt.figure(figsize=(12, 10))
sns.heatmap(similarity_matrix, 
            xticklabels=all_repos, 
            yticklabels=all_repos,
            cmap='YlOrRd',
            annot=True,
            fmt='.2f',
            cbar_kws={'label': 'Similarity Score'})
plt.title('Repository Similarity Matrix')
plt.tight_layout()
plt.show()

In [None]:
# Create network graph visualization
import networkx as nx

# Create graph from similarity matrix
G = nx.Graph()

# Add nodes
for repo in all_repos:
    G.add_node(repo)

# Add edges for high similarity (> 0.5)
threshold = 0.5
for i in range(n_repos):
    for j in range(i+1, n_repos):
        if similarity_matrix[i, j] > threshold:
            G.add_edge(all_repos[i], all_repos[j], 
                      weight=similarity_matrix[i, j])

# Create visualization
plt.figure(figsize=(14, 10))

# Calculate layout
pos = nx.spring_layout(G, k=2, iterations=50)

# Draw nodes - color by cluster
node_colors = []
for repo in all_repos:
    for i, cluster in enumerate(clusters):
        if repo in cluster['members']:
            node_colors.append(i)
            break
    else:
        node_colors.append(-1)

nx.draw_networkx_nodes(G, pos, node_color=node_colors, 
                      node_size=1000, cmap='Set1')

# Draw edges with varying thickness based on similarity
edges = G.edges()
weights = [G[u][v]['weight'] * 3 for u, v in edges]
nx.draw_networkx_edges(G, pos, width=weights, alpha=0.5)

# Draw labels
nx.draw_networkx_labels(G, pos, font_size=8)

plt.title('Repository Similarity Network\n(Edges show similarity > 0.5)')
plt.axis('off')
plt.tight_layout()
plt.show()

## 5. Finding Duplicate Code {#duplicates}

ghops can identify repositories with duplicate or very similar code.

In [None]:
# Find duplicates
stdout, stderr, code = run_command(f"ghops cluster duplicates {temp_dir}")

# Parse results or use mock data
if code == 0 and stdout:
    duplicates = parse_jsonl(stdout)
else:
    # Mock duplicate detection results
    duplicates = [
        {
            "repo1": "flask-api",
            "repo2": "flask-api-v2",
            "similarity": 0.95,
            "common_files": ["app.py", "requirements.txt", "config.py"],
            "identical_files": ["requirements.txt", "config.py"],
            "recommendation": "Consider merging or archiving flask-api-v2"
        },
        {
            "repo1": "react-app",
            "repo2": "old-react-app",
            "similarity": 0.88,
            "common_files": ["package.json", "App.js", "index.js"],
            "identical_files": ["index.js"],
            "recommendation": "Archive old-react-app if no longer needed"
        }
    ]

print("Duplicate Detection Results:")
print("=" * 50)

for dup in duplicates:
    print(f"\nPotential Duplicate Pair:")
    print(f"  {dup['repo1']} <-> {dup['repo2']}")
    print(f"  Similarity: {dup['similarity']:.1%}")
    print(f"  Common files: {', '.join(dup['common_files'])}")
    print(f"  Identical files: {', '.join(dup['identical_files']) if dup['identical_files'] else 'None'}")
    print(f"  Recommendation: {dup['recommendation']}")

In [None]:
# Visualize duplicate relationships
if duplicates:
    dup_df = pd.DataFrame(duplicates)
    
    plt.figure(figsize=(12, 6))
    
    # Similarity scores
    plt.subplot(1, 2, 1)
    pairs = [f"{d['repo1']}\nvs\n{d['repo2']}" for d in duplicates]
    similarities = [d['similarity'] for d in duplicates]
    
    bars = plt.bar(range(len(pairs)), similarities, color=['red' if s > 0.9 else 'orange' if s > 0.7 else 'yellow' for s in similarities])
    plt.xticks(range(len(pairs)), pairs, rotation=0)
    plt.ylabel('Similarity Score')
    plt.title('Duplicate Repository Pairs')
    plt.axhline(y=0.9, color='r', linestyle='--', label='High similarity threshold')
    plt.axhline(y=0.7, color='orange', linestyle='--', label='Medium similarity threshold')
    plt.legend()
    
    # File overlap
    plt.subplot(1, 2, 2)
    common_counts = [len(d['common_files']) for d in duplicates]
    identical_counts = [len(d['identical_files']) for d in duplicates]
    
    x = np.arange(len(pairs))
    width = 0.35
    
    plt.bar(x - width/2, common_counts, width, label='Common Files', color='lightblue')
    plt.bar(x + width/2, identical_counts, width, label='Identical Files', color='darkblue')
    
    plt.xticks(x, pairs, rotation=0)
    plt.ylabel('Number of Files')
    plt.title('File Overlap Analysis')
    plt.legend()
    
    plt.tight_layout()
    plt.show()

## 6. Getting Consolidation Recommendations {#consolidation}

Based on the clustering analysis, ghops can provide recommendations for consolidating repositories.

In [None]:
# Get consolidation recommendations
stdout, stderr, code = run_command(f"ghops cluster recommend {temp_dir}")

if code == 0 and stdout:
    recommendations = parse_jsonl(stdout)
else:
    # Mock recommendations
    recommendations = [
        {
            "type": "merge",
            "repos": ["flask-api", "flask-api-v2"],
            "reason": "95% code similarity, identical dependencies",
            "action": "Merge flask-api-v2 changes into flask-api, then archive flask-api-v2",
            "effort": "low",
            "impact": "high"
        },
        {
            "type": "archive",
            "repos": ["old-react-app"],
            "reason": "Duplicate of react-app, last updated 6 months ago",
            "action": "Archive old-react-app repository",
            "effort": "minimal",
            "impact": "medium"
        },
        {
            "type": "refactor",
            "repos": ["ml-pipeline", "data-analysis"],
            "reason": "Share common data processing code",
            "action": "Extract shared utilities into a common library",
            "effort": "medium",
            "impact": "high"
        },
        {
            "type": "organize",
            "repos": ["react-app", "vue-dashboard", "node-api"],
            "reason": "Related JavaScript projects",
            "action": "Consider monorepo structure for JavaScript projects",
            "effort": "high",
            "impact": "high"
        }
    ]

print("Consolidation Recommendations:")
print("=" * 70)

for i, rec in enumerate(recommendations, 1):
    print(f"\n{i}. {rec['type'].upper()} Recommendation")
    print(f"   Repositories: {', '.join(rec['repos'])}")
    print(f"   Reason: {rec['reason']}")
    print(f"   Action: {rec['action']}")
    print(f"   Effort: {rec['effort']} | Impact: {rec['impact']}")

In [None]:
# Create recommendation impact matrix
if recommendations:
    rec_df = pd.DataFrame(recommendations)
    
    # Map effort and impact to numeric values
    effort_map = {'minimal': 1, 'low': 2, 'medium': 3, 'high': 4}
    impact_map = {'low': 1, 'medium': 2, 'high': 3}
    
    rec_df['effort_score'] = rec_df['effort'].map(effort_map)
    rec_df['impact_score'] = rec_df['impact'].map(impact_map)
    
    plt.figure(figsize=(10, 8))
    
    # Create scatter plot
    colors = {'merge': 'red', 'archive': 'blue', 'refactor': 'green', 'organize': 'purple'}
    
    for rec_type in rec_df['type'].unique():
        mask = rec_df['type'] == rec_type
        plt.scatter(rec_df[mask]['effort_score'], 
                   rec_df[mask]['impact_score'],
                   label=rec_type.capitalize(),
                   color=colors.get(rec_type, 'gray'),
                   s=200,
                   alpha=0.7)
    
    # Add labels
    for idx, row in rec_df.iterrows():
        plt.annotate(f"{', '.join(row['repos'][:2])}",
                    (row['effort_score'], row['impact_score']),
                    xytext=(5, 5), textcoords='offset points',
                    fontsize=8, alpha=0.7)
    
    # Add quadrant lines
    plt.axhline(y=2, color='gray', linestyle='--', alpha=0.3)
    plt.axvline(x=2.5, color='gray', linestyle='--', alpha=0.3)
    
    # Add quadrant labels
    plt.text(1.2, 2.7, 'Quick Wins', fontsize=10, alpha=0.5, weight='bold')
    plt.text(3.2, 2.7, 'Major Projects', fontsize=10, alpha=0.5, weight='bold')
    plt.text(1.2, 1.2, 'Fill-ins', fontsize=10, alpha=0.5, weight='bold')
    plt.text(3.2, 1.2, 'Questionable', fontsize=10, alpha=0.5, weight='bold')
    
    plt.xlabel('Effort Required')
    plt.ylabel('Expected Impact')
    plt.title('Consolidation Recommendations - Effort vs Impact Matrix')
    plt.xticks([1, 2, 3, 4], ['Minimal', 'Low', 'Medium', 'High'])
    plt.yticks([1, 2, 3], ['Low', 'Medium', 'High'])
    plt.legend()
    plt.grid(True, alpha=0.3)
    plt.tight_layout()
    plt.show()

## 7. Interactive Exploration {#interactive}

Let's create an interactive tool to explore repository relationships.

In [None]:
# Interactive repository explorer
from ipywidgets import interact, widgets
import IPython.display as display

def explore_repository(repo_name):
    """Interactive function to explore a repository's relationships"""
    print(f"\nRepository: {repo_name}")
    print("=" * 50)
    
    # Find cluster membership
    for cluster in clusters:
        if repo_name in cluster['members']:
            print(f"\nCluster {cluster['cluster_id']}:")
            print(f"  Cluster members: {', '.join(cluster['members'])}")
            print(f"  Cluster similarity: {cluster['similarity']:.2f}")
            break
    
    # Find duplicates
    print("\nDuplicate Analysis:")
    found_duplicate = False
    for dup in duplicates:
        if repo_name in [dup['repo1'], dup['repo2']]:
            other = dup['repo2'] if dup['repo1'] == repo_name else dup['repo1']
            print(f"  Potential duplicate: {other} (similarity: {dup['similarity']:.1%})")
            found_duplicate = True
    if not found_duplicate:
        print("  No duplicates found")
    
    # Find recommendations
    print("\nRecommendations:")
    found_rec = False
    for rec in recommendations:
        if repo_name in rec['repos']:
            print(f"  {rec['type'].upper()}: {rec['action']}")
            found_rec = True
    if not found_rec:
        print("  No specific recommendations")
    
    # Show similarity scores
    if repo_name in all_repos:
        idx = all_repos.index(repo_name)
        similarities = [(all_repos[i], similarity_matrix[idx, i]) 
                       for i in range(len(all_repos)) if i != idx]
        similarities.sort(key=lambda x: x[1], reverse=True)
        
        print("\nTop 3 Most Similar Repositories:")
        for similar_repo, score in similarities[:3]:
            print(f"  - {similar_repo}: {score:.2f}")

# Create interactive widget
if all_repos:
    interact(explore_repository, 
             repo_name=widgets.Dropdown(
                 options=sorted(all_repos),
                 description='Repository:',
                 style={'description_width': 'initial'}
             ))

## 8. Advanced Analysis {#advanced}

Let's perform some advanced clustering analysis combining multiple factors.

In [None]:
# Multi-dimensional clustering analysis
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans

# Create feature matrix for repositories
np.random.seed(42)  # For reproducibility

# Generate mock features for each repository
features = []
feature_names = ['lines_of_code', 'num_files', 'num_commits', 'num_contributors', 
                 'days_since_update', 'num_dependencies']

for repo in all_repos:
    # Generate mock features based on repository type
    if 'api' in repo or 'service' in repo:
        base_features = [2000, 15, 50, 3, 10, 8]
    elif 'app' in repo or 'dashboard' in repo:
        base_features = [5000, 30, 100, 5, 5, 12]
    elif 'ml' in repo or 'data' in repo:
        base_features = [1500, 10, 30, 2, 20, 6]
    else:
        base_features = [1000, 8, 20, 1, 30, 4]
    
    # Add noise
    noisy_features = [int(f * np.random.uniform(0.8, 1.2)) for f in base_features]
    features.append(noisy_features)

feature_matrix = np.array(features)

# Standardize features
scaler = StandardScaler()
scaled_features = scaler.fit_transform(feature_matrix)

# Perform PCA for visualization
pca = PCA(n_components=2)
pca_features = pca.fit_transform(scaled_features)

# Perform K-means clustering
kmeans = KMeans(n_clusters=3, random_state=42)
cluster_labels = kmeans.fit_predict(scaled_features)

# Visualize results
plt.figure(figsize=(14, 6))

# PCA visualization
plt.subplot(1, 2, 1)
scatter = plt.scatter(pca_features[:, 0], pca_features[:, 1], 
                     c=cluster_labels, cmap='viridis', s=100)
for i, repo in enumerate(all_repos):
    plt.annotate(repo, (pca_features[i, 0], pca_features[i, 1]),
                xytext=(5, 5), textcoords='offset points',
                fontsize=8, alpha=0.7)
plt.xlabel(f'PC1 ({pca.explained_variance_ratio_[0]:.1%} variance)')
plt.ylabel(f'PC2 ({pca.explained_variance_ratio_[1]:.1%} variance)')
plt.title('Multi-dimensional Clustering (PCA Projection)')
plt.colorbar(scatter, label='Cluster')

# Feature importance
plt.subplot(1, 2, 2)
feature_importance = np.abs(pca.components_[0])
sorted_idx = np.argsort(feature_importance)[::-1]
plt.barh(range(len(feature_names)), feature_importance[sorted_idx])
plt.yticks(range(len(feature_names)), [feature_names[i] for i in sorted_idx])
plt.xlabel('Importance')
plt.title('Feature Importance for Clustering')

plt.tight_layout()
plt.show()

# Print cluster composition
print("\nCluster Composition:")
print("=" * 50)
for cluster_id in range(3):
    cluster_members = [all_repos[i] for i, label in enumerate(cluster_labels) if label == cluster_id]
    print(f"\nCluster {cluster_id}: {', '.join(cluster_members)}")

In [None]:
# Create detailed cluster profile
cluster_profiles = []

for cluster_id in range(3):
    mask = cluster_labels == cluster_id
    cluster_features = feature_matrix[mask]
    
    profile = {
        'cluster_id': cluster_id,
        'size': mask.sum(),
        'avg_lines': cluster_features[:, 0].mean(),
        'avg_files': cluster_features[:, 1].mean(),
        'avg_commits': cluster_features[:, 2].mean(),
        'avg_contributors': cluster_features[:, 3].mean(),
        'avg_days_since_update': cluster_features[:, 4].mean(),
        'avg_dependencies': cluster_features[:, 5].mean()
    }
    cluster_profiles.append(profile)

profile_df = pd.DataFrame(cluster_profiles)

# Display cluster profiles
print("Cluster Profiles:")
print("=" * 70)
print(profile_df.round(1).to_string(index=False))

# Radar chart for cluster comparison
from math import pi

fig, axes = plt.subplots(1, 3, figsize=(15, 5), subplot_kw=dict(projection='polar'))

# Prepare data for radar chart
categories = ['Lines', 'Files', 'Commits', 'Contributors', 'Freshness', 'Dependencies']
num_vars = len(categories)

# Compute angle for each axis
angles = [n / float(num_vars) * 2 * pi for n in range(num_vars)]
angles += angles[:1]

for idx, (ax, profile) in enumerate(zip(axes, cluster_profiles)):
    # Normalize values for radar chart
    values = [
        profile['avg_lines'] / 5000,
        profile['avg_files'] / 30,
        profile['avg_commits'] / 100,
        profile['avg_contributors'] / 5,
        1 - (profile['avg_days_since_update'] / 30),  # Invert for freshness
        profile['avg_dependencies'] / 12
    ]
    values += values[:1]
    
    ax.plot(angles, values, 'o-', linewidth=2)
    ax.fill(angles, values, alpha=0.25)
    ax.set_xticks(angles[:-1])
    ax.set_xticklabels(categories, size=8)
    ax.set_ylim(0, 1)
    ax.set_title(f'Cluster {idx}', size=11, y=1.1)
    ax.grid(True)

plt.suptitle('Cluster Characteristic Profiles', size=14, y=1.05)
plt.tight_layout()
plt.show()

## 9. Exercises {#exercises}

Practice your clustering analysis skills with these exercises:

### Exercise 1: Custom Similarity Metric
Create a custom similarity metric that considers both code similarity and metadata.

In [None]:
# TODO: Implement a custom similarity function
def custom_similarity(repo1_features, repo2_features):
    """
    Calculate custom similarity between two repositories.
    
    Consider:
    - File overlap
    - Language similarity
    - Size similarity
    - Update frequency similarity
    """
    # Your implementation here
    pass

# Test your function
# similarity = custom_similarity(repo1_data, repo2_data)

### Exercise 2: Cluster Quality Metrics
Evaluate the quality of clustering using silhouette score and other metrics.

In [None]:
# TODO: Calculate clustering quality metrics
from sklearn.metrics import silhouette_score, calinski_harabasz_score

# Your code here to:
# 1. Calculate silhouette score
# 2. Calculate Calinski-Harabasz index
# 3. Determine optimal number of clusters

### Exercise 3: Consolidation Plan
Create a detailed consolidation plan based on the analysis.

In [None]:
# TODO: Create a consolidation plan
# Your code here to:
# 1. Prioritize recommendations by effort/impact
# 2. Create a timeline for implementation
# 3. Estimate resource savings
# 4. Generate a report

## Cleanup

In [None]:
# Clean up temporary directory
import shutil
if 'temp_dir' in locals() and os.path.exists(temp_dir):
    shutil.rmtree(temp_dir)
    print(f"Cleaned up temporary directory: {temp_dir}")

## Summary

In this notebook, you learned:
- How to run clustering analysis on repositories
- Different clustering algorithms and their use cases
- Visualizing repository relationships with heatmaps and network graphs
- Identifying duplicate and similar repositories
- Getting actionable consolidation recommendations
- Performing multi-dimensional analysis with PCA and K-means
- Creating interactive exploration tools

## Next Steps

- **Notebook 3**: Learn about Workflow Orchestration
- **Notebook 4**: Explore Advanced Integrations
- **Notebook 5**: Dive into Data Analysis and Visualization

## Key Takeaways

1. Clustering helps identify patterns in your repository portfolio
2. Different algorithms reveal different aspects of similarity
3. Visualization is crucial for understanding relationships
4. Consolidation can significantly reduce maintenance overhead
5. Multi-dimensional analysis provides deeper insights