# Nova Experiment Analysis Template

## Overview

This notebook provides automated analysis and visualization of nova simulation experiments. It analyzes the output from `scripts/analyze_vcf_results.py` to generate comprehensive performance metrics and publication-ready visualizations.

## Template Usage

1. **Configure Parameters**: Modify the experiment configuration in the cell below
2. **Run All Cells**: Execute the entire notebook for automated analysis
3. **Review Results**: Examine the generated plots and summary statistics

## Configuration

In [None]:
# EXPERIMENT CONFIGURATION - Modify these parameters for each experiment
EXPERIMENT_NAME = "Nova Simulation Analysis"
DATA_DIRECTORY = "../output"  # Path to experiment output directory
OUTPUT_PREFIX = "nova"  # Prefix used in analyze_vcf_results.py output files
ANALYSIS_TITLE = "Nova Variant Detection Performance Analysis"

# Optional: Override expected insertion counts if needed
# Leave as None to auto-detect from data
EXPECTED_INSERTIONS = None  # e.g., {'random': 400, 'simple': 200, 'AluYa5': 100}

# Display options
FIGURE_SIZE = (12, 8)
DPI = 100
SAVE_FIGURES = False  # Set to True to save plots as PNG files

## Setup and Imports

In [None]:
# Import required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import json
from pathlib import Path
from collections import Counter
import warnings
warnings.filterwarnings('ignore')

# Set up plotting style
plt.style.use('default')
sns.set_palette("husl")
plt.rcParams['figure.figsize'] = FIGURE_SIZE
plt.rcParams['figure.dpi'] = DPI
plt.rcParams['font.size'] = 12
plt.rcParams['axes.titlesize'] = 14
plt.rcParams['axes.labelsize'] = 12
plt.rcParams['legend.fontsize'] = 11

# Define consistent color scheme
COLORS = {
    'primary': '#2E8B57',     # Sea Green
    'secondary': '#4682B4',   # Steel Blue
    'accent': '#CD5C5C',      # Indian Red
    'success': '#2ECC71',     # Emerald
    'warning': '#F39C12',     # Orange
    'danger': '#E74C3C'       # Red
}

print(f"Analysis setup complete for: {EXPERIMENT_NAME}")
print(f"Timestamp: {pd.Timestamp.now()}")
print(f"Data directory: {DATA_DIRECTORY}")
print(f"Output prefix: {OUTPUT_PREFIX}")

## Data Loading and Validation

In [ ]:
# Quick summary setup\nprint(f\"\\n{'='*60}\")\nprint(f\"EXPERIMENT: {EXPERIMENT_NAME}\")\nprint(f\"{'='*60}\")\n\n# Basic counts for context\ntotal_variants = len(df)\nunique_variants = df['variant_index'].nunique()\nprint(f\"\\nTotal variant records: {total_variants:,}\")\nprint(f\"Unique variants: {unique_variants:,}\")\n\n# Get key columns for plotting\nprint(f\"\\nAvailable data columns: {len(df.columns)}\")\nprint(f\"Key columns: variant_index, composition, is_single_nova_only, nova_fraction, insertion_type\")\n\n# Check for optional columns\noptional_cols = ['has_genomic_clustering', 'exact_size_match', 'close_size_match']\navailable_optional = [col for col in optional_cols if col in df.columns]\nprint(f\"\\nOptional columns available: {available_optional}\")\n\nprint(f\"\\n{'='*60}\")"

## Performance Analysis\n\n### True Positive Analysis\n\nAnalyzing the detection performance for simulated insertions. True positives are defined as single nova-only variant calls - the ideal outcome for de novo simulation.

In [ ]:
# True Positive Analysis by Insertion Type\n\n# Get unique variants (avoid double-counting if multiple nova reads per variant)\nunique_df = df.groupby('variant_index').agg({\n    'is_single_nova_only': 'first',\n    'insertion_type': 'first',\n    'composition': 'first'\n}).reset_index()\n\n# Create true positive analysis visualization\nif 'insertion_type' in df.columns and df['insertion_type'].notna().any():\n    # Calculate metrics by insertion type\n    tp_by_type = []\n    insertion_types = unique_df['insertion_type'].dropna().unique()\n    \n    for ins_type in sorted(insertion_types):\n        type_df = unique_df[unique_df['insertion_type'] == ins_type]\n        total_detected = len(type_df)\n        true_positives = len(type_df[type_df['is_single_nova_only'] == True])\n        \n        # Calculate rates\n        tp_rate = (true_positives / total_detected * 100) if total_detected > 0 else 0\n        \n        tp_by_type.append({\n            'insertion_type': ins_type,\n            'total_detected': total_detected,\n            'true_positives': true_positives,\n            'tp_rate': tp_rate\n        })\n    \n    tp_df = pd.DataFrame(tp_by_type)\n    \n    # Create 2x2 subplot figure\n    fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2, 2, figsize=(16, 12))\n    \n    # Plot 1: True positive rates by insertion type\n    if len(tp_df) > 0:\n        bars1 = ax1.bar(tp_df['insertion_type'], tp_df['tp_rate'], \n                        color=COLORS['primary'], alpha=0.8)\n        ax1.set_title('True Positive Rates by Insertion Type', fontsize=14, fontweight='bold')\n        ax1.set_ylabel('True Positive Rate (%)')\n        ax1.set_xlabel('Insertion Type')\n        ax1.tick_params(axis='x', rotation=45)\n        ax1.grid(axis='y', alpha=0.3)\n        ax1.set_ylim(0, 105)\n        \n        # Add value labels\n        for bar in bars1:\n            height = bar.get_height()\n            ax1.text(bar.get_x() + bar.get_width()/2., height + 1,\n                     f'{height:.1f}%', ha='center', va='bottom', fontweight='bold')\n    \n    # Plot 2: Overall composition breakdown\n    composition_data = unique_df['composition'].value_counts()\n    colors = [COLORS['success'], COLORS['warning'], COLORS['danger']]\n    wedges, texts, autotexts = ax2.pie(composition_data.values, labels=composition_data.index, \n                                      autopct='%1.1f%%', colors=colors[:len(composition_data)])\n    ax2.set_title('Overall Read Composition', fontsize=14, fontweight='bold')\n    \n    # Make percentage text bold\n    for autotext in autotexts:\n        autotext.set_color('white')\n        autotext.set_fontweight('bold')\n    \n    # Plot 3: Detection counts by type\n    if len(tp_df) > 0:\n        x = np.arange(len(tp_df))\n        width = 0.35\n        \n        bars3a = ax3.bar(x - width/2, tp_df['true_positives'], width, \n                         label='True Positives', color=COLORS['success'], alpha=0.8)\n        bars3b = ax3.bar(x + width/2, tp_df['total_detected'] - tp_df['true_positives'], width,\n                         label='False Positives', color=COLORS['danger'], alpha=0.8)\n        \n        ax3.set_title('Detection Counts by Insertion Type', fontsize=14, fontweight='bold')\n        ax3.set_ylabel('Number of Variants')\n        ax3.set_xlabel('Insertion Type')\n        ax3.set_xticks(x)\n        ax3.set_xticklabels(tp_df['insertion_type'], rotation=45)\n        ax3.legend()\n        ax3.grid(axis='y', alpha=0.3)\n    \n    # Plot 4: Performance metrics summary\n    overall_tp_rate = (unique_df['is_single_nova_only'].sum() / len(unique_df) * 100) if len(unique_df) > 0 else 0\n    overall_fp_rate = 100 - overall_tp_rate\n    detection_quality = overall_tp_rate  # Same as TP rate when looking at unique variants\n    \n    metrics = ['True Positive Rate', 'False Positive Rate', 'Detection Quality']\n    values = [overall_tp_rate, overall_fp_rate, detection_quality]\n    colors_metrics = [COLORS['success'], COLORS['danger'], COLORS['primary']]\n    \n    bars4 = ax4.bar(metrics, values, color=colors_metrics, alpha=0.8)\n    ax4.set_title('Overall Performance Metrics', fontsize=14, fontweight='bold')\n    ax4.set_ylabel('Percentage (%)')\n    ax4.set_ylim(0, 105)\n    ax4.grid(axis='y', alpha=0.3)\n    \n    # Add value labels\n    for bar in bars4:\n        height = bar.get_height()\n        ax4.text(bar.get_x() + bar.get_width()/2., height + 1,\n                 f'{height:.1f}%', ha='center', va='bottom', fontweight='bold')\n    \n    plt.tight_layout()\n    if SAVE_FIGURES:\n        plt.savefig(data_path / f\"{OUTPUT_PREFIX}_true_positive_analysis.png\", dpi=300, bbox_inches='tight')\n    plt.show()\n    \n    # Print brief summary\n    print(f\"\\nTRUE POSITIVE ANALYSIS SUMMARY\")\n    print(f\"Overall true positive rate: {overall_tp_rate:.1f}%\")\n    if len(tp_df) > 0:\n        print(f\"Best performing type: {tp_df.loc[tp_df['tp_rate'].idxmax(), 'insertion_type']} ({tp_df['tp_rate'].max():.1f}%)\")\n        print(f\"Worst performing type: {tp_df.loc[tp_df['tp_rate'].idxmin(), 'insertion_type']} ({tp_df['tp_rate'].min():.1f}%)\")\n    \nelse:\n    print(\"Warning: Insertion type data not available in CSV file\")\n    print(\"Cannot generate insertion type analysis\")"

### False Positive Analysis\n\nAnalyzing variants that are **not** single nova-only calls. These represent false positives in de novo simulation.

In [ ]:
# False Positive Analysis\n\n# Get false positive variants\nfp_variants = df[df['is_single_nova_only'] == False]\nunique_fp = fp_variants.groupby('variant_index').first().reset_index()\n\nprint(f\"False Positive Analysis: {len(unique_fp)} unique false positive variants\")\n\nif len(unique_fp) > 0:\n    # Create false positive analysis visualization\n    fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2, 2, figsize=(16, 12))\n    \n    # Plot 1: False positive composition breakdown\n    fp_composition = unique_fp['composition'].value_counts()\n    colors = [COLORS['warning'], COLORS['danger']]\n    bars1 = ax1.bar(fp_composition.index, fp_composition.values, \n                    color=colors[:len(fp_composition)], alpha=0.8)\n    ax1.set_title('False Positive Composition', fontsize=14, fontweight='bold')\n    ax1.set_ylabel('Number of Variants')\n    ax1.set_xlabel('Composition Type')\n    ax1.grid(axis='y', alpha=0.3)\n    \n    # Add value labels\n    for bar in bars1:\n        height = bar.get_height()\n        ax1.text(bar.get_x() + bar.get_width()/2., height + 1,\n                 f'{int(height)}', ha='center', va='bottom', fontweight='bold')\n    \n    # Plot 2: Genomic clustering analysis (if available)\n    if 'has_genomic_clustering' in unique_fp.columns:\n        clustering_counts = unique_fp['has_genomic_clustering'].value_counts()\n        labels = ['No Clustering', 'Genomic Clustering']\n        sizes = [clustering_counts.get(False, 0), clustering_counts.get(True, 0)]\n        colors_clustering = [COLORS['success'], COLORS['danger']]\n        \n        wedges, texts, autotexts = ax2.pie(sizes, labels=labels, autopct='%1.1f%%', \n                                          colors=colors_clustering)\n        ax2.set_title('Genomic Clustering in False Positives', fontsize=14, fontweight='bold')\n        \n        # Make percentage text bold\n        for autotext in autotexts:\n            autotext.set_color('white')\n            autotext.set_fontweight('bold')\n    else:\n        ax2.text(0.5, 0.5, 'Genomic clustering\\ndata not available', \n                ha='center', va='center', transform=ax2.transAxes, fontsize=12)\n        ax2.set_title('Genomic Clustering Analysis', fontsize=14, fontweight='bold')\n    \n    # Plot 3: Support read distribution for false positives\n    if 'support_reads' in unique_fp.columns:\n        # Limit to reasonable range for visualization\n        support_reads = unique_fp['support_reads'].values\n        support_reads = support_reads[support_reads <= 20]  # Cap at 20 for better visualization\n        \n        ax3.hist(support_reads, bins=range(1, max(support_reads)+2), \n                color=COLORS['secondary'], alpha=0.7, edgecolor='black')\n        ax3.set_title('Support Read Distribution (False Positives)', fontsize=14, fontweight='bold')\n        ax3.set_xlabel('Number of Supporting Reads')\n        ax3.set_ylabel('Number of Variants')\n        ax3.grid(axis='y', alpha=0.3)\n        \n        # Add mean line\n        mean_support = np.mean(unique_fp['support_reads'])\n        ax3.axvline(mean_support, color=COLORS['danger'], linestyle='--', linewidth=2, \n                   label=f'Mean: {mean_support:.1f}')\n        ax3.legend()\n    \n    # Plot 4: Nova fraction distribution for false positives\n    if 'nova_fraction' in unique_fp.columns:\n        ax4.hist(unique_fp['nova_fraction'], bins=20, color=COLORS['accent'], alpha=0.7, edgecolor='black')\n        ax4.set_title('Nova Fraction Distribution (False Positives)', fontsize=14, fontweight='bold')\n        ax4.set_xlabel('Nova Fraction')\n        ax4.set_ylabel('Number of Variants')\n        ax4.grid(axis='y', alpha=0.3)\n        \n        # Add mean line\n        mean_nova_fraction = np.mean(unique_fp['nova_fraction'])\n        ax4.axvline(mean_nova_fraction, color=COLORS['danger'], linestyle='--', linewidth=2,\n                   label=f'Mean: {mean_nova_fraction:.2f}')\n        ax4.legend()\n    \n    plt.tight_layout()\n    if SAVE_FIGURES:\n        plt.savefig(data_path / f\"{OUTPUT_PREFIX}_false_positive_analysis.png\", dpi=300, bbox_inches='tight')\n    plt.show()\n    \n    # Print brief summary\n    print(f\"\\nFALSE POSITIVE SUMMARY\")\n    print(f\"Total false positives: {len(unique_fp):,}\")\n    print(f\"Composition breakdown:\")\n    for comp_type, count in fp_composition.items():\n        percentage = (count / len(unique_fp) * 100)\n        print(f\"  {comp_type}: {count:,} ({percentage:.1f}%)\")\n    \n    if 'has_genomic_clustering' in unique_fp.columns:\n        clustering_count = len(unique_fp[unique_fp['has_genomic_clustering'] == True])\n        clustering_pct = (clustering_count / len(unique_fp) * 100)\n        print(f\"Genomic clustering: {clustering_count:,} ({clustering_pct:.1f}%)\")\n        \nelse:\n    print(\"No false positive variants detected!\")\n    \n    # Create a simple message plot\n    fig, ax = plt.subplots(1, 1, figsize=(10, 6))\n    ax.text(0.5, 0.5, 'No False Positives Detected!\\n\\nExcellent simulation quality', \n            ha='center', va='center', transform=ax.transAxes, \n            fontsize=20, fontweight='bold', color=COLORS['success'])\n    ax.set_title('False Positive Analysis', fontsize=16, fontweight='bold')\n    ax.axis('off')\n    \n    if SAVE_FIGURES:\n        plt.savefig(data_path / f\"{OUTPUT_PREFIX}_false_positive_analysis.png\", dpi=300, bbox_inches='tight')\n    plt.show()"

## Detailed Analysis\n\n### Read Composition and Size Accuracy\n\nDetailed analysis of read support patterns and insertion size accuracy.

In [ ]:
# Read Composition and Size Accuracy Analysis\n\n# Get unique variants for this analysis\nunique_analysis_df = df.groupby('variant_index').agg({\n    'support_reads': 'first',\n    'nova_reads': 'first',\n    'nova_fraction': 'first',\n    'composition': 'first',\n    'is_single_nova_only': 'first',\n    'exact_size_match': 'first' if 'exact_size_match' in df.columns else lambda x: None,\n    'close_size_match': 'first' if 'close_size_match' in df.columns else lambda x: None,\n    'reasonable_size_match': 'first' if 'reasonable_size_match' in df.columns else lambda x: None,\n    'chrom': 'first',\n    'svtype': 'first'\n}).reset_index()\n\n# Create comprehensive analysis visualization\nfig = plt.figure(figsize=(18, 12))\ngs = fig.add_gridspec(3, 3, hspace=0.3, wspace=0.3)\n\n# Plot 1: Support read distribution\nax1 = fig.add_subplot(gs[0, 0])\nsupport_reads = unique_analysis_df['support_reads'].values\nsupport_reads_capped = support_reads[support_reads <= 20]  # Cap for better visualization\nax1.hist(support_reads_capped, bins=range(1, 21), color=COLORS['primary'], alpha=0.7, edgecolor='black')\nax1.set_title('Support Read Distribution', fontweight='bold')\nax1.set_xlabel('Number of Supporting Reads')\nax1.set_ylabel('Number of Variants')\nax1.grid(axis='y', alpha=0.3)\n\n# Add statistics\nmean_support = np.mean(support_reads)\nmedian_support = np.median(support_reads)\nax1.axvline(mean_support, color=COLORS['danger'], linestyle='--', linewidth=2, label=f'Mean: {mean_support:.1f}')\nax1.axvline(median_support, color=COLORS['warning'], linestyle=':', linewidth=2, label=f'Median: {median_support:.1f}')\nax1.legend()\n\n# Plot 2: Nova fraction distribution\nax2 = fig.add_subplot(gs[0, 1])\nnova_fractions = unique_analysis_df['nova_fraction'].values\nax2.hist(nova_fractions, bins=20, color=COLORS['secondary'], alpha=0.7, edgecolor='black')\nax2.set_title('Nova Fraction Distribution', fontweight='bold')\nax2.set_xlabel('Nova Fraction')\nax2.set_ylabel('Number of Variants')\nax2.grid(axis='y', alpha=0.3)\n\n# Add statistics\nmean_nova_frac = np.mean(nova_fractions)\nax2.axvline(mean_nova_frac, color=COLORS['danger'], linestyle='--', linewidth=2, label=f'Mean: {mean_nova_frac:.2f}')\nax2.legend()\n\n# Plot 3: Composition breakdown\nax3 = fig.add_subplot(gs[0, 2])\ncomposition_counts = unique_analysis_df['composition'].value_counts()\ncolors_comp = [COLORS['success'], COLORS['warning'], COLORS['danger']]\nwedges, texts, autotexts = ax3.pie(composition_counts.values, labels=composition_counts.index, \n                                  autopct='%1.1f%%', colors=colors_comp[:len(composition_counts)])\nax3.set_title('Overall Composition', fontweight='bold')\nfor autotext in autotexts:\n    autotext.set_color('white')\n    autotext.set_fontweight('bold')\n\n# Plot 4: Size accuracy analysis (if available)\nif 'exact_size_match' in unique_analysis_df.columns and unique_analysis_df['exact_size_match'].notna().any():\n    ax4 = fig.add_subplot(gs[1, 0])\n    \n    exact_count = len(unique_analysis_df[unique_analysis_df['exact_size_match'] == True])\n    close_count = len(unique_analysis_df[unique_analysis_df['close_size_match'] == True])\n    reasonable_count = len(unique_analysis_df[unique_analysis_df['reasonable_size_match'] == True])\n    total_with_size = len(unique_analysis_df[unique_analysis_df['exact_size_match'].notna()])\n    \n    if total_with_size > 0:\n        categories = ['Exact\\n(0bp diff)', 'Close\\n(≤10bp diff)', 'Reasonable\\n(≥80% ratio)']\n        counts = [exact_count, close_count, reasonable_count]\n        percentages = [(count / total_with_size * 100) for count in counts]\n        \n        bars4 = ax4.bar(categories, percentages, color=[COLORS['success'], COLORS['warning'], COLORS['secondary']], alpha=0.8)\n        ax4.set_title('Size Accuracy Rates', fontweight='bold')\n        ax4.set_ylabel('Percentage of Variants (%)')\n        ax4.grid(axis='y', alpha=0.3)\n        \n        # Add value labels\n        for bar, pct in zip(bars4, percentages):\n            ax4.text(bar.get_x() + bar.get_width()/2., bar.get_height() + 1,\n                     f'{pct:.1f}%', ha='center', va='bottom', fontweight='bold')\nelse:\n    ax4 = fig.add_subplot(gs[1, 0])\n    ax4.text(0.5, 0.5, 'Size accuracy\\ndata not available', ha='center', va='center', \n             transform=ax4.transAxes, fontsize=12)\n    ax4.set_title('Size Accuracy Analysis', fontweight='bold')\n\n# Plot 5: Chromosomal distribution\nax5 = fig.add_subplot(gs[1, 1])\nchrom_counts = unique_analysis_df['chrom'].value_counts().head(10)  # Top 10 chromosomes\nbars5 = ax5.bar(range(len(chrom_counts)), chrom_counts.values, color=COLORS['accent'], alpha=0.8)\nax5.set_title('Top 10 Chromosomes (Variant Count)', fontweight='bold')\nax5.set_ylabel('Number of Variants')\nax5.set_xlabel('Chromosome')\nax5.set_xticks(range(len(chrom_counts)))\nax5.set_xticklabels(chrom_counts.index, rotation=45)\nax5.grid(axis='y', alpha=0.3)\n\n# Plot 6: SV type distribution\nax6 = fig.add_subplot(gs[1, 2])\nsvtype_counts = unique_analysis_df['svtype'].value_counts()\nbars6 = ax6.bar(svtype_counts.index, svtype_counts.values, color=COLORS['primary'], alpha=0.8)\nax6.set_title('Structural Variant Types', fontweight='bold')\nax6.set_ylabel('Number of Variants')\nax6.set_xlabel('SV Type')\nax6.grid(axis='y', alpha=0.3)\n\n# Add value labels\nfor bar in bars6:\n    height = bar.get_height()\n    ax6.text(bar.get_x() + bar.get_width()/2., height + 1,\n             f'{int(height)}', ha='center', va='bottom', fontweight='bold')\n\n# Plot 7: Support reads vs Nova fraction scatter\nax7 = fig.add_subplot(gs[2, :])\nscatter = ax7.scatter(unique_analysis_df['support_reads'], unique_analysis_df['nova_fraction'], \n                     c=unique_analysis_df['is_single_nova_only'].map({True: COLORS['success'], False: COLORS['danger']}),\n                     alpha=0.6, s=50)\nax7.set_title('Support Reads vs Nova Fraction', fontweight='bold')\nax7.set_xlabel('Number of Supporting Reads')\nax7.set_ylabel('Nova Fraction')\nax7.grid(True, alpha=0.3)\nax7.set_xlim(0, min(20, unique_analysis_df['support_reads'].max() + 1))\nax7.set_ylim(-0.05, 1.05)\n\n# Add legend for scatter plot\nfrom matplotlib.patches import Patch\nlegend_elements = [Patch(facecolor=COLORS['success'], label='True Positives'),\n                  Patch(facecolor=COLORS['danger'], label='False Positives')]\nax7.legend(handles=legend_elements)\n\n# Add horizontal lines for reference\nax7.axhline(y=1.0, color='black', linestyle='--', alpha=0.5, label='All Nova (y=1.0)')\nax7.axhline(y=0.5, color='gray', linestyle=':', alpha=0.5, label='50% Nova (y=0.5)')\n\nplt.suptitle(f'{EXPERIMENT_NAME}: Detailed Analysis', fontsize=16, fontweight='bold')\n\nif SAVE_FIGURES:\n    plt.savefig(data_path / f\"{OUTPUT_PREFIX}_detailed_analysis.png\", dpi=300, bbox_inches='tight')\nplt.show()\n\n# Print detailed summary statistics\nprint(f\"\\nDETAILED ANALYSIS SUMMARY\")\nprint(f\"Total unique variants analyzed: {len(unique_analysis_df):,}\")\nprint(f\"\\nSupport Read Statistics:\")\nprint(f\"  Mean: {mean_support:.1f}\")\nprint(f\"  Median: {median_support:.1f}\")\nprint(f\"  Max: {unique_analysis_df['support_reads'].max()}\")\nprint(f\"  Single-read variants: {len(unique_analysis_df[unique_analysis_df['support_reads'] == 1]):,}\")\n\nprint(f\"\\nNova Fraction Statistics:\")\nprint(f\"  Mean: {mean_nova_frac:.3f}\")\nprint(f\"  All nova reads (fraction=1.0): {len(unique_analysis_df[unique_analysis_df['nova_fraction'] == 1.0]):,}\")\nprint(f\"  Majority nova (fraction≥0.5): {len(unique_analysis_df[unique_analysis_df['nova_fraction'] >= 0.5]):,}\")\n\nif 'exact_size_match' in unique_analysis_df.columns and unique_analysis_df['exact_size_match'].notna().any():\n    print(f\"\\nSize Accuracy Statistics:\")\n    total_with_size = len(unique_analysis_df[unique_analysis_df['exact_size_match'].notna()])\n    exact_count = len(unique_analysis_df[unique_analysis_df['exact_size_match'] == True])\n    print(f\"  Variants with size data: {total_with_size:,}\")\n    print(f\"  Exact matches: {exact_count:,} ({exact_count/total_with_size*100:.1f}%)\")\n\nprint(f\"\\nChromosomal Distribution:\")\nprint(f\"  Chromosomes represented: {unique_analysis_df['chrom'].nunique()}\")\nprint(f\"  Most frequent: {chrom_counts.index[0]} ({chrom_counts.iloc[0]} variants)\")"

In [None]:
# Calculate key metrics for executive summary
print(f"\n{'='*60}")
print(f"EXECUTIVE SUMMARY: {EXPERIMENT_NAME}")
print(f"{'='*60}")

# Basic metrics
total_variants = len(df)
unique_variants = df['variant_index'].nunique()
single_nova_calls = len(df[df['is_single_nova_only'] == True])

# Calculate metrics from summary if available, otherwise from data
if summary_data and 'overall_metrics' in summary_data:
    overall_metrics = summary_data['overall_metrics']
    total_expected = overall_metrics.get('total_expected', 1000)  # fallback
    true_positive_rate = overall_metrics.get('true_positive_rate', 0)
    detection_quality = overall_metrics.get('detection_quality', 0)
    false_positives = overall_metrics.get('false_positives', 0)
else:
    # Calculate from data
    total_expected = 1000  # default assumption, will try to detect from data
    true_positive_rate = (single_nova_calls / total_expected) * 100
    detection_quality = (single_nova_calls / total_variants) * 100 if total_variants > 0 else 0
    false_positives = total_variants - single_nova_calls

# Composition breakdown
composition_counts = df['composition'].value_counts()

# False positive analysis
fp_df = df[df['is_single_nova_only'] == False]
fp_with_clustering = len(fp_df[fp_df['has_genomic_clustering'] == True]) if 'has_genomic_clustering' in df.columns else 0
fp_clustering_rate = (fp_with_clustering / len(fp_df) * 100) if len(fp_df) > 0 else 0

# Size accuracy (if available)
size_metrics = {}
if 'exact_size_match' in df.columns:
    exact_matches = len(df[df['exact_size_match'] == True])
    close_matches = len(df[df['close_size_match'] == True])
    size_metrics = {
        'exact_rate': (exact_matches / total_variants * 100) if total_variants > 0 else 0,
        'close_rate': (close_matches / total_variants * 100) if total_variants > 0 else 0
    }

print(f"\n📊 OVERALL PERFORMANCE")
print(f"   Total variants detected: {total_variants:,}")
print(f"   Unique variants: {unique_variants:,}")
print(f"   True positives (single nova-only): {single_nova_calls:,}")
print(f"   True positive rate: {true_positive_rate:.1f}%")
print(f"   Detection quality: {detection_quality:.1f}%")
print(f"   False positives: {false_positives:,}")

print(f"\n🎯 READ COMPOSITION")
for comp_type, count in composition_counts.items():
    percentage = (count / total_variants * 100) if total_variants > 0 else 0
    print(f"   {comp_type}: {count:,} ({percentage:.1f}%)")

print(f"\n⚠️ FALSE POSITIVE ANALYSIS")
print(f"   Total false positives: {len(fp_df):,}")
if 'has_genomic_clustering' in df.columns:
    print(f"   With genomic clustering: {fp_with_clustering:,} ({fp_clustering_rate:.1f}%)")
else:
    print(f"   Genomic clustering data: Not available")

if size_metrics:
    print(f"\n📏 SIZE ACCURACY")
    print(f"   Exact size matches: {size_metrics['exact_rate']:.1f}%")
    print(f"   Close size matches: {size_metrics['close_rate']:.1f}%")

# Quality assessment
print(f"\n✨ QUALITY ASSESSMENT")
if true_positive_rate >= 95:
    print(f"   ✅ Excellent true positive rate ({true_positive_rate:.1f}%)")
elif true_positive_rate >= 90:
    print(f"   ⚠️ Good true positive rate ({true_positive_rate:.1f}%)")
else:
    print(f"   ❌ Low true positive rate ({true_positive_rate:.1f}%) - investigate issues")

if fp_clustering_rate > 10:
    print(f"   ❌ High genomic clustering rate ({fp_clustering_rate:.1f}%) - consider window limit adjustment")
elif fp_clustering_rate > 5:
    print(f"   ⚠️ Moderate genomic clustering rate ({fp_clustering_rate:.1f}%)")
else:
    print(f"   ✅ Low genomic clustering rate ({fp_clustering_rate:.1f}%)")

if detection_quality >= 90:
    print(f"   ✅ High detection quality ({detection_quality:.1f}%)")
elif detection_quality >= 80:
    print(f"   ⚠️ Moderate detection quality ({detection_quality:.1f}%)")
else:
    print(f"   ❌ Low detection quality ({detection_quality:.1f}%) - high false positive rate")

print(f"\n{'='*60}")

## Summary and Conclusions

This notebook template provides a foundation for nova experiment analysis. The cells above demonstrate the basic structure for:

- **Configuration**: Parameterized setup for any experiment
- **Data Loading**: Robust validation and loading of CSV/JSON output
- **Executive Summary**: Automated performance assessment with quality indicators

### Next Steps for Phase 1 Iteration:

1. **Test with real data** to validate the data loading and summary generation
2. **Add visualization cells** for true positive and false positive analysis
3. **Expand quality assessment** with more sophisticated recommendations
4. **Add detailed analysis sections** for read composition and size accuracy

### Template Features:

- ✅ **Parameterized configuration** for any experiment
- ✅ **Robust data validation** with error handling
- ✅ **Executive summary** with automated quality assessment
- ✅ **Consistent styling** and color scheme
- ✅ **Emoji indicators** for quick visual assessment
- ✅ **Flexible data sources** (works with or without JSON summary)

### Ready for Iteration!

This Phase 1 template provides a solid foundation. You can now test it with your existing nova experiment data and let me know what sections you'd like to expand or modify in the next iteration.