# Tool Performance Comparison on Simulated Data

This notebook analyzes tool-specific performance on simulated CRISPR spacer data using the results from the distance metric analysis sweep. We focus on comparing individual tool performance using Hamming distance with threshold ≤5, leveraging the performance results already generated by the `compare-results` command.

## Key Focus Areas:
1. **Per-tool performance metrics** (precision, recall, F1)
2. **Tool-vs-tool comparisons** at different mismatch thresholds
3. **Performance characteristics** across different dataset sizes
4. **False positive and false negative analysis** per tool

This differs from distance_metric_analysis.ipynb which focuses on aggregate tool-independent metrics to compare Hamming vs Edit distance.

In [45]:
import os
import warnings
warnings.filterwarnings("ignore", category=FutureWarning)

os.chdir('/clusterfs/jgi/scratch/science/metagen/neri/code/blits/spacer_bench/')

import polars as pl
import matplotlib.pyplot as plt
import numpy as np
from pathlib import Path
import glob
from bench.utils.functions import *
import altair as alt
import json

pl.Config.set_tbl_cols(n=-1)

# Configuration - using Hamming distance with threshold 5 as specified
DISTANCE_METRIC = 'hamming'
MAX_THRESHOLD = 5
THRESHOLDS = list(range(0, MAX_THRESHOLD + 1))

# Load tool styles for consistent colors and markers
with open('notebooks/tool_styles.json', 'r') as f:
    TOOL_STYLES = json.load(f)

print(f"Configuration: {DISTANCE_METRIC} distance, thresholds 0-{MAX_THRESHOLD}")
print(f"Loaded tool styles for {len(TOOL_STYLES)} tools")

Configuration: hamming distance, thresholds 0-5
Loaded tool styles for 11 tools


## Load Performance Results

We'll load the performance results generated by the `compare-results` command (from distance_metric_analysis sweep) for each simulation.

In [46]:
def load_performance_results(sim_dir, distance_metric='hamming', max_mismatches=3):
    """Load performance results TSV file for specific metric and threshold."""
    perf_file = Path(sim_dir) / f'performance_results_{distance_metric}_mm{max_mismatches}.tsv'
    if not perf_file.exists():
        print(f"Performance results not found: {perf_file}")
        return None
    
    return pl.read_csv(str(perf_file), separator='\t')

def format_num(n):
    """Format numbers for display (e.g., 100000 -> 100k)"""
    if n >= 1000000:
        return f"{n//1000000}M"
    elif n < 1000:
        return f"{n}"
    else:
        return f"{n//1000}k"

# Find all simulation directories
simulated_base_dir = "results/simulated"
simulation_dirs = sorted(glob.glob(os.path.join(simulated_base_dir, "ns_*")))

# Create simulation name mapping
SIMULATION_NAMES = {}
valid_sim_dirs = []

for sim_dir in simulation_dirs:
    sim_data_dir = Path(sim_dir) / "simulated_data"
    if not sim_data_dir.exists():
        continue
        
    prefix = os.path.basename(sim_dir)
    parts = prefix.split('_')
    
    if len(parts) >= 4:
        try:
            n_spacers = int(parts[1])
            n_contigs = int(parts[3])
            SIMULATION_NAMES[prefix] = f"{format_num(n_spacers)} spacers × {format_num(n_contigs)} contigs"
            valid_sim_dirs.append(sim_dir)
        except ValueError:
            # Keep as-is if can't parse
            SIMULATION_NAMES[prefix] = prefix
            valid_sim_dirs.append(sim_dir)

print(f"Found {len(SIMULATION_NAMES)} simulations:")
for prefix, desc in sorted(SIMULATION_NAMES.items()):
    print(f"  {prefix}: {desc}")

Found 8 simulations:
  ns_100000_nc_10000: 100k spacers × 10k contigs
  ns_100000_nc_20000: 100k spacers × 20k contigs
  ns_100_nc_50000: 100 spacers × 50k contigs
  ns_3826979_nc_421431_real_baseline: 3M spacers × 421k contigs
  ns_500000_nc_100000: 500k spacers × 100k contigs
  ns_50000_nc_5000: 50k spacers × 5k contigs
  ns_500_nc_5000_HIGH_INSERTION_RATE: 500 spacers × 5k contigs
  ns_75000_nc_5000: 75k spacers × 5k contigs


## Select Primary Simulation for Detailed Analysis

Let's focus on one simulation for detailed tool comparison analysis.

In [47]:
# Select primary simulation - using the 100k spacers × 20k contigs as the main analysis
PRIMARY_SIM = "ns_100000_nc_20000"
primary_sim_dir = f"results/simulated/{PRIMARY_SIM}"

# Also analyze the high insertion rate simulation
HIGH_INS_SIM = "ns_500_nc_5000_HIGH_INSERTION_RATE"
high_ins_sim_dir = f"results/simulated/{HIGH_INS_SIM}"

# Load performance data for primary threshold (3 mismatches)
primary_threshold = 3
perf_df = load_performance_results(primary_sim_dir, DISTANCE_METRIC, primary_threshold)

print(f"Loaded performance results for {PRIMARY_SIM} at threshold={primary_threshold}")
print(f"Shape: {perf_df.shape}")
print(f"\nTools found: {perf_df['tool'].unique().to_list()}")

Loaded performance results for ns_100000_nc_20000 at threshold=3
Shape: (12, 18)

Tools found: ['mmseqs2', 'sassy', 'all_tools_combined', 'bowtie1', 'minimap2', 'mummer4', 'bowtie2', 'blastn', 'x_mapper', 'indelfree_indexed', 'indelfree_bruteforce', 'strobealign']


In [48]:
# View the recall metrics - focus on augmented recall
perf_df.select([
    'tool', 
    'recall_augmented',
    'recall_planned', 
    'planned_true_positives',
    'all_true_positives',
    'false_negatives_planned',
    'ground_truth_augmented'
]).sort('recall_augmented', descending=True)

tool,recall_augmented,recall_planned,planned_true_positives,all_true_positives,false_negatives_planned,ground_truth_augmented
str,f64,f64,i64,i64,i64,i64
"""bowtie1""",1.0,1.0,300084,300248,0,300248
"""indelfree_indexed""",1.0,1.0,300084,300248,0,300248
"""all_tools_combined""",1.0,1.0,300084,300248,0,300248
"""indelfree_bruteforce""",1.0,1.0,300084,300248,0,300248
"""sassy""",0.99996,1.0,300084,300236,0,300248
"""bowtie2""",0.976656,0.97714,293224,293239,6860,300248
"""blastn""",0.779919,0.780128,234104,234169,65980,300248
"""mmseqs2""",0.720611,0.720998,216360,216362,83724,300248
"""strobealign""",0.545409,0.545687,163752,163758,136332,300248
"""mummer4""",0.469898,0.470152,141085,141086,158999,300248


## Tool Recall at Different Hamming Distance Thresholds

Load all thresholds for the primary simulation to see how each tool performs when allowing up to N mismatches. This reveals which tools cannot perform above certain distance thresholds.

In [49]:
# Load performance data for all thresholds
all_threshold_data = []

for threshold in THRESHOLDS:
    df = load_performance_results(primary_sim_dir, DISTANCE_METRIC, threshold)
    if df is not None:
        # Add threshold column
        df = df.with_columns(pl.lit(threshold).alias('threshold'))
        all_threshold_data.append(df)

# Combine all thresholds
combined_perf = pl.concat(all_threshold_data)

# Filter out the aggregate tool
tools_only = combined_perf.filter(pl.col('tool') != 'all_tools_combined')

print(f"Total rows: {tools_only.shape[0]}")
print(f"Unique tools: {tools_only['tool'].n_unique()}")
print(f"Thresholds: {sorted(tools_only['threshold'].unique().to_list())}")

Total rows: 66
Unique tools: 11
Thresholds: [0, 1, 2, 3, 4, 5]


## Visualize Tool Recall at Exact Hamming Distances

Using tool-specific colors and markers from tool_styles.json

In [50]:
# Check the data before plotting
print(f"Data shape: {tools_only.shape}")
print(f"Sample data:")
print(tools_only.select(['tool', 'threshold', 'recall_augmented']).head(20))
print(f"\nTools: {tools_only['tool'].unique().to_list()}")
print(f"Recall range: {tools_only['recall_augmented'].min()} to {tools_only['recall_augmented'].max()}")

# Check if all tools in data are in TOOL_STYLES
data_tools = set(tools_only['tool'].unique().to_list())
style_tools = set(TOOL_STYLES.keys())
print(f"\nTools in data but not in styles: {data_tools - style_tools}")
print(f"Tools in styles but not in data: {style_tools - data_tools}")

# Convert to pandas for Altair (Altair works better with pandas)
tools_pandas = tools_only.select(['tool', 'threshold', 'recall_augmented', 'all_true_positives', 'false_negatives_augmented']).to_pandas()

# Create color and shape mappings from tool styles
color_scale = alt.Scale(
    domain=list(TOOL_STYLES.keys()),
    range=[TOOL_STYLES[tool]['color'] for tool in TOOL_STYLES.keys()]
)

shape_scale = alt.Scale(
    domain=list(TOOL_STYLES.keys()),
    range=[TOOL_STYLES[tool]['marker'] for tool in TOOL_STYLES.keys()]
)

# Create interactive plot of recall (augmented) at different hamming distance thresholds
chart = alt.Chart(tools_pandas).mark_line(point=True).encode(
    x=alt.X('threshold:Q', title='Hamming Distance Threshold (≤)'),
    y=alt.Y('recall_augmented:Q', title='Recall (Augmented)', scale=alt.Scale(domain=[0, 1.05])),
    color=alt.Color('tool:N', title='Tool', scale=color_scale),
    shape=alt.Shape('tool:N', title='Tool', scale=shape_scale),
    tooltip=['tool', 'threshold', 'recall_augmented', 'all_true_positives', 'false_negatives_augmented']
).properties(
    width=700,
    height=450,
    title=f'Tool Recall vs Hamming Distance Threshold ({PRIMARY_SIM})'
).interactive()

chart

Data shape: (66, 19)
Sample data:
shape: (20, 3)
┌──────────────────────┬───────────┬──────────────────┐
│ tool                 ┆ threshold ┆ recall_augmented │
│ ---                  ┆ ---       ┆ ---              │
│ str                  ┆ i32       ┆ f64              │
╞══════════════════════╪═══════════╪══════════════════╡
│ bowtie1              ┆ 0         ┆ 1.0              │
│ indelfree_indexed    ┆ 0         ┆ 1.0              │
│ indelfree_bruteforce ┆ 0         ┆ 1.0              │
│ x_mapper             ┆ 0         ┆ 0.989767         │
│ mummer4              ┆ 0         ┆ 1.0              │
│ strobealign          ┆ 0         ┆ 0.983039         │
│ minimap2             ┆ 0         ┆ 0.001231         │
│ sassy                ┆ 0         ┆ 1.0              │
│ blastn               ┆ 0         ┆ 1.0              │
│ mmseqs2              ┆ 0         ┆ 0.996803         │
│ bowtie2              ┆ 0         ┆ 1.0              │
│ minimap2             ┆ 1         ┆ 0.000613         │

In [51]:
# Save the visualization
output_dir = f"{primary_sim_dir}/plots"
os.makedirs(output_dir, exist_ok=True)

chart.save(f'{output_dir}/tool_recall_exact_hamming.html')
chart.save(f'{output_dir}/tool_recall_exact_hamming.json', format='json')
print(f"Saved visualization to {output_dir}/tool_recall_exact_hamming.html")

Saved visualization to results/simulated/ns_100000_nc_20000/plots/tool_recall_exact_hamming.html


## High Insertion Rate Simulation Analysis

Some tools struggle with spacers that have high occurrence rates. Let's analyze the HIGH_INSERTION_RATE simulation to test this.

In [52]:
# Load high insertion rate simulation data for all thresholds
high_ins_data = []

for threshold in THRESHOLDS:
    df = load_performance_results(high_ins_sim_dir, DISTANCE_METRIC, threshold)
    if df is not None:
        df = df.with_columns(pl.lit(threshold).alias('threshold'))
        high_ins_data.append(df)

# Combine and filter
high_ins_combined = pl.concat(high_ins_data)
high_ins_tools = high_ins_combined.filter(pl.col('tool') != 'all_tools_combined')

print(f"High insertion rate simulation - Total rows: {high_ins_tools.shape[0]}")
print(f"Unique tools: {high_ins_tools['tool'].n_unique()}")
print(f"Thresholds: {sorted(high_ins_tools['threshold'].unique().to_list())}")

High insertion rate simulation - Total rows: 60
Unique tools: 10
Thresholds: [0, 1, 2, 3, 4, 5]


In [53]:
# Plot recall for high insertion rate simulation
high_ins_chart = alt.Chart(high_ins_tools).mark_line(point=True).encode(
    x=alt.X('threshold:Q', title='Hamming Distance Threshold (≤)'),
    y=alt.Y('recall_augmented:Q', title='Recall (Augmented)', scale=alt.Scale(domain=[0, 1.05])),
    color=alt.Color('tool:N', title='Tool', scale=color_scale),
    shape=alt.Shape('tool:N', title='Tool', scale=shape_scale),
    tooltip=['tool', 'threshold', 'recall_augmented', 'all_true_positives', 'false_negatives_augmented']
).properties(
    width=700,
    height=450,
    title=f'Tool Recall vs Hamming Distance Threshold ({HIGH_INS_SIM})'
).interactive()

high_ins_chart

In [54]:
# Save high insertion rate visualization
high_ins_output_dir = f"{high_ins_sim_dir}/plots"
os.makedirs(high_ins_output_dir, exist_ok=True)

high_ins_chart.save(f'{high_ins_output_dir}/tool_recall_exact_hamming.html')
high_ins_chart.save(f'{high_ins_output_dir}/tool_recall_exact_hamming.json', format='json')
print(f"Saved high insertion rate visualization to {high_ins_output_dir}/tool_recall_exact_hamming.html")

Saved high insertion rate visualization to results/simulated/ns_500_nc_5000_HIGH_INSERTION_RATE/plots/tool_recall_exact_hamming.html


In [55]:
# Side-by-side comparison: Primary vs High Insertion Rate at threshold=3
threshold_3_primary = tools_only.filter(pl.col('threshold') == 3).with_columns(
    pl.lit(PRIMARY_SIM).alias('simulation')
)

threshold_3_high_ins = high_ins_tools.filter(pl.col('threshold') == 3).with_columns(
    pl.lit(HIGH_INS_SIM).alias('simulation')
)

combined_compare = pl.concat([threshold_3_primary, threshold_3_high_ins])

comparison_chart = alt.Chart(combined_compare).mark_bar().encode(
    x=alt.X('tool:N', title='Tool', sort='-y'),
    y=alt.Y('recall_augmented:Q', title='Recall (Augmented)', scale=alt.Scale(domain=[0, 1.05])),
    color=alt.Color('simulation:N', title='Simulation'),
    column=alt.Column('simulation:N', title=''),
    tooltip=['tool', 'simulation', 'recall_augmented', 'all_true_positives', 'false_negatives_augmented']
).properties(
    width=350,
    height=400,
    title='Tool Recall Comparison: Primary vs High Insertion Rate (Hamming=3)'
)

comparison_chart

## Tool Recall Rankings Across Multiple Simulations

Compare how tools rank across different simulation sizes based on recall (augmented).

In [56]:
# Load threshold=3 results from all simulations
all_sims_data = []

# Exclude the real baseline
exclude_sims = ['ns_3826979_nc_421431_real_baseline']

for sim_prefix in SIMULATION_NAMES.keys():
    if sim_prefix in exclude_sims:
        continue
        
    sim_dir = f"results/simulated/{sim_prefix}"
    df = load_performance_results(sim_dir, DISTANCE_METRIC, 3)
    
    if df is not None:
        # Filter out aggregate tool
        df = df.filter(pl.col('tool') != 'all_tools_combined')
        df = df.with_columns([
            pl.lit(sim_prefix).alias('simulation'),
            pl.lit(SIMULATION_NAMES[sim_prefix]).alias('sim_description')
        ])
        all_sims_data.append(df)

all_sims_combined = pl.concat(all_sims_data)

print(f"Loaded data from {len(all_sims_data)} simulations")
print(f"Total rows: {all_sims_combined.shape[0]}")
print(f"Simulations: {all_sims_combined['simulation'].unique().to_list()}")

Performance results not found: results/simulated/ns_100_nc_50000/performance_results_hamming_mm3.tsv
Performance results not found: results/simulated/ns_500000_nc_100000/performance_results_hamming_mm3.tsv
Loaded data from 5 simulations
Total rows: 54
Simulations: ['ns_100000_nc_20000', 'ns_50000_nc_5000', 'ns_100000_nc_10000', 'ns_500_nc_5000_HIGH_INSERTION_RATE', 'ns_75000_nc_5000']


# Pivot table showing recall across simulations
recall_pivot = all_sims_combined.select(['tool', 'simulation', 'recall_augmented']).pivot(
    index='tool',
    on='simulation',
    values='recall_augmented'
).sort('tool')

recall_pivot

In [57]:
# Create recall heatmap across all simulations
recall_heatmap = alt.Chart(all_sims_combined).mark_rect().encode(
    x=alt.X('simulation:N', title='Simulation'),
    y=alt.Y('tool:N', title='Tool'),
    color=alt.Color('recall_augmented:Q', 
                    scale=alt.Scale(scheme='viridis', domain=[0, 1]),
                    title='Recall (Augmented)'),
    tooltip=['tool', 'simulation', 'recall_augmented', 'all_true_positives', 'false_negatives_augmented']
).properties(
    width=600,
    height=400,
    title='Tool Recall Across Simulations (Hamming Distance=3)'
)

recall_heatmap

## Tool Recall Summary Statistics

Calculate mean recall across all simulations to identify best performers.

In [58]:
# Calculate mean and std recall across simulations
tool_summary = all_sims_combined.group_by('tool').agg([
    pl.col('recall_augmented').mean().alias('mean_recall'),
    pl.col('recall_augmented').std().alias('std_recall'),
    pl.col('all_true_positives').sum().alias('total_tp'),
    pl.col('false_negatives_augmented').sum().alias('total_fn'),
]).sort('mean_recall', descending=True)

tool_summary

tool,mean_recall,std_recall,total_tp,total_fn
str,f64,f64,i64,i64
"""indelfree_indexed""",1.0,0.0,1169650,0
"""bowtie1""",1.0,0.0,1169650,0
"""indelfree_bruteforce""",1.0,0.0,1169650,0
"""sassy""",0.999978,1.7e-05,1169625,25
"""bowtie2""",0.977158,0.000684,1143139,26511
"""blastn""",0.780153,0.001185,912164,257486
"""mmseqs2""",0.683416,0.08627,763497,406153
"""strobealign""",0.496743,0.17347,505144,664506
"""mummer4""",0.477306,0.012812,563393,606257
"""x_mapper""",0.371624,0.178747,354550,815100


In [62]:
## Recall Differences Between Tools

# Identify which tools consistently outperform or underperform others.
# Visualize mean recall with error bars
error_bar_data = all_sims_combined.group_by('tool').agg([
    pl.col('recall_augmented').mean().alias('mean_recall'),
    pl.col('recall_augmented').std().alias('std_recall'),
])

# Create error bar chart
base_chart = alt.Chart(error_bar_data).encode(
    x=alt.X('tool:N', title='Tool', sort='-y'),
)

error_bars = base_chart.mark_errorbar(extent='stdev').encode(
    y=alt.Y('mean_recall:Q', title='Recall (Augmented)', scale=alt.Scale(domain=[0, 1.05])),
    yError='std_recall:Q'
)

points = base_chart.mark_point(filled=True, size=100).encode(
    y='mean_recall:Q',
    color=alt.Color('tool:N', scale=color_scale, legend=None),
    tooltip=['tool', 'mean_recall', 'std_recall']
)

(error_bars + points).properties(
    width=700,
    height=400,
    title='Mean Recall ± Std Dev Across Simulations (Hamming Distance=3)'
)

# Create pairwise recall differences at threshold=3 for primary simulation
threshold_3_primary = tools_only.filter(pl.col('threshold') == 3)

# Create difference matrix
tools_list = sorted(threshold_3_primary['tool'].unique().to_list())
recall_dict = {row['tool']: row['recall_augmented'] for row in threshold_3_primary.to_dicts()}

# Build difference matrix (row - column)
diff_matrix = []
for tool_row in tools_list:
    row_data = {'tool': tool_row}
    for tool_col in tools_list:
        if tool_row == tool_col:
            row_data[tool_col] = 0.0
        else:
            row_data[tool_col] = recall_dict[tool_row] - recall_dict[tool_col]
    diff_matrix.append(row_data)

diff_df = pl.DataFrame(diff_matrix)
diff_df

In [None]:
# Visualize recall difference matrix
# Melt for heatmap
diff_long = diff_df.unpivot(
    index='tool',
    on=tools_list,
    variable_name='tool_compared',
    value_name='recall_diff'
)

diff_heatmap = alt.Chart(diff_long).mark_rect().encode(
    x=alt.X('tool_compared:N', title='Tool (compared to)'),
    y=alt.Y('tool:N', title='Tool'),
    color=alt.Color('recall_diff:Q', 
                    scale=alt.Scale(scheme='redblue', domain=[-1, 1], domainMid=0),
                    title='Recall Difference'),
    tooltip=['tool', 'tool_compared', 'recall_diff']
).properties(
    width=500,
    height=500,
    title=f'Recall Differences (row minus column) at Hamming=3 ({PRIMARY_SIM})'
)

diff_heatmap

testing remaining alignments


NameError: name 'not_in_ground_truth' is not defined

## Summary

This notebook analyzed tool-specific recall performance on simulated CRISPR spacer data:

1. **Hamming Distance Threshold Analysis**: Evaluated recall allowing up to N hamming distance (0-5), revealing which tools cannot perform above certain thresholds
2. **Primary Simulation** ({PRIMARY_SIM}): Standard simulation for baseline tool comparison
3. **High Insertion Rate Simulation** ({HIGH_INS_SIM}): Tests tool performance with high occurrence rate spacers
4. **Cross-Simulation Comparison**: Identified consistent performers across different dataset sizes

Key findings focus on **recall (augmented)** differences between tools, using consistent styling from tool_styles.json.