# DALI Prefiltering with Conformal Guarantees

This notebook reproduces **Tables 4-6** from the paper:
"Functional protein mining with conformal guarantees" (Nature Communications 2025)

## Overview

We use conformal prediction to calibrate FNR (False Negative Rate) thresholds for
prefiltering candidates before expensive DALI structural alignment.

**Key Results from Paper:**
- TPR (True Positive Rate): ~82.8%
- Database Reduction: ~31.5%

## Data Requirements

This analysis requires FoldSeek alignment scores between SCOPe test proteins and
the lookup database. The pre-computed results are available in:
`results/dali_thresholds.csv`

For regenerating the analysis, you need:
- `foldseek_near_ids_scope_test_v_lookup.npy` (FoldSeek scores)
- SCOPe classification labels

In [None]:
import sys
from pathlib import Path
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Add project root to path
repo_root = Path.cwd().parent.parent
sys.path.insert(0, str(repo_root))

print(f"Repository root: {repo_root}")

## Load Pre-computed DALI Results

These results were computed using the Learn-then-Test (LTT) calibration procedure
across multiple trials.

In [None]:
# Load pre-computed results
results_path = repo_root / "results" / "dali_thresholds.csv"

if results_path.exists():
    df = pd.read_csv(results_path)
    print(f"Loaded {len(df)} trials from {results_path.name}")
    print(f"\nColumns: {df.columns.tolist()}")
else:
    print(f"ERROR: Results not found at {results_path}")
    print("Run scripts/verify_dali.py to generate results.")

In [None]:
# Display first few rows
df.head()

## Analyze Key Metrics

The key metrics from the paper:
- **TPR (True Positive Rate)**: Fraction of true structural neighbors retained
- **Database Reduction**: Fraction of database filtered out (1 - frac_samples_above_lambda)

In [None]:
# Compute key metrics
tpr_mean = df["TPR_elbow"].mean() * 100
tpr_std = df["TPR_elbow"].std() * 100

frac_kept = df["frac_samples_above_lambda"].mean()
db_reduction = (1 - frac_kept) * 100

fnr_mean = df["FNR_elbow"].mean() * 100
fdr_mean = df["FDR_elbow"].mean()
elbow_z_mean = df["elbow_z"].mean()
elbow_z_std = df["elbow_z"].std()

# Paper claims
paper_tpr = 82.8
paper_db_reduction = 31.5

print("="*60)
print("DALI Prefiltering Results")
print("="*60)
print(f"\nTPR (True Positive Rate): {tpr_mean:.1f}% +/- {tpr_std:.1f}%")
print(f"  Paper claims: {paper_tpr}%")
print(f"  Difference: {abs(tpr_mean - paper_tpr):.1f}%")
print(f"\nDatabase Reduction: {db_reduction:.1f}%")
print(f"  Paper claims: {paper_db_reduction}%")
print(f"  Difference: {abs(db_reduction - paper_db_reduction):.1f}%")
print(f"\nFNR (Miss Rate): {fnr_mean:.1f}%")
print(f"FDR at elbow: {fdr_mean:.6f}")
print(f"Elbow z-score: {elbow_z_mean:.1f} +/- {elbow_z_std:.1f}")
print("="*60)

In [None]:
# Visualization
fig, axes = plt.subplots(1, 3, figsize=(15, 5))

# TPR distribution
axes[0].hist(df["TPR_elbow"] * 100, bins=20, edgecolor='black', alpha=0.7)
axes[0].axvline(paper_tpr, color='r', linestyle='--', label=f'Paper: {paper_tpr}%')
axes[0].axvline(tpr_mean, color='g', linestyle='-', label=f'Mean: {tpr_mean:.1f}%')
axes[0].set_xlabel('TPR (%)')
axes[0].set_ylabel('Frequency')
axes[0].set_title('True Positive Rate Distribution')
axes[0].legend()

# Database reduction
db_reductions = (1 - df["frac_samples_above_lambda"]) * 100
axes[1].hist(db_reductions, bins=20, edgecolor='black', alpha=0.7, color='orange')
axes[1].axvline(paper_db_reduction, color='r', linestyle='--', label=f'Paper: {paper_db_reduction}%')
axes[1].axvline(db_reduction, color='g', linestyle='-', label=f'Mean: {db_reduction:.1f}%')
axes[1].set_xlabel('Database Reduction (%)')
axes[1].set_ylabel('Frequency')
axes[1].set_title('Database Reduction Distribution')
axes[1].legend()

# Elbow z-score distribution
axes[2].hist(df["elbow_z"], bins=20, edgecolor='black', alpha=0.7, color='green')
axes[2].set_xlabel('Elbow z-score')
axes[2].set_ylabel('Frequency')
axes[2].set_title('Elbow Z-score Distribution')

plt.tight_layout()
plt.show()

## Verification Summary

Compare our results to the paper claims:

In [None]:
# Verification
tpr_ok = abs(tpr_mean - paper_tpr) < 2.0  # Within 2%
db_ok = abs(db_reduction - paper_db_reduction) < 2.0  # Within 2%

print("="*60)
if tpr_ok and db_ok:
    print("VERIFICATION PASSED")
    print(f"  TPR {tpr_mean:.1f}% matches paper ({paper_tpr}%)")
    print(f"  DB reduction {db_reduction:.1f}% matches paper ({paper_db_reduction}%)")
else:
    print("VERIFICATION WARNING")
    if not tpr_ok:
        print(f"  TPR {tpr_mean:.1f}% differs from paper ({paper_tpr}%)")
    if not db_ok:
        print(f"  DB reduction {db_reduction:.1f}% differs from paper ({paper_db_reduction}%)")
print("="*60)

## Summary

The conformal prefiltering approach achieves:
- ~82% TPR while reducing the database by ~31%
- This allows expensive DALI alignments to run on a smaller candidate set
- Risk is controlled via the Learn-then-Test (LTT) calibration procedure

For the full analysis with raw FoldSeek data, see the original notebook in `notebooks/archive/`
or run `scripts/verify_dali.py`.