# JCVI Syn3.0 Unknown Gene Analysis

This notebook reproduces **Figure 2A** from:

> "Functional protein mining with conformal guarantees"  
> Nature Communications (2025) 16:85  
> https://doi.org/10.1038/s41467-024-55676-y

**Result**: 59/149 (39.6%) of JCVI Syn3.0 genes of unknown function can be confidently annotated at 10% FDR.

## Prerequisites

Download from [Zenodo](https://zenodo.org/records/14272215):
- `lookup_embeddings.npy` → `data/`
- `lookup_embeddings_meta_data.tsv` → `data/`
- `pfam_new_proteins.npy` → `data/`

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
from pathlib import Path

# Add repo root to path
import sys
repo_root = Path.cwd().parent.parent
sys.path.insert(0, str(repo_root))

In [None]:
from protein_conformal.util import (
    load_database, query, read_fasta,
    get_sims_labels, simplifed_venn_abers_prediction
)

## 1. Load Query Embeddings (JCVI Syn3.0)

In [None]:
# Load pre-computed embeddings for JCVI Syn3.0 unknown genes
data_dir = repo_root / 'data'
query_embeddings = np.load(data_dir / 'gene_unknown' / 'unknown_aa_seqs.npy')
print(f"Query embeddings shape: {query_embeddings.shape}")

# Load sequence metadata
query_fastas, query_metadata = read_fasta(data_dir / 'gene_unknown' / 'unknown_aa_seqs.fasta')
print(f"Number of sequences: {len(query_fastas)}")

## 2. Load Database (UniProt with Pfam annotations)

In [None]:
# Load UniProt embeddings and metadata
embeddings = np.load(data_dir / 'lookup_embeddings.npy')
lookup_proteins_meta = pd.read_csv(data_dir / 'lookup_embeddings_meta_data.tsv', sep="\t")
print(f"Database size: {len(embeddings)} proteins")

# Filter to proteins with Pfam annotations
column = 'Pfam'
col_lookup = lookup_proteins_meta[~lookup_proteins_meta[column].isnull()]
col_lookup_embeddings = embeddings[col_lookup.index]
col_meta_data = col_lookup[column].values
print(f"Proteins with Pfam: {len(col_lookup_embeddings)}")

# Build FAISS index
lookup_database = load_database(col_lookup_embeddings)

## 3. Search for Nearest Neighbors

In [None]:
# Query for the 1st nearest neighbor
k = 1
D, I = query(lookup_database, query_embeddings, k)
D_max = np.max(D, axis=1)
print(f"Similarity scores range: [{D_max.min():.6f}, {D_max.max():.6f}]")

## 4. Compute Venn-Abers Calibrated Probabilities

In [None]:
# Load calibration data
cal_data = np.load(data_dir / 'pfam_new_proteins.npy', allow_pickle=True)

# Prepare calibration set
n_calib = 100
np.random.seed(42)
np.random.shuffle(cal_data)
cal_subset = cal_data[:n_calib]
X_cal, y_cal = get_sims_labels(cal_subset, partial=False)
X_cal = X_cal.flatten()
y_cal = y_cal.flatten()

# Compute probabilities for each query
p_s = []
for d in D:
    p_0, p_1 = simplifed_venn_abers_prediction(X_cal, y_cal, d)
    p_s.append([p_0, p_1])
p_s = np.array(p_s)

# Check calibration quality (uncertainty should be low)
abs_p = [np.abs(p[0] - p[1]) for p in p_s]
print(f"Max |p0 - p1|: {max(abs_p):.4f} (should be close to 0)")

## 5. Apply FDR Threshold

In [None]:
# Paper-verified FDR threshold at alpha=0.1
l_hat = 0.999980225003127

# Count hits above threshold
hits = (D_max > l_hat).sum()
total = len(D_max)
print(f"\n=== JCVI Syn3.0 Annotation Results ===")
print(f"Total queries: {total}")
print(f"Confident hits: {hits}")
print(f"Hit rate: {hits/total*100:.1f}%")
print(f"FDR threshold (alpha=0.1): {l_hat:.10f}")

## 6. Visualize Results

In [None]:
# Histogram of similarity scores
fig, axes = plt.subplots(1, 2, figsize=(12, 4))

# Left: Similarity score distribution
ax = axes[0]
sns.histplot(D_max, bins=30, ax=ax)
ax.axvline(l_hat, color='r', linestyle='--', label=f'FDR threshold (lambda={l_hat:.4f})')
ax.set_xlabel('Similarity Score')
ax.set_ylabel('Count')
ax.set_title('Distribution of Nearest Neighbor Similarities')
ax.legend()

# Right: Pie chart
ax = axes[1]
hits_count = np.sum(D_max >= l_hat)
no_hits_count = np.sum(D_max < l_hat)
sizes = [hits_count, no_hits_count]
labels = [f'Hits: {hits_count} ({hits_count/total*100:.1f}%)',
          f'No Hits: {no_hits_count} ({no_hits_count/total*100:.1f}%)']
colors = sns.color_palette()[0:2]
ax.pie(sizes, labels=labels, colors=colors, autopct='%1.1f%%',
       startangle=140, explode=(0.1, 0))
ax.set_title(f'JCVI Syn3.0 Annotation (n={total})')

plt.tight_layout()
plt.show()

## 7. Build Results DataFrame

In [None]:
# Filter to confident hits
hit_mask = D_max > l_hat
filtered_I = I[hit_mask]
first_entries = filtered_I[:, 0]

# Build results dataframe
df_hits = col_lookup.iloc[first_entries].reset_index(drop=True)
df_hits['query_sequence'] = np.array(query_fastas)[hit_mask]
df_hits['query_name'] = np.array(query_metadata)[hit_mask]
df_hits['similarity'] = D_max[hit_mask]
df_hits['probability'] = np.mean(p_s[hit_mask], axis=1)

# Reorder columns
first_cols = ['query_name', 'similarity', 'probability', 'Pfam', 'Protein names']
other_cols = [c for c in df_hits.columns if c not in first_cols]
df_hits = df_hits[[c for c in first_cols if c in df_hits.columns] + other_cols]

print(f"\nTop 10 hits:")
df_hits[['query_name', 'similarity', 'probability', 'Pfam', 'Protein names']].head(10)

In [None]:
# Save results
output_path = data_dir / 'gene_unknown' / 'unknown_aa_seqs_pfam_hits.csv'
df_hits.to_csv(output_path, index=False)
print(f"Saved {len(df_hits)} hits to {output_path}")

## Summary

This notebook demonstrates the conformal protein retrieval workflow:

1. **Embed** query proteins using Protein-Vec (pre-computed)
2. **Search** against UniProt database using FAISS
3. **Filter** using FDR-controlled threshold (alpha=0.1 → 10% expected FDR)
4. **Calibrate** probabilities using Venn-Abers

**Result**: 39.6% of JCVI Syn3.0 unknown genes can be confidently annotated.

For command-line usage, see:
```bash
cpr search --input sequences.fasta --output results.csv --fdr 0.1
```