# üìä Notebook 2 ‚Äî Statistical Analysis & Publication Figures

## AlphaMissense-Guided VUS Reclassification in HRR Genes for Prostate Cancer

**Inputs from Notebook 1:**
- `results/annotated_hrr_variants.csv` ‚Äî HRR missense variants with AlphaMissense scores
- `results/patient_hrr_summary.csv` ‚Äî Patient-level summary
- `data/raw/clinical_patient.csv` / `clinical_sample.csv` ‚Äî Survival & clinical data
- `data/processed/alphamissense_hrr_genes.csv` ‚Äî AlphaMissense lookup for HRR genes
- `data/processed/clinvar_hrr.csv` ‚Äî ClinVar classifications (if available)

**Analyses:**
1. Descriptive summary & AlphaMissense score distribution
2. ClinVar concordance (Cohen's kappa, sensitivity/specificity)
3. VUS reclassification yield
4. Survival analysis (Cox PH + Kaplan-Meier)
5. Sensitivity analyses (threshold, E-value, gene exclusion)
6. Publication-ready figures (Fig 1‚Äì5)

**Reporting:** REMARK guidelines + STROBE checklist


## 1. Setup & Load Data

In [3]:
# REPRODUCIBILITY: Install dependencies via `pip install -r requirements.txt`
# Do NOT pip-install inside the notebook ‚Äî use pinned versions from requirements.txt


‚úÖ Setup complete


In [4]:
# ============================================================
# 2. LOAD ALL DATA FROM NOTEBOOK 1
# ============================================================

# --- Variant-level data ---
df_var = pd.read_csv(RESULTS_DIR / "annotated_hrr_variants.csv")
print(f"üìÇ Variants loaded: {len(df_var)} rows, {df_var['sample_id'].nunique()} patients")

# --- Patient summary ---
df_pat = pd.read_csv(RESULTS_DIR / "patient_hrr_summary.csv")
print(f"üìÇ Patient summary: {len(df_pat)} patients with HRR missense")

# --- Clinical data ---
df_clin_patient = pd.read_csv(DATA_DIR / "raw" / "clinical_patient.csv")
df_clin_sample = pd.read_csv(DATA_DIR / "raw" / "clinical_sample.csv")
print(f"üìÇ Clinical: {len(df_clin_patient)} patients, {len(df_clin_sample)} samples")

# --- AlphaMissense full lookup ---
am_path = DATA_DIR / "processed" / "alphamissense_hrr_genes.csv"
if am_path.exists():
    df_am_full = pd.read_csv(am_path)
    print(f"üìÇ AlphaMissense HRR lookup: {len(df_am_full):,} predictions")
else:
    df_am_full = pd.DataFrame()
    print("‚ö†Ô∏è  AlphaMissense lookup not found ‚Äî concordance limited")

# --- ClinVar ---
cv_path = DATA_DIR / "processed" / "clinvar_hrr.csv"
if cv_path.exists():
    df_clinvar = pd.read_csv(cv_path)
    print(f"üìÇ ClinVar HRR: {len(df_clinvar):,} entries")
else:
    df_clinvar = pd.DataFrame()
    print("‚ö†Ô∏è  ClinVar file not found ‚Äî will attempt alternative load")
    # Try the raw variant_summary if processed doesn't exist
    raw_cv = DATA_DIR / "raw" / "variant_summary.txt.gz"
    if raw_cv.exists():
        print("   Found raw ClinVar ‚Äî will parse in concordance section")

# Quick sanity
print(f"\n{'='*60}")
print("QUICK DATA SANITY CHECK")
print(f"{'='*60}")
n_am = df_var["am_pathogenicity"].notna().sum()
print(f"  Variants with AM score: {n_am}/{len(df_var)} ({100*n_am/len(df_var):.1f}%)")
if "am_class" in df_var.columns:
    print(f"  AM class distribution:")
    for c, n in df_var["am_class"].value_counts().items():
        print(f"    {c}: {n}")


üìÇ Variants loaded: 52 rows, 40 patients
üìÇ Patient summary: 40 patients with HRR missense
üìÇ Clinical: 15949 patients, 8854 samples
üìÇ AlphaMissense HRR lookup: 554,363 predictions
üìÇ ClinVar HRR: 193,148 entries

QUICK DATA SANITY CHECK
  Variants with AM score: 51/52 (98.1%)
  AM class distribution:
    benign: 31
    pathogenic: 19
    ambiguous: 1


## 2. Descriptive Statistics

### 2A. Variant-Level Summary
### 2B. Patient-Level Table 1


In [5]:
# ============================================================
# 3. DESCRIPTIVE ‚Äî VARIANT-LEVEL
# ============================================================

print("=" * 60)
print("VARIANT-LEVEL DESCRIPTIVE SUMMARY")
print("=" * 60)

# Total missense by gene
gene_summary = df_var.groupby("gene").agg(
    n_variants=("sample_id", "count"),
    n_patients=("sample_id", "nunique"),
    mean_am=("am_pathogenicity", "mean"),
    median_am=("am_pathogenicity", "median"),
    n_pathogenic=("am_class", lambda x: (x == "pathogenic").sum()),
    n_benign=("am_class", lambda x: (x == "benign").sum()),
    n_ambiguous=("am_class", lambda x: (x == "ambiguous").sum()),
).reset_index()

gene_summary["cohort"] = gene_summary["gene"].apply(
    lambda g: "A" if g in COHORT_A_GENES else ("B" if g in COHORT_B_GENES else "Ext")
)
gene_summary = gene_summary.sort_values(["cohort", "n_variants"], ascending=[True, False])

print("\nMissense variants by HRR gene:")
print(gene_summary.to_string(index=False))
gene_summary.to_csv(RESULTS_DIR / "table_gene_summary.csv", index=False)

# Score distribution stats
am_scores = df_var["am_pathogenicity"].dropna()
print(f"\nAlphaMissense score distribution (n={len(am_scores)}):")
print(f"  Mean ¬± SD: {am_scores.mean():.3f} ¬± {am_scores.std():.3f}")
print(f"  Median [IQR]: {am_scores.median():.3f} [{am_scores.quantile(0.25):.3f}‚Äì{am_scores.quantile(0.75):.3f}]")
print(f"  Range: {am_scores.min():.4f} ‚Äì {am_scores.max():.4f}")


VARIANT-LEVEL DESCRIPTIVE SUMMARY

Missense variants by HRR gene:
  gene  n_variants  n_patients  mean_am  median_am  n_pathogenic  n_benign  n_ambiguous cohort
   ATM          15          15 0.641073    0.70650            10         5            0      A
 BRCA2           3           3 0.160633    0.15260             0         3            0      A
 BRCA1           1           1 0.193600    0.19360             0         1            0      A
 CDK12           5           5 0.647580    0.99880             3         2            0      B
 BARD1           4           4 0.100650    0.09265             0         4            0      B
 PALB2           3           3 0.079900    0.07590             0         3            0      B
RAD51B           3           2 0.228967    0.11690             0         2            1      B
 BRIP1           2           2 0.143450    0.14345             0         2            0      B
RAD54L           2           2 0.319900    0.31990             1         1     

In [6]:
# ============================================================
# 3B. PATIENT-LEVEL DESCRIPTIVE ‚Äî BUILD TABLE 1
# ============================================================

# Merge patient summary with clinical data
# Detect clinical column names (vary between API/datahub format)
clin_cols = df_clin_patient.columns.tolist()
print(f"Available clinical columns: {clin_cols}")

# Find the patient ID column
pid_col = None
for candidate in ["PATIENT_ID", "patientId", "Patient ID"]:
    if candidate in clin_cols:
        pid_col = candidate
        break

if pid_col is None:
    print("‚ö†Ô∏è  Could not find patient ID column. Available:", clin_cols[:10])
    pid_col = clin_cols[0]  # fallback

print(f"Using patient ID column: '{pid_col}'")

# Extract patient ID from sample ID (TCGA format: TCGA-XX-XXXX-01 ‚Üí TCGA-XX-XXXX)
df_pat["patient_id"] = df_pat["sample_id"].str.extract(r"(TCGA-[A-Z0-9]+-[A-Z0-9]+)")

# Merge
df_analysis = df_pat.merge(
    df_clin_patient,
    left_on="patient_id",
    right_on=pid_col,
    how="left"
)

print(f"\n‚úÖ Merged: {len(df_analysis)} patients with clinical data")

# Identify key clinical columns
# Look for survival columns
survival_cols = [c for c in df_analysis.columns if any(
    k in c.upper() for k in ["OS_", "DFS_", "PFS_", "SURVIVAL", "STATUS", "MONTHS"]
)]
print(f"Survival-related columns found: {survival_cols}")

# Look for age, Gleason, stage
demo_cols = [c for c in df_analysis.columns if any(
    k in c.upper() for k in ["AGE", "GLEASON", "STAGE", "GRADE", "PSA", "T_STAGE", "N_STAGE", "M_STAGE"]
)]
print(f"Demographic/clinical columns found: {demo_cols}")


Available clinical columns: ['uniquePatientKey', 'patientId', 'studyId', 'clinicalAttributeId', 'value']
Using patient ID column: 'patientId'

‚úÖ Merged: 1299 patients with clinical data
Survival-related columns found: []
Demographic/clinical columns found: []


In [7]:
# ============================================================
# 3C. PIVOT CLINICAL DATA & CREATE ANALYSIS GROUPS
# ============================================================

# The cBioPortal API returns clinical data in LONG format:
#   patientId | clinicalAttributeId | value
# We need to pivot to WIDE format (one row per patient)

print("Pivoting clinical data from long ‚Üí wide format...")

if "clinicalAttributeId" in df_clin_patient.columns and "value" in df_clin_patient.columns:
    df_clin_wide = df_clin_patient.pivot_table(
        index="patientId",
        columns="clinicalAttributeId",
        values="value",
        aggfunc="first"
    ).reset_index()
    print(f"  Pivoted: {len(df_clin_wide)} patients √ó {len(df_clin_wide.columns)} attributes")
    print(f"  Available attributes: {sorted(df_clin_wide.columns.tolist())}")
else:
    # Already in wide format
    df_clin_wide = df_clin_patient.copy()
    print(f"  Already wide: {len(df_clin_wide)} patients")

# Also pivot sample-level if needed
if "clinicalAttributeId" in df_clin_sample.columns:
    df_sample_wide = df_clin_sample.pivot_table(
        index="sampleId",
        columns="clinicalAttributeId",
        values="value",
        aggfunc="first"
    ).reset_index()
    # Merge sample-level attributes (TMB, etc.) into patient table
    # Use first sample per patient
    df_sample_wide["patientId"] = df_sample_wide["sampleId"].str.extract(r"(TCGA-[A-Z0-9]+-[A-Z0-9]+)")
    sample_attrs = df_sample_wide.drop(columns=["sampleId"]).groupby("patientId").first().reset_index()
    df_clin_wide = df_clin_wide.merge(sample_attrs, on="patientId", how="left", suffixes=("", "_sample"))
    print(f"  After merging sample attributes: {len(df_clin_wide.columns)} total columns")

# Extract patient ID from sample_id
df_pat["patient_id"] = df_pat["sample_id"].str.extract(r"(TCGA-[A-Z0-9]+-[A-Z0-9]+)")

# Merge
df_analysis = df_pat.merge(df_clin_wide, left_on="patient_id", right_on="patientId", how="left")
print(f"\n‚úÖ Merged: {len(df_analysis)} patients with clinical data")

# Define AM groups
if "has_am_pathogenic" not in df_analysis.columns:
    if "n_am_pathogenic" in df_analysis.columns:
        df_analysis["has_am_pathogenic"] = df_analysis["n_am_pathogenic"] > 0
    else:
        path_patients = df_var[df_var["am_class"] == "pathogenic"]["sample_id"].unique()
        df_analysis["has_am_pathogenic"] = df_analysis["sample_id"].isin(path_patients)

df_analysis["am_group"] = df_analysis["has_am_pathogenic"].map({
    True: "AM-Pathogenic (‚â•1 path. variant)",
    False: "AM-Benign/Ambiguous only"
})
print("\nPatient groups:")
print(df_analysis["am_group"].value_counts())

# Find OS columns ‚Äî TCGA PanCancer uses OS_MONTHS and OS_STATUS
os_time_col = None
os_status_col = None
for c in df_analysis.columns:
    cu = str(c).upper()
    if cu in ["OS_MONTHS", "OS_TIME"]:
        os_time_col = c
    elif cu == "OS_STATUS":
        os_status_col = c
    elif "OVERALL_SURVIVAL" in cu and "MONTH" in cu:
        os_time_col = c

print(f"\nSurvival columns: time='{os_time_col}', status='{os_status_col}'")

if os_status_col and os_time_col:
    df_analysis["os_event"] = df_analysis[os_status_col].apply(
        lambda x: 1 if "deceased" in str(x).lower() or str(x).strip() == "1" else 0
    )
    df_analysis["os_time"] = pd.to_numeric(df_analysis[os_time_col], errors="coerce")
    valid_surv = df_analysis[["os_time", "os_event"]].dropna()
    print(f"Survival data: {len(valid_surv)}/{len(df_analysis)} patients")
    print(f"  Events (deaths): {valid_surv['os_event'].sum()}")
    print(f"  Median follow-up: {valid_surv['os_time'].median():.1f} months")
else:
    print("\n‚ö†Ô∏è  OS columns not found in pivoted data.")
    print("  Available columns:", [c for c in df_analysis.columns if any(
        k in str(c).upper() for k in ["OS", "SURV", "DEATH", "STATUS", "MONTH", "DFS", "PFS"]
    )])
    df_analysis["os_event"] = np.nan
    df_analysis["os_time"] = np.nan

# DFS
for c in df_analysis.columns:
    cu = str(c).upper()
    if cu in ["DFS_MONTHS"]:
        df_analysis["dfs_time"] = pd.to_numeric(df_analysis[c], errors="coerce")
    if cu in ["DFS_STATUS"]:
        df_analysis["dfs_event"] = df_analysis[c].apply(
            lambda x: 1 if "recur" in str(x).lower() or "progress" in str(x).lower() or str(x).strip() == "1" else 0
        )

# Add max_am_score from variant data
max_am = df_var.groupby("sample_id")["am_pathogenicity"].max().reset_index()
max_am.columns = ["sample_id", "max_am_score"]
df_analysis = df_analysis.merge(max_am, on="sample_id", how="left")

# Save
df_analysis.to_csv(RESULTS_DIR / "analysis_dataset.csv", index=False)
print(f"\nüíæ Saved: {RESULTS_DIR / 'analysis_dataset.csv'}")
print(f"  Columns: {len(df_analysis.columns)}")


Pivoting clinical data from long ‚Üí wide format...
  Pivoted: 494 patients √ó 37 attributes
  Available attributes: ['AGE', 'BUFFA_HYPOXIA_SCORE', 'CANCER_TYPE_ACRONYM', 'DAYS_LAST_FOLLOWUP', 'DAYS_TO_BIRTH', 'DAYS_TO_INITIAL_PATHOLOGIC_DIAGNOSIS', 'DFS_MONTHS', 'DFS_STATUS', 'DSS_MONTHS', 'DSS_STATUS', 'ETHNICITY', 'FORM_COMPLETION_DATE', 'GENETIC_ANCESTRY_LABEL', 'HISTORY_NEOADJUVANT_TRTYN', 'ICD_10', 'ICD_O_3_HISTOLOGY', 'ICD_O_3_SITE', 'INFORMED_CONSENT_VERIFIED', 'IN_PANCANPATHWAYS_FREEZE', 'NEW_TUMOR_EVENT_AFTER_INITIAL_TREATMENT', 'OS_MONTHS', 'OS_STATUS', 'OTHER_PATIENT_ID', 'PATH_N_STAGE', 'PATH_T_STAGE', 'PERSON_NEOPLASM_CANCER_STATUS', 'PFS_MONTHS', 'PFS_STATUS', 'PRIOR_DX', 'RACE', 'RADIATION_THERAPY', 'RAGNUM_HYPOXIA_SCORE', 'SAMPLE_COUNT', 'SEX', 'SUBTYPE', 'WINTER_HYPOXIA_SCORE', 'patientId']
  After merging sample attributes: 55 total columns

‚úÖ Merged: 40 patients with clinical data

Patient groups:
am_group
AM-Benign/Ambiguous only            21
AM-Pathogenic (‚â•1

## 3. Concordance: AlphaMissense vs. ClinVar

This is a key validation step. We check how well AlphaMissense agrees with ClinVar's expert-curated classifications for variants that have been previously classified.

**Metrics:**
- Cohen's kappa (chance-corrected agreement)
- Sensitivity / Specificity for pathogenic detection
- Confusion matrix

**Why this matters:** If AM agrees with ClinVar on known variants (kappa > 0.70), it provides confidence that AM's reclassification of VUS is meaningful.


In [8]:
# ============================================================
# 4. CONCORDANCE ANALYSIS ‚Äî AlphaMissense vs ClinVar
# ============================================================

# Strategy: Use the AlphaMissense full lookup (all possible substitutions
# for HRR genes) and cross-reference with ClinVar annotations.
#
# We match on: gene + protein_change (e.g., BRCA2 R2842H)
# ClinVar gives: Pathogenic, Likely_Pathogenic, VUS, Likely_Benign, Benign
# AlphaMissense gives: pathogenic (>0.564), ambiguous (0.34‚Äì0.564), benign (<0.34)

from scipy.stats import fisher_exact

# First, check what ClinVar data we have
if len(df_clinvar) > 0:
    print(f"ClinVar data available: {len(df_clinvar)} entries")
    print(f"Columns: {df_clinvar.columns.tolist()}")
    cv_data = df_clinvar.copy()
elif (DATA_DIR / "raw" / "variant_summary.txt.gz").exists():
    print("Parsing ClinVar from raw file (filtering for HRR genes)...")
    import gzip
    HRR_GENES_ALL = sorted(set(COHORT_A_GENES + COHORT_B_GENES + [
        "FANCA", "FANCC", "FANCD2", "FANCE", "FANCF", "FANCG",
        "NBN", "MRE11", "RAD50", "ATR", "ATRX"
    ]))
    records = []
    with gzip.open(DATA_DIR / "raw" / "variant_summary.txt.gz", 'rt', errors='replace') as f:
        header = f.readline().strip().split('\t')
        col_map = {h: i for i, h in enumerate(header)}
        for line in f:
            parts = line.strip().split('\t')
            gene = parts[col_map.get("GeneSymbol", 0)] if "GeneSymbol" in col_map else ""
            if gene in HRR_GENES_ALL:
                var_type = parts[col_map.get("Type", 0)] if "Type" in col_map else ""
                name = parts[col_map.get("Name", 0)] if "Name" in col_map else ""
                sig = parts[col_map.get("ClinicalSignificance", 0)] if "ClinicalSignificance" in col_map else ""
                if "single nucleotide" in var_type.lower() or "missense" in name.lower():
                    records.append({
                        "cv_gene": gene,
                        "cv_name": name,
                        "cv_significance": sig,
                    })
    cv_data = pd.DataFrame(records)
    print(f"  Parsed {len(cv_data)} ClinVar HRR SNV/missense entries")
else:
    cv_data = pd.DataFrame()
    print("‚ö†Ô∏è  No ClinVar data available ‚Äî skipping concordance")

if len(cv_data) > 0:
    # Simplify ClinVar classification
    def simplify_cv(sig):
        sig = str(sig).lower()
        if "pathogenic" in sig and "conflicting" not in sig and "benign" not in sig:
            return "P/LP"
        elif "benign" in sig and "conflicting" not in sig and "pathogenic" not in sig:
            return "B/LB"
        elif "uncertain" in sig:
            return "VUS"
        else:
            return "Other"

    cv_data["cv_simple"] = cv_data.get("cv_significance", cv_data.get("cv_clinical_significance", "")).apply(simplify_cv)

    print("\nClinVar simplified distribution:")
    print(cv_data["cv_simple"].value_counts())

    # Count how many are VUS ‚Äî this is the reclassification opportunity
    n_vus = (cv_data["cv_simple"] == "VUS").sum()
    n_plp = (cv_data["cv_simple"] == "P/LP").sum()
    n_blb = (cv_data["cv_simple"] == "B/LB").sum()
    print(f"\nüìä ClinVar landscape for HRR genes:")
    print(f"   Pathogenic/Likely Pathogenic: {n_plp}")
    print(f"   Benign/Likely Benign: {n_blb}")
    print(f"   VUS: {n_vus} ‚Üê RECLASSIFICATION OPPORTUNITY")
    print(f"   Other/Conflicting: {(cv_data['cv_simple'] == 'Other').sum()}")


ClinVar data available: 193148 entries
Columns: ['cv_gene', 'cv_name', 'cv_significance', 'cv_class']

ClinVar simplified distribution:
cv_simple
VUS      80630
B/LB     66719
Other    32342
P/LP     13457
Name: count, dtype: int64

üìä ClinVar landscape for HRR genes:
   Pathogenic/Likely Pathogenic: 13457
   Benign/Likely Benign: 66719
   VUS: 80630 ‚Üê RECLASSIFICATION OPPORTUNITY
   Other/Conflicting: 32342


In [9]:
# ============================================================
# 4B. CONCORDANCE ‚Äî MATCH VARIANTS BETWEEN AM AND CLINVAR
# ============================================================

# For concordance, we need to match AM predictions with ClinVar classifications.
# We'll use the variants that appear in our TCGA dataset as the link.

# Extract protein changes from ClinVar names (format varies)
import re

def extract_protein_from_clinvar(name):
    """Extract protein change from ClinVar Name field.
    Handles formats like: 'NM_000059.4(BRCA2):c.8524C>T (p.Arg2842Cys)'
    """
    # Look for p. notation
    match = re.search(r'p\.([A-Z][a-z]{2})(\d+)([A-Z][a-z]{2})', str(name))
    if match:
        aa3to1 = {
            'Ala':'A','Arg':'R','Asn':'N','Asp':'D','Cys':'C','Gln':'Q',
            'Glu':'E','Gly':'G','His':'H','Ile':'I','Leu':'L','Lys':'K',
            'Met':'M','Phe':'F','Pro':'P','Ser':'S','Thr':'T','Trp':'W',
            'Tyr':'Y','Val':'V','Ter':'*'
        }
        ref = aa3to1.get(match.group(1), '?')
        pos = match.group(2)
        alt = aa3to1.get(match.group(3), '?')
        return f"{ref}{pos}{alt}"
    # Try 1-letter
    match1 = re.search(r'p\.([A-Z*])(\d+)([A-Z*])', str(name))
    if match1:
        return f"{match1.group(1)}{match1.group(2)}{match1.group(3)}"
    return None

if len(cv_data) > 0 and len(df_am_full) > 0:
    # Parse protein changes from ClinVar
    cv_data["protein_change_parsed"] = cv_data["cv_name"].apply(extract_protein_from_clinvar)
    cv_parsed = cv_data[cv_data["protein_change_parsed"].notna()].copy()
    print(f"ClinVar entries with parseable protein change: {len(cv_parsed)}/{len(cv_data)}")

    # Build AM lookup: gene + protein_variant ‚Üí am_class, am_pathogenicity
    # Map UniProt back to gene
    UNIPROT_TO_GENE = {
        "P38398":"BRCA1","P51587":"BRCA2","Q13315":"ATM","Q86YC2":"PALB2",
        "Q9BX63":"BRIP1","Q99728":"BARD1","Q9NYV4":"CDK12","O14757":"CHEK1",
        "O96017":"CHEK2","Q9NW38":"FANCL","O15315":"RAD51B","O43502":"RAD51C",
        "O75771":"RAD51D","Q92698":"RAD54L","O15360":"FANCA","Q00597":"FANCC",
        "Q9BXW9":"FANCD2","Q9HB96":"FANCE","Q9NPI8":"FANCF","O15287":"FANCG",
        "O60934":"NBN","P49959":"MRE11","Q92878":"RAD50","Q13535":"ATR","P46100":"ATRX",
    }

    df_am_full["am_gene"] = df_am_full["uniprot_id"].map(UNIPROT_TO_GENE)
    df_am_full["am_pchange"] = df_am_full["protein_variant"]  # format: R175H

    # Create matching key: gene + protein_change
    df_am_full["match_key"] = df_am_full["am_gene"] + "_" + df_am_full["am_pchange"]
    cv_parsed["match_key"] = cv_parsed["cv_gene"] + "_" + cv_parsed["protein_change_parsed"]

    # Merge
    concordance = cv_parsed.merge(
        df_am_full[["match_key", "am_pathogenicity", "am_class"]].drop_duplicates("match_key"),
        on="match_key",
        how="inner"
    )
    print(f"\n‚úÖ Matched ClinVar √ó AlphaMissense: {len(concordance)} variants")

    # Filter to only P/LP and B/LB (exclude VUS for concordance ‚Äî those are the unknowns)
    conc_known = concordance[concordance["cv_simple"].isin(["P/LP", "B/LB"])].copy()
    print(f"   Known (P/LP or B/LB) with AM score: {len(conc_known)}")

    if len(conc_known) > 0:
        # Binary classification for kappa: AM pathogenic vs not, ClinVar P/LP vs not
        conc_known["cv_binary"] = (conc_known["cv_simple"] == "P/LP").astype(int)
        conc_known["am_binary"] = (conc_known["am_class"] == "pathogenic").astype(int)

        # Cohen's kappa
        from sklearn.metrics import cohen_kappa_score, confusion_matrix, classification_report
        kappa = cohen_kappa_score(conc_known["cv_binary"], conc_known["am_binary"])
        print(f"\n{'='*60}")
        print(f"CONCORDANCE: AlphaMissense vs ClinVar")
        print(f"{'='*60}")
        print(f"Cohen's kappa: {kappa:.3f}")

        # Confusion matrix
        cm = confusion_matrix(conc_known["cv_binary"], conc_known["am_binary"])
        print(f"\nConfusion matrix (rows=ClinVar, cols=AM):")
        print(f"              AM_Benign  AM_Pathogenic")
        print(f"  CV_B/LB     {cm[0,0]:6d}     {cm[0,1]:6d}")
        print(f"  CV_P/LP     {cm[1,0]:6d}     {cm[1,1]:6d}")

        # Sensitivity & Specificity
        TP, FP, FN, TN = cm[1,1], cm[0,1], cm[1,0], cm[0,0]
        sens = TP / (TP + FN) if (TP + FN) > 0 else np.nan
        spec = TN / (TN + FP) if (TN + FP) > 0 else np.nan
        ppv = TP / (TP + FP) if (TP + FP) > 0 else np.nan
        npv = TN / (TN + FN) if (TN + FN) > 0 else np.nan
        acc = (TP + TN) / (TP + FP + FN + TN)

        print(f"\nSensitivity (for pathogenic): {sens:.3f}")
        print(f"Specificity: {spec:.3f}")
        print(f"PPV: {ppv:.3f}")
        print(f"NPV: {npv:.3f}")
        print(f"Accuracy: {acc:.3f}")

        # Save concordance results
        conc_results = pd.DataFrame({
            "Metric": ["Cohen's kappa", "Sensitivity", "Specificity", "PPV", "NPV", "Accuracy",
                       "TP", "FP", "FN", "TN", "Total variants"],
            "Value": [kappa, sens, spec, ppv, npv, acc, TP, FP, FN, TN, len(conc_known)]
        })
        conc_results.to_csv(RESULTS_DIR / "concordance_results.csv", index=False)
        print(f"\nüíæ Saved: {RESULTS_DIR / 'concordance_results.csv'}")
    else:
        print("‚ö†Ô∏è  No known (P/LP or B/LB) variants matched ‚Äî concordance not computed")
        kappa = np.nan
else:
    print("‚ö†Ô∏è  Need both ClinVar + AlphaMissense lookup for concordance analysis")
    print("   This can be run once ClinVar data is available.")
    kappa = np.nan
    concordance = pd.DataFrame()


ClinVar entries with parseable protein change: 113170/193148

‚úÖ Matched ClinVar √ó AlphaMissense: 104776 variants
   Known (P/LP or B/LB) with AM score: 5414

CONCORDANCE: AlphaMissense vs ClinVar
Cohen's kappa: 0.733

Confusion matrix (rows=ClinVar, cols=AM):
              AM_Benign  AM_Pathogenic
  CV_B/LB       3856        198
  CV_P/LP        328       1032

Sensitivity (for pathogenic): 0.759
Specificity: 0.951
PPV: 0.839
NPV: 0.922
Accuracy: 0.903

üíæ Saved: results/concordance_results.csv


## 4. VUS Reclassification Yield

**The key clinical question:** How many previously unclassified VUS does AlphaMissense reclassify as pathogenic or benign?

This directly impacts the paper's clinical narrative ‚Äî each reclassified VUS is a patient whose variant annotation could potentially inform future PARP inhibitor considerations (pending clinical validation).


In [10]:
# ============================================================
# 5. VUS RECLASSIFICATION YIELD
# ============================================================

if len(concordance) > 0:
    # Filter concordance to VUS only
    vus_reclass = concordance[concordance["cv_simple"] == "VUS"].copy()
    n_vus_total = len(vus_reclass)
    print(f"{'='*60}")
    print(f"VUS RECLASSIFICATION ANALYSIS")
    print(f"{'='*60}")
    print(f"\nTotal VUS in ClinVar matched to AlphaMissense: {n_vus_total}")

    if n_vus_total > 0:
        n_to_path = (vus_reclass["am_class"] == "pathogenic").sum()
        n_to_ben = (vus_reclass["am_class"] == "benign").sum()
        n_remain_amb = (vus_reclass["am_class"] == "ambiguous").sum()

        print(f"\n  Reclassified as PATHOGENIC: {n_to_path} ({100*n_to_path/n_vus_total:.1f}%)")
        print(f"  Reclassified as BENIGN:     {n_to_ben} ({100*n_to_ben/n_vus_total:.1f}%)")
        print(f"  Remain AMBIGUOUS:           {n_remain_amb} ({100*n_remain_amb/n_vus_total:.1f}%)")
        print(f"  TOTAL RECLASSIFIED:         {n_to_path + n_to_ben} ({100*(n_to_path+n_to_ben)/n_vus_total:.1f}%)")

        # By gene
        print(f"\n  Reclassification by gene:")
        vus_by_gene = vus_reclass.groupby("cv_gene")["am_class"].value_counts().unstack(fill_value=0)
        print(vus_by_gene.to_string())

        # Save
        vus_reclass.to_csv(RESULTS_DIR / "vus_reclassification.csv", index=False)
        print(f"\nüíæ Saved: {RESULTS_DIR / 'vus_reclassification.csv'}")

        # Clinical impact estimate
        print(f"\n{'='*60}")
        print(f"CLINICAL IMPACT ESTIMATE")
        print(f"{'='*60}")
        print(f"  If {n_to_path} VUS are truly pathogenic:")
        print(f"  ‚Üí These patients have variants predicted pathogenic by AlphaMissense (hypothesis-generating; not for clinical decision-making)")
        print(f"  If {n_to_ben} VUS are truly benign:")
        print(f"  ‚Üí These patients can be spared unnecessary genetic counseling anxiety")
    else:
        print("  No VUS found in matched set")
else:
    print("‚ö†Ô∏è  VUS reclassification analysis requires ClinVar √ó AM concordance data")
    print("   Proceeding to survival analysis using AM classification directly")
    # For the TCGA variants, classify as VUS everything not in ClinVar
    print(f"\n  In TCGA-PRAD HRR missense variants:")
    print(f"  AM-pathogenic: {(df_var['am_class']=='pathogenic').sum()}")
    print(f"  AM-benign: {(df_var['am_class']=='benign').sum()}")
    print(f"  AM-ambiguous: {(df_var['am_class']=='ambiguous').sum()}")


VUS RECLASSIFICATION ANALYSIS

Total VUS in ClinVar matched to AlphaMissense: 74246

  Reclassified as PATHOGENIC: 15930 (21.5%)
  Reclassified as BENIGN:     50982 (68.7%)
  Remain AMBIGUOUS:           7334 (9.9%)
  TOTAL RECLASSIFIED:         66912 (90.1%)

  Reclassification by gene:
am_class  ambiguous  benign  pathogenic
cv_gene                                
ATM            1748    9072        3508
ATR             476    2852         808
ATRX            106     972         388
BARD1           326    2488         842
BRCA1           196    2920         396
BRCA2           466    5272         780
BRIP1           434    3664        1292
CDK12           136    1504         302
CHEK1             4      32          42
CHEK2           346    1386        1684
FANCA           432    3526         516
FANCC           204    1350         130
FANCD2          120     980         158
FANCE            66     474          48
FANCF            58     388          62
FANCG           102     808     

## 5. Survival Analysis

### Primary Analysis: Cox Proportional Hazards
- **Outcome:** Overall Survival (OS)
- **Primary exposure:** ‚â•1 AM-pathogenic HRR variant (binary)
- **Covariates:** Age, Gleason score (where available)
- **Complementary:** RMST (Restricted Mean Survival Time) at œÑ = 90th percentile

### Secondary: Kaplan-Meier Curves (Fig 3)

**Note on power:** With n~41 HRR-mutated patients and limited events in TCGA-PRAD (localized disease, long survival), the primary value is **effect estimation** rather than definitive hypothesis testing. The PARP cohort validation (Notebook 3) provides the clinically relevant test.


In [11]:
# ============================================================
# 6A. SURVIVAL ANALYSIS ‚Äî COX PH
# ============================================================
from lifelines import CoxPHFitter, KaplanMeierFitter
from lifelines.statistics import logrank_test

# Prepare survival dataset
df_surv = df_analysis.dropna(subset=["os_time", "os_event"]).copy()
df_surv = df_surv[df_surv["os_time"] > 0].copy()

print(f"{'='*60}")
print(f"SURVIVAL ANALYSIS")
print(f"{'='*60}")
print(f"Patients with valid OS data: {len(df_surv)}")
print(f"  AM-Pathogenic group: {df_surv['has_am_pathogenic'].sum()}")
print(f"  AM-Benign/Ambiguous: {(~df_surv['has_am_pathogenic']).sum()}")
print(f"  Total events (deaths): {df_surv['os_event'].sum():.0f}")

if len(df_surv) < 10 or df_surv["os_event"].sum() < 3:
    print("\n‚ö†Ô∏è  Insufficient events for Cox regression.")
    print("   TCGA-PRAD is primarily localized disease with few deaths.")
    print("   This is expected ‚Äî the survival analysis will be exploratory.")
    print("   Definitive testing will come from the mCRPC PARP cohort (Notebook 3).")
    cox_result = None
else:
    # Prepare covariates
    surv_cols = ["os_time", "os_event", "has_am_pathogenic"]

    # Try to add age as covariate
    age_col = None
    for c in df_surv.columns:
        if "AGE" in c.upper() and "DIAG" in c.upper():
            age_col = c
            break
        elif c.upper() == "AGE":
            age_col = c
            break

    if age_col:
        df_surv["age_numeric"] = pd.to_numeric(df_surv[age_col], errors="coerce")
        if df_surv["age_numeric"].notna().sum() > 10:
            surv_cols.append("age_numeric")
            print(f"  Including covariate: age (n={df_surv['age_numeric'].notna().sum()})")

    # Fit Cox model
    df_cox = df_surv[surv_cols].dropna().copy()
    df_cox["has_am_pathogenic"] = df_cox["has_am_pathogenic"].astype(int)

    print(f"\nCox PH model (n={len(df_cox)}, events={df_cox['os_event'].sum():.0f}):")

    try:
        cph = CoxPHFitter()
        cph.fit(df_cox, duration_col="os_time", event_col="os_event")
        cph.print_summary()

        # Extract HR for AM-pathogenic
        hr = np.exp(cph.params_["has_am_pathogenic"])
        ci_low = np.exp(cph.confidence_intervals_.loc["has_am_pathogenic", "95% lower-bound"])
        ci_high = np.exp(cph.confidence_intervals_.loc["has_am_pathogenic", "95% upper-bound"])
        p_val = cph.summary.loc["has_am_pathogenic", "p"]

        print(f"\n{'='*60}")
        print(f"PRIMARY RESULT: AM-Pathogenic HRR")
        print(f"{'='*60}")
        print(f"  HR: {hr:.2f} (95% CI: {ci_low:.2f}‚Äì{ci_high:.2f})")
        print(f"  p-value: {p_val:.4f}")

        # PH assumption check
        print(f"\nProportional hazards test:")
        try:
            ph_test = cph.check_assumptions(df_cox, show_plots=False)
            print("  ‚úÖ PH assumption OK")
        except Exception as e:
            print(f"  ‚ö†Ô∏è  PH test: {e}")

        cox_result = {"hr": hr, "ci_low": ci_low, "ci_high": ci_high, "p": p_val}

    except Exception as e:
        print(f"‚ö†Ô∏è  Cox model failed: {e}")
        print("   Likely due to very few events. Proceeding with KM only.")
        cox_result = None


SURVIVAL ANALYSIS
Patients with valid OS data: 40
  AM-Pathogenic group: 19
  AM-Benign/Ambiguous: 21
  Total events (deaths): 1

‚ö†Ô∏è  Insufficient events for Cox regression.
   TCGA-PRAD is primarily localized disease with few deaths.
   This is expected ‚Äî the survival analysis will be exploratory.
   Definitive testing will come from the mCRPC PARP cohort (Notebook 3).


In [12]:
# ============================================================
# 6B. KAPLAN-MEIER CURVES
# ============================================================

print("\n" + "="*60)
print("KAPLAN-MEIER ANALYSIS")
print("="*60)

if len(df_surv) >= 5:
    # Split by AM group
    grp_path = df_surv[df_surv["has_am_pathogenic"] == True]
    grp_ben = df_surv[df_surv["has_am_pathogenic"] == False]

    kmf_path = KaplanMeierFitter()
    kmf_ben = KaplanMeierFitter()

    if len(grp_path) >= 2 and len(grp_ben) >= 2:
        kmf_path.fit(
            grp_path["os_time"], grp_path["os_event"],
            label=f"AM-Pathogenic (n={len(grp_path)})"
        )
        kmf_ben.fit(
            grp_ben["os_time"], grp_ben["os_event"],
            label=f"AM-Benign/Ambig (n={len(grp_ben)})"
        )

        # Log-rank test
        lr = logrank_test(
            grp_path["os_time"], grp_ben["os_time"],
            grp_path["os_event"], grp_ben["os_event"]
        )
        print(f"Log-rank test: œá¬≤={lr.test_statistic:.3f}, p={lr.p_value:.4f}")

        # Median survival
        print(f"\nMedian OS:")
        print(f"  AM-Pathogenic: {kmf_path.median_survival_time_:.1f} months")
        print(f"  AM-Benign/Amb: {kmf_ben.median_survival_time_:.1f} months")

        # RMST (Restricted Mean Survival Time)
        tau = df_surv["os_time"].quantile(0.9)
        print(f"\nRMST (œÑ={tau:.1f} months):")

        # Simple RMST via trapezoidal integration of KM curves
        def compute_rmst(kmf, tau):
            """Compute RMST from KaplanMeierFitter up to time tau."""
            times = kmf.survival_function_.index
            surv = kmf.survival_function_.iloc[:, 0]
            # Restrict to tau
            mask = times <= tau
            t = np.concatenate([[0], times[mask].values, [tau]])
            s = np.concatenate([[1.0], surv[mask].values, [surv[mask].iloc[-1] if mask.sum() > 0 else 0]])
            # Trapezoidal integration
            return np.trapz(s, t)

        rmst_path = compute_rmst(kmf_path, tau)
        rmst_ben = compute_rmst(kmf_ben, tau)
        print(f"  AM-Pathogenic: {rmst_path:.1f} months")
        print(f"  AM-Benign/Amb: {rmst_ben:.1f} months")
        print(f"  Œî RMST: {rmst_path - rmst_ben:.1f} months")

    else:
        print("‚ö†Ô∏è  Too few patients in one group for KM analysis")
        lr = None
else:
    print("‚ö†Ô∏è  Insufficient patients with survival data")
    lr = None



KAPLAN-MEIER ANALYSIS
Log-rank test: œá¬≤=0.895, p=0.3442

Median OS:
  AM-Pathogenic: inf months
  AM-Benign/Amb: inf months

RMST (œÑ=69.3 months):


AttributeError: module 'numpy' has no attribute 'trapz'

## 6. Sensitivity Analyses

Three pre-specified sensitivity analyses:

1. **Threshold variation:** Test AM score as continuous + alternative cutoffs (0.34, 0.50, 0.564, 0.80)
2. **E-value:** Assess robustness to unmeasured confounding
3. **Gene exclusion (leave-one-gene-out):** Test if results are driven by a single gene


In [13]:
# ============================================================
# 7A. SENSITIVITY ‚Äî THRESHOLD VARIATION
# ============================================================

print("="*60)
print("SENSITIVITY 1: AlphaMissense Threshold Variation")
print("="*60)

# Use df_analysis which already has max_am_score from the merge
df_sens = df_analysis.dropna(subset=["os_time", "os_event"]).copy()
df_sens = df_sens[df_sens["os_time"] > 0]

# If max_am_score not in df_analysis, compute it
if "max_am_score" not in df_sens.columns:
    max_am = df_var.groupby("sample_id")["am_pathogenicity"].max().reset_index()
    max_am.columns = ["sample_id", "max_am_score"]
    df_sens = df_sens.merge(max_am, on="sample_id", how="left")

print(f"Patients for sensitivity: {len(df_sens)}")
print(f"  With max_am_score: {df_sens['max_am_score'].notna().sum()}")
print(f"  Events: {df_sens['os_event'].sum():.0f}")

thresholds = [0.34, 0.50, 0.564, 0.70, 0.80]
threshold_results = []

for thresh in thresholds:
    has_score = df_sens["max_am_score"].notna()
    df_t = df_sens[has_score].copy()
    df_t[f"path_{thresh}"] = (df_t["max_am_score"] >= thresh).astype(int)
    n_path = df_t[f"path_{thresh}"].sum()
    n_ben = len(df_t) - n_path

    if n_path >= 2 and n_ben >= 2 and df_t["os_event"].sum() >= 3:
        try:
            cph_s = CoxPHFitter()
            cph_s.fit(
                df_t[["os_time", "os_event", f"path_{thresh}"]].dropna(),
                duration_col="os_time", event_col="os_event"
            )
            hr = np.exp(cph_s.params_[f"path_{thresh}"])
            ci = np.exp(cph_s.confidence_intervals_.loc[f"path_{thresh}"])
            p = cph_s.summary.loc[f"path_{thresh}", "p"]
            threshold_results.append({
                "threshold": thresh, "n_pathogenic": n_path, "n_benign": n_ben,
                "HR": hr, "CI_low": ci.iloc[0], "CI_high": ci.iloc[1], "p_value": p
            })
            print(f"  Threshold ‚â•{thresh:.3f}: n_path={n_path}, HR={hr:.2f} "
                  f"(95% CI {ci.iloc[0]:.2f}‚Äì{ci.iloc[1]:.2f}), p={p:.4f}")
        except Exception as e:
            threshold_results.append({
                "threshold": thresh, "n_pathogenic": n_path, "n_benign": n_ben,
                "HR": np.nan, "CI_low": np.nan, "CI_high": np.nan, "p_value": np.nan
            })
            print(f"  Threshold ‚â•{thresh:.3f}: n_path={n_path} ‚Äî model failed ({e})")
    else:
        threshold_results.append({
            "threshold": thresh, "n_pathogenic": n_path, "n_benign": n_ben,
            "HR": np.nan, "CI_low": np.nan, "CI_high": np.nan, "p_value": np.nan
        })
        print(f"  Threshold ‚â•{thresh:.3f}: n_path={n_path}, events={df_t['os_event'].sum():.0f} ‚Äî insufficient")

df_thresh = pd.DataFrame(threshold_results)
df_thresh.to_csv(RESULTS_DIR / "sensitivity_threshold.csv", index=False)
print(f"\nüíæ Saved: {RESULTS_DIR / 'sensitivity_threshold.csv'}")


SENSITIVITY 1: AlphaMissense Threshold Variation
Patients for sensitivity: 40
  With max_am_score: 39
  Events: 1
  Threshold ‚â•0.340: n_path=20, events=1 ‚Äî insufficient
  Threshold ‚â•0.500: n_path=20, events=1 ‚Äî insufficient
  Threshold ‚â•0.564: n_path=19, events=1 ‚Äî insufficient
  Threshold ‚â•0.700: n_path=14, events=1 ‚Äî insufficient
  Threshold ‚â•0.800: n_path=13, events=1 ‚Äî insufficient

üíæ Saved: results/sensitivity_threshold.csv


In [14]:
# ============================================================
# 7B. SENSITIVITY ‚Äî E-VALUE
# ============================================================

print("\n" + "="*60)
print("SENSITIVITY 2: E-value for Unmeasured Confounding")
print("="*60)

def compute_evalue(hr):
    """Compute E-value for a hazard ratio.
    E-value = HR + sqrt(HR*(HR-1)) for HR >= 1
    For HR < 1, use 1/HR.
    """
    if pd.isna(hr) or hr <= 0:
        return np.nan
    rr = hr if hr >= 1 else 1/hr
    return rr + np.sqrt(rr * (rr - 1))

if cox_result is not None:
    hr = cox_result["hr"]
    ci_bound = cox_result["ci_low"] if hr >= 1 else cox_result["ci_high"]

    e_point = compute_evalue(hr)
    e_ci = compute_evalue(ci_bound)

    print(f"  Observed HR: {hr:.2f}")
    print(f"  E-value (point estimate): {e_point:.2f}")
    print(f"  E-value (CI bound): {e_ci:.2f}")
    print(f"\n  Interpretation:")
    if e_point > 2.0:
        print(f"  ‚úÖ E-value > 2.0: robust to moderate unmeasured confounding")
    elif e_point > 1.5:
        print(f"  ‚ö†Ô∏è  E-value 1.5‚Äì2.0: moderate robustness")
    else:
        print(f"  ‚ùå E-value < 1.5: result sensitive to unmeasured confounding")
else:
    print("  E-value not computed (Cox model not available)")



SENSITIVITY 2: E-value for Unmeasured Confounding
  E-value not computed (Cox model not available)


In [15]:
# ============================================================
# 7C. SENSITIVITY ‚Äî LEAVE-ONE-GENE-OUT (LOGO)
# ============================================================

print("\n" + "="*60)
print("SENSITIVITY 3: Leave-One-Gene-Out (LOGO)")
print("="*60)

# For each HRR gene, exclude all its variants and re-run the analysis
genes_present = df_var["gene"].unique()
logo_results = []

for exclude_gene in sorted(genes_present):
    # Recompute patient-level AM status excluding this gene
    vars_excl = df_var[df_var["gene"] != exclude_gene]
    path_patients_excl = set(
        vars_excl[vars_excl["am_class"] == "pathogenic"]["sample_id"].unique()
    )

    # Re-create analysis flag
    df_logo = df_surv.copy()
    df_logo["am_path_logo"] = df_logo["sample_id"].isin(path_patients_excl).astype(int)

    n_path = df_logo["am_path_logo"].sum()
    n_ben = len(df_logo) - n_path

    if n_path >= 2 and n_ben >= 2 and df_logo["os_event"].sum() >= 3:
        try:
            cph_logo = CoxPHFitter()
            cph_logo.fit(
                df_logo[["os_time", "os_event", "am_path_logo"]].dropna(),
                duration_col="os_time", event_col="os_event"
            )
            hr_l = np.exp(cph_logo.params_["am_path_logo"])
            ci_l = np.exp(cph_logo.confidence_intervals_.loc["am_path_logo"])
            p_l = cph_logo.summary.loc["am_path_logo", "p"]
            logo_results.append({
                "excluded_gene": exclude_gene,
                "n_path": n_path, "n_ben": n_ben,
                "HR": hr_l, "CI_low": ci_l.iloc[0], "CI_high": ci_l.iloc[1], "p": p_l
            })
            print(f"  Excl. {exclude_gene:10s}: n_path={n_path:3d}, HR={hr_l:.2f} "
                  f"({ci_l.iloc[0]:.2f}‚Äì{ci_l.iloc[1]:.2f}), p={p_l:.4f}")
        except:
            logo_results.append({
                "excluded_gene": exclude_gene, "n_path": n_path, "n_ben": n_ben,
                "HR": np.nan, "CI_low": np.nan, "CI_high": np.nan, "p": np.nan
            })
            print(f"  Excl. {exclude_gene:10s}: model failed")
    else:
        logo_results.append({
            "excluded_gene": exclude_gene, "n_path": n_path, "n_ben": n_ben,
            "HR": np.nan, "CI_low": np.nan, "CI_high": np.nan, "p": np.nan
        })
        print(f"  Excl. {exclude_gene:10s}: insufficient (n_path={n_path})")

df_logo_res = pd.DataFrame(logo_results)
df_logo_res.to_csv(RESULTS_DIR / "sensitivity_logo.csv", index=False)
print(f"\nüíæ Saved: {RESULTS_DIR / 'sensitivity_logo.csv'}")



SENSITIVITY 3: Leave-One-Gene-Out (LOGO)
  Excl. ATM       : insufficient (n_path=9)
  Excl. ATR       : insufficient (n_path=19)
  Excl. ATRX      : insufficient (n_path=16)
  Excl. BARD1     : insufficient (n_path=19)
  Excl. BRCA1     : insufficient (n_path=19)
  Excl. BRCA2     : insufficient (n_path=19)
  Excl. BRIP1     : insufficient (n_path=19)
  Excl. CDK12     : insufficient (n_path=16)
  Excl. FANCC     : insufficient (n_path=19)
  Excl. FANCD2    : insufficient (n_path=19)
  Excl. FANCF     : insufficient (n_path=19)
  Excl. FANCG     : insufficient (n_path=18)
  Excl. FANCL     : insufficient (n_path=18)
  Excl. NBN       : insufficient (n_path=19)
  Excl. PALB2     : insufficient (n_path=19)
  Excl. RAD50     : insufficient (n_path=19)
  Excl. RAD51B    : insufficient (n_path=19)
  Excl. RAD51D    : insufficient (n_path=19)
  Excl. RAD54L    : insufficient (n_path=18)

üíæ Saved: results/sensitivity_logo.csv


## 7. Publication-Ready Figures

**Figure 1:** Study flowchart + AlphaMissense score distribution
**Figure 2:** Concordance heatmap (AM vs ClinVar) + VUS Sankey
**Figure 3:** Kaplan-Meier survival curves
**Figure 4:** Sensitivity forest plot (threshold variation + LOGO)
**Figure 5:** Gene-level summary heatmap


In [16]:
# ============================================================
# 8A. FIGURE 1 ‚Äî AlphaMissense Score Distribution + Classification
# ============================================================

fig, axes = plt.subplots(1, 2, figsize=(12, 5))

# Panel A: Score distribution histogram
ax = axes[0]
scores = df_var["am_pathogenicity"].dropna()

ax.hist(scores, bins=50, color="#4C72B0", edgecolor="white", alpha=0.85)
ax.axvline(0.34, color="#E74C3C", linestyle="--", linewidth=1.5, label="Benign/Ambiguous (0.34)")
ax.axvline(0.564, color="#2ECC71", linestyle="--", linewidth=1.5, label="Ambiguous/Pathogenic (0.564)")

# Shade regions
ax.axvspan(0, 0.34, alpha=0.08, color="#3498DB", label="_")
ax.axvspan(0.34, 0.564, alpha=0.08, color="#F39C12", label="_")
ax.axvspan(0.564, 1.0, alpha=0.08, color="#E74C3C", label="_")

ax.set_xlabel("AlphaMissense Pathogenicity Score")
ax.set_ylabel("Number of Variants")
ax.set_title("A) AlphaMissense Score Distribution\n(HRR missense variants, TCGA-PRAD)")
ax.legend(fontsize=8)
ax.set_xlim(0, 1)

# Add text annotations
n_ben = (df_var["am_class"] == "benign").sum()
n_amb = (df_var["am_class"] == "ambiguous").sum()
n_pat = (df_var["am_class"] == "pathogenic").sum()
ax.text(0.17, ax.get_ylim()[1]*0.85, f"Benign\nn={n_ben}", ha="center", fontsize=8, color="#3498DB")
ax.text(0.45, ax.get_ylim()[1]*0.85, f"Ambiguous\nn={n_amb}", ha="center", fontsize=8, color="#F39C12")
ax.text(0.78, ax.get_ylim()[1]*0.85, f"Pathogenic\nn={n_pat}", ha="center", fontsize=8, color="#E74C3C")

# Panel B: By gene (top genes)
ax = axes[1]
gene_am = df_var.groupby("gene").agg(
    mean_score=("am_pathogenicity", "mean"),
    n_variants=("am_pathogenicity", "count"),
    pct_pathogenic=("am_class", lambda x: 100*(x=="pathogenic").sum()/len(x))
).reset_index()
gene_am = gene_am.sort_values("n_variants", ascending=True)

# Only show genes with ‚â•2 variants for readability
gene_am_plot = gene_am[gene_am["n_variants"] >= 2]

colors = []
for _, row in gene_am_plot.iterrows():
    if row["gene"] in COHORT_A_GENES:
        colors.append("#E74C3C")
    elif row["gene"] in COHORT_B_GENES:
        colors.append("#3498DB")
    else:
        colors.append("#95A5A6")

ax.barh(gene_am_plot["gene"], gene_am_plot["n_variants"], color=colors, edgecolor="white")
ax.set_xlabel("Number of Missense Variants")
ax.set_title("B) HRR Missense Variants by Gene")

# Legend
patches = [
    mpatches.Patch(color="#E74C3C", label="Cohort A (BRCA1/2, ATM)"),
    mpatches.Patch(color="#3498DB", label="Cohort B (PROfound)"),
    mpatches.Patch(color="#95A5A6", label="Extended DDR"),
]
ax.legend(handles=patches, fontsize=8, loc="lower right")

plt.tight_layout()
plt.savefig(FIG_DIR / "Fig1_AM_distribution.png", dpi=300)
plt.savefig(FIG_DIR / "Fig1_AM_distribution.pdf")
plt.show()
print("‚úÖ Figure 1 saved")


‚úÖ Figure 1 saved


In [17]:
# ============================================================
# 8B. FIGURE 2 ‚Äî Concordance Heatmap (AM vs ClinVar)
# ============================================================

fig, axes = plt.subplots(1, 2, figsize=(12, 5))

# Panel A: Confusion matrix heatmap
ax = axes[0]
if len(concordance) > 0 and 'conc_known' in dir() and len(conc_known) > 0:
    # 3x3 confusion: ClinVar (P/LP, VUS, B/LB) √ó AM (pathogenic, ambiguous, benign)
    all_conc = concordance[concordance["cv_simple"].isin(["P/LP", "VUS", "B/LB"])].copy()

    cv_order = ["P/LP", "VUS", "B/LB"]
    am_order = ["pathogenic", "ambiguous", "benign"]

    ct = pd.crosstab(
        all_conc["cv_simple"],
        all_conc["am_class"],
    ).reindex(index=cv_order, columns=am_order, fill_value=0)

    sns.heatmap(ct, annot=True, fmt="d", cmap="YlOrRd", ax=ax,
                cbar_kws={"label": "Number of variants"})
    ax.set_xlabel("AlphaMissense Classification")
    ax.set_ylabel("ClinVar Classification")
    ax.set_title("A) AlphaMissense vs ClinVar\nConcordance Matrix")

    # Add kappa annotation
    if not np.isnan(kappa):
        ax.text(0.02, 0.98, f"Cohen's Œ∫ = {kappa:.3f}",
                transform=ax.transAxes, fontsize=10, va="top",
                bbox=dict(boxstyle="round", facecolor="white", alpha=0.8))
else:
    ax.text(0.5, 0.5, "Concordance data\nnot available",
            ha="center", va="center", fontsize=12, color="gray",
            transform=ax.transAxes)
    ax.set_title("A) AlphaMissense vs ClinVar")

# Panel B: VUS Reclassification (bar chart as Sankey alternative)
ax = axes[1]
if len(concordance) > 0:
    vus_data = concordance[concordance["cv_simple"] == "VUS"]
    if len(vus_data) > 0:
        reclass_counts = vus_data["am_class"].value_counts().reindex(
            ["pathogenic", "ambiguous", "benign"], fill_value=0
        )
        bars = ax.bar(
            reclass_counts.index,
            reclass_counts.values,
            color=["#E74C3C", "#F39C12", "#3498DB"],
            edgecolor="white"
        )
        ax.set_ylabel("Number of ClinVar VUS")
        ax.set_title(f"B) VUS Reclassification by AlphaMissense\n(n={len(vus_data)} VUS)")
        ax.set_xlabel("AlphaMissense Reclassification")

        # Add count labels
        for bar, val in zip(bars, reclass_counts.values):
            if val > 0:
                ax.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.5,
                        str(val), ha="center", va="bottom", fontsize=11, fontweight="bold")
    else:
        ax.text(0.5, 0.5, "No VUS found\nin matched data",
                ha="center", va="center", fontsize=12, color="gray",
                transform=ax.transAxes)
else:
    ax.text(0.5, 0.5, "ClinVar √ó AM matching\nnot available",
            ha="center", va="center", fontsize=12, color="gray",
            transform=ax.transAxes)
    ax.set_title("B) VUS Reclassification")

plt.tight_layout()
plt.savefig(FIG_DIR / "Fig2_concordance.png", dpi=300)
plt.savefig(FIG_DIR / "Fig2_concordance.pdf")
plt.show()
print("‚úÖ Figure 2 saved")


‚úÖ Figure 2 saved


In [18]:
# ============================================================
# 8C. FIGURE 3 ‚Äî KAPLAN-MEIER CURVES
# ============================================================

fig, ax = plt.subplots(figsize=(8, 6))

if len(df_surv) >= 5 and 'kmf_path' in dir():
    # Plot KM curves
    kmf_path.plot_survival_function(
        ax=ax, color="#E74C3C", linewidth=2, ci_show=True, ci_alpha=0.15
    )
    kmf_ben.plot_survival_function(
        ax=ax, color="#3498DB", linewidth=2, ci_show=True, ci_alpha=0.15
    )

    ax.set_xlabel("Time (months)")
    ax.set_ylabel("Overall Survival Probability")
    ax.set_title("Overall Survival by AlphaMissense HRR Classification\n(TCGA-PRAD, patients with HRR missense variants)")
    ax.set_ylim(0, 1.05)

    # Add log-rank p-value
    if lr is not None:
        p_text = f"Log-rank p = {lr.p_value:.4f}" if lr.p_value >= 0.0001 else f"Log-rank p < 0.0001"
        ax.text(0.98, 0.02, p_text, transform=ax.transAxes,
                fontsize=10, ha="right", va="bottom",
                bbox=dict(boxstyle="round", facecolor="white", alpha=0.8))

    # Add number at risk table below
    # (Simple version ‚Äî lifelines has built-in but can be tricky)
    ax.legend(loc="lower left", fontsize=10)

    # Add HR annotation if available
    if cox_result is not None:
        hr_text = f"HR = {cox_result['hr']:.2f} (95% CI {cox_result['ci_low']:.2f}‚Äì{cox_result['ci_high']:.2f})"
        ax.text(0.98, 0.10, hr_text, transform=ax.transAxes,
                fontsize=9, ha="right", va="bottom",
                bbox=dict(boxstyle="round", facecolor="lightyellow", alpha=0.9))

else:
    ax.text(0.5, 0.5, "Insufficient survival data\nfor Kaplan-Meier analysis\n\n"
            "(Expected for TCGA-PRAD:\nlocalized disease, few events)\n\n"
            "Definitive analysis in Notebook 3\n(mCRPC PARP inhibitor cohort)",
            ha="center", va="center", fontsize=12, color="gray",
            transform=ax.transAxes)
    ax.set_title("Overall Survival by AlphaMissense HRR Classification")

plt.tight_layout()
plt.savefig(FIG_DIR / "Fig3_kaplan_meier.png", dpi=300)
plt.savefig(FIG_DIR / "Fig3_kaplan_meier.pdf")
plt.show()
print("‚úÖ Figure 3 saved")


‚úÖ Figure 3 saved


In [19]:
# ============================================================
# 8D. FIGURE 4 ‚Äî SENSITIVITY FOREST PLOT
# ============================================================

fig, axes = plt.subplots(1, 2, figsize=(14, 6))

# Panel A: Threshold variation
ax = axes[0]
if len(df_thresh) > 0 and df_thresh["HR"].notna().any():
    valid = df_thresh.dropna(subset=["HR"])
    y_pos = range(len(valid))

    ax.errorbar(
        valid["HR"], y_pos,
        xerr=[valid["HR"] - valid["CI_low"], valid["CI_high"] - valid["HR"]],
        fmt="o", color="#2C3E50", markersize=8, capsize=5, linewidth=1.5
    )
    ax.axvline(1.0, color="gray", linestyle="--", linewidth=1)
    ax.set_yticks(list(y_pos))
    ax.set_yticklabels([f"‚â•{t:.3f} (n={int(n)})" for t, n in zip(valid["threshold"], valid["n_pathogenic"])])
    ax.set_xlabel("Hazard Ratio (95% CI)")
    ax.set_title("A) Threshold Variation")
    ax.set_xlim(0, max(3, valid["CI_high"].max() * 1.2))
else:
    ax.text(0.5, 0.5, "Threshold sensitivity\nnot available\n(insufficient events)",
            ha="center", va="center", fontsize=11, color="gray",
            transform=ax.transAxes)
    ax.set_title("A) Threshold Variation")

# Panel B: Leave-One-Gene-Out
ax = axes[1]
if len(df_logo_res) > 0 and df_logo_res["HR"].notna().any():
    valid_logo = df_logo_res.dropna(subset=["HR"])
    y_pos = range(len(valid_logo))

    ax.errorbar(
        valid_logo["HR"], y_pos,
        xerr=[valid_logo["HR"] - valid_logo["CI_low"], valid_logo["CI_high"] - valid_logo["HR"]],
        fmt="s", color="#8E44AD", markersize=8, capsize=5, linewidth=1.5
    )
    ax.axvline(1.0, color="gray", linestyle="--", linewidth=1)
    ax.set_yticks(list(y_pos))
    ax.set_yticklabels([f"Excl. {g}" for g in valid_logo["excluded_gene"]])
    ax.set_xlabel("Hazard Ratio (95% CI)")
    ax.set_title("B) Leave-One-Gene-Out")
    ax.set_xlim(0, max(3, valid_logo["CI_high"].max() * 1.2))

    # Add reference line for main analysis
    if cox_result is not None:
        ax.axvline(cox_result["hr"], color="#E74C3C", linestyle=":", linewidth=1, alpha=0.7)
else:
    ax.text(0.5, 0.5, "LOGO sensitivity\nnot available\n(insufficient events)",
            ha="center", va="center", fontsize=11, color="gray",
            transform=ax.transAxes)
    ax.set_title("B) Leave-One-Gene-Out")

plt.tight_layout()
plt.savefig(FIG_DIR / "Fig4_sensitivity.png", dpi=300)
plt.savefig(FIG_DIR / "Fig4_sensitivity.pdf")
plt.show()
print("‚úÖ Figure 4 saved")


‚úÖ Figure 4 saved


In [20]:
# ============================================================
# 8E. FIGURE 5 ‚Äî GENE-LEVEL HEATMAP (AM classification √ó Gene √ó Patients)
# ============================================================

fig, ax = plt.subplots(figsize=(10, 8))

# Create a patient √ó gene matrix showing AM classification
# Rows = patients with HRR variants, Columns = genes
patients = df_var["sample_id"].unique()
genes = sorted(df_var["gene"].unique())

# Map: 0 = no variant, 1 = benign, 2 = ambiguous, 3 = pathogenic
matrix = np.zeros((len(patients), len(genes)))
pat_to_idx = {p: i for i, p in enumerate(patients)}
gene_to_idx = {g: i for i, g in enumerate(genes)}

for _, row in df_var.iterrows():
    pi = pat_to_idx[row["sample_id"]]
    gi = gene_to_idx[row["gene"]]
    if row["am_class"] == "pathogenic":
        matrix[pi, gi] = max(matrix[pi, gi], 3)
    elif row["am_class"] == "ambiguous":
        matrix[pi, gi] = max(matrix[pi, gi], 2)
    elif row["am_class"] == "benign":
        matrix[pi, gi] = max(matrix[pi, gi], 1)

# Sort patients by number of pathogenic variants (descending)
path_count = (matrix == 3).sum(axis=1)
sort_idx = np.argsort(-path_count)
matrix = matrix[sort_idx]

# Custom colormap
from matplotlib.colors import ListedColormap
cmap = ListedColormap(["#FFFFFF", "#3498DB", "#F39C12", "#E74C3C"])

im = ax.imshow(matrix, aspect="auto", cmap=cmap, vmin=0, vmax=3, interpolation="nearest")

ax.set_xticks(range(len(genes)))
ax.set_xticklabels(genes, rotation=45, ha="right", fontsize=8)
ax.set_ylabel(f"Patients (n={len(patients)})")
ax.set_title("AlphaMissense Classification of HRR Missense Variants\n(TCGA-PRAD)")

# Colorbar legend
cbar = plt.colorbar(im, ax=ax, ticks=[0, 1, 2, 3], shrink=0.6)
cbar.ax.set_yticklabels(["No variant", "AM-Benign", "AM-Ambiguous", "AM-Pathogenic"])

# Mark cohort A genes
for gi, gene in enumerate(genes):
    if gene in COHORT_A_GENES:
        ax.get_xticklabels()[gi].set_fontweight("bold")
        ax.get_xticklabels()[gi].set_color("#E74C3C")

plt.tight_layout()
plt.savefig(FIG_DIR / "Fig5_gene_heatmap.png", dpi=300)
plt.savefig(FIG_DIR / "Fig5_gene_heatmap.pdf")
plt.show()
print("‚úÖ Figure 5 saved")


‚úÖ Figure 5 saved


## 8. Summary of Results & Export

In [21]:
# ============================================================
# 9. EXECUTIVE SUMMARY
# ============================================================

print("=" * 70)
print("  NOTEBOOK 2 ‚Äî EXECUTIVE SUMMARY OF RESULTS")
print("=" * 70)

print(f"\nüìä DATASET")
print(f"   TCGA-PRAD: {len(df_var)} HRR missense variants in {df_var['sample_id'].nunique()} patients")
print(f"   Genes: {df_var['gene'].nunique()} HRR genes")

print(f"\nüî¨ AlphaMissense CLASSIFICATION")
for c in ["pathogenic", "ambiguous", "benign"]:
    n = (df_var["am_class"] == c).sum()
    print(f"   {c.capitalize():12s}: {n} variants ({100*n/len(df_var):.1f}%)")

if not np.isnan(kappa) if isinstance(kappa, float) else True:
    print(f"\nüîó CONCORDANCE (AM vs ClinVar)")
    print(f"   Cohen's kappa: {kappa:.3f}")

print(f"\nüìà SURVIVAL (OS)")
if cox_result is not None:
    print(f"   HR (AM-Pathogenic vs Benign/Ambig): {cox_result['hr']:.2f} "
          f"(95% CI {cox_result['ci_low']:.2f}‚Äì{cox_result['ci_high']:.2f})")
    print(f"   p-value: {cox_result['p']:.4f}")
else:
    print(f"   Cox model: not computed (insufficient events)")
    print(f"   Expected for localized TCGA-PRAD ‚Äî see Notebook 3 for mCRPC")

if lr is not None:
    print(f"   Log-rank: p={lr.p_value:.4f}")

print(f"\nüìã OUTPUT FILES")
output_files = list(RESULTS_DIR.glob("*.csv")) + list(FIG_DIR.glob("*.png"))
for f in sorted(output_files):
    print(f"   ‚úÖ {f}")

print(f"\n{'='*70}")
print(f"  PUBLICATION ASSESSMENT")
print(f"{'='*70}")
print(f"\n  Key question: Is there signal for publication?")
n_events = df_analysis["os_event"].sum() if "os_event" in df_analysis else 0
if n_events >= 5 and cox_result is not None:
    print(f"  ‚úÖ Yes ‚Äî survival signal detected with {n_events:.0f} events")
    print(f"     HR = {cox_result['hr']:.2f}, which {'crosses' if cox_result['ci_low'] <= 1 <= cox_result['ci_high'] else 'does not cross'} 1.0")
else:
    print(f"  ‚ö†Ô∏è  Exploratory ‚Äî only {n_events:.0f} events in TCGA-PRAD (localized disease)")
    print(f"     This is EXPECTED and does not invalidate the paper.")
    print(f"     The paper's strength comes from:")
    print(f"       1. VUS reclassification yield (computational reclassification yield)")
    print(f"       2. Concordance with ClinVar (validation of the tool)")
    print(f"       3. Notebook 3 will add mCRPC PARP cohort for treatment response")
    print(f"     Target: JCO Precision Oncology (brief communication / tools validation)")

print(f"\n  NEXT STEP: Notebook 3 ‚Äî Validation in mCRPC PARP Inhibitor Cohort")
print(f"  This will provide the clinically definitive test with treatment response data.")


  NOTEBOOK 2 ‚Äî EXECUTIVE SUMMARY OF RESULTS

üìä DATASET
   TCGA-PRAD: 52 HRR missense variants in 40 patients
   Genes: 19 HRR genes

üî¨ AlphaMissense CLASSIFICATION
   Pathogenic  : 19 variants (36.5%)
   Ambiguous   : 1 variants (1.9%)
   Benign      : 31 variants (59.6%)

üîó CONCORDANCE (AM vs ClinVar)
   Cohen's kappa: 0.733

üìà SURVIVAL (OS)
   Cox model: not computed (insufficient events)
   Expected for localized TCGA-PRAD ‚Äî see Notebook 3 for mCRPC
   Log-rank: p=0.3442

üìã OUTPUT FILES
   ‚úÖ figures/Fig1_AM_distribution.png
   ‚úÖ figures/Fig2_concordance.png
   ‚úÖ figures/Fig3_kaplan_meier.png
   ‚úÖ figures/Fig4_sensitivity.png
   ‚úÖ figures/Fig5_gene_heatmap.png
   ‚úÖ results/analysis_dataset.csv
   ‚úÖ results/annotated_hrr_variants.csv
   ‚úÖ results/concordance_results.csv
   ‚úÖ results/patient_hrr_summary.csv
   ‚úÖ results/sensitivity_logo.csv
   ‚úÖ results/sensitivity_threshold.csv
   ‚úÖ results/table_gene_summary.csv
   ‚úÖ results/vus_reclassifi

## ‚ö†Ô∏è Limitations & Intended Use**This analysis is hypothesis-generating and not intended for clinical decision-making.**Key limitations:- AlphaMissense predictions are **computational annotations**, not clinical reclassifications per ACMG/AMP standards.- The VUS "reclassification" reported here is a **computational triage** ‚Äî it does not replace expert curation, functional assays, or clinical-grade variant interpretation.- Concordance with ClinVar does not guarantee correctness for individual variants, particularly in under-represented genes or populations.- Survival analysis is **univariate** (no adjustment for age, stage, treatment, or other confounders) and should be interpreted as associative, not causal.- These results require **prospective clinical validation** before any integration into treatment decisions or molecular tumor board workflows.For clinical use, AlphaMissense scores should be considered as **PP3/BP4-level supporting evidence** within the ACMG/AMP framework, not as standalone determinants.

## ‚úÖ Notebook 2 Complete!

### Key Results Files:
| File | Description |
|------|-------------|
| `results/analysis_dataset.csv` | Full analysis dataset (variants + clinical + AM) |
| `results/table_gene_summary.csv` | Gene-level summary table |
| `results/concordance_results.csv` | AM vs ClinVar concordance metrics |
| `results/vus_reclassification.csv` | VUS reclassification details |
| `results/sensitivity_threshold.csv` | Threshold variation results |
| `results/sensitivity_logo.csv` | Leave-one-gene-out results |
| `figures/Fig1_AM_distribution.png/pdf` | AM score distribution + gene barplot |
| `figures/Fig2_concordance.png/pdf` | Concordance heatmap + VUS reclassification |
| `figures/Fig3_kaplan_meier.png/pdf` | Kaplan-Meier survival curves |
| `figures/Fig4_sensitivity.png/pdf` | Sensitivity forest plots |
| `figures/Fig5_gene_heatmap.png/pdf` | Patient √ó Gene heatmap |

### Next: Notebook 3 ‚Äî Validation in PARP Inhibitor Cohort
- Download mCRPC cohorts (MSK-IMPACT, SU2C/PCF) from cBioPortal
- Filter for patients treated with PARP inhibitors (olaparib, rucaparib)
- Repeat AlphaMissense annotation + survival analysis
- Correlate AM reclassification with PARP response

---
*Notebook created by Research OS ‚Äî Clinical Computational Oncology Pipeline*
*AlphaMissense: Cheng et al., Science 2023. DOI: 10.1126/science.adg7492*
*Survival analysis: lifelines (Davidson-Pilon, JOSS 2019)*
