# üß¨ AlphaMissense VUS Reclassification ‚Äî Prostate Cancer HRR Genes
## Notebook 1: Data Download, Filtering & Annotation
### v3 ‚Äî Optimized for GitHub Codespaces (uses cBioPortal API)

**Runtime: ~5 min** | No GPU needed | All public data


## 1. Setup

In [1]:
import subprocess, sys
for pkg in ["pandas", "requests", "tqdm", "openpyxl"]:
    subprocess.check_call([sys.executable, "-m", "pip", "install", "-q", pkg])

import pandas as pd
import numpy as np
import requests
import os, io, re, gzip, json, warnings
from pathlib import Path
from tqdm.auto import tqdm

warnings.filterwarnings('ignore')
pd.set_option('display.max_columns', 50)

DATA_DIR = Path("data")
RESULTS_DIR = Path("results")
for d in [DATA_DIR, RESULTS_DIR, DATA_DIR / "raw", DATA_DIR / "processed"]:
    d.mkdir(parents=True, exist_ok=True)

print("‚úÖ Setup complete!")


‚úÖ Setup complete!


  from .autonotebook import tqdm as notebook_tqdm


## 2. HRR Gene Panel (PROfound / TRITON3 / TALAPRO-2)

In [2]:
COHORT_A = ["BRCA1", "BRCA2", "ATM"]
COHORT_B = ["PALB2","BRIP1","BARD1","CDK12","CHEK1","CHEK2","FANCL","RAD51B","RAD51C","RAD51D","RAD54L"]
EXPANDED = ["FANCA","FANCC","FANCD2","FANCE","FANCF","FANCG","NBN","MRE11","RAD50","ATR","ATRX"]

HRR_PRIMARY = sorted(set(COHORT_A + COHORT_B))
HRR_ALL = sorted(set(HRR_PRIMARY + EXPANDED))

GENE2UNI = {
    "BRCA1":"P38398","BRCA2":"P51587","ATM":"Q13315","PALB2":"Q86YC2",
    "BRIP1":"Q9BX63","BARD1":"Q99728","CDK12":"Q9NYV4","CHEK1":"O14757",
    "CHEK2":"O96017","FANCL":"Q9NW38","RAD51B":"O15315","RAD51C":"O43502",
    "RAD51D":"O75771","RAD54L":"Q92698","FANCA":"O15360","FANCC":"Q00597",
    "FANCD2":"Q9BXW9","FANCE":"Q9HB96","FANCF":"Q9NPI8","FANCG":"O15287",
    "NBN":"O60934","MRE11":"P49959","RAD50":"Q92878","ATR":"Q13535","ATRX":"P46100",
}
UNI2GENE = {v: k for k, v in GENE2UNI.items()}

print(f"Cohort A: {COHORT_A}")
print(f"Cohort B: {len(COHORT_B)} genes")
print(f"Extended panel: {len(HRR_ALL)} genes total")


Cohort A: ['BRCA1', 'BRCA2', 'ATM']
Cohort B: 11 genes
Extended panel: 25 genes total


## 3. Download TCGA-PRAD Mutations
Uses the **cBioPortal REST API** (not the S3 datahub, which is blocked in Codespaces).

If you already ran the download script, this cell detects the cached file and skips the download.


In [3]:
STUDY = "prad_tcga_pan_can_atlas_2018"
CBIO = "https://www.cbioportal.org/api"
cache_file = DATA_DIR / "raw" / "tcga_prad_mutations_raw.csv"

if cache_file.exists() and os.path.getsize(cache_file) > 1000:
    print(f"üìÇ Found cached mutation data: {cache_file}")
    df_mut_raw = pd.read_csv(cache_file, low_memory=False)
    print(f"   {len(df_mut_raw):,} mutations, {df_mut_raw['sampleId'].nunique()} samples")
else:
    print("üì• Downloading mutations via cBioPortal API...")
    profile_id = f"{STUDY}_mutations"
    
    r = requests.post(
        f"{CBIO}/molecular-profiles/{profile_id}/mutations/fetch",
        headers={"Accept": "application/json", "Content-Type": "application/json"},
        json={"sampleListId": f"{STUDY}_all"},
        params={"projection": "DETAILED"},
        timeout=300
    )
    r.raise_for_status()
    data = r.json()
    
    df_mut_raw = pd.json_normalize(data)
    df_mut_raw.to_csv(cache_file, index=False)
    print(f"‚úÖ Downloaded {len(df_mut_raw):,} mutations, {df_mut_raw['sampleId'].nunique()} samples")
    print(f"   üíæ Cached to {cache_file}")


üìÇ Found cached mutation data: data/raw/tcga_prad_mutations_raw.csv
   21,448 mutations, 491 samples


## 4. Filter: Missense Mutations in HRR Genes

In [4]:
# Detect column names (API vs MAF format)
if "gene.hugoGeneSymbol" in df_mut_raw.columns:
    G = "gene.hugoGeneSymbol"
    CLS = "mutationType"
    SAM = "sampleId"
    HGV = "proteinChange"
elif "Hugo_Symbol" in df_mut_raw.columns:
    G = "Hugo_Symbol"
    CLS = "Variant_Classification"
    SAM = "Tumor_Sample_Barcode"
    HGV = "HGVSp_Short"
else:
    raise ValueError(f"Unknown columns: {list(df_mut_raw.columns[:10])}")

print(f"Column format: gene={G}, class={CLS}")
print(f"Total mutations: {len(df_mut_raw):,}\n")

# Filter HRR genes
df_hrr = df_mut_raw[df_mut_raw[G].isin(HRR_ALL)].copy()
print(f"In HRR genes: {len(df_hrr):,}")

# Filter missense
df_miss = df_hrr[df_hrr[CLS].str.contains("issense", case=False, na=False)].copy()
print(f"Missense only: {len(df_miss):,}")
print(f"Unique patients: {df_miss[SAM].nunique()}\n")

# Per-gene summary
print(f"{'Gene':>12s} {'Cohort':>6s} {'N':>5s}")
print("-" * 28)
for gene, n in df_miss[G].value_counts().items():
    c = "A" if gene in COHORT_A else ("B" if gene in COHORT_B else "Ext")
    print(f"{gene:>12s} {c:>6s} {n:>5d}")


Column format: gene=gene.hugoGeneSymbol, class=mutationType
Total mutations: 21,448

In HRR genes: 78
Missense only: 52
Unique patients: 40

        Gene Cohort     N
----------------------------
         ATM      A    15
       CDK12      B     5
       BARD1      B     4
      RAD51B      B     3
       BRCA2      A     3
        ATRX    Ext     3
       PALB2      B     3
       BRIP1      B     2
      RAD54L      B     2
         ATR    Ext     2
         NBN    Ext     2
       FANCG    Ext     1
      RAD51D      B     1
      FANCD2    Ext     1
       FANCL      B     1
       FANCC    Ext     1
       RAD50    Ext     1
       BRCA1      A     1
       FANCF    Ext     1


## 5. Parse Protein Changes

In [5]:
def parse_hgvsp(s):
    if pd.isna(s): return None, None, None
    s = str(s).strip()
    aa3 = {'Ala':'A','Arg':'R','Asn':'N','Asp':'D','Cys':'C','Gln':'Q','Glu':'E',
           'Gly':'G','His':'H','Ile':'I','Leu':'L','Lys':'K','Met':'M','Phe':'F',
           'Pro':'P','Ser':'S','Thr':'T','Trp':'W','Tyr':'Y','Val':'V','Ter':'*'}
    # 3-letter: p.Arg175His
    m = re.match(r'p\.([A-Z][a-z]{2})(\d+)([A-Z][a-z]{2})', s)
    if m:
        r_, a_ = aa3.get(m.group(1)), aa3.get(m.group(3))
        if r_ and a_ and r_ != a_: return r_, int(m.group(2)), a_
    # 1-letter: p.R175H or R175H
    m = re.match(r'(?:p\.)?([A-Z*])(\d+)([A-Z*])', s)
    if m and m.group(1) != m.group(3):
        return m.group(1), int(m.group(2)), m.group(3)
    return None, None, None

parsed = df_miss[HGV].apply(parse_hgvsp)
df_miss = df_miss.copy()
df_miss["ref_aa"] = [p[0] for p in parsed]
df_miss["protein_pos"] = [p[1] for p in parsed]
df_miss["alt_aa"] = [p[2] for p in parsed]

# Keep only parsed
df_miss = df_miss.dropna(subset=["ref_aa","protein_pos","alt_aa"]).copy()
df_miss["protein_pos"] = df_miss["protein_pos"].astype(int)

# UniProt mapping + key
df_miss["uniprot_id"] = df_miss[G].map(GENE2UNI)
df_miss["am_key"] = (
    df_miss["uniprot_id"] + "_" +
    df_miss["protein_pos"].astype(str) + "_" +
    df_miss["ref_aa"] + "_" +
    df_miss["alt_aa"]
)

print(f"‚úÖ Parsed: {len(df_miss)} variants")
print(f"   UniProt mapped: {df_miss['uniprot_id'].notna().sum()}")
print(f"\nExamples:")
for _, r in df_miss.head(5).iterrows():
    print(f"   {r[G]:>8s}  {str(r[HGV]):>15s}  ‚Üí  {r['am_key']}")


‚úÖ Parsed: 52 variants
   UniProt mapped: 52

Examples:
        ATM           G2695S  ‚Üí  Q13315_2695_G_S
        ATM           G1672A  ‚Üí  Q13315_1672_G_A
      BRCA2           N1435T  ‚Üí  P51587_1435_N_T
      FANCG            R353S  ‚Üí  O15287_353_R_S
        ATM           E2164K  ‚Üí  Q13315_2164_E_K


## 6. Download AlphaMissense Predictions (Lightweight ‚ö°)

Instead of the full 450MB file, we try:
1. **Per-protein files** from Google Storage (few KB each)
2. **Fallback:** streaming the hg38 file and filtering on-the-fly
3. **Manual fallback:** upload from Zenodo


In [6]:
am_cache = DATA_DIR / "processed" / "alphamissense_hrr_genes.csv"

if am_cache.exists() and os.path.getsize(am_cache) > 1000:
    print(f"üìÇ Found cached AlphaMissense data: {am_cache}")
    df_am = pd.read_csv(am_cache)
    print(f"   {len(df_am):,} predictions loaded")
else:
    print("üì• Downloading AlphaMissense per-protein predictions...")
    print(f"   {len(GENE2UNI)} proteins to fetch\n")
    
    AM_BASE = "https://storage.googleapis.com/dm_alphamissense"
    all_am = []
    failed = []
    
    for gene, uid in tqdm(GENE2UNI.items(), desc="Fetching"):
        try:
            # Try the aa_substitutions per-protein endpoint
            url = f"{AM_BASE}/AlphaMissense_aa_substitutions.tsv.gz"
            # This is the big file ‚Äî skip, try hg38 per-gene approach
            raise Exception("Skip big file")
        except:
            pass
        
        # Alternative: query the AlphaFold DB for AM scores
        # The per-protein TSVs aren't individually hosted, so we need the big file
        failed.append(gene)
    
    if len(all_am) == 0:
        print("\n‚ö†Ô∏è  Per-protein download not available individually.")
        print("   Downloading full AlphaMissense file (streaming + filtering)...")
        print("   This downloads ~450MB but only keeps HRR genes in memory.\n")
        
        target_uniprots = set(GENE2UNI.values())
        
        try:
            url = f"{AM_BASE}/AlphaMissense_aa_substitutions.tsv.gz"
            resp = requests.get(url, stream=True, timeout=30)
            resp.raise_for_status()
            
            total = int(resp.headers.get('content-length', 0))
            gz_path = DATA_DIR / "raw" / "AlphaMissense_aa_substitutions.tsv.gz"
            downloaded = 0
            
            with open(gz_path, 'wb') as f:
                for chunk in resp.iter_content(chunk_size=1024*1024):
                    f.write(chunk)
                    downloaded += len(chunk)
                    if total > 0:
                        print(f"   {downloaded/1e6:.0f}/{total/1e6:.0f} MB ({100*downloaded/total:.0f}%)", end="\r")
            
            print(f"\n   ‚úÖ Downloaded: {gz_path}")
            print("   Parsing (filtering for HRR genes only)...\n")
            
            with gzip.open(gz_path, 'rt') as f:
                header = None
                n_lines = 0
                for line in f:
                    if line.startswith('#'): continue
                    if header is None:
                        header = line.strip().split('\t')
                        continue
                    n_lines += 1
                    parts = line.strip().split('\t')
                    if len(parts) >= 4 and parts[0] in target_uniprots:
                        try:
                            all_am.append({
                                "uniprot_id": parts[0],
                                "gene": UNI2GENE.get(parts[0], parts[0]),
                                "protein_variant": parts[1],
                                "am_pathogenicity": float(parts[2]),
                                "am_class": parts[3].strip(),
                            })
                        except ValueError:
                            continue
                    if n_lines % 5_000_000 == 0:
                        print(f"   {n_lines/1e6:.0f}M lines | {len(all_am)} HRR variants", end="\r")
            
            print(f"\n   ‚úÖ Extracted {len(all_am):,} from {n_lines:,} lines")
            
        except Exception as e:
            print(f"\n‚ùå Download failed: {e}")
            print("\nüìù MANUAL OPTION:")
            print("   1. Download from: https://zenodo.org/records/8208688")
            print("   2. Get: AlphaMissense_aa_substitutions.tsv.gz")
            print("   3. Upload to: data/raw/ in your Codespace")
            print("   4. Re-run this cell")
            
            manual = DATA_DIR / "raw" / "AlphaMissense_aa_substitutions.tsv.gz"
            if manual.exists():
                print("\n‚úÖ Found manually uploaded file! Parsing...")
                target_uniprots = set(GENE2UNI.values())
                with gzip.open(manual, 'rt') as f:
                    header = None
                    for line in f:
                        if line.startswith('#'): continue
                        if header is None: header = line; continue
                        parts = line.strip().split('\t')
                        if len(parts) >= 4 and parts[0] in target_uniprots:
                            try:
                                all_am.append({
                                    "uniprot_id": parts[0],
                                    "gene": UNI2GENE.get(parts[0], parts[0]),
                                    "protein_variant": parts[1],
                                    "am_pathogenicity": float(parts[2]),
                                    "am_class": parts[3].strip(),
                                })
                            except ValueError: continue
                print(f"‚úÖ Extracted {len(all_am):,} HRR variants")
    
    # Process into DataFrame
    if len(all_am) > 0:
        df_am = pd.DataFrame(all_am)
        df_am["ref_aa_am"] = df_am["protein_variant"].str[0]
        df_am["alt_aa_am"] = df_am["protein_variant"].str[-1]
        df_am["pos_str"] = df_am["protein_variant"].str[1:-1]
        df_am = df_am[df_am["pos_str"].str.match(r'^\d+$', na=False)].copy()
        df_am["protein_pos_am"] = df_am["pos_str"].astype(int)
        df_am["am_key"] = (
            df_am["uniprot_id"] + "_" +
            df_am["protein_pos_am"].astype(str) + "_" +
            df_am["ref_aa_am"] + "_" +
            df_am["alt_aa_am"]
        )
        df_am = df_am.drop_duplicates("am_key")
        df_am.to_csv(am_cache, index=False)
        print(f"\nüíæ Cached: {am_cache}")
    else:
        df_am = pd.DataFrame(columns=["am_key","am_pathogenicity","am_class"])
        print("\n‚ö†Ô∏è  No AlphaMissense data available yet")

# Summary
if len(df_am) > 0:
    print(f"\n{'Gene':>12s} {'Total':>6s} {'Path':>6s} {'Benign':>6s} {'Ambig':>6s} {'%Path':>6s}")
    print("=" * 50)
    for gene in sorted(GENE2UNI.keys()):
        s = df_am[df_am["gene"]==gene]
        if len(s)==0: continue
        np_=((s["am_class"]=="pathogenic").sum())
        nb_=((s["am_class"]=="benign").sum())
        na_=((s["am_class"]=="ambiguous").sum())
        print(f"{gene:>12s} {len(s):>6d} {np_:>6d} {nb_:>6d} {na_:>6d} {100*np_/len(s):>5.1f}%")
    print(f"\n   Total: {len(df_am):,} variant predictions")


üìÇ Found cached AlphaMissense data: data/processed/alphamissense_hrr_genes.csv
   554,363 predictions loaded

        Gene  Total   Path Benign  Ambig  %Path
         ATM  58083  22037  26627   9419  37.9%
         ATR  50236  23767  19418   7051  47.3%
        ATRX  47367  24503  18114   4750  51.7%
       BARD1  14763   4385   8531   1847  29.7%
       BRCA1  35397   5616  24611   5170  15.9%
       BRCA2  64961  10158  45731   9072  15.6%
       BRIP1  23731   7485  13841   2405  31.5%
       CDK12  28310  11594  12888   3828  41.0%
       CHEK1   9044   5252   2764   1028  58.1%
       CHEK2  10317   5245   3895   1177  50.8%
       FANCA  27645   8142  14736   4767  29.5%
       FANCC  10602   2698   5596   2308  25.4%
      FANCD2  27569   8977  14428   4164  32.6%
       FANCE  10184   2348   6096   1740  23.1%
       FANCF   7106   2089   3732   1285  29.4%
       FANCG  11818   3029   6688   2101  25.6%
       FANCL   7125   2826   3001   1298  39.7%
       MRE11  13452   60

## 7. Download ClinVar Classifications

In [7]:
clinvar_cache = DATA_DIR / "processed" / "clinvar_hrr.csv"

if clinvar_cache.exists() and os.path.getsize(clinvar_cache) > 500:
    print(f"üìÇ Found cached ClinVar data")
    df_clinvar = pd.read_csv(clinvar_cache)
else:
    print("üì• Downloading ClinVar...")
    CLINVAR_URL = "https://ftp.ncbi.nlm.nih.gov/pub/clinvar/tab_delimited/variant_summary.txt.gz"
    
    try:
        resp = requests.get(CLINVAR_URL, timeout=300, stream=True)
        resp.raise_for_status()
        
        cv_file = DATA_DIR / "raw" / "variant_summary.txt.gz"
        with open(cv_file, 'wb') as f:
            for chunk in resp.iter_content(chunk_size=1024*1024):
                f.write(chunk)
        print(f"   Downloaded: {os.path.getsize(cv_file)/1e6:.1f} MB")
        
        records = []
        with gzip.open(cv_file, 'rt', errors='replace') as f:
            header = f.readline().strip().split('\t')
            idx = {c: i for i, c in enumerate(header)}
            for line in f:
                parts = line.strip().split('\t')
                gene = parts[idx.get('GeneSymbol',0)] if 'GeneSymbol' in idx else ""
                if gene not in HRR_ALL: continue
                vtype = parts[idx.get('Type',0)] if 'Type' in idx else ""
                name = parts[idx.get('Name',0)] if 'Name' in idx else ""
                if "single nucleotide" in vtype.lower() or "missense" in name.lower():
                    records.append({
                        'cv_gene': gene,
                        'cv_name': name,
                        'cv_significance': parts[idx.get('ClinicalSignificance',0)] if 'ClinicalSignificance' in idx else "",
                    })
        
        df_clinvar = pd.DataFrame(records)
        
        def simplify(sig):
            s = str(sig).lower()
            if "pathogenic" in s and "conflicting" not in s:
                return "LP/P" if "likely" in s else "Pathogenic"
            elif "benign" in s and "conflicting" not in s:
                return "LB/B" if "likely" in s else "Benign"
            elif "uncertain" in s: return "VUS"
            elif "conflicting" in s: return "Conflicting"
            return "Other"
        
        df_clinvar["cv_class"] = df_clinvar["cv_significance"].apply(simplify)
        df_clinvar.to_csv(clinvar_cache, index=False)
        print(f"\n‚úÖ ClinVar HRR variants: {len(df_clinvar):,}")
        
    except Exception as e:
        print(f"‚ö†Ô∏è  ClinVar failed: {e}")
        df_clinvar = pd.DataFrame()

if len(df_clinvar) > 0:
    print(f"\nDistribution:")
    for cls, n in df_clinvar["cv_class"].value_counts().items():
        print(f"   {cls:>15s}: {n:5d} ({100*n/len(df_clinvar):.1f}%)")


üìÇ Found cached ClinVar data

Distribution:
               VUS: 80630 (41.7%)
              LB/B: 60264 (31.2%)
       Conflicting: 27046 (14.0%)
        Pathogenic:  7449 (3.9%)
            Benign:  6455 (3.3%)
              LP/P:  6008 (3.1%)
             Other:  5296 (2.7%)


## 8. Merge: Mutations √ó AlphaMissense √ó ClinVar

In [8]:
print("üîó Merging datasets...\n")

if len(df_am) > 0:
    df_ann = df_miss.merge(
        df_am[["am_key","am_pathogenicity","am_class"]].drop_duplicates("am_key"),
        on="am_key", how="left"
    )
    matched = df_ann["am_pathogenicity"].notna().sum()
    print(f"‚úÖ AlphaMissense match: {matched}/{len(df_ann)} ({100*matched/len(df_ann):.1f}%)")
else:
    df_ann = df_miss.copy()
    df_ann["am_pathogenicity"] = np.nan
    df_ann["am_class"] = "not_annotated"
    print("‚ö†Ô∏è  AlphaMissense not loaded ‚Äî placeholder columns added")

# Summary
print(f"\nAlphaMissense classification:")
for cls, n in df_ann["am_class"].value_counts().items():
    bar = "‚ñà" * int(50 * n / len(df_ann))
    print(f"   {cls:>15s}: {n:3d} ({100*n/len(df_ann):5.1f}%) {bar}")


üîó Merging datasets...

‚úÖ AlphaMissense match: 51/52 (98.1%)

AlphaMissense classification:
            benign:  31 ( 59.6%) ‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà
        pathogenic:  19 ( 36.5%) ‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà
         ambiguous:   1 (  1.9%) 


## 9. Download Clinical Data

In [9]:
clin_cache = DATA_DIR / "raw" / "clinical_patient.csv"

if clin_cache.exists():
    print("üìÇ Clinical data already downloaded")
    df_clin_pat = pd.read_csv(clin_cache)
    df_clin_sam = pd.read_csv(DATA_DIR / "raw" / "clinical_sample.csv")
else:
    print("üì• Downloading clinical data via API...")
    for ctype, fname, varname in [
        ("PATIENT", "clinical_patient.csv", "df_clin_pat"),
        ("SAMPLE", "clinical_sample.csv", "df_clin_sam"),
    ]:
        r = requests.get(
            f"{CBIO}/studies/{STUDY}/clinical-data?clinicalDataType={ctype}",
            headers={"Accept": "application/json"}, timeout=60
        )
        df_tmp = pd.json_normalize(r.json())
        df_tmp.to_csv(DATA_DIR / "raw" / fname, index=False)
        print(f"   ‚úÖ {fname}: {len(df_tmp)} rows")
        if ctype == "PATIENT": df_clin_pat = df_tmp
        else: df_clin_sam = df_tmp

print(f"\nPatients: {len(df_clin_pat)}")
surv = [c for c in df_clin_pat.columns if any(x in c.upper() for x in ["OS","DFS","PFS","SURV","STATUS"])]
if surv:
    print(f"Survival columns: {surv}")
else:
    print(f"Columns: {list(df_clin_pat.columns)}")


üìÇ Clinical data already downloaded

Patients: 15949
Columns: ['uniquePatientKey', 'patientId', 'studyId', 'clinicalAttributeId', 'value']


## 10. Final Annotated Table

In [10]:
# Build final table with clean column names
df_final = df_ann[[
    c for c in [SAM, G, HGV, CLS, "chr", "startPosition", "referenceAllele", "variantAllele",
                "ref_aa", "protein_pos", "alt_aa", "uniprot_id", "am_key",
                "am_pathogenicity", "am_class",
                "Chromosome", "Start_Position", "Reference_Allele", "Tumor_Seq_Allele2"]
    if c in df_ann.columns
]].copy()

# Standardize names
rename = {SAM: "sample_id", G: "gene", HGV: "protein_change", CLS: "variant_classification"}
df_final = df_final.rename(columns={k:v for k,v in rename.items() if k in df_final.columns})

# Add cohort
df_final["hrr_cohort"] = df_final["gene"].apply(
    lambda g: "A" if g in COHORT_A else ("B" if g in COHORT_B else "Ext"))

df_final = df_final.sort_values(["gene","protein_pos","sample_id"]).reset_index(drop=True)
df_final.to_csv(RESULTS_DIR / "annotated_hrr_variants.csv", index=False)

print(f"‚úÖ SAVED: results/annotated_hrr_variants.csv")
print(f"   Rows: {len(df_final):,}")
print(f"   Unique variants: {df_final['am_key'].nunique():,}")
print(f"   Unique patients: {df_final['sample_id'].nunique():,}")
print(f"   Genes: {df_final['gene'].nunique()}")

print(f"\nBy cohort:")
for c in ["A","B","Ext"]:
    s = df_final[df_final["hrr_cohort"]==c]
    print(f"   {c}: {len(s)} variants, {s['gene'].nunique()} genes")

print(f"\nPreview:")
show = ["sample_id","gene","protein_change","am_pathogenicity","am_class","hrr_cohort"]
show = [c for c in show if c in df_final.columns]
print(df_final[show].head(10).to_string(index=False))


‚úÖ SAVED: results/annotated_hrr_variants.csv
   Rows: 52
   Unique variants: 52
   Unique patients: 40
   Genes: 19

By cohort:
   A: 19 variants, 3 genes
   B: 21 variants, 8 genes
   Ext: 12 variants, 8 genes

Preview:
      sample_id gene protein_change  am_pathogenicity   am_class hrr_cohort
TCGA-KK-A8IF-01  ATM          R337C            0.3232     benign          A
TCGA-J4-A67Q-01  ATM         L1078V            0.2514     benign          A
TCGA-EJ-7784-01  ATM         H1568L            0.0822     benign          A
TCGA-CH-5762-01  ATM         G1672A            0.1741     benign          A
TCGA-M7-A725-01  ATM         I1846N            0.6880 pathogenic          A
TCGA-EJ-5518-01  ATM         L1936S            0.8693 pathogenic          A
TCGA-EJ-5511-01  ATM         E2164K            0.8689 pathogenic          A
TCGA-VN-A88K-01  ATM         R2453P            0.7065 pathogenic          A
TCGA-HC-8260-01  ATM         G2694R            0.9621 pathogenic          A
TCGA-CH-5737-01  A

## 11. Patient-Level Summary

In [11]:
if df_final["am_pathogenicity"].notna().any():
    pat = df_final.groupby("sample_id").agg(
        n_hrr_missense=("gene","count"),
        n_pathogenic=("am_class", lambda x: (x=="pathogenic").sum()),
        n_benign=("am_class", lambda x: (x=="benign").sum()),
        n_ambiguous=("am_class", lambda x: (x=="ambiguous").sum()),
        max_am_score=("am_pathogenicity","max"),
        hrr_genes=("gene", lambda x: ", ".join(sorted(x.unique()))),
        has_cohort_a=("hrr_cohort", lambda x: (x=="A").any()),
    ).reset_index()
    pat["has_am_pathogenic"] = pat["n_pathogenic"] > 0
    
    pat.to_csv(RESULTS_DIR / "patient_hrr_summary.csv", index=False)
    
    print(f"Patients with HRR missense: {len(pat)}")
    print(f"  ‚â•1 AM-pathogenic: {pat['has_am_pathogenic'].sum()}")
    print(f"  All AM-benign:    {(pat['n_pathogenic']==0).sum()}")
    print(f"\nüíæ Saved: results/patient_hrr_summary.csv")
else:
    print("‚è≠Ô∏è  Pending AlphaMissense annotation")


Patients with HRR missense: 40
  ‚â•1 AM-pathogenic: 19
  All AM-benign:    21

üíæ Saved: results/patient_hrr_summary.csv


## ‚úÖ Notebook 1 Complete!

### Files:
| File | Description |
|------|-------------|
| `results/annotated_hrr_variants.csv` | HRR missense variants + AlphaMissense scores |
| `results/patient_hrr_summary.csv` | Patient-level summary |
| `data/raw/clinical_patient.csv` | Survival & clinical data |
| `data/processed/alphamissense_hrr_genes.csv` | Reusable AM lookup for HRR genes |

### Git save:
```bash
git add -A && git commit -m "Notebook 1: annotated HRR variants" && git push
```

### Next ‚Üí Notebook 2: Statistical Analysis
- Cox PH, Kaplan-Meier
- AlphaMissense vs ClinVar concordance  
- Sensitivity analyses
- Publication figures
