# R1: Genetic Validation - GWAS of Signature Exposure

## Reviewer Question

**Referee #1**: "The authors say in several places that the models describe clinically meaningful biological processes without giving any proof of the clinical and certainly not biological meaningfulness."

## Why This Matters

Demonstrating genetic associations with signature exposure provides evidence that signatures have a genetic basis and capture biologically meaningful pathways.

## Our Approach

We performed **genome-wide association studies (GWAS)** using signature exposure as quantitative phenotypes:

1. **Calculate Average Signature Exposure (AEX)**: For each individual, we compute the average signature loading over time
2. **GWAS on AEX**: Test genome-wide SNPs for association with signature exposure
3. **Identify Signature-Specific Loci**: Find genetic variants associated with signatures but not with individual diseases
4. **Map to Nearest Genes**: Annotate significant hits with nearest genes

**Key Innovation**: We test genetic loading of 0 (baseline genetic effects) to identify loci that influence signature trajectories independently.

## Key Findings

✅ **Multiple genome-wide significant loci** identified for each signature
✅ **Signature-specific loci** found that are not associated with individual diseases
✅ **Biologically plausible gene associations** (e.g., lipid genes for Signature 5)


In [None]:
import pandas as pd
from pathlib import Path
from IPython.display import display

# Load GWAS loci annotations
loci_file = Path("/Users/sarahurbut/Library/CloudStorage/Dropbox-Personal/all_loci_annotated.tsv")

if loci_file.exists():
    loci_df = pd.read_csv(loci_file, sep="\t")
    print("="*80)
    print("GWAS LOCI SUMMARY")
    print("="*80)
    print(f"\nTotal loci: {len(loci_df):,}")
    
    # Count loci per signature
    signature_cols = [col for col in loci_df.columns if col.startswith("locus_SIG")]
    print(f"\nSignatures with loci: {len(signature_cols)}")
    
    # Summary by signature
    sig_summary = []
    for sig_col in sorted(signature_cols):
        sig_num = sig_col.replace("locus_SIG", "")
        n_loci = loci_df[sig_col].sum()
        sig_summary.append({
            "Signature": f"SIG{sig_num}",
            "N_Loci": int(n_loci),
            "Percentage": f"{n_loci/len(loci_df)*100:.1f}%"
        })
    
    sig_summary_df = pd.DataFrame(sig_summary).sort_values("N_Loci", ascending=False)
    print("\nLoci per signature:")
    display(sig_summary_df)
    display(loci_df.head(10))
else:
    print(f"⚠️  GWAS loci file not found: {loci_file}")

## 2. Top Loci by Signature

For each signature, we identify the top genetic loci (by p-value) and their nearest genes.

In [None]:
if "loci_df" in locals():
    signature_names = {
        "locus_SIG5": "SIG5 - Cardiovascular/Lipid",
        "locus_SIG17": "SIG17 - GI/Colorectal",
        "locus_SIG7": "SIG7 - Hypertension/Vascular",
        "locus_SIG0": "SIG0 - Heart Failure/Arrhythmia",
        "locus_SIG16": "SIG16 - Neurodegeneration",
    }
    
    top_n = 10
    print("="*80)
    print(f"TOP {top_n} GENETIC LOCI PER SIGNATURE")
    print("="*80)
    
    all_top_loci = []
    for sig_col in sorted(signature_cols):
        if sig_col not in signature_names:
            continue
        sig_name = signature_names[sig_col]
        sig_num = sig_col.replace("locus_SIG", "")
        sig_loci = loci_df[loci_df[sig_col] == 1].copy()
        if len(sig_loci) == 0:
            continue
        sig_loci_sorted = sig_loci.nlargest(top_n, "LOG10P")
        print(f"\n{sig_name} ({len(sig_loci)} total loci)")
        for idx, row in sig_loci_sorted.iterrows():
            rsid = row["rsid"]
            gene = row["nearestgene"]
            pval = row["LOG10P"]
            print(f"  {rsid:15} {gene:20} p={10**(-pval):.2e} (LOG10P={pval:.2f})")
            all_top_loci.append({
                "Signature": f"SIG{sig_num}",
                "SNP": rsid,
                "Nearest_Gene": gene,
                "LOG10P": round(pval, 2),
                "P_value": f"{10**(-pval):.2e}"
            })
    top_loci_df = pd.DataFrame(all_top_loci)
    display(top_loci_df)

## 3. Summary and Response

### Key Findings

1. **Genome-wide significant loci identified**: Multiple genetic loci are associated with signature exposure.
2. **Signature-specific loci**: Genetic variants associated with signatures but not with individual diseases.
3. **Biologically plausible gene associations**: Signature 5 is enriched for lipid metabolism genes.

### Response to Reviewer

We demonstrate biological meaningfulness through **genetic association analysis**. We performed GWAS using average signature exposure (AEX) as quantitative phenotypes, identifying genetic variants associated with disease signatures. Signature 5 (cardiovascular) is enriched for genes with known roles in lipid metabolism (e.g., LDLR, APOB, PCSK9, LPA), providing strong biological validation.