# Toy GWAS in Google Colab (simulated but realistic)

## What is GWAS?

**Genome-Wide Association Studies (GWAS)** test millions of genetic variants (SNPs) across the genome to find associations with a trait or disease. The goal is to identify which genetic variants are associated with the phenotype of interest.

## What you'll learn in this notebook:

1. **How to simulate realistic genetic data** with population structure
2. **Why population structure causes confounding** in GWAS
3. **How Principal Components (PCs) help control for confounding**
4. **How to interpret Manhattan plots and QQ plots**
5. **The difference between unadjusted and adjusted GWAS results**

This notebook simulates genotype data with mild population structure, generates a phenotype, and runs a GWAS twice:

1) **Unadjusted** (shows inflation due to population structure)
2) **Adjusted for PCs** (reduces confounding)

It then makes **Manhattan** and **QQ** plots to visualize the results.

> Designed to run in a few minutes on Colab.

In [None]:
#@title Setup
!pip -q install numpy pandas scipy statsmodels scikit-learn matplotlib

import numpy as np
import pandas as pd
from scipy import stats
import statsmodels.api as sm
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt

np.random.seed(42)

## 1) Simulate genotypes with mild population structure

### Why simulate population structure?

In real genetic data, individuals from different populations (e.g., different ancestries) have different allele frequencies. This creates **population structure** - a correlation between genetic variants and population membership.

**Why this matters for GWAS:** If your phenotype also differs between populations (e.g., due to environmental factors), you might find "significant" associations that are actually just due to population differences, not true genetic effects. This is called **confounding** or **population stratification**.

### What we're doing:

We simulate two subpopulations with slightly different allele frequencies. This creates a simple form of population structure that will cause confounding in our GWAS if we don't control for it.

In [None]:
#@title Simulate genotypes
n = 2000          # individuals
m = 12000         # SNPs (keep moderate for Colab speed)

# random minor allele frequencies
maf = np.random.uniform(0.05, 0.5, size=m)

# two populations (0/1)
pop = np.random.binomial(1, 0.5, size=n)

# allele-frequency shift between populations (small)
delta = np.random.normal(0, 0.03, size=m)  # mild structure
maf_pop1 = np.clip(maf + delta, 0.01, 0.99)
maf_pop0 = np.clip(maf - delta, 0.01, 0.99)

# genotype matrix G: n x m, coded 0/1/2
G = np.empty((n, m), dtype=np.int8)
for i in range(n):
    p = maf_pop1 if pop[i] == 1 else maf_pop0
    G[i, :] = np.random.binomial(2, p).astype(np.int8)

# quick QC: remove monomorphic SNPs (should be rare here)
var = G.var(axis=0)
keep = var > 0
G = G[:, keep]
maf = maf[keep]
m = G.shape[1]

print(f"Genotypes: n={n}, m={m}")

## 2) Create a phenotype

### How we simulate the phenotype:

1. **Genetic signal:** We pick a few "causal" SNPs that truly affect the phenotype
2. **Population effect (confounding):** We add an effect that differs between populations - this mimics real-world scenarios where environmental or cultural factors differ by population
3. **Noise:** We add random noise to make it realistic

**Key point:** The population effect creates confounding because:
- Population membership is correlated with genotype (due to different allele frequencies)
- Population membership also affects the phenotype
- This creates a spurious association between genotype and phenotype, even for non-causal SNPs!

This is why we need to control for population structure in GWAS.

In [None]:
#@title Simulate phenotype
n_causal = 20
causal_idx = np.random.choice(m, size=n_causal, replace=False)
beta_causal = np.random.normal(0, 0.25, size=n_causal)

genetic_signal = G[:, causal_idx] @ beta_causal
pop_effect = 0.6 * (pop - pop.mean())          # confounding
noise = np.random.normal(0, 1.0, size=n)

y = (genetic_signal + pop_effect + noise)
y = (y - y.mean()) / y.std()

pheno = pd.DataFrame({"iid": np.arange(n), "pop": pop, "y": y})
pheno.head()

## 3) Compute Principal Components (PCs)

### What are Principal Components?

**Principal Component Analysis (PCA)** is a dimensionality reduction technique. When applied to genotype data, the first few PCs capture the major axes of genetic variation, which often correspond to population structure.

### Why use PCs in GWAS?

- **PCs capture population structure:** The first few PCs typically capture ancestry-related differences
- **They can be used as covariates:** By including PCs as covariates in the regression model, we can control for population structure
- **This reduces false positives:** SNPs that appear significant only due to population differences will no longer be significant after adjusting for PCs

### What to look for:

The correlation between `pop` and `PC1` should be high - this shows that PC1 is capturing the population structure we simulated!

In [None]:
#@title PCA
# standardize variants for PCA
G_std = (G - G.mean(axis=0)) / G.std(axis=0)
pca = PCA(n_components=5, random_state=42)
pcs = pca.fit_transform(G_std)

for k in range(5):
    pheno[f"PC{k+1}"] = pcs[:, k]

pheno[["pop","PC1","PC2","PC3"]].corr()

## 4) GWAS (fast linear regression)

### How GWAS works:

For each SNP, we test the association between genotype and phenotype using linear regression:

**Phenotype = intercept + (SNP effect × genotype) + covariates + error**

### Two approaches:

1. **Unadjusted GWAS:** No covariates included
   - Will show **inflation** (too many significant results) due to population structure
   - This demonstrates the problem of confounding

2. **Adjusted GWAS:** Includes PC1–PC5 as covariates
   - Controls for population structure
   - Should show fewer false positives

### Implementation details:

We use a fast approach: residualize both the phenotype and each SNP on the covariates, then test the correlation between the residuals. This is mathematically equivalent to including covariates in the regression but computationally faster.

**Key question:** Will the adjusted GWAS show less inflation than the unadjusted one?

In [None]:
#@title Helper: residualize and run GWAS
def residualize(v, X):
    # v: (n,), X: (n,k) with intercept included
    beta, *_ = np.linalg.lstsq(X, v, rcond=None)
    return v - X @ beta

def gwas_linear(G, y, cov=None):
    n, m = G.shape
    if cov is None:
        X = np.ones((n,1))
    else:
        X = np.column_stack([np.ones(n), cov])
    y_r = residualize(y, X)

    # residualize each SNP on covariates (loop over chunks for memory)
    betas = np.zeros(m, dtype=float)
    ses = np.zeros(m, dtype=float)
    pvals = np.zeros(m, dtype=float)

    chunk = 2000
    for start in range(0, m, chunk):
        end = min(m, start + chunk)
        Gc = G[:, start:end].astype(float)

        # residualize SNPs
        # Solve (X'X)^{-1}X'G for each SNP using lstsq
        B, *_ = np.linalg.lstsq(X, Gc, rcond=None)  # (k+1) x chunk
        Gc_r = Gc - X @ B

        # association using simple regression y_r ~ Gc_r
        # beta = cov / var
        cov_y = (Gc_r * y_r[:, None]).sum(axis=0)
        var_g = (Gc_r ** 2).sum(axis=0)
        beta = cov_y / var_g

        # SE from residual variance
        resid = y_r[:, None] - Gc_r * beta[None, :]
        sigma2 = (resid**2).sum(axis=0) / (n - X.shape[1] - 1)
        se = np.sqrt(sigma2 / var_g)

        z = beta / se
        p = 2 * stats.norm.sf(np.abs(z))

        betas[start:end] = beta
        ses[start:end] = se
        pvals[start:end] = p

    return betas, ses, pvals

# run both
cov_pcs = pheno[[f"PC{k+1}" for k in range(5)]].to_numpy()
beta0, se0, p0 = gwas_linear(G, y, cov=None)
beta1, se1, p1 = gwas_linear(G, y, cov=cov_pcs)

print("Done.")

## 5) Summaries: inflation (λGC) and top hits

### Lambda GC (λGC) - Genomic Control

**Lambda GC** is a measure of test statistic inflation:
- **λGC = 1.0:** No inflation (perfect null distribution)
- **λGC > 1.0:** Inflation (too many significant results)
- **λGC < 1.0:** Deflation (too few significant results)

**What causes inflation?**
- Population structure (confounding)
- Cryptic relatedness
- Other systematic biases

**What to expect:**
- **Unadjusted:** λGC should be > 1.0 (inflation due to population structure)
- **Adjusted:** λGC should be closer to 1.0 (inflation reduced by controlling for PCs)

**Note:** λGC is not perfect - it can miss some types of confounding, but it's a useful diagnostic tool.

### Top hits:

Let's look at the most significant SNPs. Some should be the causal SNPs we simulated (marked as `is_causal=True`), but we might also see some false positives in the unadjusted analysis.

In [None]:
#@title Summaries
def lambda_gc(pvals):
    chi2 = stats.chi2.isf(pvals, df=1)
    return np.median(chi2) / stats.chi2.ppf(0.5, df=1)

lam0 = lambda_gc(p0)
lam1 = lambda_gc(p1)

print(f"Lambda GC (unadjusted): {lam0:.3f}")
print(f"Lambda GC (PC-adjusted): {lam1:.3f}")

# top SNPs
top = np.argsort(p1)[:20]
top_df = pd.DataFrame({
    "snp_index": top,
    "beta_adj": beta1[top],
    "se_adj": se1[top],
    "p_adj": p1[top],
    "is_causal": np.isin(top, causal_idx)
}).sort_values("p_adj")

top_df

## 6) Manhattan + QQ plots

### What are these plots?

**Manhattan Plot:**
- Shows -log10(p-value) for each SNP across the genome
- Each point is a SNP, positioned by its genomic location
- Horizontal line shows the genome-wide significance threshold (typically p < 5×10⁻⁸)
- **"Towers"** of significant SNPs indicate regions with true associations (due to linkage disequilibrium - nearby SNPs are correlated)

**QQ Plot (Quantile-Quantile):**
- Compares observed p-values to expected p-values under the null hypothesis
- Points on the diagonal = no inflation
- Points above the diagonal = inflation (too many small p-values)
- The "tail" of the plot (right side) shows the most significant results

### Synthetic coordinates:

We generate fake chromosome/position coordinates so the plots look like a real GWAS. In practice, you'd use the actual genomic positions from your data.

In [None]:
#@title Create synthetic genomic coordinates
# assign SNPs to chromosomes (autosomes 1-22)
chroms = np.repeat(np.arange(1, 23), np.ceil(m / 22).astype(int))[:m]
# positions within chromosome (random but sorted)
pos = np.zeros(m, dtype=int)
for c in range(1, 23):
    idx = np.where(chroms == c)[0]
    pos[idx] = np.sort(np.random.randint(1, 50_000_000, size=len(idx)))

# cumulative position for Manhattan plot (so chromosomes appear side-by-side)
chrom_max = pd.Series(pos).groupby(chroms).max()
offsets = chrom_max.cumsum().shift(fill_value=0).to_dict()
x = pos + np.array([offsets[c] for c in chroms])

coord = pd.DataFrame({"chr": chroms, "pos": pos, "x": x, "p_unadj": p0, "p_adj": p1})
print("Created synthetic genomic coordinates:")
coord.head()

In [None]:
#@title Helper function: Simulate LD effects to create realistic "towers"
def add_ld_effects(coord_df, pvals, causal_idx, window_size=500000):
    """
    Simulate linkage disequilibrium (LD) effects.
    When a causal SNP is significant, nearby SNPs also become more significant
    due to correlation (LD). This creates the "tower" effect in Manhattan plots.
    """
    pvals_ld = pvals.copy()
    coord_reset = coord_df.reset_index(drop=True)  # Ensure 0-based integer index
    
    # For each causal SNP that's significant, boost nearby SNPs
    for idx in causal_idx:
        if pvals[idx] < 0.01:  # if causal SNP is at least nominally significant
            # Find SNPs on same chromosome within window
            chr_num = coord_reset.iloc[idx]["chr"]
            pos_causal = coord_reset.iloc[idx]["pos"]
            
            same_chr = coord_reset["chr"] == chr_num
            within_window = np.abs(coord_reset["pos"] - pos_causal) < window_size
            
            nearby = same_chr & within_window
            nearby_positions = np.where(nearby)[0]  # Get positional indices
            
            # Boost p-values of nearby SNPs (make them more significant)
            # The boost decays with distance
            for nidx in nearby_positions:
                if nidx != idx:
                    dist = abs(coord_reset.iloc[nidx]["pos"] - pos_causal)
                    decay = np.exp(-dist / (window_size / 3))
                    # Make p-value smaller (more significant) by a factor
                    # Stronger effect if causal SNP is highly significant
                    boost_factor = 0.9 * decay if pvals[idx] < 0.001 else 0.7 * decay
                    pvals_ld[nidx] = pvals_ld[nidx] * (1 - boost_factor)
    
    return np.clip(pvals_ld, 1e-300, 1.0)

# Add LD effects to make plots more realistic
# Reset index to ensure proper alignment
coord_reset = coord.reset_index(drop=True)
p0_ld = add_ld_effects(coord_reset, p0, causal_idx)
p1_ld = add_ld_effects(coord_reset, p1, causal_idx)

# Update coord with LD-adjusted p-values
coord["p_unadj_ld"] = p0_ld
coord["p_adj_ld"] = p1_ld
print("Added LD effects to create realistic 'towers' in Manhattan plots")

### Manhattan Plot: Unadjusted GWAS

This plot shows the results **without** controlling for population structure. Notice:
- Many SNPs appear significant (above the threshold line)
- This is **inflation** - many of these are false positives due to population structure
- The QQ plot will show this as points above the diagonal

In [None]:
#@title Manhattan plot (Unadjusted - shows inflation)
def plot_manhattan(coord_df, pvals_col, title, figsize=(14, 5)):
    """Create a Manhattan plot with alternating chromosome colors"""
    df = coord_df.sort_values(["chr", "pos"]).copy()
    df["mlog10p"] = -np.log10(df[pvals_col].clip(1e-300, 1.0))
    
    # Color alternating chromosomes
    colors = ['#2E86AB', '#A23B72']  # Blue and purple
    df['color'] = df['chr'] % 2
    df['color'] = df['color'].map({0: colors[0], 1: colors[1]})
    
    plt.figure(figsize=figsize)
    
    # Plot each chromosome separately for better color control
    for chr_num in sorted(df['chr'].unique()):
        chr_data = df[df['chr'] == chr_num]
        color = colors[chr_num % 2]
        plt.scatter(chr_data["x"], chr_data["mlog10p"], 
                   s=8, alpha=0.6, c=color, label=None)
    
    # Genome-wide significance threshold
    plt.axhline(-np.log10(5e-8), color='red', linestyle='--', 
               linewidth=1.5, label='Genome-wide significance (5×10⁻⁸)')
    
    # Suggestive threshold
    plt.axhline(-np.log10(1e-5), color='orange', linestyle='--', 
               linewidth=1, alpha=0.7, label='Suggestive (1×10⁻⁵)')
    
    # Chromosome boundaries
    chr_boundaries = []
    for c in sorted(df['chr'].unique()):
        chr_data = df[df['chr'] == c]
        chr_boundaries.append(chr_data['x'].min())
    chr_boundaries.append(df['x'].max())
    
    for boundary in chr_boundaries[1:-1]:
        plt.axvline(boundary, color='gray', linestyle=':', alpha=0.3, linewidth=0.5)
    
    plt.xlabel("Chromosome", fontsize=12)
    plt.ylabel("-log₁₀(P-value)", fontsize=12)
    plt.title(title, fontsize=14, fontweight='bold')
    plt.legend(loc='upper right', fontsize=9)
    plt.grid(axis='y', alpha=0.3, linestyle='--')
    
    # Set x-axis to show chromosome numbers
    chr_centers = []
    chr_labels = []
    for c in sorted(df['chr'].unique()):
        chr_data = df[df['chr'] == c]
        chr_centers.append(chr_data['x'].mean())
        chr_labels.append(str(int(c)))
    plt.xticks(chr_centers, chr_labels, rotation=0)
    
    plt.tight_layout()
    plt.show()

plot_manhattan(coord, "p_unadj_ld", 
               "Manhattan Plot: Unadjusted GWAS (shows inflation due to population structure)")

### Manhattan Plot: PC-Adjusted GWAS

This plot shows the results **after** controlling for population structure using PCs. Notice:
- Fewer SNPs appear significant (inflation reduced)
- The "towers" of significant SNPs indicate regions with true genetic associations
- These towers occur because of **linkage disequilibrium (LD)** - nearby SNPs are correlated, so when a causal variant is significant, nearby SNPs also show significance

In [None]:
#@title Manhattan plot (PC-adjusted - inflation controlled)
plot_manhattan(coord, "p_adj_ld", 
               "Manhattan Plot: PC-Adjusted GWAS (population structure controlled)")

In [None]:
### QQ Plots: Comparing Unadjusted vs Adjusted

**What to look for:**
- **Diagonal line:** Expected under the null hypothesis (no associations)
- **Points above diagonal:** More significant results than expected (inflation)
- **Points below diagonal:** Fewer significant results than expected (deflation)

**Unadjusted plot:** Should show clear inflation (points above diagonal), especially in the tail
**Adjusted plot:** Should be closer to the diagonal, showing that controlling for PCs reduced inflation

In [None]:
#@title QQ plots (unadjusted vs adjusted)
def qqplot(pvals, title, lambda_gc=None):
    """Create a QQ plot comparing observed vs expected p-values"""
    p = np.sort(pvals)
    n = len(p)
    exp = -np.log10((np.arange(1, n+1) - 0.5) / n)
    obs = -np.log10(p.clip(1e-300, 1.0))
    
    plt.figure(figsize=(5, 5))
    plt.scatter(exp, obs, s=8, alpha=0.6, edgecolors='none')
    
    # Diagonal line (expected under null)
    mx = max(exp.max(), obs.max())
    plt.plot([0, mx], [0, mx], 'r--', linewidth=1.5, label='Expected (null)')
    
    plt.xlabel("Expected -log₁₀(P-value)", fontsize=11)
    plt.ylabel("Observed -log₁₀(P-value)", fontsize=11)
    
    title_text = title
    if lambda_gc is not None:
        title_text += f"\nλGC = {lambda_gc:.3f}"
    plt.title(title_text, fontsize=12, fontweight='bold')
    plt.legend(loc='lower right', fontsize=9)
    plt.grid(alpha=0.3, linestyle='--')
    plt.tight_layout()
    plt.show()

qqplot(p0, "QQ Plot: Unadjusted GWAS", lambda_gc=lam0)
qqplot(p1, "QQ Plot: PC-Adjusted GWAS", lambda_gc=lam1)

## Key Takeaways and Teaching Points

### 1. Population Structure Causes Confounding

**The problem:**
- Different populations have different allele frequencies
- If the phenotype also differs between populations (for any reason), you get spurious associations
- This creates **false positives** - SNPs that appear significant but aren't truly causal

**The solution:**
- Control for population structure using Principal Components (PCs)
- Include PCs as covariates in the regression model
- This removes the spurious associations while keeping true genetic effects

### 2. Principal Components Capture Population Structure

- The first few PCs typically capture major axes of genetic variation
- These often correspond to population/ancestry differences
- By including PCs as covariates, we control for these differences

### 3. Manhattan Plots Show Genome-Wide Results

- Each point is a SNP, positioned by genomic location
- Height shows -log10(p-value) - higher = more significant
- **Towers** of significant SNPs indicate regions with true associations
- Towers occur due to **linkage disequilibrium (LD)** - nearby SNPs are correlated

### 4. QQ Plots Diagnose Inflation

- Compare observed p-values to expected p-values under the null
- Points above diagonal = inflation (too many significant results)
- Points on diagonal = well-controlled study
- The "tail" (right side) shows the most significant results

### 5. Lambda GC (λGC) Measures Inflation

- λGC = 1.0: No inflation (perfect)
- λGC > 1.0: Inflation (problem!)
- λGC < 1.0: Deflation (also a problem, but less common)

**Note:** λGC is useful but not perfect - it can miss some types of confounding, and a well-powered study with true associations will also show some inflation in the tail.

### 6. Real GWAS Best Practices

In real GWAS studies, you should:
1. **Quality control (QC):** Filter out low-quality SNPs and samples
2. **Control for covariates:** PCs, age, sex, batch effects, etc.
3. **Check for inflation:** Use QQ plots and λGC
4. **Replication:** Significant findings should be replicated in independent datasets
5. **Multiple testing correction:** Use appropriate thresholds (e.g., p < 5×10⁻⁸ for genome-wide significance)

### Questions for Discussion

1. Why does the unadjusted GWAS show more significant results?
2. How do PCs help reduce false positives?
3. What would happen if we didn't control for population structure in a real GWAS?
4. Why do we see "towers" of significant SNPs rather than isolated points?
5. What other factors (besides population structure) could cause inflation in a GWAS?
