# Pathway Subtyping — Try It in 60 Seconds

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/topmist-admin/pathway-subtyping-framework/blob/main/examples/notebooks/00_quick_demo.ipynb)

**What this does:** Takes a cohort of patients with rare genetic variants, groups variants into biological pathways, and discovers molecular subtypes using Gaussian Mixture Model clustering — with built-in validation to prevent false discoveries.

**Think of it as:** "Patient segmentation, but at the genomic level."

Run the 5 cells below. That's it.

In [None]:
# Cell 1: Install (takes ~30 seconds in Colab)

!pip install -q pathway-subtyping>=0.2.0

# Verify
import pathway_subtyping
print(f"Installed pathway-subtyping v{pathway_subtyping.__version__}")
print("Ready — run cells 2-5 below.")

In [None]:
# Cell 2: Generate a synthetic cohort with planted subtypes
#
# This simulates 200 patients across 3 molecular subtypes.
# Each subtype has elevated burden in different biological pathways.
# In real use, you'd provide your own VCF + pathway definitions.

from pathway_subtyping import (
    SimulationConfig, generate_synthetic_data,
    run_clustering, ClusteringAlgorithm,
    ValidationGates
)
import pandas as pd
import numpy as np

sim = generate_synthetic_data(SimulationConfig(
    n_samples=200,
    n_pathways=12,
    n_genes_per_pathway=20,
    n_subtypes=3,
    effect_size=1.2,
    noise_level=1.0,
    seed=42
))

print(f"Cohort: {sim.pathway_scores.shape[0]} patients x {sim.pathway_scores.shape[1]} pathways")
print(f"Planted subtypes: {np.unique(sim.true_labels)}")
print(f"\nPathway score matrix (first 5 patients):")
sim.pathway_scores.head()

In [None]:
# Cell 3: Discover subtypes via GMM clustering

result = run_clustering(
    sim.pathway_scores.values,
    n_clusters=3,
    algorithm=ClusteringAlgorithm.GMM,
    seed=42
)

from sklearn.metrics import adjusted_rand_score

ari = adjusted_rand_score(sim.true_labels, result.labels)

print(f"Clustering quality")
print(f"  Silhouette score:      {result.silhouette:.3f}  (range -1 to 1, higher = better)")
print(f"  Calinski-Harabasz:     {result.calinski_harabasz:.1f}  (higher = better)")
print(f"  Davies-Bouldin:        {result.davies_bouldin:.3f}  (lower = better)")
print(f"\nRecovery of planted subtypes")
print(f"  Adjusted Rand Index:   {ari:.3f}  (1.0 = perfect recovery)")

In [None]:
# Cell 4: Validate — are these clusters real or noise?
#
# Three built-in gates:
#   Label Shuffle:     shuffled labels should NOT reproduce clusters (ARI < 0.15)
#   Random Gene Sets:  random pathways should NOT reproduce clusters (ARI < 0.15)
#   Bootstrap:         resampling SHOULD reproduce clusters (ARI >= 0.80)

gates = ValidationGates(seed=42, n_permutations=50, n_bootstrap=50)
val_result = gates.run_all(
    pathway_scores=sim.pathway_scores,
    cluster_labels=result.labels,
    pathways=sim.pathways,
    gene_burdens=sim.gene_burdens,
    n_clusters=3,
    gmm_seed=42
)

print("Validation Gates")
print("=" * 50)
for test in val_result.results:
    icon = "PASS" if test.passed else "FAIL"
    print(f"  [{icon}] {test.name}: {test.metric_name}={test.metric_value:.3f} (threshold: {test.threshold})")
print(f"\nOverall: {'ALL PASSED' if val_result.all_passed else 'SOME FAILED'}")

In [None]:
# Cell 5: Visualize — PCA projection + pathway heatmap

import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
import seaborn as sns

fig, axes = plt.subplots(1, 3, figsize=(18, 5))

pca = PCA(n_components=2, random_state=42)
X = pca.fit_transform(sim.pathway_scores.values)

# Left: discovered clusters
for label in np.unique(result.labels):
    mask = result.labels == label
    axes[0].scatter(X[mask, 0], X[mask, 1], label=f"Cluster {label}", s=40, alpha=0.7)
axes[0].set_title("Discovered Subtypes")
axes[0].set_xlabel(f"PC1 ({pca.explained_variance_ratio_[0]:.0%})")
axes[0].set_ylabel(f"PC2 ({pca.explained_variance_ratio_[1]:.0%})")
axes[0].legend()

# Center: ground truth
for label in np.unique(sim.true_labels):
    mask = sim.true_labels == label
    axes[1].scatter(X[mask, 0], X[mask, 1], label=f"True {label}", s=40, alpha=0.7)
axes[1].set_title("Ground Truth Subtypes")
axes[1].set_xlabel(f"PC1 ({pca.explained_variance_ratio_[0]:.0%})")
axes[1].set_ylabel(f"PC2 ({pca.explained_variance_ratio_[1]:.0%})")
axes[1].legend()

# Right: mean pathway scores per cluster
scores_df = sim.pathway_scores.copy()
scores_df["cluster"] = result.labels
cluster_means = scores_df.groupby("cluster").mean()
sns.heatmap(cluster_means, cmap="RdBu_r", center=0, ax=axes[2],
            xticklabels=True, yticklabels=True, cbar_kws={"label": "Mean Z-score"})
axes[2].set_title("Pathway Profiles by Subtype")
axes[2].set_xlabel("Pathway")
axes[2].set_ylabel("Cluster")

plt.suptitle(f"Pathway Subtyping Results — ARI: {ari:.3f}", fontsize=14, fontweight="bold")
plt.tight_layout()
plt.show()

---

## What just happened?

1. **Generated** a synthetic cohort of 200 patients with 3 planted molecular subtypes
2. **Scored** each patient across 12 biological pathways (aggregating gene-level burden)
3. **Clustered** patients using Gaussian Mixture Modeling
4. **Validated** that the clusters are real — not noise, not artifacts, and stable under resampling
5. **Visualized** the subtypes in PCA space + their distinguishing pathway profiles

## Use it on your own data

```bash
pip install pathway-subtyping
psf --config your_config.yaml
```

The framework accepts **annotated VCF files** (with GENE, CONSEQUENCE, CADD fields) and includes curated pathway sets for autism, schizophrenia, epilepsy, Parkinson's, bipolar disorder, and intellectual disability.

## Links

- **Full tutorial:** [01_getting_started.ipynb](https://colab.research.google.com/github/topmist-admin/pathway-subtyping-framework/blob/main/examples/notebooks/01_getting_started.ipynb)
- **GitHub:** [topmist-admin/pathway-subtyping-framework](https://github.com/topmist-admin/pathway-subtyping-framework)
- **PyPI:** `pip install pathway-subtyping`
- **DOI:** [10.5281/zenodo.18442427](https://doi.org/10.5281/zenodo.18442427)
- **Methods:** [METHODS.md](https://github.com/topmist-admin/pathway-subtyping-framework/blob/main/docs/METHODS.md)

---
*Disease-agnostic. Open source. Built by a parent who needed it to exist.*