Pathway Subtyping Framework

A Disease-Agnostic Tool for Pathway-Based Molecular Subtype Discovery

Overview

The Pathway Subtyping Framework is an open-source computational tool for identifying molecular subtypes in genetically heterogeneous diseases. Instead of analyzing individual genes, it aggregates rare variant burden at the biological pathway level, enabling:

Better signal detection across genetically diverse cohorts
Identification of biologically coherent patient subgroups
Cross-cohort validation of discovered subtypes

Originally developed for autism research, this generalized version can be adapted for any disease with:

Genetic heterogeneity (many implicated genes)
Convergent pathway biology
Available exome/genome sequencing data

Supported Disease Areas

Disease	Status	Pathway File
Autism Spectrum Disorder	Validated	`autism_pathways.gmt`
Schizophrenia	Template	`schizophrenia_pathways.gmt`
Epilepsy	Template	`epilepsy_pathways.gmt`
Intellectual Disability	Template	`intellectual_disability_pathways.gmt`
Parkinson's Disease	Template	`parkinsons_pathways.gmt`
Bipolar Disorder	Template	`bipolar_pathways.gmt`
Your disease	Adapt it →	`your_pathways.gmt`

Key Features

Feature	Description
Pathway Scoring	Aggregate gene burdens across biological pathways
Expression Scoring	Bulk RNA-seq pathway scoring via ssGSEA, GSVA, or mean-Z methods
Single-Cell Scoring	Per-cell and pseudobulk pathway scoring from scRNA-seq (h5ad/CSV)
Multi-Omic Fusion	Fuse VCF + expression + single-cell scores (concatenate, weighted, intersection)
Bulk Deconvolution	Estimate cell-type proportions from bulk RNA-seq via NNLS; cell-type-aware subtypes
Multiple Clustering	GMM, K-means, Hierarchical, Spectral with cross-validation
Ancestry Correction	PCA-based population stratification correction with independence testing
Batch Correction	ComBat-style batch effect detection and correction
Sensitivity Analysis	Parameter robustness testing across algorithms, features, normalization
Threshold Calibration	Data-driven validation thresholds that adjust for sample size and cluster count
Variant QC	QUAL, call rate, HWE, MAF filters before burden computation
Validation Gates	5 gates: negative controls, bootstrap stability, ancestry independence, cross-modal concordance
Statistical Rigor	FDR correction, effect sizes, confidence intervals
Power Analysis	Sample size recommendations, Type I error estimation
Simulation	Synthetic data generation with ground truth for validation
Cross-Cohort Validation	Transfer learning and projection-based replication testing
Visualization	Interactive Plotly HTML reports, UMAP/t-SNE scatter plots, radar charts, multi-format export
Performance	tqdm progress bars, chunked VCF processing, 10K+ sample support
Reproducibility	Deterministic execution, pinned dependencies, Docker
Config-Driven	YAML configuration for all parameters

Quick Start

Installation

pip install pathway-subtyping

Optional extras:

pip install pathway-subtyping[vcf]   # VCF file processing (pysam)
pip install pathway-subtyping[viz]   # Interactive visualizations (Plotly, UMAP)
pip install pathway-subtyping[sc]    # Single-cell support (AnnData)
pip install pathway-subtyping[graph] # Network analysis (NetworkX, py4cytoscape)

Network Visualization (Cytoscape)

The [graph] extra enables publication-ready network figure generation via Cytoscape, an open-source desktop application for network visualization. The py4cytoscape library communicates with Cytoscape's CyREST API on localhost:1234.

Setup:

Download and install Cytoscape desktop (v3.10+)
Launch Cytoscape and wait for it to fully load
Install the Python extra: pip install pathway-subtyping[graph]

Verify the connection:

python -c "import py4cytoscape as p4c; print(p4c.cytoscape_ping())"

Run the figure generator:

python scripts/generate_cytoscape_figures.py

Try in Browser (No Installation)

60-second demo — generates a synthetic cohort, discovers subtypes, validates them, and visualizes results. No data needed.

Full tutorial: 01_getting_started.ipynb

Notebooks

18 Jupyter notebooks covering tutorials through full manuscript reproduction. See docs/notebook-guide.md for execution order and dependencies.

Tutorials (00-09) -- Synthetic data, standalone, any order:

#	Topic
00	Quick demo (60 seconds)
01	Getting started (installation, pipeline, validation)
02	Expression scoring (ssGSEA, GSVA, mean-Z)
03	Multi-omic fusion
04	Cell-type deconvolution
05	Visualization (PCA, t-SNE, UMAP, Plotly)
06	Ancestry and batch correction
07	Sensitivity analysis
08	Subtype characterization
09	Signaling database integration

Real Data Validation (10-16) -- GEO datasets, run in order (later notebooks use earlier outputs):

#	Dataset	Tissue	N	Manuscript Section
10	GSE28521	Brain (frontal + temporal cortex)	79	Section 5
11	GSE64018	Brain (temporal cortex, RNA-seq)	24	Section 5.9
12	GSE80655	Brain (3 regions, multi-diagnosis)	281	Section 6
12b	GSE80655	Null ARI permutation test	141	Section 6.5
13	GSE111175	Blood + ADOS clinical scores	141	Section 6.10
14	GSE18123	Blood (largest cohort, 2 platforms)	285	Section 6.10
15	GSE53987	Brain (3 regions, Affymetrix, 4 diagnoses)	205	Section 6.11
16	Multi-dataset	Knowledge graph (STRING PPI + DGIdb)	1,075	Section 7

Run with Sample Data

# Clone for sample data and configs
git clone https://github.com/topmist-admin/pathway-subtyping-framework
cd pathway-subtyping-framework

# Run the pipeline with synthetic test data
psf --config configs/test_synthetic.yaml

# View results
cat outputs/synthetic_test/report.md

Run with Your Data

# Copy and customize a config
cp configs/example_autism.yaml configs/my_analysis.yaml

# Edit paths in my_analysis.yaml, then run
psf --config configs/my_analysis.yaml

Docker

# Run pipeline
docker-compose run pipeline

# Run tests
docker-compose run test

# Start Jupyter notebook
docker-compose up jupyter
# Open http://localhost:8888

Adapting for Your Disease

Create a pathway GMT file with disease-relevant gene sets
Copy an example config and point to your data
Run the pipeline — validation gates will tell you if subtypes are meaningful

See the full guide: Adapting for Your Disease

How It Works

VCF / Expression / scRNA-seq → Pathway Scoring → [Ancestry Correction] → [Batch Correction] → GMM Clustering → [Sensitivity Analysis] → Validation → Report

1. Pathway Scoring

Multiple input modalities are supported, each producing the same Z-normalized pathway score matrix:

VCF: Rare damaging variants aggregated with LoF/CADD weights
Expression: Bulk RNA-seq scored via ssGSEA, GSVA, or mean-Z
Single-cell: Pseudobulk or per-cell scoring from scRNA-seq
Multi-omic: Fuse scores from multiple modalities for unified subtype discovery

2. Subtype Discovery

Multiple clustering algorithms identify patient subgroups:

GMM (default): Soft assignments, automatic selection via BIC
K-means: Fast, spherical clusters
Hierarchical: Dendogram-based, no K required
Spectral: Nonlinear boundaries
Cross-validation for stability assessment
Algorithm comparison with pairwise ARI

3. Validation Gates

Built-in tests prevent overfitting:

Label shuffle: Randomized labels should NOT cluster (ARI < 0.15)
Random genes: Fake pathways should NOT work (ARI < 0.15)
Bootstrap: Clusters should be stable under resampling (ARI > 0.8)
Ancestry independence: Clusters should not correlate with ancestry PCs (when provided)
Cross-modal concordance: Subtypes should replicate across data modalities (when multi-omic)

4. Statistical Rigor

Publication-quality statistics:

FDR correction: Benjamini-Hochberg for multiple testing
Effect sizes: Cohen's d with 95% bootstrap confidence intervals
Power analysis: Sample size recommendations for target effect sizes
Type I error: Estimation via null simulations

See docs/METHODS.md for full statistical methodology.

Data Requirements

Input	Format	Notes
Variants	VCF	Annotated with gene symbols, consequences
Bulk Expression	CSV/TSV	Gene expression matrix (samples x genes)
Single-Cell	h5ad/CSV	AnnData or cell-by-gene matrix with cell type annotations
Phenotypes	CSV	Sample IDs + clinical features
Pathways	GMT	Gene sets for your disease

Your data stays on your infrastructure. The framework runs locally or in your cloud environment.

Data Provenance and Integrity

This project contains zero proprietary, commercial, or third-party customer data.

Every data file in this repository was either:

Computationally generated — The synthetic VCF and phenotype files in data/sample/ were created by our SyntheticDataGenerator using random number generators with fixed seeds. They contain no real patient or clinical data whatsoever.
Curated from public scientific literature — The pathway GMT files in data/pathways/ contain gene symbol lists assembled exclusively from publicly available, peer-reviewed sources: SFARI Gene, KEGG, Reactome, MSigDB, and Gene Ontology. Gene symbols (e.g., SHANK3, CHD8) are standard scientific identifiers published in thousands of research papers.
Open-source code only — All algorithms are original implementations or standard open-source libraries (scikit-learn, scipy, numpy, pandas). No proprietary software, commercial code, or licensed algorithms were used.

No data from any employer, client, institution, or commercial entity was used at any stage of this project — not in development, testing, validation, or documentation. The framework is designed so that users supply their own data; it does not ship with, embed, or depend on any private or restricted datasets.

For full details, see DISCLAIMER.md and docs/contributor-kit/04-research-compliance.md.

Project Structure

pathway-subtyping-framework/
├── src/pathway_subtyping/       # Core Python package
│   ├── pipeline.py              # Main pipeline orchestrator
│   ├── clustering.py            # GMM, K-means, Hierarchical, Spectral
│   ├── validation.py            # Validation gates (5 gates)
│   ├── statistical_rigor.py     # FDR, effect sizes, burden weights
│   ├── simulation.py            # Synthetic data & power analysis
│   ├── expression.py            # Bulk RNA-seq pathway scoring (ssGSEA, GSVA, mean-Z)
│   ├── single_cell.py           # Single-cell scRNA-seq scoring (pseudobulk + per-cell)
│   ├── multi_omic.py            # Multi-omic pathway score fusion
│   ├── deconvolution.py         # Bulk deconvolution (NNLS cell-type proportions)
│   ├── cross_modal_validation.py # Cross-modal concordance gate (Gate 5)
│   ├── visualization.py         # Interactive Plotly reports, UMAP/t-SNE, export
│   ├── characterization.py      # Subtype profiling, heatmaps, gene contributions
│   ├── ancestry.py              # Population stratification correction
│   ├── batch_correction.py      # Batch effect detection & correction
│   ├── sensitivity.py           # Parameter sensitivity analysis
│   ├── benchmark.py             # Method comparison benchmarks
│   ├── cross_cohort.py          # Cross-cohort validation
│   ├── threshold_calibration.py # Data-driven threshold calibration
│   ├── variant_qc.py            # Variant QC (QUAL, HWE, MAF, call rate)
│   ├── validation_datasets.py   # ClinVar/Reactome integration
│   ├── data_quality.py          # VCF quality checks
│   └── utils/                   # Performance, seeding, progress tracking
├── configs/                     # Example YAML configurations
├── data/
│   ├── pathways/                # Pathway GMT files (6 diseases)
│   └── sample/                  # Synthetic test data
├── docs/
│   ├── METHODS.md               # Statistical methods documentation
│   ├── api/                     # API reference (13 modules)
│   └── guides/                  # User guides
├── scripts/                     # Utility scripts
│   ├── generate_cytoscape_figures.py  # Publication-ready network figures (requires [graph] + Cytoscape desktop)
│   ├── validate_with_public_data.py   # ClinVar/Reactome validation
│   └── benchmark_performance.py       # Performance benchmarks
├── examples/notebooks/          # Jupyter tutorials
├── tests/                       # Test suite (968+ tests)
├── Dockerfile                   # Container support
└── docker-compose.yml           # Easy orchestration

Development

# Install with dev dependencies (from cloned repo)
pip install -e ".[dev,vcf,viz,sc]"

# Run tests
pytest tests/ -v

# Run linting
black src/ tests/
isort src/ tests/
flake8 src/ tests/

# Set up pre-commit hooks
pre-commit install

Related Projects

Autism Pathway Framework — The original autism-focused implementation with SFARI cohort validation

Contributing

Contributions welcome! Areas where help is needed:

Additional disease pathway definitions
Multi-omic integration (spatial transcriptomics, proteomics)
Documentation and tutorials

See CONTRIBUTING.md for guidelines.

Citation

If you use this framework, please cite:

Chauhan R. Pathway Subtyping Framework. Zenodo. 2026.
DOI: 10.5281/zenodo.18442426
https://github.com/topmist-admin/pathway-subtyping-framework

For autism-specific work, also cite:

Chauhan R. Autism Pathway Framework. Zenodo. 2026.
DOI: 10.5281/zenodo.18403844

License

MIT License — see LICENSE for details.

Contact

Rohit Chauhan

RESEARCH USE ONLY — This framework is for hypothesis generation. Not for clinical diagnosis or treatment decisions.

Name		Name	Last commit message	Last commit date
Latest commit History 85 Commits
.github/workflows		.github/workflows
configs		configs
data		data
docs		docs
examples		examples
scripts		scripts
src/pathway_subtyping		src/pathway_subtyping
tests		tests
.dockerignore		.dockerignore
.flake8		.flake8
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.zenodo.json		.zenodo.json
CHANGELOG.md		CHANGELOG.md
CITATION.cff		CITATION.cff
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
DISCLAIMER.md		DISCLAIMER.md
Dockerfile		Dockerfile
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
docker-compose.yml		docker-compose.yml
paper.bib		paper.bib
paper.md		paper.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Pathway Subtyping Framework

Overview

Supported Disease Areas

Key Features

Quick Start

Installation

Network Visualization (Cytoscape)

Try in Browser (No Installation)

Notebooks

Run with Sample Data

Run with Your Data

Docker

Adapting for Your Disease

How It Works

1. Pathway Scoring

2. Subtype Discovery

3. Validation Gates

4. Statistical Rigor

Data Requirements

Data Provenance and Integrity

Project Structure

Development

Related Projects

Contributing

Citation

License

Contact

About

Uh oh!

Releases 6

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Pathway Subtyping Framework

Overview

Supported Disease Areas

Key Features

Quick Start

Installation

Network Visualization (Cytoscape)

Try in Browser (No Installation)

Notebooks

Run with Sample Data

Run with Your Data

Docker

Adapting for Your Disease

How It Works

1. Pathway Scoring

2. Subtype Discovery

3. Validation Gates

4. Statistical Rigor

Data Requirements

Data Provenance and Integrity

Project Structure

Development

Related Projects

Contributing

Citation

License

Contact

About

Resources

License

Code of conduct

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 6

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages