What makes a plasma protein genetically predictable — and why do some escape local genetic control?
PRISM is a nine-step R analysis pipeline that models which genomic, regulatory, and network features determine how well plasma protein abundance can be predicted from genetic variants (pQTL R²) across 1,359 human proteins. The pipeline covers feature engineering, regression and random forest modelling, STRING network analysis, mixed models, and pathway enrichment, with all downstream analyses performed on residualized outcomes that remove technical confounders.
PRISM/
├── R/
│ ├── run_all.R # Orchestrator — runs all steps in sequence
│ ├── step01_load_and_clean.R # Data loading, merging, transforms, residualization
│ ├── step02_eda.R # EDA and distribution plots
│ ├── step03_feature_matrix.R # Feature groups (5 categories, 19 features), VIF checks
│ ├── step04_regression.R # Incremental OLS + elastic net (cis / trans / genome-wide)
│ ├── step05_rf_ppi.R # Random Forest variable importance + PPI stratification
│ ├── step06_visualisation.R # PCA, observed vs predicted scatter plots
│ ├── step07_string_network.R # STRING network, Louvain modules, RWR, decile analysis
│ ├── step08_mixed_model.R # Mixed model, module permutation tests, ICC
│ ├── step09_enrichment.R # Protein classification, GSEA/ORA, module enrichment
│ ├── step_model_panel.R # Combined 3×3 model comparison figure (OLS/EN/RF)
│ ├── step_module_figure.R # Combined module heatmap + enrichment figure
│ └── step_enrichment_panel.R # Combined enrichment panel figure
├── data/ # Raw input files (committed) + generated intermediates (gitignored)
├── figs/ # Generated figures, organised by step (gitignored)
├── outputs/ # Generated result CSVs (gitignored)
├── viewer/ # FastAPI + Plotly interactive results viewer
├── logs/ # Pipeline run logs (gitignored)
└── prism_clean_column_dictionary.md
| Step | Script | Key outputs |
|---|---|---|
| 01 | step01_load_and_clean.R |
data/prism_clean.csv — merged, cleaned, INT-transformed, with residualized outcomes |
| 02 | step02_eda.R |
Distribution plots, cis vs trans scatter, correlation heatmap |
| 03 | step03_feature_matrix.R |
data/feature_matrix.csv — 19 features across 5 groups, VIF < 10 |
| 04 | step04_regression.R |
Incremental OLS ΔR² by feature group + elastic net coefficients |
| 05 | step05_rf_ppi.R |
RF variable importance (OOB R²: cis=0.71, trans=0.70, gw=0.65) + PPI stratification |
| 06 | step06_visualisation.R |
PCA of feature space, observed vs predicted plots |
| 07 | step07_string_network.R |
STRING network (threshold=700), 225 Louvain modules, RWR, decile subnetwork analysis |
| 08 | step08_mixed_model.R |
Mixed model ICC, permutation tests (n=10,000 per module) |
| 09 | step09_enrichment.R |
Protein classification, GSEA/ORA (4 residualized outcomes), per-module GO:BP enrichment |
| Variable | Description |
|---|---|
R2_cis |
Predictive R² from cis-only pQTL model (local genetic architecture) |
R2_trans |
Additional R² gained by adding trans SNPs beyond cis |
R2_genome_wide |
Predictive R² from full genome-wide SNP model |
R2_ratio |
R²_trans / R²_cis (proteins with R²_cis > 0.001 only) |
Residualized outcomes (computed in step01, used in steps 07–09):
Technical confounders (R2_cov, prot_expr, model_sparsity) are regressed out and residuals are z-scored. trans_resid additionally removes the contribution of R2_cis_int.
| Residual | Confounder model |
|---|---|
cis_resid |
R2_cis_int ~ R2_cov + prot_expr + model_sparsity |
trans_resid |
R2_trans_int ~ R2_cis_int + R2_cov + prot_expr + model_sparsity |
gw_resid |
R2_genome_wide_int ~ R2_cov + prot_expr + model_sparsity |
ratio_resid |
R2_ratio_int ~ R2_cov + prot_expr + model_sparsity |
| File | Size | Description |
|---|---|---|
data/Full_model.csv |
~650 KB | Feature table (genomic, regulatory, constraint, pQTL per protein) |
data/R2_data.csv |
~190 KB | Outcome table (R²_cis, R²_trans, R²_genome_wide per protein) |
data/9606.protein.links.v12.0.txt.gz |
80 MB | STRING v12 interaction network — download |
data/9606.protein.info.v12.0.txt.gz |
1.9 MB | STRING v12 protein metadata — download |
data/9606.protein.enrichment.terms.v12.0.txt.gz |
38 MB | STRING v12 functional annotations — download |
Note: Raw column naming in source CSVs is non-standard —
GeneSymbolcontains Ensembl gene IDs andgenecontains HGNC symbols. Step 01 renames these correctly.
# CRAN
install.packages(c(
"ggplot2", "ggrepel", "ggridges", # visualisation
"glmnet", # elastic net
"randomForest", # random forest
"lme4", # mixed models
"igraph", # network analysis
"pheatmap", # heatmaps
"jsonlite", # JSON export
"dplyr", "readr", "scales", # data manipulation
"gridExtra", "grid", # figure layout
"msigdbr" # MSigDB gene sets (step 09)
))
# Bioconductor (step 09)
if (!requireNamespace("BiocManager")) install.packages("BiocManager")
BiocManager::install(c("clusterProfiler", "org.Hs.eg.db", "enrichplot", "fgsea"))python3 -m venv .viewer-venv
source .viewer-venv/bin/activate
pip install -r viewer/requirements.txt# Full pipeline (~10–15 min including GSEA)
Rscript R/run_all.R 2>&1 | tee logs/pipeline_run_$(date +%Y%m%d).log
# Individual steps
Rscript R/step01_load_and_clean.R # must run first
Rscript R/step07_string_network.R # requires STRING files
Rscript R/step09_enrichment.R # requires steps 01 + 07 + 08
# Figure panels (run after full pipeline)
Rscript R/step_model_panel.R
Rscript R/step_module_figure.R
Rscript R/step_enrichment_panel.R| Variable | Default | Description |
|---|---|---|
PRISM_DATA_DIR |
./data |
Location of raw input CSVs and STRING files |
PRISM_OUTPUT_DIR |
./data |
Intermediate R outputs |
PRISM_RESULTS_DIR |
./outputs |
Model result CSVs |
PRISM_FIGS_DIR |
./figs |
Base figures directory |
A read-only FastAPI + Plotly dashboard for exploring pipeline outputs.
source .viewer-venv/bin/activate
uvicorn viewer.app:app --host 127.0.0.1 --port 8000Access via SSH tunnel from your local machine:
ssh -N -L 8000:127.0.0.1:8000 <cluster-login>
# then open http://127.0.0.1:8000/- pQTL architecture dominates R² prediction across all three outcomes (largest ΔR² of any feature group)
- Signalling and immune proteins have significantly higher trans_resid (β=+0.69, FDR=2.4×10⁻¹²), while intracellular/mitochondrial proteins have lower trans_resid
- GSEA reveals pathway-level cis escape: all 107 significant gene sets for cis_resid show negative NES — hormonal response, hemostasis, and platelet activation pathway proteins are uniformly less locally heritable than expected
- Trans-driven proteins are STRING network hubs: high R²_trans decile proteins form denser subnetworks (D10 mean degree=2.0 vs D1=0.5)
- Module ICC=0.011 after residualization (down from 0.022 on raw R²) — technical confounders explained ~half the apparent module clustering signal
| Step | Expected |
|---|---|
| 01 | 1,359 proteins; residualized outcomes: cis adj.R²=0.091, trans adj.R²=0.559, gw adj.R²=0.291 |
| 03 | All 19 features with VIF < 10 |
| 05 | RF OOB R²: cis=0.712, trans=0.702, gw=0.650 |
| 07 | 1,357/1,359 proteins matched to STRING; 225 Louvain modules; RWR converges at ~37 iterations |
| 08 | ICC=0.011; ≥1 significant module (q_BH<0.05) |
| 09 | 107 significant GSEA sets for cis_resid; 15 modules biologically labelled |