This repository runs downstream analyses linking:
- gene expression heritability (
h2_GREML) - evolutionary constraint (
s_het post_meanand deciles) - nearby regulatory burden (enhancer/open-chromatin counts)
- nearby repeat burden (total, class, family, and literature-filtered sets)
The main script is:
scripts/downstream_h2_regulatory_repeat_analysis.R
It builds a gene-level feature matrix and summary plots/tables for decile trends, association models, repeat subtype behavior, and random-genomic-background enrichment.
For upstream phenotype preparation (before heritability/eQTL mapping), use:
scripts/prepare_gtex_like_expression_phenotype.R
This script prepares GTEx-style expression phenotypes:
- mandatory TPM+count filtering (
TPM >= 0.1andcounts >= 6in>=20%samples) - TMM normalization of counts
- inverse-normal transformation per gene across samples
For end-to-end GREML execution through Nextflow + Slurm:
nf/main.nfnf/nextflow.confignf/bin/prepare_phenotypes.Rrun_greml.sh(submit withsbatch run_greml.sh)
scripts/downstream_h2_regulatory_repeat_analysis.Rscripts/downstream_h2_regulatory_repeat_helpers.Rscripts/prepare_gtex_like_expression_phenotype.Rnf/main.nfnf/bin/prepare_phenotypes.Rrun_greml.sh
docs/how_to_run_downstream_analysis.mddocs/downstream_analysis_methods.mddocs/how_to_prepare_gtex_like_phenotypes.mddocs/how_to_run_greml_nextflow.md
summary_tsv: GREML heritability summary (Gene,h2_GREML,SE_GREML,Pval_GREML)shet_xlsx: s_het table (ensg,post_mean; coordinates optional)tpm_tsv: TPM matrix from Salmon/GEUVADIS processinggene_gtf: GENCODE hg38 GTF
enhancer_bed: enhancer BEDopen_bed: open-chromatin BEDrepeat_rmsk: UCSCrmskfilechrom_info_tsv: chromosome size tableroadmap_links_dir: local Roadmap link files (optional)
- ENCODE SCREEN enhancer BED (
GRCh38-cCREs.ELS.bed) - ENCODE SCREEN open BED (
GRCh38-cCREs.bed) - UCSC
hg38/database/rmsk.txt.gz - UCSC
hg38/database/chromInfo.txt.gz - GEUVADIS SDRF URL
- Load SDRF and keep European GEUVADIS runs.
- Load TPM matrix and apply expression filter.
- Default is GTEx-style TPM filter:
TPM >= 0.1in>=20%of EU samples.
- Default is GTEx-style TPM filter:
- Merge heritability summary with s_het table and build
post_mean_bin(1-10). - Build gene coordinates (S_het coords if available, otherwise GTF), then construct windows:
- gene-centered:
100kb,250kb,500kb,1mb - TSS-centered:
100kb,250kb,500kb,1mb
- gene-centered:
- Count enhancer/open-chromatin features in windows.
- Load RepeatMasker annotations and compute:
- total repeat counts
- repeat class counts
- repeat family counts
- repeat subtype aliases (
repeat_<type>_count_<window>)
- Apply literature-style repeat filters (class/divergence/length).
- Simulate random genomic background (default 100 iterations) with fixed seed strategy.
- Generate summaries, models, and plots by
post_meandecile.
merged: h2 + s_het + expression-filtered gene tablegene_tbl: gene metadata and window widthsreg_features: enhancer/open-chromatin feature tablerepeat_features: total repeat burden featuresrepeat_class_dt: repeat class burden featuresrepeat_family_dt: repeat family burden featuresanalysis_dt: final gene-level merged feature matrixrepeat_filt_observed_summary: observed filtered-repeat decile summariesrepeat_background_iter: per-iteration random-background summariesrepeat_filtered_enrichment: observed vs background enrichment table
The script emits [Sanity] messages and stops early on failures. Checks include:
- input table row counts (SDRF/TPM/s_het/summary/rmsk)
- matched EU sample count and valid expression thresholds
- non-empty merged gene set and decile coverage
- non-empty enhancer/open/repeat annotations
- positive window widths
- final
analysis_dtrow count matching merged genes - uniqueness of background simulation seeds
Typical expectation:
- thousands of genes retained after expression filter and merge
- 10 s_het bins present
- non-zero counts in at least some enhancer/open/repeat columns
All outputs are written under:
results/downstream_h2_regulatory_repeat/
gene_level_features.tsvdecile_feature_summary.tsvspearman_correlations.tsvmodel_lm_h2.tsvmodel_glm_h2sig.tsvmodel_lm_repeat_classes.tsv(when predictors are valid)
expression_filter_counts.tsvexpression_filter_method.tsvnon_zero_eur_gene_ids.txt
repeat_filter_criteria.tsvrepeat_filter_set_counts.tsvchrom_sizes_used.tsvrepeat_background_seed_info.tsvrepeat_background_seed_map.tsvrepeat_filtered_background_iter.tsvpostmean_repeat_filtered_observed_summary.tsvpostmean_repeat_filtered_background_summary.tsvpostmean_repeat_filtered_enrichment.tsv
postmean_bin_feature_summary.tsvpostmean_bin_repeat_type_summary.tsvpostmean_bin_repeat_family_summary.tsv
- Global trends: h2, regulatory burden, repeat burden
- Repeat type trends per window (
100kb,250kb,500kb,1mb) - Repeat family heatmaps per window
- Filtered-repeat trends per window
- Observed-vs-background filtered-repeat plots per window
- Background enrichment-ratio plots per window
Set these in cfg:
repeat_bg_seed: base seed for background simulationsrepeat_bg_n_iter: number of random iterations
Exact seeds used are recorded in:
repeat_background_seed_map.tsv
Seed strategy:
seed_used = repeat_bg_seed + simulation_index_in_filter_window_loop
If config and inputs are unchanged, reruns produce identical background outputs.
From repository root:
Rscript scripts/downstream_h2_regulatory_repeat_analysis.R- Current expression filtering implements the TPM component of GTEx-style filtering.
(GTEx eQTL prep also uses a read-count threshold when counts are available.) - Windows are gene-centered for primary burden analyses; TSS-centered features are also exported for comparison.