GeneExpression: Downstream h2 / s_het Regulatory + Repeat Analysis

Purpose

This repository runs downstream analyses linking:

gene expression heritability (h2_GREML)
evolutionary constraint (s_het post_mean and deciles)
nearby regulatory burden (enhancer/open-chromatin counts)
nearby repeat burden (total, class, family, and literature-filtered sets)

The main script is:

scripts/downstream_h2_regulatory_repeat_analysis.R

It builds a gene-level feature matrix and summary plots/tables for decile trends, association models, repeat subtype behavior, and random-genomic-background enrichment.

For upstream phenotype preparation (before heritability/eQTL mapping), use:

scripts/prepare_gtex_like_expression_phenotype.R

This script prepares GTEx-style expression phenotypes:

mandatory TPM+count filtering (TPM >= 0.1 and counts >= 6 in >=20% samples)
TMM normalization of counts
inverse-normal transformation per gene across samples

For end-to-end GREML execution through Nextflow + Slurm:

nf/main.nf
nf/nextflow.config
nf/bin/prepare_phenotypes.R
run_greml.sh (submit with sbatch run_greml.sh)

Repository Objects

Core code

scripts/downstream_h2_regulatory_repeat_analysis.R
scripts/downstream_h2_regulatory_repeat_helpers.R
scripts/prepare_gtex_like_expression_phenotype.R
nf/main.nf
nf/bin/prepare_phenotypes.R
run_greml.sh

Documentation

docs/how_to_run_downstream_analysis.md
docs/downstream_analysis_methods.md
docs/how_to_prepare_gtex_like_phenotypes.md
docs/how_to_run_greml_nextflow.md

Inputs

Required user inputs (set in `cfg`)

summary_tsv: GREML heritability summary (Gene, h2_GREML, SE_GREML, Pval_GREML)
shet_xlsx: s_het table (ensg, post_mean; coordinates optional)
tpm_tsv: TPM matrix from Salmon/GEUVADIS processing
gene_gtf: GENCODE hg38 GTF

Optional user overrides

enhancer_bed: enhancer BED
open_bed: open-chromatin BED
repeat_rmsk: UCSC rmsk file
chrom_info_tsv: chromosome size table
roadmap_links_dir: local Roadmap link files (optional)

External defaults used if optional paths are `NULL`

ENCODE SCREEN enhancer BED (GRCh38-cCREs.ELS.bed)
ENCODE SCREEN open BED (GRCh38-cCREs.bed)
UCSC hg38/database/rmsk.txt.gz
UCSC hg38/database/chromInfo.txt.gz
GEUVADIS SDRF URL

Pipeline Steps

Load SDRF and keep European GEUVADIS runs.
Load TPM matrix and apply expression filter.
- Default is GTEx-style TPM filter: TPM >= 0.1 in >=20% of EU samples.
Merge heritability summary with s_het table and build post_mean_bin (1-10).
Build gene coordinates (S_het coords if available, otherwise GTF), then construct windows:
- gene-centered: 100kb, 250kb, 500kb, 1mb
- TSS-centered: 100kb, 250kb, 500kb, 1mb
Count enhancer/open-chromatin features in windows.
Load RepeatMasker annotations and compute:
- total repeat counts
- repeat class counts
- repeat family counts
- repeat subtype aliases (repeat_<type>_count_<window>)
Apply literature-style repeat filters (class/divergence/length).
Simulate random genomic background (default 100 iterations) with fixed seed strategy.
Generate summaries, models, and plots by post_mean decile.

Key In-Memory Objects

merged: h2 + s_het + expression-filtered gene table
gene_tbl: gene metadata and window widths
reg_features: enhancer/open-chromatin feature table
repeat_features: total repeat burden features
repeat_class_dt: repeat class burden features
repeat_family_dt: repeat family burden features
analysis_dt: final gene-level merged feature matrix
repeat_filt_observed_summary: observed filtered-repeat decile summaries
repeat_background_iter: per-iteration random-background summaries
repeat_filtered_enrichment: observed vs background enrichment table

Expectations and Sanity Checks

The script emits [Sanity] messages and stops early on failures. Checks include:

input table row counts (SDRF/TPM/s_het/summary/rmsk)
matched EU sample count and valid expression thresholds
non-empty merged gene set and decile coverage
non-empty enhancer/open/repeat annotations
positive window widths
final analysis_dt row count matching merged genes
uniqueness of background simulation seeds

Typical expectation:

thousands of genes retained after expression filter and merge
10 s_het bins present
non-zero counts in at least some enhancer/open/repeat columns

Outputs

All outputs are written under:

results/downstream_h2_regulatory_repeat/

Core tables

gene_level_features.tsv
decile_feature_summary.tsv
spearman_correlations.tsv
model_lm_h2.tsv
model_glm_h2sig.tsv
model_lm_repeat_classes.tsv (when predictors are valid)

Expression-filter metadata

expression_filter_counts.tsv
expression_filter_method.tsv
non_zero_eur_gene_ids.txt

Repeat filtering/background metadata

repeat_filter_criteria.tsv
repeat_filter_set_counts.tsv
chrom_sizes_used.tsv
repeat_background_seed_info.tsv
repeat_background_seed_map.tsv
repeat_filtered_background_iter.tsv
postmean_repeat_filtered_observed_summary.tsv
postmean_repeat_filtered_background_summary.tsv
postmean_repeat_filtered_enrichment.tsv

Decile feature summaries

postmean_bin_feature_summary.tsv
postmean_bin_repeat_type_summary.tsv
postmean_bin_repeat_family_summary.tsv

Plots (`results/downstream_h2_regulatory_repeat/plots/`)

Global trends: h2, regulatory burden, repeat burden
Repeat type trends per window (100kb, 250kb, 500kb, 1mb)
Repeat family heatmaps per window
Filtered-repeat trends per window
Observed-vs-background filtered-repeat plots per window
Background enrichment-ratio plots per window

Reproducibility

Set these in cfg:

repeat_bg_seed: base seed for background simulations
repeat_bg_n_iter: number of random iterations

Exact seeds used are recorded in:

repeat_background_seed_map.tsv

Seed strategy:

seed_used = repeat_bg_seed + simulation_index_in_filter_window_loop

If config and inputs are unchanged, reruns produce identical background outputs.

Run

From repository root:

Rscript scripts/downstream_h2_regulatory_repeat_analysis.R

Notes

Current expression filtering implements the TPM component of GTEx-style filtering.
(GTEx eQTL prep also uses a read-count threshold when counts are available.)
Windows are gene-centered for primary burden analyses; TSS-centered features are also exported for comparison.

Name		Name	Last commit message	Last commit date
Latest commit History 57 Commits
docs		docs
nf		nf
scripts		scripts
.DS_Store		.DS_Store
.Rhistory		.Rhistory
README.md		README.md
run_greml.sh		run_greml.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GeneExpression: Downstream h2 / s_het Regulatory + Repeat Analysis

Purpose

Repository Objects

Core code

Documentation

Inputs

Required user inputs (set in `cfg`)

Optional user overrides

External defaults used if optional paths are `NULL`

Pipeline Steps

Key In-Memory Objects

Expectations and Sanity Checks

Outputs

Core tables

Expression-filter metadata

Repeat filtering/background metadata

Decile feature summaries

Plots (`results/downstream_h2_regulatory_repeat/plots/`)

Reproducibility

Run

Notes

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

GeneExpression: Downstream h2 / s_het Regulatory + Repeat Analysis

Purpose

Repository Objects

Core code

Documentation

Inputs

Required user inputs (set in cfg)

Optional user overrides

External defaults used if optional paths are NULL

Pipeline Steps

Key In-Memory Objects

Expectations and Sanity Checks

Outputs

Core tables

Expression-filter metadata

Repeat filtering/background metadata

Decile feature summaries

Plots (results/downstream_h2_regulatory_repeat/plots/)

Reproducibility

Run

Notes

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Required user inputs (set in `cfg`)

External defaults used if optional paths are `NULL`

Plots (`results/downstream_h2_regulatory_repeat/plots/`)

Packages