-
Notifications
You must be signed in to change notification settings - Fork 14
GRIN2
GRIN2 (Genomic Random Interval, version 2) identifies genes that are recurrently hit by genomic lesions — copy-number gains/losses, SNV/indels, fusions, and structural variants — across a cohort, and tests whether that recurrence is greater than expected by chance. It produces a sortable Top Genes table and a genome-wide Manhattan plot.
It is based on the GRIN method: Pounds S, et al. A genomic random interval model for statistical analysis of genomic lesion data. Bioinformatics 2013;29(17):2088–95. doi:10.1093/bioinformatics/btt372.
- Open the GRIN2 chart for a dataset (e.g. from the chart menu in the Mass app).
- Apply a cohort filter to choose which samples to analyze (e.g. a diagnosis or subtype).
- In the controls, check the data types to include (SNV/indel, CNV, Fusion, SV — only those available for the dataset are shown) and set their options (below).
- Click Run GRIN2.
The analysis runs server-side and results are cached, so re-running the same filter + options returns instantly.
- Consequences — which mutation classes to include (missense, frameshift, nonsense, splice, etc.). Defaults to the protein-changing set (plus start-lost/stop-lost). Use Select All / Clear All / Default to adjust. If none are selected, all classes are included.
- MAF filter (if the dataset provides it) — keep only mutations whose variant allele fraction (VAF) passes a threshold (e.g. VAF > 0.1). The filter term can pool allele counts across assays — for the ASH dataset the Tumor DNA term sums WGS + WES read counts; you can also filter on a single assay.
CNV calls are segment log2 ratios. A segment counts as a lesion when its value crosses a threshold:
- Loss Threshold — a segment is a loss when its log2 ratio ≤ this value (default −0.4).
- Gain Threshold — a segment is a gain when its log2 ratio ≥ this value (default 0.4; a dataset may set its own default, e.g. 0.3).
- Max Segment Length — segments longer than this (in bp) are dropped before analysis (default 2,000,000 = 2 Mb; set to 0 to disable). This prevents a single very large passenger CNV from inflating significance for every gene it spans.
Tightening the thresholds (e.g. −0.1/0.1 → −0.4/0.3) removes low-amplitude calls; the Max Segment Length cap controls broad arm/chromosome-scale events.
Included as lesions when the corresponding data type is checked; no per-type cutoffs.
Removes genes that sit in known artifact regions before the statistics run, so the table is not dominated by non-driver loci that recur for technical or germline reasons (olfactory-receptor clusters, HLA, FAM90, POTE, GOLGA8, APOBEC3, KANSL1, etc.).
- Exclude artifact genes (checkbox, default on) — toggles the mask.
- Min gene overlap (default 0.5) — a gene is excluded only when at least this fraction of its span lies inside a masked region. The 0.5 default removes genes that sit inside artifact regions while sparing real drivers that merely abut one (e.g. KRAS overlaps a segmental duplication by ~10% and is kept).
The mask is the union of four hg38 region sets:
| Layer | Source | What it removes |
|---|---|---|
| Blacklist | ENCODE / Kundaje GRCh38 unified blacklist | anomalous-signal / low-mappability regions |
| Segmental duplications | UCSC genomicSuperDups
|
duplicated, recombination-prone, paralogous regions |
| Assembly gaps | UCSC gap + centromeres
|
centromeres, telomeres, gaps |
| Common germline CNVs | DGV Gold Standard, frequency ≥ 1% | loci that are copy-number-polymorphic in the normal population |
The germline-CNV layer is essential: loci like OR clusters, HLA class II, and KANSL1 are well-mapped but copy-number-variable in healthy people, so mappability/blacklist alone does not catch them.
When the mask runs, the results panel shows a Region Mask (excluded artifact genes) section reporting how many genes were excluded, a few examples, and the genome fraction masked (~14% with all four layers).
References: ENCODE blacklist — Amemiya et al., Sci Rep 2019, doi:10.1038/s41598-019-45839-z; segmental duplications — Sharp et al., AJHG 2005, doi:10.1086/431652; DGV — MacDonald et al., NAR 2014, doi:10.1093/nar/gkt958; region-exclusion practice — Ogata et al. (excluderanges), Bioinformatics 2023, doi:10.1093/bioinformatics/btad198.
- Max genes to show — number of rows in the Top Genes table (default 500).
- Significance (q-value) threshold — q-values below this (default 0.05) are flagged in the table/tooltips and shown as interactive points in the Manhattan plot.
For each gene and each included lesion type the table reports:
- P-value (Gain/Loss/Mutation/…) — probability of seeing at least the observed number of affected subjects under the random-interval null.
- Q-value (…) — the p-value adjusted for multiple testing (FDR). Use the q-value to judge significance (e.g. q < 0.05).
- Subject Count (…) — number of subjects with that lesion type overlapping the gene.
When more than one lesion type is analyzed, additional N Lesion Types columns appear (1/2/3 …). These are constellation tests that combine evidence across lesion types — e.g. the "2 Lesion Types" q-value flags genes significant when gain+loss (or mutation+CNV) are considered jointly. A gene driven by a single type will be most significant in that type's column; a gene hit by several types will rise in the multi-type columns.
Sort any column by clicking its header. The table is sorted by overall significance by default.
Each point is a gene at its genomic position; the y-axis is −log₁₀(q-value). Colors distinguish lesion types (mutation, gain, loss, fusion, SV). Points above the significance threshold are interactive (hover for gene, type, subject count, q-value). The y-axis auto-scales/caps for very significant peaks.
GRIN models each lesion as an interval placed at a random genomic location, and asks how often a gene would be overlapped by chance. For each gene it computes the probability that k or more subjects are hit (a Bernoulli convolution over per-subject hit probabilities), giving a p-value per lesion type. P-values are converted to q-values by Benjamini–Hochberg FDR correction. Multi-type constellation p-values are derived from the per-type p-value order statistics, so a gene recurrently hit by several lesion types can be significant even if no single type is.
Because the null assumes lesions land uniformly at random, artifact-dense regions (low-mappability, segmental-duplication, germline-CNV loci) accumulate spurious recurrence — which is why the Exclude artifact genes mask is on by default.
- Always read q-values, not p-values, for significance.
- Keep the artifact mask on unless you have a specific reason to inspect raw results; with it off, expect OR/HLA/segdup/germline-CNV genes near the top.
- CNV thresholds and Max Segment Length are the main knobs for controlling how many CNV lesions enter the analysis.
- Genuine drivers that happen to sit ≥50% inside an artifact region would be excluded by the mask — check the excluded-gene summary if a gene you expect is missing.
- Results are cached on the analysis inputs (filter + options); changing a display-only setting reuses the cached statistics.
- Home
- Installation
- Development
- Usage
- ProteinPaint Portals
- Plots
- Tracks
- Apps
- Publications
- References