diff --git a/assets/samplesheet.csv b/assets/samplesheet.csv index 8e141d97..5adeb527 100644 --- a/assets/samplesheet.csv +++ b/assets/samplesheet.csv @@ -1 +1,4 @@ -id,subject_name,sample_name,sample_type,filetype,filepath +id,subject_name,sample_name,filetype,filepath +subject_a.example,subject_a,sample_germline,dragen_germline_dir,/path/to/dragen_germline/ +subject_a.example,subject_a,sample_somatic,dragen_somatic_dir,/path/to/dragen_somatic/ +subject_a.example,subject_a,sample_somatic,oncoanalyser_dir,/path/to/oncoanalyser/ diff --git a/docs/README.md b/docs/README.md index 5ec0f466..c03d035a 100644 --- a/docs/README.md +++ b/docs/README.md @@ -1,8 +1,8 @@ -# umccr/sash: Documentation - -The umccr/sash documentation is split into the following pages: - +- [Details](details.md) + - In details of the pipeline steps - [Usage](usage.md) - An overview of how the pipeline works, how to run it and a description of all of the different command-line flags. - [Output](output.md) - An overview of the different results produced by the pipeline and how to interpret them. +- [Architectural decision record (ADR)](adr.md) + - describes a choice the team makes about a significant aspect of the software architecture they're planning to build \ No newline at end of file diff --git a/docs/adr.md b/docs/adr.md new file mode 100644 index 00000000..b1a7ea88 --- /dev/null +++ b/docs/adr.md @@ -0,0 +1,44 @@ +# ADR #1: Implement VCF Chunking and Parallelization in Sash Workflow for PCGR Processing + +**Status**: In Progress +**Date**: 2024-11-07 +**Deciders**: Oliver Hofmann, Stephen Watts, Quentin Clayssen +**Technical Story**: Based on the limitations of PCGR in handling large variant datasets within the sash workflow, specifically impacting hypermutated samples. + +## Context +[PCGR](https://sigven.github.io/pcgr/) (Personal Cancer Genome Reporter) currently has a variant processing limit of 500,000 variants per run. In the sash workflow, hypermutated samples often exceed this variant limit. PCGR has its own filtering steps, but an additional filtering step was also introduced in Bolt. By using VCF chunking and parallel processing, we can ensure that these large datasets are analyzed effectively without exceeding the PCGR variant limit, leading to larger annotation and a more scalable pipeline. + +## Decision +To address the limitations of PCGR when handling hypermutated samples, we WILL implement the following: + +1. **Split VCF Files into Chunks**: Input VCF files MUST be divided into chunks, each containing no more than 500,000 variants. This ensures that each chunk remains within PCGR’s processing capacity. + +2. **Parallelize Processing**: Each chunk MUST be processed concurrently through PCGR to optimize processing time. The annotated outputs from all chunks MUST be merged to create a unified dataset. + +3. **Integrate into Bolt Annotation**: The chunking and parallelization changes MUST be implemented in the Bolt annotation module to ensure seamless and scalable processing for large variant datasets. + +4. **Efficiency Consideration**: For now, there MAY be a loss of efficiency for larger variant sets due to the fixed resources allocated for annotation. Further resource adjustments SHOULD be evaluated in the future. + +## Consequences + +### Positive Consequences +- **Improved Efficiency**: This approach allows large variant datasets to be processed within PCGR's constraints, enhancing efficiency and ensuring more comprehensive analysis. +- **Scalability**: Chunking and parallel processing make the sash workflow more scalable for hypermutated samples, accommodating larger datasets. + +### Negative Consequences +- **Complexity**: Adding chunking and merging processes WILL increase complexity in data handling and ensuring integrity across all merged data. +- **Resource Demand**: Parallel processing MAY increase resource consumption, affecting system performance and requiring further resource management. + +## Remaining Challenges +While the proposed approach mitigates the current limitations of PCGR, it MAY not fully resolve the issues for hypermutated samples with exceptionally high variant counts. Additional solutions MUST be explored, such as: + +- **Additional Filtering Criteria**: Applying additional filters to reduce the variant count where applicable. +- **Alternative Reporting Methods**: Exploring more scalable reporting approaches that COULD handle higher variant loads. + +## Status +**Status**: In Progress + +## Links +- [Related PR for VCF Chunking and Parallelization Implementation](https://github.com/scwatts/bolt/pull/2) +- [PCGR Documentation on Variant Limit](https://sigven.github.io/pcgr/articles/running.html#large-input-sets-vcf) +- Discussion on Hypermutated Samples Handling diff --git a/docs/details.md b/docs/details.md new file mode 100644 index 00000000..ea8e1457 --- /dev/null +++ b/docs/details.md @@ -0,0 +1,582 @@ +# sash workflow details + +## Table of Contents +- [Overview](#overview) +- [HMFtools WiGiTs](#hmftools-wigits) +- [Other Tools](#other-tools) +- [Pipeline Inputs](#pipeline-inputs) +- [Workflows](#workflows) + - [Somatic Small Variants](#somatic-small-variants) + - [Somatic Structural Variants](#somatic-structural-variants) + - [Germline Small Variants](#germline-small-variants) +- [Common Reports](#common-reports) +- [sash Module Outputs](#sash-module-outputs) +- [Coverage](#coverage) +- [Reference Data](#reference-data) +- [FAQ](#faq) + +## Overview + +![Summary](images/sash_overview_qc.png) + +The sash Workflow is a genomic analysis framework comprising three primary pipelines: + +- Somatic Small Variants (SNV somatic): Detects single nucleotide variants (SNVs) and indels in tumor samples, emphasizing clinical relevance. +- Somatic Structural Variants (SV somatic): Identifies large-scale genomic alterations (deletions, duplications, etc.) and integrates copy number data. +- Germline Variants (SNV germline): Focuses on inherited variants linked to cancer predisposition. + +These pipelines utilize Bolt (a Python package designed for modular processing) and leverage outputs from the [DRAGEN](https://sapac.illumina.com/products/by-type/informatics-products/dragen-secondary-analysis.html) Variant Caller alongside the [Hartwig Medical Foundation (HMF) tools](https://github.com/hartwigmedical/hmftools/tree/master) integrated via [Oncoanalyser](https://github.com/nf-core/oncoanalyser). Each pipeline is tailored to a specific type of genomic variant, incorporating filtering, annotation and HTML reports for research and curation. + +--- + +## HMFtools + +HMFtools is an open-source suite for cancer genomics developed by the Hartwig Medical Foundation. Key components used in sash include: + +- [SAGE (Somatic Alterations in Genome)](https://github.com/hartwigmedical/hmftools/blob/master/sage/README.md): + A tiered SNV/indel caller targeting cancer hotspots from databases including [Cancer Genome Interpreter](https://www.cancergenomeinterpreter.org/home), [CIViC](http://civic.genome.wustl.edu/), and [OncoKB](https://oncokb.org/) to recover low-frequency variants missed by DRAGEN. Outputs a VCF with confidence tiers (hotspot, panel, high/low confidence). + +- [PURPLE](https://github.com/hartwigmedical/hmftools/tree/master/purple): + Estimates tumor purity (tumor cell fraction) and ploidy (average copy number), integrates copy number data, and calculates TMB (tumor mutation burden) and MSI (microsatellite instability). + +- [Cobalt](https://github.com/hartwigmedical/hmftools/blob/master/cobalt/README.md): + Calculates read-depth ratios from sequencing data, providing essential input for copy number analysis. Its outputs are used by PURPLE to generate accurate copy number profiles across the genome. + +- [Amber](https://github.com/hartwigmedical/hmftools/blob/master/amber/README.md): + Computes B-allele frequencies, which are critical for estimating tumor purity and ploidy. The Amber directory contains these measurements, supporting PURPLE's analysis. + +--- + +## Other Tools + +### [SIGRAP](https://github.com/umccr/sigrap) +A framework for running PCGR and other genomic reporting tools. + +### [Personal Cancer Genome Reporter (PCGR)](https://github.com/sigven/pcgr/tree/v1.4.1) +Tool for comprehensive clinical interpretation of somatic variants, providing tiered classifications and extensive annotation. + +### [Cancer Predisposition Sequencing Report (CPSR)](https://github.com/sigven/cpsr) +Tool for predisposition variant analysis and reporting in germline samples. + +### [Genomics Platform Group Reporting (GPGR)](https://github.com/umccr/gpgr) +UMCCR-developed R package for generating cancer genomics reports. + +### [Linx](https://github.com/hartwigmedical/hmftools/tree/master/linx) +Tool for structural variant annotation and visualization to classify complex rearrangements. + +### [GRIDSS/GRIPSS](https://github.com/PapenfussLab/gridss) +Structural variant caller (GRIDSS) and accompanying filtering tool (GRIPSS) for high-confidence SV detection. + +### [VIRUSBreakend](https://github.com/PapenfussLab/gridss/blob/master/VIRUSBreakend_Readme.md) +Tool for detecting viral integration events in human genome sequencing data. + +--- + +## Pipeline Inputs + +### DRAGEN +- `{tumor_id}.hard-filtered.vcf.gz`: Somatic variant calls from DRAGEN pipeline. + +### Oncoanalyser + +#### [GRIDSS/GRIPSS](https://github.com/PapenfussLab/gridss) +- `{tumor_id}.gridss.vcf.gz`: VCF containing structural variant calls produced by GRIDSS2. + +#### [SAGE](https://github.com/hartwigmedical/hmftools/blob/master/sage/README.md) +- `{tumor_id}.sage.somatic.vcf.gz`: Somatic SNV/indel calls from SAGE. + +#### [VIRUSBreakend](https://github.com/PapenfussLab/gridss/blob/master/VIRUSBreakend_Readme.md) +- Directory: `virusbreakend/`: Contains outputs from VIRUSBreakend, used for detecting viral integration events. + +#### [Cobalt](https://github.com/hartwigmedical/hmftools/blob/master/cobalt/README.md) +- Directory: `cobalt/`: Contains read-depth ratio data required for copy number analysis by PURPLE. + +#### [Amber](https://github.com/hartwigmedical/hmftools/blob/master/amber/README.md) +- Directory: `amber/`: Contains B-allele frequency measurements used by PURPLE to estimate tumor purity and ploidy. + +--- + +## Workflows + +### Somatic Small Variants + +#### General +In the Somatic Small Variants workflow, variant detection is performed using the DRAGEN Variant Caller and Oncoanalyser (relying on SAGE and PURPLE outputs). It's structured into four steps: Re-calling, Annotation, Filter, and Report. The final outputs include an HTML report summarizing the results. + +#### Summary +1. Re-calling SAGE variants to recover low-frequency mutations in hotspots. +2. Annotate variants with clinical and functional information using PCGR. +3. Filter variants based on quality and frequency criteria, while retaining those of potential clinical significance. +4. Generate comprehensive HTML reports (PCGR, Cancer Report, LINX, MultiQC). + +### Variant Calling Re-calling + +The variant calling re-calling step uses variants from [SAGE](https://github.com/hartwigmedical/hmftools/tree/master/sage), which is more sensitive than DRAGEN in detecting variants, particularly those with low allele frequency. SAGE focuses on cancer hotspots, prioritizing predefined genomic regions of high clinical or biological relevance with its [filtering system](https://github.com/hartwigmedical/hmftools/tree/master/sage#6-soft-filters). This enables the re-calling of biologically significant variants that may have been missed otherwise. + +#### Inputs +- From DRAGEN: Somatic small variant caller VCF + - `${tumor_id}.main.dragen.vcf.gz` +- From Oncoanalyser: SAGE VCF + - `${tumor_id}.main.sage.filtered.vcf.gz` + + Filtered on chromosomes 1-22, X, Y, and M. + +#### Output +- Re-calling: VCF + - `${tumor_id}.rescued.vcf.gz` + +#### Steps +1. Select High-Confidence SAGE Calls in Hotspot Regions: + - Filter the SAGE output to retain only variants that pass quality filters and overlap with known hotspot regions. + - Compare the input VCF and the SAGE VCF to identify overlapping and unique variants. +2. Annotate existing somatic variant calls also present in the SAGE calls in the input VCF: + - For each variant in the input VCF, check if it exists in the SAGE existing calls. + - For variants integrated by SAGE: + - If `SAGE FILTER=PASS` and input VCF `FILTER=PASS`: + - Set `INFO/SAGE_HOTSPOT` to indicate the variant is called by SAGE in a hotspot. + - If `SAGE FILTER=PASS` and input VCF `FILTER` is not `PASS`: + - Set `INFO/SAGE_HOTSPOT` and `INFO/SAGE_RESCUE` to indicate the variant is re-called from SAGE. + - Update `FILTER=PASS` to include the variant in the final analysis. + - If `SAGE FILTER` is not `PASS`: + - Append `SAGE_lowconf` to the `FILTER` field to flag low-confidence variants. + - Transfer SAGE `FORMAT` fields to the input VCF with a `SAGE_` prefix. +3. Combine annotated input VCF with novel SAGE calls: + - Prepare novel SAGE calls. For each variant in the SAGE VCF missing from the input VCF: + - Rename certain `FORMAT` fields in the novel SAGE VCF to avoid namespace collisions: + - For example, `FORMAT/SB` is renamed to `FORMAT/SAGE_SB`. + - Retain necessary `INFO` and `FORMAT` annotations while removing others to streamline the data. + +### Annotation + +The Annotation process employs Reference Sources (GA4GH/GIAB problem region stratifications, GIAB high confidence regions, gnomAD, Hartwig hotspots), UMCCR panel of normals (built from approximately 200 normal samples), and the PCGR tool to enrich variants with [classification](https://sigven.github.io/pcgr/articles/variant_classification.html) and clinical information. +**These annotations are used to decide which variants are retained or filtered in the next step.** + +#### Inputs +- Small variant VCF + - `${tumor_id}.rescued.vcf.gz` + +#### Output +- Annotated VCF + - `${tumor_id}.annotations.vcf.gz` + +#### Steps +1. Set FILTER to "PASS" for unfiltered variants: + - Iterate over the input VCF file and set the `FILTER` field to `PASS` for any variants that currently have no filter status (`FILTER` is `.` or `None`). +2. Annotate the VCF against reference sources: + - Use vcfanno to add annotations to the VCF file: + - gnomAD (version 2.1) + - Hartwig Hotspots + - ENCODE Blacklist + - Genome in a Bottle High-Confidence Regions (v4.2.1) + - Low and High GC Regions (< 30% or > 65% GC content, compiled by GA4GH) + - Bad Promoter Regions (compiled by GA4GH) +3. Annotate with UMCCR panel of normals counts: + - Use vcfanno and bcftools to annotate the VCF with counts from the UMCCR panel of normals. +4. Standardize the VCF fields: + - Add new `INFO` fields for use with PCGR: + - `TUMOR_AF`, `NORMAL_AF`: Tumor and normal allele frequencies. + - `TUMOR_DP`, `NORMAL_DP`: Tumor and normal read depths. + - Add the `AD` FORMAT field: + - `AD`: Allelic depths for the reference and alternate alleles. +5. Prepare VCF for PCGR annotation: + - Make minimal VCF header keeping only INFO AF/DP, and contigs size. + - Move tumor and normal `FORMAT/AF` and `FORMAT/DP` annotations to the `INFO` field as required by PCGR. + - Set `FILTER` to `PASS` and remove all `FORMAT` and sample columns. +6. Run PCGR (v1.4.1) to annotate VCF against external sources: + - Classify variants by tiers based on annotations and functional impact according to AMP/ASCO/CAP guidelines. + - Add `INFO` fields into the VCF: `TIER`, `SYMBOL`, `CONSEQUENCE`, `MUTATION_HOTSPOT`, `TCGA_PANCANCER_COUNT`, `CLINVAR_CLNSIG`, `ICGC_PCAWG_HITS`, `COSMIC_CNT`. + - External sources include VEP, ClinVar, COSMIC, TCGA, ICGC, Open Targets Platform, CancerMine, DoCM, CBMDB, DisGeNET, Cancer Hotspots, dbNSFP, UniProt/SwissProt, Pfam, DGIdb, and ChEMBL. +7. Transfer PCGR annotations to the full set of variants: + - Merge the PCGR annotations back into the original VCF file. + - Ensure that all variants, including those not selected for PCGR annotation, have relevant clinical annotations where available. + - Preserve the `FILTER` statuses and other annotations from the original VCF. + +### Filter + +The Filter step applies a series of stringent filters to somatic variant calls in the VCF file, ensuring the retention of high-confidence and biologically meaningful variants. + +#### Inputs +- Annotated VCF + - `${tumor_id}.annotations.vcf.gz` + +#### Output +- Filtered VCF + - `${tumor_id}*filters_set.vcf.gz` + +#### Filters + +Variants that do not meet these criteria will be filtered out unless they qualify for [Clinical Significance Exceptions](#clinical-significance-exceptions): + +| **Filter Type** | **Threshold/Criteria** | +|-------------------------------------------|------------------------------------------------| +| **Allele Frequency (AF) Filter** | Tumor AF < 10% (0.10) | +| **Allele Depth (AD) Filter** | Fewer than 4 supporting reads (6 in low-complexity regions) | +| **Non-GIAB AD Filter** | Stricter thresholds outside GIAB high-confidence regions | +| **Problematic Genomic Regions Filter** | Overlap with ENCODE blacklist, bad promoter, or low-complexity regions | +| **Population Frequency (gnomAD) Filter** | gnomAD AF ≥ 1% (0.01) | +| **Panel of Normals (PoN) Germline Filter**| Present in ≥ 5 normal samples or PoN AF > 20% (0.20) | + +#### Clinical Significance Exceptions + +| Exception Category | Criteria | +|-----------------------------------|-------------------------------------------------------------------------| +| **Reference Database Hit Count** | COSMIC count ≥10 OR TCGA pan-cancer count ≥5 OR ICGC PCAWG count ≥5 | +| **ClinVar Pathogenicity** | ClinVar classification of `conflicting_interpretations_of_pathogenicity`, `likely_pathogenic`, `pathogenic`, or `uncertain_significance` | +| **Mutation Hotspots** | Annotated as `HMF_HOTSPOT`, `PCGR_MUTATION_HOTSPOT`, or SAGE Hotspots (CGI, CIViC, OncoKB) | +| **PCGR Tier Exception** | Classified as `TIER_1` OR `TIER_2` | + +### Reports + +The Report step utilizes the Personal Cancer Genome Reporter (PCGR) and other tools to generate comprehensive reports. + +#### Inputs +- Purple purity data +- Filtered VCF + - `${tumor_id}*filters_set.vcf.gz` +- DRAGEN VCF + - `${tumor_id}.main.dragen.vcf.gz` + +#### Output +- PCGR Cancer report + - `${tumor_id}.pcgr_acmg.grch38.html` + +#### Steps +1. Generate BCFtools Statistics on the Input VCF: + - Run `bcftools stats` to gather statistics on variant quality and distribution. +2. Calculate Allele Frequency Distributions: + - Filter and normalize variants according to high-confidence regions. + - Extract allele frequency data from tumor samples. + - Produce both a global allele frequency summary and a subset of allele frequencies restricted to key cancer genes. +3. Compare Variant Counts From Two Variant Sets (DRAGEN vs. BOLT): + - Count the total number and types of variants (SNPs, Indels, Others) passing filters in both the DRAGEN VCF and the Filtered BOLT VCF. +4. Count Variants by Processing Stage. +5. Parse Purity and Ploidy Information (Purple Data). +6. Run PCGR to generate the final report. + +### Somatic Structural Variants + +The Somatic Structural Variants (SVs) pipeline identifies and annotates large-scale genomic alterations, including deletions, duplications, inversions, insertions, and translocations in tumor samples. This step re-calls outputs from DRAGEN Variant Caller, GRIDSS2, using PURPLE applies filtering criteria, and prioritizes clinically significant structural variants. + +#### Summary +1. GRIPSS filtering: + - Refines the structural variant calls using read counts, panel-of-normals, known fusion hotspots, and repeat masker annotations. +2. PURPLE: + - Combines the GRIPSS-filtered SV calls with copy number variation (CNV) data and tumor purity/ploidy estimates. +3. Annotation: + - Combines SV calls with CNV data and annotates using [SnpEff](https://github.com/pcingola/SnpEff). +4. Prioritization: + - Prioritizes SV annotations based on [AstraZeneca-NGS](https://github.com/AstraZeneca-NGS/simple_sv_annotation) using curated reference data. +5. Report: + - Generates cancer report and MultiQC output. + +#### Inputs +- GRIDSS2 + - `${tumor_id}.gridss.vcf.gz` + +#### Steps +1. GRIPSS filtering: + - Evaluate split-read and paired-end support; discard variants with low support. + - Apply panel-of-normals filtering to remove artifacts observed in normal samples. + - Retain variants overlapping known oncogenic fusion hotspots (using UMCCR-curated lists). + - Exclude variants in repetitive regions based on Repeat Masker annotations. +2. PURPLE: + - Merge SV calls with CNV segmentation data. + - Estimate tumor purity and ploidy. + - Adjust SV breakpoints based on copy number transitions. + - Classify SVs as somatic or germline. +3. Annotation: + - Compile SV and CNV information into a unified VCF file. + - Extend the VCF header with PURPLE-related INFO fields (e.g., PURPLE_baf, PURPLE_copyNumber). + - Convert CNV records from TSV format into VCF records with appropriate SVTYPE tags (e.g., 'DUP' for duplications, 'DEL' for deletions). + - Run SnpEff to annotate the unified VCF with functional information such as gene names, transcript effects, and coding consequences. +4. Prioritization: + - Run the prioritization module (forked from the AstraZeneca simple_sv_annotation tool) using reference data files including known fusion pairs, known fusion 5′ and 3′ lists, key genes, and key tumor suppressor genes. + - Classify Variants: + - Structural Variants (SVs): Variants labeled with the source `sv_gridss`. + - Copy Number Variants (CNVs): Variants labeled with the source `cnv_purple`. +5. Prioritize variants on a 4-tier system using [prioritize_sv](https://github.com/umccr/vcf_stuff/blob/master/scripts/prioritize_sv.): + - **1 (high)** - **2 (moderate)** - **3 (low)** - **4 (no interest)** + - Exon loss: + - On cancer gene list (1) + - Other (2) + - Gene fusion: + - Paired (hits two genes): + - On list of known pairs (1) (curated by [HMF](https://resources.hartwigmedicalfoundation.nl)) + - One gene is a known promiscuous fusion gene (1) (curated by [HMF](https://resources.hartwigmedicalfoundation.nl)) + - On list of [FusionCatcher](https://github.com/ndaniel/fusioncatcher/blob/master/bin/generate_known.py) known pairs (2) + - Other: + - One or two genes on cancer gene list (2) + - Neither gene on cancer gene list (3) + - Unpaired (hits one gene): + - On cancer gene list (2) + - Others (3) + - Upstream or downstream: A specific type of fusion where one gene comes under the control of another gene's promoter, potentially leading to overexpression (oncogene) or underexpression (tumor suppressor gene): + - On cancer gene list genes (2) + - LoF or HIGH impact in a tumor suppressor: + - On cancer gene list (2) + - Other TS gene (3) + - Other (4) +6. Filter Low-Quality Calls: + - Apply Quality Filters: + - Keep variants with sufficient read support (e.g., split reads (SR) ≥ 5 and paired reads (PR) ≥ 5). + - Exclude Tier 3 and Tier 4 variants where `SR < 5` and `PR < 5`. + - Exclude Tier 3 and Tier 4 variants where `SR < 10`, `PR < 10`, and allele frequencies (`AF0` or `AF1`) are below 0.1. +7. Report: + - Generate MultiQC and cancer report outputs. + +### Germline Small Variants + +Filtering Select passing variants in the given [gene panel transcript regions](https://github.com/umccr/gene_panels/tree/main/germline_panel) made with PMCC familial cancer clinic list then make CPSR report. + +#### Inputs +- DRAGEN VCF + - `${normal_id}.hard-filtered.vcf.gz` + +#### Output +- CPSR report + - `${normal_id}.cpsr.grch38.html` + +#### Steps +1. Prepare: + - Selection of Passing Variants: + - Raw germline variant calls from DRAGEN are filtered to retain only those variants marked as PASS (or with no filter flag). + - Selection of Gene Panel Variants: + - The filtered variants are further restricted to regions defined by the [gene panel transcript regions file](https://github.com/umccr/gene_panels/tree/main/germline_panel), based on the PMCC familial cancer clinic list. +2. Report: + - Generate CPSR (Cancer Predisposition Sequencing Report) summarizing germline findings. + +--- + +## Common Reports + +### [Cancer Report](https://umccr.github.io/gpgr/) + +UMCCR cancer report containing: + +#### Tumor Mutation Burden (TMB) +- Data Source: filtered somatic VCF +- Tool: PURPLE + +#### Mutational Signatures +- Data Source: filtered SNV/CNV VCF +- Tool: MutationalPatterns R package (via PCGR) + +#### Contamination Score + +- Data Source: – +- Note: No dedicated contamination metric is currently generated + +#### Purity & Ploidy +- Data Source: COBALT (providing read-depth ratios) and AMBER (providing B-allele frequency measurements) +- Tool: PURPLE, which uses these inputs to compute sample purity (percentage of tumor cells) and overall ploidy (average copy number) + +#### HRD Score +- Data Source: HRD analysis output file (${tumor_id}.hrdscore.tsv) +- Tool: DRAGEN + +#### MSI (Microsatellite Instability) +- Data Source: Indels in microsatellite regions from SNV/CNV +- Tool: PURPLE + +#### Structural Variant Metrics +- Data Source: GRIDSS/GRIPSS SV VCF and PURPLE CNV segmentation +- Tools: GRIDSS/GRIPSS and PURPLE + +#### Copy Number Metrics (Segments, Deleted Genes, etc.) +- Data Source: PURPLE CNV outputs (segmentation files, gene-level CNV TSV) +- Tool: PURPLE + +The LINX report includes the following: +- Tables of Variants: + - Breakends + - Links + - Driver Catalog +- Plots: + - Cluster-Level Plots + +### MultiQC + +General Stats: Overview of QC metrics aggregated from all tools, providing high-level sample quality information. + +DRAGEN: Mapping metrics (mapped reads, paired reads, duplicated alignments, secondary alignments), WGS coverage (average depth, cumulative coverage, per-contig coverage), fragment length distributions, trimming metrics, and time metrics for pipeline steps. + +PURPLE: Sample QC status (PASS/FAIL), ploidy, tumor purity, polyclonality percentage, tumor mutational burden (TMB), microsatellite instability (MSI) status, and variant metrics for somatic and germline SNPs/indels. + +BcfTools Stats: Variant substitution types, SNP and indel counts, quality scores, variant depth, and allele frequency metrics for both somatic and germline variants. + +DRAGEN-FastQC: Per-base sequence quality, per-sequence quality scores, GC content (per-sequence and per-position), HRD score, sequence length distributions, adapter contamination, and sequence duplication levels. + +### PCGR + +Personal Cancer Genome Reporter (PCGR) tool generates a comprehensive, interactive HTML report that consolidates filtered and annotated variant data, providing detailed insights into the somatic variants identified. + +Key Metrics: + +- Variant Classification and Tier Distribution: PCGR categorizes variants into tiers based on their clinical and biological significance. The report details the proportion of variants across different tiers, indicating their potential clinical relevance. +- Mutational Signatures: The report includes analysis of mutational signatures, offering insights into the mutational processes active in the tumor. +- Copy Number Alterations (CNAs): Visual representations of CNAs are provided, highlighting significant gains and losses across the genome. Genome-wide plots display regions of copy number gains and losses. +- Tumor Mutational Burden (TMB): Calculations of TMB are included, which can have implications for immunotherapy eligibility. The report presents the TMB value, representing the number of mutations per megabase. +- Microsatellite Instability (MSI) Status: Assessment of MSI status is performed, relevant for certain cancer types and treatment decisions. +- Clinical Trials Information: Information on relevant clinical trials is incorporated, offering potential therapeutic options based on the identified variants. + +Note: The PCGR tool is designed to process a maximum of 500,000 variants. If the input VCF file contains more than this limit, variants exceeding 500,000 will be filtered out. + +### CPSR Report + +The CPSR (Cancer Predisposition Sequencing Report) includes the following: + +Settings: +- Sample metadata +- Report configuration +- Virtual gene panel + +Summary of Findings: +- Variant statistics + +Variant Classification: + +ClinVar and Non-ClinVar variants: +- Class 5 - Pathogenic variants +- Class 4 - Likely Pathogenic variants +- Class 3 - Variants of Uncertain Significance (VUS) +- Class 2 - Likely Benign variants +- Class 1 - Benign variants +- Biomarkers + +PCGR TIER according to [ACMG](https://www.ncbi.nlm.nih.gov/pubmed/27993330): +- Tier 1 (High): Highest priority variants with strong clinical relevance. +- Tier 2 (Moderate): Variants with potential clinical significance. +- Tier 3 (Low): Variants with uncertain significance. +- Tier 4 (No Interest): Variants unlikely to be clinically relevant. + +--- + +## Coverage + +The sash workflow utilizes coverage metrics from DRAGEN to evaluate the sequencing quality and depth across target regions. Coverage analysis includes: + +- Mean coverage across targeted genomic regions +- Percentage of target regions covered at various depth thresholds (10X, 20X, 50X, 100X) +- Coverage uniformity metrics +- Gap analysis for regions with insufficient coverage + +These metrics are integrated into the MultiQC report, providing a comprehensive overview of sequencing quality and coverage. + +--- + +## Reference Data + +### [UMCCR Gene Panels](https://github.com/umccr/gene_panels) +Curated gene panels for specific analyses, including the germline cancer predisposition gene panel used in the Germline Small Variants workflow. + +### Genome Annotations + +#### HMFtools Reference Data +- Ensembl reference data (GRCh38) +- Somatic driver catalogs +- Known fusion gene pairs +- Driver gene panels + +#### Annotation Databases: +- gnomAD (v2.1): Provides population allele frequencies to help distinguish common variants from rare ones. +- ClinVar (20220103): Offers clinically curated variant information, aiding in the interpretation of potential pathogenicity. +- COSMIC: Contains data on somatic mutations found in cancer, facilitating the identification of cancer-related variants. +- Gene Panels: Focuses analysis on specific sets of genes relevant to particular conditions or research interests. + +#### Structural Variant Data: +- SnpEff Databases: Used for predicting the effects of variants on genes and proteins. +- Panel of Normals (PON): Helps filter out technical artifacts by comparing against a set of normal samples. +- RepeatMasker: Identifies repetitive genomic regions to prevent false-positive variant calls. + +Databases/datasets PCGR Reference Data: + +*Version: v20220203* + +- [GENCODE](https://www.gencodegenes.org/) - high quality reference gene annotation and experimental validation (release 39/19) +- [dbNSFP](https://sites.google.com/site/jpopgen/dbNSFP) - Database of non-synonymous functional predictions (20210406 (v4.2)) +- [dbMTS](http://database.liulab.science/dbMTS) - Database of alterations in microRNA target sites (v1.0) +- [ncER](https://github.com/TelentiLab/ncER_datasets) - Non-coding essential regulation score (genome-wide percentile rank) (v2) +- [GERP](http://mendel.stanford.edu/SidowLab/downloads/gerp/) - Genomic Evolutionary Rate Profiling (GERP) - rejected substitutions (RS) score (v1) +- [Pfam](http://pfam.xfam.org) - Collection of protein families/domains (2021_11 (v35.0)) +- [UniProtKB](http://www.uniprot.org) - Comprehensive resource of protein sequence and functional information (2021_04) +- [gnomAD](http://gnomad.broadinstitute.org) - Germline variant frequencies exome-wide (r2.1 (October 2018)) +- [dbSNP](http://www.ncbi.nlm.nih.gov/SNP/) - Database of short genetic variants (154) +- [DoCM](http://docm.genome.wustl.edu) - Database of curated mutations (release 3.2) +- [CancerHotspots](http://cancerhotspots.org) - A resource for statistically significant mutations in cancer (2017) +- [ClinVar](https://www.ncbi.nlm.nih.gov/clinvar) - Database of genomic variants of clinical significance (20220103) +- [CancerMine](http://bionlp.bcgsc.ca/cancermine/) - Literature-mined database of tumor suppressor genes/proto-oncogenes (20211106 (v42)) +- [OncoTree](http://oncotree.mskcc.org/) - Open-source ontology developed at MSK-CC for standardization of cancer type diagnosis (2021-11-02) +- [DiseaseOntology](http://disease-ontology.org) - Standardized ontology for human disease (20220131) +- [EFO](https://github.com/EBISPOT/efo) - Experimental Factor Ontology (v3.38.0) +- [GWAS_Catalog](https://www.ebi.ac.uk/gwas/) - The NHGRI-EBI Catalog of published genome-wide association studies (20211221) +- [CGI](http://cancergenomeinterpreter.org/biomarkers) - Cancer Genome Interpreter Cancer Biomarkers Database (20180117) + +--- + +## sash Module Outputs + +### Somatic SNVs +- File: `smlv_somatic/filter/{tumor_id}.pass.vcf.gz` +- Description: Contains somatic single nucleotide variants (SNVs) with filtering applied (VCF format). + +### Somatic SVs +- File: `sv_somatic/prioritise/{tumor_id}.sv.prioritised.vcf.gz` +- Description: Contains somatic structural variants (SVs) with prioritization applied (VCF format). + +### Somatic CNVs +- File: `cancer_report/cancer_report_tables/purple/{tumor_id}-purple_cnv_som.tsv.gz` +- Description: Contains somatic copy number variations (CNVs) data (TSV format). + +### Somatic Gene CNVs +- File: `cancer_report/cancer_report_tables/purple/{tumor_id}-purple_cnv_som_gene.tsv.gz` +- Description: Contains gene-level somatic copy number variations (CNVs) data (TSV format). + +### Germline SNVs +- File: `dragen_germline_output/{normal_id}.hard-filtered.vcf.gz` +- Description: Contains germline single nucleotide variants (SNVs) with hard filtering applied (VCF format). + +### Purple Purity, Ploidy, MS Status +- File: `purple/{tumor_id}.purple.purity.tsv` +- Description: Contains estimated tumor purity, ploidy, and microsatellite status (TSV format). + +### PCGR JSON with TMB +- File: `smlv_somatic/report/pcgr/{tumor_id}.pcgr_acmg.grch38.json.gz` +- Description: Contains PCGR annotations, including tumor mutational burden (TMB) (JSON format). + +### DRAGEN HRD Score +- File: `dragen_somatic_output/{tumor_id}.hrdscore.tsv` +- Description: Contains homologous recombination deficiency (HRD) score from DRAGEN analysis (TSV format). + +--- + +## FAQ + +### Q: Do we use PCGR for the rescue of SAGE? +A: In Somatic SV, we use SAGE for variant calling, then annotate the variants using PCGR, followed by filtering. Variants with high-tier ranks (TIER_1 or TIER_2) are not filtered out regardless of other criteria. + +### Q: How are hypermutated samples handled in the current version, and is there any impact on derived metrics such as TMB or MSI? +A: In the current version of sash, hypermutated samples are identified based on a threshold of 500,000 total somatic variant counts. If the variant count exceeds this threshold, the sample is flagged as hypermutated. When this occurs, we will filter variants that: 1) don't have clinical impact, 2) aren't in hotspot regions, until we meet the threshold. This impacts the TMB and MSI calculations by PURPLE. Currently, we are using the TMB and MSI values from PURPLE in these edge cases. A future release will provide correct TMB and MSI calculations from PURPLE. + +### Q: How are we handling non-standard chromosomes if present in the input VCFs (ALTs, chrM, etc)? +A: We filter on chromosomes 1-22 and chromosomes X, Y, M. All other non-standard chromosomes and contigs are filtered out. + +### Q: What inputs for the cancer reporter - have they changed (and what can we harmonize); e.g., where is the Circos plot from at this point? +A: Circos plots are generated by PURPLE. + +### Q: We dropped the CACAO coverage reports. Can we discuss how to utilize DRAGEN or HMFtools coverage information instead? +A: DRAGEN coverage metrics are now integrated into the MultiQC report, providing a comprehensive overview of sequencing quality and coverage across the genome. We are exploring further integration of HMFtools coverage analysis for future releases. + +### Q: What TMB score is displayed in the cancer reporter? +A: The TMB displayed is calculated by PCGR. + +### Q: What filtered VCF is the source for the mutational signatures? +A: We use the filtered VCF (after applying quality filters but retaining clinically significant variants) for mutational signatures analysis. + +### Q: Where is the contamination score coming from currently? +A: Currently, sash does not calculate a dedicated contamination metric. Tumor purity estimation from PURPLE serves as the primary indicator of sample quality. + +### Q: Do the GRIPSS steps do something more than what's happening in Oncoanalyser? +A: No, the same GRIPSS parameters are applied, with the only difference being the reference files used. + +### Q: Does the data from Somatic Small Variants workflow get used for the SV analysis? +A: No, the somatic small variant workflow data is not used in the structural variant (SV) workflow. These are independent analyses that run in parallel. \ No newline at end of file diff --git a/docs/images/sash_overview_qc.png b/docs/images/sash_overview_qc.png new file mode 100644 index 00000000..d4fa1a0b Binary files /dev/null and b/docs/images/sash_overview_qc.png differ diff --git a/docs/images/sash_workflow_overview_diagram_Vqc.pptx b/docs/images/sash_workflow_overview_diagram_Vqc.pptx new file mode 100644 index 00000000..09a04b2a Binary files /dev/null and b/docs/images/sash_workflow_overview_diagram_Vqc.pptx differ diff --git a/docs/images/sash_workflow_overview_diagram_qc/Slide1.png b/docs/images/sash_workflow_overview_diagram_qc/Slide1.png new file mode 100644 index 00000000..7f1e4374 Binary files /dev/null and b/docs/images/sash_workflow_overview_diagram_qc/Slide1.png differ diff --git a/docs/output.md b/docs/output.md index 734dd68c..50203b2c 100644 --- a/docs/output.md +++ b/docs/output.md @@ -1,68 +1,398 @@ -# umccr/sash: Output +# Sash Output ## Introduction -This document describes the output produced by the pipeline. Most of the plots are taken from the MultiQC report, which summarises results at the end of the pipeline. +This document outlines the key results and files produced by the UMCCR SASH (post-processing WGS tumor/normal) pipeline. After a run, the pipeline organizes output files by analysis module under a directory for each tumor/normal pair (identified by run ID and sample names). The main outputs include annotated variant reports for somatic and germline variants, copy number and structural variant analyses, and a comprehensive MultiQC report for quality control. All paths below are relative to the top-level results directory of a given run. +## Pipeline overview -The directories listed below will be created in the results directory after the pipeline has finished. All paths are relative to the top-level results directory. +- [Sash Output](#sash-output) + - [Introduction](#introduction) + - [Pipeline overview](#pipeline-overview) + - [Directory Structure](#directory-structure) + - [Summary](#summary) + - [Workflows](#workflows) + - [Somatic Small Variants](#somatic-small-variants) + - [General](#general) + - [Summary](#summary-1) + - [Details](#details) + - [bolt smlv somatic rescue](#bolt-smlv-somatic-rescue) + - [BOLT\_SMLV\_SOMATIC\_ANNOTATE](#bolt_smlv_somatic_annotate) + - [BOLT\_SMLV\_SOMATIC\_FILTER](#bolt_smlv_somatic_filter) + - [SOMATIC\_SNV\_REPORTS](#somatic_snv_reports) + - [Somatic Structural Variants](#somatic-structural-variants) + - [General](#general-1) + - [Summary](#summary-2) + - [SV Annotation](#sv-annotation) + - [SV Prioritization](#sv-prioritization) + - [Germline Variants](#germline-variants) + - [General](#general-2) + - [Summary](#summary-3) + - [Germline Preparation](#germline-preparation) + - [Germline Reports](#germline-reports) + - [Reports](#reports) + - [Cancer Report](#cancer-report) + - [LINX Reports](#linx-reports) + - [PURPLE Reports](#purple-reports) + - [PCGR Reports](#pcgr-reports) + - [SIGRAP Reports](#sigrap-reports) + - [CPSR Reports](#cpsr-reports) + - [MultiQC Reports](#multiqc-reports) - +## Directory Structure -## Pipeline overview +```bash +[RUN_ID]/[sample]/ +├── .cancer_report.html +├── .cpsr.html +├── .pcgr.html +├── _linx.html +├── .multiqc.html +├── cancer_report/ +│ ├── img/ +│ └── cancer_report_tables/ +│ ├── hrd/ +│ ├── json/ +│ ├── purple/ +│ └── sigs/ +├── linx/ +│ ├── germline_annotations/ +│ ├── somatic_annotations/ +│ └── somatic_plots/ +├── multiqc_data/ +├── purple/ +├── smlv_germline/ +│ └── prepare/ +│ └── report/ +├── smlv_somatic/ +│ ├── report/ +│ ├── annotate/ +│ ├── filter/ +│ └── rescue/ +└── sv_somatic/ + ├── annotate/ + └── prioritise/ +``` +## Summary + +The **Sash Workflow** comprises three primary pipelines: **Somatic Small Variants**, **Somatic Structural Variants**, and **Germline Variants**. These pipelines utilize **Bolt**, a Python package designed for modular processing, and leverage outputs from the **DRAGEN Variant Caller** alongside **HMFtools in Oncoanalyser**. Each pipeline is tailored to a specific type of genomic variant, incorporating filtering, annotation, and HTML reports for research and curation. + +## Workflows + +### Somatic Small Variants + +#### General + +In the **Somatic Small Variants** workflow, variant detection is performed using the **DRAGEN Variant Caller** and **Oncoanalyser (SAGE, Purple)** outputs. It's structured into four steps: **Rescue**, **Annotation**, **Filter**, and **Report**. The final outputs include an **HTML report** summarizing the results. + +#### Summary + +1. **Rescue** variants using SAGE to recover low-frequency alterations in clinically important hotspots. +2. **Annotate** variants with clinical and functional information using PCGR. +3. **Filter** variants based on allele frequency (AF), supporting reads (AD), and population frequency (gnomAD AF), removing low-confidence and common variants. +4. **Report** final annotated variants in a comprehensive HTML format. + +### Details + +## bolt smlv somatic rescue + +
+Output files + +- `smlv_somatic/rescue/` + - `.rescued.vcf.gz`: Rescued somatic VCF file containing previously filtered variants at known hotspots. + +
+ +The `BOLT_SMLV_SOMATIC_RESCUE` process rescues somatic variants using the BOLT tool. The output includes the rescued VCF file that recovers potentially important variants that may have been filtered in earlier steps due to borderline quality metrics. + +#### BOLT_SMLV_SOMATIC_ANNOTATE + +
+Output files + +- `smlv_somatic/annotate/` + - `.annotations.vcf.gz`: Annotated somatic VCF file with functional and clinical annotations. + +
+ +The `BOLT_SMLV_SOMATIC_ANNOTATE` process annotates somatic variants using the BOLT tool. The output includes the annotated VCF file enriched with gene information, variant effect predictions, and other annotations to aid in variant interpretation. + +#### BOLT_SMLV_SOMATIC_FILTER + +
+Output files + +- `smlv_somatic/filter/` + - `.filters_set.vcf.gz`: VCF file with filters set but all variants retained. + - `.pass.vcf.gz`: Filtered somatic VCF file containing only PASS variants. + - `.pass.vcf.gz.tbi`: Index file for the filtered VCF. + +
+ +The `BOLT_SMLV_SOMATIC_FILTER` process filters somatic variants using the BOLT tool. The output includes both a VCF with all variants but filter tags applied, and a filtered VCF containing only variants that pass all quality filters. + +#### SOMATIC_SNV_REPORTS + +
+Output files + +- `smlv_somatic/report/` + - `.somatic.bcftools_stats.txt`: Statistical summary of somatic variants. + - `.somatic.variant_counts_process.json`: Variant count metrics at each processing step. + - `.somatic.variant_counts_type.yaml`: Variant counts by variant type. + - `af_tumor.txt`: Information about variant allele frequencies in tumor. + - `af_tumor_keygenes.txt`: Variant allele frequencies in key cancer-related genes. + - `pcgr/`: Directory containing PCGR report outputs. + +
+ +The reporting process generates statistical summaries and specialized reports for somatic SNVs, including PCGR HTML reports for clinical interpretation. + +### Somatic Structural Variants + +#### General + +The **Somatic Structural Variants** workflow identifies and analyzes large genomic rearrangements such as deletions, duplications, inversions, and translocations. It processes outputs from GRIDSS, PURPLE, and LINX to provide comprehensive SV analysis. + +#### Summary + +1. **Annotate** SVs with gene context and potential functional impacts. +2. **Prioritize** SVs based on cancer relevance and gene disruption potential. +3. **Report** clinically relevant SVs with gene fusion predictions and visualization. + +#### SV Annotation + +
+Output files + +- `sv_somatic/annotate/` + - `.annotated.vcf.gz`: Annotated structural variant VCF file. + +
+ +This process adds gene annotations and functional impact predictions to structural variants. The annotated VCF contains information about genes affected by breakpoints, potential fusion events, and other biologically relevant details. -The pipeline is built using [Nextflow](https://www.nextflow.io/) and processes data using the following steps: +#### SV Prioritization -- [FastQC](#fastqc) - Raw read QC -- [MultiQC](#multiqc) - Aggregate report describing results and QC from the whole pipeline -- [Pipeline information](#pipeline-information) - Report metrics generated during the workflow execution +
+Output files + +- `sv_somatic/prioritise/` + - `.cnv.prioritised.tsv`: Prioritized copy number variations in tabular format. + - `.sv.prioritised.tsv`: Prioritized structural variants in tabular format. + - `.sv.prioritised.vcf.gz`: Prioritized structural variants in VCF format. + +
+ +This process ranks structural variants based on their potential clinical relevance, creating filterable lists for review. It separately handles copy number variations and other structural variants for easier interpretation. + +### Germline Variants + +#### General + +The **Germline Variants** workflow analyzes inherited variants from the normal sample to identify potential cancer predisposition genes and variants that may influence treatment decisions. + +#### Summary + +1. **Prepare** germline variants from DRAGEN normal sample outputs. +2. **Report** potentially actionable germline variants through CPSR. + +#### Germline Preparation + +
+Output files + +- `smlv_germline/prepare/` + - `.prepared.vcf.gz`: Prepared germline VCF file for annotation. + +
-### FastQC +This process prepares germline variants for downstream annotation and reporting. It applies normalization, left-alignment, and other preprocessing steps to ensure consistent variant representation. + +#### Germline Reports
Output files -- `fastqc/` - - `*_fastqc.html`: FastQC report containing quality metrics. - - `*_fastqc.zip`: Zip archive containing the FastQC report, tab-delimited data file and plot images. +- `smlv_germline/report/` + - `.annotations.vcf.gz`: Annotated germline VCF. + - `.germline.bcftools_stats.txt`: Statistical summary of germline variants. + - `.germline.variant_counts_type.yaml`: Variant counts by type. + - `cpsr/`: Directory containing CPSR outputs. + - `.cpsr.grch38.json.gz`: Structured CPSR data. + - `.cpsr.grch38.pass.tsv.gz`: Filtered CPSR variants in tabular format. + - `.cpsr.grch38.snvs_indels.tiers.tsv`: Tiered variants by clinical significance. + - Other CPSR output files.
-[FastQC](http://www.bioinformatics.babraham.ac.uk/projects/fastqc/) gives general quality metrics about your sequenced reads. It provides information about the quality score distribution across your reads, per base sequence content (%A/T/G/C), adapter contamination and overrepresented sequences. For further reading and documentation see the [FastQC help pages](http://www.bioinformatics.babraham.ac.uk/projects/fastqc/Help/). +The germline reporting process focuses on identifying variants in cancer predisposition genes and producing a comprehensive CPSR (Cancer Predisposition Sequencing Reporter) report. + +### Reports + +#### Cancer Report -![MultiQC - FastQC sequence counts plot](images/mqc_fastqc_counts.png) +
+Output files -![MultiQC - FastQC mean quality scores plot](images/mqc_fastqc_quality.png) +- `.cancer_report.html`: Main cancer report HTML file. +- `cancer_report/` + - `.snvs.normalised.vcf.gz`: Normalized SNVs used in the report. + - `img/`: Images used in the cancer report. + - `cancer_report_tables/`: Tabular data supporting the report. + - `_-qc_summary.tsv.gz`: Quality control summary. + - `_-report_inputs.tsv.gz`: Report configuration inputs. + - `hrd/`: Homologous Recombination Deficiency analysis from multiple methods. + - `json/`: JSON-formatted report data. + - `purple/`: Copy number information. + - `sigs/`: Mutational signature analysis (SBS, DBS, indels). -![MultiQC - FastQC adapter content plot](images/mqc_fastqc_adapter.png) +
-> **NB:** The FastQC plots displayed in the MultiQC report shows _untrimmed_ reads. They may contain adapter sequence and potentially regions with low quality. +The cancer report integrates findings from all analysis modules into a comprehensive HTML report for clinical interpretation. It includes tumor characteristics, key somatic alterations, mutational signatures, and therapy recommendations. -### MultiQC +#### LINX Reports
Output files -- `multiqc/` - - `multiqc_report.html`: a standalone HTML file that can be viewed in your web browser. - - `multiqc_data/`: directory containing parsed statistics from the different tools used in the pipeline. - - `multiqc_plots/`: directory containing static images from the report in various formats. +- `_linx.html`: LINX visualization report. +- `linx/` + - `germline_annotations/`: Germline structural variant analysis. + - `.linx.germline.breakend.tsv`: Germline breakend annotations. + - `.linx.germline.clusters.tsv`: Germline SV clusters. + - `.linx.germline.disruption.tsv`: Gene disruptions by germline SVs. + - `.linx.germline.driver.catalog.tsv`: Potential driver germline SVs. + - `.linx.germline.links.tsv`: Links between germline SVs. + - `.linx.germline.svs.tsv`: Germline structural variants. + - `linx.version`: LINX version information. + - `somatic_annotations/`: Somatic structural variant analysis. + - `.linx.breakend.tsv`: Somatic breakend annotations. + - `.linx.clusters.tsv`: Somatic SV clusters. + - `.linx.driver.catalog.tsv`: Potential driver somatic SVs. + - `.linx.drivers.tsv`: Driver SV details. + - `.linx.fusion.tsv`: Gene fusion predictions. + - `.linx.links.tsv`: Links between somatic SVs. + - `.linx.svs.tsv`: Somatic structural variants. + - Visualization data files (vis_*). + - `linx.version`: LINX version information. + - `somatic_plots/`: Visualizations of somatic structural variants.
-[MultiQC](http://multiqc.info) is a visualization tool that generates a single HTML report summarising all samples in your project. Most of the pipeline QC results are visualised in the report and further statistics are available in the report data directory. +LINX reports provide detailed analysis of structural variants, including gene fusions, disruptions, and visualization of complex rearrangements. The HTML visualization report offers interactive exploration of structural variants and their potential functional impacts. + +#### PURPLE Reports + +
+Output files + +- `purple/` + - `.purple.cnv.gene.tsv`: Gene-level copy number variations. + - `.purple.cnv.somatic.tsv`: Segment-level somatic copy number variations. + - `.purple.driver.catalog.germline.tsv`: Potential germline driver variants. + - `.purple.driver.catalog.somatic.tsv`: Potential somatic driver variants. + - `.purple.germline.deletion.tsv`: Germline deletion information. + - `.purple.purity.range.tsv`: Range of possible purity values. + - `.purple.purity.tsv`: Tumor purity, ploidy, and microsatellite status. + - `.purple.qc`: Quality control metrics. + - `.purple.segment.tsv`: Genomic segmentation data. + - `.purple.somatic.clonality.tsv`: Clonality analysis of somatic variants. + - `.purple.somatic.hist.tsv`: Somatic variant histograms. + - `.purple.somatic.vcf.gz`: Somatic variants VCF with copy number annotations. + - `.purple.sv.germline.vcf.gz`: Germline structural variants. + - `.purple.sv.vcf.gz`: Somatic structural variants. + - `circos/`: Circos visualization data files. + - `.ratio.circos`: Normal sample coverage ratio data. + - `.baf.circos`: B-allele frequency data. + - `.cnv.circos`: Copy number data. + - `.indel.circos`: Indel visualization data. + - `.link.circos`: SV links visualization data. + - `.snp.circos`: SNP visualization data. + - Configuration and input files for Circos. + - `plot/`: Additional visualization data. + - `purple.version`: PURPLE version information. + +
+ +PURPLE reports provide copy number analysis, tumor purity estimation, and whole genome doubling assessment. The circos directory contains data for generating circular genome plots that visualize genomic alterations across the entire genome. + +#### PCGR Reports + +
+Output files + +- `.pcgr.html`: PCGR HTML report. +- `smlv_somatic/report/pcgr/` + - `.pcgr_acmg.grch38.flexdb.html`: Flexible database PCGR report. + - `.pcgr_acmg.grch38.json.gz`: Structured PCGR data in JSON format. + - `.pcgr_acmg.grch38.mp_input.vcf.gz`: Input VCF for mutational pattern analysis. + - `.pcgr_acmg.grch38.mutational_signatures.tsv`: Mutational signature analysis. + - `.pcgr_acmg.grch38.pass.tsv.gz`: Filtered variants in tabular format. + - `.pcgr_acmg.grch38.pass.vcf.gz`: Filtered variants in VCF format. + - `.pcgr_acmg.grch38.snvs_indels.tiers.tsv`: Tiered variants by clinical significance. + - `.pcgr_acmg.grch38.vcf.gz`: All variants in VCF format. + - `.pcgr_config.rds`: PCGR configuration. + +
+ +PCGR (Personal Cancer Genome Reporter) reports provide clinical interpretation of somatic variants, including therapy matches, clinical trial eligibility, and tumor mutational burden assessment. + +#### SIGRAP Reports + +
+Output files + +- `cancer_report/cancer_report_tables/sigs/` + - `_-dbs.tsv.gz`: Double base substitution signature analysis. + - `_-indel.tsv.gz`: Indel signature analysis. + - `_-snv_2015.tsv.gz`: SNV signature analysis using 2015 signatures. + - `_-snv_2020.tsv.gz`: SNV signature analysis using 2020 signatures. +- `cancer_report/cancer_report_tables/json/sigs/`: JSON-formatted signature data. + +
+ +SIGRAP reports provide mutational signature analysis, identifying patterns associated with specific mutational processes or exposures. The pipeline analyzes single base substitutions (SBS), double base substitutions (DBS), and indel signatures using both the 2015 and 2020 reference signature sets. + +#### CPSR Reports + +
+Output files + +- `.cpsr.html`: CPSR HTML report. +- `smlv_germline/report/cpsr/` + - `.cpsr.grch38.custom_list.bed`: Custom gene list in BED format. + - `.cpsr.grch38.json.gz`: Structured CPSR data in JSON format. + - `.cpsr.grch38.pass.tsv.gz`: Filtered variants in tabular format. + - `.cpsr.grch38.pass.vcf.gz`: Filtered variants in VCF format. + - `.cpsr.grch38.snvs_indels.tiers.tsv`: Tiered variants by clinical significance. + - `.cpsr.grch38.vcf.gz`: All variants in VCF format. + - `.cpsr_config.rds`: CPSR configuration. + +
-Results generated by MultiQC collate pipeline QC from supported tools e.g. FastQC. The pipeline has special steps which also allow the software versions to be reported in the MultiQC output for future traceability. For more information about how to use MultiQC reports, see . +CPSR (Cancer Predisposition Sequencing Reporter) focuses on germline variants in known cancer predisposition genes, providing a comprehensive report of inherited cancer risk variants. -### Pipeline information +#### MultiQC Reports
Output files -- `pipeline_info/` - - Reports generated by Nextflow: `execution_report.html`, `execution_timeline.html`, `execution_trace.txt` and `pipeline_dag.dot`/`pipeline_dag.svg`. - - Reports generated by the pipeline: `pipeline_report.html`, `pipeline_report.txt` and `software_versions.yml`. The `pipeline_report*` files will only be present if the `--email` / `--email_on_fail` parameter's are used when running the pipeline. - - Reformatted samplesheet files used as input to the pipeline: `samplesheet.valid.csv`. +- `.multiqc.html`: Main MultiQC report. +- `multiqc_data/`: Supporting data for the MultiQC report. + - `dragen_frag_len.txt`: Fragment length metrics. + - `dragen_map_metrics.txt`: Mapping metrics. + - `dragen_ploidy.txt`: Ploidy estimation metrics. + - `dragen_time_metrics.txt`: Processing time metrics. + - `dragen_trimmer_metrics.txt`: Read trimming metrics. + - `dragen_vc_metrics.txt`: Variant calling metrics. + - `dragen_wgs_cov_metrics.txt`: WGS coverage metrics. + - `multiqc.log`: MultiQC log file. + - `multiqc_bcftools_stats.txt`: BCFtools statistics. + - `multiqc_data.json`: MultiQC data in JSON format. + - `multiqc_general_stats.txt`: General statistics. + - `purple.txt`: PURPLE metrics.
-[Nextflow](https://www.nextflow.io/docs/latest/tracing.html) provides excellent functionality for generating various reports relevant to the running and execution of the pipeline. This will allow you to troubleshoot errors with the running of the pipeline, and also provide you with other information such as launch commands, run times and resource usage. +MultiQC aggregates quality metrics from all pipeline components into a single HTML report, providing an overview of sample quality and analysis performance. diff --git a/docs/usage.md b/docs/usage.md index 37f6aac6..0be3bb51 100644 --- a/docs/usage.md +++ b/docs/usage.md @@ -1,194 +1,176 @@ # umccr/sash: Usage -> _Documentation of pipeline parameters is generated automatically from the pipeline schema and can no longer be found in markdown files._ +> Parameter documentation is generated automatically from `nextflow_schema.json`. Run `nextflow run umccr/sash --help` +> or use [nf-core/launch](https://nf-co.re/launch) for an interactive form. ## Introduction - +umccr/sash is UMCCR’s post-processing pipeline for tumour/normal WGS analyses. It consumes DRAGEN secondary-analysis +outputs together with nf-core/oncoanalyser WiGiTS artefacts to perform small-variant rescue, annotation, filtering, +structural variant integration, PURPLE CNV calling, and reporting (PCGR, CPSR, GPGR cancer report, LINX, MultiQC). + +- Requires Nextflow ≥ 22.10.6 and a container engine (Docker/Singularity/Apptainer/Podman/Conda). +- Uses GRCh38 reference data resolved from `--ref_data_path` (see [Reference data](#reference-data)). +- Expects inputs via a CSV samplesheet describing DRAGEN and Oncoanalyser directories; no FASTQ inputs are needed. ## Samplesheet input -You will need to create a samplesheet with information about the samples you would like to analyse before running the pipeline. Use this parameter to specify its location. It has to be a comma-separated file with 3 columns, and a header row as shown in the examples below. +Pass a CSV with `--input`. Each row represents one directory staged by upstream pipelines for a given analysis `id`. +Rows sharing the same `id` are grouped into a single tumour/normal run. -```bash ---input '[path to samplesheet file]' -``` +### Column definitions -### Multiple runs of the same sample +| Column | Description | +| -------------- | ----------- | +| `id` | Unique analysis identifier grouping rows belonging to the same tumour/normal pair. | +| `subject_name` | Subject identifier; must be identical for all rows with the same `id`. | +| `sample_name` | DRAGEN sample label. Used to derive tumour (`dragen_somatic_dir`) and normal (`dragen_germline_dir`) identifiers. | +| `filetype` | One of the supported directory types below. | +| `filepath` | Absolute or relative path to the directory containing the expected files. | -The `sample` identifiers have to be the same when you have re-sequenced the same sample more than once e.g. to increase sequencing depth. The pipeline will concatenate the raw reads before performing any downstream analysis. Below is an example for the same sample sequenced across 3 lanes: +Example row set: -```console -sample,fastq_1,fastq_2 -CONTROL_REP1,AEG588A1_S1_L002_R1_001.fastq.gz,AEG588A1_S1_L002_R2_001.fastq.gz -CONTROL_REP1,AEG588A1_S1_L003_R1_001.fastq.gz,AEG588A1_S1_L003_R2_001.fastq.gz -CONTROL_REP1,AEG588A1_S1_L004_R1_001.fastq.gz,AEG588A1_S1_L004_R2_001.fastq.gz +```csv +id,subject_name,sample_name,filetype,filepath +subject_a.example,subject_a,sample_germline,dragen_germline_dir,/path/to/dragen_germline/ +subject_a.example,subject_a,sample_somatic,dragen_somatic_dir,/path/to/dragen_somatic/ +subject_a.example,subject_a,sample_somatic,oncoanalyser_dir,/path/to/oncoanalyser/ ``` -### Full samplesheet +An example sheet is included at `assets/samplesheet.csv`. -The pipeline will auto-detect whether a sample is single- or paired-end using the information provided in the samplesheet. The samplesheet can have as many columns as you desire, however, there is a strict requirement for the first 3 columns to match those defined in the table below. +### Required directory contents -A final samplesheet file consisting of both single- and paired-end data may look something like the one below. This is for 6 samples, where `TREATMENT_REP3` has been sequenced twice. +Paths below are relative to the value of `filepath` for each row. The pipeline targets nf-core/oncoanalyser ≥ 2.2.0 +exports. -```console -sample,fastq_1,fastq_2 -CONTROL_REP1,AEG588A1_S1_L002_R1_001.fastq.gz,AEG588A1_S1_L002_R2_001.fastq.gz -CONTROL_REP2,AEG588A2_S2_L002_R1_001.fastq.gz,AEG588A2_S2_L002_R2_001.fastq.gz -CONTROL_REP3,AEG588A3_S3_L002_R1_001.fastq.gz,AEG588A3_S3_L002_R2_001.fastq.gz -TREATMENT_REP1,AEG588A4_S4_L003_R1_001.fastq.gz, -TREATMENT_REP2,AEG588A5_S5_L003_R1_001.fastq.gz, -TREATMENT_REP3,AEG588A6_S6_L003_R1_001.fastq.gz, -TREATMENT_REP3,AEG588A6_S6_L004_R1_001.fastq.gz, -``` +- `dragen_somatic_dir` + - `.hard-filtered.vcf.gz` and `.hard-filtered.vcf.gz.tbi` + - Optional: `.hrdscore.csv` (ingested into the cancer report when present) +- `dragen_germline_dir` + - `.hard-filtered.vcf.gz` +- `oncoanalyser_dir` + - `amber/` and `cobalt/` directories (coverage inputs for PURPLE) + - `sage_calling/somatic/.sage.somatic.vcf.gz` (+ `.tbi`) + - `esvee/.esvee.ref_depth.vcf.gz` and accompanying directory (used to seed eSVee calling) + - `virusbreakend/` directory -| Column | Description | -| --------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -| `sample` | Custom sample name. This entry will be identical for multiple sequencing libraries/runs from the same sample. Spaces in sample names are automatically converted to underscores (`_`). | -| `fastq_1` | Full path to FastQ file for Illumina short reads 1. File has to be gzipped and have the extension ".fastq.gz" or ".fq.gz". | -| `fastq_2` | Full path to FastQ file for Illumina short reads 2. File has to be gzipped and have the extension ".fastq.gz" or ".fq.gz". | - -An [example samplesheet](../assets/samplesheet.csv) has been provided with the pipeline. +> SASH runs PURPLE internally; precomputed PURPLE outputs are not required as inputs. ## Running the pipeline -The typical command for running the pipeline is as follows: +Quickstart command: ```bash -nextflow run umccr/sash --input samplesheet.csv --outdir --genome GRCh37 -profile docker +nextflow run umccr/sash \ + --input samplesheet.csv \ + --ref_data_path /path/to/reference_data_root \ + --outdir results/ \ + -profile docker ``` -This will launch the pipeline with the `docker` configuration profile. See below for more information about profiles. - -Note that the pipeline will create the following files in your working directory: +This launches the pipeline with the `docker` configuration profile. The following appear in the working directory: -```bash -work # Directory containing the nextflow working files - # Finished results in specified location (defined with --outdir) -.nextflow_log # Log file from Nextflow -# Other nextflow hidden files, eg. history of pipeline runs and old logs. +``` +work/ # Nextflow working files +results/ # Pipeline outputs (as specified by --outdir) +.nextflow_log # Nextflow run log ``` -If you wish to repeatedly use the same parameters for multiple runs, rather than specifying each flag in the command, you can specify these in a params file. - -Pipeline settings can be provided in a `yaml` or `json` file via `-params-file `. +### Parameter files and profiles -> ⚠️ Do not use `-c ` to specify parameters as this will result in errors. Custom config files specified with `-c` must only be used for [tuning process resource specifications](https://nf-co.re/docs/usage/configuration#tuning-workflow-resources), other infrastructural tweaks (such as output directories), or module arguments (args). -> The above pipeline run specified with a params file in yaml format: +Reuse parameter sets via `-params-file params.yaml`: ```bash nextflow run umccr/sash -profile docker -params-file params.yaml ``` -with `params.yaml` containing: - ```yaml -input: './samplesheet.csv' -outdir: './results/' -genome: 'GRCh37' -input: 'data' -<...> +input: 'samplesheet.csv' +ref_data_path: '/data/refdata' +outdir: 'results/' ``` -You can also generate such `YAML`/`JSON` files via [nf-core/launch](https://nf-co.re/launch). +> ⚠️ Avoid using `-c` to pass pipeline parameters. `-c` should only point to Nextflow config files for resource tuning, +> executor settings or module overrides (see below). -### Updating the pipeline +You can generate YAML/JSON parameter files through [nf-core/launch](https://nf-co.re/launch) or Nextflow Tower. -When you run the above command, Nextflow automatically pulls the pipeline code from GitHub and stores it as a cached version. When running the pipeline after this, it will always use the cached version if available - even if the pipeline has been updated since. To make sure that you're running the latest version of the pipeline, make sure that you regularly update the cached version of the pipeline: - -```bash -nextflow pull umccr/sash -``` +## Reference data -### Reproducibility +All resources are resolved relative to `--ref_data_path` using `conf/refdata.config`. Confirm the directory contains the +expected subpaths (versions may change between releases): -It is a good idea to specify a pipeline version when running the pipeline on your data. This ensures that a specific version of the pipeline code and software are used when you run your pipeline. If you keep using the same tag, you'll be running the same version of the pipeline, even if there have been changes to the code since. +- `genomes/GRCh38_umccr/` – GRCh38 FASTA, FAI and dict files plus sequence metadata. +- `hmf_reference_data/` – WiGiTS bundle with PURPLE GC profiles, eSVee panel-of-normals, SAGE hotspot resources, LINX + transcripts and driver catalogues. +- `databases/pcgr/` – PCGR/CPSR annotation bundle. +- `umccr/` – bolt configuration files, driver panels, MultiQC templates, GPGR assets. +- `misc/` – panel-of-normals, APPRIS annotations, snpEff cache and other supporting data. -First, go to the [umccr/sash releases page](https://github.com/umccr/sash/releases) and find the latest pipeline version - numeric only (eg. `1.3.1`). Then specify this when running the pipeline with `-r` (one hyphen) - eg. `-r 1.3.1`. Of course, you can switch to another version by changing the number after the `-r` flag. +Refer to `docs/details.md` for a deeper breakdown of required artefacts. -This version number will be logged in reports when you run the pipeline, so that you'll know what you used when you look back in the future. For example, at the bottom of the MultiQC reports. - -To further assist in reproducbility, you can use share and re-use [parameter files](#running-the-pipeline) to repeat pipeline runs with the same settings without having to write out a command with every single parameter. - -> 💡 If you wish to share such profile (such as upload as supplementary material for academic publications), make sure to NOT include cluster specific paths to files, nor institutional specific profiles. - -## Core Nextflow arguments - -> **NB:** These options are part of Nextflow and use a _single_ hyphen (pipeline parameters use a double-hyphen). +## Nextflow configuration ### `-profile` -Use this parameter to choose a configuration profile. Profiles can give configuration presets for different compute environments. - -Several generic profiles are bundled with the pipeline which instruct the pipeline to use software packaged using different methods (Docker, Singularity, Podman, Shifter, Charliecloud, Apptainer, Conda) - see below. - -> We highly recommend the use of Docker or Singularity containers for full pipeline reproducibility, however when this is not possible, Conda is also supported. - -Note that multiple profiles can be loaded, for example: `-profile test,docker` - the order of arguments is important! -They are loaded in sequence, so later profiles can overwrite earlier profiles. - -If `-profile` is not specified, the pipeline will run locally and expect all software to be installed and available on the `PATH`. This is _not_ recommended, since it can lead to different results on different machines dependent on the computer enviroment. - -- `test` - - A profile with a complete configuration for automated testing - - Includes links to test data so needs no other parameters -- `docker` - - A generic configuration profile to be used with [Docker](https://docker.com/) -- `singularity` - - A generic configuration profile to be used with [Singularity](https://sylabs.io/docs/) -- `podman` - - A generic configuration profile to be used with [Podman](https://podman.io/) -- `shifter` - - A generic configuration profile to be used with [Shifter](https://nersc.gitlab.io/development/shifter/how-to-use/) -- `charliecloud` - - A generic configuration profile to be used with [Charliecloud](https://hpc.github.io/charliecloud/) -- `apptainer` - - A generic configuration profile to be used with [Apptainer](https://apptainer.org/) -- `conda` - - A generic configuration profile to be used with [Conda](https://conda.io/docs/). Please only use Conda as a last resort i.e. when it's not possible to run the pipeline with Docker, Singularity, Podman, Shifter, Charliecloud, or Apptainer. +Profiles configure software packaging and cluster backends. Bundled profiles include `test`, `docker`, `singularity`, +`podman`, `shifter`, `charliecloud`, `apptainer` and `conda`. Combine multiple profiles with commas (later entries +override earlier ones). If no profile is supplied, Nextflow expects all software on `$PATH`, which is discouraged. ### `-resume` -Specify this when restarting a pipeline. Nextflow will use cached results from any pipeline steps where the inputs are the same, continuing from where it got to previously. For input to be considered the same, not only the names must be identical but the files' contents as well. For more info about this parameter, see [this blog post](https://www.nextflow.io/blog/2019/demystifying-nextflow-resume.html). - -You can also supply a run name to resume a specific run: `-resume [run-name]`. Use the `nextflow log` command to show previous run names. +Resume cached work by adding `-resume`. Nextflow matches stages using both file names and content; keep inputs identical +for cache hits. Supply a run name to resume a specific execution: `-resume `. Use `nextflow log` to list +previous runs. ### `-c` -Specify the path to a specific config file (this is a core Nextflow command). See the [nf-core website documentation](https://nf-co.re/usage/configuration) for more information. +`-c custom.config` loads additional Nextflow configuration (eg. executor queues, resource overrides, institutional +profiles). See the [nf-core configuration docs](https://nf-co.re/docs/usage/configuration) for examples. ## Custom configuration ### Resource requests -Whilst the default requirements set within the pipeline will hopefully work for most people and with most input data, you may find that you want to customise the compute resources that the pipeline requests. Each step in the pipeline has a default set of requirements for number of CPUs, memory and time. For most of the steps in the pipeline, if the job exits with any of the error codes specified [here](https://github.com/nf-core/rnaseq/blob/4c27ef5610c87db00c3c5a3eed10b1d161abf575/conf/base.config#L18) it will automatically be resubmitted with higher requests (2 x original, then 3 x original). If it still fails after the third attempt then the pipeline execution is stopped. +Default resources suit typical datasets, but you can override CPUs/memory/time through custom config files. Many modules +honour nf-core’s automatic retry logic: certain exit codes trigger resubmission at 2× and 3× the original resources +before failing the run. Refer to the nf-core guides on +[max resources](https://nf-co.re/docs/usage/configuration#max-resources) and +[tuning workflow resources](https://nf-co.re/docs/usage/configuration#tuning-workflow-resources). -To change the resource requests, please see the [max resources](https://nf-co.re/docs/usage/configuration#max-resources) and [tuning workflow resources](https://nf-co.re/docs/usage/configuration#tuning-workflow-resources) section of the nf-core website. +### Custom containers -### Custom Containers +nf-core pipelines default to Biocontainers/Bioconda images. You can override container or conda package selections in +config to use patched or institutional builds. See the +[updating tool versions](https://nf-co.re/docs/usage/configuration#updating-tool-versions) section for patterns. -In some cases you may wish to change which container or conda environment a step of the pipeline uses for a particular tool. By default nf-core pipelines use containers and software from the [biocontainers](https://biocontainers.pro/) or [bioconda](https://bioconda.github.io/) projects. However in some cases the pipeline specified version maybe out of date. +### Custom tool arguments -To use a different container from the default container or conda environment specified in a pipeline, please see the [updating tool versions](https://nf-co.re/docs/usage/configuration#updating-tool-versions) section of the nf-core website. +If you need to provide additional tool parameters beyond those exposed by pipeline options, set `process.ext.args` +(overrides per-module) or leverage module-specific hooks documented in nf-core. Review `conf/modules.config` for +supported overrides in umccr/sash. -### Custom Tool Arguments +## Outputs -A pipeline might not always support every possible argument or option of a particular tool used in pipeline. Fortunately, nf-core pipelines provide some freedom to users to insert additional parameters that the pipeline does not include by default. +See `docs/output.md` for a full description of generated artefacts (PCGR/CPSR HTML, cancer report, LINX, PURPLE, MultiQC +and supporting statistics). ## Running in the background -Nextflow handles job submissions and supervises the running jobs. The Nextflow process must run until the pipeline is finished. - -The Nextflow `-bg` flag launches Nextflow in the background, detached from your terminal so that the workflow does not stop if you log out of your session. The logs are saved to a file. +Nextflow supervises submitted jobs; keep the Nextflow process alive for the pipeline to finish. Options include: -Alternatively, you can use `screen` / `tmux` or similar tool to create a detached session which you can log back into at a later time. -Some HPC setups also allow you to run nextflow within a cluster job submitted your job scheduler (from where it submits more jobs). +- `nextflow run ... -bg` to launch detached and log to `.nextflow.log`. +- Using `screen`, `tmux` or similar to keep sessions alive. +- Submitting Nextflow itself through your scheduler (eg. `sbatch`), where it will launch child jobs. ## Nextflow memory requirements -In some cases, the Nextflow Java virtual machines can start to request a large amount of memory. -We recommend adding the following line to your environment to limit this (typically in `~/.bashrc` or `~./bash_profile`): +The Nextflow JVM can request substantial RAM on large runs. Set an upper bound via environment variables, typically in +`~/.bashrc` or `~/.bash_profile`: ```bash -NXF_OPTS='-Xms1g -Xmx4g' +export NXF_OPTS='-Xms1g -Xmx4g' ``` + +Adjust limits to suit your environment.