Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
37 commits
Select commit Hold shift + click to select a range
dfcfc32
add document for https://trello.com/c/NEVSbxPK/1656-sash-planning-doc…
qclayssen Mar 13, 2025
800ffb8
add general doc workflow
qclayssen Mar 17, 2025
75433b8
Add info readme
qclayssen Mar 21, 2025
f2ee788
add output.md file
qclayssen Mar 21, 2025
b8227fb
linting
qclayssen Mar 24, 2025
d30363e
add doc
qclayssen Mar 24, 2025
f4ae804
uptdate usage
qclayssen Mar 24, 2025
9683ae9
add adr
qclayssen Mar 27, 2025
2d33534
reorganise doc and linting
qclayssen Mar 27, 2025
c768668
fix typo
qclayssen Mar 27, 2025
81c5919
linting add inputs
qclayssen Mar 27, 2025
7e55c71
linting
qclayssen Mar 27, 2025
d644b22
more linting
qclayssen Mar 27, 2025
41f68c4
separation
qclayssen Mar 27, 2025
437eeb3
remove useless sumaary
qclayssen Mar 27, 2025
7f86249
rescue -> integration
qclayssen Mar 27, 2025
4cdde39
rephrase linting
qclayssen Mar 28, 2025
eaa0206
Add tools and linting
qclayssen Mar 30, 2025
2841c64
add tables
qclayssen Apr 1, 2025
12a0ae8
remove redundancy
qclayssen Apr 2, 2025
eeafb17
remove redundancy
qclayssen Apr 6, 2025
b387700
missed changed in docs
qclayssen Apr 8, 2025
f1733cb
linting
qclayssen Apr 8, 2025
c721156
fix redundancy typo
qclayssen Apr 9, 2025
ebf2954
add table of content
qclayssen Apr 9, 2025
16ba1dc
fix table
qclayssen Apr 9, 2025
f491a4f
fix filrer and info
qclayssen Apr 15, 2025
e64df39
linting
qclayssen Apr 15, 2025
85e58f3
typo
qclayssen Apr 15, 2025
759e9f0
correct filter
qclayssen Apr 15, 2025
e607ea7
linting
qclayssen Apr 16, 2025
f4f6f25
linting and improve writing
qclayssen Apr 29, 2025
8baed92
fixing input doc + linting
qclayssen Sep 8, 2025
b9b8c9f
work usage
qclayssen Sep 8, 2025
14af5fe
Merge origin/main into Documentation
qclayssen Oct 2, 2025
475b6c7
add info usage
qclayssen Oct 2, 2025
89750ed
reshape usage
qclayssen Oct 2, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 4 additions & 1 deletion assets/samplesheet.csv
Original file line number Diff line number Diff line change
@@ -1 +1,4 @@
id,subject_name,sample_name,sample_type,filetype,filepath
id,subject_name,sample_name,filetype,filepath
subject_a.example,subject_a,sample_germline,dragen_germline_dir,/path/to/dragen_germline/
subject_a.example,subject_a,sample_somatic,dragen_somatic_dir,/path/to/dragen_somatic/
subject_a.example,subject_a,sample_somatic,oncoanalyser_dir,/path/to/oncoanalyser/
8 changes: 4 additions & 4 deletions docs/README.md
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
# umccr/sash: Documentation

The umccr/sash documentation is split into the following pages:

- [Details](details.md)
- In details of the pipeline steps
- [Usage](usage.md)
- An overview of how the pipeline works, how to run it and a description of all of the different command-line flags.
- [Output](output.md)
- An overview of the different results produced by the pipeline and how to interpret them.
- [Architectural decision record (ADR)](adr.md)
- describes a choice the team makes about a significant aspect of the software architecture they're planning to build
44 changes: 44 additions & 0 deletions docs/adr.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
# ADR #1: Implement VCF Chunking and Parallelization in Sash Workflow for PCGR Processing

**Status**: In Progress
**Date**: 2024-11-07
**Deciders**: Oliver Hofmann, Stephen Watts, Quentin Clayssen
**Technical Story**: Based on the limitations of PCGR in handling large variant datasets within the sash workflow, specifically impacting hypermutated samples.

## Context
[PCGR](https://sigven.github.io/pcgr/) (Personal Cancer Genome Reporter) currently has a variant processing limit of 500,000 variants per run. In the sash workflow, hypermutated samples often exceed this variant limit. PCGR has its own filtering steps, but an additional filtering step was also introduced in Bolt. By using VCF chunking and parallel processing, we can ensure that these large datasets are analyzed effectively without exceeding the PCGR variant limit, leading to larger annotation and a more scalable pipeline.

## Decision
To address the limitations of PCGR when handling hypermutated samples, we WILL implement the following:

1. **Split VCF Files into Chunks**: Input VCF files MUST be divided into chunks, each containing no more than 500,000 variants. This ensures that each chunk remains within PCGR’s processing capacity.

2. **Parallelize Processing**: Each chunk MUST be processed concurrently through PCGR to optimize processing time. The annotated outputs from all chunks MUST be merged to create a unified dataset.

3. **Integrate into Bolt Annotation**: The chunking and parallelization changes MUST be implemented in the Bolt annotation module to ensure seamless and scalable processing for large variant datasets.

4. **Efficiency Consideration**: For now, there MAY be a loss of efficiency for larger variant sets due to the fixed resources allocated for annotation. Further resource adjustments SHOULD be evaluated in the future.

## Consequences

### Positive Consequences
- **Improved Efficiency**: This approach allows large variant datasets to be processed within PCGR's constraints, enhancing efficiency and ensuring more comprehensive analysis.
- **Scalability**: Chunking and parallel processing make the sash workflow more scalable for hypermutated samples, accommodating larger datasets.

### Negative Consequences
- **Complexity**: Adding chunking and merging processes WILL increase complexity in data handling and ensuring integrity across all merged data.
- **Resource Demand**: Parallel processing MAY increase resource consumption, affecting system performance and requiring further resource management.

## Remaining Challenges
While the proposed approach mitigates the current limitations of PCGR, it MAY not fully resolve the issues for hypermutated samples with exceptionally high variant counts. Additional solutions MUST be explored, such as:

- **Additional Filtering Criteria**: Applying additional filters to reduce the variant count where applicable.
- **Alternative Reporting Methods**: Exploring more scalable reporting approaches that COULD handle higher variant loads.

## Status
**Status**: In Progress

## Links
- [Related PR for VCF Chunking and Parallelization Implementation](https://github.com/scwatts/bolt/pull/2)
- [PCGR Documentation on Variant Limit](https://sigven.github.io/pcgr/articles/running.html#large-input-sets-vcf)
- Discussion on Hypermutated Samples Handling
582 changes: 582 additions & 0 deletions docs/details.md

Large diffs are not rendered by default.

Binary file added docs/images/sash_overview_qc.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file not shown.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
392 changes: 361 additions & 31 deletions docs/output.md

Large diffs are not rendered by default.

224 changes: 103 additions & 121 deletions docs/usage.md

Large diffs are not rendered by default.