merge branch develop

suhrig · Feb 8, 2023 · 44e6acb · 44e6acb
2 parents 3a88da0 + 9bd0eba
commit 44e6acb
Show file tree

Hide file tree

Showing 15 changed files with 227 additions and 55 deletions.
diff --git a/README.md b/README.md
@@ -81,6 +81,7 @@ Please refer to the [user manual](http://arriba.readthedocs.io/en/latest/) for i
    - [Convert fusions.tsv to VCF](https://arriba.readthedocs.io/en/latest/utility-scripts/#convert-fusionstsv-to-vcf)
    - [Run Arriba on prealigned BAM file](https://arriba.readthedocs.io/en/latest/utility-scripts/#run-arriba-on-prealigned-bam-file)
    - [Quantify virus expression](https://arriba.readthedocs.io/en/latest/utility-scripts/#quantify-virus-expression)
+   - [Annotate exon numbers](https://arriba.readthedocs.io/en/latest/utility-scripts/#annotate-exon-numbers)
 
 9. [Current limitations](https://arriba.readthedocs.io/en/latest/current-limitations/)
 
@@ -91,6 +92,8 @@ Please refer to the [user manual](http://arriba.readthedocs.io/en/latest/) for i
    - [Viral detection](https://arriba.readthedocs.io/en/latest/current-limitations/#viral-detection)
    - [Targeted sequencing](https://arriba.readthedocs.io/en/latest/current-limitations/#targeted-sequencing)
    - [Supporting read count vs. coverage](https://arriba.readthedocs.io/en/latest/current-limitations/#supporting-read-count-vs-coverage)
+   - [Supported organisms](https://arriba.readthedocs.io/en/latest/current-limitations/#supported-organisms)
+   - [Supported aligners](https://arriba.readthedocs.io/en/latest/current-limitations/#supported-aligners)
 
 10. [Internal algorithm](https://arriba.readthedocs.io/en/latest/internal-algorithm/)
 

diff --git a/documentation/command-line-options.md b/documentation/command-line-options.md
@@ -219,7 +219,7 @@ draw_fusions.R --fusions=fusions.tsv --annotation=annotation.gtf --output=output
 : This option only applies to intergenic breakpoints. If it is set to a value greater than 0, then the script draws the genes which are no more than the given distance away from an intergenic breakpoint. The keywords `closestGene` and `closestProteinCodingGene` instruct the script to dynamically determine the distance to the next (protein-coding) gene for each breakpoint. Alternatively, instead of specifying a single distance that is applied upstream and downstream of both breakpoints alike, more fine-grained control over the region to be shown is possible by specifying four comma-separated values. The first two values determine the region to the left and to the right of breakpoint 1; the third and fourth values determine the region to the left and to the right of breakpoint 2. Note that this option is incompatible with `--squishIntrons`. Default: `0`
 
 `--transcriptSelection=coverage|provided|canonical`
-: By default, the transcript isoform with the highest coverage is drawn. Alternatively, the transcript isoform that is provided in the columns `transcript_id1` and `transcript_id2` in the given fusions file can be drawn. Selecting the isoform with the highest coverage usually produces nicer plots, in the sense that the coverage track is smooth and shows a visible increase in coverage after the fusion breakpoint. However, the isoform with the highest coverage may not be the one that is involved in the fusion. Often, genomic rearrangements lead to non-canonical isoforms being transcribed. For this reason, it can make sense to rely on the transcript selection provided by the columns `transcript_id1/2`, which reflect the actual isoforms involved in a fusion. As a third option, the transcripts that are annotated as canonical can be drawn. Transcript isoforms tagged with `appris_principal`, `appris_candidate`, or `CCDS` are considered canonical. Default: `coverage`
+: By default, the transcript isoform with the highest coverage is drawn. Alternatively, the transcript isoform that is provided in the columns `transcript_id1` and `transcript_id2` in the given fusions file can be drawn. Selecting the isoform with the highest coverage usually produces nicer plots, in the sense that the coverage track is smooth and shows a visible increase in coverage after the fusion breakpoint. However, the isoform with the highest coverage may not be the one that is involved in the fusion. Often, genomic rearrangements lead to non-canonical isoforms being transcribed. For this reason, it can make sense to rely on the transcript selection provided by the columns `transcript_id1/2`, which reflect the actual isoforms involved in a fusion. As a third option, the transcripts that are annotated as canonical can be drawn. Transcript isoforms tagged with `appris_principal`, `appris_candidate`, or `CCDS` are considered canonical. Default: `provided`
 
 `--fixedScale=BASES`
 : By default, transcripts are scaled automatically to fill the entire page. This parameter enforces a fixed scale to be applied to all fusions, which is useful when a collection of fusions should be visualized and the sizes of all transcripts should be comparable. A common use case is the visualization of a gene that is found to be fused to multiple partners. By forcing all fusion plots to use the same scale, the fusions can be summarized as a collage in a single plot one above the other with matching scales. Note: The scale must be bigger than the sum of the biggest pair of transcripts to be drawn, or else dynamic scaling is applied, because display errors would occur otherwise. The default value is `0`, which means that no fixed scale should be used and that the scale should be adapted dynamically for each fusion. Default: `0`

diff --git a/documentation/current-limitations.md b/documentation/current-limitations.md
@@ -18,7 +18,7 @@ However, excessive memory consumption can indicate a user error. So before reduc
 Adapter trimming
 ----------------
 
-In most cases, it is not necessary to trim adapters, because STAR is capable of performing clipped alignments, such that reads with adapters will simply be aligned partially. For this reason, the demo script `run_arriba.sh` does not perform adapter trimming. However, if many reads contain a substantial fraction of adapter sequence, it can be beneficial to remove adapters for improved sensitivity, because STAR dismisses chimeric alignments when a big fraction of a read cannot be aligned (as controlled by the parameter `--outFilterMatchNminOverLread`). To this end, the STAR parameter `--clip3pAdapterSeq` can be used or a specialized adapter trimming tool.
+In most cases, it is not necessary to trim adapters, because STAR is capable of performing clipped alignments, such that reads with adapters will simply be aligned partially. For this reason, the demo script `run_arriba.sh` does not perform adapter trimming. However, if many reads contain a substantial fraction of adapter sequence, then these reads will not be aligned. Reads required for fusion detection will be affected in particular, because they are difficult to align anyway. This problem usually manifests as a high value for `% of reads unmapped: too short` in the log file of STAR. In this situation, it can be beneficial to remove adapters for improved sensitivity, because STAR dismisses chimeric alignments when a big fraction of a read cannot be aligned (as controlled by the parameter `--outFilterMatchNminOverLread`). To this end, the STAR parameter `--clip3pAdapterSeq` can be used or a specialized adapter trimming tool.
 
 Viral detection
 ---------------
@@ -43,3 +43,27 @@ Supporting read count vs. coverage
 
 The number of reads (or fragments) supporting a fusion are given in the columns `split_reads1/2` and `discordant_mates`. These columns only report reads which passed all filters and can be thought of as high-quality supporting reads. Reads which failed one or more filters are reported in the column `filters`. In contrast, the columns `coverage1/2` report all reads covering the fusion breakpoints. No filters are applied to coverage calculation, such that these numbers are not afflicted with the negative bias of the supporting reads columns. Most notably, the coverage calculation includes duplicates, whereas the supporting reads lack duplicates. Moreover, Arriba by default ignores supporting reads in excess of 300 for performance reasons (see also parameter `-U`). Therefore, the coverage values and supporting read counts are only roughly comparable - especially when a high number of duplicates is expected, for example with targeted sequencing libraries or highly expressed genes. Nevertheless, the implications on fusion calling are negligible, because few filters make use of coverage information. But users who desire consistent counting of supporting reads and coverage should remove (not just mark) all duplicates from the BAM file prior to running Arriba. This is currently the only way to obtain comparable counts.
 
+Supported organisms
+-------------------
+
+Arriba officially supports only human (hg19/GRCh37/hs37d5 or hg38/GRCh38) and mouse (mm10/GRCm38 or mm39/GRCm39). Other organisms or genome assemblies can be used in principle, but the results will be less accurate and the annotation incomplete. This is because important reference files are not available, including:
+
+- protein domain annotation, which is required to populate the column `retained_protein_domains`
+
+- a list of known fusions, which (marginally) improves sensitivity for known oncogenic driver fusions
+
+- the blacklist, which is essential for removing benign transcripts and recurrent artifacts
+
+These reference files are optional. Arriba can be run without them simply by omitting the respective command-line arguments (`-p` for the protein domain annotation, `-k` for the known fusions, and `-b` for the blacklist and, in addition, the blacklist filter must be disabled using the parameter `-f blacklist`).
+
+In order to improve the fusion calls from unsupported organisms, users can build their own blacklist. To this end, training samples are required. The more samples are used for training the better. For example, the blacklists for the officially supported organisms were trained on several hundred samples. For a robust blacklist, it is advisable to use a wide range of tissue types and sequencing/library preparation protocols so that the whole spectrum of benign transcripts and recurrent artifacts is captured by the blacklist. Ideally, one should choose training samples which are expected to harbor no somatic fusions, i.e., normal samples. Malignant samples can theoretically be used, too, but great care must be taken not to accidentally blacklist recurrent driver alterations.
+
+A blacklist can be built by simply running Arriba on the set of training samples. The breakpoint pairs to be blacklisted can then be extracted from the columns `breakpoint1/2` from both the main output file (as specified by the parameter `-o`) and the discarded fusions file (as specified by the parameter `-O`). The extracted breakpoint pairs must be stored in a tab-separated file with two columns - one for each breakpoint. Depending on the type of training samples used (normal vs. malignant), the recurrence threshold should be adjusted accordingly. If normal samples were used for training, any breakpoint pair which is found in more than one sample can be blacklisted as a recurrent artifact; for malignant training samples, the threshold should be much higher - at least as high as the most prevalent oncogenic driver fusion in the given disease. After the recurrent breakpoint pairs have been added to the blacklist, the list can optionally be fine-tuned further by adding special keywords. For example, when a certain gene is involved in a lot of artifacts even after the newly built blacklist has been applied, the gene may be blacklisted completely by putting the gene name in the first column and the keyword `any` in the second column. All valid keywords are described in the [section about the blacklist](input-files.md#blacklist).
+
+Supported aligners
+------------------
+
+In principle, Arriba is compatible with any RNA-Seq aligner which reports split reads and discordant mates in a format that is compliant with the [SAM format specification](https://samtools.github.io/hts-specs/SAMv1.pdf). That is, paired-end discordant mates must be marked as such by means of having the `BAM_FPROPER_PAIR (0x2)` flag unset, and split reads must be represented as supplementary alignments with the `BAM_FSUPPLEMENTARY (0x800)` flag set for the supplementary and a `SA` tag for the anchor read. However, Arriba currently has the limitation that it can only utilize supplementary alignments if there is exactly one supplementary alignment per read. Reads which have multiple supplementary alignments are ignored. Multi-mapping chimeric reads are recognized by Arriba provided that all SAM records pertaining to the same alignment have a `HI` tag and the tag has the same value. In other words, when a read maps to multiple loci, all SAM records pertaining to the first alignment must have the tag `HI:i:1`, and all SAM records pertaining to the second alignment must have the tag `HI:i:2`, and so on.
+
+Alignment tools that have been tested successfully with Arriba are the [STAR aligner](https://github.com/alexdobin/STAR) and Illumina's Dragen aligner. (Note: The open-source implementation of Dragen, DRAGMAP, is not compatible with Arriba, since it is not suitable for RNA-Seq data at the time of this writing.) STAR is preferred over Dragen, because it is better at aligning split reads and because multi-mapping reads are stored in a way that Arriba can handle. Users who want to use an incompatible aligner in conjunction with Arriba can run the script [`run_arriba_on_prealigned_bam.sh`](../utility-scripts/#run-arriba-on-prealigned-bam-file), which takes a BAM file (aligned by any aligner) as input and uses STAR to realign only those reads which are relevant to fusion detection, namely, clipped and unmapped reads. The fusion calls from this workflow should be close to the recommended workflow based entirely on STAR, but avoids having to realign the entire BAM file just for the sake of fusion detection, thus saving CPU time.
+