getITD for FLT3-ITD-based MRD monitoring in AML
https://doi.org/10.1038/s41375-019-0483-z
T.J. Luck, née Blätte
getITD was developed for FLT3-ITD detection from Illumina sequencing data. Adaptations to data generated by similar assays, sequencing technologies or target regions should be possible. In fact, getITD was successfully used to analyze 454 data of different amplicons as presented at ASH 2018 (RĂĽcker et al. "Prognostic Impact of Insertion Site in Acute Myeloid Leukemia (AML) with FLT3 Internal Tandem Duplication: Results from the Ratify Study (Alliance 10603)".).
The core getITD analysis program is contained within getitd.py. In addition, two wrapper scripts, make_getitd_config.py and getitd_from_config_wrapper.py, which do not require use of the commandline are provided to facilitate getITD analysis.
Setup and usage of all programs is described in detail below.
python3
- pandas
- numpy
- biopython
- easygui
- matplotlib
The pandas, numpy and biopython modules are required for the core getITD program. The easygui and matplotlib modules are optional: easygui is only required by the make_getitd_config.py script, whereas matplotlib is required for coverage plotting (optionally activated using -plot_coverage True
).
In case you are working behind a firewall, set the appropriate proxy server (ask your IT administrator if you do not know which one this is) by opening the commandline and entering the following commands:
export https_proxy=your.proxy.server:your_port
export http_proxy=your.proxy.server:your_port
To install the necessary python3 modules, open the commandline and enter:
pip3 install --user numpy pandas biopython easygui matplotlib
Commands provided in this manual assume that you are working from within the folder that contains the getitd.py script. Your current folder is shown at the beginning of the commandline. To go up in the folder hierarchy, enter:
cd ..
To enter a certain subfolder, type:
cd EXAMPLE
(Replace EXAMPLE with the name of the folder you are trying to enter.)
Repeat these two steps to reach the desired destination.
Alternatively, provide full or relative paths to all files, including the getitd.py script, input FASTQ files and reference and annotation files, whenever you use getITD.
If necessary, download and install the latest python3 from https://www.python.org/downloads/windows/. Be sure to check the box to add it to your PATH. Check that python works by opening the Windows commandline (cmd), typing
py -3
and pressing ENTER. If python3 was correctly installed and added to your PATH, this will open the python interpreter. Exit by typing
exit()
and pressing ENTER.
Download and install Visual Studio C++ Build tools from https://visualstudio.microsoft.com/visual-cpp-build-tools (click Download Build Tools, scroll down to All Downloads and select Build Tools for Visual Studio under Tools for Visual Studio).
For installation, select the following options in addition to the defaults:
- "C++/CLI support"
- "VC++ 2015.3 v14.00 (v140) toolset for desktop"
In case you are working behind a firewall, set the appropriate proxy server (ask your IT administrator if you do not know which one this is) by entering the following commands in the Windws commandline (cmd):
set https_proxy=your.proxy.server:your_port
set http_proxy=your.proxy.server:your_port
Open the Windows commandline (cmd) and install the required python3 modules:
pip3 install --user pandas numpy biopython easygui matplotlib
If this command fails with a connection error such as the one shown below, you are working behind a firewall and first need to set the correct proxy server as described above under Set proxy):
Retrying (Retry(total=4, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<pip._vendor.urllib3.connection.VerifiedHTTPSConnection object at 0x00000218168F2A90>: Failed to establish a new connection: [WinError 10061] No connection could be made because the target machine actively refused it')': /simple/pandas/
Commands provided in this manual assume that you are working from within the folder that contains the getitd.py script. Your current folder is shown at the beginning of the commandline. To go up in the folder hierarchy, enter:
cd ..
To enter a certain subfolder, type:
cd EXAMPLE
(Replace EXAMPLE with the name of the folder you are trying to enter.)
Repeat these two steps to reach the desired destination.
Alternatively, provide full or relative paths to all files, including the getitd.py script, input FASTQ files and reference and annotation files, whenever you use getITD.
The only required inputs are a sample ID to name output files and folders and the actual FASTQ files containing the sequencing data to process. Supply these in the correct order (see also the examples below):
Position | Description |
---|---|
1st | sample ID which is used for naming of output files and folders. This parameter is required. |
2nd | Path and filename of the FASTQ file containing forward reads to process, by convention the R1 FASTQ file. When processing single-end sequencing data, supply the single FASTQ file that was generated. This parameter is required. |
3rd | Path and filename of the FASTQ file containing reverse reads to process, by convention the R2 FASTQ file generated by paired-end sequencing. This parameter is optional. |
Note that FASTQ files may be provided as both compressed (gzipped) *.fastq.gz
or uncompressed *.fastq
files.
getITD will automatically detect file type and appropriately read them in; commands do not need to be changed.
Many optional parameters are available to customize the analysis:
Parameter | Description |
---|---|
-reference X |
Path and filename of the reference to align to. This file should contain a single line of text, namely the WT sequence of the amplicon that was processed. getITD aligns all reads to this reference and detects insertions and ITDs relative to it. Default: ./anno/amplicon.txt. |
-anno X |
Path and filename of the annotation file used to retrieve chromosomal, transcriptomic and proteomic coordinates and domains for each insertion and ITD identified by getITD. Columns should be tab-separated with header amplicon_bp, region, chr_bp, transcript_bp, protein_as. Default: ./anno/amplicon_kayser.tsv. |
-forward_primer X |
Gene-specific sequence(s) of the forward primer(s) used to generate the sequenced amplicon. The sequence should be identical to the 5' end of supplied forward reads. When sequencing a pool of multiple amplicons, provide information on all primer pairs used. Separate individual sequences by space. If this is the last optional parameter used, pass also " -- " to signal that what follows are not more primer sequences. Default: GCAATTTAGGTATGAAAGCCAGCTAC. |
-reverse_primer X |
Gene-specific sequence(s) of the reverse primer(s) used. This sequence should be identical to the 5' end of supplied reverse reads. When sequencing a pool of multiple amplicons, provide information on all primer pairs used. Separate individual sequences by space. If this is the last optional parameter used, pass also " -- " to signal that what follows are not more primer sequences. Default: CTTTCAGCATTTTGACGGCAACC. |
-require_indel_free_primers X |
If True, discard reads containing insertions or deletions within the primer sequence, as these indicate low sequence fidelity. Note that, if set to True, this also filters reads not containing any primer sequence at all, so this must be set to False in case primers have been trimmed. Note that adapters but not primers should be trimmed. Default: True. |
-forward_adapter X |
Sequencing adapter of the forward reads' primer as (potentially) present at the 5' end of the supplied forward reads, 5' of the gene-specific primer sequence. Default: TCGTCGGCAGCGTCAGATGTGTATAAGAGACAGA. |
-reverse_adapter X |
Sequencing adapter of the reverse reads' primer as (potentially) present at the 5' end of the supplied reverse reads, 5' of the gene-specific primer sequence. Default: GTCTCGTGGGCTCGGAGATGTGTATAAGAGACAGA. |
-plot_coverage X |
If True, plot coverage distribution across the reference to a png file. Default: False. |
-technology X |
Sequencing technology used, options are 454 and Illumina. Currently this has no effect for Illumina data. For 454 data, -infer_sense_from_alignment is set to True and -min_read_copies is set to 1, command line options passed to these parameters are then ignored. Default: Illumina. |
-infer_sense_from_alignment X |
If True, infer each read's sense from alignment by aligning both its actual sequence as well as its reverse-complement and keeping whichever achieves the higher mapping score to the reference. If False, infer each read's sense from the file of origin (R1 vs R2 FASTQ). Default: False. |
-nkern X |
The number of cores to use for parallel tasks. Set to 1 to disable parallelization. Default: 12. |
-gap_open X |
Alignment cost of a gap opening. Default: -36 |
-gap_extend X |
Alignment cost of a gap extension. Default: -0.5 |
-match X |
Alignment cost of a base match. Default: 5 |
-mismatch X |
Alignment cost of a base mismatch. Default: -15 |
-minscore_inserts X |
Fraction of the maximum possible alignment score required between inserts and respective WT tandems when deciding whether these are sufficiently similar to consider them ITDs and inserts supported by distinct reads when determining whether these are supporting the same mutation or not. Higher values will require sequences in both cases to be more similar. Default: 0.5. |
-minscore_alignments X |
Fraction of the maximum possible alignment score required between each read and the reference. Reads that do not pass this filter are discarded to eliminate low fidelity sequences. Default: 0.4. |
-min_bqs X |
Minimum average base quality score (BQS) required by each read. Reads that do not pass this filter are discarded to eliminate sequences likely to contain sequencing errors. Default: 30. |
-min_insert_seq_length X |
Minimum number of sequenced insert bases required for getITD to call an insertion / ITD. Default: 6. Increase to 9 or higher to avoid false positives in lower quality samples. |
-min_read_length X |
Minimum read length required by each read. This filter is applied after trailing N bases are trimmed at both the 5' and 3' end. Reads that do not pass this filter are discarded. Default: 100. |
-min_read_copies X |
Minimum number of copies of each read required for processing. When set to 2, all unique read sequences are discarded, based on the assumption that true and clinically relevant sequences will be present at least twice. Disable this filter, by setting it to 1, when not all reads have the same length as it will otherwise incorrectly discard the majority of input reads. Default: 2. |
-filter_ins_unique_reads X |
Minimum number of unique supporting reads required for an insertion / ITD to be considered high confidence. Disable this filter by setting it to 1. Default: 2. |
-filter_ins_total_reads X |
Minimum number of total supporting reads required for an insertion / ITD to be considered high confidence. Default: 1 (disabled). |
-filter_ins_vaf X |
Minimum variant allele frequency (VAF), in percent, required for an insertion / ITD to be considered high confidence. Default: 0.006. Set to 0.2 for data multiplexed using single-unique barcodes. |
For a complete list of available commandline arguments, a short description and respective default values, change to the getitd folder and, on Linux & MacOS, run
python3 getitd.py --help
On Windows, run
py -3 getitd.py --help
This will print
usage: getitd.py [-h] [-reference REFERENCE] [-anno ANNO]
[-forward_primer FORWARD_PRIMER [FORWARD_PRIMER ...]]
[-reverse_primer REVERSE_PRIMER [REVERSE_PRIMER ...]]
[-require_indel_free_primers REQUIRE_INDEL_FREE_PRIMERS]
[-forward_adapter FORWARD_ADAPTER]
[-reverse_adapter REVERSE_ADAPTER]
[-plot_coverage PLOT_COVERAGE] [-technology {Illumina,454}]
[-infer_sense_from_alignment INFER_SENSE_FROM_ALIGNMENT]
[-nkern NKERN] [-gap_open GAP_OPEN] [-gap_extend GAP_EXTEND]
[-match MATCH] [-mismatch MISMATCH]
[-max_trailing_bp MAX_TRAILING_BP]
[-minscore_inserts MINSCORE_INSERTS]
[-minscore_alignments MINSCORE_ALIGNMENTS] [-min_bqs MIN_BQS]
[-min_read_length MIN_READ_LENGTH]
[-min_read_copies MIN_READ_COPIES]
[-min_insert_seq_length MIN_INSERT_SEQ_LENGTH]
[-filter_ins_unique_reads FILTER_INS_UNIQUE_READS]
[-filter_ins_total_reads FILTER_INS_TOTAL_READS]
[-filter_ins_vaf FILTER_INS_VAF]
sampleID fastq1 [fastq2]
positional arguments:
sampleID sample ID used as output folder prefix (REQUIRED)
fastq1 FASTQ file (optionally gzipped) of forward reads (REQUIRED)
fastq2 FASTQ file (optionally gzipped) of reverse reads (optional)
optional arguments:
-h, --help show this help message and exit
-reference REFERENCE WT amplicon sequence as reference for read alignment
(default ./anno/amplicon.txt)
-anno ANNO WT amplicon sequence annotation (default
./anno/amplicon_kayser.tsv)
-forward_primer FORWARD_PRIMER [FORWARD_PRIMER ...]
Forward primer gene-specific sequence(s) as present at
the 5' end of supplied forward reads. Separate by
space when supplying more than one (default
GCAATTTAGGTATGAAAGCCAGCTAC)
-reverse_primer REVERSE_PRIMER [REVERSE_PRIMER ...]
Reverse primer gene-specific sequence(s) as present at
the 5' end of supplied reverse reads. Separate by
space when supplying more than one (default
CTTTCAGCATTTTGACGGCAACC)
-require_indel_free_primers REQUIRE_INDEL_FREE_PRIMERS
If True, discard i) reads containing insertions or
deletions within the primer sequence and ii) reads not
containing any primer sequence. Set to False if these
have been trimmed (default True)
-forward_adapter FORWARD_ADAPTER
Sequencing adapter of the forward reads' primer as
(potentially) present at the 5' end of the supplied
forward reads, 5' of the gene-specific primer sequence
(default TCGTCGGCAGCGTCAGATGTGTATAAGAGACAGA)
-reverse_adapter REVERSE_ADAPTER
Sequencing adapter of the reverse reads' primer as
(potentially) present at the 5' end of the supplied
reverse reads, 5' of the gene-specific primer sequence
(default GTCTCGTGGGCTCGGAGATGTGTATAAGAGACAGA)
-plot_coverage PLOT_COVERAGE
If True, plot read coverage across the reference to
'coverage.png' in the respective output folder
(default False)
-technology {Illumina,454}
Sequencing technology used, options are '454' or
'Illumina' (default). '454' sets
-infer_sense_from_alignment to True and
-min_read_copies to 1, regardless of the respective
command line options used; 'Illumina' will instead use
these command line options or their respective
defaults.
-infer_sense_from_alignment INFER_SENSE_FROM_ALIGNMENT
If True, infer each read's sense by aligning it as a
forward and reverse read and keeping the better
alignment (default False).
-nkern NKERN number of cores to use for parallel tasks (default 12)
-gap_open GAP_OPEN alignment cost of gap opening (default -36)
-gap_extend GAP_EXTEND
alignment cost of gap extension (default -0.5)
-match MATCH alignment cost of base match (default 5)
-mismatch MISMATCH alignment cost of base mismatch (default -15)
-max_trailing_bp MAX_TRAILING_BP
maximum number of aligned bp between the start / end
of an insertion and the start / end of the read to
consider the insertion 'trailing'. Trailing insertions
are not required to be in-frame and will be considered
ITDs even if the matching WT tandem is not directly
adjacent. Set this to 0 to disable (default 0).
-minscore_inserts MINSCORE_INSERTS
fraction of max possible alignment score required for
ITD detection and insert collapsing (default 0.5)
-minscore_alignments MINSCORE_ALIGNMENTS
fraction of max possible alignment score required for
a read to pass when aligning reads to amplicon
reference (default 0.4)
-min_bqs MIN_BQS minimum average base quality score (BQS) required by
each read (default 30)
-min_read_length MIN_READ_LENGTH
minimum read length in bp required after N-trimming
(default 100)
-min_read_copies MIN_READ_COPIES
minimum number of copies of each read required for
processing (1 to turn filter off, 2 (default) to
discard unique reads)
-min_insert_seq_length MIN_INSERT_SEQ_LENGTH
minimum number of insert basepairs which must be
sequenced of each insert for it to be considered by
getITD. For non-trailing ITDs, this is the minimum
insert length; for trailing ITDs, it is the minimum
number of bp of a potentially longer ITD which have to
be sequenced (default 6).
-filter_ins_unique_reads FILTER_INS_UNIQUE_READS
minimum number of unique reads required to support an
insertion for it to be considered 'high confidence'
(default 2)
-filter_ins_total_reads FILTER_INS_TOTAL_READS
minimum number of total reads required to support an
insertion for it to be considered 'high confidence'
(default 1)
-filter_ins_vaf FILTER_INS_VAF
minimum variant allele frequency (VAF) required for an
insertion to be considered 'high confidence' (default
0.006)
For each sample, an output directory will be created in the current directory, named using the provided sample ID. Inside, all of the generated output files reside:
- config.txt contains parameters, date and time of the analysis, as well as the getITD version that was used
- stats.txt contains the number of reads and insertions / ITDs processed and filtered at each step of the analysis
- coverage.txt contains the read coverage across the reference sequence
- coverage.png contains a plot of that same coverage data (optionally created using
-plot_coverage True
) - out_needle/ contains individual alignment files, needle_*.txt, of all the different reads processed
- itds_collapsed-is-same_is-similar_is-close_is-same-trailing_hc.tsv contains filtered high-confidence (hc) ITDs, fully merged
- itds_collapsed-is-same_is-similar_is-close_is-same-trailing.tsv contains all ITDs, fully merged
- itds_collapsed-is-same_is-similar_is-close.tsv contains all ITDs, having merged those of the same insertion length, nearby insertion sites (within one tandem length) and similar insert sequences
- itds_collapsed-is-same_is-similar.tsv contains all ITDs, having merged those of the same insertion length and site and similar insert sequences
- itds_collapsed-is-same.tsv contains all ITDs, having merged those that share the same insertion length, site and sequence
- insertions*.tsv files are analogous to ITD files but list all insertions, regardless of whether these are also ITDs or not
Note that insertion*tsv
and itd*tsv
files are only created if at least one insertion or ITD is found, respectively. Config and stats files should exist for all samples.
At each filtering step, the number of reads and insertions passing the specified requirements are printed. This basically corresponds to the content of stats.txt. Currently, also the computation time of various steps is printed, but presumably this will change in the future.
To analyze the provided test data, change to the getitd folder and, on Linux / MacOS, run:
python3 getitd.py test test/test_R1.fastq test/test_R2.fastq
On Windows, run:
py -3 getitd.py test test\test_R1.fastq test\test_R2.fastq
Expected output:
==== PROCESSING SAMPLE test ====
-- Reading FASTQ files --
Reading FASTQ files took 0.1 s
Number of total reads: 5000
Number of total reads > 100bp remaining after N-trimming: 5000 (100.0 %)
Mean read length after N-trimming: 251.0
Number of total reads with mean BQS >= 30: 4544 (90.88 %)
Getting unique reads took 0.06 s
Number of unique reads with mean BQS >= 30: 987
Number of unique reads with at least 2 copies: 170
Total reads remaining for analysis: 3727 (74.54 %)
-- Aligning to Reference --
Alignment took 1.13 s
Filtering 0 / 170 low quality alignments with a score < 40.0 % of max
Filtering 20 / 170 alignments with indels in primer bases
Total reads remaining for analysis: 3669 (73.38 %)
Calculating coverage took 0.09 s
-- Looking for insertions & ITDs --
Collecting inserts took 0.04 s
44 inserts >= 6 bp were found
0/44 insertions were part of adapters and filtered
Filtering inserts for adapter sequences took 0.0 s
Annotating coverage took 0.0 s
Collecting ITDs took 0.35 s
44 ITDs were found
-- Merging insertions --
3 insertions remain after merging
2 insertions remain after merging
1 insertions remain after merging
1 insertions remain after merging
-- Merging itds --
3 itds remain after merging
2 itds remain after merging
1 itds remain after merging
1 itds remain after merging
Merging took 0.15 s
-- Filtering insertions --
Filtered 0 / 1 insertions based on the number of unique supporting reads
Filtered 0 / 1 insertions based on the number of total supporting reads
Filtered 0 / 1 insertions based on the vaf
1 insertions remain after filtering!
-- Filtering itds --
Filtered 0 / 1 itds based on the number of unique supporting reads
Filtered 0 / 1 itds based on the number of total supporting reads
Filtered 0 / 1 itds based on the vaf
1 itds remain after filtering!
Filtering took 0.03 s
To change, for example, the number of cores used for the analysis to 2, on Linux / MacOS instead run:
python3 getitd.py -nkern 2 test test/test_R1.fastq test/test_R2.fastq
On Windows, run:
py -3 getitd.py -nkern=2 test test\test_R1.fastq test\test_R2.fastq
Because an arbitrary number of primer sequences can be supplied following the -forward_primer
and -reverse_primer
parameters, an explicit signal is required to identify the end of the given list of primer sequences. This can be either any other optional argument which reads in only one word, such as -nkern
, or the 'empty' argument --
:
python3 getitd.py -forward_primer GCAATTTAGGTATGAAAGCCAGCTAC -nkern 2 test test/test_R1.fastq test/test_R2.fastq
or
python3 getitd.py -forward_primer GCAATTTAGGTATGAAAGCCAGCTAC -- test test/test_R1.fastq test/test_R2.fastq
- Read in FASTQ files
- Read in forward reads and BQS (R1)
- Read in and reverse complement reverse reads and BQS (R2)
- Trim trailing ambiguous N bases on all reads
- Filter reads for minimum average base quality score (BQS)
- Discard low quality reads
- Filter unique reads
- Assume that true & clinically relevant sequences will be present at least twice and discard unique reads
- Align each read to the WT amplicon reference sequence using Needlemann-Wunsch alignment algorithm
- Filter alignments, require
- at least 40% of the maximum possible alignment score
- gap-free alignment to the given primer sequence
- Collect insertions within passing alignments, require
- insert length of at least 6 bp
- absence of ambiguous "N" bases within the actual insert sequence
- in-frame insert (length divisible by 3) for inserts fully contained within a read
- 3' or 5' trailing inserts not fully spanned by the sequenced reads are not required to be in-frame, since their exact length is not actually known
- Collect ITDs from passing insertions, require
- insert sequence can be realigned to the WT amplicon sequence, with at least 50% of the maximum possible alignment score
- for 3' and 5' trailing inserts that insert sequence is not an adapter artefact
- Merge total insertions and ITDs independently (considering reads to describe the same event and adding up supporting counts), require
- that they are actually the same: insert length, site and sequence are identical
- that they are similar: insert length and site are the same, sequences are similar
- that they are close: insert length is the same, sequences are similar and sites are within one insert length of each other
- close trailing inserts with similar sequences are considered the same regardless of their (estimated) lengths
- Filter insertions and ITDs independently, require
- at least two different supporting reads
- a minimum variant allele frequency (VAF) of 0.006 %
Two additional scripts are provided to facilitate getITD analysis by alleviating the need to use the commandline. The first, make_getitd_config.py, when double-clicked, creates a simple graphical interface and asks the user to select desired analysis parameters and annotation / reference files. These parameters are then saved to a config.txt file. When this config.txt is placed in a folder containing both getitd_from_config_wrapper.py and an arbitrary number of FASTQ files to analyze, double-clicking the wrapper script will initiate sequential analysis of all of the provided FASTQ files using parameters stored in config.txt. Though these wrapper scripts were originally intended primarily to facilitate analysis on Windows, they run also on Linux and Mac.