Skip to content

Software for FLT3-ITD detection and MRD monitoring assay

License

Notifications You must be signed in to change notification settings

tjblaette/getitd

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

getITD v1.5.16

getITD for FLT3-ITD-based MRD monitoring in AML

https://doi.org/10.1038/s41375-019-0483-z

Author / Support

T.J. Luck, née Blätte

Overview

getITD was developed for FLT3-ITD detection from Illumina sequencing data. Adaptations to data generated by similar assays, sequencing technologies or target regions should be possible. In fact, getITD was successfully used to analyze 454 data of different amplicons as presented at ASH 2018 (RĂĽcker et al. "Prognostic Impact of Insertion Site in Acute Myeloid Leukemia (AML) with FLT3 Internal Tandem Duplication: Results from the Ratify Study (Alliance 10603)".).

The core getITD analysis program is contained within getitd.py. In addition, two wrapper scripts, make_getitd_config.py and getitd_from_config_wrapper.py, which do not require use of the commandline are provided to facilitate getITD analysis.

Setup and usage of all programs is described in detail below.

Requirements

python3

  • pandas
  • numpy
  • biopython
  • easygui
  • matplotlib

The pandas, numpy and biopython modules are required for the core getITD program. The easygui and matplotlib modules are optional: easygui is only required by the make_getitd_config.py script, whereas matplotlib is required for coverage plotting (optionally activated using -plot_coverage True).

Setup (Linux / Ubuntu & MacOS)

If applicable: Set proxy

In case you are working behind a firewall, set the appropriate proxy server (ask your IT administrator if you do not know which one this is) by opening the commandline and entering the following commands:

export https_proxy=your.proxy.server:your_port
export http_proxy=your.proxy.server:your_port

Python3 modules

To install the necessary python3 modules, open the commandline and enter:

pip3 install --user numpy pandas biopython easygui matplotlib
For convenience: Change to the folder containing the getitd.py program and anno subfolder

Commands provided in this manual assume that you are working from within the folder that contains the getitd.py script. Your current folder is shown at the beginning of the commandline. To go up in the folder hierarchy, enter:

cd ..

To enter a certain subfolder, type:

cd EXAMPLE

(Replace EXAMPLE with the name of the folder you are trying to enter.)

Repeat these two steps to reach the desired destination.

Alternatively, provide full or relative paths to all files, including the getitd.py script, input FASTQ files and reference and annotation files, whenever you use getITD.

Setup (Windows)

Python3

If necessary, download and install the latest python3 from https://www.python.org/downloads/windows/. Be sure to check the box to add it to your PATH. Check that python works by opening the Windows commandline (cmd), typing

py -3

and pressing ENTER. If python3 was correctly installed and added to your PATH, this will open the python interpreter. Exit by typing

exit()

and pressing ENTER.

Visual Studio C++ Build Tools

Download and install Visual Studio C++ Build tools from https://visualstudio.microsoft.com/visual-cpp-build-tools (click Download Build Tools, scroll down to All Downloads and select Build Tools for Visual Studio under Tools for Visual Studio).

For installation, select the following options in addition to the defaults:

  • "C++/CLI support"
  • "VC++ 2015.3 v14.00 (v140) toolset for desktop"

If applicable: Set proxy

In case you are working behind a firewall, set the appropriate proxy server (ask your IT administrator if you do not know which one this is) by entering the following commands in the Windws commandline (cmd):

set https_proxy=your.proxy.server:your_port
set http_proxy=your.proxy.server:your_port

Install python3 modules:

Open the Windows commandline (cmd) and install the required python3 modules:

pip3 install --user pandas numpy biopython easygui matplotlib

If this command fails with a connection error such as the one shown below, you are working behind a firewall and first need to set the correct proxy server as described above under Set proxy):

Retrying (Retry(total=4, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<pip._vendor.urllib3.connection.VerifiedHTTPSConnection object at 0x00000218168F2A90>: Failed to establish a new connection: [WinError 10061] No connection could be made because the target machine actively refused it')': /simple/pandas/

For convenience: Change to the folder containing the getitd.py program and anno subfolder

Commands provided in this manual assume that you are working from within the folder that contains the getitd.py script. Your current folder is shown at the beginning of the commandline. To go up in the folder hierarchy, enter:

cd ..

To enter a certain subfolder, type:

cd EXAMPLE

(Replace EXAMPLE with the name of the folder you are trying to enter.)

Repeat these two steps to reach the desired destination.

Alternatively, provide full or relative paths to all files, including the getitd.py script, input FASTQ files and reference and annotation files, whenever you use getITD.

getITD core program

Input

Required:

The only required inputs are a sample ID to name output files and folders and the actual FASTQ files containing the sequencing data to process. Supply these in the correct order (see also the examples below):

Position Description
1st sample ID which is used for naming of output files and folders. This parameter is required.
2nd Path and filename of the FASTQ file containing forward reads to process, by convention the R1 FASTQ file. When processing single-end sequencing data, supply the single FASTQ file that was generated. This parameter is required.
3rd Path and filename of the FASTQ file containing reverse reads to process, by convention the R2 FASTQ file generated by paired-end sequencing. This parameter is optional.

Note that FASTQ files may be provided as both compressed (gzipped) *.fastq.gz or uncompressed *.fastq files. getITD will automatically detect file type and appropriately read them in; commands do not need to be changed.

Optional:

Many optional parameters are available to customize the analysis:

Parameter Description
-reference X Path and filename of the reference to align to. This file should contain a single line of text, namely the WT sequence of the amplicon that was processed. getITD aligns all reads to this reference and detects insertions and ITDs relative to it. Default: ./anno/amplicon.txt.
-anno X Path and filename of the annotation file used to retrieve chromosomal, transcriptomic and proteomic coordinates and domains for each insertion and ITD identified by getITD. Columns should be tab-separated with header amplicon_bp, region, chr_bp, transcript_bp, protein_as. Default: ./anno/amplicon_kayser.tsv.
-forward_primer X Gene-specific sequence(s) of the forward primer(s) used to generate the sequenced amplicon. The sequence should be identical to the 5' end of supplied forward reads. When sequencing a pool of multiple amplicons, provide information on all primer pairs used. Separate individual sequences by space. If this is the last optional parameter used, pass also " -- " to signal that what follows are not more primer sequences. Default: GCAATTTAGGTATGAAAGCCAGCTAC.
-reverse_primer X Gene-specific sequence(s) of the reverse primer(s) used. This sequence should be identical to the 5' end of supplied reverse reads. When sequencing a pool of multiple amplicons, provide information on all primer pairs used. Separate individual sequences by space. If this is the last optional parameter used, pass also " -- " to signal that what follows are not more primer sequences. Default: CTTTCAGCATTTTGACGGCAACC.
-require_indel_free_primers X If True, discard reads containing insertions or deletions within the primer sequence, as these indicate low sequence fidelity. Note that, if set to True, this also filters reads not containing any primer sequence at all, so this must be set to False in case primers have been trimmed. Note that adapters but not primers should be trimmed. Default: True.
-forward_adapter X Sequencing adapter of the forward reads' primer as (potentially) present at the 5' end of the supplied forward reads, 5' of the gene-specific primer sequence. Default: TCGTCGGCAGCGTCAGATGTGTATAAGAGACAGA.
-reverse_adapter X Sequencing adapter of the reverse reads' primer as (potentially) present at the 5' end of the supplied reverse reads, 5' of the gene-specific primer sequence. Default: GTCTCGTGGGCTCGGAGATGTGTATAAGAGACAGA.
-plot_coverage X If True, plot coverage distribution across the reference to a png file. Default: False.
-technology X Sequencing technology used, options are 454 and Illumina. Currently this has no effect for Illumina data. For 454 data, -infer_sense_from_alignment is set to True and -min_read_copies is set to 1, command line options passed to these parameters are then ignored. Default: Illumina.
-infer_sense_from_alignment X If True, infer each read's sense from alignment by aligning both its actual sequence as well as its reverse-complement and keeping whichever achieves the higher mapping score to the reference. If False, infer each read's sense from the file of origin (R1 vs R2 FASTQ). Default: False.
-nkern X The number of cores to use for parallel tasks. Set to 1 to disable parallelization. Default: 12.
-gap_open X Alignment cost of a gap opening. Default: -36
-gap_extend X Alignment cost of a gap extension. Default: -0.5
-match X Alignment cost of a base match. Default: 5
-mismatch X Alignment cost of a base mismatch. Default: -15
-minscore_inserts X Fraction of the maximum possible alignment score required between inserts and respective WT tandems when deciding whether these are sufficiently similar to consider them ITDs and inserts supported by distinct reads when determining whether these are supporting the same mutation or not. Higher values will require sequences in both cases to be more similar. Default: 0.5.
-minscore_alignments X Fraction of the maximum possible alignment score required between each read and the reference. Reads that do not pass this filter are discarded to eliminate low fidelity sequences. Default: 0.4.
-min_bqs X Minimum average base quality score (BQS) required by each read. Reads that do not pass this filter are discarded to eliminate sequences likely to contain sequencing errors. Default: 30.
-min_insert_seq_length X Minimum number of sequenced insert bases required for getITD to call an insertion / ITD. Default: 6. Increase to 9 or higher to avoid false positives in lower quality samples.
-min_read_length X Minimum read length required by each read. This filter is applied after trailing N bases are trimmed at both the 5' and 3' end. Reads that do not pass this filter are discarded. Default: 100.
-min_read_copies X Minimum number of copies of each read required for processing. When set to 2, all unique read sequences are discarded, based on the assumption that true and clinically relevant sequences will be present at least twice. Disable this filter, by setting it to 1, when not all reads have the same length as it will otherwise incorrectly discard the majority of input reads. Default: 2.
-filter_ins_unique_reads X Minimum number of unique supporting reads required for an insertion / ITD to be considered high confidence. Disable this filter by setting it to 1. Default: 2.
-filter_ins_total_reads X Minimum number of total supporting reads required for an insertion / ITD to be considered high confidence. Default: 1 (disabled).
-filter_ins_vaf X Minimum variant allele frequency (VAF), in percent, required for an insertion / ITD to be considered high confidence. Default: 0.006. Set to 0.2 for data multiplexed using single-unique barcodes.

For a complete list of available commandline arguments, a short description and respective default values, change to the getitd folder and, on Linux & MacOS, run

python3 getitd.py --help

On Windows, run

py -3 getitd.py --help

This will print

usage: getitd.py [-h] [-reference REFERENCE] [-anno ANNO]
                 [-forward_primer FORWARD_PRIMER [FORWARD_PRIMER ...]]
                 [-reverse_primer REVERSE_PRIMER [REVERSE_PRIMER ...]]
                 [-require_indel_free_primers REQUIRE_INDEL_FREE_PRIMERS]
                 [-forward_adapter FORWARD_ADAPTER]
                 [-reverse_adapter REVERSE_ADAPTER]
                 [-plot_coverage PLOT_COVERAGE] [-technology {Illumina,454}]
                 [-infer_sense_from_alignment INFER_SENSE_FROM_ALIGNMENT]
                 [-nkern NKERN] [-gap_open GAP_OPEN] [-gap_extend GAP_EXTEND]
                 [-match MATCH] [-mismatch MISMATCH]
                 [-max_trailing_bp MAX_TRAILING_BP]
                 [-minscore_inserts MINSCORE_INSERTS]
                 [-minscore_alignments MINSCORE_ALIGNMENTS] [-min_bqs MIN_BQS]
                 [-min_read_length MIN_READ_LENGTH]
                 [-min_read_copies MIN_READ_COPIES]
                 [-min_insert_seq_length MIN_INSERT_SEQ_LENGTH]
                 [-filter_ins_unique_reads FILTER_INS_UNIQUE_READS]
                 [-filter_ins_total_reads FILTER_INS_TOTAL_READS]
                 [-filter_ins_vaf FILTER_INS_VAF]
                 sampleID fastq1 [fastq2]

positional arguments:
  sampleID              sample ID used as output folder prefix (REQUIRED)
  fastq1                FASTQ file (optionally gzipped) of forward reads (REQUIRED)
  fastq2                FASTQ file (optionally gzipped) of reverse reads (optional)

optional arguments:
  -h, --help            show this help message and exit
  -reference REFERENCE  WT amplicon sequence as reference for read alignment
                        (default ./anno/amplicon.txt)
  -anno ANNO            WT amplicon sequence annotation (default
                        ./anno/amplicon_kayser.tsv)
  -forward_primer FORWARD_PRIMER [FORWARD_PRIMER ...]
                        Forward primer gene-specific sequence(s) as present at
                        the 5' end of supplied forward reads. Separate by
                        space when supplying more than one (default
                        GCAATTTAGGTATGAAAGCCAGCTAC)
  -reverse_primer REVERSE_PRIMER [REVERSE_PRIMER ...]
                        Reverse primer gene-specific sequence(s) as present at
                        the 5' end of supplied reverse reads. Separate by
                        space when supplying more than one (default
                        CTTTCAGCATTTTGACGGCAACC)
  -require_indel_free_primers REQUIRE_INDEL_FREE_PRIMERS
                        If True, discard i) reads containing insertions or
                        deletions within the primer sequence and ii) reads not
                        containing any primer sequence. Set to False if these
                        have been trimmed (default True)
  -forward_adapter FORWARD_ADAPTER
                        Sequencing adapter of the forward reads' primer as
                        (potentially) present at the 5' end of the supplied
                        forward reads, 5' of the gene-specific primer sequence
                        (default TCGTCGGCAGCGTCAGATGTGTATAAGAGACAGA)
  -reverse_adapter REVERSE_ADAPTER
                        Sequencing adapter of the reverse reads' primer as
                        (potentially) present at the 5' end of the supplied
                        reverse reads, 5' of the gene-specific primer sequence
                        (default GTCTCGTGGGCTCGGAGATGTGTATAAGAGACAGA)
  -plot_coverage PLOT_COVERAGE
                        If True, plot read coverage across the reference to
                        'coverage.png' in the respective output folder
                        (default False)
  -technology {Illumina,454}
                        Sequencing technology used, options are '454' or
                        'Illumina' (default). '454' sets
                        -infer_sense_from_alignment to True and
                        -min_read_copies to 1, regardless of the respective
                        command line options used; 'Illumina' will instead use
                        these command line options or their respective
                        defaults.
  -infer_sense_from_alignment INFER_SENSE_FROM_ALIGNMENT
                        If True, infer each read's sense by aligning it as a
                        forward and reverse read and keeping the better
                        alignment (default False).
  -nkern NKERN          number of cores to use for parallel tasks (default 12)
  -gap_open GAP_OPEN    alignment cost of gap opening (default -36)
  -gap_extend GAP_EXTEND
                        alignment cost of gap extension (default -0.5)
  -match MATCH          alignment cost of base match (default 5)
  -mismatch MISMATCH    alignment cost of base mismatch (default -15)
  -max_trailing_bp MAX_TRAILING_BP
                        maximum number of aligned bp between the start / end
                        of an insertion and the start / end of the read to
                        consider the insertion 'trailing'. Trailing insertions
                        are not required to be in-frame and will be considered
                        ITDs even if the matching WT tandem is not directly
                        adjacent. Set this to 0 to disable (default 0).
  -minscore_inserts MINSCORE_INSERTS
                        fraction of max possible alignment score required for
                        ITD detection and insert collapsing (default 0.5)
  -minscore_alignments MINSCORE_ALIGNMENTS
                        fraction of max possible alignment score required for
                        a read to pass when aligning reads to amplicon
                        reference (default 0.4)
  -min_bqs MIN_BQS      minimum average base quality score (BQS) required by
                        each read (default 30)
  -min_read_length MIN_READ_LENGTH
                        minimum read length in bp required after N-trimming
                        (default 100)
  -min_read_copies MIN_READ_COPIES
                        minimum number of copies of each read required for
                        processing (1 to turn filter off, 2 (default) to
                        discard unique reads)
  -min_insert_seq_length MIN_INSERT_SEQ_LENGTH
                        minimum number of insert basepairs which must be
                        sequenced of each insert for it to be considered by
                        getITD. For non-trailing ITDs, this is the minimum
                        insert length; for trailing ITDs, it is the minimum
                        number of bp of a potentially longer ITD which have to
                        be sequenced (default 6).
  -filter_ins_unique_reads FILTER_INS_UNIQUE_READS
                        minimum number of unique reads required to support an
                        insertion for it to be considered 'high confidence'
                        (default 2)
  -filter_ins_total_reads FILTER_INS_TOTAL_READS
                        minimum number of total reads required to support an
                        insertion for it to be considered 'high confidence'
                        (default 1)
  -filter_ins_vaf FILTER_INS_VAF
                        minimum variant allele frequency (VAF) required for an
                        insertion to be considered 'high confidence' (default
                        0.006)

Output

Files

For each sample, an output directory will be created in the current directory, named using the provided sample ID. Inside, all of the generated output files reside:

  • config.txt contains parameters, date and time of the analysis, as well as the getITD version that was used
  • stats.txt contains the number of reads and insertions / ITDs processed and filtered at each step of the analysis
  • coverage.txt contains the read coverage across the reference sequence
  • coverage.png contains a plot of that same coverage data (optionally created using -plot_coverage True)
  • out_needle/ contains individual alignment files, needle_*.txt, of all the different reads processed
  • itds_collapsed-is-same_is-similar_is-close_is-same-trailing_hc.tsv contains filtered high-confidence (hc) ITDs, fully merged
  • itds_collapsed-is-same_is-similar_is-close_is-same-trailing.tsv contains all ITDs, fully merged
  • itds_collapsed-is-same_is-similar_is-close.tsv contains all ITDs, having merged those of the same insertion length, nearby insertion sites (within one tandem length) and similar insert sequences
  • itds_collapsed-is-same_is-similar.tsv contains all ITDs, having merged those of the same insertion length and site and similar insert sequences
  • itds_collapsed-is-same.tsv contains all ITDs, having merged those that share the same insertion length, site and sequence
  • insertions*.tsv files are analogous to ITD files but list all insertions, regardless of whether these are also ITDs or not

Note that insertion*tsv and itd*tsv files are only created if at least one insertion or ITD is found, respectively. Config and stats files should exist for all samples.

On stdout

At each filtering step, the number of reads and insertions passing the specified requirements are printed. This basically corresponds to the content of stats.txt. Currently, also the computation time of various steps is printed, but presumably this will change in the future.

Example

Using default parameters

To analyze the provided test data, change to the getitd folder and, on Linux / MacOS, run:

python3 getitd.py test test/test_R1.fastq test/test_R2.fastq

On Windows, run:

py -3 getitd.py test test\test_R1.fastq test\test_R2.fastq

Expected output:

==== PROCESSING SAMPLE test ====
-- Reading FASTQ files --
Reading FASTQ files took 0.1 s
Number of total reads: 5000
Number of total reads > 100bp remaining after N-trimming: 5000 (100.0 %)
Mean read length after N-trimming: 251.0
Number of total reads with mean BQS >= 30: 4544 (90.88 %)
Getting unique reads took 0.06 s

Number of unique reads with mean BQS >= 30: 987
Number of unique reads with at least 2 copies: 170
Total reads remaining for analysis: 3727 (74.54 %)

-- Aligning to Reference --
Alignment took 1.13 s
Filtering 0 / 170 low quality alignments with a score < 40.0 % of max
Filtering 20 / 170 alignments with indels in primer bases
Total reads remaining for analysis: 3669 (73.38 %)
Calculating coverage took 0.09 s

-- Looking for insertions & ITDs --
Collecting inserts took 0.04 s
44 inserts >= 6 bp were found
0/44 insertions were part of adapters and filtered
Filtering inserts for adapter sequences took 0.0 s
Annotating coverage took 0.0 s
Collecting ITDs took 0.35 s
44 ITDs were found

-- Merging insertions --
3 insertions remain after merging
2 insertions remain after merging
1 insertions remain after merging
1 insertions remain after merging

-- Merging itds --
3 itds remain after merging
2 itds remain after merging
1 itds remain after merging
1 itds remain after merging
Merging took 0.15 s

-- Filtering insertions --
Filtered 0 / 1 insertions based on the number of unique supporting reads
Filtered 0 / 1 insertions based on the number of total supporting reads
Filtered 0 / 1 insertions based on the vaf
1 insertions remain after filtering!

-- Filtering itds --
Filtered 0 / 1 itds based on the number of unique supporting reads
Filtered 0 / 1 itds based on the number of total supporting reads
Filtered 0 / 1 itds based on the vaf
1 itds remain after filtering!
Filtering took 0.03 s

Changing optional parameters

To change, for example, the number of cores used for the analysis to 2, on Linux / MacOS instead run:

python3 getitd.py -nkern 2 test test/test_R1.fastq test/test_R2.fastq

On Windows, run:

py -3 getitd.py -nkern=2 test test\test_R1.fastq test\test_R2.fastq

Specifying primer sequences

Because an arbitrary number of primer sequences can be supplied following the -forward_primer and -reverse_primer parameters, an explicit signal is required to identify the end of the given list of primer sequences. This can be either any other optional argument which reads in only one word, such as -nkern, or the 'empty' argument --:

python3 getitd.py -forward_primer GCAATTTAGGTATGAAAGCCAGCTAC -nkern 2 test test/test_R1.fastq test/test_R2.fastq

or

python3 getitd.py -forward_primer GCAATTTAGGTATGAAAGCCAGCTAC -- test test/test_R1.fastq test/test_R2.fastq

How it works

  1. Read in FASTQ files
    • Read in forward reads and BQS (R1)
    • Read in and reverse complement reverse reads and BQS (R2)
    • Trim trailing ambiguous N bases on all reads
  2. Filter reads for minimum average base quality score (BQS)
    • Discard low quality reads
  3. Filter unique reads
    • Assume that true & clinically relevant sequences will be present at least twice and discard unique reads
  4. Align each read to the WT amplicon reference sequence using Needlemann-Wunsch alignment algorithm
  5. Filter alignments, require
    • at least 40% of the maximum possible alignment score
    • gap-free alignment to the given primer sequence
  6. Collect insertions within passing alignments, require
    • insert length of at least 6 bp
    • absence of ambiguous "N" bases within the actual insert sequence
    • in-frame insert (length divisible by 3) for inserts fully contained within a read
    • 3' or 5' trailing inserts not fully spanned by the sequenced reads are not required to be in-frame, since their exact length is not actually known
  7. Collect ITDs from passing insertions, require
    • insert sequence can be realigned to the WT amplicon sequence, with at least 50% of the maximum possible alignment score
    • for 3' and 5' trailing inserts that insert sequence is not an adapter artefact
  8. Merge total insertions and ITDs independently (considering reads to describe the same event and adding up supporting counts), require
    • that they are actually the same: insert length, site and sequence are identical
    • that they are similar: insert length and site are the same, sequences are similar
    • that they are close: insert length is the same, sequences are similar and sites are within one insert length of each other
    • close trailing inserts with similar sequences are considered the same regardless of their (estimated) lengths
  9. Filter insertions and ITDs independently, require
    • at least two different supporting reads
    • a minimum variant allele frequency (VAF) of 0.006 %

getITD wrappers - running getITD without the commandline

Two additional scripts are provided to facilitate getITD analysis by alleviating the need to use the commandline. The first, make_getitd_config.py, when double-clicked, creates a simple graphical interface and asks the user to select desired analysis parameters and annotation / reference files. These parameters are then saved to a config.txt file. When this config.txt is placed in a folder containing both getitd_from_config_wrapper.py and an arbitrary number of FASTQ files to analyze, double-clicking the wrapper script will initiate sequential analysis of all of the provided FASTQ files using parameters stored in config.txt. Though these wrapper scripts were originally intended primarily to facilitate analysis on Windows, they run also on Linux and Mac.

About

Software for FLT3-ITD detection and MRD monitoring assay

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages