Amplicon Long-read Error Correction (ALEC) was developed to correct sequencing and alignment errors [substitutionsandinsertion/deletions (indels)] generated by targeted amplicon sequencing with the PacBio RS platform. This script has been developed usingPacBio single molecule real time (SMRT) full-gene sequencing of the CYP2D6 gene (5.0 kb)according to the P6-C4 Pacific Biosciences protocol. ALEC was further tested using a 9.2kb amplicon and long-read PacBio SMRT sequencing of the RYR2 gene. ALEC may have utility withother long-read sequencing platforms (e.g., Oxford Nanopore), and/or other sequencing chemistries; however, optimizing the error correction parameters may be necessary to achieve ideal sequence correction and output for genes and sequencing platforms other than those noted above.
Four types of sequencing errors can be corrected after applying ALEC: 1) random substitutions; 2) random indels; 3) indels within homopolymers; and 4) indels near sequence variants.
- Python 2.7.10 or above (does not support Python 3)
- pysam
- numpy
The ALEC script takes a fasta file as a reference file and a SAM/BAM file as the raw data alignment file, and automatically generates a corrected fasta file as output in the working directory. The usage of the script is as below:
Python ALEC.py -r reference.fasta -i input.bam/sam [arguments]
Argument | Type | Default | Description |
---|---|---|---|
--input -i |
string | NA | Required. Input file, SAM or BAM. |
-- reference -r |
string | NA | Required. Reference file, FASTA file, index it following Samtools manual |
--targetRegion -t |
string | NA | Required. Target region interval. Example: 1:300000-400000 |
--lengthFilter -lf |
float | 0.0 | Optional. Reads shorter than lengthFilter * length(targetRegion) will be excluded in this correction process. |
--downsample -ds |
float | 1.0 | Optional. Fraction of down sampling. |
--deletion -del |
float | 0.0 | Required. Deletion error frequency threshold (per base) to trigger correction. |
--insert -ins |
float | 0.0 | Required. Insert error frequency threshold (per base) to trigger correction. |
--mismatch -mis |
float | 0.0 | Required. Substitution error frequency threshold (per base) to trigger correction. |
--del_homo_p -del_hp |
float | 0.0 | Required. Deletion Homopolymer Penalty. |
--ins_homo_p -ins_hp |
float | 0.0 | Optional. Insert Homopolymer Penalty. |
--platform -x |
string | NA | Optional. pacbio_ccs, pacbio_sub or nanopore. If one of these options was chosen, correction related arguments( -del, -ins, -mis, -del_hp, -ins_hp) will be set as predefined.Customrized setting of these parameters will be ignored. |
--help -h |
show help message |
- The ALEC script only takes a single sequence as a reference per use. Please use the same reference file as used in alignment.
- Given that ALEC was developed for germline DNA sequencing, caution should be exercised when using ALECfor somatic mutation detection.
- ALEC does not have asequence alignment tool preference; however, alignment files generated by BWA-MEM (0.7.12) were used during the development of the ALEC script (see reference below for details). Files from other alignment tools could lead to unexpected performance.
- A manuscript detailing and evaluating the functionality of ALEC is currently under review.
- For an application example of ALEC, see reference: Qiao W and Yang Y, et al. Hum Mutat. 2016 Mar;37(3):315-23. doi: 10.1002/humu.22936. Epub 2015 Dec 18.