v0.6.0rc1
v0.6.0 Release Candidate 1
This release introduces the completely overhauled variant calling setup for Lorikeet. No longer does lorikeet rely on threshold based variant calling approaches, and instead takes a more sophisticated approach utilising local re-assembly of active regions. This release includes a reimplementation of the GATK HaplotypeCaller algorithm but in Rust, so hopefully it is faster. It will be at least be easier to parse multiple genomes + samples into the algorithm at once to generate called variants.
Currently, the strain resolving part of lorikeet is hidden and will be re-enabled ASAP.
The HaplotypeCaller algorithm involves breaking up genomes into potential active regions and then performing local re-assembly with the reads that mapped to those locations. The local assembly is then searched for potential haplotypes using a number of techniques and candidate haplotypes are assigned likelihoods using a pairwise HMM model to re-assign reads to the haplotypes. Ultimately, the HaplotypeCaller algorithm produces sets of high confidence variants with depths across samples.
The HaplotypeCaller code was re-implemented in Rust in order to potentially speed up the variant calling process, make it easier to parse multiple genomes and samples into the algorithm, and hopefully make use of some of the code base in future projects and in the strain resolving pipeline.
The code requires benchmarking, but early indications from tests and small datasets puts the Lorikeet variant calling speed on par with the Java implementation. I believe the real speed up will appear when multiple genomes are supplied to Lorikeet as they will be run in parallel seamlessly.
Additionally, a number of code clean-ups should be implemented as soon as possible. Primarily around the BirdToolRead, SequencesForKmers, and Kmers data structures. Currently, accessing the bytes within a read requires cloning the data with no option to create a reference pointing the data (without the added complexity of decoding every encoded base). This means SequencesForKmers and Kmers each hold a clone of the read bases which is very costly. I believe by adding a bases field to BirdToolRead that is updated when the underlying Read is changed, we can change those clones to be references and wrangle with the lifetimes to significantly speed up the graph building stage of the algorithm.
TODO:
Reimplement strain calling + abundance estimation
Reimplement consensus calling
Update README
Update Workflow image
Various code improvements