A small set of test cases for long-read assembly tools
A very recent long-read assembly benchmark evaluates how robust assemblers are to sequencing artifacts that happen in the reads, using a bacterial genome as a test case. Here we take a different but related standpoint. We provide a set of "supposedly easy" test cases for long-read genome assemblers. Can assemblers at least assemble them?
This page is intended 1) to users, in order to give some rough idea of the performance each tool; and 2) to assembly tool developers, in order to highlight situations where they perform sub-optimally.
This project was motivated by this talk at CGSI 2019.
A bacterial genome, 3.5 Mbp, having repeat of length around 16 kbp (according to my own genome-vs-genome alignment). In fact, it is the same genome as in https://github.com/rrwick/Long-read-assembler-comparison/ Quoting Ryan: "not a particularly repeat-rich genome, but it does have four tandem copies of its rRNA operon, which creates a 20 kbp repeat region".
Assemblers and commands
Exactly the same assemblers, versions and commands as https://github.com/rrwick/Long-read-assembler-comparison/#assemblers-and-commands were used. Command lines are in
Miniasm was subsequently added.
Reads were simulated differently, though. We used PaSS, a recently-published PacBio Sequel read simulator.
Four test datasets:
one with 50x coverage ("50x" column)
Here is a visualization of the reads alignment to the reference genome (using the Tablet software).
The high-depth region corresponds to the 16kbp repeat.
100x coverage ("100x" column)
50x coverage with a simulated coverage drop to 10x at a repeat-free location, position 420,000 bp in the reference, ("50x-drop-10x" column)
50x coverage with a simulated coverage drop to 5x at the same position as previously ("50x-drop-5x" column)
Raw data is available in the
data/ folder. Command lines used to generate the reads are in
The following table reports the number of contigs for each assembler, for each test case.
This is certainly not an extensive assembly benchmark. There is only a single (bacterial) dataset and a single metric (number of contigs). It should be seen as an incentive for tool developers to make sure simple examples are treated adequately.
I am unsure whether I'll keep the last test case (the drop to 5x coverage). It is perhaps a bit too much to ask, i.e. assemble regions that have 5x coverage.
Regarding updates to this benchmark: they are welcome. Please submit a pull request if you would like to add/modify an assembler in the table.
This project was largely inspired by R. Wick's and K. Holt's Long read assembler comparison repository: https://github.com/rrwick/Long-read-assembler-comparison/