Skip to content
Switch branches/tags

Latest commit


Git stats


Failed to load latest commit information.
Latest commit message
Commit time

A small set of test cases for long-read assembly tools


A very recent long-read assembly benchmark evaluates how robust assemblers are to sequencing artifacts that happen in the reads, using a bacterial genome as a test case. Here we take a different but related standpoint. We provide a set of "supposedly easy" test cases for long-read genome assemblers. Can assemblers at least assemble them?

This page is intended 1) to users, in order to give some rough idea of the performance each tool; and 2) to assembly tool developers, in order to highlight situations where they perform sub-optimally.

This project was motivated by this talk at CGSI 2019.


Test genome

A bacterial genome, 3.5 Mbp, having repeat of length around 16 kbp (according to my own genome-vs-genome alignment). In fact, it is the same genome as in Quoting Ryan: "not a particularly repeat-rich genome, but it does have four tandem copies of its rRNA operon, which creates a 20 kbp repeat region".

Assemblers and commands

Exactly the same assemblers, versions and commands as were used. Command lines are in scripts/

Miniasm was subsequently added.

Reads were simulated differently, though. We used PaSS, a recently-published PacBio Sequel read simulator.

Test cases

Four test datasets:

  • one with 50x coverage ("50x" column)

    Here is a visualization of the reads alignment to the reference genome (using the Tablet software).


    The high-depth region corresponds to the 16kbp repeat.

  • 100x coverage ("100x" column)


  • 50x coverage with a simulated coverage drop to 10x at a repeat-free location, position 420,000 bp in the reference, ("50x-drop-10x" column)





  • 50x coverage with a simulated coverage drop to 5x at the same position as previously ("50x-drop-5x" column)


Raw data is available in the data/ folder. Command lines used to generate the reads are in scripts/


The following table reports the number of contigs for each assembler, for each test case.

Dataset 50x 100x 50x-drop-10x  50x-drop-5x
Canu v1.8 1 #00DD00  1 #00DD00  1 #00DD00 2 FF8C00
Flye v2.4.2 1 #00DD00  1 #00DD00  1 #00DD00 1 #00DD00
Ra 07364a1 2 FF8C00  2 FF8C00  2 FF8C00 3 FF8C00
Unicycler v0.4.7 3 FF8C00  3 FF8C00 3 FF8C00 3 FF8C00
Wtdbg2 v2.4 2 FF8C00  3 FF8C00 2 FF8C00 2 FF8C00
Miniasm cdcb49d 1 #00DD00  1 #00DD00  1 #00DD00 1 #00DD00


This is certainly not an extensive assembly benchmark. There is only a single (bacterial) dataset and a single metric (number of contigs). It should be seen as an incentive for tool developers to make sure simple examples are treated adequately.

I am unsure whether I'll keep the last test case (the drop to 5x coverage). It is perhaps a bit too much to ask, i.e. assemble regions that have 5x coverage.

Regarding updates to this benchmark: they are welcome. Please submit a pull request if you would like to add/modify an assembler in the table.


This project was largely inspired by R. Wick's and K. Holt's Long read assembler comparison repository:


GNU General Public License, version 3


A small set of test cases for long-read assembly tools



No releases published


No packages published