Skip to content
A small set of test cases for long-read assembly tools
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Type Name Latest commit message Commit time
Failed to load latest commit information.
images initial commit Jul 19, 2019
scripts Update Jul 25, 2019

A small set of test cases for long-read assembly tools


A very recent long-read assembly benchmark evaluates how robust assemblers are to sequencing artifacts that happen in the reads, using a bacterial genome as a test case. Here we take a different but related standpoint. We provide a set of "supposedly easy" test cases for long-read genome assemblers. Can assemblers at least assemble them?

This page is intended 1) to users, in order to give some rough idea of the performance each tool; and 2) to assembly tool developers, in order to highlight situations where they perform sub-optimally.

This project was motivated by this talk at CGSI 2019.


Test genome

A bacterial genome, 3.5 Mbp, having repeat of length around 16 kbp (according to my own genome-vs-genome alignment). In fact, it is the same genome as in Quoting Ryan: "not a particularly repeat-rich genome, but it does have four tandem copies of its rRNA operon, which creates a 20 kbp repeat region".

Assemblers and commands

Exactly the same assemblers, versions and commands as were used. Command lines are in scripts/

Miniasm was subsequently added.

Reads were simulated differently, though. We used PaSS, a recently-published PacBio Sequel read simulator.

Test cases

Four test datasets:

  • one with 50x coverage ("50x" column)

    Here is a visualization of the reads alignment to the reference genome (using the Tablet software).


    The high-depth region corresponds to the 16kbp repeat.

  • 100x coverage ("100x" column)


  • 50x coverage with a simulated coverage drop to 10x at a repeat-free location, position 420,000 bp in the reference, ("50x-drop-10x" column)





  • 50x coverage with a simulated coverage drop to 5x at the same position as previously ("50x-drop-5x" column)


Raw data is available in the data/ folder. Command lines used to generate the reads are in scripts/


The following table reports the number of contigs for each assembler, for each test case.

Dataset 50x 100x 50x-drop-10x  50x-drop-5x
Canu v1.8 1 #00DD00  1 #00DD00  1 #00DD00 2 FF8C00
Flye v2.4.2 1 #00DD00  1 #00DD00  1 #00DD00 1 #00DD00
Ra 07364a1 2 FF8C00  2 FF8C00  2 FF8C00 3 FF8C00
Unicycler v0.4.7 3 FF8C00  3 FF8C00 3 FF8C00 3 FF8C00
Wtdbg2 v2.4 2 FF8C00  3 FF8C00 2 FF8C00 2 FF8C00
Miniasm cdcb49d 1 #00DD00  1 #00DD00  1 #00DD00 1 #00DD00


This is certainly not an extensive assembly benchmark. There is only a single (bacterial) dataset and a single metric (number of contigs). It should be seen as an incentive for tool developers to make sure simple examples are treated adequately.

I am unsure whether I'll keep the last test case (the drop to 5x coverage). It is perhaps a bit too much to ask, i.e. assemble regions that have 5x coverage.

Regarding updates to this benchmark: they are welcome. Please submit a pull request if you would like to add/modify an assembler in the table.


This project was largely inspired by R. Wick's and K. Holt's Long read assembler comparison repository:


GNU General Public License, version 3

You can’t perform that action at this time.