Skip to content
master
Switch branches/tags
Code

Latest commit

 

Git stats

Files

Permalink
Failed to load latest commit information.
Type
Name
Latest commit message
Commit time
 
 
 
 
 
 
 
 

A small set of test cases for long-read assembly tools

Introduction

A very recent long-read assembly benchmark evaluates how robust assemblers are to sequencing artifacts that happen in the reads, using a bacterial genome as a test case. Here we take a different but related standpoint. We provide a set of "supposedly easy" test cases for long-read genome assemblers. Can assemblers at least assemble them?

This page is intended 1) to users, in order to give some rough idea of the performance each tool; and 2) to assembly tool developers, in order to highlight situations where they perform sub-optimally.

This project was motivated by this talk at CGSI 2019.

Methods

Test genome

A bacterial genome, 3.5 Mbp, having repeat of length around 16 kbp (according to my own genome-vs-genome alignment). In fact, it is the same genome as in https://github.com/rrwick/Long-read-assembler-comparison/ Quoting Ryan: "not a particularly repeat-rich genome, but it does have four tandem copies of its rRNA operon, which creates a 20 kbp repeat region".

Assemblers and commands

Exactly the same assemblers, versions and commands as https://github.com/rrwick/Long-read-assembler-comparison/#assemblers-and-commands were used. Command lines are in scripts/run_assemblers.sh.

Miniasm was subsequently added.

Reads were simulated differently, though. We used PaSS, a recently-published PacBio Sequel read simulator.

Test cases

Four test datasets:

  • one with 50x coverage ("50x" column)

    Here is a visualization of the reads alignment to the reference genome (using the Tablet software).

    50x

    The high-depth region corresponds to the 16kbp repeat.

  • 100x coverage ("100x" column)

    100x

  • 50x coverage with a simulated coverage drop to 10x at a repeat-free location, position 420,000 bp in the reference, ("50x-drop-10x" column)

    from

    50x_no_drop.png

    to

    50x_with_10x_region

  • 50x coverage with a simulated coverage drop to 5x at the same position as previously ("50x-drop-5x" column)

    50x_drop_5x

Raw data is available in the data/ folder. Command lines used to generate the reads are in scripts/generate_reads.sh.

Results

The following table reports the number of contigs for each assembler, for each test case.

Dataset 50x 100x 50x-drop-10x  50x-drop-5x
Canu v1.8 1 #00DD00  1 #00DD00  1 #00DD00 2 FF8C00
Flye v2.4.2 1 #00DD00  1 #00DD00  1 #00DD00 1 #00DD00
Ra 07364a1 2 FF8C00  2 FF8C00  2 FF8C00 3 FF8C00
Unicycler v0.4.7 3 FF8C00  3 FF8C00 3 FF8C00 3 FF8C00
Wtdbg2 v2.4 2 FF8C00  3 FF8C00 2 FF8C00 2 FF8C00
Miniasm cdcb49d 1 #00DD00  1 #00DD00  1 #00DD00 1 #00DD00

Caveats

This is certainly not an extensive assembly benchmark. There is only a single (bacterial) dataset and a single metric (number of contigs). It should be seen as an incentive for tool developers to make sure simple examples are treated adequately.

I am unsure whether I'll keep the last test case (the drop to 5x coverage). It is perhaps a bit too much to ask, i.e. assemble regions that have 5x coverage.

Regarding updates to this benchmark: they are welcome. Please submit a pull request if you would like to add/modify an assembler in the table.

Acknowledgements

This project was largely inspired by R. Wick's and K. Holt's Long read assembler comparison repository: https://github.com/rrwick/Long-read-assembler-comparison/

License

GNU General Public License, version 3

About

A small set of test cases for long-read assembly tools

Resources

Releases

No releases published

Packages

No packages published

Languages