A fuzzy Bruijn graph approach to long noisy reads assembly
Clone or download
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
scripts minimap2 and wtpoa-cns are used to polish raw contigs Nov 13, 2018
.gitignore added gitignore Sep 23, 2018
.travis.yml Enable Travis CI Sep 24, 2018
LICENSE.txt added LICENSE; updated README Sep 24, 2018
Makefile upgrate to v2.3 Dec 12, 2018
README-ori.md Updated README Sep 23, 2018
README.md added asm time for Axolotl Jan 18, 2019
bit2vec.h init Sep 29, 2017
bitsvec.h speed the bit operators Sep 24, 2018
bitvec.h add get64_bitvec and set64_bitvec Nov 3, 2018
chararray.h Updates: Nov 2, 2018
dagcns.h New version 2.2 Nov 2, 2018
dbgcns.h New version 2.2 Nov 2, 2018
dna.h add print_lines Nov 3, 2018
filereader.h New version 2.2 Nov 2, 2018
filewriter.h fixed a memory leak bug in bufferedwriter Dec 13, 2018
general_graph.h init Sep 29, 2017
hashset.h fix a bug in freeze_hashset: set->ones->cap may less than set->size Oct 13, 2018
kbm.c fixed a bug in parsing -r option Jan 9, 2019
kbm.h fixed a bug in print kbm hits Jan 9, 2019
kbmpoa.h a new module, correct reads based on KBM's alignments, and query the … Dec 2, 2018
ksw.c init Sep 29, 2017
ksw.h init Sep 29, 2017
kswx.h New version 2.2 Nov 2, 2018
list.h add pop/push in recycle_list Nov 5, 2018
mem_share.h can set max rss and max rtime by set env{LIMIT_RSS}, env{LIMIT_RTIME}… Jan 14, 2019
pgzf.c report error code after finish decompression Dec 4, 2018
pgzf.h fixed a bug in reading the end of PGZF file Dec 6, 2018
poacns.h try to cope with short reads poslishing better, but not finish yet Jan 9, 2019
queue.h init Sep 29, 2017
sort.h update C libray header files Sep 1, 2018
thread.h revised thread_import and thread_export in thread.h Nov 27, 2018
tripoa.h set correct W_score Jan 3, 2019
txtplot.h Updates: Nov 2, 2018
wtdbg-cns.c -V to show version Oct 26, 2018
wtdbg-graph.h removed a compilation warn Dec 23, 2018
wtdbg.c splited wtdbg.c into multiple files Dec 12, 2018
wtdbg.h fixed a bug in print kbm hits Jan 9, 2019
wtpoa-cns.c revise the preset of sam-sr Jan 17, 2019
wtpoa.h print number of consensus bases Jan 17, 2019

README.md

Updates

  • wtdbg 2.3 2018-12-23
    No limitation on read length and read count.

Getting Started

git clone https://github.com/ruanjue/wtdbg2
cd wtdbg2 && make
# assemble long reads
./wtdbg2 -x rs -g 4.6m -i reads.fa.gz -t16 -fo prefix
# derive consensus
./wtpoa-cns -t16 -i prefix.ctg.lay.gz -fo prefix.ctg.fa

# polish consensus, not necessary if you want to polish the assemblies using other tools
minimap2 -t 16 -x map-pb -a prefix.ctg.fa reads.fa.gz | samtools view -Sb - >prefix.ctg.map.bam
samtools sort prefix.ctg.map.bam prefix.ctg.map.srt
samtools view prefix.ctg.map.srt.bam | ./wtpoa-cns -t 16 -d prefix.ctg.fa -i - -fo prefix.ctg.2nd.fa

Introduction

Wtdbg2 is a de novo sequence assembler for long noisy reads produced by PacBio or Oxford Nanopore Technologies (ONT). It assembles raw reads without error correction and then builds the consensus from intermediate assembly output. Wtdbg2 is able to assemble the human and even the 32Gb Axolotl genome at a speed tens of times faster than CANU and FALCON while producing contigs of comparable base accuracy.

During assembly, wtdbg2 chops reads into 1024bp segments, merges similar segments into a vertex and connects vertices based on the segment adjacency on reads. The resulting graph is called fuzzy Bruijn graph (FBG). It is akin to De Bruijn graph but permits mismatches/gaps and keeps read paths when collapsing k-mers. The use of FBG distinguishes wtdbg2 from the majority of long-read assemblers.

Installation

Wtdbg2 only works on 64-bit Linux. To compile, please type make in the source code directory. You can then copy wtdbg2 and wtpoa-cns to your PATH.

Wtdbg2 also comes with an approxmimate read mapper kbm, a faster but less accurate consesus tool wtdbg-cns and many auxiliary scripts in the scripts directory.

Usage

Wtdbg2 has two key components: an assembler wtdg2 and a consenser wtpoa-cns. Executable wtdbg2 assembles raw reads and generates the contig layout and edge sequences in a file "prefix.ctg.lay.gz". Executable wtpoa-cns takes this file as input and produces the final consensus in FASTA. A typical workflow looks like this:

./wtdbg2 -x rs -g 4.6m -t 16 -i reads.fa.gz -fo prefix
./wtpoa-cns -t 16 -i prefix.ctg.lay.gz -fo prefix.ctg.fa

where -g is the estimated genome size and -x specifies the sequencing technology, which could take value "rs" for PacBio RSII, "sq" for PacBio Sequel, "ccs" for PacBio CCS reads and "ont" for Oxford Nanopore. This option sets multiple parameters and should be applied before other parameters. When you are unable to get a good assembly, you may need to tune other parameters as follows.

Wtdbg2 combines normal k-mers and homopolymer-compressed (HPC) k-mers to find read overlaps. Option -k specifies the length of normal k-mers, while -p specifies the length of HPC k-mers. By default, wtdbg2 samples a fourth of all k-mers by their hashcodes. For data of relatively low coverage, you may increase this sampling rate by reducing -S. This will greatly increase the peak memory as a cost, though. Option -e, which defaults to 3, specifies the minimum read coverage of an edge in the assembly graph. You may adjust this option according to the overall sequencing depth, too. Option -A also helps relatively low coverage data at the cost of performance. For PacBio data, -L5000 often leads to better assemblies emperically, so is recommended. Please run wtdbg2 --help for a complete list of available options or consult README-ori.md for more help.

The following table shows various command lines and their resource usage for the assembly step:

Dataset GSize Cov Asm options CPU asm CPU cns Real tot RAM
E. coli 4.6Mb PB x20 -x rs -g4.6m -t16 53s 8m54s 42s 1.0G
C. elegans 100Mb PB x80 -x rs -g100m -t32 1h07m 5h06m 13m42s 11.6G
D. melanogaster A4 144m PB x120 -x rs -g144m -t32 2h06m 5h11m 26m17s 19.4G
D. melanogaster ISO1 144m ONT x32 -xont -g144m -t32 5h12m 4h30m 25m59s 17.3G
A. thaliana 125Mb PB x75 -x sq -g125m -t32 11h26m 4h57m 49m35s 25.7G
Human NA12878 3Gb ONT x36 -x ont -g3g -t31 793h11m 97h46m 31h03m 221.8G
Human NA19240 3Gb ONT x35 -x ont -g3g -t31 935h31m 89h17m 35h20m 215.0G
Human HG00733 3Gb PB x93 -x sq -g3g -t47 2114h26m 152h24m 52h22m 338.1G
Human NA24385 3Gb CCS x28 -x ccs -g3g -t31 231h25m 58h48m 10h14m 112.9G
Human CHM1 3Gb PB x60 -x rs -g3g -t96 105h33m 139h24m 5h17m 225.1G
Axolotl 32Gb PB x32 -x rs -g32g -t96 2806h40m 1788.1G

The timing was obtained on three local servers with different hardware configurations. There are also run-to-run fluctuations. Exact timing on your machines may differ. The assembled contigs can be found at the following FTP:

ftp://ftp.dfci.harvard.edu/pub/hli/wtdbg/

Limitations

  • For Nanopore data, wtdbg2 may produce an assembly smaller than the true genome.

  • When inputing multiple files of both fasta and fastq format, please put fastq first, then fasta. Otherwise, program cannot find '>' in fastq, and append all fastq in one read.

Getting Help

Please use the GitHub's Issues page if you have questions. You may also directly contact Jue Ruan at ruanjue@gmail.com.