August 2019 consensus accuracy update
Ryan R. Wick1, Louise M. Judd1 and Kathryn E. Holt1,2
1. Department of Infectious Diseases, Central Clinical School, Monash University, Melbourne, Victoria 3004, Australia
2. London School of Hygiene & Tropical Medicine, London WC1E 7HT, UK
Table of contents
This repository contains an addendum to our previous paper on Oxford Nanopore basecalling:
Wick RR, Judd LM, Holt KE. Performance of neural network basecalling tools for Oxford Nanopore sequencing. Genome Biology. 2019;20(1):129.
It was motivated by two main questions. First, there have been some advancements with Guppy since our last paper, including additional basecalling models, and I was curious to see how well they performed. Second, I wanted to compare Medaka and Nanopolish as assembly polishers, a topic that we didn't explore much in our paper.
Unlike that previous paper, this isn't a comparison between different basecallers or a look at historical basecalling performance. Rather, it just examines the current version of Guppy using a few different basecalling models and some alternative polishing approaches.
Genome and reads
I chose to not use the same test read set we used in our previous study as it was from an R9.4 flowcell and getting a bit out-of-date. I instead chose a different Klebsiella pneumoniae isolate that we recently sequenced as part of a barcoded run on an R9.4.1 flowcell.
This particular genome (named 67K) is not yet publicly available as it's part of a larger study that's in progress. It will be shared when that study is released. I chose this genome because it had a high yield of ONT reads and high-quality Illumina reads which enabled a clean hybrid assembly to use as a ground truth for sequence accuracy.
There were over 1 Gbp of total reads for this genome (an excessive amount) so I used Filtlong to choose a high-quality subset of 100× depth (~530 Mbp). This refined read set contained 14,979 reads with a nice N50 of ~38 kbp.
Here is a high-level overview of what I did to the read set with each basecalling model:
Consensus accuracy was assessed at each applicable stage (the coloured boxes).
I used Guppy v3.2.2 (the current version at the time of writing) with five different models (all of which use Guppy's flip-flop architecture):
- Fast: the 'r9.4.1_450bps_fast' basecalling model that comes with Guppy. This model has hidden layers of size 96 and a total of 285,160 parameters.
- Fast-Kp: this model has the same neural network architecture as the fast model but was trained on our Klebsiella pneumoniae-heavy dataset that we describe in this paper. This greatly improved its ability to call Kp-specific base modifications.
- HAC: the 'r9.4.1_450bps_hac' basecalling model that comes with Guppy ('hac' meaning high-accuracy). It differs from the fast model in two ways: it takes a smaller stride with its convolutional layer (2 vs 4) and its hidden layers have a larger size (256 vs 96). It has a total of 1,989,160 parameters.
- HAC-mod: a new model recently added to Guppy which allows for calling 5mC and 6mA modified bases. It was trained on human and E. coli reads (E. coli being a close relative of our Klebsiella pneumoniae test genome). While this basecalling model can produce a table containing per-base probabilities of modifications, we only used the standard FASTQ output with the four canonical bases.
- HAC-Kp: this model has the same neural network architecture as the HAC model but was trained on our Klebsiella pneumoniae-heavy dataset.
Here are the specific Guppy commands I used to basecall:
guppy_basecaller --input_path reads --save_path Fast --device auto --config dna_r9.4.1_450bps_fast.cfg --barcode_kits EXP-NBD114 --trim_barcodes guppy_basecaller --input_path reads --save_path Fast-Kp --device auto --config dna_r9.4.1_450bps_fast.cfg --model holtlab_kp_fast_flipflop_r9.4_r9.4.1_apr_2019.jsn --barcode_kits EXP-NBD114 --trim_barcodes guppy_basecaller --input_path reads --save_path HAC --device auto --config dna_r9.4.1_450bps_hac.cfg --barcode_kits EXP-NBD114 --trim_barcodes guppy_basecaller --input_path reads --save_path HAC-mod --device auto --config dna_r9.4.1_450bps_modbases_dam-dcm-cpg_hac.cfg --barcode_kits EXP-NBD114 --trim_barcodes guppy_basecaller --input_path reads --save_path HAC-Kp --device auto --config dna_r9.4.1_450bps_hac.cfg --model holtlab_kp_large_flipflop_r9.4_r9.4.1_apr_2019.jsn --barcode_kits EXP-NBD114 --trim_barcodes
flye --nano-raw basecalled_reads.fastq --genome-size 5.5m --out-dir flye --threads 12
Each of the five basecalled sets assembled cleanly and completely with Flye, generating a circular chromosome (~5.3 Mbp) and a large circular plasmid (~160 kbp).
rebaler --threads 8 flye/assembly.fasta basecalled_reads.fastq > rebaler.fasta
I then ran Medaka v0.8.1 on the Racon-polished assembly:
medaka_consensus -i basecalled_reads.fastq -d rebaler.fasta -o medaka -t 12 -m r941_min_fast # or -m r941_min_high, as appropriate
Medaka's models were trained on reads from particular basecalling models:
r941_min_fast goes with the fast model and
r941_min_high goes with the HAC model. This means that the other three models I ran (fast-Kp, HAC-mod and HAC-Kp) did not have a corresponding Medaka model. I therefore used what I thought would be the best match:
r941_min_fast for the fast-Kp model and
r941_min_high for the HAC-mod and HAC-Kp models.
I also ran Nanopolish v0.11.1 on the Racon-polished assembly:
nanopolish index -d fast5s basecalled_reads.fastq minimap2 -x map-ont -a -t 8 rebaler.fasta basecalled_reads.fastq | samtools sort > alignments.bam samtools index alignments.bam samtools faidx rebaler.fasta python nanopolish_makerange.py rebaler.fasta | parallel -P 8 nanopolish variants --consensus -o polished."$range".vcf -w "$range" -r basecalled_reads.fastq -b alignments.bam -g rebaler.fasta -t 6 --min-candidate-frequency 0.1 --methylation-aware=dcm,dam --fix-homopolymers nanopolish vcf2fasta -g rebaler.fasta polished.*.vcf > nanopolish.fasta
I used the
--fix-homopolymers options because in the past I found these to work well for K. pneumoniae genomes.
Additional rounds of Medaka
Finally, I tried running Medaka on both the Medaka and Nanopolish consensus assemblies. I used the same Medaka command shown above and called the results 'Medaka × 2' and 'Nanopolish + Medaka'.
Basecalling was pretty quick on the GTX 1080 GPU of our MinION desktop. The fast and fast-Kp models completed basecalling in ~8.5 minutes (~1,000,000 bases/sec). The HAC, HAC-mod and HAC-Kp models each took ~50 minutes (~180,000 bases/sec). We didn't try basecalling on the CPU but based on past experience it would be much slower than the GPU. The speed performance of Guppy is correlated with the number of parameters in its neural network model. The HAC models have ~7× more parameters than the fast models and are ~6× slower.
Racon is reasonably fast and can do one round of polishing in a few minutes on my laptop. Rebaler does many rounds of Racon and probably takes closer to an hour on a laptop. A recent Racon release added support for GPU-based polishing – haven't yet tried it myself but is presumably much faster.
It's also worth noting that Medaka is much faster than Nanopolish. I didn't time them in a systematic manner, but Medaka ran on my laptop in a few minutes while Nanopolish took about an hour using many nodes of a computing cluster.
Here are the accuracy results, plotted using the log-based Phred quality score:
First of all, there were some unsurprising results: high-accuracy models did better than fast models, and models which included Kp-specific methylation in their training did better than those which did not. This much I expected based on my previous studies.
I also expected to see that Racon improved upon the Flye assembly, and this was indeed usually the case. The one exception was for the fast-Kp model, where Flye did quite well and Racon actually made the assembly slightly worse – not sure what's going on there.
Medaka always improved the assembly and produced the highest accuracy results in the study with the HAC-mod and HAC-Kp models. Running Medaka a second time did not make much of a difference – sometimes it made the assembly a tiny bit better, sometimes a tiny bit worse.
Nanopolish did better than Medaka for the fast model, similarly to Medaka for the fast-Kp and HAC models, and worse than Medaka in the HAC-mod and HAC-Kp models. For the HAC-mod and HAC-Kp models, Nanopolish was actually worse than the Racon-polished assembly it took as input. Another way to look at this is that Nanopolish seemed to produce assemblies somewhere around Q28, and if the input assembly was already much higher than that, Nanopolish was likely to make things worse not better. Running Nanopolish before Medaka did not show any advantage over just running Medaka.
To summarise our results as a set of recommendations:
- Use high-accuracy models over fast models. This can be very slow on a CPU, so GPU basecalling is recommended.
- If you're sequencing native DNA, use a model that was trained on reads with the same bases modifications as your sample, ideally from the same species.
- Polishing with Medaka is fast, effective and usually preferable over Nanopolish. Don't bother with multiple rounds of Medaka polishing.
Medaka is trained on a per-model basis and the Medaka documentation states 'It is crucially important to specify the correct model'. The Holt lab mainly uses the HAC-Kp basecalling model, and the fact that there isn't a corresponding Medaka model has discouraged me from using it in the past. However, our results show that Medaka can work quite nicely even if you aren't using the exact same basecalling model it was trained on. This is encouraging and means I'll be using Medaka more in the future!
Our best assemblies were about 99.97% accurate, which equates to one error per ~3000 bases (~1600 errors in a 5 Mbp genome). Certainly an improvement over ONT results in the past but still too high of an error rate to replace Illumina sequencing in all contexts. Hopefully ONT's new R10 pore will deliver greater accuracies. The Holt lab will be trying them out in the near future!