Performance of neural network basecalling tools for Oxford Nanopore sequencing

Ryan R. Wick¹, Louise M. Judd¹ and Kathryn E. Holt^1,2
_{1. Department of Infectious Diseases, Central Clinical School, Monash University, Melbourne, Victoria 3004, Australia
2. London School of Hygiene & Tropical Medicine, London WC1E 7HT, UK}

This repository contains the scripts used in the preparation of our manuscript on basecalling performance:
Wick RR, Judd LM, Holt KE. Performance of neural network basecalling tools for Oxford Nanopore sequencing. Genome Biology. 2019;20(1):129.

In August 2019, I put a small addendum to this paper on GitHub which looks at a more recent version of Guppy as well as some different polishing strategies:
github.com/rrwick/August-2019-consensus-accuracy-update

Previous versions of this repository contained the analysis results here in the README, but the current results are now in that manuscript and this repo just holds the scripts associated with the analysis. These scripts assume you're running on Ubuntu 16.04. They make work on other OSs, but no guarantees!

If you're still interested in the older results, here is a link to the earlier version of this repo: Comparison of Oxford Nanopore basecalling tools.

Basecalling

Before you analyse a read set, you must generate the read set! The basecalling_scripts directory contains Bash scripts with the loops/commands I used to run the various basecallers. You'll need to edit the paths at the top of these scripts before running them.

Custom training of basecallers

The sloika_training_scripts directory contains the commands we used to train the custom-Kp and custom-Kp-big-net models using our fork of Sloika.

We used many different isolates in our training set, so the per-isolate_commands.sh script contains the commands which must be run separately for each of them.

After the prepartory work is done, the model can be trained with the commands in training_commands.sh.

Read set analysis

The analysis_scripts directory contains the scripts for processing and generating accuracy measurements from read sets. Before the analysis, the reads must be given consistent names, as different basecallers have different conventions for the fastq headers. The fix_read_names.py script will convert a read fastq into a format suitable for the next step.

analysis.sh is the 'master script' that will run all analyses on a given read set: read-level accuracy, assembly, assembly-level accuracy, nanopolish and nanopolish-level accuracy. It will use the other scripts in its execution. You also might want to edit some of the variables at the start of the script to change things like the output directories and the number of CPU threads. You can also comment out parts of this script if you only want to run some of the analyses.

License

GNU General Public License, version 3

Name		Name	Last commit message	Last commit date
Latest commit History 101 Commits
analysis_scripts		analysis_scripts
basecalling_scripts		basecalling_scripts
images		images
sloika_training_scripts		sloika_training_scripts
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

analysis_scripts

analysis_scripts

basecalling_scripts

basecalling_scripts

images

images

sloika_training_scripts

sloika_training_scripts

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

Repository files navigation

Performance of neural network basecalling tools for Oxford Nanopore sequencing

Basecalling

Custom training of basecallers

Read set analysis

License

About

Releases 8

Packages

Languages

License

rrwick/Basecalling-comparison

Folders and files

Latest commit

History

Repository files navigation

Performance of neural network basecalling tools for Oxford Nanopore sequencing

Basecalling

Custom training of basecallers

Read set analysis

License

About

Resources

License

Stars

Watchers

Forks

Languages