A simple tool for extracting reads from Oxford Nanopore fast5 files
Python
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.

README.md

This tool is deprecated!

Albacore (Oxford Nanopore's basecaller) can basecall directly to FASTQ, which makes FAST5 to FASTQ conversion much less relevant. Also, I have since written Filtlong, which does more sophisticated long read filtering than these scripts.

The current recommendation is therefore to basecall directly to FASTQ (or use some other tool to extract FASTQ reads from FAST5 files), trim with Porechop and then filter with Filtlong. If you're still interested in using these scripts, the original README follows below:

FAST5 to FASTQ

This is a simple script to extract FASTQ files from FAST5 files.

There are a number of other tools which can do this, including Poretools, PoRe, nanopolish extract and more. I made this one for a couple of specific features:

  • If there are multiple FASTQ groups in a FAST5 file (i.e. basecalling has been performed more than once), it extracts the last group. It does this on a per-read basis, so it's okay if some reads have only one group and others have more than one.
  • Ability to filter using:
    • Read length (--min_length)
    • Mean Phred score (--min_mean_qual)
    • Minimum mean Phred score in a window, to exclude reads with low quality regions (--min_qual_window)
  • Ability to automatically set --min_qual_window to get a target number of bases (--target_bases)

UPDATE (22 May 2017): Since Albacore v1.1, direct to FASTQ basecalling is possible (yay!). I therefore made a version of this script which takes a FASTQ input instead of a FAST5 directory so you can perform the length/quality filters if you did straight-to-FASTQ basecalling. More info here.

Requirements

  • Python 3.4 or later
  • h5py

Installation

No installation is required - it's all just in one Python script:

git clone https://github.com/rrwick/Fast5-to-Fastq
Fast5-to-Fastq/fast5_to_fastq.py --help

Usage

Extracting all reads from FAST5 to FASTQ:

  • fast5_to_fastq.py path/to/fast5_directory > output.fastq
  • This will search through the target directory recursively.

Gzip while you extract:

  • fast5_to_fastq.py path/to/fast5_directory | gzip > output.fastq.gz

Filter based on length:

  • fast5_to_fastq.py --min_length 10000 path/to/fast5_directory | gzip > output.fastq.gz
  • To be included in the output, reads must be 10 kbp or longer.

Filter based on mean Phred quality score:

  • fast5_to_fastq.py --min_mean_qual 11.5 path/to/fast5_directory | gzip > output.fastq.gz
  • To be included in the output, reads must have a mean Phred score of at least 11.5.

Filter based on min Phred score over a sliding window:

  • fast5_to_fastq.py --min_qual_window 10.0 path/to/fast5_directory | gzip > output.fastq.gz
  • To be included in the output, reads must have a mean Phred score over a sliding window that never drops below 10.0.
  • The default window size is 50 bp, but it's configurable with --window_size.

Aim for a target number of bases:

  • fast5_to_fastq.py --target_bases 100000000 path/to/fast5_directory | gzip > output.fastq.gz
  • Only outputs the best 100 Mbp of reads, as judged by their mininum mean Phred score over a sliding window.
  • Effectively sets --min_qual_window automatically to get the number of desired bases in the output.

How I (Ryan) like to use it:

  • fast5_to_fastq.py --min_length 2000 --target_bases 500000000 path/to/fast5_directory | gzip > output.fastq.gz
  • I mainly use Nanopore reads for bacterial isolate assembly, and anything over 100x depth is probably overkill. So I use --target_bases to aim for about 500 Mbp of reads (adjust as necessary for the approximate genome size).
  • Repeat sequences of about 1 kbp are common in bacterial genomes (e.g. insertion sequences), so I use --min_length to exclude anything less than 2 kbp. That's large enough that there should be very few reads which are entirely contained within a repeat (which aren't useful for assembly), but small enough that I'm not excluding small plasmid sequences.
  • I'll then pass the output through Porechop to get rid of adapters and split/discard chimeric reads.

FASTQ filtering

The fastq_to_fastq.py script has the same usage as fast5_to_fastq.py, just replace path/to/fast5_directory with path/to/reads.fastq. For example:

  • fastq_to_fastq.py --min_length 2000 --target_bases 500000000 path/to/reads.fastq | gzip > output.fastq.gz

Both *.fastq and *.fastq.gz should work as input formats.

FAST5 integrity check

I ran into some annoying crashes caused by corrupt FAST5 files, so I made the fast5_integrity_check.py tool to find these.

It only takes one argument: the directory to check (searched recursively):
fast5_integrity_check.py path/to/fast5_directory

It prints the name and path for bad FAST5 files to stdout and some progress info to stderr.

License

GNU General Public License, version 3