This tool is deprecated!
Albacore (Oxford Nanopore's basecaller) can basecall directly to FASTQ, which makes FAST5 to FASTQ conversion much less relevant. Also, I have since written Filtlong, which does more sophisticated long read filtering than these scripts.
The current recommendation is therefore to basecall directly to FASTQ (or use some other tool to extract FASTQ reads from FAST5 files), trim with Porechop and then filter with Filtlong. If you're still interested in using these scripts, the original README follows below:
FAST5 to FASTQ
This is a simple script to extract FASTQ files from FAST5 files.
- If there are multiple FASTQ groups in a FAST5 file (i.e. basecalling has been performed more than once), it extracts the last group. It does this on a per-read basis, so it's okay if some reads have only one group and others have more than one.
- Ability to filter using:
- Read length (
- Mean Phred score (
- Minimum mean Phred score in a window, to exclude reads with low quality regions (
- Read length (
- Ability to automatically set
--min_qual_windowto get a target number of bases (
UPDATE (22 May 2017): Since Albacore v1.1, direct to FASTQ basecalling is possible (yay!). I therefore made a version of this script which takes a FASTQ input instead of a FAST5 directory so you can perform the length/quality filters if you did straight-to-FASTQ basecalling. More info here.
- Python 3.4 or later
No installation is required - it's all just in one Python script:
git clone https://github.com/rrwick/Fast5-to-Fastq Fast5-to-Fastq/fast5_to_fastq.py --help
Extracting all reads from FAST5 to FASTQ:
fast5_to_fastq.py path/to/fast5_directory > output.fastq
- This will search through the target directory recursively.
Gzip while you extract:
fast5_to_fastq.py path/to/fast5_directory | gzip > output.fastq.gz
Filter based on length:
fast5_to_fastq.py --min_length 10000 path/to/fast5_directory | gzip > output.fastq.gz
- To be included in the output, reads must be 10 kbp or longer.
Filter based on mean Phred quality score:
fast5_to_fastq.py --min_mean_qual 11.5 path/to/fast5_directory | gzip > output.fastq.gz
- To be included in the output, reads must have a mean Phred score of at least 11.5.
Filter based on min Phred score over a sliding window:
fast5_to_fastq.py --min_qual_window 10.0 path/to/fast5_directory | gzip > output.fastq.gz
- To be included in the output, reads must have a mean Phred score over a sliding window that never drops below 10.0.
- The default window size is 50 bp, but it's configurable with
Aim for a target number of bases:
fast5_to_fastq.py --target_bases 100000000 path/to/fast5_directory | gzip > output.fastq.gz
- Only outputs the best 100 Mbp of reads, as judged by their mininum mean Phred score over a sliding window.
- Effectively sets
--min_qual_windowautomatically to get the number of desired bases in the output.
How I (Ryan) like to use it:
fast5_to_fastq.py --min_length 2000 --target_bases 500000000 path/to/fast5_directory | gzip > output.fastq.gz
- I mainly use Nanopore reads for bacterial isolate assembly, and anything over 100x depth is probably overkill. So I use
--target_basesto aim for about 500 Mbp of reads (adjust as necessary for the approximate genome size).
- Repeat sequences of about 1 kbp are common in bacterial genomes (e.g. insertion sequences), so I use
--min_lengthto exclude anything less than 2 kbp. That's large enough that there should be very few reads which are entirely contained within a repeat (which aren't useful for assembly), but small enough that I'm not excluding small plasmid sequences.
- I'll then pass the output through Porechop to get rid of adapters and split/discard chimeric reads.
fastq_to_fastq.py script has the same usage as
fast5_to_fastq.py, just replace
path/to/reads.fastq. For example:
fastq_to_fastq.py --min_length 2000 --target_bases 500000000 path/to/reads.fastq | gzip > output.fastq.gz
*.fastq.gz should work as input formats.
FAST5 integrity check
I ran into some annoying crashes caused by corrupt FAST5 files, so I made the
fast5_integrity_check.py tool to find these.
It only takes one argument: the directory to check (searched recursively):
It prints the name and path for bad FAST5 files to stdout and some progress info to stderr.