Chapter 2 : File Formats

1.0 FASTA

Simple file format for DNA or peptide sequences.

1.1 Different Header Formats

https://en.wikipedia.org/wiki/FASTA_format

Database	Format
GenBank	gb\|accession\|locus
EMBL Data Library	emb\|accession\|locus
DDBJ, DNA Database of Japan	dbj\|accession\|locus
NBRF PIR	pir\|\|entry
Protein Research Foundation	prf\|\|name
SWISS-PROT	sp\|accession\|entry name
Brookhaven Protein Data Bank	pdb\|entry\|chain
Patents	pat\|country\|number
GenInfo Backbone Id	bbs\|number
General database identifier	gnl\|database\|identifier
NCBI Reference Sequence	ref\|accession\|locus
Local Sequence identifier	lcl\|identifier

1.2 Filename extensions

https://en.wikipedia.org/wiki/FASTA_format

Extension	Meaning	Notes
fasta	generic fasta	Any generic fasta file. Other extensions can be fas, fa, seq, fsa
fna	fasta nucleic acid	Used generically to specify nucleic acids
ffn	FASTA nucleotide of gene regions	Contains coding regions for a genome
faa	fasta amino acid	Contains amino acids. A multiple protein fasta file can have the more specific extension mpfa
frn	FASTA non-coding RNA	Contains non-coding RNA regions for a genome, in DNA alphabet e.g. tRNA, rRNA

2.0 FASTQ

A text-based format for storing both a biological sequence (usually nucleotide sequence) and its corresponding quality scores.

2.1 Phred Scores

Likelihood probabilities (p) are real small decimal numbers, e.g p = 10^-20
A common practice to show them in logarithmic scale

Phred-Scaling is an integer mapping

Phred Quality Score (Q)

2.3 Base Quality

A base-quality value represents the prbility of that the corresponding base call is incorrect
Base qualities are ASCII (American Standard Code for Information Interchange) encoded Phred scores
Two variations have been in use
- Sanger, Illumina Casava >1.8 variant
- Solexa pipeline variant –Software delivered with Illumina Genome Analyzer (GA) Illumina Casava > 1.3 and < 1.8

2.4 Phred Quality and Error

Phred Score	P_Error
3	~ 50 %
10	10%
15	3.16%
20	1%
30	0.1%

https://en.wikipedia.org/wiki/FASTQ_format

2.5 Paired-end vs. Single-end FASTQ

Paired-end: This file contains reads generated using Paired-end DNA sequencing.

Paired-end data comes with separate files with forward and reverse reads or interleaved file
Illumina sequencer generated files are named follows:

10_S10_L001_R1_001.fastq.gz and 10_S10_L001_R2_001.fastq.gz or 10_S10_L001_R1_001.fastq.gz for interleaved files.

Forward/reverse read indicators: ‘/1’ and ‘/2’ or ‘1:N:0:GCTCATGA+ACTGCATA’ and ‘2:N:0:GCTCATGA+ACTGCATA’ at the end of the header line
The header line of a Illumina Paired-end FASTQ file:

@<instrument>:<run number>:<flowcell ID>:<lane>:<tile>:<x-pos>:<y-pos> <read>:<is filtered>:<control number>:<sample number>

Element	Requirments	Description
@	@	Each sequence identifier line starts with @
<instrument>	Characters allowed: a–z, A–Z, 0–9 and underscore	Instrument ID
<run number>	Numerical	Run number on instrument
<flowcell ID>	Characters	allowed: a–z, A–Z, 0–9
<lane>	Numerical	Lane number
<tile>	Numerical	Tile number
<x_pos>	Numerical	X coordinate of cluster
<y_pos>	Numerical	Y coordinate of cluster
<read>	Numerical	Read number. 1 can be single read or Read 2 of paired-end
<is filtered>	Y or N Y	if the read is filtered (did not pass), N otherwise
<control number>	Numerical	0 when none of the control bits are on, otherwise it is an even number. On HiSeq X and NextSeq systems, control specification is not performed and this number is always 0
<sample number>	Numerical	Sample number from the sample sheet

3.0 Platform specific

454 fastq

More information available here: https://www.ncbi.nlm.nih.gov/sra/docs/submitformats/#454

Ion Torrent fastq

More information availbe here: https://www.ncbi.nlm.nih.gov/sra/docs/submitformats/#iontorrent

Older Illumina fastq

The older Illumina sequencers generated FASTQs with following headers.

@<machine_id>:<lane>:<tile>:<x_coord>:<y_coord>#<index>/<read>

Ex:

@HWI-ST745_0097:7:1101:1201:1000#0/1
ACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCAACCCTAACCCTAACCCT
+HWI-ST745_0097:7:1101:1201:1000#0/1
HHHHHHHHHHHHHHHHHHHGHHHHHHHHHHHGFHFHHFHFHEHEEGBGGFHFHHGGBHHEEGGBGFGCDGGGDBGF

PacBio fastq

More: https://www.ncbi.nlm.nih.gov/sra/docs/submitformats/#pacbio-ccs-circular-consensus-se

Helicos fastq

More: https://www.ncbi.nlm.nih.gov/sra/docs/submitformats/#helicos-fastq-with-a-fixed-ascii

4.0 HDF5 (Hierarchical Data Format)

A versatile data model that can represent very complex data and can include metadata. Read more: https://support.hdfgroup.org/HDF5/Tutor/HDF5Intro.pdf

4.1 PacBio

More:

wget http://files.pacb.com/software/hgap/HGAp_BAS_H5_DATA/HGAp_BAS_H5_DATA/BAC/m120729_040044_42134_c100384402550000001523033010171256_s1_p0.bas.h5

h5ls -r  m120729_040044_42134_c100384402550000001523033010171256_s1_p0.bas.h5

4.2 MinION Oxford Nanopore

Files are stored in hdf5 file format and have the extension of .fast5.

https://porecamp.github.io/2016/tutorials/PoreCamp2016-02-MinIONData.pdf

5.0 Legacy Formats

SRF files A generic file format for stroing DNA sequence data
Native Illumina

QSEQ - They are named _qseq.txt.

s_<lane>_<read>_<tile>_qseq.txt

http://www.bordeaux.inra.fr/live/pgtb/lib/exe/fetch.php?media=casava1.7_user_guide_15011196_a.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly