Skip to content

Chapter 2 : File Formats

Saranga Wijeratne edited this page Jun 6, 2018 · 1 revision

1.0 FASTA


Simple file format for DNA or peptide sequences.

1.1 Different Header Formats


https://en.wikipedia.org/wiki/FASTA_format

Database Format
GenBank gb|accession|locus
EMBL Data Library emb|accession|locus
DDBJ, DNA Database of Japan dbj|accession|locus
NBRF PIR pir||entry
Protein Research Foundation prf||name
SWISS-PROT sp|accession|entry name
Brookhaven Protein Data Bank pdb|entry|chain
Patents pat|country|number
GenInfo Backbone Id bbs|number
General database identifier gnl|database|identifier
NCBI Reference Sequence ref|accession|locus
Local Sequence identifier lcl|identifier

1.2 Filename extensions


https://en.wikipedia.org/wiki/FASTA_format

Extension Meaning Notes
fasta generic fasta Any generic fasta file. Other extensions can be fas, fa, seq, fsa
fna fasta nucleic acid Used generically to specify nucleic acids
ffn FASTA nucleotide of gene regions Contains coding regions for a genome
faa fasta amino acid Contains amino acids. A multiple protein fasta file can have the more specific extension mpfa
frn FASTA non-coding RNA Contains non-coding RNA regions for a genome, in DNA alphabet e.g. tRNA, rRNA

2.0 FASTQ


A text-based format for storing both a biological sequence (usually nucleotide sequence) and its corresponding quality scores.

2.1 Phred Scores


  • Likelihood probabilities (p) are real small decimal numbers, e.g p = 10-20
  • A common practice to show them in logarithmic scale

  • Phred-Scaling is an integer mapping

  • Phred Quality Score (Q)

2.3 Base Quality


  • A base-quality value represents the prbility of that the corresponding base call is incorrect
  • Base qualities are ASCII (American Standard Code for Information Interchange) encoded Phred scores
  • Two variations have been in use
    • Sanger, Illumina Casava >1.8 variant

    • Solexa pipeline variant –Software delivered with Illumina Genome Analyzer (GA) Illumina Casava > 1.3 and < 1.8

2.4 Phred Quality and Error


Phred Score PError
3 ~ 50 %
10 10%
15 3.16%
20 1%
30 0.1%

https://en.wikipedia.org/wiki/FASTQ_format

2.5 Paired-end vs. Single-end FASTQ


Paired-end: This file contains reads generated using Paired-end DNA sequencing.

  • Paired-end data comes with separate files with forward and reverse reads or interleaved file
  • Illumina sequencer generated files are named follows:

10_S10_L001_R1_001.fastq.gz and 10_S10_L001_R2_001.fastq.gz or 10_S10_L001_R1_001.fastq.gz for interleaved files.

  • Forward/reverse read indicators: ‘/1’ and ‘/2’ or ‘1:N:0:GCTCATGA+ACTGCATA’ and ‘2:N:0:GCTCATGA+ACTGCATA’ at the end of the header line

  • The header line of a Illumina Paired-end FASTQ file:

@<instrument>:<run number>:<flowcell ID>:<lane>:<tile>:<x-pos>:<y-pos> <read>:<is filtered>:<control number>:<sample number>

Element Requirments Description
@ @ Each sequence identifier line starts with @
<instrument> Characters allowed: a–z, A–Z, 0–9 and underscore Instrument ID
<run number> Numerical Run number on instrument
<flowcell ID> Characters allowed: a–z, A–Z, 0–9
<lane> Numerical Lane number
<tile> Numerical Tile number
<x_pos> Numerical X coordinate of cluster
<y_pos> Numerical Y coordinate of cluster
<read> Numerical Read number. 1 can be single read or Read 2 of paired-end
<is filtered> Y or N Y if the read is filtered (did not pass), N otherwise
<control number> Numerical 0 when none of the control bits are on, otherwise it is an even number. On HiSeq X and NextSeq systems, control specification is not performed and this number is always 0
<sample number> Numerical Sample number from the sample sheet

3.0 Platform specific


  • 454 fastq

More information available here: https://www.ncbi.nlm.nih.gov/sra/docs/submitformats/#454

  • Ion Torrent fastq

More information availbe here: https://www.ncbi.nlm.nih.gov/sra/docs/submitformats/#iontorrent

  • Older Illumina fastq

The older Illumina sequencers generated FASTQs with following headers.

@<machine_id>:<lane>:<tile>:<x_coord>:<y_coord>#<index>/<read>

Ex:

@HWI-ST745_0097:7:1101:1201:1000#0/1
ACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCAACCCTAACCCTAACCCT
+HWI-ST745_0097:7:1101:1201:1000#0/1
HHHHHHHHHHHHHHHHHHHGHHHHHHHHHHHGFHFHHFHFHEHEEGBGGFHFHHGGBHHEEGGBGFGCDGGGDBGF
  • PacBio fastq

More: https://www.ncbi.nlm.nih.gov/sra/docs/submitformats/#pacbio-ccs-circular-consensus-se

  • Helicos fastq

More: https://www.ncbi.nlm.nih.gov/sra/docs/submitformats/#helicos-fastq-with-a-fixed-ascii

4.0 HDF5 (Hierarchical Data Format)


A versatile data model that can represent very complex data and can include metadata. Read more: https://support.hdfgroup.org/HDF5/Tutor/HDF5Intro.pdf

4.1 PacBio


More:

wget http://files.pacb.com/software/hgap/HGAp_BAS_H5_DATA/HGAp_BAS_H5_DATA/BAC/m120729_040044_42134_c100384402550000001523033010171256_s1_p0.bas.h5

h5ls -r  m120729_040044_42134_c100384402550000001523033010171256_s1_p0.bas.h5

4.2 MinION Oxford Nanopore


Files are stored in hdf5 file format and have the extension of .fast5.

https://porecamp.github.io/2016/tutorials/PoreCamp2016-02-MinIONData.pdf

5.0 Legacy Formats


  • SRF files A generic file format for stroing DNA sequence data

  • Native Illumina

QSEQ - They are named _qseq.txt.

s_<lane>_<read>_<tile>_qseq.txt

http://www.bordeaux.inra.fr/live/pgtb/lib/exe/fetch.php?media=casava1.7_user_guide_15011196_a.pdf