-
Notifications
You must be signed in to change notification settings - Fork 12
Chapter 2 : File Formats
Simple file format for DNA or peptide sequences.
https://en.wikipedia.org/wiki/FASTA_format
Database | Format |
---|---|
GenBank | gb|accession|locus |
EMBL Data Library | emb|accession|locus |
DDBJ, DNA Database of Japan | dbj|accession|locus |
NBRF PIR | pir||entry |
Protein Research Foundation | prf||name |
SWISS-PROT | sp|accession|entry name |
Brookhaven Protein Data Bank | pdb|entry|chain |
Patents | pat|country|number |
GenInfo Backbone Id | bbs|number |
General database identifier | gnl|database|identifier |
NCBI Reference Sequence | ref|accession|locus |
Local Sequence identifier | lcl|identifier |
https://en.wikipedia.org/wiki/FASTA_format
Extension | Meaning | Notes |
---|---|---|
fasta | generic fasta | Any generic fasta file. Other extensions can be fas, fa, seq, fsa |
fna | fasta nucleic acid | Used generically to specify nucleic acids |
ffn | FASTA nucleotide of gene regions | Contains coding regions for a genome |
faa | fasta amino acid | Contains amino acids. A multiple protein fasta file can have the more specific extension mpfa |
frn | FASTA non-coding RNA | Contains non-coding RNA regions for a genome, in DNA alphabet e.g. tRNA, rRNA |
A text-based format for storing both a biological sequence (usually nucleotide sequence) and its corresponding quality scores.
- Likelihood probabilities (p) are real small decimal numbers, e.g p = 10-20
- A common practice to show them in logarithmic scale
- Phred-Scaling is an integer mapping
- Phred Quality Score (Q)
- A base-quality value represents the prbility of that the corresponding base call is incorrect
- Base qualities are ASCII (American Standard Code for Information Interchange) encoded Phred scores
- Two variations have been in use
-
Sanger, Illumina Casava >1.8 variant
-
Solexa pipeline variant –Software delivered with Illumina Genome Analyzer (GA) Illumina Casava > 1.3 and < 1.8
-
Phred Score | PError |
---|---|
3 | ~ 50 % |
10 | 10% |
15 | 3.16% |
20 | 1% |
30 | 0.1% |
https://en.wikipedia.org/wiki/FASTQ_format
Paired-end: This file contains reads generated using Paired-end DNA sequencing.
- Paired-end data comes with separate files with forward and reverse reads or interleaved file
- Illumina sequencer generated files are named follows:
10_S10_L001_R1_001.fastq.gz
and 10_S10_L001_R2_001.fastq.gz
or 10_S10_L001_R1_001.fastq.gz
for interleaved files.
-
Forward/reverse read indicators: ‘/1’ and ‘/2’ or ‘1:N:0:GCTCATGA+ACTGCATA’ and ‘2:N:0:GCTCATGA+ACTGCATA’ at the end of the header line
-
The header line of a Illumina Paired-end FASTQ file:
@<instrument>:<run number>:<flowcell ID>:<lane>:<tile>:<x-pos>:<y-pos> <read>:<is filtered>:<control number>:<sample number>
Element | Requirments | Description |
---|---|---|
@ | @ | Each sequence identifier line starts with @ |
<instrument> | Characters allowed: a–z, A–Z, 0–9 and underscore | Instrument ID |
<run number> | Numerical | Run number on instrument |
<flowcell ID> | Characters | allowed: a–z, A–Z, 0–9 |
<lane> | Numerical | Lane number |
<tile> | Numerical | Tile number |
<x_pos> | Numerical | X coordinate of cluster |
<y_pos> | Numerical | Y coordinate of cluster |
<read> | Numerical | Read number. 1 can be single read or Read 2 of paired-end |
<is filtered> | Y or N Y | if the read is filtered (did not pass), N otherwise |
<control number> | Numerical | 0 when none of the control bits are on, otherwise it is an even number. On HiSeq X and NextSeq systems, control specification is not performed and this number is always 0 |
<sample number> | Numerical | Sample number from the sample sheet |
- 454 fastq
More information available here: https://www.ncbi.nlm.nih.gov/sra/docs/submitformats/#454
- Ion Torrent fastq
More information availbe here: https://www.ncbi.nlm.nih.gov/sra/docs/submitformats/#iontorrent
- Older Illumina fastq
The older Illumina sequencers generated FASTQs with following headers.
@<machine_id>:<lane>:<tile>:<x_coord>:<y_coord>#<index>/<read>
Ex:
@HWI-ST745_0097:7:1101:1201:1000#0/1
ACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCAACCCTAACCCTAACCCT
+HWI-ST745_0097:7:1101:1201:1000#0/1
HHHHHHHHHHHHHHHHHHHGHHHHHHHHHHHGFHFHHFHFHEHEEGBGGFHFHHGGBHHEEGGBGFGCDGGGDBGF
- PacBio fastq
More: https://www.ncbi.nlm.nih.gov/sra/docs/submitformats/#pacbio-ccs-circular-consensus-se
- Helicos fastq
More: https://www.ncbi.nlm.nih.gov/sra/docs/submitformats/#helicos-fastq-with-a-fixed-ascii
A versatile data model that can represent very complex data and can include metadata. Read more: https://support.hdfgroup.org/HDF5/Tutor/HDF5Intro.pdf
More:
- http://files.pacb.com/software/instrument/2.0.0/bas.h5%20Reference%20Guide.pdf
- https://www.mathworks.com/help/matlab/import_export/importing-hierarchical-data-format-hdf5-files.html
wget http://files.pacb.com/software/hgap/HGAp_BAS_H5_DATA/HGAp_BAS_H5_DATA/BAC/m120729_040044_42134_c100384402550000001523033010171256_s1_p0.bas.h5
h5ls -r m120729_040044_42134_c100384402550000001523033010171256_s1_p0.bas.h5
Files are stored in hdf5 file format and have the extension of .fast5.
https://porecamp.github.io/2016/tutorials/PoreCamp2016-02-MinIONData.pdf
-
SRF files A generic file format for stroing DNA sequence data
-
Native Illumina
QSEQ - They are named _qseq.txt
.
s_<lane>_<read>_<tile>_qseq.txt
http://www.bordeaux.inra.fr/live/pgtb/lib/exe/fetch.php?media=casava1.7_user_guide_15011196_a.pdf