# Data formats for NGS data
When it comes to NGS data, there are many different types of data formats. Here we will have a closer look at some of the most common ones that you are likely to encounter.

## FASTQ
FASTQ is a simple format for raw unaligned sequencing reads. It is an extension to the FASTA file format it that it also includes a quality score for each base. Have a look at the example below:

```
@ERR007731.739 IL16_2979:6:1:9:1684/1
CTTGACGACTTGAAAAATGACGAAATCACTAAAAAACGTGAAAAATGAGAAATG
+
BBCBCBBBBBBBABBABBBBBBBABBBBBBBBBBBBBBABAAAABBBBB=@>B
@ERR007731.740 IL16_2979:6:1:9:1419/1
AAAAAAAAAGATGTCATCAGCACATCAGAAAAGAAGGCAACTTTAAAACTTTTC
+
BBABBBABABAABABABBABBBAAA>@B@BBAA@4AAA>.>BAA@779:AAA@A
```
The first line, starting with @, holds the read metadata, such as an ID. The second line holds the read itself, the third line starts with a + and can optionally contain the ID again, and the forth line contains the per base quality score.

The quality score is encoded in [ASCII characters](https://en.wikipedia.org/wiki/ASCII) including decimal codes 33-126. For example, the ASCII code of “A” is 65, so the corresponding quality is Q = 65 − 33 = 32. From this we can calculate the Phred quality score  as P = 10^−Q/10.

The following simple perl command will print the quality score value for an ASCII character. Try changing the "A" for another character, for example one from the quality strings above (e.g. @, B or =)

In [None]:
perl -e 'printf "%d\n",ord("A")-33;'

Something to beware of is that multiple quality scores were in use, for example Sanger, Solexa and Illumina 1.3+.

For paired-end sequencing, two FASTQ files are produced.

## SAM/BAM
SAM (Sequence Alignment/Map) format was developed by the 1000 Genomes Project group in 2009 and is a unified format for storing read alignments to a reference genome. SAM/BAM format is the accepted standard format for storing NGS sequencing reads, base qualities, associated meta-data and alignments of the data to a reference genome. If no reference genome is available, the data can also be stored unaligned.

The files consists of one record (a single DNA fragment alignment) per line describing alignment between fragment and reference, each with 11 fixed columns and optional key:type:value tuples. We will look closer at the different parts of the SAM/BAM format now.

### Header
Download the SAM/BAM file specification document from http://samtools.github.io/hts-specs. From reading page 4 of the SAM specification, look at the following line from the header of the BAM file: 

```
@RG ID:ERR003612 PL:ILLUMINA LB:g1k-sc-NA20538-TOS-1 PI:2000 DS:SRP000540 SM:NA20538 CN:SC
```
__Q1: What does RG stand for?  
Q2: What is the sequencing platform?  
Q3: What is the sequencing centre?  
Q4: What is the lane ID?  
Q5: What is the expected fragment insert size?__  

### SAM fields
1. QNAME Query NAME of the read or the read pair
2. FLAG Bitwise FLAG (pairing, strand, mate strand, etc.)
3. RNAME Reference sequence NAME
4. POS 1-Based leftmost POSition of clipped alignment
5. MAPQ MAPping Quality (Phred-scaled)
6. CIGAR Extended CIGAR string (operations: MIDNSHPX=)
7. MRNM Mate Reference NaMe (’=’ if same as RNAME)
8. MPOS 1-Based leftmost Mate POSition
9. ISIZE Inferred Insert SIZE
10. SEQ Query SEQuence on the same strand as the reference
11. QUAL Query QUALity (ASCII-33=Phred base quality)
12. OTHER Optional fields

![SAM](img/SAM_BAM.png)

### CIGAR string
The CIGAR string provides a compact representation of sequence alignment. Have a look at the table below. It contains the meaning of all possible symbols of a CIGAR string:

|Symbol|Meaning|
|------|-------|
|M|alignment match or mismatch|
|=|sequence match|
|X|sequence mismatch|
|I|insertion to the reference|
|D|deletion from the reference|
|S|soft clipping (clipped sequences present in SEQ)|
|H|hard clipping (clipped sequences NOT present in SEQ)|
|N|skipped region from the reference|
|P|padding (silent deletion from padded reference)|

Below are some examples describing the CIGAR string in more detail.
  
__Example 1:__  
Ref:   ACGTACGTACGTACGT  
Read:  ACGT----ACGTACGA  
Cigar: 4M 4D 8M  

The first four bases in the read are the same as in the reference, so we can represent these as 4M in the CIGAR string. Next comes 4 deletions, represented by 4D, followed by 7 alignment matches and one alignmen mismatch, represented by 8M.

__Example 2:__  
Ref:  ACGT----ACGTACGT  
Read: ACGTACGTACGTACGT  
Cigar: 4M 4I 8M  

This example is very similar to the previous one, except this time we have an insertion instead of a deletion, represented by 4I.

__Example 3:__
Ref:  ACTCAGTG--GT  
Read: ACGCA-TGCAGTtagacgt  
Cigar: 5M 1D 2M 2I 2M 7S  

Here we start off with 5 alignment matches and missmatches, followed by one deletion. Then we have two more alignment matches, two insetions and two more matches. At the end, we have seven soft clippings, 7S. These are clipped sequences that are present in the SEQ (Query SEQuence on the same strand as the reference).

### Flags
|Hex|Dec|Flag|Description|
|---|---|----|-----------|
|0x1|1|PAIRED|paired-end (or multiple-segment) sequencing technology|
|0x2|2|PROPER_PAIR|each segment properly aligned according to the aligner|
|0x4|4|UNMAP|segment unmapped|
|0x8|8|MUNMAP|next segment in the template unmapped|
|0x10|16|REVERSE|SEQ is reverse complemented|
|0x20|32|MREVERSE|SEQ of the next segment in the template is reversed|
|0x40|64|READ1|the first segment in the template|
|0x80|128|READ2|the last segment in the template|
|0x100|256|SECONDARY|secondary alignment|
|0x200|512|QCFAIL|not passing quality controls|
|0x400|1024|DUP|PCR or optical duplicate|
|0x800|2048|SUPPLEMENTARY|supplementary alignment|

### Optional tags
Each lane has a unique RG tag that contains meta-data for the lane
RG tags
* ID: SRR/ERR number
* PL: Sequencing platform
* PU: Run name
* LB: Library name
* PI: Insert fragment size
* SM: Individual
* CN: Sequencing center

![metadata](img/metadata_sam.png)



### BAM
BAM (Binary Alignment/Map) format, is a binary version of SAM, developed for fast processing and random access. To acheive this BGZF (Block GZIP) compression is used for indexing. The key features of BAM are:

* Can store alignments from most mappers
* Supports multiple sequencing technologies
* Supports indexing for quick retrieval/viewing
* Compact size (e.g. 112Gbp Illumina = 116GB disk space)
* Reads can be grouped into logical groups e.g. lanes, libraries, samples
* Widely supported by variant calling packages and viewers

Samtools comprises a set of programs for interacting with SAM/BAM files. Using the samtools view command, print the header of the BAM file:

In [None]:
samtools view -H NA20538.bam

__Q6: What version of the human assembly was used to perform the alignments?  
Q7: How many lanes are in this BAM file?  
Q8: What programs were used to create this BAM file?  
Q9: What version of bwa was used to align the reads?__  

Now have a look at the first read of the BAM file:

In [None]:
samtools view NA20538.bam | head

__Q10: What is the name of the first read?  
Q11: What position does the alignment of the read start at?  
Q12: What is the mapping quality of the first read?__

## CRAM
Even thoug BAM files are compressed, they are still too large. Typically they use 1.5-2 bytes per base pair, and while disk capacity is ever improved, increases in disk capacity are being far outstripped by sequencing technologies.

![compression](img/compression_cram.png)

BAM stores all of the data, every read base and base quality, and uses a single conventional compression technique for all types of data. To further decrase the size, CRAM was introduced. It uses three important concepts:

* Reference based compression
* Controlled loss of quality information
* Different compression methods to suit the type of data, e.g. base qualities vs. metadata vs. extra tags

![CRAM](img/CRAM_format.png)
![CRAM2](img/CRAM_format2.png)
![CRAM_structure](img/CRAM_structure.png)
In lossless mode a CRAM file makes out 60% of the size of a BAM file. Archives and sequencing centers are now moving from BAM to CRAM and support for CRAM was added to Samtools/HTSlib in 2014. It is also soon to be available in Picard/GATK.



## VCF/BCF
File format for storing variation data
• Tab-delimited text, parsable by standard UNIX commands
• Flexible and user-extensible
• Compressed with BGZF (bgzip), indexed with TBI or CSI (tabix)

![VCF_format](img/VCF1.png)

VCFs can be very big
• compressed VCF with 3781 samples, human data:
• 54 GB for chromosome 1
• 680 GB whole genome
VCFs can be slow to parse
• text conversion is slow
• main bottleneck: FORMAT fields

![VCF_format2](img/VCF2.png)

BCF
• binary representation of VCF
• fields rearranged for fast access

![VCF_format3](img/VCF3.png)

Often it is not enough not know variant sites only
• was a site dropped because of a reference call or because of missing data?
• we need evidence for both variant and non-variant positions in the genome
gVCF
• blocks of reference-only sites can be represented in a single record using the
INFO/END tag
• symbolic alleles <\*> for incremental calling
• raw, “callable” gVCF
• calculate genotype likelihoods only once (an expensive step)
• then call incrementally as more samples come in

![gVCF_format](img/gVCF.png)
