# Data formats for NGS data
When it comes to NGS data, there are many different types of data formats. Here we will have a closer look at some of the most common ones that you are likely to encounter.

## FASTQ
FASTQ is a simple format for raw unaligned sequencing reads. It is an extension to the FASTA file format, also including a quality score for each base. For paired-end sequencing, two FASTQ files are produced. Have a look at the example below, containing two reads:
  
```
@ERR007731.739 IL16_2979:6:1:9:1684/1
CTTGACGACTTGAAAAATGACGAAATCACTAAAAAACGTGAAAAATGAGAAATG
+
BBCBCBBBBBBBABBABBBBBBBABBBBBBBBBBBBBBABAAAABBBBB=@>B
@ERR007731.740 IL16_2979:6:1:9:1419/1
AAAAAAAAAGATGTCATCAGCACATCAGAAAAGAAGGCAACTTTAAAACTTTTC
+
BBABBBABABAABABABBABBBAAA>@B@BBAA@4AAA>.>BAA@779:AAA@A
```
   
The first line, starting with `@`, holds the read metadata, such as the read ID, and the second line holds the read itself. The third line starts with a `+` and can optionally contain the ID again, and the forth line contains the per base quality score.

The quality score is encoded in [ASCII characters](https://en.wikipedia.org/wiki/ASCII), including decimal codes 33-126. For example, the ASCII code of “A” is 65, so the corresponding quality is:
   
```   
Q = 65 − 33 = 32
```
   
From this we can calculate the [Phred quality score](https://en.wikipedia.org/wiki/Phred_quality_score) as:
   
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; P = 10<sup>−Q/10</sup>   
   
The following simple perl command will print the quality score value for an ASCII character. Try changing the "A" to another character, for example one from the quality strings above (e.g. @, = or B).

In [None]:
perl -e 'printf "%d\n",ord("A")-33;'

Something to be aware of is that multiple quality scores have been in use, for example Sanger, Solexa and Illumina 1.3+.

## SAM/BAM
[SAM (Sequence Alignment/Map)](https://samtools.github.io/hts-specs/SAMv1.pdf) format was developed by the 1000 Genomes Project group in 2009 and is a unified format for storing read alignments to a reference genome. SAM/BAM format is the accepted standard format for storing NGS sequencing reads, base qualities, associated meta-data and alignments of the data to a reference genome. If no reference genome is available, the data can also be stored unaligned.

The files consists a header section (optional) and an alignment section. The alignment section contains one record (a single DNA fragment alignment) per line describing the alignment between fragment and reference. Each record has 11 fixed columns and optional key:type:value tuples. Let us have a closer look at the different parts of the SAM/BAM format.

### Header
Each line in the SAM header begins with an `@`, followed by a two-letter header record type code defined in the [SAM/BAM file specification document](https://samtools.github.io/hts-specs/SAMv1.pdf). Have a look at this document to familiarise yourself with the SAM and BAM format.  
  
From reading section 1.3 of the SAM specification, look at the following line from the header of the BAM file: 

```
@RG ID:ERR003612 PL:ILLUMINA LB:g1k-sc-NA20538-TOS-1 PI:2000 DS:SRP000540 SM:NA20538 CN:SC
```
   
__Q1: What does RG stand for?  
Q2: What is the sequencing platform?  
Q3: What is the sequencing centre?  
Q4: What is the lane ID?  
Q5: What is the expected fragment insert size?__  

### SAM fields
The alignment section of SAM files contains one line per read fragment alignment, which in turn contains the columns listed below. The first 11 on these are mandatory.

1. QNAME Query NAME of the read or the read pair
2. FLAG Bitwise FLAG (pairing, strand, mate strand, etc.)
3. RNAME Reference sequence NAME
4. POS 1-Based leftmost POSition of clipped alignment
5. MAPQ MAPping Quality (Phred-scaled)
6. CIGAR Extended CIGAR string (operations: MIDNSHPX=)
7. MRNM Mate Reference NaMe (’=’ if same as RNAME)
8. MPOS 1-Based leftmost Mate POSition
9. ISIZE Inferred Insert SIZE
10. SEQ Query SEQuence on the same strand as the reference
11. QUAL Query QUALity (ASCII-33=Phred base quality)
12. OTHER Optional fields

The image below provides a visual guide to some of the columns of the SAM format.

<img src="img/SAM_BAM.png" alt="SAM" style="width: 500px;"/>

### CIGAR string
The CIGAR string provides a compact representation of sequence alignment. Have a look at the table below. It contains the meaning of all different symbols of a CIGAR string:

|Symbol|Meaning|
|------|-------|
|M|alignment match or mismatch|
|=|sequence match|
|X|sequence mismatch|
|I|insertion to the reference|
|D|deletion from the reference|
|S|soft clipping (clipped sequences present in SEQ)|
|H|hard clipping (clipped sequences NOT present in SEQ)|
|N|skipped region from the reference|
|P|padding (silent deletion from padded reference)|

Below are some examples describing the CIGAR string in more detail.
  
__Example 1:__  
Ref:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;ACGTACGTACGTACGT  
Read:&nbsp;&nbsp;ACGT-&nbsp;-&nbsp;-&nbsp;-&nbsp;ACGTACGA  
Cigar: 4M 4D 8M  

The first four bases in the read are the same as in the reference, so we can represent these as 4M in the CIGAR string. Next comes 4 deletions, represented by 4D, followed by 7 alignment matches and one alignment mismatch, represented by 8M. Note that the mismatch at position 16 is included in 8M. This is because it still aligns to the reference.

__Example 2:__  
Ref:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;ACGT-&nbsp;-&nbsp;-&nbsp;-&nbsp;ACGTACGT  
Read:&nbsp;&nbsp;ACGTACGTACGTACGT  
Cigar: 4M 4I 8M  

This example is very similar to the previous one, except this time we have an insertion instead of a deletion, represented by 4I.

__Example 3:__  
Ref:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;ACTCAGTG-&nbsp;-&nbsp;GT  
Read:&nbsp;&nbsp;ACGCA-&nbsp;TGCAGTtagacgt  
Cigar: 5M 1D 2M 2I 2M 7S  

Here we start off with 5 alignment matches and missmatches, followed by one deletion. Then we have two more alignment matches, two insetions and two more matches. At the end, we have seven soft clippings, 7S. These are clipped sequences that are present in the SEQ (Query SEQuence on the same strand as the reference).

### Flags
The second column for a read alignment in a SAM file contains a combination of bitwise FLAGs describing the alignment. The following table contains the information you can get from the bitwise FLAGs:
   
|Hex|Dec|Flag|Description|
|---|---|----|-----------|
|0x1|1|PAIRED|paired-end (or multiple-segment) sequencing technology|
|0x2|2|PROPER_PAIR|each segment properly aligned according to the aligner|
|0x4|4|UNMAP|segment unmapped|
|0x8|8|MUNMAP|next segment in the template unmapped|
|0x10|16|REVERSE|SEQ is reverse complemented|
|0x20|32|MREVERSE|SEQ of the next segment in the template is reversed|
|0x40|64|READ1|the first segment in the template|
|0x80|128|READ2|the last segment in the template|
|0x100|256|SECONDARY|secondary alignment|
|0x200|512|QCFAIL|not passing quality controls|
|0x400|1024|DUP|PCR or optical duplicate|
|0x800|2048|SUPPLEMENTARY|supplementary alignment|

For example, if you have an alignment with FLAG set to 113, this can only be represented by decimal codes `64 + 32 + 16 + 1`, so we know that these four flags apply to the alignment.


### Optional tags
Optionally, additional information can be stored in TAGs. Each lane has a unique RG tag that contains meta-data for the lane. Some common TAGs are:

* ID: SRR/ERR number
* PL: Sequencing platform
* PU: Run name
* LB: Library name
* PI: Insert fragment size
* SM: Individual
* CN: Sequencing center

For further information about TAGs, see the [SAM/BAM file specification document](https://samtools.github.io/hts-specs/SAMv1.pdf).

### BAM
BAM (Binary Alignment/Map) format, is a binary version of SAM, developed for fast processing and random access. To acheive this, BGZF (Block GZIP) compression is used for indexing. The key features of BAM are:

* Can store alignments from most mappers
* Supports multiple sequencing technologies
* Supports indexing for quick retrieval/viewing
* Compact size (e.g. 112Gbp Illumina = 116GB disk space)
* Reads can be grouped into logical groups e.g. lanes, libraries, samples
* Widely supported by variant calling packages and viewers
   
Samtools comprises a set of programs for interacting with SAM and BAM files. Using the samtools view command, print the header of the BAM file:

In [None]:
samtools view -H data/NA20538.bam

__Q6: What version of the human assembly was used to perform the alignments?  
Q7: How many lanes are in this BAM file?  
Q8: What programs were used to create this BAM file?  
Q9: What version of bwa was used to align the reads?__  

Now have a look at the first read of the BAM file:

In [None]:
samtools view data/NA20538.bam | head

__Q10: What is the name of the first read?  
Q11: What position does the alignment of the read start at?  
Q12: What is the mapping quality of the first read?__

## CRAM
Even thoug BAM files are compressed, they are still too large. Typically they use 1.5-2 bytes per base pair, and while disk capacity is ever improving, increases in disk capacity are being far outstripped by sequencing technologies.

<img src="img/compression_cram.png" alt="CRAM_Compr" style="width: 700px;"/>

The BAM format stores all of the data, every read base and base quality, and uses a single conventional compression technique for all types of data. To further decrease the size, CRAM was introduced. It uses three important concepts:

* Reference based compression
* Controlled loss of quality information
* Different compression methods to suit the type of data, e.g. base qualities vs. metadata vs. extra tags

The figure below displays how reference based compression works. Instead of saving all the bases of all the reads,only the nucleotides that differ from the reference, and their positions, are kept.

<img src="img/CRAM_format.png" alt="CRAM" style="width: 500px;"/>
<img src="img/CRAM_format2.png" alt="CRAM2" style="width: 500px;"/>

In lossless mode a CRAM file makes out 60% of the size of a BAM file, so archives and sequencing centers are now moving from BAM to CRAM and support for CRAM was added to Samtools/HTSlib in 2014.

## VCF/BCF
The VCF file format and it's binary version BCF were introduced to store variation data. VCF consists of tab-delimited text and is parsable by standard UNIX commands which makes it flexible and user-friendly. The figure below provides an overview of the different components of a VCF file:

<img src="img/VCF1.png" alt="VCF_format" style="width: 700px;"/>

VCF can be compressed with BGZF (bgzip) and  indexed with TBI or CSI (tabix), but it even compressed it can still be very big. For example, a compressed VCF with 3781 samples of human data will be 54 GB for chromosome 1, and 680 GB for the whole genome. 

VCFs can also be slow to parse, as text conversion is slow. The main bottleneck is the "FORMAT" fields. For this reason BCF format, a binary representation of VCF, was developed. In BCF files the fields are rearranged for fast access. The following images show the process of converting a VCF file into a BCF file. 

<img src="img/VCF2.png" alt="VCF_format" style="width: 800px;"/>
<img src="img/VCF3.png" alt="VCF_format" style="width: 800px;"/>

The official specification is available from http://samtools.github.io/hts-specs.

Bcftools comprises a set of programs for interacting with VCF/BCF files. You can use bcftools to convert between VCF and BCF and to view or extract records from a region.
   
Using the bcftools view command, print the header of the BCF file:

In [None]:
bcftools view -h data/1kg.bcf

__Q13: What version of the human assembly do the coordinates refer to?__  

Similarly to BAM, VCF/BCF supports random access, that is, fast retrieval from a given region. For this, the file must be indexed:

In [None]:
bcftools index data/1kg.bcf

Now extract all records from the region 20:24042765-24043073

In [None]:
bcftools view -H -r 20:24042765-24043073 data/1kg.bcf

The versatile bcftools query command can be used to extract any VCF field. Combined with standard UNIX commands, this gives a powerful tool for quick querying of VCFs. Try to answer the following questions with the help of the [manual page](http://samtools.github.io/bcftools/bcftools.html). You can use the blank code cell below to try out your commands.

__Q14: How many samples are in the BCF? (Hint: use the -l option)   
Q15: What is the genotype of the sample HG00107 at the position 20:24019472? (Hint: use the combination of -r, -s, and -f '[ %TGT]\n' options)   
Q16: How many positions there are with more than 10 alternate alleles (see the INFO/AC tag)? (Hint: use the -i filtering option)   
Q17: List all positions where HG00107 has a non-reference genotype and the read depth is bigger than 10.__   

### gVCF
Often it is not enough not know variant sites only. For instance, you will not know if a site was dropped because of a reference call or because of missing data. We sometimes need evidence for both variant and non-variant positions in the genome. In gVCF format, blocks of reference-only sites can be represented in a single record using the "INFO/END" tag. Symbolic alleles (<\*>) are used for incremental calling:

<img src="img/gVCF.png" alt="gVCF_format" style="width: 700px;"/>

The answers to the questions on this page can be found [here](answers.ipynb).   

Now continue to the next section of the tutorial: [File conversion](conversion.ipynb).   
You can also return to the [index page](index.ipynb).