# QC assessment of NGS data

Biases in sequencing   
* Base calling accuracy
* Read cycle vs. base content
* GC vs. depth
* Indel ratio

Biases in mapping   

Genotype checking   
* Sample swaps
* Contaminations

## Base quality
Sequencing by synthesis: dephasing
* growing sequences in a cluster gradually desynchronize
* error rate increases with read length
Calculate the average quality at each position across all reads

![Base Quality](img/base_qual.png)

## Base calling errors
![Base Calling Errors](img/base_calling_errors.png)
_Base-calling for next-generation sequencing platforms_, doi: 10.1093/bib/bbq077

## Base quality
![Base Quality](img/base_qual2.png)

## Mismatches per cycle
Mismatches in aligned reads (requires reference sequence)
* detect cycle-specific errors
* base qualities are informative!

![Mismatch](img/mismatches.png)

## GC bias
GC- and AT-rich regions are more difficult to amplify
* compare the GC content against the expected distribution (reference sequence)

![GC_Bias](img/gc_bias.png)

## GC content by cycle
Was the adapter sequence trimmed?

![GC_cycle](img/gc_cycle.png)

## Fragment size
Paired-end sequencing: the size of DNA fragments matters
![Fragment size](img/fragment_size.png)

## Quiz
__Q1: This is 100bp paired-end sequencing. Can you spot any problems?__
![quiz](img/qc_quiz.png)

## Insertions / Deletions per cycle
False indels
* air bubbles in the flow cell can manifest as false indels
![Indels](img/indels.png)

## Auto QC tests
A suggestion for human data:  
Minimum number of mapped bases 90%  
Maximum error rate 0.02%  
Maximum number of duplicate reads 5%  
Minimum number of mapped reads which are properly paired 80%  
Maximum number of duplicated bases due to overlapping read pairs 4%  
Maximum in/del ratio 0.82  
Minimum in/del ratio 0.68  
Maximimum indels per cycle, factor above median 8  
Minimum number of reads within 25% of the main peak 80%  

![Auto QC](img/auto_qc.png)

## Detecting sample swaps
Check the identity against a known set of variants

![Swap](img/swap.png)


## Generate QC stats
Now let's try this out! We will generate QC stats for two lanes of Illumina paired-end sequencing data from yeast. We will use the bwa mapper to align the data to the Saccromyces cerevisiae genome (ftp://ftp.ensembl.org/pub/current_fasta/saccharomyces_cerevisiae/dna) and samtools stats to generate the stats.

Read pairs are usually stored in two separate FASTQ files so that n-th read in the first file and the n-th read in the second file constitute a read pair. Can you devise a quick sanity check that reads in these two files indeed form pairs? The files must have the same number of lines and the naming of the reads usually suggests if they form a pair. The location of the files is:  
  
data/lane1/s_7_1.fastq   
data/lane1/s_7_2.fastq  
  
Run the script below to create the mappings:

In [None]:
#!/bin/sh

# Create the reference genome index, necessary for bwa alignment
bwa index Saccharomyces_cerevisiae.EF4.68.dna.toplevel.fa


# Several commands are piped one into another:
#   - align the lane fastq files with bwa
#   - convert the sam output to bam
#   - sort the bams
#   - index the bams
#
bwa mem -M \
    -R '@RG\tID:lane1\tSM:60A_Sc_DBVPG6044' \
    Saccharomyces_cerevisiae.EF4.68.dna.toplevel.fa \
    data/lane1/s_7_1.fastq \
    data/lane1/s_7_2.fastq |
samtools view -b - |
samtools sort -T tmp.lane1 -o lane1.sorted.bam
samtools index lane1.sorted.bam

The script contains several commands, some are combined together using pipes. (UNIX pipes is a very powerful and elegant concept which allows us to feed the output of one command into the next command and avoid writing intermediate files. If you are not comfortable with UNIX, condiser having a go at the UNIX tutorial).

The script will produce the BAM file lane1.sorted.bam. Generate the stats including only primary alignments using the command:

In [None]:
samtools stats -F SECONDARY lane1.sorted.bam > lane1.sorted.bam.bchk

In [None]:
cat lane1.sorted.bam.bchk

Look at the output and answer the following questions:

__Q2: What is the total number of reads?  
Q3: What proportion of the reads were mapped?  
Q4: How many reads were mapped to a different chromosome?  
Q5: What is the insert size mean and standard deviation?  
Q6: How many reads were paired properly? Challenge: can you verify that only mapped reads have the PROPER_PAIR bit set? (Skip the second part of this question if you don't know how to use awk and its bitwise and() operation.)__  

Next we will create some QC plots from the output of the stats command using the command plot-bamstats which is of the samtools package: 

In [None]:
plot-bamstats -p lane1-plots/ lane1.sorted.bam.bchk

Now in your web browser open the file lane1-plots/index.html to view the QC information.

__Q7: How many reads have zero mapping quality?  
Q8: Which of the first fragments or second fragments are higher base quality on average?__  