## QC assessment of NGS data
As mentioned previously, QC is an important part of any analysis. In this section we are going to look at some of the most important aspects of QC to consider. 

## Biases in sequencing
A lot can happen during sequencing that can affect the quality of the data. Here we mention a couple of things.

### Base quality
Illumina sequencing technology relies on sequencing by synthesis. One of the most common problems with this is __dephasing__. For each sequencing cycle, there is a possibility that the replication machinery slips and either incorporates more than one nucleotide or perhaps misses to incorporate one at all. The more cycles that are run (i.e. the longer the read length gets), the greater the accumulation of these types of errors gets. This leads to a heterogeneous population in the cluster, and a decreased signal purity, which in turn reduces the precision of the basecalling. The figure below shows an example of this.

<img src="img/base_qual.png" alt="Base Quality" style="width: 400px;"/>

Because of this, it is possible to have high quality data at the beginning of the read but really low quality data towards the end of the read. In those cases you can decide to trimm off the low quality reads, for example using a tool called [Trimmomatic](http://www.usadellab.org/cms/?page=trimmomatic).

The figure below shows an example of a high quality read on the left, and a poor quality read on the right.

<img src="img/base_qual2.png" alt="Base Quality" style="width: 700px;"/>

### Other base calling errors
There are several different reasons for a base to be called incorrectly, as swown in the figure below. __Phasing noice__ and __signal decay__ is a result of the dephasing issue described above. During library preparation, __mixed clusters__ can occur if multiple templates get colocated. These clusters should be removed from the downstream analysis. __Boundary effects__ occur due to optical effects when the intensity is uneven across each tile, resulting in lower intensity toward the edges. __Cross-talk__ occurs because the emission frequency spectra for each of the four dyes partly overlap. Finally, for early chemistries __T phlourophore accumulation__ was an issue, where incomplete cleavage of the dye coupled to Thymine lead to an accumulation of Thymine.

<img src="img/base_calling_errors.png" alt="Base Calling Errors" style="width: 800px;"/>

_Base-calling for next-generation sequencing platforms_, doi: [10.1093/bib/bbq077](https://academic.oup.com/bib/article/12/5/489/268399)

### Mismatches per cycle
Aligning reads to a high quality reference genome can provide insight to the quality of a sequencing run by showing you the mismatches to the reference sequence. This can help you detect cycle-specific errors. Mismatches can occur due to two main causes, sequencing errors and differences between your sample and the reference genome, which is important to bare in mind when interpreting mismatch graphs.

<img src="img/mismatches.png" alt="Mismatch" style="width: 900px;"/>

### GC bias
Regions rich in GC or AT are more difficult to amplify and prone to sequencing errors like instertions and deletions. For this reason it is a good idea to compare the GC content of the reads against the expected distribution in a reference sequence. In the left image below, we can see that the GC content of the sample is about the same as for the reference, at ~38%. However in the right figure, the GC content of the sample is closer to 55%, indicating that there is an issue with this sample.

<img src="img/gc_bias.png" alt="GC Bias" style="width: 900px;"/>


### GC content by cycle
Looking at the GC content per cycle can help detect if the adapter sequence was trimmed. For a random library, it is expected to be little to no difference between the different bases of a sequence run, so the lines in this plot should be parallel with each other like in the figure on the left below. In the figure on the right, the initial spikes are likely due to adapter sequences that have not been removed. 

<img src="img/gc_cycle.png" alt="GC cycle" style="width: 800px;"/>

### Insert size
For paired-end sequencing the size of DNA fragments also matters. In the left of the examples below, the insert size peaks around 440 bp. On the right however, there is also a peak at around 200 bp. This could indicate that a lot of the DNA was fragmented for some reason.

<img src="img/fragment_size.png" alt="Fragment size" style="width: 800px;"/>

__Q1: The figure below is from a 100bp paired-end sequencing. Can you spot any problems?__

<img src="img/qc_quiz.png" alt="quiz" style="width: 550px;"/>


### Insertions/Deletions per cycle
Sometimes, air bubbles occur in the flow cell, which can manifest as false indels. The spike in the image on the right provides an example of how this can look.

<img src="img/indels.png" alt="Indels" style="width: 800px;"/>

## Genotype checking
Looking closer at the genotype of your samples can help you detect sample swaps and contamination. Here we will cover how to detect sample swaps, and in the next section we will cover how to detect contamination.

### Detecting sample swaps
By comparing a suspicious sample against a known set of variants it is possible to detect if a sample is likely to have been swapped.

<img src="img/swap.png" alt="swap" style="width: 700px;"/>


## Generate QC stats
Now let's try this out! We will generate QC stats for two lanes of Illumina paired-end sequencing data from yeast. We will use the bwa mapper to align the data to the [Saccromyces cerevisiae genome](ftp://ftp.ensembl.org/pub/current_fasta/saccharomyces_cerevisiae/dna), followed by samtools stats to generate the stats.

Read pairs are usually stored in two separate FASTQ files so that n-th read in the first file and the n-th read in the second file constitute a read pair. Can you devise a quick sanity check that reads in these two files indeed form pairs? The files must have the same number of lines and the naming of the reads usually suggests if they form a pair. The location of the files is:  
```  
data/lane1/s_7_1.fastq   
data/lane1/s_7_2.fastq  
```

Run the script below to create the mappings:

In [None]:
#!/bin/sh

# Create the reference genome index, necessary for bwa alignment
bwa index data/Saccharomyces_cerevisiae.EF4.68.dna.toplevel.fa


# Several commands are piped one into another:
#   - align the lane fastq files with bwa
#   - convert the sam output to bam
#   - sort the bams
#   - index the bams
#
bwa mem -M \
    -R '@RG\tID:lane1\tSM:60A_Sc_DBVPG6044' \
    data/Saccharomyces_cerevisiae.EF4.68.dna.toplevel.fa \
    data/lane1/s_7_1.fastq \
    data/lane1/s_7_2.fastq |
samtools view -b - |
samtools sort -T tmp.lane1 -o lane1.sorted.bam
samtools index lane1.sorted.bam

The script contains several commands, some are combined together using pipes. (UNIX pipes is a very powerful and elegant concept which allows us to feed the output of one command into the next command and avoid writing intermediate files. If you are not comfortable with UNIX, condiser having a go at the UNIX tutorial).

The script will produce the BAM file lane1.sorted.bam. Generate the stats including only primary alignments using the command:

In [None]:
samtools stats -F SECONDARY lane1.sorted.bam > lane1.sorted.bam.bchk

In [None]:
head -n 41 lane1.sorted.bam.bchk

Look at the output and answer the following questions:

__Q2: What is the total number of reads?  
Q3: What proportion of the reads were mapped?  
Q4: How many reads were mapped to a different chromosome?  
Q5: What is the insert size mean and standard deviation?  
Q6: How many reads were paired properly? Challenge: can you verify that only mapped reads have the PROPER_PAIR bit set? (Skip the second part of this question if you don't know how to use awk and its bitwise and() operation.)__  

Next we will create some QC plots from the output of the stats command using the command plot-bamstats which is of the samtools package: 

In [None]:
plot-bamstats -p lane1-plots/ lane1.sorted.bam.bchk

Now in your web browser open the file lane1-plots/index.html to view the QC information.

__Q7: How many reads have zero mapping quality?  
Q8: Which of the first fragments or second fragments are higher base quality on average?__  