This lecture uses files from this tutorial:

https://datacarpentry.org/wrangling-genomics/03-trimming/index.html

# FastQC: A quality control tool for high throughput sequence data

FastQC is a GUI program for quality controlling of FASTQ files. Installation directions are below:

https://www.bioinformatics.babraham.ac.uk/projects/download.html#fastqc

Download these files and copy them inside the `Nov 2` directory:

These are gunzipped files. Most bioinformatics programs accept compressed .gz files as input but you can decompress them by running `gunzip` (which should be by default installed on Linux and Mac systems, but of course not Windows):

The above will decompress the file in-place and remove the original compressed file.

These are Illumina paired-end reads for 3 samples, we will work with the first sample for now.

Paired-end reads are usually output in two files. Each file contains one half of each pair of reads in the same order. Let's see the first read from SRR2589044_1:

Now let's see the first read from SRR2589044_2:

Note that the IDs are the same except for the `/1` and `/2` in the end which specified which member of the pair the reads is.

Now let's run FastQC and open these two files in it, it will take some time to process:

![FastQC Output](images/fastqc_1.png)

Different tabs on the left show various output plots.  Most are self-explanatory; we'll go over some of the key ones.

The "per base sequence quality" shows the average quality for each position in the read across all reads. We usually expect less quality along the first and last bases but higher qualities in the middle:

![Per Base Quality](images/per_base.png)

Next we move to the "per sequence quality", which shows the average read quality for all reads. The quality of each read is simply the average quality across all its bases.

![Per Sequence Quality](images/per_sequence.png)

A more interesting plot is the "per base sequence content". There is a warning sign next to it in our output. Let's see why:

![Per Base Content](images/per_base_content.png)

We usually expect all bases to have more or less equal representation in different positions. Here we're not seeing that. The yellow warning means the different between bases has been more than 10% for at least one position. If the difference was more than 20% for a certain position, we would see a red cross which indicates FastQC failure. If you look at the output for the second file, you'll see that the difference has been more 20%:

![Per Base Content](images/per_base_content_2.png)

This is due to the presence of fixed adapter sequences at the beginning of reads.

# Trimmomatic:  A flexible read trimming tool for Illumina NGS data

We will now use a tool called Trimmomatic to remove low-quality and otherwise problematic sequences from our data. Trimmomatic also requires Java to be installed. Download Trimmoamtic here:

http://www.usadellab.org/cms/uploads/supplementary/Trimmomatic/Trimmomatic-0.39.zip

One downloaded unzip the output and run it using `java -jar`:

We will get the following output:

We need to determine if we have Paired-End or Single-End data. The last option `<trimmer1>` expects some flags on how to trim the inputs:

* ILLUMINACLIP:<fastaWithAdaptersEtc>:<seed mismatches>:<palindrome clip threshold>:<simple clip threshold>
    * fastaWithAdaptersEtc: specifies the path to a fasta file containing all the adapters, PCR sequences etc. The naming of the various sequences within this file determines how they are used. See below.
    * seedMismatches: specifies the maximum mismatch count which will still allow a full match to be performed
    * palindromeClipThreshold: specifies how accurate the match between the two 'adapter ligated' reads must be for PE palindrome read alignment.
    *simpleClipThreshold: specifies how accurate the match between any adapter etc. sequence must be against a read.
* SLIDINGWINDOW:<windowSize>:<requiredQuality>
    * windowSize: specifies the number of bases to average across
    * requiredQuality: specifies the average quality required.
* LEADING:<quality>
    * quality: Specifies the minimum quality required to keep a base.
* TRAILING:<quality>
    * quality: Specifies the minimum quality required to keep a base.
* CROP:<length>
    * length: The number of bases to keep, from the start of the read.
* HEADCROP:<length>
    * length: The number of bases to remove from the start of the read.
* MINLEN:<length>
    * length: Specifies the minimum length of reads to be kept.m

For instance, look at this sample invocations:

Here we're telling it to clip Ilumina adapters using the file `SRR_adapters.fa`.

The downloaded zip file includes adapters for several different sequencers.

Let's clean the files we used with FastQC:

Above, we have identified the quality encoding ar phred33 and we're using Nextera (predecessor to Illumina) adapters.

You can check that output files are smaller than the input files because some reads have been removed.

Now let's run FastQC for the output files `SRR2589044_1.trim.fastq.gz` and `SRR2589044_2.trim.fastq.gz`.

![Per Base Quality](images/per_base_quality_trim.png)

You'll see that quality is slightly higher now compared to before filtering.