# Cleaning and trimming of FASTQ files

Work through the following steps in order to produce cleaned FASTQ files that pass all quality tests. If you have questions about the commands, please first have a look at the [Trimmomatic tutorial](https://drive.google.com/file/d/1WI_gGGIViibALnlY3UfYzR0z-TXChvC-/view) and see if you find the answer there.

Answer the questions in the yellow blocks,
<div class="alert alert-block alert-warning">
**Exam question**
</div>
in a separate word document in case you need/want a course completion certificate. It is a requirement of ForBio, otherwise we can only hand out participation certificates, which might be sufficient for you to get credits, depending on your home university's policies. However there are not many questions and you can keep the answers to a minimum, and in any case we recommend you to at least think about the questions.

_______

**1)** Connect to the cluster and connect to the software environment. (Reminder: If you have not already installed the virtual environment, do so by executing `sbatch /work/projects/forbio/bin/install_forbio_env.sh`, which will take approx. 15 min):

In [None]:
%%bash
module load Anaconda3/5.1.0
source activate forbio_env

Copy the folder for this tutorial to your account and enter the folder:

In [None]:
%%bash
cp -r /work/projects/forbio/tutorials_tobi/cleaning_trimming .
cd cleaning_trimming


_____

**2)** Count number of reads in FASTQ files. We have two FASTQ files for our sample 1061. Think about what this command does, ask if it's unclear.

<div class="alert alert-block alert-warning">
**Question:** Why do the two files have exactly the same number of reads?
</div>

In [None]:
%%bash
cat raw_fastq/1061_R1.fastq | wc -l | awk '{print $1/4}'
cat raw_fastq/1061_R2.fastq | wc -l | awk '{print $1/4}'


____

**3)** Run quality check on one of the files:

In [None]:
%%bash
fastqc -o quality_check raw_fastq/1061_R1.fastq


______

**4)** Check results of quality check. Let's have a look at the graphic output of the test. For this you need to **copy the quality check output to your computer** and open it in a html viewer (internet explorer, safari, firefox, ...).

In [None]:
%%bash
# open bash terminal on your computer and copy the quality test 
# output folder from the cluster to your computer
scp -r USERNAME@abel.uio.no:/PATH/TO/YOUR/QUALITY_TEST/RESULTS/ .

Now double-click on the `1061_R1_fastqc.html` file and look at the different tests. A green symbol in front of the test means that it passed, a yellow sybol means that it didn't fail but produced a warning and a red symbol means that the test failed.

Try to understand what each test tells you about your data and why some of the test results are problematic. Ask if some of the tests don't make sense to you!


______

**5)**  Prepare a file containing the adapter and barcode information. We have the following information about our sample:

- i5 adapter:

AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGATCTCGGTGGTCGCCGTATCATT

- i7 adapter:

GATCGGAAGAGCACACGTCTGAACTCCAGTCAC**\***ATCTCGTATGCCGTCTTCTGCTTG

- barcode of sample 1061: AGTCAA

Before moving to the next step you need to **prepare a FASTA file** containing the full adapter sequences for our sample. 
In the folder `cleaned_reads/` you will find a template for the FASTA file named `1061_adapters.fasta`. Open this file in a command line text editor (e.g. `nano`) and insert the correct adapter sequences from above (for easier copying the sequences are also stored in a file `adapter_barcode_info.txt` in the same folder).
**The barcode has to be inserted at the position marked with an asterisk (\*) in the i7 adapter.**

<div class="alert alert-block alert-info">
**NOTE:** The adapter sequences and barcodes differ between different projects (depending on the kits used), so always make sure to keep track of which adapter sequences and which barcodes were used for each sample.
</div>




______

**6)** Clean your files. We are using the software trimmomatic, which is a tool for removing reads or parts of reads of bad quality and which can also be used to remove adapter sequences.

The basic trimmomatic command syntax is shown below:


```
paried end (PE) or single end (SE) mode
<input read1>
<input read2>
<output file clean read1>
<output file clean read1 single (no read2 match)>
<output file clean read2>
<output file clean read2 single (no read1 match)>
ILLUMINACLIP:<fastaWithAdapterSeqs>:<seed mismatches>:<palindrome clip
threshold>:<simple clip threshold>
LEADING:<quality>
TRAILING:<quality>
SLIDINGWINDOW:<windowSize>:<requiredQuality>
MINLEN:<length>
CROP:<length>
HEADCROP:<length>
```

Let's run trimmomatic for cleaning our reads and removing the adapter sequences specified in our adapter FASTA file `1061_adapters.fasta`, using the following command:

In [None]:
%%bash
trimmomatic \
        PE \
        raw_fastq/1061_R1.fastq \
        raw_fastq/1061_R2.fastq \
        ./cleaned_reads/1061_READ1_clean.fastq \
        ./cleaned_reads/1061_READ1-single_clean.fastq \
        ./cleaned_reads/1061_READ2_clean.fastq \
        ./cleaned_reads/1061_READ2-single_clean.fastq \
        ILLUMINACLIP:./cleaned_reads/1061_adapters.fasta:2:30:10 \
        LEADING:3 \
        TRAILING:3 \
        SLIDINGWINDOW:4:15 \
        MINLEN:10 \
        CROP:301 \
        HEADCROP:0


Take a moment to understand the output of trimmomatic (what information do you get on the screen and what do the output files contain?).

Also look up the meanings of the different flags (e.g. LEADING, TRAILING, etc) and the meaning of the provided values in the [Trimmomatic tutorial](https://drive.google.com/file/d/1WI_gGGIViibALnlY3UfYzR0z-TXChvC-/view). It is worth spending some time here, as you will need that understanding for later steps.


__________

**7)** Run quality check on one of the cleaned files: 

In [None]:
%%bash
fastqc -o quality_check cleaned_reads/1061_READ1_clean.fastq

Check how far the quality has improved and if there are still some tests that are problematic. For checking the file, repeat the instructions of **step 4)**. If your sample is still failing some of the tests, investigate what the issue is (which tests are failing and why?).

Alter and rerun the trimmomatic command from step **6)** accordingly to take care of those issues. This will most likely involve some trial and error and several rounds of refinements.
Find the right settings so that **all tests PASS** (except for 'Per base sequence content' and 'Sequence Length Distribution', it's fine if there remains a WARN).

<div class="alert alert-block alert-warning">
**Question:** Report your final settings that passed these criteria.
</div>

In order to speed up the process you can avoid downloading and opening the `.html` file after every round and instead apply the code below in order to print a test-overview to the screen:

In [2]:
%%bash
cd quality_check
yes | unzip -qq 1061_READ1_clean_fastqc.zip
cat 1061_READ1_clean_fastqc/summary.txt
cd ..


PASS	Basic Statistics	1061_READ1_clean.fastq
PASS	Per base sequence quality	1061_READ1_clean.fastq
PASS	Per tile sequence quality	1061_READ1_clean.fastq
PASS	Per sequence quality scores	1061_READ1_clean.fastq
WARN	Per base sequence content	1061_READ1_clean.fastq
PASS	Per sequence GC content	1061_READ1_clean.fastq
PASS	Per base N content	1061_READ1_clean.fastq
WARN	Sequence Length Distribution	1061_READ1_clean.fastq
PASS	Sequence Duplication Levels	1061_READ1_clean.fastq
PASS	Overrepresented sequences	1061_READ1_clean.fastq
PASS	Adapter Content	1061_READ1_clean.fastq
PASS	Kmer Content	1061_READ1_clean.fastq


<div class="alert alert-block alert-success">

**Congratulations!** You should now be able to clean any FASTQ file that you will be confronted with during your research. Cleaning and trimming is the most essential step and it is crucial to ensure that none of the quality tests fail.

</div>