## Removing primers/adapters: Cutadapt
Cutadapt is a tool for removing adapter sequences, primers, and poly-A tails from DNA sequencing data. Sequencing data is often delivered to you with the adapters from the sequencing center and with the primers you used during PCR amplification. When it is ready for you to analyze, you must first remove all Illumina adapters and primers that may be in your sequencing reads, and this is where cutadapt comes in. 

### Using Cutadapt
To run cutadapt, you will first need to make sure that you have the module installed into your Miniforge/Python environment. To do this, we will run this line of code in our terminal after logging into our HPC:

In [None]:
conda install bioconda::cutadapt

From here, you will need to adjust your cutadapt SLURM script for your own requirements. This includes adjusting your SLURM directives to specify job requirements and resource allocation, **AND** your adapter/primer sequences (dependant on what kind of sequencing you use and the region you amplify). A cutadapt script for 16S reads tagged with V4 primers and sequenced with Illumina will look something like this:

In [None]:
#!/bin/bash
#SBATCH --job-name=cutadapt_%j
#SBATCH --output=cutadapt_%j.log
#SBATCH --error=cutadapt_%j.err
#SBATCH --mail-type=BEGIN,END,FAIL
#SBATCH --mail-user=<your_txstate_id>@txstate.edu
#SBATCH --partition=shared
#SBATCH --nodes=1
#SBATCH --mem-per-cpu=60gb
#SBATCH --time=24:0:0


####### DO NOT USE CAPITAL Rs IN SAMPLE NAMES----RENAME WITH r (R1 & R2 exceptions)

########### Cutadapt: remove illumina adapters and primers

# see manual for options: http://cutadapt.readthedocs.io/en/stable/guide.html
# illumina adapters:
# -a AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC -A AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGATCTCGGTGGTCGCCGTATCATT 
# V4 primers:
# -a GGACTACNVGGGTWTCTAAT -g GTGYCAGCMGCCGCGGTAA

# make directory for cutadapt output files
mkdir cutadapt

module load cutadapt
cutadapt --version

# create file names for cutadapt
files_cut=`ls | grep "R1_001.fastq.gz"`

# use loop to run cutadapt function on all original pairs; send output to cutadapt folder
for R1 in $files_cut
do
  	R1_cut=`echo $R1 | cut -d R -f1`R1_cut.fastq.gz
        R2=`echo $R1 | cut -d R -f1`R2_001.fastq.gz
        R2_cut=`echo $R1 | cut -d R -f1`R2_cut.fastq.gz
        cutadapt -a AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC \
         -A AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGATCTCGGTGGTCGCCGTATCATT \
         -a GGACTACNVGGGTWTCTAAT \
         -g GTGYCAGCMGCCGCGGTAA \
         -n 4 \
         -o cutadapt/$R1_cut \
         -p cutadapt/$R2_cut \
         $R1 $R2
done

#### Let's go over each section of our SLURM script. 
1. At the beginning, we have our SLURM directives, which we can adjust depending on the the resources we need. For these samples, this is plent of resources. **Make sure you fill in your own email for the *mail_user* directive.**

2. You will create a cutadapt directory to store output files using the line: 

In [None]:
mkdir cutadapt

3. You will load the cutadapt software from miniforge:

In [None]:
module load cutadapt

4. You will print the version of cutadapt you use to your log file. This is extremely important for reproducibility: 

In [None]:
cutadapt --version

5. You will then create a list of input files ending with R1_001.fastq.gz (forward reads). The reason we don't make a list for our R2 (reverse) reads is because cutadapt will assume a paired-end naming convention where R1=forward, R2=reverse:

In [None]:
files_cut=`ls | grep "R1_001.fastq.gz"`

Sequence read files ending with ```"R1(R2)_001.fastq.gz"``` are standard in sequencing data, and therefore this will likely not change. However it is ALWAYS important to double check your own data. 

6. You will set up a ***for loop*** that iterates through each forward read file:

In [None]:
for R1 in $files_cut
do

What does this mean exactly?
* The loop starts by iterating over each file in the variable $files_cut.
* ```$files_cut``` contains a list of filenames (e.g., ```sample1_R1_001.fastq.gz```, ```sample2_R1_001.fastq.gz```) that match the pattern ```"R1_001.fastq.gz"```.
* For each file, the variable ```R1``` temporarily holds the current filename.

7. You will set up your output filename construction:

In [None]:
R1_cut=`echo $R1 | cut -d R -f1`R1_cut.fastq.gz
R2=`echo $R1 | cut -d R -f1`R2_001.fastq.gz
R2_cut=`echo $R1 | cut -d R -f1`R2_cut.fastq.gz

Here, we are telling LEAP2 to create our output filenames by:
* Truncating original filenames at first "R"
* Appending _cut.fastq.gz
* Finding matching reverse read (R2) files and doing the same to those files

8. You will remove primers and adapters:

In [None]:
cutadapt -a AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC \
         -A AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGATCTCGGTGGTCGCCGTATCATT \
         -a GGACTACNVGGGTWTCTAAT \
         -g GTGYCAGCMGCCGCGGTAA \
         -n 4 \
         -o cutadapt/$R1_cut \
         -p cutadapt/$R2_cut \
         $R1 $R2

Let's take a closer look at the parameters we've specified here:
* **Adapters Removed:**
    * -a: Universal Illumina adapter (forward read 3' end)
    * -A: Reverse read Illumina adapter (reverse read 3' end)
    * -a: 3' primer sequence (common 806R primer for 16S rRNA gene)
    * -g: 5' primer sequence (common 515F primer for 16S rRNA gene)
* **Options:**
    * -n 4: Removes up to 4 adapters per read
    * -o/-p: Specifies output paths for forward/reverse reads
    * Input files: Original R1 and R2 files



You will need to place the .bash or .sh file with this script into your LEAP2 directory where your reads are. For me, I have named this script "cutadapt.bash", so I will put that file into my LEAP2 where my raw reads are located. Then, I can run the script from my command line using the ```sbatch``` command. 