![VWBPRGbanner.jpeg](VWBPRGbanner.jpeg)

# 16S amplicon NGS

# USEARCH Pipeline

***

**Input**

Raw `fastq` files from illumina NGS platforms.

**Output**

Complied OTU table (taxonomy not assigned) and final OTU sequences in `fasta` format.

**Assumptions**

- Working within MacOS environment - there are slight differences when working within a linux environment. As Pawsey system (Nimbus) works within a linux environment this pipeline will include the variation required to work within a linux environment also, however take note it is primarily written for MacOS. Please provide feedback if functions are not working within a linux environment.
- Downloaded USEARCH. This pipeline required usearch 9.2 and usearch 10. In most cases the 34-bit files are okay, however if undertaking analysis of large datasets you will need to purchase the 64-bit. The Cryptick lab has the 64-bit versions of usearch8, 9.2 and 10 - you will need to use the iMac to access these. Unfortunately we do not have the 64-bit version on a linux environment and as such cannot run it on the Nimbus cloud. You will have to weight up your options between running the analysis on Nimbus vs iMac. You might consider doing the first steps on the Nimbus cloud and the final steps on the iMac.

**Data type**

This analysis is for paired-end reads

**Parameters**

- **Merge pairs overlap** - This inital step simply merged the R1 and R2 files. Setting this parameter too high can mean it acts as a 'quality control' step which is not what it is intended for. Try this step with a few reads to ensure that majority of reads are passing through
    - Command: usearch9.2 -fastq_mergepairs
    - Default: 50 bp


1. Unzip raw fastq.gz files -> fastq
2. Merged paired illumina sequences

***

## Fastq info

[fastq_info command](https://www.drive5.com/usearch/manual/cmd_fastx_info.html)

This command is available in both usearch9.2 and usearch10.

In [None]:
raw_data="raw_data"
read_summary="1.read_summary"
usearch="usearch10"


mkdir $read_summary

 for fq in $raw_seqs/*R1*.fastq
 do
  $usearch -fastx_info $fq -output $read_summary/1a_fwd_fastq_info.txt
 done

 for fq in $raw_seqs/*R2*.fastq
 do
  $usearch -fastx_info $fq -output $read_summary/1b_rev_fastq_info.txt
 done

***

## Merge pairs

The below code will work where a subset of raw fastq files are in the directory `raw_data_subset`. It will out put the merged files in the directory `merging_test`. In the terminal it will display the % merged. 

In [None]:
#!/bin/bash

# This script works with usearch9.2 and usearch10
# This peice of code works on a subset of raw fastq reads to test for min merge overlap - default = 50 bp

mkdir merging_test

# Step1: merge data with usearch10 -fastq_mergepairs

for file1 in raw_data_subset/*R1_001.fastq
    do

        echo ""
        echo ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
        echo Merging paired reads
        echo forward reads are:
        echo $(basename ${file1})
        echo reverse reads are:
        echo $(basename ${file1} R1_001.fastq)R2_001.fastq

    usearch10 -fastq_mergepairs ${file1} -reverse "raw_data_subset/$(basename -s R1_001.fastq ${file1})R2_001.fastq" -fastqout "merging_test/$(basename "$file1")" -fastq_minovlen 50 -report merging_test/report.txt
done

# Step 2: Remove "_L001_R1_001" from filenames

for file2 in merging_test/*.fastq
    do

        rename="$(basename ${file2} _L001_R1_001.fastq).fastq"

        mv ${file2} merging_test/${rename}
done

_**Note**_

There are a number of other commands that you may find useful at this step available on the USEARCH documenation page [here](https://www.drive5.com/usearch/manual/merge_options.html). 
After trials with tick NGS data I have found that in most cases altering these parameters make little to no difference to the outcome of the sequences and hence they are obmitted from this pipeline. Again it is important to remember the purpose of this line of code, it is simply to merge the forward and reverse sequences _not_ to act as a quality control step.

***

## Quality control and removing dimer sequences

In [None]:
#!/bin/bash

# This script works with usearch9.2 and usearch10
# fastq_filter provides a quiality filter and max ee rate 

mkdir quality_filered_test

for file3 in merging_test/*.fastq
    do
        echo ""
        echo ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
        echo Quality control and removing dimer seqs
        echo input is:
        echo ${file3}

    usearch10 -fastq_filter ${file3} -fastaout "quality_filered_test/$(basename "$file3" .fastq).fasta" -fastq_maxee_rate 0.01 -fastq_minlen 150
done


## Trim primers

usearch8 [search_pcr command](https://www.drive5.com/usearch/manual8.1/cmd_search_pcr.html)

In [None]:
#!/bin/bash

# Enter FWD primer sequence 5'-3' (degenerate bases OK)
fwd_primer="AGAGTTTGATCCTGGCTYAG"
# Enter REV primer sequence 5'-3' (degenerate bases OK)
rev_primer="TGCTGCCTCCCGTAGGAGT"

echo %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
echo Triming primers and distal bases
echo %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
echo ""

# At the moment this usearch command can only take .fasta as input so can only be done
# after QF

# Creating working directories

mkdir trimmed_data_test
mkdir seqs_w_fwd_primer
mkdir seqs_wo_fwd_primer
mkdir seqs_w_fwd_and_rev_primer
mkdir seqs_w_fwd_butnot_rev_primer

# Creating FWD primer db

echo ">fwd_primer" > fwd_primer_db.fasta
echo ${fwd_primer} >> fwd_primer_db.fasta

# Creating REV primer db

echo ">rev_primer" > rev_primer_db.fasta
echo ${rev_primer} >> rev_primer_db.fasta

# Creating FWD and REV primer db

echo ">fwd_primer" > both_primers_db.fasta
echo ${fwd_primer} >> both_primers_db.fasta
echo ">rev_primer" >> both_primers_db.fasta
echo ${rev_primer} >> both_primers_db.fasta

#*****************************************************************************************
# Step 1: Finding seqs with FWD primer

for file4 in quality_filered_test/*.fasta
    do

        echo ""
        echo ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
        echo Trimming primers step 1: finding seqs with FWD primer
        echo input is:
        echo ${file4}

    usearch9.2 -search_oligodb ${file4} -db fwd_primer_db.fasta -strand both -matched "seqs_w_fwd_primer/$(basename ${file4})" -notmatched "seqs_wo_fwd_primer/$(basename ${file4})"
done
#*****************************************************************************************
# Step 2: Finding seqs with FWD and REV primers

for file5 in seqs_w_fwd_primer/*.fasta
    do

        echo ""
        echo ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
        echo Trimming primers step 2: finding seqs with FWD and REV primer
        echo input is:
        echo ${file5}

    usearch9.2 -search_oligodb ${file5} -db rev_primer_db.fasta -strand both -matched "seqs_w_fwd_and_rev_primer/$(basename ${file5})" -notmatched "seqs_w_fwd_butnot_rev_primer/$(basename ${file5})"
done
#*****************************************************************************************
# Step 3: Trimming FWD and REV primers

for file6 in seqs_w_fwd_and_rev_primer/*.fasta
    do

        echo ""
        echo ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
        echo Trimming primers step 3: removing FWD and REV primers
        echo input is:
        echo ${file6}

    usearch8 -search_pcr ${file6} -db both_primers_db.fasta -strand both -maxdiffs ${pcr_missmatches} -pcr_strip_primers -ampout "trimmed_data_test/$(basename ${file6} .fasta).fasta"
done
