![VWBPRGbanner.jpeg](VWBPRGbanner.jpeg)

# 16S amplicon NGS

# USEARCH Pipeline

***

**Input**

Raw `fastq` files from illumina NGS platforms.

**Output**

Complied OTU table (taxonomy not assigned) and final OTU sequences in `fasta` format.

**Assumptions**

- Working within MacOS environment - there are slight differences when working within a linux environment. As Pawsey system (Nimbus) works within a linux environment this pipeline will include the variation required to work within a linux environment also, however take note it is primarily written for MacOS. Please provide feedback if functions are not working within a linux environment.
- Downloaded USEARCH. This pipeline required usearch 9.2 and usearch 10. In most cases the 34-bit files are okay, however if undertaking analysis of large datasets you will need to purchase the 64-bit. The Cryptick lab has the 64-bit versions of usearch8, 9.2 and 10 - you will need to use the iMac to access these. Unfortunately we do not have the 64-bit version on a linux environment and as such cannot run it on the Nimbus cloud. You will have to weight up your options between running the analysis on Nimbus vs iMac. You might consider doing the first steps on the Nimbus cloud and the final steps on the iMac.

**Data type**

This analysis is for paired-end reads

**Commands used in this script**

- [fastx_info](https://www.drive5.com/usearch/manual/cmd_fastx_info.html) - gives a short summary report of the sequences. Importantly at this stage you can assess the EE or expected error rate which is described in detail [here](https://www.drive5.com/usearch/manual/exp_errs.html). In short the EE should be under 2.0, and is usually lower in the forward reads than reverse.
- 

Implement the https://www.drive5.com/usearch/manual/cmd_fastx_learn.html fast learn command after the fast uniques line 345

**Parameters**

- **Merge pairs overlap** - This inital step simply merged the R1 and R2 files. Setting this parameter too high can mean it acts as a 'quality control' step which is not what it is intended for. Try this step with a few reads to ensure that majority of reads are passing through
    - Command: usearch9.2 -fastq_mergepairs
    - Default: 50 bp


1. Unzip raw fastq.gz files -> fastq
2. Merged paired illumina sequences

***

## Fastq info

[fastq_info command](https://www.drive5.com/usearch/manual/cmd_fastx_info.html)

This command is available in both usearch9.2 and usearch10.

In [None]:
#!/bin/bash

raw_data="raw_data"
read_summary="1.read_summary"
usearch="usearch10"


mkdir $read_summary

 for fq in $raw_seqs/*R1*.fastq
 do
  $usearch -fastx_info $fq -output $read_summary/1a_fwd_fastq_info.txt
 done

 for fq in $raw_seqs/*R2*.fastq
 do
  $usearch -fastx_info $fq -output $read_summary/1b_rev_fastq_info.txt
 done

At this point it is important to look at the 'EE' value which means expected error.
Detailed information on this can be found [here](https://www.drive5.com/usearch/manual/exp_errs.html).
In short you want your EE value to be under two, usually it is lower for the forward reads than the reverse reads. 

***

## Merge pairs

The below code will work where a subset of raw fastq files are in the directory `raw_data_subset`. It will out put the merged files in the directory `merging_test`. In the terminal it will display the % merged. This script will work in usearch9.2 or usearch10. It will work on raw fastq sequences to test for min merge overlap - default = 50 bp.

In [None]:
#!/bin/bash

raw_data="raw_data"
merged_reads="2.merged_reads"
usearch="usearch10"

mkdir $merged_reads

# Step1: merge data with usearch10 -fastq_mergepairs

for file1 in $raw_data/*R1_001.fastq
    do

        echo ""
        echo ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
        echo Merging paired reads
        echo forward reads are:
        echo $(basename ${file1})
        echo reverse reads are:
        echo $(basename ${file1} R1_001.fastq)R2_001.fastq

    usearch10 -fastq_mergepairs ${file1} -reverse $raw_data/$(basename -s R1_001.fastq ${file1})R2_001.fastq" -fastqout "$merged_reads/$(basename "$file1")" -fastq_minovlen 50 -report $merged_reads/report.txt
done

# Step 2: Remove "_L001_R1_001" from filenames

for file2 in $merged_reads/*.fastq
    do

        rename="$(basename ${file2} _L001_R1_001.fastq).fastq"

        mv ${file2} $merged_reads/${rename}
done

_**Note**_

There are a number of other commands that you may find useful at this step available on the USEARCH documenation page [here](https://www.drive5.com/usearch/manual/merge_options.html). 
After trials with tick NGS data I have found that in most cases altering these parameters make little to no difference to the outcome of the sequences and hence they are obmitted from this pipeline. Again it is important to remember the purpose of this line of code, it is simply to merge the forward and reverse sequences _not_ to act as a quality control step.

***

## Quality control and removing dimer sequences

In [None]:
#!/bin/bash

# This script works with usearch9.2 and usearch10
# fastq_filter provides a quiality filter and max ee rate 

merged_reads="2.merged_reads"
QF_reads="3.quality_filtered"
usearch="usearch10"

mkdir $QF_reads

for file3 in $merged_reads/*.fastq
    do
        echo ""
        echo ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
        echo Quality control and removing dimer seqs
        echo input is:
        echo ${file3}

    usearch10 -fastq_filter ${file3} -fastaout "$QF_reads/$(basename "$file3" .fastq).fasta" -fastq_maxee_rate 0.01 -fastq_minlen 150
done


## Trim primers

usearch8 [search_pcr command](https://www.drive5.com/usearch/manual8.1/cmd_search_pcr.html)