Amplicon next generation sequence analysis

Set of pipelines and scripts for analysis of sequence data using USEARCH, a program developed by Robert Edgar.

Download USEARCH Available here. Remeber if you have a large dataset you will need the 64 bit version.

Documentation and tutorials

I maintain a webpage on NGS sequence analysis (currently focused on amplicon illumia MiSeq data) that contains a variety of pipelines, code, examples and lots of links.

Notes on using these scripts

It is import that your sequence files are unzipped (i.e. in *.fastq format). If your sequences are zipped (i.e. in *.fastq.gz format) you will need to unzip them, you can do so using unix command

gunzip *.fastq.gz

Each script in the repo here is numbered 1-7 and written so that it can be run as a 'stand alone' piece however will require you have installed appropriate version USEARCH and adjust directories and file names as needed.

If you are new or starting out this step by step process will be easier to follow and importantly you will be able to decide which parts work for your data and which parameters might need optimising.

Once you are happy with the script you can run the complete pipeline in full by using (usearchv11_analysis.sh](https://github.com/siobhon-egan/usearch_analysis/blob/master/usearch11_analysis.sh).

To use the usearchv11_analysis.sh script make edits in the first part of the script by opening in a text editor.

You can rename directories as you like. The most important one is the raw_data directory, this MUST match the name of the folder where you raw data is. It assumes your raw data is in the following format

sample_id_SXXX_L001_R1_001.fastq (read 1, forward)
sample_id_SXXX_L001_R2_001.fastq (read 2, reverse)

Before use:

$chmod 775 this_script.sh

To run:

$./usearch_analysis.sh

Important variables to edit

Enter max diff for merging - default 5 but should increase for paired end

maxdiffs="15"

Enter minimum merge overlap - default is 16 bp

overlap="50"

Enter forward sequences (5'-3'). Wildcard letters indicating degenerate positions in the primer are supported. See IUPAC codes for details

fwd_primer="AGAGTTTGATCCTGGCTYAG" #primer name and reference
rev_primer="TGCTGCCTCCCGTAGGAGT" #primer name and reference

Enter max error rate. Natural choice is 1, however may want to decrease to 0.5 or 0.25 for more stringent filtering

max_ee="1"

Enter min length of sequence for trimming in bp (eg. to keep all seqs above 200 bp enter "200")

minlen="150"

Enter max size to discard i.e. to discard singletons = 1, duplicates = 2

maxsize="1"

Alternative simple script

I have created a simplified version of the usearch11 script here

This simplified pipeline does not search for any primers/distal sequences and does not differentiate sequences based on expected length. I Recommend this script if working with unknown amplicon illumina NGS data i.e. you are not sure what primers were used and/or the target length of amplicons.

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
.gitignore		.gitignore
1.fastx_info.sh		1.fastx_info.sh
2.fastq_mergepairs.sh		2.fastq_mergepairs.sh
3.search_pcr2.sh		3.search_pcr2.sh
4.fastq_filter.sh		4.fastq_filter.sh
5.renaming_fasta_sequences.sh		5.renaming_fasta_sequences.sh
6.fastx_uniques_and_search_exact.sh		6.fastx_uniques_and_search_exact.sh
7a.uparse_otus.sh		7a.uparse_otus.sh
7b.unoise_zotus.sh		7b.unoise_zotus.sh
README.md		README.md
usearch11_complete_analysis.sh		usearch11_complete_analysis.sh
usearchv11_simple.sh		usearchv11_simple.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.gitignore

.gitignore

1.fastx_info.sh

1.fastx_info.sh

2.fastq_mergepairs.sh

2.fastq_mergepairs.sh

3.search_pcr2.sh

3.search_pcr2.sh

4.fastq_filter.sh

4.fastq_filter.sh

5.renaming_fasta_sequences.sh

5.renaming_fasta_sequences.sh

6.fastx_uniques_and_search_exact.sh

6.fastx_uniques_and_search_exact.sh

7a.uparse_otus.sh

7a.uparse_otus.sh

7b.unoise_zotus.sh

7b.unoise_zotus.sh

README.md

README.md

usearch11_complete_analysis.sh

usearch11_complete_analysis.sh

usearchv11_simple.sh

usearchv11_simple.sh

Repository files navigation

Amplicon next generation sequence analysis

Documentation and tutorials

Notes on using these scripts

Alternative simple script

About

Releases

Packages

Languages

siobhon-egan/usearch_analysis

Folders and files

Latest commit

History

Repository files navigation

Amplicon next generation sequence analysis

Documentation and tutorials

Notes on using these scripts

Alternative simple script

About

Topics

Resources

Stars

Watchers

Forks

Languages