## Overview

The 16S sequences were provided to me from Mr. DNA via a DropBox download link. They are **Demultiplexed** (aka **Demuxed**) sequences that still have the forward and reverse primers in the sequences.

-   The Raw Data is **demultiplexed**

-   A R1 and R2 fastq.gz file has been generated for each individual sample

-   All forward reads are binned into the R1 fastq.gz files

-   All reverse reads are binned into the R2 fastq.gz files

-   Other than demultiplexing; you can consider the Raw Data on BaseSpace as untouched (**The Forward and Reverse Primer Sequences have not been removed**)

Here I follow the QIIME2 [Casava 1.8 paired-end demultiplexed fastq](https://docs.qiime2.org/2023.5/tutorials/importing/#:~:text=Casava%201.8%20paired%2Dend%20demultiplexed%20fastq) tutorial example on importing data, using the files provided to me by Mr. DNA, Molecular Research via dropbox. 

## Data download

I got an email from Mr. DNA with a Dropbox link to the data files, where I downloaded two .zip folders; one had raw data files and the other had analysis pipeline files that Mr. DNA generated.

Here I am working with the raw data files located in `coral-pae-temp/analysis/microbiome/rawdata/demux`

In the `demux` folder is a `fastq.gz` file for each sample.

The file name includes the sample identifier and should look like `4.Ea_S1_L001_R1_001.fastq.gz`. 
The underscore-separated fields in this file name are:

1.  the sample identifier,

2.  the barcode sequence or a barcode identifier,

3.  the lane number,

4.  the direction of the read (i.e. R1 or R2, because these are paired-end reads), and

5.  the set number.


Make an output directory 

In [None]:
!cd ../ ; mkdir output

Make a table of the metadata 

In [None]:
!qiime metadata tabulate \
  --m-input-file ../rawdata/sample-metadata.tsv \
  --o-visualization ../output/metadata.qzv

In [4]:
from qiime2 import Visualization
Visualization.load('../output/metadata.qzv')

Import Sequences into qiime

In [None]:
!qiime tools import \
  --type 'SampleData[PairedEndSequencesWithQuality]' \
  --input-path ../rawdata/demux \
  --input-format CasavaOneEightSingleLanePerSampleDirFmt \
  --output-path ../output/demux-paired-end.qza

The `demux-paired-end.qza` artifact contains raw, demultiplexed sequences that still have forward and reverse primers

## Trim primers from paired-end sequences using `cutadapt`

> "The PCR primers (F515/R806) were developed against the V4 region of the 16S rRNA, which we determined would yield optimal community clustering with reads of this length using a procedure similar to that of ref. 15. [For reference, this primer pair amplifies the region 533–786 in the Escherichia coli strain 83972 sequence (greengenes accession no. prokMSA_id:470367).]The reverse PCR primer is barcoded with a 12-base errorcorrecting Golay code to facilitate multiplexing of up to ≈1,500 samples per lane, and both PCR primers contain sequencer adapter regions." - (Caporasco et al. 2011)

Caporaso, J. G., Lauber, C. L., Walters, W. A., Berg-Lyons, D., Lozupone, C. A., Turnbaugh, P. J., Fierer, N., & Knight, R. (2011). Global patterns of 16S rRNA diversity at a depth of millions of sequences per sample. Proceedings of the National Academy of Sciences, 108(supplement_1), 4516–4522. https://doi.org/10.1073/pnas.1000080107

> "The V4 variable region of the 16S rRNA gene was amplified using the 515F (5′-­GTGCCAGCMGCCGCGGTAA-­3′) and 806R (5′-­GGACTACHVGGGTWTCTAAT-­3′) primer set (Caporaso et al. 2011). ” - (Brown et al. 2021)

Brown, Tanya, Dylan Sonett, Jesse R. Zaneveld, and Jacqueline L. Padilla-Gamiño. 2021. “Characterization of the Microbiome and Immune Response in Corals with Chronic Montipora White Syndrome.” Molecular Ecology 30 (11): 2591–2606. https://doi.org/10.1111/mec.15899.

In [6]:
!qiime cutadapt trim-paired \
  --i-demultiplexed-sequences ../output/demux-paired-end.qza \
  --p-cores 4 \
  --p-front-f GTGYCAGCMGCCGCGGTAA \
  --p-front-r GGACTACNVGGGTWTCTAAT \
  --o-trimmed-sequences ../output/demux-trimmed.qza

[32mSaved SampleData[PairedEndSequencesWithQuality] to: ../output/demux-trimmed.qza[0m
[0m

## Visualize trimmed & demultiplexed sequences

In [7]:
!qiime demux summarize \
  --i-data ../output/demux-trimmed.qza \
  --o-visualization ../output/demux-trimmed-summary.qzv

[32mSaved Visualization to: ../output/demux-trimmed-summary.qzv[0m
[0m

In [8]:
Visualization.load('../output/demux-trimmed-summary.qzv')

## [Denoise with DADA2](https://docs.qiime2.org/2023.5/tutorials/moving-pictures/#sequence-quality-control-and-feature-table-construction:~:text=with%20QIIME%201.-,Option%201%3A%20DADA2%C2%B6,-DADA2%20is%20a)

[DADA2](https://pubmed.ncbi.nlm.nih.gov/27214047/) is a pipeline for detecting and correcting (where possible) Illumina amplicon sequence data. 

As implemented in the q2-dada2 plugin, this quality control process will additionally filter any phiX reads (commonly present in marker gene Illumina sequence data) that are identified in the sequencing data, and will filter chimeric sequences.

The dada2 denoise-paired method requires four parameters that are used in quality filtering: 
--p-trim-left-f m, which trims off the first m bases of each sequence in the forward reads
--p-trim-left-r n, which trims off the first m bases of each sequence in the reverse reads
--p-trunc-len-f o, which truncates each sequence at position o in the forward reads 
--p-trunc-len-f o, which truncates each sequence at position o in the reverse reads 

This allows the user to remove low quality regions of the sequences. 

What is a 'good' quality score?
In QIIME 2's interactive quality plots, the quality scores typically range from 0 to 40. Quality scores reflect the accuracy of base calls in sequencing data, with higher scores indicating higher accuracy. The most common quality score scale used in modern sequencing technologies is the Phred scale.

In the Phred scale:

A quality score of 10 corresponds to a 1 in 10 chance of an incorrect base call (90% accuracy).
A quality score of 20 corresponds to a 1 in 100 chance of an incorrect base call (99% accuracy).
A quality score of 30 corresponds to a 1 in 1000 chance of an incorrect base call (99.9% accur
A "good" quality score in this context depends on your specific analysis goals and the sequencing platform you're using. However, many researchers consider quality scores above 20 to be generally acceptable for downstream analysis. Scores above 30 are often seen as very high quality.a
In theew the Interactive Quality Plot tab in the `demux-trimmed-summary.qzv` file that was generated by `qiime demux summarize`, lots, we see that the quality scores of the bases are high, between a score of 11 in the lowest 2nd percentile and a score of 37 in the bottom 25th percentile and higher. So we won’t trim any bases from the beginning of the sequeI sort of arbitrarily picked 230 as the sequence base position for both the forward and reverse read to truncate.... I'm not sure this is even necessary... a low score of 11 doesn't seem so bad. But, I'm going to keep it in here until I can double check this with someone who knows better.

ases. This next command may take up to 10 minutes to run, and is the slowest step

In [None]:
!qiime dada2 denoise-paired \
  --i-demultiplexed-seqs ../output/demux-trimmed.qza \
  --p-trim-left-f 0 \
  --p-trim-left-r 0 \
  --p-trunc-len-f 230 \
  --p-trunc-len-r 230 \
  --p-n-threads 10 \
  --o-table ../output/dada2-table.qza \
  --o-representative-sequences ../output/dada2-rep-seqs.qza \
  --o-denoising-stats ../output/dada2-denoising-stats.qza

## Summarize & tabulate the feature table
After the quality filtering step completes, you’ll want to explore the resulting data. You can do this using the following two commands, which will create visual summaries of the data. The `feature-table summarize` command will give you information on how many sequences are associated with each sample and with each feature, histograms of those distributions, and some related summary statistics. The `feature-table tabulate-seqs` command will provide a mapping of feature IDs to sequences, and provide links to easily BLAST each sequence against the NCBI nt database.

The following step generates a summary visualization of a feature table. 
Feature tables in QIIME 2 represent the abundance of different biological features (such as bacterial taxa or OTUs) across samples. 
 In this command:
 `

--i-table table`, ies the input feature table in QIIME 2 artifact format (.qza file) that you want to summa.
 `e.

--o-visualization tab`, ifies the output visualization in QIIME 2 artifact format (.qzv file) that will contain the summary r.
 `lts.

--m-sample-metadata-file sample-meta`, ecifies the metadata file (usually in tab-separated values format) that contains additional information about the samples in your feature ta le. The metadata will be used to generate additional summary plots that allow you to explore the relationships between features and m

The feature-table summarize command will give you information on how many sequences are associated with each sample and with each feature, histograms of those distributions, and some related summary statistics.lizations.

In [None]:
!qiime feature-table summarize \
  --i-table ../output/dada2-table.qza \
  --o-visualization ../output/dada2-table.qzv \
  --m-sample-metadata-file ../rawdata/sample-metadata.tsv