## Overview

The 16S sequences were provided to me from Mr. DNA via a DropBox download link. They are **Demultiplexed** (aka **Demuxed**) sequences that still have the forward and reverse primers in the sequences.

-   The Raw Data is **demultiplexed**

-   A R1 and R2 fastq.gz file has been generated for each individual sample

-   All forward reads are binned into the R1 fastq.gz files

-   All reverse reads are binned into the R2 fastq.gz files

-   Other than demultiplexing; you can consider the Raw Data on BaseSpace as untouched (**The Forward and Reverse Primer Sequences have not been removed**)

Here I follow the QIIME2 [Casava 1.8 paired-end demultiplexed fastq](https://docs.qiime2.org/2023.5/tutorials/importing/#:~:text=Casava%201.8%20paired%2Dend%20demultiplexed%20fastq) tutorial example on importing data, using the files provided to me by Mr. DNA, Molecular Research via dropbox. 

## Data download

I got an email from Mr. DNA with a Dropbox link to the data files, where I downloaded two .zip folders; one had raw data files and the other had analysis pipeline files that Mr. DNA generated.

Here I am working with the raw data files located in `coral-pae-temp/analysis/microbiome/rawdata/demux`

In the `demux` folder is a `fastq.gz` file for each sample.

The file name includes the sample identifier and should look like `4.Ea_S1_L001_R1_001.fastq.gz`. 
The underscore-separated fields in this file name are:

1.  the sample identifier,

2.  the barcode sequence or a barcode identifier,

3.  the lane number,

4.  the direction of the read (i.e. R1 or R2, because these are paired-end reads), and

5.  the set number.


In [1]:
from qiime2 import Visualization
from qiime2 import Artifact

Make an output directory 

In [2]:
!cd ../ ; mkdir output

mkdir: cannot create directory ‘output’: File exists


Make a table of the metadata 

In [3]:
!qiime metadata tabulate \
  --m-input-file ../rawdata/sample-metadata-verbose.tsv \
  --o-visualization ../output/metadata-verbose.qzv

[32mSaved Visualization to: ../output/metadata-verbose.qzv[0m
[0m

In [4]:
Visualization.load('../output/metadata-verbose.qzv')

Import Sequences into qiime

In [5]:
!qiime tools import \
  --type 'SampleData[PairedEndSequencesWithQuality]' \
  --input-path ../rawdata/demux \
  --input-format CasavaOneEightSingleLanePerSampleDirFmt \
  --output-path ../output/demux-paired-end.qza

[32mImported ../rawdata/demux as CasavaOneEightSingleLanePerSampleDirFmt to ../output/demux-paired-end.qza[0m
[0m

The `demux-paired-end.qza` artifact contains raw, demultiplexed sequences that still have forward and reverse primers

## Trim primers from paired-end sequences using `cutadapt`

> "The PCR primers (F515/R806) were developed against the V4 region of the 16S rRNA, which we determined would yield optimal community clustering with reads of this length using a procedure similar to that of ref. 15. [For reference, this primer pair amplifies the region 533–786 in the Escherichia coli strain 83972 sequence (greengenes accession no. prokMSA_id:470367).]The reverse PCR primer is barcoded with a 12-base errorcorrecting Golay code to facilitate multiplexing of up to ≈1,500 samples per lane, and both PCR primers contain sequencer adapter regions." - (Caporasco et al. 2011)

Caporaso, J. G., Lauber, C. L., Walters, W. A., Berg-Lyons, D., Lozupone, C. A., Turnbaugh, P. J., Fierer, N., & Knight, R. (2011). Global patterns of 16S rRNA diversity at a depth of millions of sequences per sample. Proceedings of the National Academy of Sciences, 108(supplement_1), 4516–4522. https://doi.org/10.1073/pnas.1000080107

> "The V4 variable region of the 16S rRNA gene was amplified using the 515F (5′-­GTGCCAGCMGCCGCGGTAA-­3′) and 806R (5′-­GGACTACHVGGGTWTCTAAT-­3′) primer set (Caporaso et al. 2011). ” - (Brown et al. 2021)

Brown, Tanya, Dylan Sonett, Jesse R. Zaneveld, and Jacqueline L. Padilla-Gamiño. 2021. “Characterization of the Microbiome and Immune Response in Corals with Chronic Montipora White Syndrome.” Molecular Ecology 30 (11): 2591–2606. https://doi.org/10.1111/mec.15899.

In [6]:
!qiime cutadapt trim-paired \
  --i-demultiplexed-sequences ../output/demux-paired-end.qza \
  --p-cores 4 \
  --p-front-f GTGYCAGCMGCCGCGGTAA \
  --p-front-r GGACTACNVGGGTWTCTAAT \
  --o-trimmed-sequences ../output/demux-trimmed.qza

[32mSaved SampleData[PairedEndSequencesWithQuality] to: ../output/demux-trimmed.qza[0m
[0m

## Visualize trimmed & demultiplexed sequences

In [7]:
!qiime demux summarize \
  --i-data ../output/demux-trimmed.qza \
  --o-visualization ../output/demux-trimmed-summary.qzv

[32mSaved Visualization to: ../output/demux-trimmed-summary.qzv[0m
[0m

In [8]:
Visualization.load('../output/demux-trimmed-summary.qzv')

## [Denoise with DADA2](https://docs.qiime2.org/2023.5/tutorials/moving-pictures/#sequence-quality-control-and-feature-table-construction:~:text=with%20QIIME%201.-,Option%201%3A%20DADA2%C2%B6,-DADA2%20is%20a)

[DADA2](https://pubmed.ncbi.nlm.nih.gov/27214047/) is a pipeline for detecting and correcting (where possible) Illumina amplicon sequence data. 

As implemented in the q2-dada2 plugin, this quality control process will additionally filter any phiX reads (commonly present in marker gene Illumina sequence data) that are identified in the sequencing data, and will filter chimeric sequences.

The dada2 denoise-paired method requires four parameters that are used in quality filtering: 
--p-trim-left-f m, which trims off the first m bases of each sequence in the forward reads
--p-trim-left-r n, which trims off the first m bases of each sequence in the reverse reads
--p-trunc-len-f o, which truncates each sequence at position o in the forward reads 
--p-trunc-len-f o, which truncates each sequence at position o in the reverse reads 

This allows the user to remove low quality regions of the sequences. 

What is a 'good' quality score?
In QIIME 2's interactive quality plots, the quality scores typically range from 0 to 40. Quality scores reflect the accuracy of base calls in sequencing data, with higher scores indicating higher accuracy. The most common quality score scale used in modern sequencing technologies is the Phred scale.

In the Phred scale:

A quality score of 10 corresponds to a 1 in 10 chance of an incorrect base call (90% accuracy).
A quality score of 20 corresponds to a 1 in 100 chance of an incorrect base call (99% accuracy).
A quality score of 30 corresponds to a 1 in 1000 chance of an incorrect base call (99.9% accur
A "good" quality score in this context depends on your specific analysis goals and the sequencing platform you're using. However, many researchers consider quality scores above 20 to be generally acceptable for downstream analysis. Scores above 30 are often seen as very high quality.a
In theew the Interactive Quality Plot tab in the `demux-trimmed-summary.qzv` file that was generated by `qiime demux summarize`, lots, we see that the quality scores of the bases are high, between a score of 11 in the lowest 2nd percentile and a score of 37 in the bottom 25th percentile and higher. So we won’t trim any bases from the beginning of the sequeI sort of arbitrarily picked 230 as the sequence base position for both the forward and reverse read to truncate.... I'm not sure this is even necessary... a low score of 11 doesn't seem so bad. But, I'm going to keep it in here until I can double check this with someone who knows better.

ases. This next command may take up to 10 minutes to run, and is the slowest step

In [9]:
!qiime dada2 denoise-paired \
  --i-demultiplexed-seqs ../output/demux-trimmed.qza \
  --p-trim-left-f 0 \
  --p-trim-left-r 0 \
  --p-trunc-len-f 0 \
  --p-trunc-len-r 0 \
  --p-n-threads 20 \
  --o-table ../output/dada2-table.qza \
  --o-representative-sequences ../output/dada2-rep-seqs.qza \
  --o-denoising-stats ../output/dada2-denoising-stats.qza

[32mSaved FeatureTable[Frequency] to: ../output/dada2-table.qza[0m
[32mSaved FeatureData[Sequence] to: ../output/dada2-rep-seqs.qza[0m
[32mSaved SampleData[DADA2Stats] to: ../output/dada2-denoising-stats.qza[0m
[0m

## Summarize & tabulate the feature table
After the quality filtering step completes, you’ll want to explore the resulting data. You can do this using the following two commands, which will create visual summaries of the data. The `feature-table summarize` command will give you information on how many sequences are associated with each sample and with each feature, histograms of those distributions, and some related summary statistics. The `feature-table tabulate-seqs` command will provide a mapping of feature IDs to sequences, and provide links to easily BLAST each sequence against the NCBI nt database.

### feature-table summarize
The feature-table summarize command will give you information on how many sequences are associated with each sample and with each feature, histograms of those distributions, and some related summary statistics. Feature tables in QIIME 2 represent the abundance of different biological features (such as bacterial taxa or OTUs) across samples.
 In this command:

--i-table tab.qzal`es the input feature table in QIIME 2 artifact format (.qza file) that you want to sumri

--o-visualizattable.qzv tfies the output visualization in QIIME 2 artifact format (.qzv file) that will con the summaes.

--m-sample-metadata-file sampldata.tsve-cifies the metadata file (usually in tab-separated values format) that contains additional information about the samples in your featble  and m

In [10]:
!qiime feature-table summarize \
  --i-table ../output/dada2-table.qza \
  --o-visualization ../output/dada2-table.qzv \
  --m-sample-metadata-file ../rawdata/sample-metadata-verbose.tsv

[32mSaved Visualization to: ../output/dada2-table.qzv[0m
[0m

In [11]:
Visualization.load('../output/dada2-table.qzv')

### feature-table tabulate-seqs

In [12]:
!qiime feature-table tabulate-seqs \
  --i-data ../output/dada2-rep-seqs.qza \
  --o-visualization ../output/dada2-rep-seqs.qzv

[32mSaved Visualization to: ../output/dada2-rep-seqs.qzv[0m
[0m

In [13]:
Visualization.load('../output/dada2-rep-seqs.qzv')

In [14]:
!qiime metadata tabulate \
  --m-input-file ../output/dada2-denoising-stats.qza \
  --o-visualization ../output/denoising-stats.qzv

[32mSaved Visualization to: ../output/denoising-stats.qzv[0m
[0m

In [15]:
Visualization.load('../output/denoising-stats.qzv')

## [Generate a tree for phylogenetic diversity analyses](https://docs.qiime2.org/2023.5/tutorials/moving-pictures-usage/#:~:text=Generate%20a%20tree%20for%20phylogenetic%20diversity%20analyses)

From the moving pictures tutorial:
> QIIME supports several phylogenetic diversity metrics, including Faith’s Phylogenetic Diversity and weighted and unweighted UniFrac. In addition to counts of features per sample (i.e., the data in the FeatureTable[Frequency] QIIME 2 artifact), these metrics require a rooted phylogenetic tree relating the features to one another. This information will be stored in a Phylogeny[Rooted] QIIME 2 artifact. To generate a phylogenetic tree we will use align-to-tree-mafft-fasttree pipeline from the q2-phylogeny plugin. 
First, the pipeline uses the mafft program to perform a multiple sequence alignment of the sequences in our FeatureData[Sequence] to create a FeatureData[AlignedSequence] QIIME 2 artifact. Next, the pipeline masks (or filters) the alignment to remove positions that are highly variable. These positions are generally considered to add noise to a resulting phylogenetic tree. Following that, the pipeline applies FastTree to generate a phylogenetic tree from the masked alignment. The FastTree program creates an unrooted tree, so in the final step in this section midpoint rooting is applied to place the root of the tree at the midpoint of the longest tip-to-tip distance in the unrooted tree.

In [25]:
!qiime phylogeny align-to-tree-mafft-fasttree \
  --i-sequences ../output/dada2-rep-seqs.qza \
  --output-dir ../output/phylogeny-tree

[32mSaved FeatureData[AlignedSequence] to: ../output/phylogeny-tree/alignment.qza[0m
[32mSaved FeatureData[AlignedSequence] to: ../output/phylogeny-tree/masked_alignment.qza[0m
[32mSaved Phylogeny[Unrooted] to: ../output/phylogeny-tree/tree.qza[0m
[32mSaved Phylogeny[Rooted] to: ../output/phylogeny-tree/rooted_tree.qza[0m
[0m

## [Alpha & Beta diversity](https://docs.qiime2.org/2023.5/tutorials/moving-pictures-usage/#:~:text=Alpha%20and%20beta%20diversity%20analysis)

QIIME 2’s diversity analyses are available through the `q2-diversity` plugin, which supports computing alpha and beta diversity metrics, applying related statistical tests, and generating interactive visualizations. We’ll first apply the `core-metrics-phylogenetic` method, which rarefies a `FeatureTable Frequency` to a user-specified depth, computes several alpha and beta diversity metrics, and generates principle coordinates analysis (PCoA) plots using Emperor for each of the beta diversity metrics. The metrics computed by default are: 
Alpha diversi 

Shannon’s diversity index (a quantitative measure of community ric )-s)

Observed Features (a qualitative measure of community    -ess)

Faith’s Phylogenetic Diversity (a qualitative measure of community richness that incorporates phylogenetic relationships betweenea    -tures)

Evenness (or Pielou’s Evenness; a measure of cmunity e- venn 
Beta    - diversity

Jaccard distance (a qualitative measure ounity di     ssimilarity)

Bray-Curtis distance (a quantitative measucommunity      dissimilarity)

unweighted UniFrac distance (a qualitative measure of community dissimilarity that incorporates phylogenetic rnships betwe     en the features)

weighted UniFrac distance (a quantitative measure of community dissimilarity that incorporates phylogenetic relationships between the features)

An important parameter that needs to be provi`ded to this script` is --p-sampling-depth, which is the even sampling (i.e. rarefaction) depth. Because most diversity metrics are sensitive to different sampling depths across different samples, this script will randomly subsample the counts from each sample to the value provided for this parameter. For ex`--ple, if you provide `--p-sampling-depth 500, this step will subsample the counts in each sample without replacement so that each sample in the resulting table has a total count of 500. If the total count for any sample(s) are smaller than this value, those samples will be dropped from the diversity analysis. Choosing this value is tricky. We recommend making your choice by reviewing the informa`tion pres`ented in the table.qzv file that was created above. Choose a value that is as high as possible (so you retain more sequences per sample)w samples as possible. 
 while excluding as fe

In [26]:
## open interactive table visualization
Visualization.load('../output/dada2-table.qzv')

## navigate to the interactive sample detail tab
# move the sampling depth slider as high as you can before excluding any samples 
# we want the sampling depth to be high, while retaining all 22 samples
# this looks like a sampling depth of 10,6727 (09AUG2023, SST) 

What value would you choose to pass for --p-sampling-depth? 
- **107,656**
How many samples will be excluded from your analysis based on this choice? 
- **none, all 22 samples are retained**
How many total sequences will you be analyzing in the core-metrics-phylogenetic command?
- **2,368,410**

This represents **40.40%** of the features present across the 22 samples
The mock community has the fewest features at **107,656** and is our 'limiting factor' to increase sample depth.
Why does the blank have so many features! That is not good... 

In [27]:
!qiime diversity core-metrics-phylogenetic \
  --i-phylogeny ../output/phylogeny-tree/rooted_tree.qza \
  --i-table ../output/dada2-table.qza \
  --p-sampling-depth 107656 \
  --m-metadata-file ../rawdata/sample-metadata-verbose.tsv \
  --output-dir ../output/diversity-core

[32mSaved FeatureTable[Frequency] to: ../output/diversity-core/rarefied_table.qza[0m
[32mSaved SampleData[AlphaDiversity] to: ../output/diversity-core/faith_pd_vector.qza[0m
[32mSaved SampleData[AlphaDiversity] to: ../output/diversity-core/observed_features_vector.qza[0m
[32mSaved SampleData[AlphaDiversity] to: ../output/diversity-core/shannon_vector.qza[0m
[32mSaved SampleData[AlphaDiversity] to: ../output/diversity-core/evenness_vector.qza[0m
[32mSaved DistanceMatrix to: ../output/diversity-core/unweighted_unifrac_distance_matrix.qza[0m
[32mSaved DistanceMatrix to: ../output/diversity-core/weighted_unifrac_distance_matrix.qza[0m
[32mSaved DistanceMatrix to: ../output/diversity-core/jaccard_distance_matrix.qza[0m
[32mSaved DistanceMatrix to: ../output/diversity-core/bray_curtis_distance_matrix.qza[0m
[32mSaved PCoAResults to: ../output/diversity-core/unweighted_unifrac_pcoa_results.qza[0m
[32mSaved PCoAResults to: ../output/diversity-core/weighted_unifrac_pcoa_res

In [28]:
Visualization.load('../output/diversity-core-metrics-phylogeneticnetic/unweighted_unifrac_emperor.qzv')

In [29]:
Visualization.load('../output/diversity-core-metrics-phylogeneticnetic/weighted_unifrac_emperor.qzv')

In [21]:
Visualization.load('../output/diversity-core-metrics-phylogeneticnetic/jaccard_emperor.qzv')

In [30]:
Visualization.load('../output/diversity-core-metrics-phylogeneticnetic/bray_curtis_emperor.qzv')

After computing diversity metrics, we can begin to explore the microbial composition of the samples in the context of the sample metadata. This information is present in the sample metadata file that was downloaded earlier.

We’ll first test for associations between categorical metadata columns and alpha diversity data. We’ll do that here for the Faith Phylogenetic Diversity (a measure of community richness) and evenness metrics.

In [32]:
!qiime diversity alpha-group-significance \
  --i-alpha-diversity ../output/diversity-core/faith_pd_vector.qza \
  --m-metadata-file ../rawdata/sample-metadata-verbose.tsv \
  --o-visualization ../output/faith-pd-group-significance.qzv

[32mSaved Visualization to: ../output/faith-pd-group-significance.qzv[0m
[0m

In [31]:
!qiime diversity alpha-group-significance \
  --i-alpha-diversity ../output/diversity-core/evenness_vector.qza \
  --m-metadata-file ../rawdata/sample-metadata-verbose.tsv \
  --o-visualization ../output/evenness-group-significance.qzv

[32mSaved Visualization to: ../output/evenness-group-significance.qzv[0m
[0m