## Overview
### Method

### Data download
I downloaded two .zip folders from a Dropbox link; one had raw data files and the other had analysis pipeline files generated by [Mr. DNA Lab](https://www.mrdnalab.com/) Molecular Research. I unzipped the folders and uploaded the Mr. DNA analysis pipeline files into the `coral-pae-temp/analysis/microbiome/mrdna` directory, and the `sample-metadata.tsv` and `demux` folder into `coral-pae-temp/analysis/microbiome/rawdata` directory.  

### Data type

Here I am working with the raw data files located in `coral-pae-temp/analysis/microbiome/data/demux`. In the `demux` folder are two `fastq.gz` files for each of the 22 samples, one for the forward read and one for the reverse read. 

The `fastq.gz` file name includes the sample identifier and should look like `4.Ea_S1_L001_R1_001.fastq.gz`. 
The underscore-separated fields in this file name are:

1.  the sample identifier,

2.  the barcode sequence or a barcode identifier,

3.  the lane number,

4.  the direction of the read (i.e. R1 or R2, because these are paired-end reads), and

5.  the set number.
   

The `fastq.gz` files are **Demultiplexed** (aka **Demuxed**) sequences that still have the forward and reverse primers in the sequences.

-   The Raw Data is **demultiplexed**

-   A R1 and R2 fastq.gz file has been generated for each individual sample

-   All forward reads are binned into the R1 fastq.gz files

-   All reverse reads are binned into the R2 fastq.gz files

-   Other than demultiplexing; you can consider the Raw Data on BaseSpace as untouched (**The Forward and Reverse Primer Sequences have not been removed**)

### Data process
Here I follow the QIIME2 [Casava 1.8 paired-end demultiplexed fastq](https://docs.qiime2.org/2023.5/tutorials/importing/#:~:text=Casava%201.8%20paired%2Dend%20demultiplexed%20fastq) tutorial example on importing data


## Python 3 API import qiime plugins

In [10]:
from qiime2 import Visualization
from qiime2 import Artifact
from qiime2 import Metadata

## Metadata
Make a table of the metadata.
Here I added columns 'Pae', 'Temp', 'PeaTemp', 'Colony', and 'Tank' to the original `sample-metadata.tsv` file provided to me by Mr. DNA and renamed it `sample-metadata-verbose.tsv`
This was a bit of a process... I had to:
upload the `sample-metadata.tsv` to Excel
edit the metadata by adding the above columns and values
save it as a csv
open it in a text editor
search for all ',' commas, and find&replace them with 'TAB' symbols
save as a tab separated file `.tsv`
upload it back into the `coral-pae-temp/analysis/microbiome/rawdata` folder

At first I had named the new columns 'pae', 'temp', etc. with lower case... for some reason this was a problem later on and the interactive emperor plots wouldn't recognize the new columns. When I changed the column names to CamelCase to match the others, it worked. The `qiime2` docs indicate that metadata formatted with an Identifier Column such as `#Sample ID` is [case-sensitive](https://docs.qiime2.org/2023.5/tutorials/metadata/#metadata-formatting-requirements:~:text=feature%2Did-,Case%2Dsensitive,-(these%20are%20mostly)

Later I learned you can edit .tsv files directly in Jupyter Notebooks using pandas


In [1]:
!qiime metadata tabulate \
  --m-input-file ../data/sample-metadata-verbose.tsv \
  --o-visualization ../output/sample-metadata-verbose.qzv

[32mSaved Visualization to: ../output/sample-metadata-verbose.qzv[0m
[0m

In [4]:
Visualization.load('../output/sample-metadata-verbose.qzv')

Import Sequences into qiime

In [5]:
!qiime tools import \
  --type 'SampleData[PairedEndSequencesWithQuality]' \
  --input-path ../data/demux \
  --input-format CasavaOneEightSingleLanePerSampleDirFmt \
  --output-path ../output/demux-paired-end.qza

[32mImported ../data/demux as CasavaOneEightSingleLanePerSampleDirFmt to ../output/demux-paired-end.qza[0m
[0m

The `demux-paired-end.qza` artifact contains raw, demultiplexed sequences that still have forward and reverse primers

## Trim primers from paired-end sequences using `cutadapt`

> "The PCR primers (F515/R806) were developed against the V4 region of the 16S rRNA, which we determined would yield optimal community clustering with reads of this length using a procedure similar to that of ref. 15. [For reference, this primer pair amplifies the region 533–786 in the Escherichia coli strain 83972 sequence (greengenes accession no. prokMSA_id:470367).] The reverse PCR primer is barcoded with a 12-base error correcting Golay code to facilitate multiplexing of up to ≈1,500 samples per lane, and both PCR primers contain sequencer adapter regions." - (Caporasco et al. 2011)

Caporaso, J. G., Lauber, C. L., Walters, W. A., Berg-Lyons, D., Lozupone, C. A., Turnbaugh, P. J., Fierer, N., & Knight, R. (2011). Global patterns of 16S rRNA diversity at a depth of millions of sequences per sample. Proceedings of the National Academy of Sciences, 108(supplement_1), 4516–4522. https://doi.org/10.1073/pnas.1000080107

> "The V4 variable region of the 16S rRNA gene was amplified using the 515F (5′-­GTGCCAGCMGCCGCGGTAA-­3′) and 806R (5′-­GGACTACHVGGGTWTCTAAT-­3′) primer set (Caporaso et al. 2011). ” - (Brown et al. 2021)

Brown, Tanya, Dylan Sonett, Jesse R. Zaneveld, and Jacqueline L. Padilla-Gamiño. 2021. “Characterization of the Microbiome and Immune Response in Corals with Chronic Montipora White Syndrome.” Molecular Ecology 30 (11): 2591–2606. https://doi.org/10.1111/mec.15899.

In [6]:
!qiime cutadapt trim-paired \
  --i-demultiplexed-sequences ../output/demux-paired-end.qza \
  --p-cores 4 \
  --p-front-f GTGYCAGCMGCCGCGGTAA \
  --p-front-r GGACTACNVGGGTWTCTAAT \
  --o-trimmed-sequences ../output/demux-trimmed.qza

[32mSaved SampleData[PairedEndSequencesWithQuality] to: ../output/demux-trimmed.qza[0m
[0m

## Visualize trimmed & demultiplexed sequences

In [7]:
!qiime demux summarize \
  --i-data ../output/demux-trimmed.qza \
  --o-visualization ../output/demux-trimmed-summary.qzv

[32mSaved Visualization to: ../output/demux-trimmed-summary.qzv[0m
[0m

In [8]:
Visualization.load('../output/demux-trimmed-summary.qzv')

## [Denoise with DADA2](https://docs.qiime2.org/2023.5/tutorials/moving-pictures/#sequence-quality-control-and-feature-table-construction:~:text=with%20QIIME%201.-,Option%201%3A%20DADA2%C2%B6,-DADA2%20is%20a)

[DADA2](https://pubmed.ncbi.nlm.nih.gov/27214047/) is a pipeline for detecting and correcting (where possible) Illumina amplicon sequence data. 

As implemented in the q2-dada2 plugin, this quality control process will additionally filter any phiX reads (commonly present in marker gene Illumina sequence data) that are identified in the sequencing data, and will filter chimeric sequences.  
The dada2 denoise-paired method requires four parameters that are used in quality filtering:  

    `--p-trim-left-f m`, which trims off the first m bases of each sequence in the forward reads
    `--p-trim-left-r n`, which trims off the first m bases of each sequence in the reverse reads
    `--p-trunc-len-f o`, which truncates each sequence at position o in the forward reads
    `--p-trunc-len-f p`, which truncates each sequence at position p in the reverse reads  
    
This allows the user to remove low quality regions of the sequences.  

What is a 'good' quality score?  

In QIIME 2's interactive quality plots, the quality scores typically range from 0 to 40. Quality scores reflect the accuracy of base calls in sequencing data, with higher scores indicating higher accuracy. The most common quality score scale used in modern sequencing technologies is the Phred scale.<br>

In the Phred sceal-<br> 

A quality score of 10 corresponds to a 1 in 10 chance of an incorrect base call (90% aate -<br> y).
A quality score of 20 corresponds to a 1 in 100 chance of an incorrect base call (99%rate -<br> cy).
A quality score of 30 corresponds to a 1 in 1000 chance of an incorrect base call  (9curate..<br>%<br>
accur
A "good" quality score in this context depends on your specific analysis goals and the sequencing platform yo're u
sing. However, many researchers consider quality scores above 20 to be generally acceptable for downstreaanal**ysis. Scores above 30 are often seen as very ha qu*. In theew the Interactive Quality Plot tab in the `demux-trimmed-summary.qzv` file that was generated by `qiime demux summ lots, we see that the quality scores of the bases are high, between a score of 11 in the lowest 2nd percentile and a score of 37 in the bottom 25th percentile and higher. So we won’t trim any bases from the beginning of thences. 
<br>er.  

[Denoising tips from Greg Caporasco](https://docs.qiime2.org/jupyterbooks/cancer-microbiome-intervention-tutorial/020-tutorial-upstream/040-denoising.html) using `qiime dada2 denoise-pa<br>iForward: 230
<br>
Reverse: 2- R<br>, e31
   
*This next command may take up to 10min to run and is the slowestt step

In [9]:
!qiime dada2 denoise-paired \
  --i-demultiplexed-seqs ../output/demux-trimmed.qza \
  --p-trim-left-f 0 \
  --p-trim-left-r 0 \
  --p-trunc-len-f 230 \
  --p-trunc-len-r 231 \
  --p-n-threads 20 \
  --output-dir ../output/dada2 --verbose

Running external command line application(s). This may print messages to stdout and/or stderr.
The command(s) being run are below. These commands cannot be manually re-run as they will depend on temporary files that no longer exist.

Command: run_dada.R --input_directory /tmp/tmp7xn9n2ih/forward --input_directory_reverse /tmp/tmp7xn9n2ih/reverse --output_path /tmp/tmp7xn9n2ih/output.tsv.biom --output_track /tmp/tmp7xn9n2ih/track.tsv --filtered_directory /tmp/tmp7xn9n2ih/filt_f --filtered_directory_reverse /tmp/tmp7xn9n2ih/filt_r --truncation_length 230 --truncation_length_reverse 231 --trim_left 0 --trim_left_reverse 0 --max_expected_errors 2.0 --max_expected_errors_reverse 2.0 --truncation_quality_score 2 --min_overlap 12 --pooling_method independent --chimera_method consensus --min_parental_fold 1.0 --allow_one_off False --num_threads 20 --learn_min_reads 1000000

R version 4.2.3 (2023-03-15) 
Loading required package: Rcpp
[?25hDADA2: 1.26.0 / Rcpp: 1.0.10 / RcppParallel: 5.1.6 
[

In [13]:
!qiime metadata tabulate \
  --m-input-file ../output/dada2/denoising_stats.qza \
  --o-visualization ../output/dada2/denoising-stats.qzv

[32mSaved Visualization to: ../output/dada2/denoising-stats.qzv[0m
[0m

### Visualize Denoising Stats

In [14]:
Visualization.load('../output/dada2/denoising-stats.qzv')

## Summarize & tabulate the feature table
After the quality filtering step completes, you’ll want to explore the resulting data. You can do this using the following two commands, which will create visual summaries of the data. The `feature-table summarize` command will give you information on how many sequences are associated with each sample and with each feature, histograms of those distributions, and some related summary statistics. The `feature-table tabulate-seqs` command will provide a mapping of feature IDs to sequences, and provide links to easily BLAST each sequence against the NCBI nt database.

### feature-table summarize
The feature-table summarize command will give you information on how many sequences are associated with each sample and with each feature, histograms of those distributions, and some related summary statistics. Feature tables in QIIME 2 represent the abundance of different biological features (such as bacterial taxa or OTUs) across samples.
 In this command:

--i-table tab.qzal`es the input feature table in QIIME 2 artifact format (.qza file) that you want to sumri

--o-visualizattable.qzv tfies the output visualization in QIIME 2 artifact format (.qzv file) that will con the summaes.

--m-sample-metadata-file sampldata.tsve-cifies the metadata file (usually in tab-separated values format) that contains additional information about the samples in your featble  and m

In [15]:
!qiime feature-table summarize \
  --i-table ../output/dada2/table.qza \
  --o-visualization ../output/dada2/table.qzv \
  --m-sample-metadata-file ../data/sample-metadata-verbose.tsv

[32mSaved Visualization to: ../output/dada2/table.qzv[0m
[0m

In [16]:
Visualization.load('../output/dada2/table.qzv')

### feature-table tabulate-seqs

In [17]:
!qiime feature-table tabulate-seqs \
  --i-data ../output/dada2/representative_sequences.qza \
  --o-visualization ../output/dada2/representative-sequences.qzv

[32mSaved Visualization to: ../output/dada2/representative-sequences.qzv[0m
[0m

In [18]:
Visualization.load('../output/blank-mock/rep-seqs-bm.qzv')