There are three goals of this script:
1. To check the results of the mock community extraction and sequencing to look for any bias in the extraction and sequencing methods that may influence the data 
2. To use the 'blank' sample to remove or decontaminate the coral samples from any background bacterial contamination from the lab
3. Remove mitochondria, chloroplasts and unknown ASV's from the feature table

We're going to use the qiime2 plugin [quality-control](https://docs.qiime2.org/2023.9/plugins/available/quality-control/) to achieve these goals

This [qiime2 tutorial](https://docs.qiime2.org/2023.9/tutorials/quality-control/) will help

## Python 3 API import qiime plugins

In [1]:
from qiime2 import Visualization
from qiime2 import Artifact
from qiime2.plugins import quality_control

# Evaluate mock community

[Zymo MIQ score](https://www.zymoresearch.com/blogs/blog/how-to-quantify-bias-with-mock-microbial-community-standards)

[ZymoBIOMICS Microbial Community Standard](https://www.zymoresearch.com/blogs/blog/zymobiomics-microbial-standards-optimize-your-microbiomics-workflow%20%20%20%20%20%20%20%20)

Analyzing 16S sequencing data of the [ZymoBIOMICS® Standard](https://www.zymoresearch.com/collections/zymobiomics-microbial-community-standards/products/zymobiomics-microbial-community-standard) When sequencing the ZymoBIOMICS® standards analyze them using your regular 16S analysis pipelines, such as Qiime and Mothur. You can compare the measured composition with the theoretical composition of the standard. Questions that should be kept in mind during this comparison include: (1) whether your measurement covers all strains with the proper taxonomy assignment and with correct abundance, (2) whether your measurement indicates the presence of foreign taxa with significant abundance. Taxonomy assignment might be incorrect or improper because of problems in the reference database. Abundance estimation might be off because of bias in DNA extraction, bias in library preparation, poor quality of MiSeq runs, etc. The presence of foreign taxa might indicate process contamination, poor sequencing quality, PCR chimera in library preparation, defects in bioinformatics analysis, defects in reference database, etc. Both the ZymoBIOMICS®Microbial Community Standard and the DNA Standard are certified to have low impurities levels, < 0.01% by DNA abundance. So any foreign taxa with abundance higher than 0.01% are derived from artifacts in your workflow.

Checkout this Zymo resource page on [How to quantify bias with mock microbial community standards](https://www.zymoresearch.com/blogs/blog/how-to-quantify-bias-with-mock-microbial-community-standards)

Measurement Integrity Quotient (MIQ) score
The MIQ score simplifies bias assessment by assigning a score from 0 to 100 and can be interpreted like a grade on a high school exam, with >90 being excellent, 80-89 being good, and so on (Figure 1). This is achieved by measuring the fidelity of measured relative abundances compared to a known input. The MIQ score is presented in a user-friendly report which also includes different plots to visually represent the evaluation. These include radar plots, taxa bar plots, and read fate counts

![example-shotgun-composition-MIQ-score](https://files.zymoresearch.com/images/blog_post-miq_score_figure-2b.png)

Called and asked them to send me the raw sequence data for the theoretical composition of ZymoBIOMICS Microbial Community Standard [(Cat#D6300)](https://www.zymoresearch.com/collections/zymobiomics-microbial-community-standards/products/zymobiomics-microbial-community-standard)
<br>
Theoretical Composition Based on Genomic DNA: 
- Listeria monocytogenes - 12% 
- Pseudomonas aeruginosa - 12% 
- Bacillus subtilis - 12% 
- Escherichia coli - 12% 
- Salmonella enterica - 12% 
- Lactobacillus fermentum - 12% 
- Enterococcus faecalis - 12% 
- Staphylococcus aureus - 12%
- Saccharomyces cerevisiae - 2%
- Cryptococcus neoformans - 2%

To do this we need:
- mock-expected.qza "FeatureTable[RelativeFrequency]"
- mock-observed.qza "FeatureTable[RelativeFrequency]"

Exclude the mock community from the feature table for downstream analysis

In [19]:
!qiime feature-table filter-samples \
  --i-table ../output/dada2/table.qza \
  --m-metadata-file ../data/sample-metadata-verbose.tsv \
  --p-where "BarcodeSequence ='GATTAAGGTG'" \
  --p-exclude-ids true \
  --o-filtered-table ../output/filtered/table-no-mock.qza

[32mSaved FeatureTable[Frequency] to: ../output/filtered/table-no-mock.qza[0m
[0m

In [21]:
!qiime feature-table summarize \
  --i-table ../output/filtered/table-no-mock.qza \
  --o-visualization ../output/filtered/table-no-mock.qzv \
  --m-sample-metadata-file ../data/sample-metadata-verbose.tsv

[32mSaved Visualization to: ../output/filtered/table-no-mock.qzv[0m
[0m

In [23]:
Visualization.load('../output/filtered/table-no-mock.qzv')

# Decontaminate

Here I want to take all the features that were found in the 'blank' sample, and subtract them from the feature table. aka, getting rid of contamination! Yay! 

We first need to generate some files for input:
- query-seqs.qza "FeatureData[Sequence]"
  | We want this to be a list of the sequences found in the blank, the ones we want to remove!
- reference-seqs.qza "FeatureData[Sequence]"
  | This is the list of sequences found in the coral samples

We want to search for any sequences present in the query against the reference.
Any sequences from the blank found in the coral will be 'hits', any sequences from the corals that don't match those in the blank will be 'miss'

- query-table.qza "FeatureTable[Frequency]" | This will be all the coral samples

(Metadata-based filtering)[https://docs.qiime2.org/2019.4/tutorials/filtering/#:~:text=view%20%7C%20download-,Metadata%2Dbased%20filtering,-%C2%B6]

Modify metadata to keep only the blank sample!

In [11]:
import pandas as pd

In [12]:
# read in tsv sample metadata as a csv
df = pd.read_csv('../data/sample-metadata-verbose.tsv', delimiter='\t')
# make a list of SampleID values to remove
keep = ['blank.']
# remove those rows 
df = df[df['#SampleID'].isin(keep)]
df

Unnamed: 0,#SampleID,BarcodeSequence,LinkerPrimerSequence,BarcodeName,ReversePrimer,ProjectName,Description,Pae,Temp,PaeTemp,Colony,Tank
20,blank.,TATGTGCAAT,GTGYCAGCMGCCGCGGTAA,60bp_UDPi5_0093,GGACTACNVGGGTWTCTAAT,060823STillcus515F,blank.,blank,blank,blank,blank,blank


In [13]:
# save the new df as a tsv
df.to_csv('../data/sample-metadata-blank.tsv', sep='\t', index=False)

In [None]:
Create feature table of only the blank sample

In [14]:
!qiime feature-table filter-samples \
  --i-table ../output/dada2/table.qza \
  --m-metadata-file ../data/sample-metadata-blank.tsv \
  --o-filtered-table ../output/filtered/table-blank.qza

[32mSaved FeatureTable[Frequency] to: ../output/dada2/table-blank.qza[0m
[0m

 Alternatively, instead of using pandas, you can use `qiime feature-table filter-samples` :

In [36]:
!qiime feature-table filter-samples \
  --i-table ../output/dada2/table.qza \
  --m-metadata-file ../data/sample-metadata-verbose.tsv \
  --p-where "BarcodeSequence ='TATGTGCAAT'" \
  --o-filtered-table ../output/filtered/table-blank.qza

[32mSaved FeatureTable[Frequency] to: ../output/filtered/table-blank.qza[0m
[0m

In [38]:
!qiime feature-table summarize \
  --i-table ../output/filtered/table-blank.qza \
  --o-visualization ../output/filtered/table-blank.qzv \
  --m-sample-metadata-file ../data/sample-metadata-blank.tsv

[32mSaved Visualization to: ../output/filtered/table-blank.qzv[0m
[0m

In [39]:
Visualization.load('../output/filtered/table-blank.qzv')

### Filter representative sequences 
Here I make an artifact file that contained only the sequences found in the blank sample.
This artifact creation uses the blank-filtered frequency table and the representative sequences `FeatureData[Sequence]` as input. 

In [32]:
!qiime feature-table filter-seqs \
  --i-data ../output/dada2/representative_sequences.qza \
  --i-table ../output/filtered/table-blank.qza \
  --o-filtered-data ../output/filtered/rep-seqs-blank.qza

[32mSaved FeatureData[Sequence] to: ../output/filtered/rep-seqs-blank.qza[0m
[0m

In [5]:
!qiime quality-control exclude-seqs \
  --i-query-sequences ../output/filtered/rep-seqs-blank.qza \
  --i-reference-sequences ../output/dada2/representative_sequences.qza \
  --p-method blast \
  --p-perc-identity 0.97 \
  --p-perc-query-aligned 0.97 \
  --o-sequence-hits ../output/filtered/hits.qza \
  --o-sequence-misses ../output/filtered/misses.qza

[32mSaved FeatureData[Sequence] to: ../output/filtered/hits.qza[0m
[32mSaved FeatureData[Sequence] to: ../output/filtered/misses.qza[0m
[0m

In [6]:
!qiime tools peek ../output/filtered/hits.qza

[32mUUID[0m:        fcdeb722-c761-4ca4-92fc-bf8b947dcc96
[32mType[0m:        FeatureData[Sequence]
[32mData format[0m: DNASequencesDirectoryFormat


In [7]:
!qiime feature-table filter-features \
  --i-table ../output/dada2/table.qza \
  --m-metadata-file ../output/filtered/hits.qza \
  --o-filtered-table ../output/filtered/no-hits-filtered-table.qza \
  --p-exclude-ids

[32mSaved FeatureTable[Frequency] to: ../output/filtered/no-hits-filtered-table.qza[0m
[0m

In [8]:
!qiime feature-table summarize \
  --i-table ../output/filtered/no-hits-filtered-table.qza \
  --o-visualization ../output/filtered/no-hits-filtered-table.qzv \
  --m-sample-metadata-file ../data/sample-metadata-verbose.tsv

[32mSaved Visualization to: ../output/filtered/no-hits-filtered-table.qzv[0m
[0m

In [12]:
Visualization.load("../output/filtered/no-hits-filtered-table.qzv")

# Remove Mitochondria, Chloroplastis & Unknowns with Taxonomy-based filtering
[qiime2 tutorial](https://docs.qiime2.org/2019.4/tutorials/filtering/)

In [13]:
!qiime taxa filter-table \
  --i-table ../output/filtered/no-hits-filtered-table.qza \
  --i-taxonomy ../output/taxonomy/classification.qza \
  --p-exclude mitochondria,chloroplast,unassigned \
  --o-filtered-table ../output/filtered/table-taxon-filtered.qza

[32mSaved FeatureTable[Frequency] to: ../output/filtered/table-taxon-filtered.qza[0m
[0m

Now let's view the taxonomy bar plot to confirm that mitochondria, chloroplasts, and unknown features were removed from the filtered feature table

In [14]:
!qiime taxa barplot \
    --i-table ../output/filtered/table-taxon-filtered.qza \
    --i-taxonomy ../output/taxonomy/classification.qza \
    --m-metadata-file ../data/sample-metadata-verbose.tsv \
    --o-visualization ../output/taxonomy/taxa_barplot-taxon-filtered.qzv

[32mSaved Visualization to: ../output/taxonomy/taxa_barplot-taxon-filtered.qzv[0m
[0m

In [15]:
Visualization.load("../output/taxonomy/taxa_barplot-taxon-filtered.qzv")