# Mitochondria removal tutorial (QIIME2 Artifact API)

This tutorial will cover how to use the preconstructed extended taxonomies from Sonett et al. [preprint](https://www.biorxiv.org/content/10.1101/2021.02.23.431501v2) to remove mitochondria from 16S rRNA gene sequence data in QIIME2 using the QIIME2 Artifact API.

If you need to build your own, custom extended taxonomy, see the extended [taxonomy creation tutorial here](extended_taxonomy_construction_tutorial.ipynb).

We will use as an example a subset of the 16S rRNA amplicon Global Coral Microbiome Project dataset. For the experimental design, see [Pollock *et al*., 2018](https://www.nature.com/articles/s41467-018-07275-x). In this case, we are just using these data to give an example of mitochondria removal, so you don't need to read the paper in order to do the tutorial.



## Requirements


It's assumed that QIIME2 is installed, and it's virtual environment activated. This tutorial was tested with qiime2-amplicon-2024.2 running on Windows 10 via WSL.

Viewing QIIME2 Visualizations in Jupyter notebook requires installing the QIIME 2 Jupyter extension with:

`jupyter serverextension enable --py qiime2 --sys-prefix` 
  
then restarting your jupyter server.

## Activating your QIIME2 virtual environment
NOTE: you have to activate your QIIME2 virtual environment *before* starting this notebook. That's awkward since you're already in the notebook. You may need to exit the notebook, activate QIIME2, then restart it. I usually do this by first reminding myself of which virtual environments I have available:

`conda env list`

Here's an example of the output in my case:
```# conda environments:

base                  *  /Users/zaneveld/opt/anaconda3
qiime2-2020.8            /Users/zaneveld/opt/anaconda3/envs/qiime2-2020.8
qiime2-2021.11           /Users/zaneveld/opt/anaconda3/envs/qiime2-2021.11
qiime2-amplicon-2023.9     /Users/zaneveld/opt/anaconda3/envs/qiime2-amplicon-2023.9
```

If I wanted to activate the `qiime2-amplicon-2023.9` virtual environment, I would then use: 

`conda activate qiime2-amplicon-2023.9`

. You can run this tutorial from a Jupyter notebook in Terminal on Mac, any BASH command line interface in Linux or the Windows Subsystem for Linux as long as QIIME2 is installed.

## Arrangement of tutorial files

It's  assumed that you've downloaded the zipped tutorial with `input`, `output` and `procedure` folders, and that this script is within the provided procedure folder (where it starts). 

Given all that, this tutorial will discuss how to use the the supplemented databases to remove mitochondria from your 16S datasets using QIIME2. 


## Check that we can import QIIME2 objects

Before we go further, let's double check that we can import objects from QIIME2 by importing the core `Artifact` object.

In [None]:
try:
    from qiime2 import Artifact
    print('Good to go!')
except ModuleNotFoundError:
    raise ModuleNotFoundError('\nQIIME2 is not installed, or you are not in the proper environment.\nPlease stop the Jupyter server, install QIIME2 or activate the environment, and restart this notebook.') from None

# Classify taxonomy with VSEARCH
In order to remove mitochondria and chloroplast sequences, we must first classify the taxonomy of all 16S rRNA sequences in the library to figure out which derive from organelles. In this tutorial, we'll use VSEARCH to align our 16S sequences to the extended reference database. VSEARCH is included with QIIME2 and so you shouldn't need to install any additional software to run it.

### Set up the directories and import files for this part of the analysis

First we'll set up references to the directories that hold our raw data and results. The tutorial assumes that we have an `input`, `output`, and `procedure` folder within our analysis directory, and that we are running our jupyter notebook from the `procedure` folder. Therefore all paths to files in `input` will look like `../input/name_of_file.txt` and similarly files in the `output` folder will have paths like `../output/name_of_file.txt`.

We will start by assigning some relevant paths to variables. If you were adapting this tutorial to your own data, you would need to replace the filenames for the metadata, sequences, and feature table to your own. 

In [None]:
from os.path import join, abspath, exists

#Since we are in procedure, the main analysis directory ('working_dir')
#encloses our current directory
working_dir = abspath('..')

input_dir = join(working_dir,'input')
output_dir = join(working_dir, 'output')

#Define variables to hold the filepaths for metadata, sequences, and the extended taxonomy reference

#NOTE: these should be adjusted to point to your data
metadata_file_name = 'GCMP_sample_metadata.txt'
sequence_file_name = 'GCMP_tutorial_seqs.qza'
feature_table_name = 'GCMP_tutorial_ft.qza'

#Combine the filenames with paths for your input folders
#We use the join function to ensure cross-system compatibility (i.e. / vs. \ issues)
metadata_path = join(input_dir, metadata_file_name)
seqs_path = join(input_dir, sequence_file_name)
feature_table_path = join(input_dir, feature_table_name)

print(f"Metadata filepath:{metadata_path}\n")
print(f"Sequence filepath:{seqs_path}\n")
print(f"Feature table filepath:{feature_table_path}\n")

## Define variables holding the paths to our taxonomic references

In [None]:
#these will be automaticallly downloaded if not found
taxonomy_reference_dir = join(input_dir,'taxonomy_references')
base_silva_paths = [join(taxonomy_reference_dir, 'silva_sequences.qza'),
                    join(taxonomy_reference_dir, 'silva_taxonomy.qza')]
extended_silva_paths = [join(taxonomy_reference_dir, 'silva_extended_sequences.qza'),
                        join(taxonomy_reference_dir, 'silva_extended_taxonomy.qza')]

### Check for files and download key files if not present

In [None]:
import os
import shutil
import urllib.request

def download_file(url, local_filepath):
    with urllib.request.urlopen(url) as response, open(local_filepath, 'wb') as out_file:
        shutil.copyfileobj(response, out_file)

In [None]:
print("Verifying that all needed starting data files exist.")

### Add file names for the actual files we need to the required files list

required_filepaths = ([taxonomy_reference_dir, metadata_path, seqs_path, feature_table_path]
                      + base_silva_paths + extended_silva_paths)

for required_filepath in required_filepaths:
    if not exists(required_filepath):
        if required_filepath == taxonomy_reference_dir:
            os.mkdir(taxonomy_reference_dir)
        elif required_filepath in base_silva_paths:
            download_file('https://data.qiime2.org/2021.4/common/silva-138-99-seqs-515-806.qza',
                          join(taxonomy_reference_dir, 'silva_sequences.qza'))
            download_file('https://data.qiime2.org/2021.4/common/silva-138-99-tax-515-806.qza', 
                          join(taxonomy_reference_dir, 'silva_taxonomy.qza'))
        elif required_filepath in extended_silva_paths:
            download_file('https://zenodo.org/records/10251912/files/silva_extended_sequences.qza?download=1',
                          join(taxonomy_reference_dir, 'silva_extended_sequences.qza'))
            download_file('https://zenodo.org/records/10251912/files/silva_extended_taxonomy.qza?download=1',
                          join(taxonomy_reference_dir, 'silva_extended_taxonomy.qza'))            
        else:
            raise IOError(f"Required file {required_filepath} not found. Please ensure it is in that directory.")
        
    print(f"{required_filepath}.....OK")
print("Done.")

# Annotate sequences

We will use vsearch to annotate taxonomy. This will be done once for each of the refernce taxonomies :  Silva, and Silva + MeTaxa2 + phytoref reference mitocondrial sequences. 

## Load metadata and sequence files as .qza artifacts

Next we'll load your study-specific metadata and sequence files as QIIME2 Artifacts. (If adapting to your own files later, replace metadata_path and seqs_path with paths to your metadata `.tsv` file and your sequence `.qza` file).

In [None]:
from qiime2 import Artifact, Metadata
metadata = Metadata.load(metadata_path)
seqs = Artifact.load(seqs_path)

## Use VSEARCH to classify your sequences according to base and extended SILVA taxonomies

**Note:** On a full dataset this step can take a while to run (even on the limited tutorial dataset it will take a few minutes). Adjusting the vsearch threads parameter can help speed up the process if you have enough memory to support multiple threads. About 8GB / thread has worked for us as a rule of thumb, but the scaling may not be totally linear - see https://forum.qiime2.org/t/vsearch-classifier-memory/8667/5 for discussion.

In [None]:
from qiime2.plugins.feature_classifier.pipelines import classify_consensus_vsearch

threads = 4


vsearch_results = {}
references = ['silva','silva_extended']
for reference in references:
    
    #Note: The next two lines just set paths to the sequence and taxonomy qza files, respectively
    #If using a custom reference set, you could either name it with the same naming scheme 
    # my_reference_sequences.qza and my_reference_taxonomy.qza, and add 'my_reference' to the 
    #references list up above, or just manually set the file names using this section as a 
    #loose guide.
    
    reference_otu_path = join(taxonomy_reference_dir, f'{reference}_sequences.qza')
    reference_taxonomy_path = join(taxonomy_reference_dir, f'{reference}_taxonomy.qza')
    
    #Load .qza files as QIIME2 artifacts
    reads = Artifact.load(reference_otu_path)
    taxonomy = Artifact.load(reference_taxonomy_path)
    
    #Run VSEARCH, and store results in the vsearch_results dictionary
    vsearch_results[reference] = classify_consensus_vsearch(seqs, reads, taxonomy, threads = threads)
    
print("Done running VSEARCH")

## Save each of the taxonomy annotations for your sequences.

Finally, save the results of the vsearch taxonomy mapping into taxonomy .qza objects that you can use with downstream QIIME2 scripts. Note that these are *study-specific* classifications of your sequences, unlike the reference taxonomies, which are general.

In [None]:
for reference in vsearch_results:
    classification_taxonomy = vsearch_results[reference].classification
    
    #Create a .qza file to hold the results of applying a given reference taxonomy
    #to your specific sequences (as distinct from the *reference* taxonomy for all known species)
    output_filepath = join(working_dir,'output',f"{reference}_classification_taxonomy.qza")
    classification_taxonomy.save(output_filepath)

# Remove mitochondria from samples

Now that we have created study-specific taxonomic annotations in the output folder (e.g. `../output/silva_extended_classification_taxonomy.qza`), we can use them to filter our feature tables to remove mitochondria or chloroplast 16S sequences using QIIME2

### Generate filtered feature tables tables with organelle sequences removed.

Next we'll use the QIIME2 filter_table command to remove features that are annotated as mitochondria or chloroplasts according to each taxonomy, and output filtered feature tables and feature table summaries for each to allow comparison and reporting.

In [None]:
from collections import defaultdict
from qiime2.plugins.feature_table.methods import filter_samples
from qiime2.plugins.feature_table.visualizers import summarize
from qiime2.plugins.taxa.methods import filter_table

filtered_feature_tables_by_taxonomy = defaultdict(dict)

feature_table = Artifact.load(feature_table_path)

for reference, classification_taxonomy in vsearch_results.items():
    print(f'Analyzing data using the {reference} classification taxonomy')
    print(f'Removing mitochondria from: {feature_table_name}')
    
    #Remove organelle sequeces from the feature table
    #Note that the Qiime2 API does not return a single object, 
    #but rather a named Tuple struture with each output, which we save in filter_table_results
    filter_table_results = filter_table(table = feature_table, taxonomy = classification_taxonomy.classification,
                                        exclude = 'mitochondria,chloroplast')
    
    #Additionally remove any samples not in the metadata
    filter_table_results = filter_samples(filter_table_results.filtered_table,
                                          metadata = metadata)
    filtered_table = filter_table_results.filtered_table
    
    #Save the filtered feature table to a .qza file in the output directory
    filtered_table_path = join(output_dir, f'feature_table_filtered_{reference}.qza')
    print(f'Saving results to: {filtered_table_path}')
    filtered_table.save(filtered_table_path)
    
    #output a feature table summary visualization
    summary_visualization = summarize(filtered_table, sample_metadata = metadata)
    vis = summary_visualization.visualization
    visualization_path = join(output_dir, f'feature_table_filtered_{reference}.qzv')
    print(f'Saving file summary to: {visualization_path}\n')
    vis.save(visualization_path)
print('Done.')    

## Visualize the difference

Now that we've removed the mitochondria, let's see how our estimates of the relative abundance of microbial taxa vary based on whether we used the base SILVA or extended taxonomy.

If you don't see a difference, that's not unexpected - our work found that many studies were minimally affected, though some had vast changes in "Unassigned" annotations.

In [None]:
from qiime2 import Visualization
from qiime2.plugins.taxa.visualizers import barplot

#we'll do this in separate cells so the barplots are easier to compare

#base silva:

table_path = join(output_dir, 'feature_table_filtered_silva.qza')
feature_table = Artifact.load(table_path)

classification_taxonomy_path = join(working_dir, 'output', 'silva_classification_taxonomy.qza')
classification_taxonomy = Artifact.load(classification_taxonomy_path)

vis = barplot(table = feature_table, taxonomy = classification_taxonomy, metadata = metadata)
visualization_path = join(working_dir, 'output', 'silva_taxa_barplot.qzv')
vis.visualization.save(visualization_path)

vis.visualization

In [None]:
#extended silva:

table_path = join(output_dir, 'feature_table_filtered_silva_extended.qza')
feature_table = Artifact.load(table_path)

classification_taxonomy_path = join(working_dir, 'output', 'silva_extended_classification_taxonomy.qza')
classification_taxonomy = Artifact.load(classification_taxonomy_path)

vis = barplot(table = feature_table, taxonomy = classification_taxonomy, metadata = metadata)
visualization_path = join(working_dir, 'output', 'silva_extended_taxa_barplot.qzv')
vis.visualization.save(visualization_path)

vis.visualization

### References

#### Silva:
Pruesse, Elmar, Christian Quast, Katrin Knittel, Bernhard M. Fuchs, Wolfgang Ludwig, Jörg Peplies, and Frank Oliver Glöckner. 2007. “SILVA: A Comprehensive Online Resource for Quality Checked and Aligned Ribosomal RNA Sequence Data Compatible with ARB.” Nucleic Acids Research 35 (21): 7188–96. doi: 10.1093/nar/gkm864

Quast, Christian, Elmar Pruesse, Pelin Yilmaz, Jan Gerken, Timmy Schweer, Pablo Yarza, Jörg Peplies, and Frank Oliver Glöckner. 2013. “The SILVA Ribosomal RNA Gene Database Project: Improved Data Processing and Web-Based Tools.” Nucleic Acids Research 41: D590–96. doi: 10.1093/nar/gks1219

#### Greengenes:
DeSantis TZ, Hugenholtz P, Larsen N, Rojas M, Brodie EL, Keller K, Huber T, Dalevi D, Hu P, Andersen GL2006.Greengenes, a Chimera-Checked 16S rRNA Gene Database and Workbench Compatible with ARB. Appl Environ Microbiol72:.https://doi.org/10.1128/AEM.03006-05

McDonald, D., Price, M., Goodrich, J. et al. An improved Greengenes taxonomy with explicit ranks for ecological and evolutionary analyses of bacteria and archaea. ISME J 6, 610–618 (2012). https://doi.org/10.1038/ismej.2011.139

#### Metaxa2:
Bengtsson-Palme, J., Hartmann, M., Eriksson, K. M., Pal, C., Thorell, K., Larsson, D. G., & Nilsson, R. H. (2015). METAXA2: improved identification and taxonomic classification of small and large subunit rRNA in metagenomic data. Molecular ecology resources, 15(6), 1403–1414. https://doi.org/10.1111/1755-0998.12399

#### PhytoRef:
Decelle, J., Romac, S., Stern, R. F., Bendif, elM., Zingone, A., Audic, S., Guiry, M. D., Guillou, L., Tessier, D., Le Gall, F., Gourvil, P., Dos Santos, A. L., Probert, I., Vaulot, D., de Vargas, C., & Christen, R. (2015). PhytoREF: a reference database of the plastidial 16S rRNA gene of photosynthetic eukaryotes with curated taxonomy. Molecular ecology resources, 15(6), 1435–1445. https://doi.org/10.1111/1755-0998.12401

#### RESCRIPt:
Michael S Robeson II, Devon R O'Rourke, Benjamin D Kaehler, Michal Ziemski, Matthew R Dillon, Jeffrey T Foster, Nicholas A Bokulich. 2021. "RESCRIPt: Reproducible sequence taxonomy reference database management". PLoS Computational Biology 17 (11): e1009581.; doi: 10.1371/journal.pcbi.1009581

#### QIIME2:
Bolyen E, Rideout JR, Dillon MR, Bokulich NA, Abnet CC, Al-Ghalith GA, Alexander H, Alm EJ, Arumugam M, Asnicar F, Bai Y, Bisanz JE, Bittinger K, Brejnrod A, Brislawn CJ, Brown CT, Callahan BJ, Caraballo-Rodríguez AM, Chase J, Cope EK, Da Silva R, Diener C, Dorrestein PC, Douglas GM, Durall DM, Duvallet C, Edwardson CF, Ernst M, Estaki M, Fouquier J, Gauglitz JM, Gibbons SM, Gibson DL, Gonzalez A, Gorlick K, Guo J, Hillmann B, Holmes S, Holste H, Huttenhower C, Huttley GA, Janssen S, Jarmusch AK, Jiang L, Kaehler BD, Kang KB, Keefe CR, Keim P, Kelley ST, Knights D, Koester I, Kosciolek T, Kreps J, Langille MGI, Lee J, Ley R, Liu YX, Loftfield E, Lozupone C, Maher M, Marotz C, Martin BD, McDonald D, McIver LJ, Melnik AV, Metcalf JL, Morgan SC, Morton JT, Naimey AT, Navas-Molina JA, Nothias LF, Orchanian SB, Pearson T, Peoples SL, Petras D, Preuss ML, Pruesse E, Rasmussen LB, Rivers A, Robeson MS, Rosenthal P, Segata N, Shaffer M, Shiffer A, Sinha R, Song SJ, Spear JR, Swafford AD, Thompson LR, Torres PJ, Trinh P, Tripathi A, Turnbaugh PJ, Ul-Hasan S, van der Hooft JJJ, Vargas F, Vázquez-Baeza Y, Vogtmann E, von Hippel M, Walters W, Wan Y, Wang M, Warren J, Weber KC, Williamson CHD, Willis AD, Xu ZZ, Zaneveld JR, Zhang Y, Zhu Q, Knight R, and Caporaso JG. 2019. Reproducible, interactive, scalable and extensible microbiome data science using QIIME 2. Nature Biotechnology 37: 852–857. https://doi.org/10.1038/s41587-019-0209-9

#### VSEARCH:
Torbjørn Rognes, Tomáš Flouri, Ben Nichols, Christopher Quince, and Frédéric Mahé. Vsearch: a versatile open source tool for metagenomics. PeerJ, 4:e2584, 2016. doi:10.7717/peerj.2584.

