# Mitochondria removal tutorial (QIIME2 Artifact API)

This tutorial will cover how to use the preconstructed extended taxonomies from Sonett et al. [preprint](https://www.biorxiv.org/content/10.1101/2021.02.23.431501v2) to remove mitochondria from 16S rRNA gene sequence data in QIIME2 using the QIIME2 Artifact API

If you need to build your own extended taxonomy, see the extended [taxonomy creation tutorial here](extended_taxonomy_construction_tutorial.ipynb)

## Requirements


It's assumed that QIIME2 is installed. You can run this tutorial from a Jupyter notebook in Terminal on Mac, any BASH command line interface in Linux or the Windows Subsystem for Linux, or PowerShell on Windows.

It's also assumed that you've downloaded the zipped tutorial with `input`, `output` and `procedure` folders, and that this script is within the provided procedure folder. 

Given all that, this tutorial will discuss how to use the the supplemented databases to remove mitochondria from your 16S datasets using Qiime2. 

In [1]:
try:
    from qiime2 import Artifact, Metadata
except ModuleNotFoundError:
    raise ModuleNotFoundError('\nQIIME2 is not installed, or you are not in the proper environment.\nPlease stop the Jupyter server, install QIIME2 or activate the environment, and restart this notebook.') from None

# Classify taxonomy with VSEARCH
In order to remove mitochondria and chloroplast sequences, we must first classify the taxonomy of all 16S rRNA sequences in the library to figure out which derive from organelles. In this tutorial, we'll use VSEARCH to align our 16S sequences to the extended reference database. VSEARCH is included with QIIME2 and so you shouldn't need to install any additional software to run it.

### Set up the directories and import files for this part of the analysis

First we'll set up references to the directories that hold our raw data and results. The tutorial assumes that we have an `input`, `output`, and `procedure` folder within our analysis directory, and that we are running our jupyter notebook from the `procedure` folder. Therefore all paths to files in `input` will look like `../input/name_of_file.txt` and similarly files in the `output` folder will have paths like `../output/name_of_file.txt`.

We will start by assigning some relevant paths to variables. If you were adapting this tutorial to your own data, you would need to replace the filenames for the metadata, sequences, and feature table to your own. 

In [2]:
from os.path import join, abspath, exists

#Since we are in procedure, the main analysis directory ('working_dir')
#encloses our current directory
working_dir = abspath('..')

input_dir = join(working_dir,'input')
output_dir = join(working_dir, 'output')

#Define variables to hold the filepaths for metadata, sequences, and the extended taxonomy reference

#NOTE: these should be adjusted to point to your data
metadata_file_name = 'sample_metadata_live_vs_dead_combo.tsv'
sequence_file_name = 'rep_seqs_merged.qza'
feature_table_name = 'feature_table_live_vs_dead.qza'

#Combine the filenames with paths for your input folders
#We use the join function to ensure cross-system compatibility (i.e. / vs. \ issues)
metadata_path = join(input_dir, metadata_file_name)
seqs_path = join(input_dir, sequence_file_name)
feature_table_path = join(input_dir, feature_table_name)

print(f"Metadata filepath:{metadata_path}\n")
print(f"Sequence filepath:{seqs_path}\n")
print(f"Feature table filepath:{feature_table_path}\n")

Metadata filepath:/mnt/c/Users/dsone/Documents/zaneveld/organelle_removal/organelle_removal/Tutorial/input/sample_metadata_live_vs_dead_combo.tsv

Sequence filepath:/mnt/c/Users/dsone/Documents/zaneveld/organelle_removal/organelle_removal/Tutorial/input/rep_seqs_merged.qza

Feature table filepath:/mnt/c/Users/dsone/Documents/zaneveld/organelle_removal/organelle_removal/Tutorial/input/feature_table_live_vs_dead.qza



## Define variables holding the paths to our taxonomic references

In [3]:
#these will be automaticallly downloaded if not found
taxonomy_reference_dir = join(input_dir,'taxonomy_references')
base_silva_paths = [join(taxonomy_reference_dir, 'silva_sequences.qza'),
                    join(taxonomy_reference_dir, 'silva_taxonomy.qza')]
extended_silva_paths = [join(taxonomy_reference_dir, 'silva_extended_sequences.qza'),
                        join(taxonomy_reference_dir, 'silva_extended_taxonomy.qza')]

### Check for files and download key files if not present

In [4]:
import os
import shutil
import urllib.request

def download_file(url, local_filepath):
    with urllib.request.urlopen(url) as response, open(local_filepath, 'wb') as out_file:
        shutil.copyfileobj(response, out_file)

In [5]:
print("Verifying that all needed starting data files exist.")

### Add file names for the actual files we need to the required files list

required_filepaths = ([taxonomy_reference_dir, metadata_path, seqs_path, feature_table_path]
                      + base_silva_paths + extended_silva_paths)

for existing_file in required_filepaths:
    if not exists(existing_file):
        if existing_file == taxonomy_reference_dir:
            os.mkdir(taxonomy_reference_dir)
        elif existing_file in base_silva_paths:
            download_file('https://data.qiime2.org/2021.4/common/silva-138-99-seqs-515-806.qza',
                          join(taxonomy_reference_dir, 'silva_sequences.qza'))
            download_file('https://data.qiime2.org/2021.4/common/silva-138-99-tax-515-806.qza', 
                          join(taxonomy_reference_dir, 'silva_taxonomy.qza'))
        elif existing_file in extended_silva_paths:
            download_file('https://zenodo.org/records/10251912/files/silva_extended_sequences.qza?download=1',
                          join(taxonomy_reference_dir, 'silva_extended_sequences.qza'))
            download_file('https://zenodo.org/records/10251912/files/silva_extended_taxonomy.qza?download=1',
                          join(taxonomy_reference_dir, 'silva_extended_taxonomy.qza'))            
        else:
            raise IOError(f"Required file {existing_file} not found. Please ensure it is in that directory.")
        
    print(f"{existing_file}.....OK")
print("Done.")

Verifying that all needed starting data files exist.
/mnt/c/Users/dsone/Documents/zaneveld/organelle_removal/organelle_removal/Tutorial/input/taxonomy_references.....OK
/mnt/c/Users/dsone/Documents/zaneveld/organelle_removal/organelle_removal/Tutorial/input/sample_metadata_live_vs_dead_combo.tsv.....OK
/mnt/c/Users/dsone/Documents/zaneveld/organelle_removal/organelle_removal/Tutorial/input/rep_seqs_merged.qza.....OK
/mnt/c/Users/dsone/Documents/zaneveld/organelle_removal/organelle_removal/Tutorial/input/feature_table_live_vs_dead.qza.....OK
/mnt/c/Users/dsone/Documents/zaneveld/organelle_removal/organelle_removal/Tutorial/input/taxonomy_references/silva_sequences.qza.....OK
/mnt/c/Users/dsone/Documents/zaneveld/organelle_removal/organelle_removal/Tutorial/input/taxonomy_references/silva_taxonomy.qza.....OK
/mnt/c/Users/dsone/Documents/zaneveld/organelle_removal/organelle_removal/Tutorial/input/taxonomy_references/silva_extended_sequences.qza.....OK
/mnt/c/Users/dsone/Documents/zaneveld

# Annotate sequences

We will use vsearch to annotate taxonomy. This will be done once for each of the refernce taxonomies :  Silva, and Silva + MeTaxa2 + phytoref reference mitocondrial sequences. 

## Load metadata and sequence files as .qza artifacts

Next we'll load your study-specific metadata and sequence files as QIIME2 Artifacts

In [6]:
from qiime2 import Artifact, Metadata
metadata = Metadata.load(metadata_path)
seqs = Artifact.load(seqs_path)

## Use VSEARCH to classify your sequences according to base and extended SILVA taxonomies

**Note:** On a full dataset this step can take a while to run. Adjusting the vsearch threads parameter can help speed up the process if you have enough memory to support multiple threads (about 8GB / thread has worked for us)

see https://forum.qiime2.org/t/vsearch-classifier-memory/8667/5

In [9]:
from qiime2.plugins.feature_classifier.pipelines import classify_consensus_vsearch

threads = 1


vsearch_results = {}
references = ['silva','silva_extended']
for reference in references:
    
    #Note: The next two lines just set paths to the sequence and taxonomy qza files, respectively
    #If using a custom reference set, you could either name it with the same naming scheme 
    # my_reference_sequences.qza and my_reference_taxonomy.qza, and add 'my_reference' to the 
    #references list up above, or just manually set the file names using this section as a 
    #loose guide.
    
    reference_otu_path = join(taxonomy_reference_dir, f'{reference}_sequences.qza')
    reference_taxonomy_path = join(taxonomy_reference_dir, f'{reference}_taxonomy.qza')
    
    #Load .qza files as QIIME2 artifacts
    reads = Artifact.load(reference_otu_path)
    taxonomy = Artifact.load(reference_taxonomy_path)
    
    #Run VSEARCH, and store results in the vsearch_results dictionary
    vsearch_results[reference] = classify_consensus_vsearch(seqs, reads, taxonomy, threads = threads)
    

Running external command line application. This may print messages to stdout and/or stderr.
The command being run is below. This command cannot be manually re-run as it will depend on temporary files that no longer exist.

Command: vsearch --usearch_global /tmp/qiime2/dylan/data/d7bb1a41-d3e8-465f-be11-90b58e1cf211/data/dna-sequences.fasta --id 0.8 --query_cov 0.8 --strand both --maxaccepts 10 --maxrejects 0 --db /tmp/qiime2/dylan/data/b41681fb-a4e7-4ef8-a23a-a26f1bcfd272/data/dna-sequences.fasta --threads 1 --output_no_hits --blast6out /tmp/q2-BLAST6Format-vnb3nekd



vsearch v2.22.1_linux_x86_64, 7.7GB RAM, 12 cores
https://github.com/torognes/vsearch

Reading file /tmp/qiime2/dylan/data/b41681fb-a4e7-4ef8-a23a-a26f1bcfd272/data/dna-sequences.fasta 100%
86453445 nt in 313734 seqs, min 54, max 2366, avg 276
Masking 100%
Counting k-mers 100%
Creating k-mer index 100%
Searching 100%
Matching unique query sequences: 127 of 464 (27.37%)


Running external command line application. This may print messages to stdout and/or stderr.
The command being run is below. This command cannot be manually re-run as it will depend on temporary files that no longer exist.

Command: vsearch --usearch_global /tmp/qiime2/dylan/data/d7bb1a41-d3e8-465f-be11-90b58e1cf211/data/dna-sequences.fasta --id 0.8 --query_cov 0.8 --strand both --maxaccepts 10 --maxrejects 0 --db /tmp/qiime2/dylan/data/f5bba9f9-ccf6-4f7f-9610-aa9dcdb47eca/data/dna-sequences.fasta --threads 1 --output_no_hits --blast6out /tmp/q2-BLAST6Format-jbz41lto



vsearch v2.22.1_linux_x86_64, 7.7GB RAM, 12 cores
https://github.com/torognes/vsearch

Reading file /tmp/qiime2/dylan/data/f5bba9f9-ccf6-4f7f-9610-aa9dcdb47eca/data/dna-sequences.fasta 100%
88535151 nt in 322903 seqs, min 54, max 2366, avg 274
Masking 100%
Counting k-mers 100%
Creating k-mer index 100%
Searching 100%
Matching unique query sequences: 127 of 464 (27.37%)


## Save each of the taxonomy annotations for your sequences.

Finally, save the results of the vsearch taxonomy mapping into taxonomy .qza objects that you can use with downstream QIIME2 scripts. Note that these are *study-specific* classifications of your sequences, unlike the reference taxonomies, which are general.

In [14]:
for reference in vsearch_results:
    classification_taxonomy = vsearch_results[reference].classification
    
    #Create a .qza file to hold the results of applying a given reference taxonomy
    #to your specific sequences (as distinct from the *reference* taxonomy for all known species)
    output_filepath = join(working_dir,'output',f"{reference}_classification_taxonomy.qza")
    classification_taxonomy.save(output_filepath)

# Remove mitochondria from samples

Now that we have created study-specific taxonomic annotations in the output folder (e.g. `../output/silva_extended_classification_taxonomy.qza`), we can use them to filter our feature tables to remove mitochondria or chloroplast 16S sequences.

## Set up filepaths

First we'll set up variables to hold the filepaths we'll need. 

In [61]:
#already done up top -- I think this is a duplicated cell; variables aren't called below

# output_dir = abspath("../output")
# input_filepath = abspath("../input")
#mapping_file = metadata_path
#sequence_file = seqs_path


#These files are specific to your study
feature_table_name = 'feature_table_live_vs_dead.qza'
feature_table_path = join(input_dir, feature_table_name)



#Load the taxonomy files created in the last section.
taxonomy_files = {"silva_base": "../output/silva_reference_taxonomy.qza",\
                  "silva_extended": "../output/silva_extended_reference_taxonomy.qza"}

required_files = [feature_table,mapping_file,sequence_file]

### Generate filtered feature tables tables with organelle sequences removed.

Next we'll use the QIIME2 filter_table command to remove features that are annotated as mitochondria or chloroplasts according to each taxonomy, and output filtered feature tables and feature table summaries for each to allow comparison and reporting.

In [26]:
#rewrote this cell, keeping for reference atm

In [27]:
# from collections import defaultdict

# filtered_feature_tables_by_taxonomy = defaultdict(dict)

# feature_table = Artifact.load(feature_table_path)

# for label, taxonomy_file in taxonomy_files.items():
#     print(f"Analyzing data using the {label} taxonomy ({taxonomy_file})")
#     taxonomy = Artifact.load(taxonomy_file)
#     print(f"Removing mitochondia from: {feature_table}")

#     #Remove organelle sequeces from the feature table
#     #Note that the Qiime2 API does not return a single object, 
#     #but rather a named Tuple struture with each output, which we save in filter_table_results
#     filter_table_results = filter_table(feature_table,taxonomy,exclude="mitochondria,chloroplast",mode="contains")
    
#     #Additionally remove any samples not in the metadata
#     filter_table_results = filter_samples(filter_table_results.filtered_table,metadata=metadata)
#     filtered_table = filter_table_results.filtered_table
    
#     #Save the filtered feature table to a .qza file in the output directory
#     output_filename = f"feature_table_filtered_{label}_mws.qza"
#     output_filepath = join(output_dir,output_filename)
#     print(f"Saving results to: {output_filepath}")
#     filtered_table.save(output_filepath)
    
#     #Output a feature table summary visualization
#     summary_visualization = summarize(filtered_table,sample_metadata=metadata)
#     vis = summary_visualization.visualization
#     output_filename = f"feature_table_filtered_{label}_mws.qzv"
#     output_filepath = join(output_dir,output_filename)
#     print(f"Saving file summary to: {output_filepath}")
    
#     filtered_feature_tables_by_taxonomy[label]=filtered_table
#     print(f"Done with processing {label} taxonomy annotations!\n\n")

In [25]:
from collections import defaultdict
from qiime2.plugins.feature_table.methods import filter_samples
from qiime2.plugins.feature_table.visualizers import summarize
from qiime2.plugins.taxa.methods import filter_table

filtered_feature_tables_by_taxonomy = defaultdict(dict)

feature_table = Artifact.load(feature_table_path)

for reference, classification_taxonomy in vsearch_results.items():
    print(f'Analyzing data using the {reference} classification taxonomy')
    print(f'Removing mitochondria from: {feature_table_name}')
    
    #Remove organelle sequeces from the feature table
    #Note that the Qiime2 API does not return a single object, 
    #but rather a named Tuple struture with each output, which we save in filter_table_results
    filter_table_results = filter_table(table = feature_table, taxonomy = classification_taxonomy.classification,
                                        exclude = 'mitochondria,chloroplast',
                                        mode = 'contains')
    #Additionally remove any samples not in the metadata
    filter_table_results = filter_samples(filter_table_results.filtered_table,
                                          metadata = metadata)
    filtered_table = filter_table_results.filtered_table
    
    #Save the filtered feature table to a .qza file in the output directory
    filtered_table_path = join(output_dir, f'feature_table_filtered_{reference}.qza')
    print(f'Saving results to: {filtered_table_path}')
    filtered_table.save(filtered_table_path)
    
    #output a feature table summary visualization
    summary_visualization = summarize(filtered_table, sample_metadata = metadata)
    vis = summary_visualization.visualization
    visualization_path = join(output_dir, f'feature_table_filtered_{reference}.qzv')
    print(f'Saving file summary to: {visualization_path}\n')
    vis.save(visualization_path)
print('Done.')    

Analyzing data using the silva classification taxonomy
Removing mitochondria from: feature_table_live_vs_dead.qza
Saving results to: /mnt/c/Users/dsone/Documents/zaneveld/organelle_removal/organelle_removal/Tutorial/output/feature_table_filtered_silva.qza
Saving file summary to: /mnt/c/Users/dsone/Documents/zaneveld/organelle_removal/organelle_removal/Tutorial/output/feature_table_filtered_silva.qzv

Analyzing data using the silva_extended classification taxonomy
Removing mitochondria from: feature_table_live_vs_dead.qza
Saving results to: /mnt/c/Users/dsone/Documents/zaneveld/organelle_removal/organelle_removal/Tutorial/output/feature_table_filtered_silva_extended.qza
Saving file summary to: /mnt/c/Users/dsone/Documents/zaneveld/organelle_removal/organelle_removal/Tutorial/output/feature_table_filtered_silva_extended.qzv

Done.


# Rarefy tables to an even depth

In [None]:
#do we want to include this rarefaction step? I assume we want to replace as little of their pipeline as possible

In [68]:
#choose a rarefaction depth appropriate for your study.
rarefaction_depth = 1000

rarefied_feature_tables_by_taxonomy = {}

for label, filtered_feature_tables in filtered_feature_tables_by_taxonomy.items():
    print(f'Rarefying feature table {filtered_feature_tables} to {rarefaction_depth} sequences/sample.')
    rarefy_results = rarefy(table=filtered_feature_tables,sampling_depth=rarefaction_depth)
    #get the rarefied table out of the NamedTuple of results
    rarefied_filtered_table = rarefy_results.rarefied_table
        
    #save the resulting feature table
    output_filename = f'feature_table_{label}_{rarefaction_depth}.qza'
    output_filepath = join(output_dir,output_filename)
    print(f'Saving results to:{output_filepath}')
    rarefied_filtered_table.save(output_filepath)
        
    #store the rarefied tables in a dict so they don't need to relode them
    rarefied_feature_tables_by_taxonomy[label] = rarefied_filtered_table

Rarefying feature table <artifact: FeatureTable[Frequency] uuid: 44d128c1-3928-45c1-812b-5b4951168af0> to 1000 sequences/sample.
Saving results to:{output_filepath}
Rarefying feature table <artifact: FeatureTable[Frequency] uuid: 198b8748-a810-412f-b31a-ec9bbdd5def8> to 1000 sequences/sample.
Saving results to:{output_filepath}
Rarefying feature table <artifact: FeatureTable[Frequency] uuid: c6323e2b-0a78-4e72-939e-08cc227a6085> to 1000 sequences/sample.
Saving results to:{output_filepath}
Rarefying feature table <artifact: FeatureTable[Frequency] uuid: 5bb410e1-75b4-41c1-8887-073b4db4ac1e> to 1000 sequences/sample.
Saving results to:{output_filepath}
