# Extended Taxonomy Construction Tutorial (QIIME2 Artifact API)

This tutorial will walk through the dowloading of the Silva, Greengenes, Metaxa2, and PhytoRef taxonomies, and create extended versions of the Silva and Greengenes databases by supplementing them with the Metaxa2 and PhytoRef taxonomy sequences. These supplemented databases can be used for more accurate annotation of mitochondria as described in Sonett et al. [preprint](https://www.biorxiv.org/content/10.1101/2021.02.23.431501v2) using the [mitochondria removal tutorial found here](mitochondria_removal_protocol.ipynb).

While we noticed a substantial difference in mitochondrial annotations when supplementing with Metaxa2 sequences, we saw little difference in chloroplast annotations with the addition of Phytoref sequences.

## Requirements


It's assumed that QIIME2 is installed, and it's virtual environment activated. This tutorial was tested with qiime2-amplicon-2024.2 running on Windows 10 via WSL.

## Activating your QIIME2 virtual environment
NOTE: you have to activate your QIIME2 virtual environment *before* starting this notebook. That's awkward since you're already in the notebook. You may need to exit the notebook, activate QIIME2, then restart it. I usually do this by first reminding myself of which virtual environments I have available:

`conda env list`

Here's an example of the output in my case:
`# conda environments:

base                  *  /Users/zaneveld/opt/anaconda3
qiime2-2020.8            /Users/zaneveld/opt/anaconda3/envs/qiime2-2020.8
qiime2-2021.11           /Users/zaneveld/opt/anaconda3/envs/qiime2-2021.11
qiime2-amplicon-2023.9     /Users/zaneveld/opt/anaconda3/envs/qiime2-amplicon-2023.9
`

If I wanted to activate the `qiime2-amplicon-2023.9` virtual environment, I would then use: 

`conda activate qiime2-amplicon-2023.9`

. You can run this tutorial from a Jupyter notebook in Terminal on Mac, any BASH command line interface in Linux or the Windows Subsystem for Linux as long as QIIME2 is installed.

## Arrangement of tutorial files

It's  assumed that you've downloaded the zipped tutorial with `input`, `output` and `procedure` folders, and that this script is within the provided procedure folder (where it starts). 

Given all that, this tutorial will discuss how to use the the supplemented databases to remove mitochondria from your 16S datasets using QIIME2. 


## Check that we can import QIIME2 objects

Before we go further, let's double check that we can import objects from QIIME2 by importing the core `Artifact` object.

In [2]:
try:
    from qiime2 import Artifact
    print('Good to go!')
except ModuleNotFoundError:
    raise ModuleNotFoundError('\nQIIME2 is not installed, or you are not in the proper environment.\nPlease stop the Jupyter server, install QIIME2 or activate the environment, and restart this notebook.') from None

Good to go!


### Download referance taxonomy files from the internet.

Import the gzip, os, shutil, subprocess, and tarfile libraries (for file system commands) and urllib (for the download function). Set working and reference directories, define a download function, and create the reference directory if it does not already exist:

In [37]:
import gzip
import os
import subprocess
import shutil
import tarfile
import urllib.request

In [23]:
working_dir = abspath('../output')
refs_dir = join(working_dir, 'taxonomy_references')

In [36]:
def download_file(url, local_filepath):
    with urllib.request.urlopen(url) as response, open(local_filepath, 'wb') as out_file:
        shutil.copyfileobj(response, out_file)

In [8]:
if not exists(refs_dir):
    os.mkdir(refs_dir)

Download and unpack the Silva 138 SSU, Greengenes 13_8, Metaxa2, and Phytoref files.

In [17]:
download_file('https://www.arb-silva.de/fileadmin/silva_databases/release_138/Exports/SILVA_138_SSURef_NR99_tax_silva_trunc.fasta.gz',
              os.path.join(refs_dir, 'silva_sequences.fasta.gz'))
download_file('https://www.arb-silva.de/fileadmin/silva_databases/release_138/Exports/taxonomy/taxmap_slv_ssu_ref_138.txt.gz',
              os.path.join(refs_dir, 'silva_taxonomy.txt.gz'))

In [20]:
with gzip.open(os.path.join(refs_dir, 'silva_sequences.fasta.gz'), 'rb') as zipped:
    with open(os.path.join(refs_dir, 'silva_sequences.fasta'), 'wb') as unzipped:
        shutil.copyfileobj(zipped, unzipped)
with gzip.open(os.path.join(refs_dir, 'silva_taxonomy.txt.gz'), 'rb') as zipped:
    with open(os.path.join(refs_dir, 'silva_taxonomy.txt'), 'wb') as unzipped:
        shutil.copyfileobj(zipped, unzipped)

In [52]:
download_file('ftp://greengenes.microbio.me/greengenes_release/gg_13_5/gg_13_8_otus.tar.gz',
              os.path.join(refs_dir, 'gg_13_8_otus.tar.gz'))
with tarfile.open(os.path.join(refs_dir, 'gg_13_8_otus.tar.gz'), 'r:gz') as tar:
    tar.extractall(refs_dir)
shutil.copyfile(os.path.join(refs_dir, 'gg_13_8_otus', 'rep_set', '99_otus.fasta'), os.path.join(refs_dir, 'greengenes_sequences.fasta'))

'/mnt/c/Users/Dylan/Documents/zaneveld/organelle_removal/Tutorial/qiime2_API_tutorial/output/taxonomy_references/greengenes_sequences.fasta'

In [27]:
download_file('https://microbiology.se/sw/Metaxa2_2.2.1.tar.gz',
              os.path.join(refs_dir, 'Metaxa2_2.2.1.tar.gz'))
with tarfile.open(os.path.join(refs_dir, 'Metaxa2_2.2.1.tar.gz'), 'r:gz') as tar:
    tar.extractall(refs_dir)

Metaxa files are contained within a BLAST database and need further extraction.

In [31]:
os.chdir(os.path.join(refs_dir, 'Metaxa2_2.2.1/metaxa2_db/SSU'))
subprocess.run(['blastdbcmd', '-entry', 'all', '-db', 'blast', '-out', 'metaxa2.fasta'])
shutil.copyfile(os.path.join(refs_dir, os.path.join('Metaxa2_2.2.1', 'metaxa2_db', 'SSU', 'metaxa2.fasta')),
                os.path.join(refs_dir, 'metaxa2.fasta'))
os.chdir(working_dir)

In [32]:
download_file('http://phytoref.sb-roscoff.fr/static/downloads/PhytoRef_with_taxonomy.fasta',
              os.path.join(refs_dir, 'PhytoRef_with_taxonomy.fasta'))

# Create supplemented Silva and Greengenes reference taxonomies

Using Biopython, grab the sequences and sequence information from the supplmetal databases and add them to new taxonomy files.

In [43]:
from Bio import SeqIO

Silva and Greengenes have slightly different taxonomy naming schemes. We'll place the mitochondria sequences in a made-up family called "Mitochondria" under the order Rickettsiales. Chloroplasts are also different between Silva and Greengenes, so in Silva they go in order "Chloroplast" under class Cyanobacteriia, and in Greengenes they'll be class "Chloroplast" under phylum Cyanobacteria. 

Any information from the fasta files will be the species annotation, with the hope that if something is a strong enough match that the species info will populate.

In [41]:
silva_mitochondria_prefix = 'd__Bacteria; p__Proteobacteria; c__Alphaproteobacteria; o__Rickettsiales; f__Mitochondria; g__Mitochondria; s__'
silva_chloroplast_prefix = 'd__Bacteria; p__Cyanobacteria; c__Cyanobacteriia; o__Chloroplast; f__Chloroplast; g__Chloroplast; s__'
greengenes_mitochondria_prefix = 'k__Bacteria; p__Proteobacteria; c__Alphaproteobacteria; o__Rickettsiales; f__mitochondria; g__Mitochondria; s__'
greengenes_chloroplast_prefix = 'k__Bacteria; p__Cyanobacteria; c__Chloroplast; o__Chloroplast; f__Chloroplast; g__Chloroplast; s__'

In [44]:
with open(os.path.join(refs_dir, 'silva_organelle_taxonomy.tsv'), 'w') as silva_taxonomy:
    with open(os.path.join(refs_dir, 'gg_organelle_taxonomy.tsv'), 'w') as gg_taxonomy:
        silva_taxonomy.write('Feature ID\tTaxon\n')
        gg_taxonomy.write('Feature ID\tTaxon\n')
        with open(os.path.join(refs_dir, 'organelle_sequences.fasta'), 'w') as organelle_seqs:
            for i, entry in enumerate(SeqIO.parse(os.path.join(refs_dir, 'metaxa2.fasta'), 'fasta')):
                if 'mitochondria' in entry.description or 'Mitochondria' in entry.description:
                    organelle_seqs.write(f'>metaxa2_mitochondria_{i}\n')
                    organelle_seqs.write(str(entry.seq + '\n'))
                    specific_info = str(entry.description).split(';')[-1]
                    silva_taxonomy.write(f'metaxa2_mitochondria_{i}\t{silva_mitochondria_prefix}{specific_info}\n')
                    gg_taxonomy.write(f'metaxa2_mitochondria_{i}\t{greengenes_mitochondria_prefix}{specific_info}\n')
            for i, entry in enumerate(SeqIO.parse(refs_dir + '/PhytoRef_with_taxonomy.fasta', 'fasta')):
                if not 'XXXXXXXXXX' in entry.seq:   #ditch the weird sequence
                    organelle_seqs.write(f'>phytoref_chloroplast_{i}\n')
                    organelle_seqs.write(str(entry.seq + '\n'))
                    specific_info = str(entry.description).split('|')[-1]
                    silva_taxonomy.write(f'phytoref_chloroplast_{i}\t{silva_chloroplast_prefix}{specific_info}\n')
                    gg_taxonomy.write(f'phytoref_chloroplast_{i}\t{greengenes_chloroplast_prefix}{specific_info}\n')

Now we will import the sequence files into QIIME2 in order to merge them. We'll also save the base versions as QZAs.

Silva publishes RNA sequences so we will convert that to DNA sequences as well.

In [57]:
from qiime2.plugins.rescript.methods import reverse_transcribe

In [61]:
organelle_seqs = Artifact.import_data('FeatureData[Sequence]', os.path.join(refs_dir, 'organelle_sequences.fasta'))
silva_seqs_rna = Artifact.import_data('FeatureData[RNASequence]', os.path.join(refs_dir, 'silva_sequences.fasta'))
silva_seqs = reverse_transcribe(silva_seqs_rna)
greengenes_seqs = Artifact.import_data('FeatureData[Sequence]', os.path.join(refs_dir, 'greengenes_sequences.fasta'))

silva_seqs.dna_sequences.save(os.path.join(refs_dir, 'silva_sequences_full.qza'))
greengenes_seqs.save(os.path.join(refs_dir, 'gg_sequences_full.qza'))

'/mnt/c/Users/Dylan/Documents/zaneveld/organelle_removal/Tutorial/qiime2_API_tutorial/output/taxonomy_references/gg_sequences.qza'

In [62]:
from qiime2.plugins.feature_table.methods import merge_seqs

In [66]:
silva_extended_seqs = merge_seqs([organelle_seqs, silva_seqs.dna_sequences])
greengenes_extended_seqs = merge_seqs([organelle_seqs, greengenes_seqs])

silva_extended_seqs.merged_data.save(os.path.join(refs_dir, 'silva_extended_sequences_full.qza'))
greengenes_extended_seqs.merged_data.save(os.path.join(refs_dir, 'greengenes_extended_sequences_full.qza'))

  for id_, seq in data.iteritems():
  for id_, seq in data.iteritems():


'/mnt/c/Users/Dylan/Documents/zaneveld/organelle_removal/Tutorial/qiime2_API_tutorial/output/taxonomy_references/greengenes_extended_sequences_full.qza'

Next, we'll select our region of interest. We use the V4 region in this example (and in the paper) based on the EMP protocol.

In [58]:
import Bio
from Bio import SeqIO
import glob
import tempfile
import pysam
from zipfile import ZipFile

from collections import defaultdict
from qiime2 import Artifact
from qiime2.plugins.feature_classifier.methods import classify_consensus_vsearch, extract_reads
from qiime2.metadata import Metadata
from qiime2.plugins.feature_table.methods import filter_samples, merge, merge_seqs, merge_taxa, rarefy
from qiime2.plugins.taxa.methods import filter_table
from qiime2.plugins.feature_table.visualizers import summarize

In [8]:
#import, select V4 region, merge, save
organelle_seqs = Artifact.import_data('FeatureData[Sequence]',
                                      refs_dir + '/organelle_sequences.fasta')
v4_organelle_seqs, = extract_reads(organelle_seqs, 'GTGYCAGCMGCCGCGGTAA',
                                   'GGACTACNVGGGTWTCTAAT', n_jobs = 24,
                                   read_orientation = 'forward')
silva_extended_seqs, = merge_seqs([v4_organelle_seqs,
                                   Artifact.load(refs_dir +
                                                 '/silva_sequences.qza')])
#save the sequence files for both extended files
silva_extended_seqs.save(refs_dir + '/silva_extended_sequences.qza')
gg_seqs = Artifact.import_data('FeatureData[Sequence]', refs_dir +
                               '/gg_13_8_otus/rep_set/99_otus.fasta')
v4_gg_seqs, = extract_reads(gg_seqs, 'GTGYCAGCMGCCGCGGTAA',
                            'GGACTACNVGGGTWTCTAAT', n_jobs = 24,
                            read_orientation = 'forward')
v4_gg_seqs.save(refs_dir + '/gg_sequences.qza')
gg_extended_seqs, = merge_seqs([organelle_seqs, gg_seqs])
gg_extended_seqs.save(refs_dir + '/gg_extended_sequences.qza')

'/home/tanya/Work_files/organelle_removal/output/taxonomy_references/gg_extended_sequences.qza'

In [9]:
#save the taxonomy files for both extended files
silva_organelle_taxonomy = Artifact.import_data('FeatureData[Taxonomy]',
                                                refs_dir +
                                                '/silva_organelle_taxonomy.tsv')
silva_extended_taxonomy, = merge_taxa([silva_organelle_taxonomy,
                                       Artifact.load(refs_dir +
                                                     '/silva_taxonomy.qza')])
silva_extended_taxonomy.save(refs_dir + '/silva_extended_taxonomy.qza')
gg_taxonomy = Artifact.import_data('FeatureData[Taxonomy]', refs_dir +
                                   '/gg_13_8_otus/taxonomy/99_otu_taxonomy.txt',
                                   'HeaderlessTSVTaxonomyFormat')
gg_taxonomy.save(refs_dir + '/gg_taxonomy.qza')
gg_organelle_taxonomy = Artifact.import_data('FeatureData[Taxonomy]',
                                             refs_dir +
                                             '/gg_organelle_taxonomy.tsv')
gg_extended_taxonomy, = merge_taxa([gg_organelle_taxonomy, gg_taxonomy])
gg_extended_taxonomy.save(refs_dir + '/gg_extended_taxonomy.qza')

'/home/tanya/Work_files/organelle_removal/output/taxonomy_references/gg_extended_taxonomy.qza'

# Classify taxonomy with vsearch 

### Set up the directories and import files for this part of the analysis

In [48]:
working_dir = abspath('..')

#import the metadata and sequence files from your analysis
metadata_path = working_dir + '/input/sample_metadata_live_vs_dead_combo.tsv'
seqs_path = working_dir + '/input/rep_seqs_merged.qza'

#import the taxonomy files created in the previously created step.
taxonomy_reference_dir = working_dir + '/output/taxonomy_references/'

### Verify that all files exist

In [49]:
print("Verifying that all needed starting data files exist.")
for existing_file in [working_dir,metadata_path,seqs_path,taxonomy_reference_dir]:
    if not os.path.exists(existing_file):
        raise IOError(f"Required file {existing_file} not found. Please ensure it is in that directory.")
print("Done.")

Verifying that all needed starting data files exist.
Done.


# Annotate sequences

We will use vsearch to annotate taxonomy. This will be done once for each of the refernce taxonomies created in the last section: Greengenes, Silva, Greengenes + MeTaxa2 + phytoref reference mitochondrial sequences, Silva + MeTaxa2 + phytoref reference mitocondrial sequences. 

Note: This step can take a wile to run. It is recommended to run this step overnight.

In [50]:
references = ['gg','silva','gg_extended','silva_extended']
metadata = Metadata.load(metadata_path)
seqs = Artifact.load(seqs_path)

In [51]:
vsearch_results = {}
for reference in references:
    reference_otu_path = taxonomy_reference_dir + f'{reference}_sequences.qza'
    reference_taxonomy_path = taxonomy_reference_dir + f'{reference}_taxonomy.qza'
    reads = Artifact.load(reference_otu_path)
    taxonomy = Artifact.load(reference_taxonomy_path)
    vsearch_results[reference] = classify_consensus_vsearch(seqs, reads, taxonomy, threads = 4)
    

Running external command line application. This may print messages to stdout and/or stderr.
The command being run is below. This command cannot be manually re-run as it will depend on temporary files that no longer exist.

Command: vsearch --usearch_global /tmp/qiime2-archive-6ri4ri5f/d7bb1a41-d3e8-465f-be11-90b58e1cf211/data/dna-sequences.fasta --id 0.8 --query_cov 0.8 --strand both --maxaccepts 10 --maxrejects 0 --db /tmp/qiime2-archive-5j06xdmi/9d299643-4922-44bc-8003-3d5185d76805/data/dna-sequences.fasta --threads 4 --output_no_hits --blast6out /tmp/tmpirqojtnn



vsearch v2.7.0_linux_x86_64, 15.5GB RAM, 8 cores
https://github.com/torognes/vsearch

Reading file /tmp/qiime2-archive-5j06xdmi/9d299643-4922-44bc-8003-3d5185d76805/data/dna-sequences.fasta 100%
51405917 nt in 202865 seqs, min 81, max 1003, avg 253
Masking 100%
Counting k-mers 100%
Creating k-mer index 100%
Searching 100%
Matching query sequences: 0 of 464 (0.00%)


Running external command line application. This may print messages to stdout and/or stderr.
The command being run is below. This command cannot be manually re-run as it will depend on temporary files that no longer exist.

Command: vsearch --usearch_global /tmp/qiime2-archive-6ri4ri5f/d7bb1a41-d3e8-465f-be11-90b58e1cf211/data/dna-sequences.fasta --id 0.8 --query_cov 0.8 --strand both --maxaccepts 10 --maxrejects 0 --db /tmp/qiime2-archive-mjgxjjvw/b41681fb-a4e7-4ef8-a23a-a26f1bcfd272/data/dna-sequences.fasta --threads 4 --output_no_hits --blast6out /tmp/tmpqgopdrng



vsearch v2.7.0_linux_x86_64, 15.5GB RAM, 8 cores
https://github.com/torognes/vsearch

Reading file /tmp/qiime2-archive-mjgxjjvw/b41681fb-a4e7-4ef8-a23a-a26f1bcfd272/data/dna-sequences.fasta 100%
86453445 nt in 313734 seqs, min 54, max 2366, avg 276
Masking 100%
Counting k-mers 100%
Creating k-mer index 100%
Searching 100%
Matching query sequences: 127 of 464 (27.37%)


Running external command line application. This may print messages to stdout and/or stderr.
The command being run is below. This command cannot be manually re-run as it will depend on temporary files that no longer exist.

Command: vsearch --usearch_global /tmp/qiime2-archive-6ri4ri5f/d7bb1a41-d3e8-465f-be11-90b58e1cf211/data/dna-sequences.fasta --id 0.8 --query_cov 0.8 --strand both --maxaccepts 10 --maxrejects 0 --db /tmp/qiime2-archive-ewumxl2w/327da406-e7db-4898-931f-df01072fcefd/data/dna-sequences.fasta --threads 4 --output_no_hits --blast6out /tmp/tmpr3s8ims5



vsearch v2.7.0_linux_x86_64, 15.5GB RAM, 8 cores
https://github.com/torognes/vsearch

Reading file /tmp/qiime2-archive-ewumxl2w/327da406-e7db-4898-931f-df01072fcefd/data/dna-sequences.fasta 100%
303450669 nt in 213319 seqs, min 46, max 5604, avg 1423
minseqlength 32: 1 sequence discarded.
Masking 100%
Counting k-mers 100%
Creating k-mer index 100%
Searching 100%
Matching query sequences: 373 of 464 (80.39%)


Running external command line application. This may print messages to stdout and/or stderr.
The command being run is below. This command cannot be manually re-run as it will depend on temporary files that no longer exist.

Command: vsearch --usearch_global /tmp/qiime2-archive-6ri4ri5f/d7bb1a41-d3e8-465f-be11-90b58e1cf211/data/dna-sequences.fasta --id 0.8 --query_cov 0.8 --strand both --maxaccepts 10 --maxrejects 0 --db /tmp/qiime2-archive-4cipr0id/5a4b411f-26bc-4a88-9a05-7499d740c944/data/dna-sequences.fasta --threads 4 --output_no_hits --blast6out /tmp/tmpbso9jlee



vsearch v2.7.0_linux_x86_64, 15.5GB RAM, 8 cores
https://github.com/torognes/vsearch

Reading file /tmp/qiime2-archive-4cipr0id/5a4b411f-26bc-4a88-9a05-7499d740c944/data/dna-sequences.fasta 100%
88535911 nt in 322907 seqs, min 54, max 2366, avg 274
Masking 100%
Counting k-mers 100%
Creating k-mer index 100%
Searching 100%
Matching query sequences: 127 of 464 (27.37%)


### Save each of the taxonomy annotations for your sample sequences.

In [52]:
for reference in vsearch_results:
    classification_taxonomy, = vsearch_results[reference]
    classification_taxonomy.save(working_dir + '/output/' + str(reference) + '_reference_taxonomy.qza')

# Removal of mitochondria from samples

Set up needed files for removal of mitochondria from samples

In [61]:
output_filepath = abspath("../output")
input_filepath = abspath("../input")
output_dir = abspath("../output/filtered_tables")

#These 3 files are specific to your study
feature_table_dir = working_dir + '/input/feature_table_live_vs_dead.qza'
mapping_file = metadata_path
sequence_file = seqs_path

#Load the taxonomy files created in the last section.
taxonomy_files = {"greengenes_reference":"../output/gg_reference_taxonomy.qza",\
                  "greengenes_extended": "../output/gg_extended_reference_taxonomy.qza",\
                  "silva_reference": "../output/silva_reference_taxonomy.qza",\
                  "silva_extended": "../output/silva_extended_reference_taxonomy.qza"}

required_files = [feature_table,mapping_file,sequence_file]

Check that all files exist.

In [56]:
print("Verifying that all needed starting data files exist.")
for existing_files in required_files:
    if not exists(existing_file):
        raise IOError(f"Required file {existing_file} not found. Please ensure it is in the directory.")
        
print("Done.")


Verifying that all needed starting data files exist.
Done.


### Generate filtered tables using the default taxonomies for Greengenes and Silva as well as those supplemented with metaxa2 mitochondrial 16S rRNA and phytoref chloroplasts.

In [62]:
filtered_feature_tables_by_taxonomy = defaultdict(dict)

feature_table = Artifact.load(feature_table_dir)

for label, taxonomy_file in taxonomy_files.items():
    print(f"Analyzing data using the {label} taxonomy ({taxonomy_file})")
    taxonomy = Artifact.load(taxonomy_file)
    print(f"Removing mitochondia from: {feature_table}")
    #Qiime2 API does not return a single object, but a named Tuple struture with each output.
    filter_table_results = filter_table(feature_table,taxonomy,exclude="mitochondria,chloroplast",mode="contains")
    filter_table_results = filter_samples(filter_table_results.filtered_table,metadata=metadata)
    filtered_table = filter_table_results.filtered_table
    
    #save the file
    output_filename = f"feature_table_filtered_{label}_mws.qza"
    output_filepath = join(output_dir,output_filename)
    print(f"Saving results to: {output_filepath}")
    filtered_table.save(output_filepath)
    
    #output a file summary
    summary_visualization = summarize(filtered_table,sample_metadata=metadata)
    vis = summary_visualization.visualization
    output_filename = f"feature_table_filtered_{label}_mws.qzv"
    output_filepath = join(output_dir,output_filename)
    print(f"Saving file summary to: {output_filepath}")
    
    filtered_feature_tables_by_taxonomy[label]=filtered_table
    print(f"Done with processing {label} taxonomy annotations!\n\n")

Analyzing data using the greengenes_reference taxonomy (../output/gg_reference_taxonomy.qza)
Removing mitochondia from: <artifact: FeatureTable[Frequency] uuid: 97994d43-af84-4dcd-a4a5-316bcc00a472>
Saving results to: /home/tanya/Work_files/organelle_removal/output/filtered_tables/feature_table_filtered_greengenes_reference_mws.qza
Saving file summary to: /home/tanya/Work_files/organelle_removal/output/filtered_tables/feature_table_filtered_greengenes_reference_mws.qzv
Done with processing greengenes_reference taxonomy annotations!


Analyzing data using the greengenes_extended taxonomy (../output/gg_extended_reference_taxonomy.qza)
Removing mitochondia from: <artifact: FeatureTable[Frequency] uuid: 97994d43-af84-4dcd-a4a5-316bcc00a472>
Saving results to: /home/tanya/Work_files/organelle_removal/output/filtered_tables/feature_table_filtered_greengenes_extended_mws.qza
Saving file summary to: /home/tanya/Work_files/organelle_removal/output/filtered_tables/feature_table_filtered_greengen

# Rarefy tables to an even depth

In [68]:
#choose a rarefaction depth appropriate for your study.
rarefaction_depth = 1000

rarefied_feature_tables_by_taxonomy = defaultdict(dict)

for label,filtered_feature_tables in filtered_feature_tables_by_taxonomy.items():
    print(f"Rarefying feature table {filtered_feature_tables} to {rarefaction_depth} sequences/sample.")
    rarefy_results = rarefy(table=filtered_feature_tables,sampling_depth=rarefaction_depth)
    #get the rarefied table out of the NamedTuple of results
    rarefied_filtered_table = rarefy_results.rarefied_table
        
    #save the resulting feature table
    output_filename = f"feature_table_{label}_{rarefaction_depth}.qza"
    output_filepath = join(output_dir,output_filename)
    print("Saving results to:{output_filepath}")
    rarefied_filtered_table.save(output_filepath)
        
    #store the rarefied tables in a dict so they don't need to relode them
    rarefied_feature_tables_by_taxonomy[label] = rarefied_filtered_table

Rarefying feature table <artifact: FeatureTable[Frequency] uuid: 44d128c1-3928-45c1-812b-5b4951168af0> to 1000 sequences/sample.
Saving results to:{output_filepath}
Rarefying feature table <artifact: FeatureTable[Frequency] uuid: 198b8748-a810-412f-b31a-ec9bbdd5def8> to 1000 sequences/sample.
Saving results to:{output_filepath}
Rarefying feature table <artifact: FeatureTable[Frequency] uuid: c6323e2b-0a78-4e72-939e-08cc227a6085> to 1000 sequences/sample.
Saving results to:{output_filepath}
Rarefying feature table <artifact: FeatureTable[Frequency] uuid: 5bb410e1-75b4-41c1-8887-073b4db4ac1e> to 1000 sequences/sample.
Saving results to:{output_filepath}
