# Extended Taxonomy Construction Tutorial (QIIME2 Artifact API)

This tutorial will walk through the dowloading of the Silva, Greengenes, Metaxa2, and PhytoRef taxonomies, and create extended versions of the Silva and Greengenes databases by supplementing them with the Metaxa2 and PhytoRef taxonomy sequences. These supplemented databases can be used for more accurate annotation of mitochondria as described in Sonett et al. [preprint](https://www.biorxiv.org/content/10.1101/2021.02.23.431501v2) using the [mitochondria removal tutorial found here](mitochondria_removal_protocol.ipynb).

While we noticed a substantial difference in mitochondrial annotations when supplementing with Metaxa2 sequences, we saw little difference in chloroplast annotations with the addition of Phytoref sequences.

## Requirements


It's assumed that QIIME2 is installed, and it's virtual environment activated. This tutorial was tested with qiime2-amplicon-2024.2 running on Windows 10 via WSL.

## Activating your QIIME2 virtual environment
NOTE: you have to activate your QIIME2 virtual environment *before* starting this notebook. That's awkward since you're already in the notebook. You may need to exit the notebook, activate QIIME2, then restart it. I usually do this by first reminding myself of which virtual environments I have available:

`conda env list`

Here's an example of the output in my case:
```# conda environments:

base                  *  /Users/zaneveld/opt/anaconda3
qiime2-2020.8            /Users/zaneveld/opt/anaconda3/envs/qiime2-2020.8
qiime2-2021.11           /Users/zaneveld/opt/anaconda3/envs/qiime2-2021.11
qiime2-amplicon-2023.9     /Users/zaneveld/opt/anaconda3/envs/qiime2-amplicon-2023.9
```

If I wanted to activate the `qiime2-amplicon-2023.9` virtual environment, I would then use: 

`conda activate qiime2-amplicon-2023.9`

You can run this tutorial from a Jupyter notebook in Terminal on Mac, any BASH command line interface in Linux or the Windows Subsystem for Linux as long as QIIME2 is installed.

## Arrangement of tutorial files

It's  assumed that you've downloaded the zipped tutorial with `input`, `output` and `procedure` folders, and that this script is within the provided procedure folder (where it starts). 

Given all that, this tutorial will discuss how to use the the supplemented databases to remove mitochondria from your 16S datasets using QIIME2. 


## Check that we can import QIIME2 objects

Before we go further, let's double check that we can import objects from QIIME2 by importing the core `Artifact` object.

In [2]:
try:
    from qiime2 import Artifact
    print('Good to go!')
except ModuleNotFoundError:
    raise ModuleNotFoundError('\nQIIME2 is not installed, or you are not in the proper environment.\nPlease stop the Jupyter server, install QIIME2 or activate the environment, and restart this notebook.') from None

Good to go!


### Download referance taxonomy files from the internet

Import the gzip, os, shutil, subprocess, and tarfile libraries (for file system commands) and urllib (for the download function).

In [3]:
import gzip
import os
import subprocess
import shutil
import tarfile
import urllib.request

Set working and reference directories and create the reference directory if it does not already exist:

In [4]:
working_dir = os.path.abspath(os.path.join('..'))
refs_dir = os.path.join(working_dir, 'input', 'taxonomy_references')

if not os.path.exists(refs_dir):
    os.mkdir(refs_dir)

Define a download function that we'll use to grab the Silva 138 SSU, Greengenes 13_8, Metaxa2, and Phytoref data files.

In [5]:
def download_file(url, local_filepath):
    with urllib.request.urlopen(url) as response, open(local_filepath, 'wb') as out_file:
        shutil.copyfileobj(response, out_file)

Download the Silva data files, kindly provided by the QIIME2 folks. Pre-processed with [RESCRIPt](https://github.com/bokulich-lab/RESCRIPt).

In [6]:
download_file('https://data.qiime2.org/2023.5/common/silva-138-99-seqs.qza',
              os.path.join(refs_dir, 'silva_sequences_full.qza'))
download_file('https://data.qiime2.org/2023.5/common/silva-138-99-tax.qza',
              os.path.join(refs_dir, 'silva_taxonomy_full.qza'))

Download and unzip the Greengenes data files.

In [None]:
download_file('ftp://greengenes.microbio.me/greengenes_release/gg_13_5/gg_13_8_otus.tar.gz',
              os.path.join(refs_dir, 'gg_13_8_otus.tar.gz'))

with tarfile.open(os.path.join(refs_dir, 'gg_13_8_otus.tar.gz'), 'r:gz') as tar:
    tar.extractall(refs_dir)
download_path = shutil.copyfile(os.path.join(refs_dir, 'gg_13_8_otus', 'taxonomy', '99_otu_taxonomy.txt'), os.path.join(refs_dir, 'greengenes_taxonomy.txt'))
download_path = shutil.copyfile(os.path.join(refs_dir, 'gg_13_8_otus', 'rep_set', '99_otus.fasta'), os.path.join(refs_dir, 'greengenes_sequences.fasta'))
print(f'Greengenes sequences can be found at {download_path}')

Download and unzip the Metaxa2 data files.

In [None]:
download_file('https://microbiology.se/sw/Metaxa2_2.2.1.tar.gz',
              os.path.join(refs_dir, 'Metaxa2_2.2.1.tar.gz'))
with tarfile.open(os.path.join(refs_dir, 'Metaxa2_2.2.1.tar.gz'), 'r:gz') as tar:
    tar.extractall(refs_dir)

Metaxa files are contained within a BLAST database and need further extraction.

In [None]:
os.chdir(os.path.join(refs_dir, 'Metaxa2_2.2.1/metaxa2_db/SSU'))
subprocess.run(['blastdbcmd', '-entry', 'all', '-db', 'blast', '-out', 'metaxa2.fasta'])
download_path = shutil.copyfile(os.path.join(refs_dir, os.path.join('Metaxa2_2.2.1', 'metaxa2_db', 'SSU', 'metaxa2.fasta')),
                os.path.join(refs_dir, 'metaxa2.fasta'))
os.chdir(working_dir)
print(f'Metaxa2 sequences can be found at {download_path}')

Download and unzip the PhytoRef data files.

In [None]:
download_file('http://phytoref.sb-roscoff.fr/static/downloads/PhytoRef_with_taxonomy.fasta',
              os.path.join(refs_dir, 'PhytoRef_with_taxonomy.fasta'))
download_path = os.path.join(refs_dir, 'PhytoRef_with_taxonomy.fasta')
print(f'PhytoRef sequences can be found at {download_path}')

# Create supplemented Silva and Greengenes reference taxonomies

Silva and Greengenes have slightly different taxonomy naming schemes. We'll place the mitochondria sequences in a made-up family called "Mitochondria" under the order Rickettsiales. Chloroplasts are also different between Silva and Greengenes, so in Silva they go in order "Chloroplast" under class Cyanobacteriia, and in Greengenes they'll be class "Chloroplast" under phylum Cyanobacteria. 

Any information from the fasta files will be the species annotation, with the hope that if something is a strong enough match that the species info will populate.

First, we'll define variables to hold the taxonomy strings for mitochondria and chloroplasts in Silva and Greengenes, as well as the various ways organelles might be described.

In [54]:
silva_mitochondria_prefix = 'd__Bacteria; p__Proteobacteria; c__Alphaproteobacteria; o__Rickettsiales; f__Mitochondria; g__Mitochondria; s__'
silva_chloroplast_prefix = 'd__Bacteria; p__Cyanobacteria; c__Cyanobacteriia; o__Chloroplast; f__Chloroplast; g__Chloroplast; s__'
greengenes_mitochondria_prefix = 'k__Bacteria; p__Proteobacteria; c__Alphaproteobacteria; o__Rickettsiales; f__mitochondria; g__Mitochondria; s__'
greengenes_chloroplast_prefix = 'k__Bacteria; p__Cyanobacteria; c__Chloroplast; o__Chloroplast; f__Chloroplast; g__Chloroplast; s__'

mitochondria_descriptors = ['Mitochondria', 'mitochondria', 'Mitochonrion', 'mitochondrion']
chloroplast_descriptors = ['Chloroplast', 'chloroplast']

Next, we grab the fasta sequences and sequence information from the supplemental databases and add them to new taxonomy files.

In [55]:
from skbio import read as read_fasta

with open(os.path.join(refs_dir, 'silva_organelle_taxonomy.tsv'), 'w') as silva_taxonomy:
    with open(os.path.join(refs_dir, 'greengenes_organelle_taxonomy.tsv'), 'w') as gg_taxonomy:
        silva_taxonomy.write('Feature ID\tTaxon\n')
        gg_taxonomy.write('Feature ID\tTaxon\n')
        with open(os.path.join(refs_dir, 'organelle_sequences.fasta'), 'w') as organelle_seqs:
            metaxa2_fasta_path = os.path.join(refs_dir, 'metaxa2.fasta')
            for i, entry in enumerate(read_fasta(metaxa2_fasta_path, format = 'fasta')):
                description = ' '.join([entry.metadata['id'], entry.metadata['description']])
                for descriptor in mitochondria_descriptors:
                    if descriptor in description:
                        organelle_seqs.write(f'>metaxa2_mitochondria_{i}\n')
                        organelle_seqs.write(f'{entry}\n')
                        specific_info = entry.metadata['description'].split(';')[-1]
                        silva_taxonomy.write(f'metaxa2_mitochondria_{i}\t{silva_mitochondria_prefix}{specific_info}\n')
                        gg_taxonomy.write(f'metaxa2_mitochondria_{i}\t{greengenes_mitochondria_prefix}{specific_info}\n')
                        break
            phytoref_fasta_path = os.path.join(refs_dir, 'PhytoRef_with_taxonomy.fasta')
            for i, entry in enumerate(read_fasta(phytoref_fasta_path, format = 'fasta')):
                description = ' '.join([entry.metadata['id'], entry.metadata['description']])
                if 'XXXXXXXXXX' in entry:    #phytoref contains one sequence with non-IUPAC characters
                    continue
                for descriptor in chloroplast_descriptors:
                    if descriptor in description:
                        description = description.replace(';', '_')
                        organelle_seqs.write(f'>phytoref_chloroplast_{i}\n')
                        organelle_seqs.write(f'{entry}\n')
                        specific_info = entry.metadata['description'].split('|')[-1]
                        silva_taxonomy.write(f'phytoref_chloroplast_{i}\t{silva_chloroplast_prefix}{specific_info}\n')
                        gg_taxonomy.write(f'phytoref_chloroplast_{i}\t{greengenes_chloroplast_prefix}{specific_info}\n')

Now we will import the Greengenes sequence files into QIIME2 (it's already been done for Silva). That will let us merge them into extended versions of the Silva and Greengenes databases. We'll save the organelle sequences and the sequences from the base taxonomy references as QIIME2 artifacts (sequence .qza files).

In [None]:
organelle_seqs = Artifact.import_data('FeatureData[Sequence]', os.path.join(refs_dir, 'organelle_sequences.fasta'))
greengenes_seqs = Artifact.import_data('FeatureData[Sequence]', os.path.join(refs_dir, 'greengenes_sequences.fasta'))
silva_seqs = Artifact.load(os.path.join(refs_dir, 'silva_sequences_full.qza'))

organelle_seqs.save(os.path.join(refs_dir, 'organelle_sequences_full.qza'))
greengenes_seqs.save(os.path.join(refs_dir, 'greengenes_sequences_full.qza'))

We'll merge the organelle sequences with the sequences from the base reference taxonomies to create what we call "extended" reference taxonomies.

In [None]:
from qiime2.plugins.feature_table.methods import merge_seqs

silva_extended_seqs = merge_seqs([organelle_seqs, silva_seqs])
greengenes_extended_seqs = merge_seqs([organelle_seqs, greengenes_seqs])

silva_extended_seqs.merged_data.save(os.path.join(refs_dir, 'silva_extended_sequences_full.qza'))
greengenes_extended_seqs.merged_data.save(os.path.join(refs_dir, 'greengenes_extended_sequences_full.qza'))

Next, we'll select our region of interest. In this example (and in the paper), we chose the V4 primers used in EMP protocol: 515F (Parada) and 806R (Apprill). The EMP lists the last name of the first author of the paper which reported the primer to avoid ambiguity.

In [None]:
from qiime2.plugins.feature_classifier.methods import extract_reads

forward_primer = 'GTGYCAGCMGCCGCGGTAA' #515F (Parada)
reverse_primer = 'GGACTACNVGGGTWTCTAAT' #806R (Apprill)

v4_silva_base_seqs = extract_reads(silva_seqs, forward_primer, reverse_primer, n_jobs = 4, read_orientation = 'forward')
v4_silva_extended_seqs = extract_reads(silva_extended_seqs.merged_data, forward_primer, reverse_primer, n_jobs = 4, read_orientation = 'forward')
v4_greengenes_base_seqs = extract_reads(greengenes_seqs, forward_primer, reverse_primer, n_jobs = 4, read_orientation = 'forward')
v4_greengenes_extended_seqs = extract_reads(greengenes_extended_seqs.merged_data, forward_primer, reverse_primer, n_jobs = 4, read_orientation = 'forward')

v4_silva_base_seqs.reads.save(os.path.join(refs_dir, 'silva_sequences.qza'))
v4_silva_extended_seqs.reads.save(os.path.join(refs_dir, 'silva_extended_sequences.qza'))
v4_greengenes_base_seqs.reads.save(os.path.join(refs_dir, 'greengenes_sequences.qza'))
v4_greengenes_extended_seqs.reads.save(os.path.join(refs_dir, 'greengenes_extended_sequences.qza'))

The taxonomy merge is a little more straightforward, we'll just merge the taxonomic annotations of our new sequences with the base taxonomy files to create our extended taxonomy files.

In [None]:
from qiime2.plugins.feature_table.methods import merge_taxa

silva_taxonomy = Artifact.load(os.path.join(refs_dir, 'silva_taxonomy_full.qza'))
silva_organelle_taxonomy = Artifact.import_data('FeatureData[Taxonomy]', os.path.join(refs_dir, 'silva_organelle_taxonomy.tsv'))

greengenes_taxonomy = Artifact.import_data('FeatureData[Taxonomy]', os.path.join(refs_dir, 'greengenes_taxonomy.txt'), 'HeaderlessTSVTaxonomyFormat')
greengenes_taxonomy.save(os.path.join(refs_dir, 'greengenes_taxonomy.qza'))
greengenes_organelle_taxonomy = Artifact.import_data('FeatureData[Taxonomy]', os.path.join(refs_dir, 'greengenes_organelle_taxonomy.tsv'))

silva_extended_taxonomy = merge_taxa([silva_taxonomy, silva_organelle_taxonomy])
greengenes_extended_taxonomy = merge_taxa([greengenes_taxonomy, greengenes_organelle_taxonomy])

silva_extended_taxonomy.merged_data.save(os.path.join(refs_dir, 'silva_extended_taxonomy.qza'))
greengenes_extended_taxonomy.merged_data.save(os.path.join(refs_dir, 'greengenes_extended_taxonomy.qza'))

print("Done!")

You're now ready to use the extended reference databases as you like. A quick demonstration of the effect of the extended Silva reference taxonomy is outlined in the [mitochondria removal tutorial found here](mitochondria_removal_protocol.ipynb).

### References

#### Silva:
Pruesse, Elmar, Christian Quast, Katrin Knittel, Bernhard M. Fuchs, Wolfgang Ludwig, Jörg Peplies, and Frank Oliver Glöckner. 2007. “SILVA: A Comprehensive Online Resource for Quality Checked and Aligned Ribosomal RNA Sequence Data Compatible with ARB.” Nucleic Acids Research 35 (21): 7188–96. doi: 10.1093/nar/gkm864

Quast, Christian, Elmar Pruesse, Pelin Yilmaz, Jan Gerken, Timmy Schweer, Pablo Yarza, Jörg Peplies, and Frank Oliver Glöckner. 2013. “The SILVA Ribosomal RNA Gene Database Project: Improved Data Processing and Web-Based Tools.” Nucleic Acids Research 41: D590–96. doi: 10.1093/nar/gks1219

#### Greengenes:
DeSantis TZ, Hugenholtz P, Larsen N, Rojas M, Brodie EL, Keller K, Huber T, Dalevi D, Hu P, Andersen GL2006.Greengenes, a Chimera-Checked 16S rRNA Gene Database and Workbench Compatible with ARB. Appl Environ Microbiol72:.https://doi.org/10.1128/AEM.03006-05

McDonald, D., Price, M., Goodrich, J. et al. An improved Greengenes taxonomy with explicit ranks for ecological and evolutionary analyses of bacteria and archaea. ISME J 6, 610–618 (2012). https://doi.org/10.1038/ismej.2011.139

#### Metaxa2:
Bengtsson-Palme, J., Hartmann, M., Eriksson, K. M., Pal, C., Thorell, K., Larsson, D. G., & Nilsson, R. H. (2015). METAXA2: improved identification and taxonomic classification of small and large subunit rRNA in metagenomic data. Molecular ecology resources, 15(6), 1403–1414. https://doi.org/10.1111/1755-0998.12399

#### PhytoRef:
Decelle, J., Romac, S., Stern, R. F., Bendif, elM., Zingone, A., Audic, S., Guiry, M. D., Guillou, L., Tessier, D., Le Gall, F., Gourvil, P., Dos Santos, A. L., Probert, I., Vaulot, D., de Vargas, C., & Christen, R. (2015). PhytoREF: a reference database of the plastidial 16S rRNA gene of photosynthetic eukaryotes with curated taxonomy. Molecular ecology resources, 15(6), 1435–1445. https://doi.org/10.1111/1755-0998.12401

#### RESCRIPt:
Michael S Robeson II, Devon R O'Rourke, Benjamin D Kaehler, Michal Ziemski, Matthew R Dillon, Jeffrey T Foster, Nicholas A Bokulich. 2021. "RESCRIPt: Reproducible sequence taxonomy reference database management". PLoS Computational Biology 17 (11): e1009581.; doi: 10.1371/journal.pcbi.1009581

#### QIIME2:
Bolyen E, Rideout JR, Dillon MR, Bokulich NA, Abnet CC, Al-Ghalith GA, Alexander H, Alm EJ, Arumugam M, Asnicar F, Bai Y, Bisanz JE, Bittinger K, Brejnrod A, Brislawn CJ, Brown CT, Callahan BJ, Caraballo-Rodríguez AM, Chase J, Cope EK, Da Silva R, Diener C, Dorrestein PC, Douglas GM, Durall DM, Duvallet C, Edwardson CF, Ernst M, Estaki M, Fouquier J, Gauglitz JM, Gibbons SM, Gibson DL, Gonzalez A, Gorlick K, Guo J, Hillmann B, Holmes S, Holste H, Huttenhower C, Huttley GA, Janssen S, Jarmusch AK, Jiang L, Kaehler BD, Kang KB, Keefe CR, Keim P, Kelley ST, Knights D, Koester I, Kosciolek T, Kreps J, Langille MGI, Lee J, Ley R, Liu YX, Loftfield E, Lozupone C, Maher M, Marotz C, Martin BD, McDonald D, McIver LJ, Melnik AV, Metcalf JL, Morgan SC, Morton JT, Naimey AT, Navas-Molina JA, Nothias LF, Orchanian SB, Pearson T, Peoples SL, Petras D, Preuss ML, Pruesse E, Rasmussen LB, Rivers A, Robeson MS, Rosenthal P, Segata N, Shaffer M, Shiffer A, Sinha R, Song SJ, Spear JR, Swafford AD, Thompson LR, Torres PJ, Trinh P, Tripathi A, Turnbaugh PJ, Ul-Hasan S, van der Hooft JJJ, Vargas F, Vázquez-Baeza Y, Vogtmann E, von Hippel M, Walters W, Wan Y, Wang M, Warren J, Weber KC, Williamson CHD, Willis AD, Xu ZZ, Zaneveld JR, Zhang Y, Zhu Q, Knight R, and Caporaso JG. 2019. Reproducible, interactive, scalable and extensible microbiome data science using QIIME 2. Nature Biotechnology 37: 852–857. https://doi.org/10.1038/s41587-019-0209-9

