## Create QIIME2 Taxonomy Artifact Files

For QIIME2 to annotate taxonomy, it requires that we create taxonomy .qza files. In this notebook, we will create such taxonomy files by supplementing the Greengenes and SILVA databases with additional mitochondrial 12S rRNA gene sequences from the MeTaxa2 project

The script assumes the following files are already together in a folder. (By default this is ../outputs/taxonomy_references/):

gg_13_8_otus/ (derived from gg_13_8_otus.tar.gz, downloaded from ftp://greengenes.microbio.me/greengenes_release/gg_13_5/gg_13_8_otus.tar.gz)
Silva_132_release/ (derived from Silva_132_release.zip downloaded from https://www.arb-silva.de/fileadmin/silva_databases/qiime/Silva_132_release.zip)
metaxa2.fasta (constructed from the MeTaxa2 BLAST database, https://microbiology.se/sw/Metaxa2_2.2.1.tar.gz)

These are all manually downloaded and extracted by the download_starting_taxonomy_files.ipynb notebook 

The script further requires the following libraries:

-- The QIIME2 software and dependencies (we ran it from within the qiime2-2020.6 conda environment) 
-- Biopython installed within the QIIME2 conda environment (I used conda install biopython)


#### Import python libraries

Note that this step will fail if Biopython or QIIME2 are not installed (specifically, Biopython must be installed within the qiime2 conda environment)

In [1]:
from Bio import SeqIO
from qiime2 import Artifact
import os

#### Define paths for new and existing files 

All paths are defined here to make the below code more modular.

Basically for each resource (Greengenes or SILVA) we need a FASA file with the actual sequences, and a taxonomy file that says how those sequences map into taxonomic categories. From these, we generate a new sequence and taxonomy file that has the original info plus the new sequences from MeTaxa2.

In [2]:
## Set up input and output directories (assumes starting from 'procedure' folder)
working_dir = os.path.abspath('../output/taxonomy_references/')

#Set up variables for relevant files
#Note in variable names we follow the convention m2 = MeTaxa2, gg = greengenes

#Existing files:
gg_fasta_path = working_dir + '/gg_13_8_otus/rep_set/99_otus.fasta'
gg_taxonomy_path = working_dir+'/gg_13_8_otus/taxonomy/99_otu_taxonomy.txt'

silva_fasta_path = working_dir + '/SILVA_132_QIIME_release/rep_set/rep_set_16S_only/99/silva_132_99_16S.fna'
silva_taxonomy_path = working_dir + '/SILVA_132_QIIME_release/taxonomy/16S_only/99/taxonomy_7_levels.txt'
m2_fasta_path = working_dir + '/metaxa2.fasta'


#New intermediate/raw data files:
gg_plus_m2_fasta_path = working_dir + '/m2+gg_otus.fasta'
gg_plus_m2_taxonomy_path = working_dir + '/m2+gg_taxonomy.txt'
silva_plus_m2_fasta_path = working_dir + '/m2+silva_otus.fasta'
silva_plus_m2_taxonomy_path = working_dir + '/m2+silva_taxonomy.txt'

#New QIIME2 .qza artifact files:
gg_otus_qza = working_dir + '/greengenes_otus.qza'
gg_taxonomy_qza = working_dir + '/greengenes_taxonomy.qza'
gg_m2_otus_qza = working_dir + '/greengenes_metaxa2_otus.qza'
gg_m2_taxonomy_qza = working_dir + '/greengenes_metaxa2_taxonomy.qza'
silva_otus_qza = working_dir + '/silva_otus.qza'
silva_taxonomy_qza = working_dir + '/silva_taxonomy.qza'
silva_m2_otus_qza = working_dir + '/silva_metaxa2_otus.qza'
silva_m2_taxonomy_qza = working_dir + '/silva_metaxa2_taxonomy.qza'

#### Verify that all input files exist

Let's run a quick check to make sure all input files actually exist

In [3]:
print("Verifying that all needed starting data files exist.")
for existing_file in [gg_fasta_path,gg_taxonomy_path,silva_fasta_path,silva_taxonomy_path,m2_fasta_path]:
    if not os.path.exists(existing_file):
        raise IOError(f"Required file {existing_file} not found. Please ensure it is in that directory.")
print("Done.")

Verifying that all needed starting data files exist.
Done.


#### Create a combined Greengenes plus MeTaxa2 fasta and taxonomy file

In [4]:
#extract mitochondria sequences from Metaxa2 and add them to new
#fasta and taxonomy files in the style of greengenes

otu_file = open(gg_plus_m2_fasta_path, "a")
taxonomy_file = open(gg_plus_m2_taxonomy_path, "a") 

#Add MeTaxa2 data to the gg_plus_m2 fasta and taxonomy files
for i, entry in enumerate(SeqIO.parse(m2_fasta_path, "fasta")):
    if 'mitochondria' in entry.description or 'Mitochondria' in entry.description:
        otu_file.write(">metaxa2_" + str(i) + "\n")
        otu_file.write(str(entry.seq + "\n"))
        taxonomy_file.write("metaxa2_" + str(i) + "\tk__Bacteria; p__Proteobacteria; c__Alphaproteobacteria; o__Rickettsiales; f__mitochondria; g__; s__\n")
    else:
        continue
        
#copy greengenes otus into the gg_plus_m2 fasta file        
for entry in SeqIO.parse(gg_fasta_path, "fasta"):
        otu_file.write(">" + str(entry.description) + "\n")
        otu_file.write(str(entry.seq) + "\n")

In [5]:
#copy greengenes taxonomy into the combined taxonomy file
greengenes_taxonomy_file = open(gg_taxonomy_path) 
for line in greengenes_taxonomy_file:
    taxonomy_file.write(line)

In [6]:
#Verify that we really created all files we were supposed to

print("Verifying that all gg_plus_m2 data files were created.")
for created_file in [gg_plus_m2_fasta_path,gg_plus_m2_taxonomy_path]:
    if not os.path.exists(created_file):
        raise IOError(f"Required file {created_file} not found. Please ensure it is in that directory.")
    print("Successfully created file:",created_file)
    
print("Done.")

Verifying that all gg_plus_m2 data files were created.
Successfully created file: /mnt/c/Users/Dylan/Documents/zaneveld/2_14_gcmp/GCMP_Global_Disease/analysis/organelle_removal/output/taxonomy_references/m2+gg_otus.fasta
Successfully created file: /mnt/c/Users/Dylan/Documents/zaneveld/2_14_gcmp/GCMP_Global_Disease/analysis/organelle_removal/output/taxonomy_references/m2+gg_taxonomy.txt
Done.


## Supplement the SILVA database with MeTaxa2 sequences

We'll now repeat the process to supplement the SILVA database with MeTaxa2 mitochondrial sequences

In [7]:
#extract mitochondria sequences from Metaxa2 and create separate fasta and taxonomy files in the style of SILVA
otu_file = open(silva_plus_m2_fasta_path, "a") 
taxonomy_file = open(silva_plus_m2_taxonomy_path, "a")

for i, entry in enumerate(SeqIO.parse(m2_fasta_path, "fasta")):
    if 'mitochondria' in entry.description or 'Mitochondria' in entry.description:
        otu_file.write(">metaxa2_" + str(i) + "\n")
        otu_file.write(str(entry.seq + "\n"))
        taxonomy_file.write("metaxa2_" + str(i) + "\tD_0__Bacteria;D_1__Proteobacteria;D_2__Alphaproteobacteria;D_3__Rickettsiales;D_4__Mitochondria;D_5__uncultured bacterium;D_6__uncultured bacterium\n")
    else:
        continue

In [8]:
#copy silva otus into the combined fasta file 
for entry in SeqIO.parse(silva_fasta_path, "fasta"):
    otu_file.write(">" + str(entry.description) + "\n")
    otu_file.write(str(entry.seq) + "\n")

In [9]:
#copy silva taxonomy into the combined taxonomy file
silva_taxonomy_file = open(silva_taxonomy_path) 
for line in silva_taxonomy_file:
    taxonomy_file.write(line)

In [10]:
#Verify that we really created all files we were supposed to

print("Verifying that all silva_plus_m2 data files were created.")
for created_file in [silva_plus_m2_fasta_path,silva_plus_m2_taxonomy_path]:
    if not os.path.exists(created_file):
        raise IOError(f"Required file {created_file} not found. Please ensure it is in that directory.")
    print("Successfully created file:",created_file)
    
print("Done.")

Verifying that all silva_plus_m2 data files were created.
Successfully created file: /mnt/c/Users/Dylan/Documents/zaneveld/2_14_gcmp/GCMP_Global_Disease/analysis/organelle_removal/output/taxonomy_references/m2+silva_otus.fasta
Successfully created file: /mnt/c/Users/Dylan/Documents/zaneveld/2_14_gcmp/GCMP_Global_Disease/analysis/organelle_removal/output/taxonomy_references/m2+silva_taxonomy.txt
Done.


## Import the Greengenes + Metaxa2 data into QIIME2 and export taxonomy .qzas

We now should have all the info we need to generate new QIIME2 taxonomy .qza files that can be used for taxonomic annotation. 

In [11]:
#create greengenes taxonomy and OTU artifacts
gg_otus = Artifact.import_data('FeatureData[Sequence]', gg_fasta_path)
gg_taxonomy = Artifact.import_data('FeatureData[Taxonomy]', gg_taxonomy_path, 'HeaderlessTSVTaxonomyFormat')

In [12]:
#create greengenes+m2 taxonomy and otu artifacts
gg_m2_otus = Artifact.import_data('FeatureData[Sequence]', gg_plus_m2_fasta_path)
gg_m2_taxonomy = Artifact.import_data('FeatureData[Taxonomy]', 
  gg_plus_m2_taxonomy_path, 'HeaderlessTSVTaxonomyFormat')



In [13]:
#create SILVA taxonomy and OTU artifacts
silva_otus = Artifact.import_data('FeatureData[Sequence]', silva_fasta_path)
silva_taxonomy = Artifact.import_data('FeatureData[Taxonomy]', silva_taxonomy_path, 'HeaderlessTSVTaxonomyFormat')

In [14]:
#create SILVA+m2 taxonomy and otu artifacts
silva_m2_otus = Artifact.import_data('FeatureData[Sequence]', silva_plus_m2_fasta_path)
silva_m2_taxonomy = Artifact.import_data('FeatureData[Taxonomy]', silva_plus_m2_taxonomy_path, 'HeaderlessTSVTaxonomyFormat')

In [15]:
#export the artifacts to the output folder
gg_otus.save(gg_otus_qza)
gg_taxonomy.save(gg_taxonomy_qza)
gg_m2_otus.save(gg_m2_otus_qza)
gg_m2_taxonomy.save(gg_m2_taxonomy_qza)
silva_otus.save(silva_otus_qza)
silva_taxonomy.save(silva_taxonomy_qza)
silva_m2_otus.save(silva_m2_otus_qza)
silva_m2_taxonomy.save(silva_m2_taxonomy_qza)

'/mnt/c/Users/Dylan/Documents/zaneveld/2_14_gcmp/GCMP_Global_Disease/analysis/organelle_removal/output/taxonomy_references/silva_metaxa2_taxonomy.qza'