## Download Starting Taxonomy Files

In order to remove coral mitochondria, we need to first build supplemented versions of the greengenes and SILVA reference databases. In this notebook we automated downloading and extracting the databases.

NOTE: these references are large, so it will take a while for them to download, and you should expect them to occupy several Gb of hard drive space. (I'd hesitate before running if < 15Gb are free on your harddrive).

gg_13_8_otus.tar.gz (from ftp://greengenes.microbio.me/greengenes_release/gg_13_5/gg_13_8_otus.tar.gz)  
Silva_132_release.zip (from https://www.arb-silva.de/fileadmin/silva_databases/qiime/Silva_132_release.zip)    
MeTaxa2(from https://microbiology.se/sw/Metaxa2_2.2.1.tar.gz)


#### Define functions 

We will define a couple of utility functions for downloading files from a given web address and for creating a direcotry

In [1]:
#Set up utility functions for downloading data and organizing our folders

import urllib.request
import shutil
import os


def download_file(url, local_filepath):
    """Download a file from a remote url and save to a local filepath
    
    url - the web address of the file you want to download as a string
    local_filepath - the local filepath to which the file will be saved
    """

    print(f"Downloading file: {url}")
    # This is slightly convoluted-looking, but we are getting a response from the webpage and
    # then copying that to the file. 

    #Hat-tip to stack overflow: 
    #https://stackoverflow.com/questions/7243750/download-file-from-web-in-python-3

    with urllib.request.urlopen(url) as response, open(local_filepath, 'wb') as out_file:
        shutil.copyfileobj(response, out_file)
    
    print(f"Saved to local filepath: {local_filepath}")
    
def make_directory(path):
    """Make a directory, but proceed without errors if it fails
    path -- the path to the directory (e.g. "../output/taxonomy_references")
    """
    try:
        os.mkdir(path)
    except OSError:
        print (f"Creation of directory {path} failed")
    else:
        print (f"Created the directory {path}")


#### Set up filepaths

This notebook assumes that you ran jupyter notebook in the organelle_removal folder, then opened and ran the .ipynb file in the procedure folder. As such it assumes that the output folder will be in ../output/  relative to the starting working directory. If this is not correct (e.g. because your folders are organized differently), you can replace the data_folder variable with a new absolute path.

In [2]:
# Filepaths and urls
data_folder = os.path.abspath("../output/taxonomy_references/")
#data_folder = "../output/taxonomy_references/"

gg_url = "ftp://greengenes.microbio.me/greengenes_release/gg_13_5/gg_13_8_otus.tar.gz"
local_gg_filename = "gg_13_8_otus.tar.gz"

silva_url = " https://www.arb-silva.de/fileadmin/silva_databases/qiime/Silva_132_release.zip"
local_silva_filename = "Silva_132_release.zip"
local_silva_filepath = os.path.join(data_folder,local_silva_filename)
local_gg_filepath = os.path.join(data_folder,local_gg_filename)

metaxa2_url = "https://microbiology.se/sw/Metaxa2_2.2.1.tar.gz"
local_metaxa2_filename = "Metaxa2_2.2.1.tar.gz"
local_metaxa2_filepath = os.path.join(data_folder,local_metaxa2_filename)
metaxa2_fasta_filepath = os.path.join(data_folder,'metaxa2.fasta')


In [3]:
#### Set up a folder to hold large taxonomy files
import os

# create the data folder if it doesn't already exist
if not os.path.exists(data_folder):
    print(f"Creating new output folder {data_folder}")
    make_directory(data_folder)
    
else:
    print(f"Results will be saved in existing output folder {data_folder}")



Results will be saved in existing output folder /Users/jzaneveld/Dropbox/Zaneveld_Lab_Organization/Projects/GCMP_Global_Disease/gcmp_global_disease/analysis/organelle_removal/output/taxonomy_references


#### Download the Greengenes Taxonomy

We'll now download the greengenes 13_8 taxonomy reference

In [4]:
download_file(url=gg_url,local_filepath = local_gg_filepath)

Downloading file: ftp://greengenes.microbio.me/greengenes_release/gg_13_5/gg_13_8_otus.tar.gz
Saved to local filepath: /Users/jzaneveld/Dropbox/Zaneveld_Lab_Organization/Projects/GCMP_Global_Disease/gcmp_global_disease/analysis/organelle_removal/output/taxonomy_references/gg_13_8_otus.tar.gz


We now want to expand the greengenes .tar.gz file into our input folder so we can access the contents. 

In [5]:
import tarfile

tar = tarfile.open(local_gg_filepath, "r:gz")
tar.extractall(path=data_folder)
tar.close()

#### Download and Expand the SILVA 132 release

Now we'll download the SIVLA 132 release and decompress it.
NOTE: this is a large (~2.47 Gb) file, so it may take a while to download.

In [6]:
download_file(url=silva_url,local_filepath = local_silva_filepath)

Downloading file:  https://www.arb-silva.de/fileadmin/silva_databases/qiime/Silva_132_release.zip
Saved to local filepath: /Users/jzaneveld/Dropbox/Zaneveld_Lab_Organization/Projects/GCMP_Global_Disease/gcmp_global_disease/analysis/organelle_removal/output/taxonomy_references/Silva_132_release.zip


In [7]:
from zipfile import is_zipfile, ZipFile

if not is_zipfile(local_silva_filepath):
    raise ValueError("The SILVA database zip file {local_silva_filepath} doesn't look like a zip file. Was it downloaded correctly?")

silva_zipfile = ZipFile(local_silva_filepath)

#Obnoxiously this file contains a _MACOSX subfolder. We don't want to unzip that...
files_to_extract = [m for m in silva_zipfile.namelist() if "_MACOSX" not in m]
print("Extracting SILVA database...")
silva_zipfile.extractall(path = data_folder,members = files_to_extract)
silva_zipfile.close()
print(f"Extracted the SILVA 132 database into: {data_folder}")

Extracting SILVA database...
Extracted the SILVA 132 database into: /Users/jzaneveld/Dropbox/Zaneveld_Lab_Organization/Projects/GCMP_Global_Disease/gcmp_global_disease/analysis/organelle_removal/output/taxonomy_references


#### Download and Expand MeTaxa2

We want to get sequences from a BLAST repository generated for the MeTaxa2 project

In [8]:
download_file(url=metaxa2_url,local_filepath=local_metaxa2_filepath)

Downloading file: https://microbiology.se/sw/Metaxa2_2.2.1.tar.gz
Saved to local filepath: /Users/jzaneveld/Dropbox/Zaneveld_Lab_Organization/Projects/GCMP_Global_Disease/gcmp_global_disease/analysis/organelle_removal/output/taxonomy_references/Metaxa2_2.2.1.tar.gz


We'll now extract the MeTaxa2 software into our data folder

In [9]:
import tarfile
print(f"About to extract .tar.gz file: {local_metaxa2_filepath}")
tar = tarfile.open(local_metaxa2_filepath, "r:gz")
tar.extractall(path=data_folder)
tar.close()
print("Done")

About to extract .tar.gz file: /Users/jzaneveld/Dropbox/Zaneveld_Lab_Organization/Projects/GCMP_Global_Disease/gcmp_global_disease/analysis/organelle_removal/output/taxonomy_references/Metaxa2_2.2.1.tar.gz
Done


#### Get a FASTA file out of the Metaxa2 BLAST database

The MeTaxa2 software supplies a BLAST database, but not a FASTA file for the underlying sequences. In this step we convert these files to the FASTA format using the blastdbcmd program from BLAST+. 

This step requires BLAST+ installed

In [10]:
starting_folder = os.getcwd()
print(f"Started in working directory: {starting_folder}")
metaxa2_db_folder = os.path.join(data_folder,"Metaxa2_2.2.1","metaxa2_db","SSU")
os.chdir(metaxa2_db_folder)
print("Changed folder to ", os.getcwd())

print("Generating a FASTA file from the MeTaxa2 BLAST database")
#Step out of python for a moment to extract a FASTA file from the Metaxa2 BLAST db
!blastdbcmd -entry all -db blast -out metaxa2.fasta

print("Resulting FASTA file(s):")
!ls ./*.fasta
!mv metaxa2.fasta $metaxa2_fasta_filepath

print(f"Moved metaxa2.fasta to {metaxa2_fasta_filepath}")

os.chdir(starting_folder)
print("Changed working directory back to: ", os.getcwd())


Started in working directory: /Users/jzaneveld/Dropbox/Zaneveld_Lab_Organization/Projects/GCMP_Global_Disease/gcmp_global_disease/analysis/organelle_removal/procedure
Changed folder to  /Users/jzaneveld/Dropbox/Zaneveld_Lab_Organization/Projects/GCMP_Global_Disease/gcmp_global_disease/analysis/organelle_removal/output/taxonomy_references/Metaxa2_2.2.1/metaxa2_db/SSU
Generating a FASTA file from the MeTaxa2 BLAST database
Resulting FASTA file(s):
./metaxa2.fasta
Moved metaxa2.fasta to /Users/jzaneveld/Dropbox/Zaneveld_Lab_Organization/Projects/GCMP_Global_Disease/gcmp_global_disease/analysis/organelle_removal/output/taxonomy_references/metaxa2.fasta
Changed working directory back to:  /Users/jzaneveld/Dropbox/Zaneveld_Lab_Organization/Projects/GCMP_Global_Disease/gcmp_global_disease/analysis/organelle_removal/procedure


#### Taxonomy reference files downloaded

The taxonomy reference files should now all be downloaded into the data directory. The next step is to supplement the Greengenes and SILVA taxonomies with the MeTaxa2 mitochondrial data to better annotate coral sequences.