## Download Starting Taxonomy Files

In order to remove coral mitochondria, we need to first build supplemented versions of the greengenes and SILVA reference databases. In this notebook we automated downloading and extracting the databases.

NOTE: these references are large, so it will take a while for them to download, and you should expect them to occupy several Gb of hard drive space. (I'd hesitate before running if < 15Gb are free on your harddrive).

gg_13_8_otus.tar.gz (from ftp://greengenes.microbio.me/greengenes_release/gg_13_5/gg_13_8_otus.tar.gz)  
Silva_132_release.zip (from https://www.arb-silva.de/fileadmin/silva_databases/qiime/Silva_132_release.zip)    

In [22]:
#Set up utility functions for downloading data and organizing our folders

import urllib.request
import shutil
import os


def download_file(url, local_filepath):
    """Download a file from a remote url and save to a local filepath
    
    url - the web address of the file you want to download as a string
    local_filepath - the local filepath to which the file will be saved
    """

    print(f"Downloading file: {url}")
    # This is slightly convoluted-looking, but we are getting a response from the webpage and
    # then copying that to the file. 

    #Hat-tip to stack overflow: 
    #https://stackoverflow.com/questions/7243750/download-file-from-web-in-python-3

    with urllib.request.urlopen(url) as response, open(local_filepath, 'wb') as out_file:
        shutil.copyfileobj(response, out_file)
    
    print(f"Saved to local filepath: {local_filepath}")
    
def make_directory(path):
    """Make a directory, but proceed without errors if it fails
    path -- the path to the directory (e.g. "../output/taxonomy_references")
    """
    try:
        os.mkdir(path)
    except OSError:
        print (f"Creation of directory {path} failed")
    else:
        print (f"Created the directory {path}")


In [23]:
# Filepaths and urls

gg_url = "ftp://greengenes.microbio.me/greengenes_release/gg_13_5/gg_13_8_otus.tar.gz"
data_folder = "../output/taxonomy_references/"
local_gg_filename = "gg_13_8_otus.tar.gz"

silva_url = " https://www.arb-silva.de/fileadmin/silva_databases/qiime/Silva_132_release.zip"
local_silva_filename = "Silva_132_release.zip"
local_silva_filepath = os.path.join(data_folder,local_silva_filename)
local_gg_filepath = os.path.join(data_folder,local_gg_filename)



In [24]:
#### Set up a folder to hold large taxonomy files
import os

# create the data folder if it doesn't already exist
make_directory(data_folder)



Creation of directory ../output/taxonomy_references/ failed


#### Download the Greengenes Taxonomy

We'll now download the greengenes 13_8 taxonomy reference

In [25]:
download_file(url=gg_url,local_filepath = local_gg_filepath)

Downloading file: ftp://greengenes.microbio.me/greengenes_release/gg_13_5/gg_13_8_otus.tar.gz
Saved to local filepath: ../output/taxonomy_references/gg_13_8_otus.tar.gz


We now want to expand the greengenes .tar.gz file into our input folder so we can access the contents. 

In [26]:
import tarfile

tar = tarfile.open(local_gg_filepath, "r:gz")
tar.extractall(path=data_folder)
tar.close()

#### Download and Expand the SILVA 132 release

Now we'll download the SIVLA 132 release and decompress it.
NOTE: this is a large (~2.47 Gb) file, so it may take a while to download.

In [28]:
download_file(url=silva_url,local_filepath = local_silva_filepath)

Downloading file:  https://www.arb-silva.de/fileadmin/silva_databases/qiime/Silva_132_release.zip
Saved to local filepath: ../output/taxonomy_references/Silva_132_release.zip


In [None]:
from zipfile import is_zipfile, ZipFile

if not is_zipfile(local_silva_filepath):
    raise ValueError("The SILVA database zip file {local_silva_filepath} doesn't look like a zip file. Was it downloaded correctly?")

silva_zipfile = ZipFile(local_silva_filepath)

#Obnoxiously this file contains a _MACOSX subfolder. We don't want to unzip that...
files_to_extract = [m for m in silva_zipfile.namelist() if "_MACOSX" not in m]
print("Extracting SILVA database...")
silva_zipfile.extractall(path = data_folder,members = files_to_extract)
silva_zipfile.close()
print(f"Extracted the SILVA 132 database into: {data_folder}")