# Download Gene Information
This noteboook downloads and unzips a sample dataset.

Dataset description:
A list of gene names and annotations for species from [National Center for Biotechnology Information](https://www.ncbi.nlm.nih.gov/) in tsv format.

Size: 5.4 GB unzipped (as of June 2022)

Source: https://ftp.ncbi.nlm.nih.gov/gene/DATA/gene_info.gz

Description: https://ftp.ncbi.nlm.nih.gov/gene/DATA/README

To ensure platform independence, this notebook uses Python libraries to download and unzip a compressed file.

In [1]:
import os
import shutil
import requests
import gzip
import pandas as pd

### Specify the number of copies that should be made of the original dataset to increase its size for benchmarking

A single copy of the file is about 5.4GB as of June 2022.

In [2]:
ncopies = 1

If LOCAL_SCRATCH_DIR environment variable is not set, this notebook creates the ../data directory to store temporary files.

In [3]:
DATA_DIR = os.getenv("LOCAL_SCRATCH_DIR", default="../data")
os.makedirs(DATA_DIR, exist_ok=True)

In [4]:
url = "https://ftp.ncbi.nlm.nih.gov/gene/DATA/gene_info.gz"
filename_in = os.path.join(DATA_DIR, "gene_info.gz")
filename_out = os.path.join(DATA_DIR, "gene_info.tsv")

Download using streaming to handle large files that exceed available memory

In [5]:
def download_http(url, filename):
    with requests.get(url, stream=True) as r:
        with open(filename, "wb") as f:
            shutil.copyfileobj(r.raw, f)

In [6]:
def unzip(filename_in, filename_out):
    with gzip.open(filename_in, "rb") as f_in:
        with open(filename_out, "wb") as f_out:
            shutil.copyfileobj(f_in, f_out)

In [7]:
print(f"downloading {url} to {filename_in}")
download_http(url, filename_in)

downloading https://ftp.ncbi.nlm.nih.gov/gene/DATA/gene_info.gz to ../data/gene_info.gz


In [8]:
print(f"unzipping {filename_in} to {filename_out}")

unzipping ../data/gene_info.gz to ../data/gene_info.tsv


#### Create dataset by appending multiple copies to the original file

In [9]:
%%time
with open(filename_out, 'wb') as f_out:
    # make a single copy including the header
    with gzip.open(filename_in, "rb") as f_in:
        shutil.copyfileobj(f_in, f_out)
                               
    # append n-1 copies without the header
    for _ in range(1, ncopies):
        header = True
        with gzip.open(filename_in, "rb") as f_in:
            for line in f_in:
                if header:
                    header = False
                else:
                    f_out.write(line)

CPU times: user 18.4 s, sys: 3.13 s, total: 21.6 s
Wall time: 27.6 s


In [10]:
file_size = os.path.getsize(filename_out)
print(f"File Size: {file_size/1E9:.1f} GB")

File Size: 5.4 GB
