# 1-DownloadData
This notebook downloads and unzips a sample dataset.

Dataset Description:
A list of gene names and annotations for species from [National Center for Biotechnology Information](https://www.ncbi.nlm.nih.gov/) in tsv format.

Size: 6.4 GB unzipped (as of June 2023)

Source: https://ftp.ncbi.nlm.nih.gov/gene/DATA/gene_info.gz

Description: https://ftp.ncbi.nlm.nih.gov/gene/DATA/README

To ensure platform independence, this notebook uses Python libraries to download and unzip a compressed file.

Author: Peter W. Rose (pwrose@ucsd.edu)

In [1]:
import os
import shutil
import requests
import gzip
import pandas as pd

### Specify the number of copies that should be made of the original dataset to increase its size for benchmarking

A single copy of the file is about 6.4GB as of June 2023.

The ```n_copies``` parameter is used for benchmarking different dataset sizes. 
The cell below has been [parameterized](https://papermill.readthedocs.io/en/latest/usage-parameterize.html) as input parameters for [papermill](https://papermill.readthedocs.io/en/latest/index.html).

In [2]:
n_copies = 1

If LOCAL_SCRATCH_DIR environment variable is not set, this notebook stores data files in the ../data directory.

In [3]:
DATA_DIR = os.getenv("LOCAL_SCRATCH_DIR", default="../data")
os.makedirs(DATA_DIR, exist_ok=True)

In [4]:
url = "https://ftp.ncbi.nlm.nih.gov/gene/DATA/gene_info.gz"
filename_in = os.path.join(DATA_DIR, "gene_info.gz")
filename_out = os.path.join(DATA_DIR, "gene_info.tsv")

Download using streaming to handle large files that exceed available memory

In [5]:
def download_http(url, filename):
    with requests.get(url, stream=True) as r:
        with open(filename, "wb") as f:
            shutil.copyfileobj(r.raw, f)

In [6]:
def unzip(filename_in, filename_out):
    with gzip.open(filename_in, "rb") as f_in:
        with open(filename_out, "wb") as f_out:
            shutil.copyfileobj(f_in, f_out)

In [7]:
print(f"downloading {url} to {filename_in}")
download_http(url, filename_in)

downloading https://ftp.ncbi.nlm.nih.gov/gene/DATA/gene_info.gz to /scratch/pwrose/job_22855518/gene_info.gz


In [8]:
print(f"unzipping {filename_in} to {filename_out}")

unzipping /scratch/pwrose/job_22855518/gene_info.gz to /scratch/pwrose/job_22855518/gene_info.tsv


#### Create dataset by appending multiple copies to the original file

In [9]:
%%time
with open(filename_out, 'wb') as f_out:
    # make a single copy including the header
    with gzip.open(filename_in, "rb") as f_in:
        shutil.copyfileobj(f_in, f_out)
                               
    # append n-1 copies without the header
    for _ in range(1, n_copies):
        header = True
        with gzip.open(filename_in, "rb") as f_in:
            for line in f_in:
                if header:
                    header = False
                else:
                    f_out.write(line)

CPU times: user 15.8 s, sys: 3.32 s, total: 19.1 s
Wall time: 19.1 s


In [10]:
file_size = os.path.getsize(filename_out)
print(f"File Size: {file_size/1E9:.1f} GB")

File Size: 6.4 GB
