# Download Gene Information
This noteboook copies and unzips a sample dataset.

Dataset description:
A list of gene names and annotations for species from [National Center for Biotechnology Information](https://www.ncbi.nlm.nih.gov/) in tsv format.

Size: 5.4 GB unzipped (as of June 2022)

Source: https://ftp.ncbi.nlm.nih.gov/gene/DATA/gene_info.gz

Description: https://ftp.ncbi.nlm.nih.gov/gene/DATA/README

To ensure platform independence, this notebook uses Python libraries to unzip a compressed file.

In [None]:
import os
import shutil
import gzip

If LOCAL_SCRATCH_DIR environment variable is not set, this notebook creates the ../data directory to store temporary files.

In [None]:
DATA_DIR = os.getenv("LOCAL_SCRATCH_DIR", default="../data")
os.makedirs(DATA_DIR, exist_ok=True)

In [None]:
filename_in = os.path.join(DATA_DIR, "/cm/shared/examples/sdsc/ciml/2022/gene_info.gz")
filename_out = os.path.join(DATA_DIR, "gene_info.tsv")

### Specify the number of copies that should be made of the original dataset to increase its size.

A single copy of the file is about 5.4GB. Note, making additional copies is very slow!

In [None]:
def unzip(filename_in, filename_out):
    with gzip.open(filename_in, "rb") as f_in:
        with open(filename_out, "wb") as f_out:
            shutil.copyfileobj(f_in, f_out)

In [None]:
print(f"unzipping {filename_in} to {filename_out}")
unzip(filename_in, filename_out)

In [None]:
file_size = os.path.getsize(filename_out)
print(f"File Size: {file_size/1E9:.1f} GB")