# 1-FetchLocalData
This noteboook copies and unzips a local copy of dataset for benchmarking.

Dataset description:
A list of gene names and annotations for species from [National Center for Biotechnology Information](https://www.ncbi.nlm.nih.gov/) in tsv format.

Size: 6.5 GB unzipped (as of June 2023)

Source: https://ftp.ncbi.nlm.nih.gov/gene/DATA/gene_info.gz

Description: https://ftp.ncbi.nlm.nih.gov/gene/DATA/README

To ensure platform independence, this notebook uses Python libraries to unzip a compressed file.

Author: Peter W. Rose (pwrose@ucsd.edu)

In [1]:
import os
import shutil
import gzip
from pathlib import path

If LOCAL_SCRATCH_DIR environment variable is not set, this notebook stores data files in the ../data directory.

In [2]:
DATA_DIR = os.getenv("LOCAL_SCRATCH_DIR", default="../data")
os.makedirs(DATA_DIR, exist_ok=True)

## Copy csv dataset

In [3]:
filename_in = os.path.join(Path.home(), "data/gene_info.gz")
filename_out = os.path.join(DATA_DIR, "gene_info.tsv")

In [4]:
def unzip(filename_in, filename_out):
    with gzip.open(filename_in, "rb") as f_in:
        with open(filename_out, "wb") as f_out:
            shutil.copyfileobj(f_in, f_out)

In [5]:
print(f"unzipping {filename_in} to {filename_out}")
unzip(filename_in, filename_out)

unzipping /cm/shared/examples/sdsc/ciml/2023/gene_info.gz to /scratch/train139/job_23601009/gene_info.tsv


In [6]:
file_size = os.path.getsize(filename_out)
print(f"File Size: {file_size/1E9:.1f} GB")

File Size: 6.5 GB


## Copy parquet dataset
Note, a parquet "file" is usually a directory of files.

In [7]:
filename_in = os.path.join(Path.home(), "data/gene_info.parquet")
filename_out = os.path.join(DATA_DIR, "gene_info.parquet")
shutil.copytree(filename_in, filename_out)

'/scratch/train139/job_23601009/gene_info.parquet'

In [8]:
# A parquet "file" is a directory of files. Each file corresponds to a partition in Dask.
file_size = 0
for path, dirs, files in os.walk(filename_out):
    for f in files:
        fp = os.path.join(path, f)
        file_size += os.path.getsize(fp)

In [9]:
print("Filename:", filename_out)
print(f"File Size: {file_size/1E9:.1f} GB")

Filename: /scratch/train139/job_23601009/gene_info.parquet
File Size: 1.7 GB
