# 1a-Csv2Parquet
This notebook converts a csv or tsv file to a parquet file.  [Apache Parquet](https://parquet.apache.org/) is an open source, column-oriented data file format designed for efficient data storage and retrieval. It provides efficient data compression and encoding schemes with enhanced performance to handle complex data in bulk.

Author: Peter W. Rose (pwrose@ucsd.edu)

In [1]:
import os
import dask.dataframe as dd
import time

If LOCAL_SCRATCH_DIR environment variable is not set, this notebook stores data files in the ../data directory.

In [2]:
DATA_DIR = os.getenv("LOCAL_SCRATCH_DIR", default="../data")
filename = os.path.join(DATA_DIR, "gene_info.tsv")
print("Filename:", filename)
file_size = os.path.getsize(filename)
print(f"File Size: {file_size/1E9:.1f} GB")
output_filename = os.path.join(DATA_DIR, "gene_info.parquet")

Filename: /scratch/pwrose/job_22853611/gene_info.tsv
File Size: 6.4 GB


In [3]:
start = time.time()

In [4]:
genes = dd.read_csv(filename, dtype=str, sep="\t")

In [5]:
genes.to_parquet(output_filename, partition_on=["type_of_gene"], write_index=False, write_metadata_file=True, engine="pyarrow")

In [6]:
end = time.time()

In [7]:
# A parquet "file" is a directory of files. Each file corresponds to a partition in Dask.
file_size = 0
for path, dirs, files in os.walk(output_filename):
    for f in files:
        fp = os.path.join(path, f)
        file_size += os.path.getsize(fp)

In [8]:
print("Filename:", output_filename)
print(f"File Size: {file_size/1E9:.1f} GB")

Filename: /scratch/pwrose/job_22853611/gene_info.parquet
File Size: 1.7 GB


In [9]:
print(f"Csv2Parquet: {end - start:.1f} sec.")

Csv2Parquet: 161.5 sec.
