#### Rohan Bhatt, Shubhang Srikoti 
##### MSML605 -  Investigating the Impact of Storage Formats
Problem statement: How does the choice of storage format (CSV, Parquet, HDF5) impact the overall performance of a machine learning pipeline and its processes (data ingestion, memory overhead, time-to-train, and more).




In [11]:
import kagglehub
path = kagglehub.dataset_download("jtbontinck/amex-parquet-file")
print("Path to dataset files:", path)

Downloading from https://www.kaggle.com/api/v1/datasets/download/jtbontinck/amex-parquet-file?dataset_version_number=1...


100%|██████████| 8.99G/8.99G [06:21<00:00, 25.3MB/s]

Extracting files...





Path to dataset files: /Users/rohan/.cache/kagglehub/datasets/jtbontinck/amex-parquet-file/versions/1


In [12]:
# #THIS SCRIPT ACTUALLY ENDED UP INCREASING MY FILE SIZE to 12GB WHICH I USED GOING FORWARD

# #tried script to convert 10gb parquet to 2gb parquet
# import pyarrow.parquet as pq, pyarrow as pa, math
# from pathlib import Path

# SRC = Path("data.parquet")          # 16-GB file
# DST = Path("data_2gb.parquet")

# pq_src   = pq.ParquetFile(SRC)
# n_rg     = pq_src.num_row_groups

# # gather row-group sizes (compressed bytes on disk)
# rg_sizes = [pq_src.metadata.row_group(i).total_byte_size for i in range(n_rg)]

# target_bytes = 2 * 1024**3          # 2 GB
# keep_rg      = []
# cum          = 0
# for i, sz in enumerate(rg_sizes):
#     if cum + sz > target_bytes:
#         break
#     keep_rg.append(i)
#     cum += sz

# print(f"Keeping {len(keep_rg)} row-groups  →  ~{cum/1024**3:.2f} GB")

# # read & write subset
# tables = [pq_src.read_row_group(i) for i in keep_rg]
# subset = pa.concat_tables(tables)
# pq.write_table(subset, DST, compression="snappy")   # or "zstd"

# print(f"Subset rows: {subset.num_rows:,}")
# print("Wrote:", DST)

In [13]:
import pyarrow.parquet as pq
#sanity check of parquet file
pq_file = pq.ParquetFile("data.parquet")
print("Rows in file:", pq_file.metadata.num_rows)
print("Columns in file:", pq_file.metadata.num_columns)
print("Schema:", pq_file.schema)

Rows in file: 16895213
Columns in file: 193
Schema: <pyarrow._parquet.ParquetSchema object at 0x112aa7ac0>
required group field_id=-1 duckdb_schema {
  optional fixed_len_byte_array(16) field_id=-1 line_ID (UUID);
  optional binary field_id=-1 customer_ID (String);
  optional int64 field_id=-1 date (Timestamp(isAdjustedToUTC=false, timeUnit=microseconds, is_from_converted_type=false, force_set_converted_type=false));
  optional float field_id=-1 P_2;
  optional float field_id=-1 D_39;
  optional float field_id=-1 B_1;
  optional float field_id=-1 B_2;
  optional float field_id=-1 R_1;
  optional float field_id=-1 S_3;
  optional float field_id=-1 D_41;
  optional float field_id=-1 B_3;
  optional float field_id=-1 D_42;
  optional float field_id=-1 D_43;
  optional float field_id=-1 D_44;
  optional float field_id=-1 B_4;
  optional float field_id=-1 D_45;
  optional float field_id=-1 B_5;
  optional float field_id=-1 R_2;
  optional float field_id=-1 D_46;
  optional float field_id=-1

In [None]:
#imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import pyarrow.parquet as pq
import pyarrow as pa
import time, datetime, os, psutil
import xgboost as xgb
from pathlib import Path, PureWindowsPath
import gc

Below is converting the initial ~11gb parquet file to CSV. To do this, I set the source (initial parquet file) and the destination path, and opening the parquet to discover how many row-groups to iterate. Then the loop reads one row-group at a time to keep memory low so the entire process doesn't explode in memory, then converting the columnar arrow buffers into a pandas dataframe, appending that chunk to the CSV (writing the header only once), then freeing the RAM before loading the next chunk. Throughout this process I monitored the conversion, and although I forgot to code the time it took ~25 min to convert ending in ~32 gb CSV.

In [15]:
# in/out file paths
IN_FILE = Path("data.parquet")
OUT_CSV = Path("data.csv") 
#opening the parquet file
pq_file = pq.ParquetFile(IN_FILE, memory_map=True) #zero copy i/o data
num_row_groups = pq_file.num_row_groups #number of row groups, meaning the partitions on disk
print(f"Row groups in file: {num_row_groups}")

# write loop
first_chunk = True #controlling the header write
for row_group in range(num_row_groups):
    # load one row group into an arrow table to stay off heap but still columnar
    table = pq_file.read_row_group(row_group)
    # converting to pandas, the arrowDtype preserves the nullable types during the conversion
    df = table.to_pandas(types_mapper=pd.ArrowDtype)
    # writing / appending
    if first_chunk: #create/overwrite file and write to header row
        df.to_csv(OUT_CSV, index=False, mode="w", header=True)
        first_chunk = False
    else: #append rows and skip header
        df.to_csv(OUT_CSV, index=False, mode="a", header=False)
    
    # free memory for chunk before next iteration
    del df, table
    gc.collect()
    print(f"row-group {row_group+1}/{num_row_groups} appended")

print("All done ->", OUT_CSV) 

Row groups in file: 169
row-group 1/169 appended
row-group 2/169 appended
row-group 3/169 appended
row-group 4/169 appended
row-group 5/169 appended
row-group 6/169 appended
row-group 7/169 appended
row-group 8/169 appended
row-group 9/169 appended
row-group 10/169 appended
row-group 11/169 appended
row-group 12/169 appended
row-group 13/169 appended
row-group 14/169 appended
row-group 15/169 appended
row-group 16/169 appended
row-group 17/169 appended
row-group 18/169 appended
row-group 19/169 appended
row-group 20/169 appended
row-group 21/169 appended
row-group 22/169 appended
row-group 23/169 appended
row-group 24/169 appended
row-group 25/169 appended
row-group 26/169 appended
row-group 27/169 appended
row-group 28/169 appended
row-group 29/169 appended
row-group 30/169 appended
row-group 31/169 appended
row-group 32/169 appended
row-group 33/169 appended
row-group 34/169 appended
row-group 35/169 appended
row-group 36/169 appended
row-group 37/169 appended
row-group 38/169 append

Before converting to HDF5, below is a sanity check that there wasn't any loss or misrepresentation in data between the files. The only visible difference is the date display; Parquet keeps it as pandas datetime64[ns], while the csv reader shows a string timestamp of object dtype. we can cast this later but data is still preserved.

In [16]:
#verifying both files before hdf5 conversion
df_pq = pq.read_table("data.parquet").to_pandas().head(5)
df_csv = pd.read_csv("data.csv", nrows=5)

print(df_pq.head())
print("-------\n-------\n-------\n-------\n-------")
print(df_csv.head())

print("row lengths are the same:", len(df_pq) == len(df_csv)) 

                                             line_ID  \
0  b'\xb6a\x82\x86f#F\x1d\x8c\x94\x7f\x8d\x944\xd...   
1        b'L\xa8+-\xa8\x8dM\xa9\x96g\xed0I\x95\x1e$'   
2        b']s_\x87\xaf B\xec\xbeEg\xb5\x1e\xb2\xaed'   
3     b'\xfb^\xd4{Q\xb5HO\xa8\xb6\xf6\xca\xb1]@\x99'   
4  b'`\xa5\x96\xf6\x1b\rG\x8d\xab\\\x16\x8d\xe1\x...   

                                         customer_ID       date       P_2  \
0  d00b98b2401d26197fa1d6102cdc1c9bbed7c066b8aaa9... 2018-03-06  0.366254   
1  d00bc5e66e3aac9eae7c9e94621b36d196566d61ef7a32... 2018-03-25  0.312623   
2  d00bd125cf6fa463a6c57b9959b8a4197f6f79fb154fee... 2018-03-28  0.395606   
3  d00bfbdee3081206258a4b4fb2ef2eb311697f37056bfb... 2018-03-01  0.977543   
4  d00c0dd295ada176c4e697d4cc1cd2f0d572870f770859... 2018-03-26  0.934237   

       D_39       B_1       B_2       R_1       S_3      D_41  ...     D_138  \
0  0.003860  0.009151  0.818901  0.008979  0.143153  0.005497  ...  0.500092   
1  0.179014  0.560108  0.029272  0.75639

In [None]:
import numpy
import tables
print("NumPy :", numpy.__version__)
print("PyTables :", tables.__version__)

NumPy : 1.24.4
PyTables : 3.9.2


In [None]:
import pyarrow.parquet as pq, pandas as pd, numpy as np, gc, time
from pathlib import Path
#source / destination files
IN_PARQUET = Path("data.parquet")
OUT_H5 = Path("data.h5")

#open parquet file and getting how many row groups we need to stream
pq_file = pq.ParquetFile(IN_PARQUET, memory_map=True) # zero copy i/o data
num_row_groups = pq_file.num_row_groups

#iterating through the row groups and writing to hdf5
t0 = time.time()
# "w" -> create / overwrite HDF5 file
# "zlib" -> compression algorithm
# "6" -> compression level (0-9), 1 being fastest but minimal compression, 9 being slowest and highest compression
with pd.HDFStore(OUT_H5, "w", complib="zlib", complevel=6) as s:
    for i in range(num_row_groups):
        #loading one row group into pandas df
        df = pq_file.read_row_group(i).to_pandas()
        # bytes → hex-strings
        for col in df.select_dtypes("object"):
            if isinstance(df[col].iloc[0], (bytes, bytearray)):
                df[col] = df[col].apply(lambda b: b.hex())

        # force consistent NumPy int8 for label columns
        for col in ["target", "test"]:
            if col in df.columns:
                df[col] = df[col].fillna(-1).astype(np.int8)

        s.append("train", df, data_columns=True, index=False)
        del df; gc.collect()

        elapsed = time.time() - t0
        print(f"✓ row-group {i+1}/{num_row_groups}  |  elapsed {elapsed/60:.1f} min")

total = time.time() - t0
print(f"\nParquet → HDF5 completed in {total/60:.1f} minutes")


✓ row-group 1/17  |  elapsed 0.4 min
✓ row-group 2/17  |  elapsed 2.7 min
✓ row-group 3/17  |  elapsed 5.4 min
✓ row-group 4/17  |  elapsed 7.6 min
✓ row-group 5/17  |  elapsed 13.1 min
✓ row-group 6/17  |  elapsed 71.3 min
✓ row-group 7/17  |  elapsed 73.4 min
✓ row-group 8/17  |  elapsed 75.6 min
✓ row-group 9/17  |  elapsed 77.8 min
✓ row-group 10/17  |  elapsed 230.3 min
✓ row-group 11/17  |  elapsed 335.3 min
✓ row-group 12/17  |  elapsed 361.5 min
✓ row-group 13/17  |  elapsed 363.7 min
✓ row-group 14/17  |  elapsed 366.0 min
✓ row-group 15/17  |  elapsed 368.3 min
✓ row-group 16/17  |  elapsed 370.8 min
✓ row-group 17/17  |  elapsed 373.0 min

Parquet → HDF5 completed in 373.0 minutes


In [8]:
#sanity check of hdf5 file
import pandas as pd, os, time

start = time.time()
df_head = pd.read_hdf("data_2gb.h5", key="train", stop=5)
print(df_head.head())
print("\nHDF5 quick read time:", time.time()-start, "sec")
print("HDF5 size on disk:", os.path.getsize("data_2gb.h5")/1024**3, "GB")

                            line_ID  \
0  b66182866623461d8c947f8d9434d7b6   
1  4ca82b2da88d4da99667ed3049951e24   
2  5d735f87af2042ecbe4567b51eb2ae64   
3  fb5ed47b51b5484fa8b6f6cab15d4099   
4  60a596f61b0d478dab5c168de1c6f6be   

                                         customer_ID                    date  \
0  d00b98b2401d26197fa1d6102cdc1c9bbed7c066b8aaa9... 1970-01-18 14:18:14.400   
1  d00bc5e66e3aac9eae7c9e94621b36d196566d61ef7a32... 1970-01-18 14:45:36.000   
2  d00bd125cf6fa463a6c57b9959b8a4197f6f79fb154fee... 1970-01-18 14:49:55.200   
3  d00bfbdee3081206258a4b4fb2ef2eb311697f37056bfb... 1970-01-18 14:11:02.400   
4  d00c0dd295ada176c4e697d4cc1cd2f0d572870f770859... 1970-01-18 14:47:02.400   

        P_2      D_39       B_1       B_2       R_1       S_3      D_41  ...  \
0  0.366254  0.003860  0.009151  0.818901  0.008979  0.143153  0.005497  ...   
1  0.312623  0.179014  0.560108  0.029272  0.756391  0.091940  0.005489  ...   
2  0.395606  1.066026  0.731072  0.019496  0

##### Actual benchmarks of each format
(See run_benchmark.py, given the nature of how lengthy the process is I had to run each format separately, and asynch)

| format  | rows     | load_sec | train_sec | peak_ram_gb |
|---------|----------|----------|-----------|-------------|
| Parquet | 1000000  | 22.19    | 7.83      | 0.92        |
| CSV     | 1000000  | 375.81   | 7.76      | 3.84        |
| HDF5    | 1000000  | 96.84    | 7.78      | 7.66        |

: 