<a href="https://colab.research.google.com/github/satyakisen/pandas-ff-comparison/blob/main/Pandas_File_Format_Comparison.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Pandas File Format Comparision
## Overview
In this notebook let us compare the following pandas file formats.
1. csv - common text file that is comma seperated.
2. hdf5 - an open source file format that supports large, complex, heterogeneous data
3. parquet - an open source, column-oriented data file format designed for efficient data storage and retrieval.
4. feather - a portable file format for storing Arrow tables or data frames (from languages like Python or R) that utilizes the Arrow IPC format internally

Comparision parameters to consider are:
1. Time to write.
2. Time to read.
3. File size on disk.

## Let us begin
Let us begin by creating a dummy dataset containing only random float.

In [3]:
import pandas as pd
import numpy as np

def make_data(row_n, col_n):
  arr = np.random.randn(row_n, col_n)
  df = pd.DataFrame(arr, columns=['col_{0}'.format(i) for i in range(col_n)])
  return df

df = make_data(100000, 10)

Let us check the dummy dataset we made.

In [4]:
df.head(5)

Unnamed: 0,col_0,col_1,col_2,col_3,col_4,col_5,col_6,col_7,col_8,col_9
0,0.268581,0.528943,-1.447149,1.065151,1.990748,0.264007,0.474082,-0.274708,2.836983,-1.18829
1,0.245928,-0.75561,0.595658,-1.581838,1.518598,1.507621,1.053752,0.914109,0.997426,0.36312
2,0.462662,-0.299092,2.870275,-0.920234,0.384289,0.853863,2.750504,0.051945,-0.761517,1.201193
3,-0.214414,2.169669,-1.660826,1.246721,0.340115,-0.152463,-1.196192,1.817195,1.195896,-1.093683
4,0.19141,0.928366,0.227941,2.1637,-1.135679,0.149892,1.223956,-0.549677,-0.679656,0.166663


## Time to write

Let us now check the writing time of the above dataframe we created. We will first create a decorator for calculating time and then write a function to save the dataframe into different file formats.

In [5]:
%time df.to_csv('test.csv')

CPU times: user 2.72 s, sys: 111 ms, total: 2.83 s
Wall time: 3.48 s


In [6]:
%time df.to_hdf('test.h5', key='root')

CPU times: user 60.3 ms, sys: 25 ms, total: 85.3 ms
Wall time: 273 ms


In [7]:
%time df.to_parquet('test.parquet')

CPU times: user 184 ms, sys: 37.3 ms, total: 221 ms
Wall time: 421 ms


In [8]:
%time df.to_feather('test.feather')

CPU times: user 29.6 ms, sys: 19.1 ms, total: 48.8 ms
Wall time: 59.3 ms


From above we can see that **feather** & **parquet** are the file formats which works pretty well for writing to disk. Now let us consider the second parameter.

##File size on Disk
Let us now check the file size on disk.

In [9]:
%%bash
du -sh test.*

20M	test.csv
7.7M	test.feather
8.5M	test.h5
9.7M	test.parquet


From the above result we can see that **feather** and **hdf** outperforms others. But we will check the performance again with some gigabyte of data afterward.

## Time to read
Let us check how much time does it take for a read operation.

In [10]:
%%time 
df=pd.read_csv('test.csv')

CPU times: user 219 ms, sys: 52 ms, total: 271 ms
Wall time: 273 ms


In [11]:
%%time
df=pd.read_hdf('test.h5')

CPU times: user 17.7 ms, sys: 7.06 ms, total: 24.8 ms
Wall time: 28.5 ms


In [12]:
%%time
df=pd.read_parquet('test.parquet')

CPU times: user 27.2 ms, sys: 26.3 ms, total: 53.5 ms
Wall time: 87.7 ms


In [13]:
%%time
df=pd.read_feather('test.feather')

CPU times: user 12.9 ms, sys: 8.23 ms, total: 21.1 ms
Wall time: 16.7 ms


In the above result we see that **hdf** and **feather** outperforms.

## Testing with categorical data.
Now let us create data with both the categorical and numerical values and check the performances of different file formats.

In [2]:
!pip install lorem

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


###Prepare Data

In [14]:
from lorem import sentence

words = np.array(sentence().strip().lower().replace(".", " ").split())

np.random.seed(0)  
n = 1000000
df = pd.DataFrame(np.c_[np.random.randn(n, 5),
                  np.random.randint(0,10,(n, 2)),
                  np.random.randint(0,1,(n, 2)),
np.array([np.random.choice(words) for i in range(n)])] , 
columns=list('ABCDEFGHIJ'))

df["A"][::10] = np.nan
len(df)

1000000

### Time to write

In [25]:
%time df.to_csv('test_big.csv')

CPU times: user 4.89 s, sys: 1.02 s, total: 5.91 s
Wall time: 5.98 s


In [26]:
%time df.to_hdf('test_big.h5', key='root')

your performance may suffer as PyTables will pickle object types that it cannot
map directly to c-types [inferred_type->mixed,key->block0_values] [items->Index(['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J'], dtype='object')]

  encoding=encoding,


CPU times: user 3.13 s, sys: 546 ms, total: 3.67 s
Wall time: 3.65 s


In [27]:
%time df.to_parquet('test_big.parquet')

CPU times: user 1.62 s, sys: 386 ms, total: 2.01 s
Wall time: 2.03 s


In [28]:
%time df.to_feather('test_big.feather')

CPU times: user 1.21 s, sys: 283 ms, total: 1.5 s
Wall time: 1.29 s


### File size on disk

In [29]:
%%bash
du -sh test_big.*

112M	test_big.csv
122M	test_big.feather
251M	test_big.h5
80M	test_big.parquet


### Time to read

In [31]:
%%time 
df=pd.read_csv('test_big.csv')

CPU times: user 1.38 s, sys: 133 ms, total: 1.51 s
Wall time: 1.52 s


In [33]:
%%time
df=pd.read_hdf('test_big.h5', key='root')

CPU times: user 2.04 s, sys: 308 ms, total: 2.35 s
Wall time: 2.88 s


In [34]:
%%time
df=pd.read_parquet('test_big.parquet')

CPU times: user 2.01 s, sys: 584 ms, total: 2.59 s
Wall time: 2.32 s


In [35]:
%%time
df=pd.read_feather('test_big.feather')

CPU times: user 1.51 s, sys: 289 ms, total: 1.8 s
Wall time: 1.73 s


We can see in the above cells that when we use raw categorical fields the size of the file gets large in case of feather and hdf. Now let us use the pd.Categorical function and modify the data and check what happens.

In [36]:
df['J'] = pd.Categorical(df['J'])

### Time to write

In [37]:
%time df.to_csv('test_big_pd_cat.csv')

CPU times: user 5.68 s, sys: 373 ms, total: 6.05 s
Wall time: 6.76 s


In [39]:
%time df.to_hdf('test_big_pd_cat.h5', key='root', format='table')

CPU times: user 8.03 s, sys: 723 ms, total: 8.76 s
Wall time: 8.89 s


In [40]:
%time df.to_parquet('test_big_pd_cat.parquet')

CPU times: user 1.5 s, sys: 458 ms, total: 1.96 s
Wall time: 2 s


In [41]:
%time df.to_feather('test_big_pd_cat.feather')

CPU times: user 1.07 s, sys: 501 ms, total: 1.57 s
Wall time: 1.36 s


### File size on disk

In [42]:
%%bash
du -sh test_big_pd_cat.*

112M	test_big_pd_cat.csv
117M	test_big_pd_cat.feather
316M	test_big_pd_cat.h5
80M	test_big_pd_cat.parquet


### Time to read

In [43]:
%%time
df = pd.read_csv('test_big_pd_cat.csv')

CPU times: user 1.58 s, sys: 192 ms, total: 1.78 s
Wall time: 1.83 s


In [44]:
%%time
df = pd.read_hdf('test_big_pd_cat.h5')

CPU times: user 6.8 s, sys: 1.42 s, total: 8.22 s
Wall time: 8.29 s


In [45]:
%%time
df = pd.read_parquet('test_big_pd_cat.parquet')

CPU times: user 2.01 s, sys: 719 ms, total: 2.73 s
Wall time: 2.54 s


In [46]:
%%time
df = pd.read_feather('test_big_pd_cat.feather')

CPU times: user 1.55 s, sys: 519 ms, total: 2.06 s
Wall time: 2.04 s


## Feather or Parquet
1. Parquet format is designed for long-term storage, where Arrow is more intended for short term or ephemeral storage because files volume are larger.

2. Parquet is usually more expensive to write than Feather as it features more layers of encoding and compression.

3. Feather is unmodified raw columnar Arrow memory. We will probably add simple compression to Feather in the future.

4. Due to dictionary encoding, RLE encoding, and data page compression, Parquet files will often be much smaller than Feather files

5. Parquet is a standard storage format for analytics that’s supported by Spark. So if you are doing analytics, Parquet is a good option as a reference storage format for query by multiple systems

[Source StackOverflow](https://stackoverflow.com/questions/48083405/what-are-the-differences-between-feather-and-parquet)

## Bibliograpgy
This notebook is inspired from the following link.<br>

[File Format - Python tools for big data](https://pnavaro.github.io/big-data/14-FileFormats.html)