<a href="https://colab.research.google.com/github/satyakisen/pandas-ff-comparison/blob/main/Pandas_File_Format_Comparison.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Pandas File Format Comparision
## Overview
In this notebook let us compare the following pandas file formats.
1. csv - common text file that is comma seperated.
2. hdf5 - an open source file format that supports large, complex, heterogeneous data
3. parquet - an open source, column-oriented data file format designed for efficient data storage and retrieval.
4. feather - a portable file format for storing Arrow tables or data frames (from languages like Python or R) that utilizes the Arrow IPC format internally

Comparision parameters to consider are:
1. Time to write.
2. Time to read.
3. File size on disk.
4. Memory Usage.


## Testing with Numerical Data
Let us begin by creating a dummy dataset containing only random float.

In [2]:
import pandas as pd
import numpy as np

def make_data(row_n, col_n):
  arr = np.random.randn(row_n, col_n)
  df = pd.DataFrame(arr, columns=['col_{0}'.format(i) for i in range(col_n)])
  return df

df = make_data(100000, 10)

Let us check the dummy dataset we made.

In [3]:
df.head(5)

Unnamed: 0,col_0,col_1,col_2,col_3,col_4,col_5,col_6,col_7,col_8,col_9
0,0.419329,1.372807,0.805272,0.107557,-0.831611,1.614514,0.865308,-0.605107,0.234464,0.017989
1,1.061586,2.089864,-0.768182,-0.518389,1.308981,-1.2941,0.265615,-0.385778,-0.55821,2.054694
2,0.465276,0.721319,-0.411959,-2.379256,0.502872,-0.125381,0.284121,-0.031056,1.146651,-1.71993
3,0.112522,-0.056549,0.379488,-0.194104,0.441639,-1.146654,0.079624,-1.125895,0.518559,-1.486646
4,-0.814379,0.501988,0.571435,-0.751721,-0.189275,-0.482968,-0.109526,-0.295733,0.348015,2.176098


### Time to write

Let us now check the writing time of the above dataframe we created. We will first create a decorator for calculating time and then write a function to save the dataframe into different file formats.

In [4]:
%time df.to_csv('test.csv')

CPU times: user 1.59 s, sys: 68.3 ms, total: 1.66 s
Wall time: 1.68 s


In [5]:
%time df.to_hdf('test.h5', key='root')

CPU times: user 35.4 ms, sys: 22.3 ms, total: 57.7 ms
Wall time: 182 ms


In [6]:
%time df.to_parquet('test.parquet')

CPU times: user 101 ms, sys: 34.4 ms, total: 136 ms
Wall time: 217 ms


In [7]:
%time df.to_feather('test.feather')

CPU times: user 25.2 ms, sys: 15.3 ms, total: 40.5 ms
Wall time: 40.5 ms


From above we can see that **feather** & **parquet** are the file formats which works pretty well for writing to disk. Now let us consider the second parameter.

### File size on Disk
Let us now check the file size on disk.

In [8]:
%%bash
du -sh test.*

20M	test.csv
7.7M	test.feather
8.5M	test.h5
9.7M	test.parquet


From the above result we can see that **feather** and **hdf** outperforms others. But we will check the performance again with some gigabyte of data afterward.

### Time to read
Let us check how much time does it take for a read operation.

In [9]:
%%time 
df_csv=pd.read_csv('test.csv')

CPU times: user 218 ms, sys: 41.7 ms, total: 260 ms
Wall time: 267 ms


In [10]:
%%time
df_hdf=pd.read_hdf('test.h5')

CPU times: user 17 ms, sys: 6.51 ms, total: 23.5 ms
Wall time: 83.6 ms


In [11]:
%%time
df_parquet=pd.read_parquet('test.parquet')

CPU times: user 23.6 ms, sys: 27.4 ms, total: 51 ms
Wall time: 88.7 ms


In [12]:
%%time
df_feather=pd.read_feather('test.feather')

CPU times: user 7.15 ms, sys: 11.1 ms, total: 18.2 ms
Wall time: 19.3 ms


### Memory Usage

In [13]:
df_csv.info(memory_usage='deep', verbose=True)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 11 columns):
 #   Column      Non-Null Count   Dtype  
---  ------      --------------   -----  
 0   Unnamed: 0  100000 non-null  int64  
 1   col_0       100000 non-null  float64
 2   col_1       100000 non-null  float64
 3   col_2       100000 non-null  float64
 4   col_3       100000 non-null  float64
 5   col_4       100000 non-null  float64
 6   col_5       100000 non-null  float64
 7   col_6       100000 non-null  float64
 8   col_7       100000 non-null  float64
 9   col_8       100000 non-null  float64
 10  col_9       100000 non-null  float64
dtypes: float64(10), int64(1)
memory usage: 8.4 MB


In [14]:
df_hdf.info(memory_usage='deep', verbose=True)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 100000 entries, 0 to 99999
Data columns (total 10 columns):
 #   Column  Non-Null Count   Dtype  
---  ------  --------------   -----  
 0   col_0   100000 non-null  float64
 1   col_1   100000 non-null  float64
 2   col_2   100000 non-null  float64
 3   col_3   100000 non-null  float64
 4   col_4   100000 non-null  float64
 5   col_5   100000 non-null  float64
 6   col_6   100000 non-null  float64
 7   col_7   100000 non-null  float64
 8   col_8   100000 non-null  float64
 9   col_9   100000 non-null  float64
dtypes: float64(10)
memory usage: 8.4 MB


In [15]:
df_parquet.info(memory_usage='deep', verbose=True)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 10 columns):
 #   Column  Non-Null Count   Dtype  
---  ------  --------------   -----  
 0   col_0   100000 non-null  float64
 1   col_1   100000 non-null  float64
 2   col_2   100000 non-null  float64
 3   col_3   100000 non-null  float64
 4   col_4   100000 non-null  float64
 5   col_5   100000 non-null  float64
 6   col_6   100000 non-null  float64
 7   col_7   100000 non-null  float64
 8   col_8   100000 non-null  float64
 9   col_9   100000 non-null  float64
dtypes: float64(10)
memory usage: 7.6 MB


In [16]:
df_feather.info(memory_usage='deep', verbose=True)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 10 columns):
 #   Column  Non-Null Count   Dtype  
---  ------  --------------   -----  
 0   col_0   100000 non-null  float64
 1   col_1   100000 non-null  float64
 2   col_2   100000 non-null  float64
 3   col_3   100000 non-null  float64
 4   col_4   100000 non-null  float64
 5   col_5   100000 non-null  float64
 6   col_6   100000 non-null  float64
 7   col_7   100000 non-null  float64
 8   col_8   100000 non-null  float64
 9   col_9   100000 non-null  float64
dtypes: float64(10)
memory usage: 7.6 MB


## Testing with categorical data.
Now let us create data with both the categorical and numerical values and check the performances of different file formats.

In [17]:
!pip install lorem

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


###Prepare Data

In [41]:
from lorem import sentence

words = np.array(sentence().strip().lower().replace(".", " ").split())

np.random.seed(0)  
n = 5000000
df = pd.DataFrame(np.c_[np.random.randn(n, 5),
                  np.random.randint(0,10,(n, 2)),
                  np.random.randint(0,1,(n, 2)),
np.array([np.random.choice(words) for i in range(n)])] , 
columns=list('ABCDEFGHIJ'))

df=df.astype(dtype={'A': float, 'B': float, 'C': float, 'D': float, 'E': float, 'F': int, 'G': int, 'H': int, 'I': int, 'J': str}, copy=True)
df.loc[::10, 'A'] = np.NaN
len(df)

5000000

### Time to write

In [42]:
%time df.to_csv('test_big.csv', index=False)

CPU times: user 44.8 s, sys: 878 ms, total: 45.7 s
Wall time: 46.8 s


In [43]:
%time df.to_hdf('test_big.h5', key='root', index=False)

CPU times: user 2.18 s, sys: 974 ms, total: 3.16 s
Wall time: 3.64 s


In [44]:
%time df.to_parquet('test_big.parquet', index=False)

CPU times: user 1.35 s, sys: 328 ms, total: 1.68 s
Wall time: 1.7 s


In [45]:
%time df.to_feather('test_big.feather')

CPU times: user 940 ms, sys: 337 ms, total: 1.28 s
Wall time: 1.19 s


### File size on disk

In [46]:
%%bash
du -sh test_big.*

533M	test_big.csv
244M	test_big.feather
1.2G	test_big.h5
195M	test_big.parquet


### Time to read

In [47]:
%%time 
df_big_csv=pd.read_csv('test_big.csv')

CPU times: user 5.87 s, sys: 1.02 s, total: 6.89 s
Wall time: 6.9 s


In [48]:
%%time
df_big_hdf=pd.read_hdf('test_big.h5', key='root')

CPU times: user 937 ms, sys: 224 ms, total: 1.16 s
Wall time: 1.15 s


In [49]:
%%time
df_big_parquet=pd.read_parquet('test_big.parquet')

CPU times: user 696 ms, sys: 715 ms, total: 1.41 s
Wall time: 992 ms


In [50]:
%%time
df_big_feather=pd.read_feather('test_big.feather')

CPU times: user 483 ms, sys: 269 ms, total: 752 ms
Wall time: 515 ms


In [51]:
df_big_parquet.dtypes

A    float64
B    float64
C    float64
D    float64
E    float64
F      int64
G      int64
H      int64
I      int64
J     object
dtype: object

### Memory Usage

In [52]:
df_big_csv.info(memory_usage='deep', verbose=True)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000000 entries, 0 to 4999999
Data columns (total 10 columns):
 #   Column  Dtype  
---  ------  -----  
 0   A       float64
 1   B       float64
 2   C       float64
 3   D       float64
 4   E       float64
 5   F       int64  
 6   G       int64  
 7   H       int64  
 8   I       int64  
 9   J       object 
dtypes: float64(5), int64(4), object(1)
memory usage: 645.6 MB


In [53]:
df_big_hdf.info(memory_usage='deep', verbose=True)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 5000000 entries, 0 to 4999999
Data columns (total 10 columns):
 #   Column  Dtype  
---  ------  -----  
 0   A       float64
 1   B       float64
 2   C       float64
 3   D       float64
 4   E       float64
 5   F       int64  
 6   G       int64  
 7   H       int64  
 8   I       int64  
 9   J       object 
dtypes: float64(5), int64(4), object(1)
memory usage: 683.8 MB


In [54]:
df_big_parquet.info(memory_usage='deep', verbose=True)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000000 entries, 0 to 4999999
Data columns (total 10 columns):
 #   Column  Dtype  
---  ------  -----  
 0   A       float64
 1   B       float64
 2   C       float64
 3   D       float64
 4   E       float64
 5   F       int64  
 6   G       int64  
 7   H       int64  
 8   I       int64  
 9   J       object 
dtypes: float64(5), int64(4), object(1)
memory usage: 645.6 MB


In [55]:
df_big_feather.info(memory_usage='deep', verbose=True)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000000 entries, 0 to 4999999
Data columns (total 10 columns):
 #   Column  Dtype  
---  ------  -----  
 0   A       float64
 1   B       float64
 2   C       float64
 3   D       float64
 4   E       float64
 5   F       int64  
 6   G       int64  
 7   H       int64  
 8   I       int64  
 9   J       object 
dtypes: float64(5), int64(4), object(1)
memory usage: 645.6 MB


We can see in the above cells that when we use raw categorical fields the size of the file gets large in case of feather and hdf. Now let us use the pd.Categorical function and modify the data and check what happens.

In [56]:
df['J'] = pd.Categorical(df['J'])

In [57]:
df.dtypes

A     float64
B     float64
C     float64
D     float64
E     float64
F       int64
G       int64
H       int64
I       int64
J    category
dtype: object

### Time to write

In [58]:
%time df.to_csv('test_big_pd_cat.csv')

CPU times: user 46.5 s, sys: 928 ms, total: 47.4 s
Wall time: 47.5 s


In [59]:
%time df.to_hdf('test_big_pd_cat.h5', key='root', format='table')

CPU times: user 3.24 s, sys: 527 ms, total: 3.77 s
Wall time: 5.31 s


In [60]:
%time df.to_parquet('test_big_pd_cat.parquet')

CPU times: user 1.01 s, sys: 341 ms, total: 1.35 s
Wall time: 1.36 s


In [61]:
%time df.to_feather('test_big_pd_cat.feather')

CPU times: user 695 ms, sys: 247 ms, total: 942 ms
Wall time: 787 ms


### File size on disk

In [62]:
%%bash
du -sh test_big_pd_cat.*

570M	test_big_pd_cat.csv
218M	test_big_pd_cat.feather
517M	test_big_pd_cat.h5
195M	test_big_pd_cat.parquet


### Time to read

In [63]:
%%time
df_big_pd_cat_csv = pd.read_csv('test_big_pd_cat.csv')

CPU times: user 6.72 s, sys: 889 ms, total: 7.61 s
Wall time: 8.51 s


In [64]:
%%time
df_big_pd_cat_hdf = pd.read_hdf('test_big_pd_cat.h5')

CPU times: user 820 ms, sys: 419 ms, total: 1.24 s
Wall time: 2.03 s


In [65]:
%%time
df_big_pd_cat_parquet = pd.read_parquet('test_big_pd_cat.parquet')

CPU times: user 393 ms, sys: 584 ms, total: 977 ms
Wall time: 670 ms


In [66]:
%%time
df_big_pd_cat_feather = pd.read_feather('test_big_pd_cat.feather')

CPU times: user 335 ms, sys: 392 ms, total: 727 ms
Wall time: 502 ms


## Feather or Parquet
1. Parquet format is designed for long-term storage, where Arrow is more intended for short term or ephemeral storage because files volume are larger.

2. Parquet is usually more expensive to write than Feather as it features more layers of encoding and compression.

3. Feather is unmodified raw columnar Arrow memory. We will probably add simple compression to Feather in the future.

4. Due to dictionary encoding, RLE encoding, and data page compression, Parquet files will often be much smaller than Feather files

5. Parquet is a standard storage format for analytics that’s supported by Spark. So if you are doing analytics, Parquet is a good option as a reference storage format for query by multiple systems

[Source StackOverflow](https://stackoverflow.com/questions/48083405/what-are-the-differences-between-feather-and-parquet)

## Bibliograpgy
This notebook is inspired from the following link.<br>

[File Format - Python tools for big data](https://pnavaro.github.io/big-data/14-FileFormats.html)