<a href="https://colab.research.google.com/github/satyakisen/pandas-ff-comparison/blob/main/Pandas_File_Format_Comparison.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Pandas File Format Comparision
## Overview
In this notebook let us compare the following pandas file formats.
1. csv - common text file that is comma seperated.
2. hdf5 - an open source file format that supports large, complex, heterogeneous data
3. parquet - an open source, column-oriented data file format designed for efficient data storage and retrieval.
4. feather - a portable file format for storing Arrow tables or data frames (from languages like Python or R) that utilizes the Arrow IPC format internally

Comparision parameters to consider are:
1. Time to write.
2. Time to read.
3. File size on disk.
4. Memory Usage.


## Testing with Numerical Data
Let us begin by creating a dummy dataset containing only random float.

In [1]:
import pandas as pd
import numpy as np

def make_data(row_n, col_n):
  arr = np.random.randn(row_n, col_n)
  df = pd.DataFrame(arr, columns=['col_{0}'.format(i) for i in range(col_n)])
  return df

df = make_data(100000, 10)

Let us check the dummy dataset we made.

In [2]:
df.head(5)

Unnamed: 0,col_0,col_1,col_2,col_3,col_4,col_5,col_6,col_7,col_8,col_9
0,1.641336,-1.621931,0.796686,-1.679665,0.435253,-0.197769,-2.721646,0.343617,0.702246,0.433517
1,0.15334,-0.069513,-0.748609,1.237218,0.644955,0.323406,0.669177,-0.611342,-0.503564,-0.778894
2,0.649113,1.348819,1.938892,0.614601,0.485035,0.352973,-0.306332,-0.079343,-1.674584,-1.566218
3,0.404498,-1.440968,-0.385486,0.441557,-0.036578,-0.293221,1.587384,0.709433,-0.395009,-1.840616
4,0.767174,1.118811,0.021381,-0.216749,0.331357,0.790685,-1.101796,-0.090205,0.595651,-0.359788


### Time to write

Let us now check the writing time of the above dataframe we created. We will first create a decorator for calculating time and then write a function to save the dataframe into different file formats.

In [3]:
%time df.to_csv('test.csv', index=False)

CPU times: user 1.59 s, sys: 69.5 ms, total: 1.66 s
Wall time: 1.74 s


In [4]:
%time df.to_hdf('test.h5', key='root')

CPU times: user 81.1 ms, sys: 19.9 ms, total: 101 ms
Wall time: 267 ms


In [5]:
%time df.to_parquet('test.parquet')

CPU times: user 208 ms, sys: 38.6 ms, total: 246 ms
Wall time: 510 ms


In [6]:
%time df.to_feather('test.feather')

CPU times: user 30.9 ms, sys: 21.1 ms, total: 51.9 ms
Wall time: 77.2 ms


From above we can see that **feather** & **hdf** are the file formats which works pretty well for writing to disk. Now let us consider the second parameter.

### File size on Disk
Let us now check the file size on disk.

In [7]:
%%bash
du -sh test.*

19M	test.csv
7.7M	test.feather
8.5M	test.h5
9.7M	test.parquet


From the above result we can see that **feather** and **hdf** outperforms others. But we will check the performance again with some big data afterward.

### Time to read
Let us check how much time does it take for a read operation.

In [8]:
%%time 
df_csv=pd.read_csv('test.csv')

CPU times: user 217 ms, sys: 36.9 ms, total: 254 ms
Wall time: 254 ms


In [9]:
%%time
df_hdf=pd.read_hdf('test.h5')

CPU times: user 17.1 ms, sys: 9.94 ms, total: 27 ms
Wall time: 27.4 ms


In [10]:
%%time
df_parquet=pd.read_parquet('test.parquet')

CPU times: user 31.3 ms, sys: 46.3 ms, total: 77.6 ms
Wall time: 99.4 ms


In [11]:
%%time
df_feather=pd.read_feather('test.feather')

CPU times: user 16.5 ms, sys: 7.1 ms, total: 23.6 ms
Wall time: 18 ms


### Memory Usage

In [12]:
df_csv.info(memory_usage='deep', verbose=True)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 10 columns):
 #   Column  Non-Null Count   Dtype  
---  ------  --------------   -----  
 0   col_0   100000 non-null  float64
 1   col_1   100000 non-null  float64
 2   col_2   100000 non-null  float64
 3   col_3   100000 non-null  float64
 4   col_4   100000 non-null  float64
 5   col_5   100000 non-null  float64
 6   col_6   100000 non-null  float64
 7   col_7   100000 non-null  float64
 8   col_8   100000 non-null  float64
 9   col_9   100000 non-null  float64
dtypes: float64(10)
memory usage: 7.6 MB


In [13]:
df_hdf.info(memory_usage='deep', verbose=True)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 100000 entries, 0 to 99999
Data columns (total 10 columns):
 #   Column  Non-Null Count   Dtype  
---  ------  --------------   -----  
 0   col_0   100000 non-null  float64
 1   col_1   100000 non-null  float64
 2   col_2   100000 non-null  float64
 3   col_3   100000 non-null  float64
 4   col_4   100000 non-null  float64
 5   col_5   100000 non-null  float64
 6   col_6   100000 non-null  float64
 7   col_7   100000 non-null  float64
 8   col_8   100000 non-null  float64
 9   col_9   100000 non-null  float64
dtypes: float64(10)
memory usage: 8.4 MB


In [14]:
df_parquet.info(memory_usage='deep', verbose=True)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 10 columns):
 #   Column  Non-Null Count   Dtype  
---  ------  --------------   -----  
 0   col_0   100000 non-null  float64
 1   col_1   100000 non-null  float64
 2   col_2   100000 non-null  float64
 3   col_3   100000 non-null  float64
 4   col_4   100000 non-null  float64
 5   col_5   100000 non-null  float64
 6   col_6   100000 non-null  float64
 7   col_7   100000 non-null  float64
 8   col_8   100000 non-null  float64
 9   col_9   100000 non-null  float64
dtypes: float64(10)
memory usage: 7.6 MB


In [15]:
df_feather.info(memory_usage='deep', verbose=True)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 10 columns):
 #   Column  Non-Null Count   Dtype  
---  ------  --------------   -----  
 0   col_0   100000 non-null  float64
 1   col_1   100000 non-null  float64
 2   col_2   100000 non-null  float64
 3   col_3   100000 non-null  float64
 4   col_4   100000 non-null  float64
 5   col_5   100000 non-null  float64
 6   col_6   100000 non-null  float64
 7   col_7   100000 non-null  float64
 8   col_8   100000 non-null  float64
 9   col_9   100000 non-null  float64
dtypes: float64(10)
memory usage: 7.6 MB


## Testing with categorical data.
Now let us create data with both the categorical and numerical values and check the performances of different file formats.

In [16]:
!pip install lorem

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting lorem
  Downloading lorem-0.1.1-py3-none-any.whl (5.0 kB)
Installing collected packages: lorem
Successfully installed lorem-0.1.1


###Prepare Data

In [17]:
from lorem import sentence

words = np.array(sentence().strip().lower().replace(".", " ").split())

np.random.seed(0)  
n = 5000000
df = pd.DataFrame(np.c_[np.random.randn(n, 5),
                  np.random.randint(0,10,(n, 2)),
                  np.random.randint(0,1,(n, 2)),
np.array([np.random.choice(words) for i in range(n)])] , 
columns=list('ABCDEFGHIJ'))

df=df.astype(dtype={'A': float, 'B': float, 'C': float, 'D': float, 'E': float, 'F': int, 'G': int, 'H': int, 'I': int, 'J': str}, copy=True)
df.loc[::10, 'A'] = np.NaN
len(df)

5000000

### Time to write

In [18]:
%time df.to_csv('test_big.csv', index=False)

CPU times: user 44.9 s, sys: 941 ms, total: 45.9 s
Wall time: 46.1 s


In [19]:
%time df.to_hdf('test_big.h5', key='root', index=False)

CPU times: user 2.25 s, sys: 1.25 s, total: 3.5 s
Wall time: 3.87 s


In [20]:
%time df.to_parquet('test_big.parquet', index=False)

CPU times: user 1.43 s, sys: 364 ms, total: 1.79 s
Wall time: 1.77 s


In [21]:
%time df.to_feather('test_big.feather')

CPU times: user 965 ms, sys: 345 ms, total: 1.31 s
Wall time: 1.15 s


### File size on disk

In [22]:
%%bash
du -sh test_big.*

530M	test_big.csv
245M	test_big.feather
425M	test_big.h5
195M	test_big.parquet


### Time to read

In [23]:
%%time 
df_big_csv=pd.read_csv('test_big.csv')

CPU times: user 6.07 s, sys: 10.2 s, total: 16.3 s
Wall time: 16.3 s


In [24]:
%%time
df_big_hdf=pd.read_hdf('test_big.h5', key='root')

CPU times: user 874 ms, sys: 620 ms, total: 1.49 s
Wall time: 1.49 s


In [25]:
%%time
df_big_parquet=pd.read_parquet('test_big.parquet')

CPU times: user 542 ms, sys: 597 ms, total: 1.14 s
Wall time: 841 ms


In [26]:
%%time
df_big_feather=pd.read_feather('test_big.feather')

CPU times: user 432 ms, sys: 534 ms, total: 966 ms
Wall time: 648 ms


In [27]:
df_big_parquet.dtypes

A    float64
B    float64
C    float64
D    float64
E    float64
F      int64
G      int64
H      int64
I      int64
J     object
dtype: object

### Memory Usage

In [28]:
df_big_csv.info(memory_usage='deep', verbose=True)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000000 entries, 0 to 4999999
Data columns (total 10 columns):
 #   Column  Dtype  
---  ------  -----  
 0   A       float64
 1   B       float64
 2   C       float64
 3   D       float64
 4   E       float64
 5   F       int64  
 6   G       int64  
 7   H       int64  
 8   I       int64  
 9   J       object 
dtypes: float64(5), int64(4), object(1)
memory usage: 642.5 MB


In [29]:
df_big_hdf.info(memory_usage='deep', verbose=True)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 5000000 entries, 0 to 4999999
Data columns (total 10 columns):
 #   Column  Dtype  
---  ------  -----  
 0   A       float64
 1   B       float64
 2   C       float64
 3   D       float64
 4   E       float64
 5   F       int64  
 6   G       int64  
 7   H       int64  
 8   I       int64  
 9   J       object 
dtypes: float64(5), int64(4), object(1)
memory usage: 680.7 MB


In [30]:
df_big_parquet.info(memory_usage='deep', verbose=True)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000000 entries, 0 to 4999999
Data columns (total 10 columns):
 #   Column  Dtype  
---  ------  -----  
 0   A       float64
 1   B       float64
 2   C       float64
 3   D       float64
 4   E       float64
 5   F       int64  
 6   G       int64  
 7   H       int64  
 8   I       int64  
 9   J       object 
dtypes: float64(5), int64(4), object(1)
memory usage: 642.5 MB


In [31]:
df_big_feather.info(memory_usage='deep', verbose=True)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000000 entries, 0 to 4999999
Data columns (total 10 columns):
 #   Column  Dtype  
---  ------  -----  
 0   A       float64
 1   B       float64
 2   C       float64
 3   D       float64
 4   E       float64
 5   F       int64  
 6   G       int64  
 7   H       int64  
 8   I       int64  
 9   J       object 
dtypes: float64(5), int64(4), object(1)
memory usage: 642.5 MB


We can see in the above cells that when we use raw categorical fields the size of the file gets large in case of feather and hdf. Now let us use the pd.Categorical function and modify the data and check what happens.

In [32]:
df['J'] = pd.Categorical(df['J'])

In [33]:
df.dtypes

A     float64
B     float64
C     float64
D     float64
E     float64
F       int64
G       int64
H       int64
I       int64
J    category
dtype: object

### Time to write

In [34]:
%time df.to_csv('test_big_pd_cat.csv')

CPU times: user 47.3 s, sys: 910 ms, total: 48.3 s
Wall time: 48.4 s


In [35]:
%time df.to_hdf('test_big_pd_cat.h5', key='root', format='table')

CPU times: user 3.49 s, sys: 610 ms, total: 4.1 s
Wall time: 4.38 s


In [36]:
%time df.to_parquet('test_big_pd_cat.parquet')

CPU times: user 1e+03 ms, sys: 238 ms, total: 1.24 s
Wall time: 1.25 s


In [37]:
%time df.to_feather('test_big_pd_cat.feather')

CPU times: user 699 ms, sys: 250 ms, total: 950 ms
Wall time: 831 ms


### File size on disk

In [38]:
%%bash
du -sh test_big_pd_cat.*

567M	test_big_pd_cat.csv
219M	test_big_pd_cat.feather
389M	test_big_pd_cat.h5
195M	test_big_pd_cat.parquet


### Time to read

In [39]:
%%time
df_big_pd_cat_csv = pd.read_csv('test_big_pd_cat.csv')

CPU times: user 7.51 s, sys: 1.2 s, total: 8.71 s
Wall time: 8.78 s


In [40]:
%%time
df_big_pd_cat_hdf = pd.read_hdf('test_big_pd_cat.h5')

CPU times: user 824 ms, sys: 309 ms, total: 1.13 s
Wall time: 1.14 s


In [41]:
%%time
df_big_pd_cat_parquet = pd.read_parquet('test_big_pd_cat.parquet')

CPU times: user 405 ms, sys: 581 ms, total: 987 ms
Wall time: 704 ms


In [42]:
%%time
df_big_pd_cat_feather = pd.read_feather('test_big_pd_cat.feather')

CPU times: user 323 ms, sys: 431 ms, total: 753 ms
Wall time: 515 ms


### Memory Usage

In [43]:
df_big_pd_cat_csv.info(memory_usage='deep', verbose=True)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000000 entries, 0 to 4999999
Data columns (total 11 columns):
 #   Column      Dtype  
---  ------      -----  
 0   Unnamed: 0  int64  
 1   A           float64
 2   B           float64
 3   C           float64
 4   D           float64
 5   E           float64
 6   F           int64  
 7   G           int64  
 8   H           int64  
 9   I           int64  
 10  J           object 
dtypes: float64(5), int64(5), object(1)
memory usage: 680.7 MB


In [45]:
df_big_pd_cat_hdf.info(memory_usage='deep', verbose=True)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 5000000 entries, 0 to 4999999
Data columns (total 10 columns):
 #   Column  Dtype   
---  ------  -----   
 0   A       float64 
 1   B       float64 
 2   C       float64 
 3   D       float64 
 4   E       float64 
 5   F       int64   
 6   G       int64   
 7   H       int64   
 8   I       int64   
 9   J       category
dtypes: category(1), float64(5), int64(4)
memory usage: 386.2 MB


In [46]:
df_big_pd_cat_parquet.info(memory_usage='deep', verbose=True)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000000 entries, 0 to 4999999
Data columns (total 10 columns):
 #   Column  Dtype   
---  ------  -----   
 0   A       float64 
 1   B       float64 
 2   C       float64 
 3   D       float64 
 4   E       float64 
 5   F       int64   
 6   G       int64   
 7   H       int64   
 8   I       int64   
 9   J       category
dtypes: category(1), float64(5), int64(4)
memory usage: 348.1 MB


In [47]:
df_big_pd_cat_feather.info(memory_usage='deep', verbose=True)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000000 entries, 0 to 4999999
Data columns (total 10 columns):
 #   Column  Dtype   
---  ------  -----   
 0   A       float64 
 1   B       float64 
 2   C       float64 
 3   D       float64 
 4   E       float64 
 5   F       int64   
 6   G       int64   
 7   H       int64   
 8   I       int64   
 9   J       category
dtypes: category(1), float64(5), int64(4)
memory usage: 348.1 MB


## Feather or Parquet
1. Parquet format is designed for long-term storage, where Arrow is more intended for short term or ephemeral storage because files volume are larger.

2. Parquet is usually more expensive to write than Feather as it features more layers of encoding and compression.

3. Feather is unmodified raw columnar Arrow memory. We will probably add simple compression to Feather in the future.

4. Due to dictionary encoding, RLE encoding, and data page compression, Parquet files will often be much smaller than Feather files

5. Parquet is a standard storage format for analytics that’s supported by Spark. So if you are doing analytics, Parquet is a good option as a reference storage format for query by multiple systems

[Source StackOverflow](https://stackoverflow.com/questions/48083405/what-are-the-differences-between-feather-and-parquet)

## Bibliograpgy
This notebook is inspired from the following link.<br>

[File Format - Python tools for big data](https://pnavaro.github.io/big-data/14-FileFormats.html)