<a href="https://colab.research.google.com/github/satyakisen/pandas-ff-comparison/blob/main/Pandas_File_Format_Comparison.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Pandas File Format Comparision
## Overview
In this notebook let us compare the following pandas file formats.
1. csv - common text file that is comma seperated.
2. hdf5 - an open source file format that supports large, complex, heterogeneous data
3. parquet - an open source, column-oriented data file format designed for efficient data storage and retrieval.
4. feather - a portable file format for storing Arrow tables or data frames (from languages like Python or R) that utilizes the Arrow IPC format internally

Comparision parameters to consider are:
1. Time to write.
2. Time to read.
3. File size on disk.

## Let us begin
Let us begin by creating a dummy dataset containing only random float.

In [1]:
import pandas as pd
import numpy as np

def make_data(row_n, col_n):
  arr = np.random.randn(row_n, col_n)
  df = pd.DataFrame(arr, columns=['col_{0}'.format(i) for i in range(col_n)])
  return df

df = make_data(100000, 10)

Let us check the dummy dataset we made.

In [2]:
df.head(5)

Unnamed: 0,col_0,col_1,col_2,col_3,col_4,col_5,col_6,col_7,col_8,col_9
0,1.110389,0.142438,0.118264,0.742217,-0.924708,0.936969,-0.58467,0.370694,-0.238375,0.075002
1,2.802703,0.742789,-1.059473,-0.587136,0.291661,0.460604,0.041531,0.971437,0.73285,0.951265
2,-1.703729,-0.783873,-0.381655,-0.581264,0.137553,-0.868481,1.553746,-1.79422,-0.703021,1.270317
3,0.509564,-0.422051,-0.591012,-0.948524,-0.431694,-0.12261,0.007332,-1.65751,-0.842759,0.516597
4,0.294259,0.910114,0.306357,0.708615,0.972699,-1.420854,-0.373757,1.166513,0.203477,-0.910188


## Time to write

Let us now check the writing time of the above dataframe we created. We will first create a decorator for calculating time and then write a function to save the dataframe into different file formats.

In [3]:
%time df.to_csv('test.csv')

CPU times: user 1.59 s, sys: 65.8 ms, total: 1.66 s
Wall time: 1.66 s


In [5]:
%time df.to_hdf('test.h5', key='root')

CPU times: user 141 ms, sys: 21.2 ms, total: 162 ms
Wall time: 332 ms


In [6]:
%time df.to_parquet('test.parquet')

CPU times: user 129 ms, sys: 22.3 ms, total: 151 ms
Wall time: 165 ms


In [7]:
%time df.to_feather('test.feather')

CPU times: user 40.9 ms, sys: 27.6 ms, total: 68.5 ms
Wall time: 54.7 ms


From above we can see that **feather** & **parquet** are the file formats which works pretty well for writing to disk. Now let us consider the second parameter.

##File size on Disk
Let us now check the file size on disk.

In [9]:
%%bash
du -sh test.*

20M	test.csv
7.7M	test.feather
8.5M	test.h5
9.7M	test.parquet


From the above result we can see that **feather** and **hdf** outperforms others. But we will check the performance again with some gigabyte of data afterward.

## Time to read
Let us check how much time does it take for a read operation.

In [11]:
%%time 
df=pd.read_csv('test.csv')

CPU times: user 226 ms, sys: 26.2 ms, total: 252 ms
Wall time: 255 ms


In [18]:
%%time
df=pd.read_hdf('test.h5')

CPU times: user 19.5 ms, sys: 3.07 ms, total: 22.6 ms
Wall time: 24.4 ms


In [17]:
%%time
df=pd.read_parquet('test.parquet')

CPU times: user 16.6 ms, sys: 27.4 ms, total: 44 ms
Wall time: 30 ms


In [16]:
%%time
df=pd.read_feather('test.feather')

CPU times: user 10.1 ms, sys: 20.2 ms, total: 30.3 ms
Wall time: 26.7 ms


In the above result we see that **hdf** and **feather** outperforms.

## Testing with big data.
Now let us create datas in few gigabytes with both the categorical and numerical values and check the performances of different file formats.