# CSV vs Parquet format


### CSV format
* CSVs are everywhere — from company reports to machine learning datasets. It’s a data format that’s simple and intuitive to work with.
* CSVs are row-orientated, which means they’re <b> slow to query and difficult to store efficiently</b>. The size difference between csv & parquet is enormous for identical datasets.
* Anyone can open and modify a CSV file. It raises security concerns.


### Parquet format
* Parquet is an alternative format for storing data. It’s open source and licensed under Apache.
* Parquet is a column-orientated storage option. 
* Parquet files take much less disk space than CSVs and are faster to scan. As a result, the identical dataset is 16 times cheaper to store in Parquet format on Amazon S3!
* Apache Parquet is designed for efficiency. The column storage architecture is the reason why, as it allows you to skip data that isn’t relevant quickly. This way both queries and aggregations are faster, resulting in hardware savings (read: it’s cheaper).
* Apache Parquet is a self-describing data format that embeds the schema or structure within the data itself. 
* Apache Parquet is built from the ground using the Google shredding and assembly algorithm.
* Apache Parquet is built to support very efficient compression and encoding schemes



* CSVs are what you call row storage, while Parquet files organize the data in columns.
* In a nutshell, column storage files are more lightweight, as adequate compression can be done for each column. That’s not the case with row storage, as one row usually contains multiple data types.
* In a nutshell, Parquet is a more efficient data format for bigger files. You will save both time and money by using Parquet over CSVs.

Let's check NYSE stock prices dataset. The CSV file is around 50MB in size, but lets see how much disk space Parquet will save.

In [1]:
import pandas as pd
df = pd.read_csv('Datasets/prices.csv')
df.head()

Unnamed: 0,date,symbol,open,close,low,high,volume
0,2016-01-05 00:00:00,WLTW,123.43,125.839996,122.309998,126.25,2163600.0
1,2016-01-06 00:00:00,WLTW,125.239998,119.980003,119.940002,125.540001,2386400.0
2,2016-01-07 00:00:00,WLTW,116.379997,114.949997,114.93,119.739998,2489500.0
3,2016-01-08 00:00:00,WLTW,115.480003,116.620003,113.5,117.440002,2006300.0
4,2016-01-11 00:00:00,WLTW,117.010002,114.970001,114.089996,117.330002,1408600.0


In [3]:
df.to_parquet('Datasets/prices.parquet')

In [4]:
#Loading Parquet files in pandas dataframe
df_parquet = pd.read_parquet('Datasets/prices.parquet')
df_parquet.head()

Unnamed: 0,date,symbol,open,close,low,high,volume
0,2016-01-05 00:00:00,WLTW,123.43,125.839996,122.309998,126.25,2163600.0
1,2016-01-06 00:00:00,WLTW,125.239998,119.980003,119.940002,125.540001,2386400.0
2,2016-01-07 00:00:00,WLTW,116.379997,114.949997,114.93,119.739998,2489500.0
3,2016-01-08 00:00:00,WLTW,115.480003,116.620003,113.5,117.440002,2006300.0
4,2016-01-11 00:00:00,WLTW,117.010002,114.970001,114.089996,117.330002,1408600.0


In [5]:
df.equals(df_parquet)

True

50 MB CSV file on disk space is equal to 12 MB Parquet file. The savings scale on larger datasets which is  especially important if you’re storing data on the cloud and paying for the overall size.

# References

https://towardsdatascience.com/csv-files-for-storage-no-thanks-theres-a-better-option-72c78a414d1d

https://blog.openbridge.com/how-to-be-a-hero-with-powerful-parquet-google-and-amazon-f2ae0f35ee04
    
Dataset link: https://www.kaggle.com/dgawlik/nyse?select=prices.csv