# Working with Parquet files

A Parquet file is a popular columnar storage file format used for storing and processing structured data efficiently. It is designed to optimize the storage and processing of large datasets, especially those used in big data and data warehousing scenarios. Parquet files are commonly used in conjunction with distributed data processing frameworks like Apache Hadoop, Apache Spark, and Apache Hive.

**Columnar Storage:** Unlike row-based formats like CSV or JSON, where all the data for a row is stored together, Parquet stores data in a columnar fashion. This means that all the values for a specific column are stored together, making it more efficient for certain types of queries and aggregations.

**Predicate Pushdown:** Columnar storage allows for efficient predicate pushdown, which means that during query processing, the system can skip reading entire columns of data that are not needed for a particular query, further improving query performance.

Predicate pushdown is an optimization technique used in database and query processing systems to improve query performance by reducing the amount of data that needs to be read and processed. This technique is particularly relevant in columnar storage formats and distributed processing environments.

In a query, a predicate is a condition that filters the data being retrieved. For example, in a SQL query like SELECT * FROM employees WHERE department = 'Sales', the predicate is department = 'Sales'. Predicate pushdown involves pushing this filtering condition down to the data source or storage layer, allowing the data source to eliminate rows or columns that do not satisfy the predicate before returning the data to the query processor. This minimizes the amount of unnecessary data transferred between the storage layer and the query processor, resulting in faster query execution.


**Parallel Processing:** The columnar storage format aligns well with parallel processing frameworks, enabling better utilization of parallelism and distributed computing resources.

**Performance:** Due to its columnar structure and optimized encoding techniques, Parquet files are efficient for analytical workloads that involve reading and processing large volumes of data.

![Parquet file](Parquet.png)


In [1]:
#to read parquet files we need pyarrow
!pip install pyarrow

Collecting pyarrow
  Downloading pyarrow-12.0.1-cp39-cp39-win_amd64.whl (21.5 MB)
Installing collected packages: pyarrow
Successfully installed pyarrow-12.0.1


In [2]:
import pandas as pd

In [27]:
file = pd.read_parquet('parquet\sample1.parquet', engine='auto')

In [28]:
file

Unnamed: 0,registration_dttm,id,first_name,last_name,email,gender,ip_address,cc,country,birthdate,salary,title,comments
0,2016-02-03 07:55:29,1,Amanda,Jordan,ajordan0@com.com,Female,1.197.201.2,6759521864920116,Indonesia,3/8/1971,49756.53,Internal Auditor,1E+02
1,2016-02-03 17:04:03,2,Albert,Freeman,afreeman1@is.gd,Male,218.111.175.34,,Canada,1/16/1968,150280.17,Accountant IV,
2,2016-02-03 01:09:31,3,Evelyn,Morgan,emorgan2@altervista.org,Female,7.161.136.94,6767119071901597,Russia,2/1/1960,144972.51,Structural Engineer,
3,2016-02-03 00:36:21,4,Denise,Riley,driley3@gmpg.org,Female,140.35.109.83,3576031598965625,China,4/8/1997,90263.05,Senior Cost Accountant,
4,2016-02-03 05:05:31,5,Carlos,Burns,cburns4@miitbeian.gov.cn,,169.113.235.40,5602256255204850,South Africa,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...
995,2016-02-03 10:30:59,996,Dennis,Harris,dharrisrn@eepurl.com,Male,178.180.111.236,374288806662929,Greece,7/8/1965,263399.54,Editor,
996,2016-02-03 17:16:53,997,Gloria,Hamilton,ghamiltonro@rambler.ru,Female,71.50.39.137,,China,4/22/1975,83183.54,VP Product Management,
997,2016-02-03 05:02:20,998,Nancy,Morris,nmorrisrp@ask.com,,6.188.121.221,3553564071014997,Sweden,5/1/1979,,Junior Executive,
998,2016-02-03 02:41:32,999,Annie,Daniels,adanielsrq@squidoo.com,Female,97.221.132.35,30424803513734,China,10/9/1991,18433.85,Editor,​


If you have multiple parquest files inthe folder you can read them at once and it will do a union.

In [23]:
pwd

'C:\\Users\\KANGRSW\\Downloads'

In [31]:
files = r'C:\\Users\\KANGRSW\\Downloads\parquet'

In [33]:
df = pd.read_parquet(files, engine='auto')

In [34]:
df

Unnamed: 0,registration_dttm,id,first_name,last_name,email,gender,ip_address,cc,country,birthdate,salary,title,comments
0,2016-02-03 07:55:29,1.0,Amanda,Jordan,ajordan0@com.com,Female,1.197.201.2,6759521864920116,Indonesia,3/8/1971,49756.53,Internal Auditor,1E+02
1,2016-02-03 17:04:03,2.0,Albert,Freeman,afreeman1@is.gd,Male,218.111.175.34,,Canada,1/16/1968,150280.17,Accountant IV,
2,2016-02-03 01:09:31,3.0,Evelyn,Morgan,emorgan2@altervista.org,Female,7.161.136.94,6767119071901597,Russia,2/1/1960,144972.51,Structural Engineer,
3,2016-02-03 00:36:21,4.0,Denise,Riley,driley3@gmpg.org,Female,140.35.109.83,3576031598965625,China,4/8/1997,90263.05,Senior Cost Accountant,
4,2016-02-03 05:05:31,5.0,Carlos,Burns,cburns4@miitbeian.gov.cn,,169.113.235.40,5602256255204850,South Africa,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...
1995,2016-02-03 13:36:49,996.0,Carol,Warren,cwarrenrn@geocities.jp,Female,71.7.191.213,,China,,185421.82,,""""
1996,2016-02-03 04:39:01,997.0,Helen,Fields,hfieldsro@comcast.net,Female,164.190.97.183,,Malaysia,,279671.68,,
1997,2016-02-03 00:33:54,998.0,Stephanie,Sims,ssimsrp@newyorker.com,Female,135.66.68.181,3548125808139842,Poland,,112275.78,,
1998,2016-02-03 00:15:08,999.0,Marie,Medina,mmedinarq@thetimes.co.uk,Female,223.83.175.211,,Kazakhstan,3/25/1969,53564.76,Speech Pathologist,


Notice, before it was 1000 rows from 1 file only, now it is 2000 rows from both the files.

References:

What is Parquet file:
https://www.youtube.com/watch?v=PaDUxrI6ThA

What is columnar database:
https://www.youtube.com/watch?v=8KGVFB3kVHQ

Parquet files downloaded from:
https://github.com/Teradata/kylo/blob/master/samples/sample-data/parquet/userdata2.parquet