# Parquet File Format (Apache Parquet)

Apache Parquet is an open-source, columnar storage file format optimized for analytical workloads (OLAP). It is widely used across big data and cloud ecosystemsâ€”Spark, Hive, Presto/Trino, Athena, BigQuery, Snowflake, and data lakes on S3, ADLS, and GCS.

### Columnar Storage

Unlike row-based formats (CSV, JSON, Avro-row), Parquet stores each column contiguously on disk.

**Implications**

1. Reads only the columns required by a query (I/O reduction).
2. Achieves higher compression due to homogeneous data per column.
3. Enables vectorized execution and predicate pushdown.

### Internal Structure

A Parquet file is organized hierarchically:

    * File
        * Row Groups (horizontal partitions)
            * Column Chunks (one per column)
                 * Pages
                    * Data Pages
                        * Dictionary Pages (optional)

**Metadata**

* Stored in the file footer.
* Contains schema, statistics (min/max/null count), encodings, and offsets.
* Enables predicate pushdown and column pruning without scanning data.

### Compression & Encoding

Parquet separates encoding from compression.

**Encodings**

* Plain
* Dictionary Encoding
* Run-Length Encoding (RLE)
* Delta Encoding (for sorted/numeric data)
* Bit Packing

**Compression Codecs**

* Snappy (default in Spark; fast)
* Gzip (higher compression, slower
* ZSTD (excellent balance; increasingly popular)
* Brotli, LZO (environment-dependent)

### Schema Support

* Strongly typed schema (INT, BIGINT, FLOAT, DOUBLE, BOOLEAN, BYTE_ARRAY, etc.)
* Nested data support (structs, arrays, maps)
* Schema evolution (add columns safely; dropping requires care)


### Performance Characteristics

**Strengths**

* Excellent for aggregates, filters, scans, joins.
* Efficient for large-scale analytics.
* Works well with partitioning (e.g., date=YYYY-MM-DD).

**Limitations**

* Not suitable for frequent small updates or deletes.
* Row-level mutations are expensive (rewrite required).
* Poor fit for transactional OLTP without a table format layer.

| Aspect      | Parquet   | CSV  | JSON    | Avro      |
| ----------- | --------- | ---- | ------- | --------- |
| Storage     | Columnar  | Row  | Row     | Row       |
| Compression | Excellent | Poor | Poor    | Moderate  |
| Schema      | Yes       | No   | Weak    | Yes       |
| Analytics   | Excellent | Poor | Poor    | Good      |
| Streaming   | No        | No   | Limited | Excellent |


**Rule of thumb**

`Analytics / Data Lake`:  **Parquet**

`Streaming / Messaging`:  **Avro**

`Interchange / Debugging`:  **JSON**

`Human-readable export`:  **CSV**

In [1]:
import os

import pandas as pd

In [2]:
rows = [
    {"id": 1, "name": "Shravan", "age": 28},
    {"id": 2, "name": "Hanvika", "age": 25}
]

In [3]:
df = pd.DataFrame(rows)
df

Unnamed: 0,id,name,age
0,1,Shravan,28
1,2,Hanvika,25


In [4]:
df.to_parquet('mydata.parquet', compression='snappy')

In [6]:
import os

os.listdir()

['avro_file.ipynb',
 'config_file.ipynb',
 'csv_file.ipynb',
 'json_file.ipynb',
 'mydata.parquet',
 'parquet_file.ipynb',
 'toml_file.ipynb']

In [8]:
df2 = pd.read_parquet('mydata.parquet')
df2

Unnamed: 0,id,name,age
0,1,Shravan,28
1,2,Hanvika,25


In [9]:
os.unlink('mydata.parquet')