# Convert the CSV file to parquet format

Parquet is a columnar storage file format designed for efficient data storage and retrieval, especially for analytics workloads. It is widely used in modern data engineering pipelines and supported by tools like DuckDB, Spark, BigQuery, AWS Athena, and Snowflake.

## 📦 What is Parquet?

- **Columnar**: Stores data by column, not by row
- **Binary format**: Efficient and compact
- **Self-describing**: Stores schema and metadata with the data
- **Supports compression**: e.g., Snappy, Gzip
- **Splittable**: Great for parallel processing

This notebook converts a CSV file to parquet using Pandas and PyArrow.


In [8]:
import pandas as pd

In [9]:
df = pd.read_csv(
    "data/state-of-delaware-pcard-transactions.csv.gz",
    compression="gzip",
    parse_dates=["TRANS_DT"],
)
df.rename(
    columns={
        "FISCAL_YEAR": "fiscal_year",
        "FISCAL_PERIOD": "fiscal_period",
        "DEPT_NAME": "department",
        "DIV_NAME": "division",
        "MERCHANT": "merchant",
        "CAT_DESCR": "category_description",
        "TRANS_DT": "transaction_date",
        "MERCHANDISE_AMT": "amount",
    },
    inplace=True,
)
df.head(2)

Unnamed: 0,fiscal_year,fiscal_period,department,division,merchant,category_description,transaction_date,amount
0,2017,1,LEGISLATIVE BRANCH,General Assembly House,VZWRLSS*APOCC VISB,Telecom Incl Prepaid-Recurring Phone Svcs,2016-07-14,11.65
1,2017,1,LEGISLATIVE BRANCH,General Assembly House,GAN*NEWSPAPER SUB1052,Direct Marketing-Continuity-Subscription Merch...,2016-07-05,15.0


In [10]:
df.to_parquet("data/state-of-delaware-pcard-transactions.parquet", index=False)