### 1 Introduction

**CSV vs PARQUET (Arrow)**
- escalabilidade do formato
- usando DuckDB local
- com MovieLens 100k e 33M

#### 2 Library import, Data and files import, duckdb import and conversion of files

In [1]:
#   BLOCO INICIAL — IMPORTS & PATHS & SETUP

import duckdb
import pandas as pd
import polars as pl
import time
from pathlib import Path

# === PATHS FOR THE DATASETS ===

## MovieLens 100k
DATA_100k = Path("..") / "data" / "100K"
ratings_100k_csv = DATA_100k / "ratings.csv"
movies_100k_csv  = DATA_100k / "movies.csv"
tags_100k_csv    = DATA_100k / "tags.csv"
links_100k_csv   = DATA_100k / "links.csv"

## MovieLens 33M
DATA_33m = Path("..") / "data" / "Full33M"
ratings_33m_csv = DATA_33m / "ratings.csv"
movies_33m_csv  = DATA_33m / "movies.csv"
tags_33m_csv    = DATA_33m / "tags.csv"
links_33m_csv   = DATA_33m / "links.csv"

print("Paths defined successfully.")

# === Local DuckDB connection ===
con = duckdb.connect("movielens_local.duckdb")
print("DuckDB connection opened.")

Paths defined successfully.
DuckDB connection opened.


### 1.2 Create Parquet Tables in DuckDB (ratings_parquet, movies_parquet)

In [6]:
# === CONVERSÃO CSV -> PARQUET  ===

ratings_100k_parquet = ratings_100k.with_suffix(".parquet")
ratings_33m_parquet  = ratings_33m.with_suffix(".parquet")
movies_33m_parquet   = movies_33m.with_suffix(".parquet")

print("Parquet paths:")
print(ratings_100k_parquet)
print(ratings_33m_parquet)
print(movies_33m_parquet)

# Converter 
duckdb.sql(f"""
COPY (SELECT * FROM read_csv_auto('{ratings_100k_csv}'))
TO '{ratings_100k_parquet}'
(FORMAT PARQUET);
""")

duckdb.sql(f"""
COPY (SELECT * FROM read_csv_auto('{ratings_33m_csv}'))
TO '{ratings_33m_parquet}'
(FORMAT PARQUET);
""")

duckdb.sql(f"""
COPY (SELECT * FROM read_csv_auto('{movies_33m_csv}'))
TO '{movies_33m_parquet}'
(FORMAT PARQUET);
""")

print("CSV -> Parquet conversion finished.")


Parquet paths:
..\data\100K\ratings.parquet
..\data\Full33M\ratings.parquet
..\data\Full33M\movies.parquet


FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

CSV -> Parquet conversion finished.


In [2]:
# List of all tables in the database
con.sql("""
SELECT table_name, table_type
FROM information_schema.tables
""").df()

Unnamed: 0,table_name,table_type
0,ratings_parquet,BASE TABLE
1,tags_parquet,BASE TABLE


In [3]:
print(ratings_100k_csv)
print(ratings_33m_csv)

..\data\100K\ratings.csv
..\data\Full33M\ratings.csv


#### 3 Performance test and measuremnts
##### 3.1 Sub to measure time on queries

In [9]:
# Função medir_tempo

def medir_tempo(query):
    t0 = time.time()
    duckdb.sql(query).df()
    return round(time.time() - t0, 3)

##### 3.2 Test with simple Query in file with movieId: csv 100k vs parque 100k and csv with 33M vs parquet with 33M

### 1.3 Benchmark 1 — CSV vs Parquet Performance (AVG ratings + GROUP BY moviesId)

We measure the execution time of a simple aggregation query (AVG(rating) GROUP BY movieId)
executed using DuckDB, running on both CSV and Parquet files for the MovieLens 100k
and 33M datasets. This benchmark isolates the impact of the storage format on query
performance within DuckDB.

In [10]:
# Medições CSV vs Parquet

tempos = {
    "CSV_100k": medir_tempo(f"""
        SELECT movieId, AVG(rating)
        FROM '{ratings_100k_csv}'
        GROUP BY movieId
    """),

    "PARQUET_100k": medir_tempo(f"""
        SELECT movieId, AVG(rating)
        FROM '{ratings_100k_parquet}'
        GROUP BY movieId
    """),

    "CSV_33M": medir_tempo(f"""
        SELECT movieId, AVG(rating)
        FROM '{ratings_33m_csv}'
        GROUP BY movieId
    """),

    "PARQUET_33M": medir_tempo(f"""
        SELECT movieId, AVG(rating)
        FROM '{ratings_33m_parquet}'
        GROUP BY movieId
    """),
}

tempos


{'CSV_100k': 0.065,
 'PARQUET_100k': 0.014,
 'CSV_33M': 1.045,
 'PARQUET_33M': 0.184}

In [11]:
# Tabela comparativa

df_tempo = pd.DataFrame([
    ["100k", tempos["CSV_100k"], tempos["PARQUET_100k"]],
    ["33M", tempos["CSV_33M"], tempos["PARQUET_33M"]],
], columns=["Dataset", "CSV time (s)", "Parquet time (s)"])

df_tempo


Unnamed: 0,Dataset,CSV time (s),Parquet time (s)
0,100k,0.065,0.014
1,33M,1.045,0.184


__Conclusion:__

This benchmark measures the execution time of a simple aggregation query (`AVG(rating)` grouped by `movieId`) across four scenarios:

- CSV 100K
- Parquet 100K
- CSV 33M
- Parquet 33M

The results show the strong impact of storage format on query speed, with Parquet providing substantial improvements—especially at larger scales.


__Close the connection (when done)__

In [12]:
#con.close()
#print("Connection closed.")

## 2.0 DuckDB vs Polars

### 2.1 Benchmark 2 — Simple COUNT() Query (DuckDB vs Polars)

Here we run the same aggregation query (`AVG(rating)` grouped by `movieId`) on Parquet files using DuckDB and Polars, for 100k and 33M rows, in order to compare the execution engines under a simple workload.

In [None]:
# paths para Parquet

p100k = ratings_100k.with_suffix(".parquet")
p33m = ratings_33m.with_suffix(".parquet")

In [None]:
# Funções de benchmark

def run_duckdb(path):
    t0 = time.time()
    duckdb.sql(f"""
        SELECT movieId, AVG(rating)
        FROM '{path}'
        GROUP BY movieId
    """).df()
    return round(time.time() - t0, 3)

def run_polars(path):
    t0 = time.time()
    (
        pl.scan_parquet(str(path))     
          .group_by("movieId")         
          .agg(pl.col("rating").mean())
          .collect()
    )
    return round(time.time() - t0, 3)


In [None]:
# Comparação

df_duck_polars = pd.DataFrame([
    ["100k", run_duckdb(p100k), run_polars(p100k)],
    ["33m", run_duckdb(p33m), run_polars(p33m)],
], columns=["Dataset", "DuckDB (s)", "Polars (s)"])

df_duck_polars


Unnamed: 0,Dataset,DuckDB (s),Polars (s)
0,100k,0.006,0.026
1,33m,0.151,0.716


### 2.2 Benchmark 3 — Aggregation + JOIN (AVG, COUNT, STDDEV) — DuckDB vs Polars

This benchmark uses a more realistic analytical workload on the 33M dataset:
we compute average rating, number of ratings and standard deviation per movie, join with the movies table to retrieve titles, filter movies with at least 500 ratings, and sort by total_ratings (TOP 100). The goal is to compare DuckDB and Polars under a heavier analytical query.

**DuckDB**

In [None]:

MIN_RATINGS = 500

def run_duckdb_movie_stats(con):
    sql = f"""
    WITH movie_stats AS (
        SELECT
            m.movieId,
            m.title,
            AVG(r.rating) AS avg_rating,
            COUNT(*) AS total_ratings,
            STDDEV_POP(r.rating) AS std_rating
        FROM ratings_33m_parquet r
        JOIN movies_33m_parquet m USING (movieId)
        GROUP BY m.movieId, m.title
        HAVING COUNT(*) >= {MIN_RATINGS}
    )
    SELECT
        movieId,
        title,
        ROUND(avg_rating, 3)  AS avg_rating,
        total_ratings,
        ROUND(std_rating, 3)  AS std_rating
    FROM movie_stats
    ORDER BY total_ratings DESC
    LIMIT 100
    """
    return con.sql(sql).df()


In [48]:
start = time.perf_counter()
df_duckdb_stats = run_duckdb_movie_stats(con)
t_duckdb_stats = time.perf_counter() - start

#df_duckdb_stats.head(), t_duckdb_stats


In [49]:
import time

start = time.perf_counter()
df_duckdb_stats = run_duckdb_movie_stats(con)
t_duckdb_stats = time.perf_counter() - start

#df_duckdb_stats.head(), t_duckdb_stats


**Polars**

In [None]:
def run_polars_movie_stats_33m():
    start = time.perf_counter()

    # usar os ficheiros PARQUET, não os CSV
    ratings = (
        pl.scan_parquet(str(ratings_33m_parquet))
        .select(["movieId", "rating"])
    )

    movies  = (
        pl.scan_parquet(str(movies_33m_parquet))
        .select(["movieId", "title"])
    )

    MIN_RATINGS = 500

    result = (
        ratings
        .group_by("movieId")
        .agg([
            pl.col("rating").mean().alias("avg_rating"),
            pl.count().alias("total_ratings"),
            pl.col("rating").std().alias("std_rating"),
        ])
        .filter(pl.col("total_ratings") >= MIN_RATINGS)
        .join(movies, on="movieId")
        .select([
            "movieId",
            "title",
            pl.col("avg_rating").round(3),
            "total_ratings",
            pl.col("std_rating").round(3),
        ])
        .sort("total_ratings", descending=True)
        .limit(100)
        .collect()
    )

    elapsed = time.perf_counter() - start
    return result, elapsed


In [56]:
#df_polars_stats_33m, t_polars_stats_33m = run_polars_movie_stats_33m()
#df_polars_stats_33m.head(), t_polars_stats_33m


In [53]:
df_complex = pd.DataFrame([
    ["33M (complex)", t_duckdb_stats, t_polars_stats_33m],
], columns=["Dataset", "DuckDB (s)", "Polars (s)"])

df_complex


Unnamed: 0,Dataset,DuckDB (s),Polars (s)
0,33M (complex),0.999457,0.676009


## 3.0 Summary

In [54]:
df_summary = pd.DataFrame([
    ["Benchmark 1 - 100k (CSV vs Parquet)", "DuckDB CSV", tempos["CSV_100k"]],
    ["Benchmark 1 - 100k (CSV vs Parquet)", "DuckDB Parquet", tempos["PARQUET_100k"]],
    ["Benchmark 1 - 33M (CSV vs Parquet)", "DuckDB CSV", tempos["CSV_33M"]],
    ["Benchmark 1 - 33M (CSV vs Parquet)", "DuckDB Parquet", tempos["PARQUET_33M"]],
    ["Benchmark 2 - 100k (simple)", "DuckDB", df_duck_polars.loc[0, "DuckDB (s)"]],
    ["Benchmark 2 - 100k (simple)", "Polars", df_duck_polars.loc[0, "Polars (s)"]],
    ["Benchmark 2 - 33M (simple)", "DuckDB", df_duck_polars.loc[1, "DuckDB (s)"]],
    ["Benchmark 2 - 33M (simple)", "Polars", df_duck_polars.loc[1, "Polars (s)"]],
    ["Benchmark 3 - 33M (complex)", "DuckDB", df_complex.loc[0, "DuckDB (s)"]],
    ["Benchmark 3 - 33M (complex)", "Polars", df_complex.loc[0, "Polars (s)"]],
], columns=["Scenario", "Engine", "Time (s)"])

df_summary


Unnamed: 0,Scenario,Engine,Time (s)
0,Benchmark 1 - 100k (CSV vs Parquet),DuckDB CSV,0.066
1,Benchmark 1 - 100k (CSV vs Parquet),DuckDB Parquet,0.015
2,Benchmark 1 - 33M (CSV vs Parquet),DuckDB CSV,1.885
3,Benchmark 1 - 33M (CSV vs Parquet),DuckDB Parquet,0.159
4,Benchmark 2 - 100k (simple),DuckDB,0.006
5,Benchmark 2 - 100k (simple),Polars,0.026
6,Benchmark 2 - 33M (simple),DuckDB,0.151
7,Benchmark 2 - 33M (simple),Polars,0.716
8,Benchmark 3 - 33M (complex),DuckDB,0.999457
9,Benchmark 3 - 33M (complex),Polars,0.676009


The results show that Parquet is substantially faster than CSV, delivering up to a considerable improvement on the 33M dataset.

In the engine comparison (Benchmark 2), DuckDB is faster on simple aggregations—especially on smaller datasets—while Polars remains competitive.

In the complex analytical query (Benchmark 3), Polars outperforms DuckDB (0.67s vs 1.02s), demonstrating better scalability under heavier workloads.

Both engines achieve sub-second performance on 33M rows when using Parquet, highlighting the efficiency of columnar storage.

Overall, Parquet + Polars provides the strongest performance for large-scale analytical queries.