## Introduction to Modin

In [1]:
!uv pip install "modin[all]"

[2mUsing Python 3.12.1 environment at: /workspaces/TechCatalyst_DE_2025/devpy[0m
[2mAudited [1m1 package[0m [2min 270ms[0m[0m


In [2]:
!uv pip install "bokeh>=3.1.0"

[2mUsing Python 3.12.1 environment at: /workspaces/TechCatalyst_DE_2025/devpy[0m
[2mAudited [1m1 package[0m [2min 7ms[0m[0m


## Using Pandas

In [3]:
s3file = 's3://techcatalyst-raw/yellow_tripdata_2024-01.parquet'

In [4]:
from dotenv import load_dotenv
load_dotenv()

True

In [5]:
%%time

import pandas as pd

df = pd.read_parquet(s3file)

# 2. Calculate trip duration in minutes
df["trip_duration_min"] = (
    (df["tpep_dropoff_datetime"] - df["tpep_pickup_datetime"]).dt.total_seconds() / 60
)

# 3. Filter trips longer than 5 minutes
df_filtered = df[df["trip_duration_min"] > 25]

# 4. Group by payment_type and get average fare
result = df_filtered.groupby("payment_type")["fare_amount"].mean()

result.head()



CPU times: user 1.26 s, sys: 512 ms, total: 1.77 s
Wall time: 4.61 s


payment_type
0    40.175634
1    49.081101
2    49.536832
3    23.084967
4     4.193730
Name: fare_amount, dtype: float64

## Using Modin

__Small to Medium Data Sizes__. 
Modin adds distributed execution overhead. For DataFrames that fit into memory (typical on laptops/workstations), this overhead can outweigh any parallel gains. Each operation must be scheduled across multiple partitions, even if you have only a few cores or your data is “small.”

In [6]:
import modin.pandas as pd
import os
os.environ["MODIN_ENGINE"] = "dask"

In [7]:
%%time
# 1. Read Parquet
df = pd.read_parquet(s3file)

# 2. Calculate trip duration in minutes
df["trip_duration_min"] = (
    (df["tpep_dropoff_datetime"] - df["tpep_pickup_datetime"]).dt.total_seconds() / 60
)

# 3. Filter trips longer than 5 minutes
df_filtered = df[df["trip_duration_min"] > 25]

# 4. Group by payment_type and get average fare
result = df_filtered.groupby("payment_type")["fare_amount"].mean()

result.head()

CPU times: user 3.29 s, sys: 292 ms, total: 3.58 s
Wall time: 13.5 s


payment_type
0    40.175634
1    49.081101
2    49.536832
3    23.084967
4     4.193730
Name: fare_amount, dtype: float64