# Benchmark for big data ingestion and querying

## Introduction

This notebook provides a quick comparison of the ingestion and querying speed between Shapelets (shapelets-platform) and two well known python libraries for tabular data handling: Pandas and Polars. You will need to have them installed in order to run this benchmark.

In [None]:
! pip install pandas polars shapelets-platform

The benchmark is based on the NYC Taxi Dataset, with a size of ~37 Gb. The dataset contains more than 1.5 billion records of taxi trips in the area of New York City during 10 years (2009 to 2019). The dataset consists of .parquet files, one for each month of data.

We will be comparing two scenarios: one month of data (~430 Mb) and 1 year of data (~4.6 Gb). The benchmark evaluates two objectives:
- Data ingestion 
- Data querying with aggregation, in order to compute the average number of passengers for each day and month.

Please note both the execution times, CPU and memory used by each library, as for instance, Shapelets implementation is the fastest, uses all CPU cores available and does not require loading the data into memory.

In order to handle data, Shapelets relies very efficient data structures which rely on a technique named bitmap indexing, implemented in C++. This technique offers particularly good results in huge databases, providing faster retrieval of records and greater efficiency in insert, delete and update operations.

When handling time series, temporal indices are discretized and codified as bitmap indices, speeding up operations and providing several advantages, like the ability to store time series sampled at an arbitrarily high frequency.

**Note**: if you find a better implementation for Pandas or Polars, feel free to raise an issue in this repo or e-mail us at hello@shapelets.io

## One Month Scenario
### Pandas


In [None]:
import pandas as pd 

df = pd.read_parquet('../nyc-taxi/2009/01')

df['pickup_at'] = pd.to_datetime(df['pickup_at'])
df = df.set_index('pickup_at')

df.groupby([df.index.date,df.index.hour])['passenger_count'].mean()

### Polars

In [None]:
import polars as pl
data = pl.scan_parquet('~/datasets/nyc-taxi/2009/01/*.parquet')

data.groupby(
        [
            pl.col("pickup_at").cast(pl.Date).alias("pickup_at_date"),
            pl.col("pickup_at").dt.hour().alias("pickup_at_hour"),
        ]
    ).agg(pl.mean("passenger_count")).collect()

### Shapelets

In [None]:
import shapelets as sh
from shapelets.functions import avg, getDate, hour, minute

playground = sh.sandbox()

taxis = playground.from_parquet("../nyc-taxi/2009/01/*.parquet")

result = playground.map(
    (getDate(row.pickup_at), hour(row.pickup_at), avg(row.passenger_count)) 
    for row in taxis
)

result.to_pandas()

## One Year Scenario

### Pandas (Large memory consumption)

In [None]:
import pandas as pd 

df = pd.read_parquet('../nyc-taxi/2009')

df['pickup_at'] = pd.to_datetime(df['pickup_at'])
df = df.set_index('pickup_at')

df.groupby([df.index.date,df.index.hour])['passenger_count'].mean()

### Polars (Large memory consumption)

In [None]:
import polars as pl
data = pl.scan_parquet('../nyc-taxi/2009/**/*.parquet')

data.groupby(
        [
            pl.col("pickup_at").cast(pl.Date).alias("pickup_at_date"),
            pl.col("pickup_at").dt.hour().alias("pickup_at_hour"),
        ]
    ).agg(pl.mean("passenger_count")).collect()

### Shapelets

In [None]:
import shapelets as sh
from shapelets.functions import sum, getDate, hour, minute, avg

playground = sh.sandbox()

taxis = playground.from_parquet("../nyc-taxi/2009/**/*.parquet")

result = playground.map(
    (getDate(row.pickup_at), hour(row.pickup_at), avg(row.passenger_count)) 
    for row in taxis
)

result.to_pandas()