# Benchmark for big data ingestion and querying

## Introduction

This notebook provides a quick comparison of the ingestion and querying speed between Shapelets (shapelets-platform) and two well known python libraries for tabular data handling: Pandas and Polars. You will need to have them installed in order to run this benchmark.

In [None]:
! pip install pandas polars shapelets-platform

The benchmark is based on the NYC Taxi Dataset, with a size of ~37 Gb. The dataset contains more than 1.5 billion records of taxi trips in the area of New York City during 10 years (2009 to 2019). The dataset consists of .parquet files, one for each month of data.

We will be comparing two scenarios: one month of data (~430 Mb) and 1 year of data (~4.6 Gb). The benchmark evaluates two objectives:
- Data ingestion 
- Data querying with aggregation, in order to compute the average number of passengers for each day and month.

Please note both the execution times, CPU and memory used by each library, as for instance, Shapelets implementation is the fastest, uses all CPU cores available and does not require loading the data into memory.

In order to handle data, Shapelets relies very efficient data structures which rely on a technique named bitmap indexing, implemented in C++. This technique offers particularly good results in huge databases, providing faster retrieval of records and greater efficiency in insert, delete and update operations.

When handling time series, temporal indices are discretized and codified as bitmap indices, speeding up operations and providing several advantages, like the ability to store time series sampled at an arbitrarily high frequency.

**Note**: if you find a better implementation for Pandas or Polars, feel free to raise an issue in this repo or e-mail us at hello@shapelets.io

## One Month Scenario
### Pandas


In [1]:
import pandas as pd 

In [2]:
df = pd.read_parquet('../Benchmarks/nyc-taxi/2009/01')

df['pickup_at'] = pd.to_datetime(df['pickup_at'])
df = df.set_index('pickup_at')

df.groupby([df.index.date,df.index.hour])['passenger_count'].mean()

            pickup_at
2009-01-01  0            1.714221
            1            1.723661
            2            1.679692
            3            1.644623
            4            1.566899
                           ...   
2009-01-31  19           1.834419
            20           1.841117
            21           1.874803
            22           1.901640
            23           1.937456
Name: passenger_count, Length: 744, dtype: float64

### Polars

In [3]:
import polars as pl

In [4]:
data = pl.scan_parquet('../Benchmarks/nyc-taxi/2009/01/*.parquet')

data.group_by(
        [
            pl.col("pickup_at").cast(pl.Date).alias("pickup_at_date"),
            pl.col("pickup_at").dt.hour().alias("pickup_at_hour"),
        ]
    ).agg(pl.mean("passenger_count")).collect()

pickup_at_date,pickup_at_hour,passenger_count
date,i8,f64
2009-01-12,21,1.677515
2009-01-23,13,1.615631
2009-01-30,23,1.878042
2009-01-18,19,1.800311
2009-01-20,15,1.647465
…,…,…
2009-01-17,20,1.862211
2009-01-05,0,1.757289
2009-01-13,7,1.538159
2009-01-25,18,1.762946


### Shapelets

In [6]:
from shapelets.data import sandbox

In [32]:
playground = sandbox()

taxis = playground.from_parquet("taxis", ["../Benchmarks/nyc-taxi/2009/01/*.parquet"])

result = playground.from_sql("""
    SELECT
        AVG(passenger_count)                                                    
    FROM taxis
        GROUP BY extract('day' from dropoff_at), extract('hour' from dropoff_at)                                                                           
""").execute()

result.to_pandas()

Unnamed: 0,avg(passenger_count)
0,1.781688
1,1.731868
2,1.690869
3,1.657708
4,1.586778
...,...
739,1.825804
740,1.837693
741,1.867162
742,1.903295


## One Year Scenario

### Pandas (Large memory consumption - Do not run)

In [None]:
import pandas as pd 

df = pd.read_parquet('../Benchmarks/nyc-taxi/2009')

df['pickup_at'] = pd.to_datetime(df['pickup_at'])
df = df.set_index('pickup_at')

df.groupby([df.index.date,df.index.hour])['passenger_count'].mean()

### Polars (Large memory consumption)

In [None]:
import polars as pl

In [None]:
data = pl.scan_parquet('../Benchmarks/nyc-taxi/2009/**/*.parquet')

data.group_by(
        [
            pl.col("pickup_at").cast(pl.Date).alias("pickup_at_date"),
            pl.col("pickup_at").dt.hour().alias("pickup_at_hour"),
        ]
    ).agg(pl.mean("passenger_count")).collect()

### Shapelets

In [33]:
from shapelets.data import sandbox

In [34]:
playground = sandbox()

taxis = playground.from_parquet("taxis", ["../Benchmarks/nyc-taxi/2009/**/*.parquet"])

result = playground.from_sql("""
    SELECT
        AVG(passenger_count)                                                    
    FROM taxis
        GROUP BY extract('day' from dropoff_at), extract('hour' from dropoff_at)                                                                           
""").execute()

result.to_pandas()

Unnamed: 0,avg(passenger_count)
0,1.791567
1,1.804884
2,1.790727
3,1.781031
4,1.732054
...,...
739,1.769948
740,1.796585
741,1.841991
742,1.860486
