# Grouped Simple Moving Average

This notebook compares Temporian's and pandas' version of a grouped simple moving average, aiming to evaluate the speed and ease of writing of each library in this specific use case.

## Data

The data we'll be using belongs to the train CSV of [this Kaggle competition](https://www.kaggle.com/competitions/store-sales-time-series-forecasting/data?select=train.csv). It contains daily sales records for grocery stores in Ecuador. Each of the 3 million records corresponds to the daily sales of a certain `family` in a certain store number (`store_nbr`).

To run the notebook, download the CSV and place it in `data/train.csv` in the project's root.

Our goal is to calculate the **weekly moving average** of the sales for **each family** in each **store** (i.e., we want the moving average grouped by those two features).

In [4]:
import numpy as np
import pandas as pd
# import temporian as tp

Load data to a pandas DataFrame.

In [5]:
sales_df = pd.read_csv('../data/train.csv', parse_dates=['date'])[['date', 'store_nbr', 'sales', 'family']]
sales_df

Unnamed: 0,date,store_nbr,sales,family
0,2013-01-01,1,0.000,AUTOMOTIVE
1,2013-01-01,1,0.000,BABY CARE
2,2013-01-01,1,0.000,BEAUTY
3,2013-01-01,1,0.000,BEVERAGES
4,2013-01-01,1,0.000,BOOKS
...,...,...,...,...
3000883,2017-08-15,9,438.133,POULTRY
3000884,2017-08-15,9,154.553,PREPARED FOODS
3000885,2017-08-15,9,2419.729,PRODUCE
3000886,2017-08-15,9,121.000,SCHOOL AND OFFICE SUPPLIES


Load data to a Temporian EventSet.

In [6]:
# sales_evset = tp.from_csv('../data/train.csv', timestamps='date')[['sales', 'store_nbr', 'family']]
# sales_evset

## Grouped moving average in pandas

We use the `rolling` method of pandas DataFrames to calculate the moving average. We group by `store_nbr` and `family` and then apply the rolling method to the `sales` column.

We'll only measure the time it takes to compute the actual moving average, not the time it takes to group the data, since it is not directly comparable between the two libraries.

In [7]:
grouped_df = sales_df.groupby(['store_nbr', 'family'])

In [8]:
%%time

pd_result = grouped_df.rolling('7d', on='date').mean()

CPU times: user 1.04 s, sys: 316 ms, total: 1.35 s
Wall time: 1.4 s


## Grouped moving average in Temporian

Temporian can handle grouped (or hierarchically structured) data natively, using [indexes](https://temporian.readthedocs.io/en/stable/user_guide/#indexes-horizontal-and-vertical-operators). Once our data has the correct index, applying a `simple_moving_average` to it is straightforward.

In [9]:
# grouped_evset = sales_evset.add_index(['store_nbr', 'family'])

In [10]:
# %%time

# tp_result = grouped_evset.simple_moving_average(tp.duration.weeks(1))

## Results

Computing the same grouped moving average in Temporian resulted in a **28x speedup** in this dataset!

#### Sanity check

As a sanity check, lets make sure the results from both libraries are the same.

In [11]:
# tp_sma = tp.to_pandas(tp_result).sort_values(['store_nbr', 'family', 'timestamp'])['sales']
# pd_sma = pd_result.sort_values(['store_nbr', 'family', 'date'])['sales']
# np.allclose(pd_sma, tp_sma)

## But... what about Polars?

[Polars](https://www.pola.rs/) is a DataFrame library written in Rust, born as a performance-oriented alternative to pandas. Lets see how it fares on this same task!

In [22]:
import polars as pl

polars_df = pl.read_csv("../data/train.csv", try_parse_dates=True, columns=['date', 'store_nbr', 'sales', 'family'])
polars_df.head()

date,store_nbr,family,sales
date,i64,str,f64
2013-01-01,1,"""AUTOMOTIVE""",0.0
2013-01-01,1,"""BABY CARE""",0.0
2013-01-01,1,"""BEAUTY""",0.0
2013-01-01,1,"""BEVERAGES""",0.0
2013-01-01,1,"""BOOKS""",0.0


In [18]:
%%time

pl_result = (
    polars_df
    .rolling(index_column='date', by=['store_nbr', 'family'], period='7d', offset='-7d')
    .agg(pl.col('sales').mean())
)

CPU times: user 460 ms, sys: 553 ms, total: 1.01 s
Wall time: 665 ms


In [17]:
pl_sma = pl_result.to_pandas().sort_values(['store_nbr', 'family', 'date'])['sales']
pd_sma = pd_result.sort_values(['store_nbr', 'family', 'date'])['sales']
np.allclose(pd_sma, pl_sma)

True