## Rolling time series analysis
By the end of this lecture you will be able to:
- calculate rolling time series expressions
- do rolling groupbys
- identify the use cases of `rolling` and `group_by_dynamic`

In [None]:
from datetime import datetime,timedelta

import polars as pl

pl.Config.set_tbl_rows(8)

We first create a simple time series `DataFrame` with hourly data over one day. We add a `values` column that tracks the row numbers

In [None]:
start_datetime = datetime(2020,1,1)
end_datetime = datetime(2020,1,2)
df = (
    pl.DataFrame(
        {
            "date":pl.datetime_range(
                start_datetime,
                end_datetime,
                interval="1h",
                eager=True
            )
        }
    )
    .with_row_index("values")
    # Re-order the columns to have date first
    .select("date","values")
)
df

### Built-in rolling aggregations
Polars has built-in expressions to do a rolling-groupby and aggregations for common aggregations such as mean,min, max, sum. 

However, the syntax is as follows:
```python
pl.col("values").rolling_mean(
            by="date",
            window_size="3h",
            closed="right"
        )
```
where:
- we first create the expression that will be aggregated in each rolling window with `pl.col("values")`
- we call `rolling_mean` to tell polars we want to create rolling windows and take the mean of `pl.col("values")` in each window
- we specify how the windows are defined with the arguments to `rolling_mean`

The rolling expression groups rows into *windows*. We can specify these length of these windows using a time period using a python `timedelta` or a Polars interval string. 


In this example we calculate the rolling mean and sum

In [None]:
(
    df
    .with_columns(
        roll_mean = pl.col("values").rolling_mean(
            by="date",
            window_size="3h",
            closed="right"
        ),
        roll_sum = pl.col("values").rolling_sum(
            by="date",
            window_size=timedelta(hours=3),
            closed="right"
        ),

    )
)

It is important to understand how rolling windows are defined. For the `rolling_*` expressions the windows are defined as:

(t_0 - window_size, t_0]

(t_1 - window_size, t_1)]

…

(t_n - window_size, t_n]

where `t_0` is the first data point, `t_1` is the second data point etc.. This means:
- the first window with right closure is rows with values from `21:00` (not included) to `00:00` (included) so just the `00:00` row
- the second window is rows with values from `22:00` (not included) to `01:00` (included) so the `00:00` and `01:00` rows
- the last row is the rows from `21:00` (not included) to `00:00` (included) so the `22:00`,`23:00` and `00:00` rows

## Rolling expression
The `rolling_*` expressions above are only available for the most common aggregations. We can specify any expression to be evaluated in rolling windows using the `rolling` expression.

The syntax of a sample `rolling` expression is slightly different from the `rolling_*` expressions above:
```python
(
    pl.col("values")
    .mean()
    .rolling(
        index_column="date",
        period="3h"
    )
)    
```
This means:
- create rolling windows based on the `date` column with a window size of 3 hours and
- in each window take the mean of the `values` column

Here we use `index_column` instead of `by` and `period` instead of `window_size`.

In this example we use the `first` and `last` expressions to help us understand how the rolling windows are defined

In [None]:
(
    df
    .with_columns(
        # Get the first row index in each window
        window_row_first = pl.col("values").first().rolling(
            index_column="date",
            period="3h"
        ),
        # Get the last row index in each window
        window_row_last = pl.col("values").last().rolling(
            index_column="date",
            period="3h"
        ),
    )
)

The windows are closed on the right by default and are defined as (t_0 - period, t_0] so:
- the first window is from `21:00:00` to `00:00:00` and so only includes `00:00:00`
- the second window is from `22:00:00` to `01:00:00` and so includes `00:00:00` and `01:00:00`
- the third window is from `23:00:00` to `02:00:00` so includes `00:00:00`, `01:00:00` and `02:00:00`
- the fourth window is from `00:00:00` to `03:00:00` but excludes `00:00:00` as it is open on the left

We can offset the start of the windows with `offset`. The default offset is `-period` or `-3h`. In this example we shift all windows two hours later than this default

In [None]:
(
    df
    .with_columns(
        window_row_first = pl.col("values").first().rolling(
            index_column="date",
            period="3h",
            offset="-1h",
        ),
        window_row_last = pl.col("values").last().rolling(
            index_column="date",
            period="3h",
            offset="-1h",
        ),
        window_row_indexes = pl.col("values").agg_groups().rolling(
            index_column="date",
            period="3h",
            offset="-1h",
        ),

    )
    .head(4)
)

With the first window now starts 2 hours later and runs from `23:00:00` to `02:00:00`.

We can do aggregations on each rolling window by using an expression. In this example we do the rolling `mean` and `sum`

In [None]:
(
    df
    .with_columns(
        roll_mean = pl.col("values").mean().rolling(
            index_column="date",
            period="3h",
        ),
        roll_sum = pl.col("values").sum().rolling(
            index_column="date",
            period="3h",
        ),
    )
    .head(4)
)

## Rolling on a `DataFrame`
The `rolling_*` or `rolling` expressions are useful if we want to add a new column or columns to a `DataFrame`.

If instead we want to create a new `DataFrame` with all rolling columns we can also call `rolling` on a `DataFrame`.

This approach is preferable if:
- we want to create a whole `DataFrame` of rolling values or
- we have numerous rolling calculations and we want to ensure we only calculate the rolling windows once

The syntax is the same as for the `rolling` expression

In [None]:
(
    df
    .rolling(
            index_column="date",
            period="3h",
        )
    .agg(
        window_row_first = pl.col("values").first(),
        window_row_last = pl.col("values").last()
        
    )
    .head(4)
)

### Rolling windows by group
We may want to do the rolling operation by group such as getting the rolling returns on different stocks or products. To illustrate this we create a new `DataFrame` which is two copies of the original `DataFrame` with an `id` column

In [None]:
df2 = (
    pl.concat(
        [
            # Copy with value A in the id column
            df.with_columns(id = pl.lit("A")),
            # Copy with value B in the id column
            df.with_columns(id = pl.lit("B")),
        ]
    )
)
df2                                   

If we want to do the rolling operations by groups we pass the `by` argument to specify the groups. 

In this example Polars: 
- first does a `group_by` on the `id` column to make a `DataFrame` for each group
- does the `rolling-agg` on each group's `DataFrame`
- concatenates the results for each group into a single `DataFrame`

In [None]:
(
    df2
    .rolling(
            index_column="date",
            period="3h",
            by="id"
        )
    .agg(
        window_row_first = pl.col("values").first(),
        window_row_last = pl.col("values").last()
        
    )
)

Note that for the rolling calculations to be correct the data must be sorted by the `index_column` within each group.

## Lazy mode and streaming?
We can use the `rolling` expressions and groupby on a `LazyFrame`.

Unfortunately rolling operations are not currently supported by the streaming engine. 

We can see that rolling operations are not streaming in the query plan below as the `WITH_COLUMNS` below is above the `--- STREAMING` block of the query plan

In [None]:
print(
    df
    .lazy()
    .with_columns(
        window_row_first = pl.col("values").first().rolling(
            index_column="date",
            period="3h"
        ),
        window_row_last = pl.col("values").last().rolling(
            index_column="date",
            period="3h"
        ),
    )
    .explain(streaming=True)
)

So why is rolling not supported by the streaming engine?

Recall that the streaming engine works in batches of rows and is best suited for operations that can be worked on in batches independently. However, some rolling windows will start in one batch and end in another meaning they cannot be worked on in batches independently. Therefore rolling does not currently work in streaming mode.

## Rolling or `group_by_dynamic`?
The `rolling` and `group_by_dynamic` methods both do windowed time series aggregations but there is a difference between them:
- `group_by_dynamic` works with constant window intervals
- `rolling` works with windows that depend on the data

Essentially `group_by_dynamic` looks at your first and last time points and divides up the time interval into boxes. But `rolling` instead looks at each time point and creates a window around each one


## Exercises
In the exercises you will learn how to:
- do rolling statistics
- manage rolling windows

### Exercise 1
First we set some helpful config settings 

In [None]:
pl.Config.set_fmt_str_lengths(100)
pl.Config.set_fmt_float("full")

We create a `DataFrame` from the Spotify data

In [None]:
spotify_csv = "../data/spotify-charts-2017-2021-global-top200.csv.gz"
spotify_df = pl.read_csv(spotify_csv,try_parse_dates=True)
spotify_df.head(3)

For the track `Starboy` add a one-week `rolling_mean` and `rolling_max` of the streams column

In [None]:
(
    spotify_df
    .filter(pl.col("title")=="Starboy")
    .select("title","artist","date","trend","streams")
    <blank>
)

We are going to visualise the rolling mean number of streams for the most popular tracks

- Add a column called `title_artist` that is a string concatenation of the `title` and `artist` columns separated by `:` (see here if you are not familar with the function to do this: https://pola-rs.github.io/polars/py-polars/html/reference/expressions/api/polars.concat_str.html#polars-concat-str)
- Sort the `DataFrame` by the `title_artist` and `date` columns

In [None]:
roll_spotify_df = (
    spotify_df
    <blank>
)
roll_spotify_df.head()

Continue by:
- doing a weekly `rolling` groupby on the `date` column by `title_artist`
- creating an aggregated `roll_streams` column with the weekly mean of the streams for each track
- sorting the output by `roll_streams` with the largest values at the top

The next step is more advanced, check out the hints below or the solution if you are not sure where to start!

We want to continue by visualising the results only for the most popular tracks. However, we want to keep all dates for these tracks, not just the most streamed dates so:

- filter the `DataFrame` to keep **all** rows for any track that appears in the top 300 rows of `roll_streams`

The output should have 14,358 rows

In [None]:
#Hint1:
# Continue by using a pipe function on the output from the query above

In [None]:
#Hint2
# In the pipe function do a semi-join with the rows that have the most popular tracks

Visualise the results as a time series line chart with Plotly (or your preferred plotting library) with
- time on the x-axis
- `roll_streams` on the y-axis
- `title_artist` in color

In [None]:
import plotly.express as px
px.line(
    <blank>
)

## Solutions

### Solution to exercise 1

In [None]:
pl.Config.set_fmt_str_lengths(100)

We create a `DataFrame` from the Spotify data

In [None]:
spotify_csv = "../data/spotify-charts-2017-2021-global-top200.csv.gz"
spotify_df = pl.read_csv(spotify_csv,try_parse_dates=True)
spotify_df.head(3)

For the track `Starboy` add a one-week `rolling_mean` and `rolling_max` of the streams column

In [None]:
(
    spotify_df
    .filter(pl.col("title")=="Starboy")
    .select("title","artist","date","trend","streams")
    .sort("date")
    .with_columns(
        roll_mean = pl.col("streams").rolling_mean(
            by="date",
            window_size="1w",
            closed="right"
        ),
        roll_max = pl.col("streams").rolling_max(
            by="date",
            window_size="1w",
            closed="right"
        )

    )
)

We are going to visualise the rolling mean number of streams for the most popular tracks

- Add a column called `title_artist` that is a string concatenation of the `title` and `artist` columns separated by `:` (see here if you are not familar with the function to do this: https://pola-rs.github.io/polars/py-polars/html/reference/expressions/api/polars.concat_str.html#polars-concat-str)
- Sort the `DataFrame` by the `title_artist` and `date` columns

In [None]:
roll_spotify_df = (
    spotify_df
    .with_columns(pl.concat_str(["title","artist"],separator=":").alias("title_artist"))
    .sort("title_artist","date")
)
roll_spotify_df.head()

Continue by:
- doing a weekly `rolling` groupby on the `date` column by `title_artist`
- creating an aggregated `roll_streams` column with the weekly mean of the streams for each track
- sorting the output by `roll_streams` with the largest values at the top

In [None]:
roll_spotify_df = (
    spotify_df
    .with_columns(pl.concat_str(["title","artist"],separator=":").alias("title_artist"))
    .sort("title_artist","date")
    .rolling(
        index_column="date",
        period="1d",
        by="title_artist"
        )
    .agg(
        roll_streams = pl.col("streams").mean()
    )
    .sort("roll_streams",descending=True)
)

The next step is more advanced, check out the solution if you are not sure where to start!

We want to visualise the results only for the most popular tracks. However, we want to keep all dates for these tracks, not just the most streamed dates so:

- filter the `DataFrame` to keep all rows for any tracks that appear in the top 500 rows of `roll_streams`

The output should have 14,358 rows

In [None]:
#Hint1:
# Use a pipe function on the output from the query

In [None]:
#Hint2
# In the pipe function do a semi-join with the output from the query

In [None]:
roll_spotify_df = (
    spotify_df
    .with_columns(pl.concat_str(["title","artist"],separator=":").alias("title_artist"))
    .sort("title_artist","date")
    .rolling(
        index_column="date",
        period="1mo",
        by="title_artist"
        )
    .agg(
        roll_streams = pl.col("streams").mean()
    )
    .sort("roll_streams",descending=True)
    .pipe(
        lambda df:df.join(
            # Get the first 500 rows and filter the dataframe for these title-artist combinations
            df.head(300),on="title_artist",how="semi")
    )
    .sort("date")
)
roll_spotify_df

Visualise the results as a time series line chart with Plotly (or your preferred plotting library) with
- time on the x-axis
- `roll_streams` on the y-axis
- `title_artist` in color

In [None]:
import plotly.express as px
px.line(
    roll_spotify_df,
    x="date",
    y="roll_streams",
    color="title_artist",
    width=1000
)

See how the streams for some tracks grew slowly while others started at high values!

Vary the rolling period to 1 day and 1 month to see the smoothing effect of the rolling analysis