# The `group_by_dynamic` window
By the end of this lecture you will be able to:

- set the frequency, length and offset of windows
- control the closure of windows
- set the displayed datetime value for each window

In [None]:
from datetime import datetime

import polars as pl

We create a `DataFrame` with hourly data over one day. We add a row number column that we can do aggregations on

In [None]:
start = datetime(2022,1,1)
stop = datetime(2022,1,2)

df = (
    pl.DataFrame(
        {
            'date':pl.datetime_range(start,stop,interval='1h',eager=True),
        }
    )
    .with_row_index("index")
)
df.head()

Note that when we create a date range column Polars sets the sorted flag to `True` automatically

In [None]:
df["date"].flags

### Specifying the window with `group_by_dynamic`
A dynamic window in `group_by_dynamic` is defined by:

- `every`: how often a window starts
- `period`: how long a window lasts
- `offset`: when the first window starts

The role of the `offset` is easier to understand once we see how the default behaviour works later in this lecture. 

In this example
```python
(
    df
    .group_by_dynamic(
        "pickup", 
        every="2h"
    )
)
```

we set `every = "2h"` and:
- a window starts every 2 hours 
- each window lasts 2 hours and 
- the first window starts at 0000 (midnight)

If `period` is not set then it defaults to the same value as `every`

In [None]:
(
    df
    .group_by_dynamic('date',every='2h')
    .agg(
        pl.col('index').count()
    )
    .head(3)
)

To help understand the window bounds we add the `include_boundaries=True` argument to see the upper and lower boundary for each window

In [None]:
(
    df
    .group_by_dynamic('date',every='2h',include_boundaries=True)
    .agg(
        pl.col('index').count()
    )
    .head(3)
)

- The `include_boundaries` argument does not affect whether boundary values are included in a window - see the section on closure for that below
- The `include_boundaries = True` argument affects performance because it makes parallelism more difficult. Only use it if you need to understand the windows

In this example
```python
(
    df
    .group_by_dynamic(
        "pickup", 
        every = "2h",
        period = "4h"
    )
)
```


we set `every = "2h", period = "4h"` and:
- a window starts every 2 hours 
- each window lasts 4 hours and 
- the first window starts at 2200 (two hours before midnight)

Because `period` is longer than `every` the windows are overlapping and we get 4 records per window

In [None]:
(
    df
    .group_by_dynamic('date',every='2h', period='4h',include_boundaries=True)
    .agg(
        pl.col('index').count()
    )
    .head(3)
)

In this example
```python
(
    df
    .group_by_dynamic(
        "pickup", 
        every="2h",
        period = "4h",
        offset = "6h"
    )
)
```


we set `every = "2h", period = "4h", offset = "6h"` and:
- a window starts every 2 hours 
- each window lasts 2 hours and 
- the first window starts at 0600 (6 AM)


In [None]:
(
    df
    .group_by_dynamic('date',every='2h', period='4h',offset = "6h",include_boundaries=True)
    .agg(
        pl.col('index').count()
    )
    .head(3)
)

So the `offset` applies an offset to the start of the windows. It can be positive or negative.

Sometimes it can be confusing to understand which rows end up in which window.

One way to clarify this is to do an `agg` with `pl.col("row_nr")`. With this you can inspect which rows are in which window

In [None]:
(
    df
    .group_by_dynamic('date',every='2h', period='4h',offset = "6h",include_boundaries=True)
    .agg(
            pl.col('index'),
    )
    .head(3)
)

In this example we can see that rows 0 to 5 are excluded (because of the `offset` and rows 8 and 9 are in both the first and second window.

### Closure and boundaries of windows
By default the windows are closed on the `left` - datetimes on the left boundary are included while datetimes on the right boundary are not included

In [None]:
(
    df
    .group_by_dynamic('date',every='2h',include_boundaries=True)
    .agg(
            pl.col('index'),
    )
    .head(3)
)

We can vary closure with the `closed` argument.

If we set `closed=both` we get:
- an additional window with just the first value as the right boundary
- each even-numbered row appears in successive windows

In [None]:
(
    df
    .group_by_dynamic('date',every='2h',closed="both",include_boundaries=True)
    .agg(
        pl.col('index'),
    )
    .head(3)
)

## Setting the first window

So far we have had neat hourly data and hourly windows. To understand how the window intervals are set it is better to use a less neat example.

In this example we use a window length of 55 minutes which is not a multiple of the hourly data interval to see the consequences

In [None]:
(
    df
    .group_by_dynamic('date',every='55m')
    .agg(
            pl.col('index'),
    )
    .head(3)
)

The first window starts at 2021-12-31 23:45:00 whereas the first time point in `df` is 2022-01-01 00:00:00.

Why does the first window start at 2021-12-31 23:45:00?

If we were to move forward from time 0 through the Unix epoch in 55 minute intervals then 2021-12-31 23:45:00 is the last point before we reach 2022-01-01 00:00:00. By default Polars starts the first window there.

If we want to move the start of the first window we can use the `offset` argument to `group_by_dynamic`.

In this example we move the start of the first window forward by 15 minutes to 00:00:00

In [None]:
(
    df
    .group_by_dynamic('date',every='55m',offset="15m",include_boundaries=True)
    .agg(
            pl.col('index'),
    )
    .head(5)
)

## Controlling the displayed datetime
In the output of `group_by_dynamic` there is a datetime on each row for each window.

By default Polars uses the lower bound of each window as the date for that window.

In this example the lower bound is shown in the `date` column

In [None]:
(
    df
    .group_by_dynamic('date',every='55m',include_boundaries=True)
    .agg(
            pl.col('index'),
    )
    .head(3)
)

We use the `label` argument to control what datetime value is used to label the window.

- `label = "left"` uses the lower bound of the window
- `label = "right"` uses the upper bound of the window
- `label = "datapoint"` uses the first datapoint in the window

In this example the `date` column equals the `upper_boundary` column as we set `label = "right"` 

In [None]:
(
    df
    .group_by_dynamic('date',every='55m',include_boundaries=True,label="right")
    .agg(
            pl.col('index'),
    )
    .head(2)
)

## Exercises
In the exercises you will develop your understanding of:
- setting the interval of the window
- setting the length of the window
- setting the offset of the window
- controlling closure of the window
- setting the displayed datetime for the window

### Exercise 1
Create a `DataFrame` that runs over 2020 at 2 minute intervals. Add a column for the row count

In [None]:
start = <blank>
stop = <blank>
(
    <blank>
    .head()
)

Do a dynamic groupby with windows that start every hour and last one hour. Aggregate the `row_nr` column into the list of row indices for each window

Do a dynamic groupby again with windows that start every hour and last two hours. 

Offset the start of the first window to 30 minutes *before* midnight

Adapt the earlier steps to:
- create the `DataFrame` over 2020 again but this time at **7 minute intervals** (chosen as a number that doesn't divide 60)
- add a row count columns
- do a groupby with one-hour windows
- set the displayed date for each window to be the first datapoint in the window

Set the windows of this `DataFrame` to be closed on the right

### Exercise 2
Create the query to generate the following optimised plan with a groupby window that is one week long

Note that the `group_by_dynamic` arguments do not appear in the optimised plan

```python
SORT BY [col("mean")]
  AGGREGATE
  	[col("trip_distance").count().alias("count"), col("trip_distance").mean().alias("mean"), col("trip_distance").max().alias("max")] BY [] FROM
     WITH_COLUMNS:
     [col("pickup").set_sorted()]

        Csv SCAN ../data/nyc_trip_data_1k.csv
        PROJECT 2/7 COLUMNS
```

In [None]:
csv_file = "../data/nyc_trip_data_1k.csv"
print(
    <blank>
    .explain()
)    

Evaluate the full query and inspect the data. Modify the query so the first date is 2022-01-01 00:00:00.

You will need to `collect()` the query to view the data to for the second point.


## Solutions

### Solution to exercise 1
Create a `DataFrame` that runs over 2020 at 2 minute intervals. Add a column for the row count

In [None]:
start = datetime(2020,1,1)
stop = datetime(2021,1,1)
(
    pl.DataFrame(
        {
            "date":pl.datetime_range(start,stop,interval="2m",eager=True)
        }
    ).with_row_index()
    .head()
)

Do a dynamic groupby with windows that start every hour and last one hour. Aggregate the `row_nr` column into the list of row indices for each window

In [None]:
start = datetime(2020,1,1)
stop = datetime(2021,1,1)
(
    pl.DataFrame(
        {
            "date":pl.datetime_range(start,stop,interval="2m",eager=True)
        }
    ).with_row_index()
    
    .group_by_dynamic(
        "date",
        every = "1h",
    )
    .agg(
        pl.col("index"),
    )
    .head()
)

Do a dynamic groupby again with windows that start every hour and last two hours. 

In [None]:
start = datetime(2020,1,1)
stop = datetime(2021,1,1)
(
    pl.DataFrame(
        {
            "date":pl.datetime_range(start,stop,interval="2m",eager=True)
        }
    )
    .with_row_index()
    .group_by_dynamic(
        "date",
        every = "1h",
        period = "2h",
    )
    .agg(
        pl.col("index"),
    )
    .head()
)

Offset the start of the first window to 30 minutes *before* midnight

In [None]:
start = datetime(2020,1,1)
stop = datetime(2021,1,1)
(
    pl.DataFrame(
        {
            "date":pl.datetime_range(start,stop,interval="2m",eager=True)
        }
    )
    .with_row_index()
    .group_by_dynamic(
        "date",
        every = "1h",
        period = "2h",
        offset = "-30m"
    )
    .agg(
        pl.col("index"),
    )
    .head()
)

Adapt the earlier steps to:
- create the `DataFrame` over 2020 again but this time at **7 minute intervals** (chosen as a number that doesn't divide 60)
- add a row count columns
- do a groupby with one-hour windows
- set the displayed date for each window to be the first datapoint in the window

In [None]:
(
    pl.DataFrame(
        {
            "date":pl.datetime_range(start,stop,interval="7m",eager=True)
        }
    )
    .with_row_index()
    .group_by_dynamic(
        "date",
        every = "1h",
        label = "datapoint",
    )
    .agg(
        pl.col("index"),
    )
    .head()
)

Set the windows of this `DataFrame` to be closed on the right

In [None]:
(
    pl.DataFrame(
        {
            "date":pl.datetime_range(start,stop,interval="2m",eager=True)
        }
    )
    .with_row_index()
    .group_by_dynamic(
        "date",
        every = "1h",
        label = "datapoint",
        closed = "right"
    )
    .agg(
        pl.col("index"),
    )
    .head()
)

### Solution to exercise 2
Create the query to generate the following optimised plan with a groupby window that is one week long

Note that the `group_by_dynamic` arguments do not appear in the optimised plan

```python
SORT BY [col("mean")]
  AGGREGATE
  	[col("trip_distance").count().alias("count"), col("trip_distance").mean().alias("mean"), col("trip_distance").max().alias("max")] BY [] FROM
     WITH_COLUMNS:
     [col("pickup").set_sorted()]

        Csv SCAN ../data/nyc_trip_data_1k.csv
        PROJECT 2/7 COLUMNS
```

In [None]:
csv_file = "../data/nyc_trip_data_1k.csv"
print(
    pl.scan_csv(csv_file,try_parse_dates=True)
    .with_columns(
        pl.col("pickup").set_sorted()
    )
    .group_by_dynamic("pickup",every="1d")
    .agg(
        [
            pl.col("trip_distance").count().alias("count"),
            pl.col("trip_distance").mean().alias("mean"),
            pl.col("trip_distance").max().alias("max"),
        ]
    )
    .sort("mean",descending=True)
    .explain()
)
    

Evaluate the full query and inspect the data. Modify the query so the first date is 2022-01-01 00:00:00.

You will need to `collect()` the query to view the data to for the second point.


In [None]:
csv_file = "../data/nyc_trip_data_1k.csv"
(
    pl.scan_csv(csv_file,try_parse_dates=True)
    .with_columns(
            pl.col("pickup").set_sorted()
    )
    .filter(pl.col("pickup") < datetime(2022,1,15))
    .group_by_dynamic("pickup",every="1w",offset = "5d")
    .agg(
        [
            pl.col("trip_distance").count().alias("count"),
            pl.col("trip_distance").mean().alias("mean"),
            pl.col("trip_distance").max().alias("max"),
        ]
    )
    .sort("mean",descending=True)
    .collect()
)