# Introduction to `group_by_dynamic`
By the end of this session you will be able to:

- do groupby and aggregations using `group_by_dynamic`
- use `group_by_dynamic` on multiple columns
- use `group_by_dynamic` in lazy mode


In [None]:
from datetime import datetime

import polars as pl

We use the time series of NYC taxi pickups

In [None]:
csv_file = "../data/nyc_trip_data_1k.csv"

In [None]:
df = pl.read_csv(csv_file,try_parse_dates=True)
df.head()

## Temporal aggregation with datetime components and `group_by`
The simplest way to do a temporal aggregation on a time series is to:
- create the datetime components of interest
- do a `group_by` on these components

In this example we get the average trip distance by day-of-week

In [None]:
(
    df
    .group_by(
        pl.col("pickup").dt.weekday().alias("weekday")
    )
    .agg(
        pl.col("trip_distance").mean().round(1),
    )
    .sort("weekday")
)


## Groupby with `group_by_dynamic`
With `group_by_dynamic` we can work directly with the values in the date column.

**For `group_by_dynamic` the date column must be sorted in ascending order.** We need to either:
- use `set_sorted` on the date column if we know the column is already sorted
- sort the column with `sort`

No `Exception` will be raised if the dates are not sorted, but the answers may be wrong.

To check sortedness we use `is_sorted`

In [None]:
df["pickup"].is_sorted()

As the output is `True` we can use `set_sorted`


In its simplest form the arguments to `group_by_dynamic` are:
- the datetime column to group on and 
- the length of the grouping window with the `every` argument

In [None]:
(
    df
    .with_columns(
        pl.col("pickup").set_sorted()
    )
    .group_by_dynamic(
        "pickup", 
        every="1d"
    )
    .agg(
        pl.col("trip_distance").mean().round(1)
    )
    .head(5)
)

We look at how the windows are specified in more detail in the next lecture

## `DynamicGroupBy` object

When we do `group_by_dynamic` we create a `DynamicGroupBy` object.

In [None]:
(
    df
    .with_columns(
        pl.col("pickup").set_sorted()
    )
    .group_by_dynamic(
        "pickup", 
        every="1d"
    )
)

To do aggregations on a `DynamicGroupBy` we call `agg`. We cannot call aggregation methods like `count` or `sum` on a `DynamicGroupBy` directly.

## Dynamic groupby on groups
We may want to divide the `DataFrame` into groups before doing `group_by_dynamic` on each group.

We can do this by passing the column names as the argument to `by` in `group_by_dynamic`.

To illustrate dynamic groupby on groups we groupby each `VendorID` and then take hourly averages of the `trip_distance`

In [None]:
(
    df
    .sort("VendorID","pickup")
    .group_by_dynamic("pickup",every="3h",by="VendorID")
    .agg(
        pl.col("tip_amount").mean().round(1)
    )
    .head()
)

Notice the order of the columns - Polars first groups by `VendorID` and then does `group_by_dynamic` on each of those groups.

We can also use expressions when grouping by another column - see the exercises.

## Dynamic groupby in lazy mode
When we do `group_by_dynamic` the Polars query optimiser sees that only a subset of columns are required and only reads these columns from the CSV (`PROJECT 3/7 COLUMNS` below)

In [None]:
print(
    pl.scan_csv(csv_file,try_parse_dates=True)
    .with_columns(
        pl.col("pickup").set_sorted()
    )
    .group_by_dynamic("pickup",every="3h",by="passenger_count")
    .agg(
        pl.col("trip_distance").mean().round(1)
    )
    .explain()
)

## Exercises
In the exercises you will develop your understanding of:
- doing `group_by_dynamic` on a single column
- doing `group_by_dynamic` on groups
- the relative performance of `group_by_dynamic` and `groupby`

### Exercise 1
Groupby the `pickup` column on a 6-hourly basis.

Get the count, mean and max of the trip distance for each window.

Sort the output by the mean trip distance with the largest values first

Filter out all windows with less than 5 records

### Exercise 2

Get the same statistics but also group by the Vendor ID

Get the same statistics (`count`,`max` and `mean`) but group by both:
- the Vendor ID and 
- the `trip_distance` where the `trip_distance` is cast to a 16-bit integer before grouping

## Solutions

### Solution to exercise 1
Groupby the `pickup` column on a 6-hourly basis.

Get the count, mean and max of the trip distance for each window.

Sort the output by the mean trip distance with the largest values first

In [None]:
(
    pl.read_csv(csv_file,try_parse_dates=True)
    .with_columns(
        pl.col("pickup").set_sorted()
    )
    .group_by_dynamic("pickup",every="6h")
    .agg(
        [
            pl.col("trip_distance").count().alias("count"),
            pl.col("trip_distance").mean().alias("mean"),
            pl.col("trip_distance").max().alias("max"),
        ]
    )
    .sort("mean",descending=True)
)

Filter out all windows with less than 5 records

In [None]:
(
    pl.read_csv(csv_file,try_parse_dates=True)
    .with_columns(
        pl.col("pickup").set_sorted()
    )
    .group_by_dynamic("pickup",every="6h")
    .agg(
        [
            pl.col("trip_distance").count().alias("count"),
            pl.col("trip_distance").mean().alias("mean"),
            pl.col("trip_distance").max().alias("max"),
        ]
    )
    .filter(pl.col("count") >= 5)
    .sort("mean",descending=True)
    .head()
)

### Solution to exercise 2

Get the same statistics but also group by the Vendor ID

In [None]:
(
    pl.read_csv(csv_file,try_parse_dates=True)
    .with_columns(
        pl.col("pickup").set_sorted()
    )
    .group_by_dynamic("pickup",every="6h",by="VendorID")
    .agg(
        [
            pl.col("trip_distance").count().alias("count"),
            pl.col("trip_distance").mean().alias("mean"),
            pl.col("trip_distance").max().alias("max"),
        ]
    )
    .filter(pl.col("count") >= 5)
    .sort("mean",descending=True)
    .head(3)
)   

Get the same statistics (`count`,`max` and `mean`) but group by both:
- the Vendor ID and 
- the `trip_distance` where the `trip_distance` is cast to a 16-bit integer before grouping

In [None]:
(
    pl.read_csv(csv_file,try_parse_dates=True)
    .with_columns(
        pl.col("pickup").set_sorted()
    )
    .group_by_dynamic(
        "pickup",
        every="6h",
        by=["VendorID",pl.col("trip_distance").cast(pl.Int16())
           ]
    )
    .agg(
        [
            pl.col("passenger_count").count().alias("count"),
            pl.col("passenger_count").mean().alias("mean"),
            pl.col("passenger_count").max().alias("max"),
        ]
    )
    .sort("mean",descending=True)
    .head()
)