## Group operations
By the end of this lecture you will be able to:
- do window operations by a single column
- do group operations by multiple columns

Expressions typically work on a single row. In some cases we want them to operate on groups of rows. For this we have `over`

In [None]:
import polars as pl
import polars.selectors as cs
pl.Config.set_tbl_rows(8)

We create a simple `DataFrame` where we have an `id` column - that defines groups of rows - and a values column

In [None]:
df = pl.DataFrame(
    {
        "id":["a","b","a","b"],
        "value":[0,1,2,3]
    }
)

We want to add a column with the maximum `value` in each group where the groups are defined by the `id` column.

We tell Polars that we want to apply the `max` expression by group by following it with the `over` expression

In [None]:
(
    df
    .with_columns(
        group_max = pl.col("value").max().over("id")
    )
)

Let's break down the syntax here. We've got:
- `pl.col("value")` which gives us the **input column**
- `.max()` which **aggregates** the values in the input column
- `over("id")` which **groups** the rows by `id` **before** we aggregate the input with `max`

> The equivalent operation in Pandas is `.groupby.transform`

Using `over` is equivalent to:
- doing a `group_by` on the `over` column
- doing an `agg` with `pl.col("value").max()` to get a grouped `DataFrame` and
- left joining the grouped `DataFrame` back to the original `DataFrame`



Typically we use an an aggregation - such as `sum` to get a scalar value for each group.

But we can also use aggregations that produce a `Series` with `over`. For example if we do a `cum_sum` on a column the output is a column rather than a scalar. 

But if we use `cum_sum` with `over` we get the output we expect - the cumulative sum by group

In [None]:
(
    df
    .with_columns(
        group_max = pl.col("value").cum_sum().over("id")
    )
)

## Multiple columns
We can also do `over` with multiple columns - just like doing a `group_by` with multiple columns. 

We define a new `DataFrame` with two groups. Only the first and third rows are in the same group

In [None]:
df_mult = pl.DataFrame(
    {
        "id1":["a","b","a","b"],
        "id2":["x","x","x","y"],
        "value":[0,1,2,3]
    }
)

We now get the maximum value by group

In [None]:
(
    df_mult
    .with_columns(
        group_max = pl.col("value").max().over("id1","id2")
    )
)

## Filling missing values by group

We can use `over` to fill missing values by group.

Here we have a `DataFrame` where the second value in group `a` is missing

In [None]:
df_missing = pl.DataFrame(
    {
        "id":["a","b","a","b"],
        "value":[0,1,None,3]
    }
)

We can fill forward from the previous value in group `a` by calling `fill_null.over`

In [None]:
(
    df_missing
    .with_columns(
        filled_value = pl.col("value").fill_null(strategy="forward").over("id")
    )
)

## Exercises
In the exercises you will develop your understanding of:
- doing arithmetic by group
- filling nulls by group
- doing multiple window expressions in a single `with_columns` statement

### Exercise 1
We want to calculate the *z-score* of the `Age` column normalised by passenger class.

Add a new column `Age_mean` with the mean of the `Age` column for passengers by class

In [None]:
csv_file = "../data/titanic.csv"

In [None]:
(
    pl.read_csv(csv_file)
    <blank>
    .select(
        'Pclass',cs.starts_with("Age")
    )
    # Use head(6) to see the null on the sixth row
    .head(6)
)

Continue by replacing the `null` values in the `Age` column with the `median` age for passengers in that class

In [None]:
(
    pl.read_csv(csv_file)
    <blank>
    .select(
        'Pclass',cs.starts_with("Age")
    )
    .head(6)
)

Replace `Age_mean` with a new column called `Age_delta` that is the difference between the age and the average age of all passengers in the same class

In [None]:
(
    pl.read_csv(csv_file)
    .with_columns(
        <blank>
    )
    .select(
        'Pclass',cs.starts_with("Age")
    )
    .head(10)
)

Continue by adding another column called `Age_z` that has the z-score for the `Age` where the z-score is the (age - average age of the passengers in that class) divided by the standard deviation of the age column for passengers in that class

In [None]:
(
    pl.read_csv(csv_file)
    .with_columns(
        <blank>
    )
    .select(
        'Pclass',cs.starts_with("Age")
    )
    .head(10)
)

### Exercise 2

Count the number of passengers in each group of: passenger class and survival. Name the column of counts `counts`

In [None]:
(
    pl.read_csv(csv_file)
    <blank>
)

Continue by calculating the percentage breakdown of passenger survival within each passenger class group. Call this column `percent`.

Sort the output by passenger class and survival

### Exercise 3
Window functions allow us to do multiple groupbys in the same `select` or `with_column`. Polars can cache the groupbys in the same `with_columns` statement.

In this exercise we explore the effect of this caching on performance.

We begin by creating a `DataFrame` with groups and values

In [None]:
import numpy as np
np.random.seed(0)

N = 1_000_000
cardinality = N // 2
groups = np.random.randint(0,cardinality,N)
df = pl.DataFrame(
        {
            "groups":groups,
            "values":np.random.standard_normal(N)
        }
    )
df.head(3)

We want to add: 
- a `max` column with the maximum value per group and 
- a `min` column with the minimum value per group.


Time how long this takes with two `with_column` statements

In [None]:
%%timeit -n1 -r3
(
    df
    <blank>
)

Time how long this takes in a single `with_columns` statement

In [None]:
%%timeit -n1 -r3
(
    df
    <blank>
)

Can Polars cache the window expressions across `with_column` statements in lazy mode?

In [None]:
%%timeit -n1 -r3
(
    pl.scan_csv(csv_file)
    <blank>
)

## Solutions

### Solution to exercise 1
We want to calculate the *z-score* of the `Age` column for each passenger normalised by their passenger class.

Add a new column `Age_mean` with the mean of the `Age` column for passengers by class

In [None]:
(
    pl.read_csv(csv_file)
    .with_columns(
        Age_mean = pl.col('Age').mean().over('Pclass')
    )
    .select(
        cs.starts_with("Age")
    )
    .head(6)
)

Continue by replacing the `null` values in the `Age` column with the `median` age for passengers in that class

In [None]:
(
    pl.read_csv(csv_file)
    .with_columns(
        Age_mean = pl.col('Age').mean().over('Pclass')
    )
    .with_columns(
        Age = pl.col('Age').fill_null(pl.col('Age').median().over('Pclass'))
    )
    .select(
        'Pclass',cs.starts_with("Age")
    )
    .head(6)
)

Replace `Age_mean` with a new column called `Age_delta` that is the difference between the age and the average age of all passengers in the same class. Keep the `fill_null` step from above

In [None]:
(
    pl.read_csv(csv_file)
    .with_columns(
        Age = pl.col('Age').fill_null(pl.col('Age').median().over('Pclass'))
    )
    .with_columns(
        Age_delta = pl.col('Age') - pl.col('Age').mean().over('Pclass')
    )
    .select(
        'Pclass',cs.starts_with("Age")
    )
    .head(6)
)

Continue by adding another column called `Age_z` that has the z-score for the `Age` where the z-score is the (age - average age of the passengers in that class) divided by the standard deviation of the age column for passengers in that class

In [None]:
(
    pl.read_csv(csv_file)
    .with_columns(
        Age = pl.col('Age').fill_null(pl.col('Age').median().over('Pclass'))
    )
    .with_columns(
        Age_delta = pl.col('Age') - pl.col('Age').mean().over('Pclass')
    )

    .with_columns(
        Age_z = ((pl.col('Age') - pl.col('Age').mean().over('Pclass'))/pl.col('Age').std().over('Pclass'))
    )
    .select(
        'Pclass',cs.starts_with("Age")
    )
    .head(6)
)

### Solution to exercise 2

Count the number of passengers in each group of passenger class and survival

In [None]:
(
    pl.read_csv(csv_file)
    .group_by(["Pclass","Survived"])
    .agg(
        pl.col("Name").count().alias("counts")
    )
)

Calculate the percentage breakdown of passenger survival within each passenger class group. Calculate the percentage as 0-100.

Sort the output by passenger class and survival

In [None]:
(
    pl.read_csv(csv_file)
    .group_by(["Pclass","Survived"])
    .agg(
        pl.col("Name").count().alias("counts")
    )
    .with_columns(
        100*(pl.col("counts")/pl.col("counts").sum().over("Pclass")).round(3).alias("percent")
    )
    .sort(["Pclass","Survived"])
)

### Solution to exercise 3

Window functions allow us to do multiple groupbys in the same `select` or `with_column`. Polars can cache the groupbys in the same `with_columns` statement.

In this exercise we explore the effect of this caching on performance.

We begin by creating a `DataFrame` with groups and values

In [None]:
import numpy as np
np.random.seed(0)

N = 1_000_000
cardinality = N // 2
groups = np.random.randint(0,cardinality,N)
df = pl.DataFrame(
        {
            "groups":groups,
            "values":np.random.standard_normal(N)
        }
    )
df.head(3)

We want to add a `max` column with the maximum value per group and a `min` column with the minimum value per group.


Do this with two `with_column` statements

In [None]:
%%timeit -n1 -r3
(
    df
    .with_columns(
        pl.col("values").max().over("groups").alias("max")
    )
    .with_columns(
        pl.col("values").min().over("groups").alias("min")
    )
)

Do this in a single `with_columns` statement

In [None]:
%%timeit -n1 -r3
(
    df
    .with_columns(
        [
            pl.col("values").max().over("groups").alias("max"),
            pl.col("values").min().over("groups").alias("min")
        ]
    )
)

Can Polars cache the window expressions across `with_column` statements in lazy mode?

In [None]:
%%timeit -n1 -r3
(
    df
    .lazy()
    .with_columns(
        pl.col("values").max().over("groups").alias("max")
    )
    .with_columns(
        pl.col("values").min().over("groups").alias("min")
    )
    .collect()
)

Not at this point as there is no speed up!