# Groupby-aggregations 1: Key concepts
By the end of this lecture you will be able to:
- do a group by-aggregation
- group by multiple columns
- group by expressions
- sort group by outputs
- use group by in lazy mode and
- do fast-track grouping on a sorted column

In [None]:
import polars as pl

In [None]:
csv_file = "../data/titanic.csv"

In [None]:
df = pl.read_csv(csv_file)
df.head(3)

## Group-by and aggregation
In Polars we can group by a column and aggregate the data in other columns with the `group_by.agg` combination.

In this example we group by the passenger class and take the mean of the `Fare` column

In [None]:
(
    df
    .group_by("Pclass")
    .agg(
        pl.col("Fare").mean().round()
    )
)

> Why group_by and not groupby? The Polars API aims to be readable and one standard is to split words by `_`

Almost everything we do after this will be some variation on this basic pattern of `group_by` and `agg`.

Note that we passed an aggregation expression `pl.col("Fare").mean()` inside `agg` to get a single value for each group.

Let's see what happens if we don't pass an aggregation expression

In [None]:
(
    df
    .group_by("Pclass")
    .agg(
        pl.col("Fare").head(2)
    )
)

In this case the `Fare` column is a `pl.List` column with all the values for each group on each row


## What happens when we run `group_by.agg`?
While the full workings are more complicated than this a basic description of the internal flow is that:
- when we call `.group_by` Polars creates a `GroupBy` object that catpures the group-by parameters (e.g. the columns to group by) but **does not calculate the groups** until a further method (such as `agg`) is called on it
- when we call `agg` on the `GroupBy` object Polars:
    - Polars calculates the groups by getting the row indexes for each group
    - Polars applies the expressions in `agg` to each group
    - Polars joins the outputs of the expressions back to each group to create the output `DataFrame`

## Grouping by multiple columns
We can group by multiple columns by passing a `list` to `group_by` or a comma-separated list of columns

In [None]:
(
    df
    .group_by("Pclass","Survived")
    .agg(
        pl.col("Fare").mean()
    )
)

We can also use expressions inside `group_by` - in fact when we pass column names as strings (as above) Polars converts these to expressions internally.

As we can pass expressions to `group_by` we can also group by a transformed column. Here, for example, we group by the `Age` column with values cast to integer

In [None]:
(
    df
    .group_by(pl.col("Age").cast(pl.Int64))
    .agg(
        pl.col("Fare").mean()
    )
    .head()
)

## Ordering of the output
We have seen that the output `DataFrame` has a different order each time. This happens because Polars works out the row indexes for the group keys in parallel. This means that Polars:
- splits the group columns into chunks (e.g. first 10 rows in one chunk, second 10 rows in another chunk, etc)
- finds the row indexes within each chunk on a seperate thread
- brings the results from different threads back together

As the order the results come back from different threads is random the order of the output `DataFrame` is random

We can force the order of the output to match the order the group keys occur in the input with the `maintain_order` argument

In [None]:
(
    df
    .group_by("Pclass",maintain_order=True)
    .agg(
        pl.col("Fare").mean()
    )
)

The first row is group `3` because the first row of `df` is `3` and so on.

Setting maintain_order=True results will affect performance to some extent. We also cannot use the streaming engine for large datasets when `maintain_order=True`.

We need to use the `sort` method if we want to set a different sorting of the output groups.

I explored the reason for `group_by` (and related methods such as `unique`) not preserving order by default in this blog post:https://www.rhosignal.com/posts/polars-ordering

> If you are running unit tests of a `group_by` you generally want to set `maintain_order=True` to get the same output each time it is run. This is the reason why `maintain_order=True` is normally set in the Polars API docs as these examples are run in the Polars test suite.


## Group by in lazy mode
A `group_by.agg` in lazy mode works in a very similar way in lazy mode to eager mode. In fact when we do this in eager mode:

In [None]:
(
    df
    .group_by("Pclass")
    .agg(
        pl.col("Fare").mean()
    )
)

then Polars internally runs it (more-or-less) like this:

In [None]:
(
    df
    .lazy()
    .group_by("Pclass")
    .agg(
        pl.col("Fare").mean()
    )
    .collect()
)

So we should not expect an isolated `group_by.agg` to run any faster in lazy mode than eager mode.

With a query that starts from a file with `pl.scan_*` Polars can do projection pushdown by identifying which columns are needed for the query

In [None]:
print(
    pl.scan_csv(csv_file)
    .group_by("Pclass")
    .agg(
        pl.col("Fare").mean()
    )
    .explain()
)

And we see here in the last row of the optimised query plan that only 2 out of 12 columns are read from the CSV

### Streaming groupby on large datasets
We can run `group_by` on large datasets - with the default argument of `maintain_order=False`. However, if we set `maintain_order=True` then `group_by` cannot be run for large datasets in streaming mode.

To see this note how the `AGGREGATE` part of this query plan moves outside of the ` --- STREAMING` if you change `maintain_order` from `False` (the default) to `True`

In [None]:
print(
    pl.scan_csv(csv_file)
    .group_by("Pclass",maintain_order=False)
    .agg(
        pl.col("PassengerId").count()
    )
    .explain(streaming=True)
)

## Groupby on a sorted column
In the lecture "Sorting and Fast-track algorithms" in the Selecting columns and transforming dataframes section we saw how Polars can use fast-track algorithms on sorted columns - if it knows the column is sorted.

A fast-track algorithm can also be used if the groupby column is sorted. See Exercise 3 for an example of this (make sure you have done the Sorting and Fast-track algorithms lecture first).

## Exercises
In the exercises you will develop your understanding of:
- doing `group_by.agg` with one or more columns
- transforming columns before grouping
- aggregating each group
- the effect of the fast-track algorithm on a sorted column

### Exercises 1
Group by the `Pclass` and `Survived` columns and count the number of passengers in each group. Ensure the order is the same as the input order

In [None]:
(
    pl.read_csv(csv_file)
    <blank>
)

Did people with longer names pay more for their ticket?

Group by the number of characters in the `Name` column and get the average `Fare` for each name length

In [None]:
(
    pl.read_csv(csv_file)
    <blank>
)

Make a scatter plot of the output with `plot.scatter`

### Exercise 2
We create a `DataFrame` from the Spotify data

In [None]:
pl.Config.set_fmt_str_lengths(100)
pl.Config.set_tbl_rows(10)
spotify_csv = "../data/spotify-charts-2017-2021-global-top200.csv.gz"
spotify_df = pl.read_csv(spotify_csv,try_parse_dates=True)
spotify_df.head(3)

Format the floating point values so that large floating point numbers are seperated by a comma (or your preferred thousand separator). If you have not encountered this before try tab-completing the following cell to find an appropriate method

In [None]:
pl.Config.set

Group by the `artist` and `title` columns and get the maximum of the other columns. Sort the output with the largest values of streams first

In [None]:
(
    spotify_df
    <blank>
    .head()
)

It's easy to forget that the max values are not set by the `stream` columns but all come from different rows. For example the max streams value for each of these entries would have been 1 but instead we see the lowest rank for the track in this output

Now we ask if collaborations lead to more streams.

Group by the number of artists listed in `artist` column and then take the mean of the streams column. Sort by the number of artists

In [None]:
(
    spotify_df
    <blank>
)

Make a bar chart of the output

### Exercise 3
We look at the effect of sorting and the fast-track algorithm on a `group_by` operation.

We create a `DataFrame` with an `id` column of integers and a `values` column

- The `N` variable sets the number of rows in the `DataFrame`
- The `cardinality` sets the number of distinct group keys in the `id` column

We begin with a low cardinality and see the effect of increasing the cardinality later in the exercise.

We pre-sort the `id`s before creating the `DataFrame`

In [None]:
pl.Config.set_tbl_rows(4)
import numpy as np
np.random.seed(0)
N = 10_000_000
cardinality = 10
# Create a sorted array of id integers
sorted_array = np.sort(np.random.randint(0,cardinality,N))
df = (
    pl.DataFrame(
        {
            "id":[i for i in sorted_array],
            "values":np.random.standard_normal(N)
        }
    )
)
df.head(3)

Time how long it takes to groupby the `id` column and take the mean of the `values` column without any fast-track algorithm

In [None]:
%%timeit -n1 -r3
(
    df
    <blank>
)

Create a new `DataFrame` called `df_sorted` where we tell Polars the `id` column is sorted

In [None]:
df_sorted = (
    df
    <blank>
)
df_sorted["id"].flags

Time how long it takes to groupby the `id` column and take the mean of the `values` column **with** a fast-track algorithm

In [None]:
%%timeit -n1 -r3
(
    df_sorted
    <blank>
)

Compare the difference between the sorted and non-sorted algorithms when the cardinality of `id` is higher. Try:
- `cardinality = 1_000` and 
- `cardinality = 1_000_000`


## Solutions

### Solutions to Exercise 1
Group by the `Pclass` and `Survived` columns and count the number of passengers in each group. Ensure the order is the same as the input order

In [None]:
(
    pl.read_csv(csv_file)
    .group_by("Pclass","Survived",maintain_order=True)
    .agg(
        pl.col("Age").len().alias("len")
    )
)

Did people with longer names pay more for their ticket?

Group by the number of characters in the `Name` column and get the average `Fare` for each name length

In [None]:
(
    pl.read_csv(csv_file)
    .group_by(pl.col("Name").str.len_chars())
    .agg(pl.col("Fare").mean().round())
    .head()
)

Make a scatter plot of the output with `plot.scatter`

In [None]:
(
    pl.read_csv(csv_file)
    .group_by(pl.col("Name").str.len_chars())
    .agg(pl.col("Fare").mean())
    .plot
    .scatter(
        x="Name",
        y="Fare"
    )
)

Overall there is a loose positive relationship between name length and fare paid!

### Solutions to Exercise 2
We create a `DataFrame` from the Spotify data

In [None]:
pl.Config.set_fmt_str_lengths(100)
pl.Config.set_tbl_rows(10)
spotify_csv = "../data/spotify-charts-2017-2021-global-top200.csv.gz"
spotify_df = pl.read_csv(spotify_csv,try_parse_dates=True)
spotify_df.head(3)

Format the floating point values so that large floating point numbers are seperated by a comma (or your preferred thousand separator). If you have not encountered this before try tab-completing the following cell to find an appropriate method

In [None]:
pl.Config.set_thousands_separator(",")

Group by the `artist` and `title` columns and get the maximum of the other columns. Sort the output with the largest values of streams first

In [None]:
(
    spotify_df
    .group_by("artist","title")
    .agg(
        pl.all().max()
    )
    .sort("streams",descending=True)
    .head()
)

Now we ask if collaborations lead to more streams.

Group by the number of artists listed in `artist` column and then take the mean of the streams column. Sort by the number of artists

In [None]:
(
    spotify_df
    .group_by(number_of_artists = pl.col("artist").str.split(",").list.len())
    .agg(
        pl.col("streams").mean()
    )
    .sort("number_of_artists")
)

Make a bar chart of the output

In [None]:
(
    spotify_df
    .group_by(number_of_artists = pl.col("artist").str.split(",").list.len())
    .agg(
        pl.col("streams").mean()
    )
    .sort("number_of_artists")
    .plot
    .bar(
        y="streams",
        x="number_of_artists"
    )
)

### Solution to exercise 3
We look at the effect of sorting on the performance of a `groupby` operation.

We create a `DataFrame` with an `id` column of integers and a `values` column

- The `N` variable sets the number of rows in the `DataFrame`
- The `cardinality` sets the number of distinct `id`s

We pre-sort the `id`s before creating the `DataFrame`.

In [None]:
pl.Config.set_tbl_rows(4)
import numpy as np
np.random.seed(0)
# Number of rows
N = 10_000_000
# Number of unique values in groupby column
cardinality = 10
# Create a sorted array of id integers
sorted_array = np.sort(np.random.randint(0,cardinality,N))
# Create a DataFrame from this data
df = (
    pl.DataFrame(
        {
            "id":[i for i in sorted_array],
            "values":np.random.standard_normal(N)
        }
    )
)
df.head(3)

At this point **we know** that `id` is sorted, but Polars does not

Time how long it takes to groupby the `id` column and take the mean of the `values` column without any fast-track algorithm

In [None]:
%%timeit -n1 -r3
(
    df
    .group_by("id")
    .agg(
        pl.col("values").mean()
    )
)

Create a new `DataFrame` called `df_sorted` where we tell Polars the `id` column is sorted. Check that Polars knows the `id` column is sorted

In [None]:
df_sorted = (
    df
    .with_columns(
        pl.col("id").set_sorted()
    )
)
df_sorted["id"].flags

Time how long it takes to groupby the `id` column and take the mean of the `values` column **with** a fast-track algorithm

In [None]:
%%timeit -n1 -r3
(
    df_sorted
    .group_by("id")
    .agg(
        pl.col("values").mean()
    )
)

Compare the difference in timings between the standard and fast-track algorithm when the cardinality of `id` is higher (e.g. equal to 100,000)


The difference is much smaller (and possibly negative) when the cardinality of `id` is high