# Groupby 2: Group iteration and aggregations
By the end of this lecture you will be able to:
- iterate over groups
- get group values
- do multiple aggregations
- apply user-defined functions on aggregations

In [None]:
import polars as pl
import polars.selectors as cs

In [None]:
csv_file = '../data/titanic.csv'

In [None]:
df = pl.read_csv(csv_file)
df.head(3)

## Iterating over groups
We can access the `DataFrame` for each group by looping over a `GroupBy` object.

When we do this iteration Polars calculates the row indexes for each group on the first iteration so they can be used for the rest of the loop

In this example we print the mean for each group. 

The group key is a `tuple` even when we are only grouping by one column. For this reason we set the first iteration variable to be a one-element tuple as `(pclass,)` so we can define a variable that matches the column name

In [None]:
for (pclass,),group_df in df.group_by(["Pclass"]):
    print(f"PClass:{pclass}")
    print(group_df.mean())

When we group by multiple columns we see how having the first element as a `tuple` naturally extends to multiple group keys

In [None]:
for (pclass,survived),group_df in df.group_by("Pclass","Survived"):
    print(f"PClass:{pclass},Survived:{survived}")
    print(group_df.mean())

## Group values
We use `head` to get the first rows in each group.

In this example we return a `DataFrame` with the first 2 rows from each group

In [None]:
(
    df
    .group_by("Pclass")
    .head(2)
)

We can also use `tail` to get the last elements

## Calling aggregations directly on `group_by`
We can call aggregations on all columns directly on `group_by` without using `agg`

In this example, we count the number of rows per group and we get a single column of counts

In [None]:
(
    df
    .group_by("Pclass")
    .len()
)

The methods we can all on `GroupBy` include:
 - `first` get the first element of each group
 - `last` get the last element of each group
 - `n_unique` get the number of unique elements in each group
 - `count` get the number of elements in each group
 - `sum` sum the elements in each group
 - `min` get the smallest element in each group
 - `max` get the largest element in each group
 - `mean` get the average of elements in each group
 - `median` get the median in each group
 - `quantile` calculate quantiles in each group

We can also call aggregations on a lazy group though not all of the above are supported
 

## Multiple aggregations on the same columns
We can use the `prefix` or `suffix` expressions when we do different aggregations on the same columns.

In this example we get the `min` and `max` of the floating point columns grouped by passenger class. We then sort the outputs to have aggregations on the same column together by sorting the column names inside a `pipe` function

In [None]:
group_column = "Pclass"
(
    df
    .group_by(group_column)
    .agg(
        pl.col(pl.Float64).min().name.suffix("_min"),
        pl.col(pl.Float64).max().name.suffix("_max"),
    )
    .pipe(
        lambda df: df.select([group_column]+sorted(df.columns[1:]))
    )
)

In this example we also see how we can apply the same aggregation to multiple columns by using `pl.col(pl.Float64)`. The same approaches we have seen previously for selecting multiple columns in all work here. For example, we can use selectors

In [None]:
group_column = "Pclass"
(
    df
    .group_by(group_column)
    .agg(
        cs.float().min().name.suffix("_min"),
        cs.float().max().name.suffix("_max"),
    )
    .pipe(
        lambda df: df.select([group_column]+sorted(df.columns[1:]))
    )
)

## User-defined functions on groups
We can define user-defined functions on groups with `map_groups`. 

The input to `map_groups` is the sub-`DataFrame` for each group - similar to the `DataFrames` we got when we iterated over the groups above. 

The output of `map_groups` must also be a `DataFrame`. Polars then vertically concatenates the output `DataFrame` for each group back into a single `DataFrame`

In this simple example we get one row for each group with the maximum value for each column in each group

In [None]:
(
    df
    .group_by("Pclass")
    .map_groups(
        lambda group_df: group_df.max()
    )
)

Here we output a 2-row `DataFrame` and get two rows for each group in the output. We do this for the floating point columns only (and we lose the grouping column `Pclass` when we do so)

In [None]:
(
    df
    .group_by("Pclass")
    .map_groups(
        lambda group_df: group_df.select(pl.col(pl.Float64)).head(2)
    )
)

## Exercises

In the exercises you will develop your understanding of
- doing multiple aggregations
- iterating over groups

### Exercise 1
Group by the `Pclass` column. Count the number of passengers in each group without using `agg`

In [None]:
(
    pl.read_csv(csv_file)
    .<blank>
)

In [None]:
(
    pl.read_csv(csv_file)
    <blank>
)

Add a column called `percent` with the percentage of the total passengers in each group

Create a bar chart of the `percent` column with the title `% per class"`

### Exercise 2
Create a `DataFrame` from the Spotify data

In [None]:
pl.Config.set_fmt_str_lengths(100)
pl.Config.set_tbl_rows(10)
spotify_csv = "../data/spotify-charts-2017-2021-global-top200.csv.gz"
spotify_df = pl.read_csv(spotify_csv,try_parse_dates=True)
spotify_df.head(3)

We want to inspect some data for the top-streaming artists by printing it out:
- filter `spotify_df` to include only rows that had more than 10 million streams
- `group_by` the `artist` column
- ensure the order of the output is the same each time
- print the `artist` key
- print the sub-`DataFrame`

Repeat this exercise but in this case grouping by the `artist` and `title` column and printing the artist and title for each group

Find the total number of streams by artist for tracks that are number 1 in the charts. Divide the number of streams by 1 million to make it easier to read and sort from high to low

Using one of the methods we can call directly on `group_by` find out how many distinct tracks each artist has. Sort the values from high to low

## Solutions

### Solution to Exercise 1
Group by the `Pclass` column. Count the number of passengers in each group without using `agg`

In [None]:
(
    pl.read_csv(csv_file)
    .group_by("Pclass")
    .len()
)

Add a column called `percent` with the percentage of the total passengers in each group

In [None]:
(
    pl.read_csv(csv_file)
    .group_by("Pclass")
    .len()
    .with_columns(
        (100 * (pl.col("len") / pl.col("len").sum())).alias("percent")
    )
)

Create a bar chart of the `percent` column by piping the output to `px.bar` with the title `% per class"`

In [None]:
(
    pl.read_csv(csv_file)
    .group_by("Pclass")
    .len()
    .with_columns(
        (100 * (pl.col("len") / pl.col("len").sum())).alias("percent")
    )
    .plot
    .bar(
        x="Pclass",
        y="percent",
        title="% per class"
    )
)

### Solution to Exercise 2
Create a `DataFrame` from the Spotify data

In [None]:
pl.Config.set_fmt_str_lengths(100)
pl.Config.set_tbl_rows(10)
spotify_csv = "../data/spotify-charts-2017-2021-global-top200.csv.gz"
spotify_df = pl.read_csv(spotify_csv,try_parse_dates=True)
spotify_df.head(3)

We want to inspect some data for the top-streaming artists by printing it out:
- filter `spotify_df` to include only rows that had more than 10 million streams
- `group_by` the `artist` column
- ensure the order of the output is the same each time
- print the `artist` key
- print the sub-`DataFrame`

In [None]:
for (artist,),artist_df in (
    spotify_df
    .filter(
        pl.col("streams") > 10_000_000,
    )
    .group_by(["artist"],maintain_order=True)
):
    print(artist)
    print(artist_df)


Repeat this exercise but in this case grouping by the `artist` and `title` column and printing the artist and title for each group

In [None]:
for (artist,title),artist_df in (
    spotify_df
    .filter(
        pl.col("streams") > 10_000_000,
    )
    .group_by("artist","title",maintain_order=True)
):
    print(artist,title)
    print(artist_df)


Find the total number of streams by artist for tracks that are number 1 in the charts. Divide the number of streams by 1 million to make it easier to read

In [None]:
(
    spotify_df
    .filter(
        pl.col("rank")==1
    )
    .group_by(
        "artist","title"
    )
    .sum()
    .with_columns(
        (pl.col("streams")/1e6)
    )
    .select("artist","title","streams")
    .sort("streams",descending=True)
)

Using one of the methods we can call directly on `group_by` find out how many distinct tracks each artist has. Sort the values from high to low

In [None]:
(
    spotify_df
    .group_by("artist")
    .n_unique()
    .select("artist","title")
    .sort("title",descending=True)
    .head()
)