# GroupBy

## Introducing our Dataset
- `retail_sales.csv` is a list of transactions for a fictional e-commerce store.
- Let's pass `try_parse_dates=True` to convert the `purchase_date` column to date values.
- Let's also sort by the `purchase_date` column.

- Several columns have a small number of unique values, so we can convert them to enums.

### Further Reading
- https://docs.pola.rs/api/python/stable/reference/api/polars.read_csv.html
- https://docs.pola.rs/api/python/stable/reference/api/polars.datatypes.Enum.html

## Intro to Grouping
- Grouping simply means "placing into groups".
- The value of grouping is performing aggregate calculations on subsets of data.
- Grouping consists of three steps (split, apply, combine):
    - **Splitting** the data into groups based on distinct column values
    - **Applying** an aggregation function to each group
    - **Combining** the results into a new `DataFrame`

## The group_by Method
- The `group_by` method accepts the column(s) whose unique values will determine the groups.
- The `group_by` returns a `GroupBy` object.
- Polars will create a group for each distinct column value. This is the **splitting** step.
- The `GroupBy` object is a container for all the split `DataFrames` (one `DataFrame` per unique value).

- The next example creates groups using the values in the `product_category` column.
- The `GroupBy` object will store 4 groups, one for each product category (`Electronics`, `Clothing`, `Home`, and `Toys`.)

- The `GroupBy` method is not particularly helpful by itself but supports many methods for aggregation.
- The `len` method returns a `DataFrame` with the number of rows per group.
- The value `Clothing` appears in 236 rows in `sales`, `Electronics` appears in 230 rows, and so on.

### Further Reading
- https://docs.pola.rs/api/python/stable/reference/dataframe/api/polars.DataFrame.group_by.html
- https://docs.pola.rs/api/python/stable/reference/dataframe/api/polars.dataframe.group_by.GroupBy.len.html

## GroupBy Methods
- Invoking a method on the `GroupBy` object applies a calculation to _each_ group, then combines the results in a new `DataFrame`.

- The `first` method returns the first row from each group.
- Polars will place the group column (`product_category`) at the start of the `DataFrame`.
- Notice that invoking the method multiple times returns a different `DataFrame` each time.
- The `DataFrame` has the same 4 rows (one for each product category) but they appear in a different order.

- The first row in each group matches its appearance in the original `DataFrame`.

- Polars stores the _rows_ within each group in the order they appear in the original `DataFrame`.
- By default, Polars does not maintain the order of the _groups_ in the output.
- It is the _groups_ that appears in a seemingly random order each time we run.
- Behind the scenes, Polars parallelizes the extraction of one row from each group.
- The order in which the results evaluate may vary from execution to execution.
- Polars thus does not guarantee consistency in the _order_  of the rows. The rows will be the same however.

### The maintain_order Parameter
- By default, the `group_by` method does not guarantee that the order of groups matches the order of the values in the `DataFrame`.
- The `maintain_order` parameter forces the `GroupBy` object to store the groups in a consistent order.
- For aggregation operations, it may be disadvantageous to set `maintain_order` to `True`.
- The consistent order will require calculation per group _in order_ which prevents effective parallelization.
- The group order is determined by the order of the distinct values in the `product_category` column.
- The order will be `"Electronics"`, `"Clothing"`, `"Toys"`, and `"Home"`.

- The order of rows is now consistent.

- The `last` method returns the last row from each group.

- The `head` method returns a specified number of rows from the top of each group.

- The `tail` method returns a specified number of rows from the end of each group.

### Further Reading
- https://docs.pola.rs/user-guide/getting-started/#group_by
- https://docs.pola.rs/user-guide/concepts/expressions-and-contexts/#group_by-and-aggregations
- https://docs.pola.rs/api/python/stable/reference/dataframe/api/polars.dataframe.group_by.GroupBy.first.html
- https://docs.pola.rs/api/python/stable/reference/dataframe/api/polars.dataframe.group_by.GroupBy.last.html
- https://docs.pola.rs/api/python/stable/reference/dataframe/api/polars.dataframe.group_by.GroupBy.head.html
- https://docs.pola.rs/api/python/stable/reference/dataframe/api/polars.dataframe.group_by.GroupBy.tail.html

## Largest and Smallest Values per Group

- The `max` method returns the largest value _per column_ for each group.
- These are _not_ single rows from the original `sales` `DataFrame`.
- For the product category of `"Toys"`, `"T0998"` is the greatest `transaction_id,` 10 is the greatest `quantity`, and so on.

- Let's take a look at the row with a `transaction_id` of 999. Notice the values do not match the row above.

- The `min` method returns the smallest value per column for each group.

### Further Reading
- https://docs.pola.rs/api/python/stable/reference/dataframe/api/polars.dataframe.group_by.GroupBy.max.html
- https://docs.pola.rs/api/python/stable/reference/dataframe/api/polars.dataframe.group_by.GroupBy.min.html

## Aggregations with the agg Method
- The methods from the previous lesson performed aggregation calculations on _every_ column in each group.
- The `agg` method targets specific column(s) for group calculations.

- The `agg` method accepts one or more expressions.
- Let's calculate the largest value in the `quantity` column for each product category.
- The results are the same as invoking `groups.max` without all the extra columns.

- Multiple expressions are welcome! As always, all column name in the new `DataFrame` must be unique.

- The `sum` method adds the values in the specified column for each group.

### Further Reading
- https://docs.pola.rs/user-guide/expressions/aggregation/#basic-aggregations
- https://docs.pola.rs/api/python/stable/reference/dataframe/api/polars.dataframe.group_by.GroupBy.agg.html
- https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.max.html
- https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.min.html
- https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.mean.html
- https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.median.html

## Using Selectors in Aggregations
- Top-level methods on the groupby object can perform aggregate operations on columns that we do not care about.
- Using the `agg` method to target individual columns for aggregations is fine but can get verbose.
- The `selectors` submodule can be helpful in targeting all columns of a specific data type.

- Let's calculate the sum of the values in every numeric column for each group.
- We'll use `cs.numeric` to create a selector targeting all numeric columns (integers and floats).
- We'll then calculate the sum of the values per product category.

- Review: The `name.suffix` method concatenates a consistent string to the end of each column name.
- It offers a shortcut to avoid naming each column individualy.

- Aggregation functions like `sum` or `mean` evaluate to Polars expressions.
- We can mix-and-match the expressions passed to the `agg` method.
- The next example calculates the sum and average of values in all numeric columns.

### Further Reading
- https://docs.pola.rs/api/python/stable/reference/selectors.html#polars.selectors.numeric
- https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.name.suffix.html
- https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.mean.html
- https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.sum.html

## Grouping with Multiple Columns
- Pass a list of values to `group_by` to group by multiple column values.
- Use the column with the smaller number of unique values as the outer/first group.

### Further Reading
- https://docs.pola.rs/user-guide/expressions/aggregation/#nested-grouping
- https://docs.pola.rs/api/python/stable/reference/dataframe/api/polars.DataFrame.group_by.html

## Grouping Temporal Data
- The `group_by` method uses distinct values to define the groups.
- Grouping datetime values by exact time is usually not valuable because of the amount of variation.
- A time window is a span of time (i.e., every 1 week).
- The `group_by_dynamic` method creates groups using a time window.
- The `group_by_dynamic` method looks for the inclusion of a datetime value within the time window/range.
- For example, a window of time from January 1st to January 7th will include a row whose date is January 5th.

- Make sure to _sort_ the date/datetime column whose values will be grouped.
- The `every` parameter accepts an interval of time.
- The next example provides `1w` for 1 week but we can use symbols to designate other durations.
- By default, Polars finds the earliest date, then buckets it into a weekly window starting on Monday.
- The earliest date in `sales` is 01/01/2025, a Wednesday.
- The first group thus starts from Monday that week, 12/30/2024.


- Let's calculate the sum of items sold for every 1 week grouping.

- Polars then creates the range of weeks until it reaches the last date in the dataset.
- The `start_by` argument accepts a string with a specified weekday.
- The next example asks Polars to start a week/range on Wednesdays.
- Each group/bucket now goes from Wednesday-Tuesday.

- Alternatively, we can pass the `start_by` parameter a value of `datapoint`.
- Polars will create 1-week buckets starting from the first datapoint, `2025-01-01`.

- The `mo` symbol stands for month.
- The next example creates monthly groupings.
- Polars is smart enough to account for differences in month lengths.

- The next example groups by quarter.

- We can mix and match different time specifications.
- The next example creates groups 1 month, 2 weeks, and 3 days apart.

- There are additional time window configuration options:
    - non-overlapping where each period group doesn't overlap another
    - overlapping (period groups can overlap one another)
    - gapped (no overlap, gaps between subsequent periods)

### Further Reading
- https://docs.pola.rs/user-guide/transformations/time-series/rolling/#parameters-for-group_by_dynamic
- https://docs.pola.rs/api/python/stable/reference/dataframe/api/polars.DataFrame.group_by_dynamic.html
- https://docs.pola.rs/api/python/stable/reference/dataframe/api/polars.dataframe.group_by.GroupBy.agg.html

## Window Functions and the over Method

- The `rank` method assigns a rank to each column value.
- A descending rank assigns the highest column value a rank of 1, the second highest a rank of 2, and so on.
- There are different reconciliation strategies if there are multiple equal values.
- The `rank` method doesn't account for groups. It's a ranking across the whole `DataFrame`.

- Say that we want to rank the transactions _within_ each region. 
- We want the most profitable transactions in the `"West"`, `"North"`, `"South"`, and `"East"` regions to have a rank of 1.
- The `over` method specifies the ranking should be done _separately_ for each group.
- We ask Polars to rank the numeric values in `total_price` over the groups defined by `region`.
- It's called a "window function" because each calculation operates over a window -- a subset of rows -- rather than the whole dataset.
- The window is effectively each "group". Each group has a subset of rows that we want to perform calculations on.

- We can pass multiple columns to the `over` method.
- The window here targets every combination of `sales_channel` and `region`.
- There are 8 possible combinations of `sales_channel` (2 options) and `region` (4 options).
- As a result, 8 rows will have a rank of 1, 8 rows will have a rank of 2, and so on.

### Further Reading
- https://docs.pola.rs/user-guide/expressions/window-functions/#operations-per-group
- https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.rank.html
- https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.over.html
- https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.cast.html