# GroupBy

In [1]:
import polars as pl

## Introducing our Dataset
- `retail_sales.csv` is a list of transactions for a fictional e-commerce store.
- Let's pass `try_parse_dates=True` to convert the `purchase_date` column to date values.
- Let's also sort by the `purchase_date` column.

In [2]:
pl.read_csv("retail_sales.csv", try_parse_dates=True)

transaction_id,region,product_category,product_id,purchase_date,sales_channel,quantity,unit_price,total_price
str,str,str,str,date,str,i64,f64,f64
"""T0000""","""North""","""Electronics""","""Headphones""",2025-02-05,"""Online""",3,126.22,378.66
"""T0001""","""West""","""Electronics""","""TV""",2025-05-10,"""Online""",1,113.23,113.23
"""T0002""","""West""","""Clothing""","""Jeans""",2025-01-02,"""Online""",5,142.7,713.5
"""T0003""","""North""","""Toys""","""Puzzle""",2025-04-04,"""Online""",4,319.66,1278.64
"""T0004""","""East""","""Clothing""","""Shirt""",2025-04-08,"""Online""",3,55.59,166.77
…,…,…,…,…,…,…,…,…
"""T0995""","""North""","""Electronics""","""Laptop""",2025-01-17,"""Online""",2,341.58,683.16
"""T0996""","""South""","""Clothing""","""Shirt""",2025-03-25,"""In-Store""",4,258.69,1034.76
"""T0997""","""South""","""Toys""","""Puzzle""",2025-01-28,"""In-Store""",1,120.04,120.04
"""T0998""","""North""","""Toys""","""RC Car""",2025-02-24,"""In-Store""",3,10.93,32.79


- Several columns have a small number of unique values, so we can convert them to enums.

In [3]:
pl.read_csv("retail_sales.csv", try_parse_dates=True).select(pl.all().n_unique())

transaction_id,region,product_category,product_id,purchase_date,sales_channel,quantity,unit_price,total_price
u32,u32,u32,u32,u32,u32,u32,u32,u32
1000,4,4,12,181,2,8,989,997


In [4]:
pl.read_csv("retail_sales.csv", try_parse_dates=True).select(
    pl.col("sales_channel").unique()
)

sales_channel
str
"""Online"""
"""In-Store"""


In [5]:
region_enum = pl.Enum(["East", "West", "North", "South"])
product_category_enum = pl.Enum(["Home", "Electronics", "Toys", "Clothing"])
sales_channel_enum = pl.Enum(["In-Store", "Online"])

sales = pl.read_csv(
    "retail_sales.csv",
    try_parse_dates=True,
    schema_overrides={
        "region": region_enum,
        "product_category": product_category_enum,
        "sales_channel": sales_channel_enum,
    },
)

sales.head(3)

transaction_id,region,product_category,product_id,purchase_date,sales_channel,quantity,unit_price,total_price
str,enum,enum,str,date,enum,i64,f64,f64
"""T0000""","""North""","""Electronics""","""Headphones""",2025-02-05,"""Online""",3,126.22,378.66
"""T0001""","""West""","""Electronics""","""TV""",2025-05-10,"""Online""",1,113.23,113.23
"""T0002""","""West""","""Clothing""","""Jeans""",2025-01-02,"""Online""",5,142.7,713.5


### Further Reading
- https://docs.pola.rs/api/python/stable/reference/api/polars.read_csv.html
- https://docs.pola.rs/api/python/stable/reference/api/polars.datatypes.Enum.html

## Intro to Grouping
- Grouping simply means "placing into groups".
- The value of grouping is performing aggregate calculations on subsets of data.
- Grouping consists of three steps (split, apply, combine):
    - **Splitting** the data into groups based on distinct column values
    - **Applying** an aggregation function to each group
    - **Combining** the results into a new `DataFrame`

## The group_by Method
- The `group_by` method accepts the column(s) whose unique values will determine the groups.
- The `group_by` returns a `GroupBy` object.
- Polars will create a group for each distinct column value. This is the **splitting** step.
- The `GroupBy` object is a container for all the split `DataFrames` (one `DataFrame` per unique value).

In [6]:
region_enum = pl.Enum(["East", "West", "North", "South"])
product_category_enum = pl.Enum(["Home", "Electronics", "Toys", "Clothing"])
sales_channel_enum = pl.Enum(["In-Store", "Online"])

sales = pl.read_csv(
    "retail_sales.csv",
    try_parse_dates=True,
    schema_overrides={
        "region": region_enum,
        "product_category": product_category_enum,
        "sales_channel": sales_channel_enum,
    },
)

sales.head(3)

transaction_id,region,product_category,product_id,purchase_date,sales_channel,quantity,unit_price,total_price
str,enum,enum,str,date,enum,i64,f64,f64
"""T0000""","""North""","""Electronics""","""Headphones""",2025-02-05,"""Online""",3,126.22,378.66
"""T0001""","""West""","""Electronics""","""TV""",2025-05-10,"""Online""",1,113.23,113.23
"""T0002""","""West""","""Clothing""","""Jeans""",2025-01-02,"""Online""",5,142.7,713.5


- The next example creates groups using the values in the `product_category` column.
- The `GroupBy` object will store 4 groups, one for each product category (`Electronics`, `Clothing`, `Home`, and `Toys`.)

In [7]:
groups = sales.group_by("product_category")
groups

<polars.dataframe.group_by.GroupBy at 0x107e55160>

- The `GroupBy` method is not particularly helpful by itself but supports many methods for aggregation.
- The `len` method returns a `DataFrame` with the number of rows per group.
- The value `Clothing` appears in 236 rows in `sales`, `Electronics` appears in 230 rows, and so on.

In [8]:
groups.len()

product_category,len
enum,u32
"""Toys""",268
"""Electronics""",230
"""Clothing""",236
"""Home""",266


### Further Reading
- https://docs.pola.rs/api/python/stable/reference/dataframe/api/polars.DataFrame.group_by.html
- https://docs.pola.rs/api/python/stable/reference/dataframe/api/polars.dataframe.group_by.GroupBy.len.html

## GroupBy Methods
- Invoking a method on the `GroupBy` object applies a calculation to _each_ group, then combines the results in a new `DataFrame`.

In [9]:
region_enum = pl.Enum(["East", "West", "North", "South"])
product_category_enum = pl.Enum(["Home", "Electronics", "Toys", "Clothing"])
sales_channel_enum = pl.Enum(["In-Store", "Online"])

sales = pl.read_csv(
    "retail_sales.csv",
    try_parse_dates=True,
    schema_overrides={
        "region": region_enum,
        "product_category": product_category_enum,
        "sales_channel": sales_channel_enum,
    },
)

groups = sales.group_by("product_category")

- The `first` method returns the first row from each group.
- Polars will place the group column (`product_category`) at the start of the `DataFrame`.
- Notice that invoking the method multiple times returns a different `DataFrame` each time.
- The `DataFrame` has the same 4 rows (one for each product category) but they appear in a different order.

In [10]:
groups.first()

product_category,transaction_id,region,product_id,purchase_date,sales_channel,quantity,unit_price,total_price
enum,str,enum,str,date,enum,i64,f64,f64
"""Clothing""","""T0002""","""West""","""Jeans""",2025-01-02,"""Online""",5,142.7,713.5
"""Home""","""T0005""","""South""","""Couch""",2025-02-13,"""Online""",1,306.52,306.52
"""Electronics""","""T0000""","""North""","""Headphones""",2025-02-05,"""Online""",3,126.22,378.66
"""Toys""","""T0003""","""North""","""Puzzle""",2025-04-04,"""Online""",4,319.66,1278.64


- The first row in each group matches its appearance in the original `DataFrame`.

In [11]:
sales.head(5)

transaction_id,region,product_category,product_id,purchase_date,sales_channel,quantity,unit_price,total_price
str,enum,enum,str,date,enum,i64,f64,f64
"""T0000""","""North""","""Electronics""","""Headphones""",2025-02-05,"""Online""",3,126.22,378.66
"""T0001""","""West""","""Electronics""","""TV""",2025-05-10,"""Online""",1,113.23,113.23
"""T0002""","""West""","""Clothing""","""Jeans""",2025-01-02,"""Online""",5,142.7,713.5
"""T0003""","""North""","""Toys""","""Puzzle""",2025-04-04,"""Online""",4,319.66,1278.64
"""T0004""","""East""","""Clothing""","""Shirt""",2025-04-08,"""Online""",3,55.59,166.77


- Polars stores the _rows_ within each group in the order they appear in the original `DataFrame`.
- By default, Polars does not maintain the order of the _groups_ in the output.
- It is the _groups_ that appears in a seemingly random order each time we run.
- Behind the scenes, Polars parallelizes the extraction of one row from each group.
- The order in which the results evaluate may vary from execution to execution.
- Polars thus does not guarantee consistency in the _order_  of the rows. The rows will be the same however.

### The maintain_order Parameter
- By default, the `group_by` method does not guarantee that the order of groups matches the order of the values in the `DataFrame`.
- The `maintain_order` parameter forces the `GroupBy` object to store the groups in a consistent order.
- For aggregation operations, it may be disadvantageous to set `maintain_order` to `True`.
- The consistent order will require calculation per group _in order_ which prevents effective parallelization.
- The group order is determined by the order of the distinct values in the `product_category` column.
- The order will be `"Electronics"`, `"Clothing"`, `"Toys"`, and `"Home"`.

In [12]:
groups = sales.group_by("product_category", maintain_order=True)

- The order of rows is now consistent.

In [13]:
groups.first()

product_category,transaction_id,region,product_id,purchase_date,sales_channel,quantity,unit_price,total_price
enum,str,enum,str,date,enum,i64,f64,f64
"""Electronics""","""T0000""","""North""","""Headphones""",2025-02-05,"""Online""",3,126.22,378.66
"""Clothing""","""T0002""","""West""","""Jeans""",2025-01-02,"""Online""",5,142.7,713.5
"""Toys""","""T0003""","""North""","""Puzzle""",2025-04-04,"""Online""",4,319.66,1278.64
"""Home""","""T0005""","""South""","""Couch""",2025-02-13,"""Online""",1,306.52,306.52


- The `last` method returns the last row from each group.

In [14]:
groups.last()

product_category,transaction_id,region,product_id,purchase_date,sales_channel,quantity,unit_price,total_price
enum,str,enum,str,date,enum,i64,f64,f64
"""Electronics""","""T0995""","""North""","""Laptop""",2025-01-17,"""Online""",2,341.58,683.16
"""Clothing""","""T0996""","""South""","""Shirt""",2025-03-25,"""In-Store""",4,258.69,1034.76
"""Toys""","""T0998""","""North""","""RC Car""",2025-02-24,"""In-Store""",3,10.93,32.79
"""Home""","""T0999""","""East""","""Vacuum""",2025-04-16,"""In-Store""",1,443.3,443.3


- The `head` method returns a specified number of rows from the top of each group.

In [15]:
groups.head(2)

product_category,transaction_id,region,product_id,purchase_date,sales_channel,quantity,unit_price,total_price
enum,str,enum,str,date,enum,i64,f64,f64
"""Electronics""","""T0000""","""North""","""Headphones""",2025-02-05,"""Online""",3,126.22,378.66
"""Electronics""","""T0001""","""West""","""TV""",2025-05-10,"""Online""",1,113.23,113.23
"""Clothing""","""T0002""","""West""","""Jeans""",2025-01-02,"""Online""",5,142.7,713.5
"""Clothing""","""T0004""","""East""","""Shirt""",2025-04-08,"""Online""",3,55.59,166.77
"""Toys""","""T0003""","""North""","""Puzzle""",2025-04-04,"""Online""",4,319.66,1278.64
"""Toys""","""T0009""","""East""","""Puzzle""",2025-05-26,"""In-Store""",1,109.44,109.44
"""Home""","""T0005""","""South""","""Couch""",2025-02-13,"""Online""",1,306.52,306.52
"""Home""","""T0008""","""West""","""Couch""",2025-03-25,"""Online""",5,113.71,568.55


- The `tail` method returns a specified number of rows from the end of each group.

In [16]:
groups.tail(2)

product_category,transaction_id,region,product_id,purchase_date,sales_channel,quantity,unit_price,total_price
enum,str,enum,str,date,enum,i64,f64,f64
"""Electronics""","""T0993""","""North""","""Laptop""",2025-06-03,"""Online""",3,224.77,674.31
"""Electronics""","""T0995""","""North""","""Laptop""",2025-01-17,"""Online""",2,341.58,683.16
"""Clothing""","""T0994""","""West""","""Shirt""",2025-02-21,"""Online""",3,454.83,1364.49
"""Clothing""","""T0996""","""South""","""Shirt""",2025-03-25,"""In-Store""",4,258.69,1034.76
"""Toys""","""T0997""","""South""","""Puzzle""",2025-01-28,"""In-Store""",1,120.04,120.04
"""Toys""","""T0998""","""North""","""RC Car""",2025-02-24,"""In-Store""",3,10.93,32.79
"""Home""","""T0992""","""North""","""Vacuum""",2025-02-25,"""Online""",4,182.25,729.0
"""Home""","""T0999""","""East""","""Vacuum""",2025-04-16,"""In-Store""",1,443.3,443.3


### Further Reading
- https://docs.pola.rs/user-guide/getting-started/#group_by
- https://docs.pola.rs/user-guide/concepts/expressions-and-contexts/#group_by-and-aggregations
- https://docs.pola.rs/api/python/stable/reference/dataframe/api/polars.dataframe.group_by.GroupBy.first.html
- https://docs.pola.rs/api/python/stable/reference/dataframe/api/polars.dataframe.group_by.GroupBy.last.html
- https://docs.pola.rs/api/python/stable/reference/dataframe/api/polars.dataframe.group_by.GroupBy.head.html
- https://docs.pola.rs/api/python/stable/reference/dataframe/api/polars.dataframe.group_by.GroupBy.tail.html

## Largest and Smallest Values per Group

In [17]:
region_enum = pl.Enum(["East", "West", "North", "South"])
product_category_enum = pl.Enum(["Home", "Electronics", "Toys", "Clothing"])
sales_channel_enum = pl.Enum(["In-Store", "Online"])

sales = pl.read_csv(
    "retail_sales.csv",
    try_parse_dates=True,
    schema_overrides={
        "region": region_enum,
        "product_category": product_category_enum,
        "sales_channel": sales_channel_enum,
    },
)

groups = sales.group_by("product_category", maintain_order=True)

sales.head(3)

transaction_id,region,product_category,product_id,purchase_date,sales_channel,quantity,unit_price,total_price
str,enum,enum,str,date,enum,i64,f64,f64
"""T0000""","""North""","""Electronics""","""Headphones""",2025-02-05,"""Online""",3,126.22,378.66
"""T0001""","""West""","""Electronics""","""TV""",2025-05-10,"""Online""",1,113.23,113.23
"""T0002""","""West""","""Clothing""","""Jeans""",2025-01-02,"""Online""",5,142.7,713.5


- The `max` method returns the largest value _per column_ for each group.
- These are _not_ single rows from the original `sales` `DataFrame`.
- For the product category of `"Toys"`, `"T0998"` is the greatest `transaction_id,` 10 is the greatest `quantity`, and so on.

In [18]:
groups.max()

product_category,transaction_id,region,product_id,purchase_date,sales_channel,quantity,unit_price,total_price
enum,str,enum,str,date,enum,i64,f64,f64
"""Electronics""","""T0995""","""South""","""TV""",2025-06-30,"""Online""",6,497.98,2487.85
"""Clothing""","""T0996""","""South""","""Shirt""",2025-06-30,"""Online""",5,498.0,2437.0
"""Toys""","""T0998""","""South""","""RC Car""",2025-06-30,"""Online""",10,498.68,2470.05
"""Home""","""T0999""","""South""","""Vacuum""",2025-06-30,"""Online""",8,496.38,2478.85


- Let's take a look at the row with a `transaction_id` of 999. Notice the values do not match the row above.

In [19]:
sales.filter(pl.col("transaction_id") == "T0999")

transaction_id,region,product_category,product_id,purchase_date,sales_channel,quantity,unit_price,total_price
str,enum,enum,str,date,enum,i64,f64,f64
"""T0999""","""East""","""Home""","""Vacuum""",2025-04-16,"""In-Store""",1,443.3,443.3


- The `min` method returns the smallest value per column for each group.

In [20]:
groups.min()

product_category,transaction_id,region,product_id,purchase_date,sales_channel,quantity,unit_price,total_price
enum,str,enum,str,date,enum,i64,f64,f64
"""Electronics""","""T0000""","""East""","""Headphones""",2025-01-02,"""In-Store""",1,6.75,11.59
"""Clothing""","""T0002""","""East""","""Jacket""",2025-01-02,"""In-Store""",1,8.89,19.06
"""Toys""","""T0003""","""East""","""Doll""",2025-01-01,"""In-Store""",1,6.61,15.12
"""Home""","""T0005""","""East""","""Couch""",2025-01-01,"""In-Store""",1,5.27,10.54


### Further Reading
- https://docs.pola.rs/api/python/stable/reference/dataframe/api/polars.dataframe.group_by.GroupBy.max.html
- https://docs.pola.rs/api/python/stable/reference/dataframe/api/polars.dataframe.group_by.GroupBy.min.html

## Aggregations with the agg Method
- The methods from the previous lesson performed aggregation calculations on _every_ column in each group.
- The `agg` method targets specific column(s) for group calculations.

In [21]:
region_enum = pl.Enum(["East", "West", "North", "South"])
product_category_enum = pl.Enum(["Home", "Electronics", "Toys", "Clothing"])
sales_channel_enum = pl.Enum(["In-Store", "Online"])

sales = pl.read_csv(
    "retail_sales.csv",
    try_parse_dates=True,
    schema_overrides={
        "region": region_enum,
        "product_category": product_category_enum,
        "sales_channel": sales_channel_enum,
    },
)

groups = sales.group_by("product_category", maintain_order=True)

sales.head(3)

transaction_id,region,product_category,product_id,purchase_date,sales_channel,quantity,unit_price,total_price
str,enum,enum,str,date,enum,i64,f64,f64
"""T0000""","""North""","""Electronics""","""Headphones""",2025-02-05,"""Online""",3,126.22,378.66
"""T0001""","""West""","""Electronics""","""TV""",2025-05-10,"""Online""",1,113.23,113.23
"""T0002""","""West""","""Clothing""","""Jeans""",2025-01-02,"""Online""",5,142.7,713.5


- The `agg` method accepts one or more expressions.
- Let's calculate the largest value in the `quantity` column for each product category.
- The results are the same as invoking `groups.max` without all the extra columns.

In [22]:
sales.head(2)

transaction_id,region,product_category,product_id,purchase_date,sales_channel,quantity,unit_price,total_price
str,enum,enum,str,date,enum,i64,f64,f64
"""T0000""","""North""","""Electronics""","""Headphones""",2025-02-05,"""Online""",3,126.22,378.66
"""T0001""","""West""","""Electronics""","""TV""",2025-05-10,"""Online""",1,113.23,113.23


In [23]:
groups.agg(pl.col("quantity").max().alias("largest_quantity"))

product_category,largest_quantity
enum,i64
"""Electronics""",6
"""Clothing""",5
"""Toys""",10
"""Home""",8


- Multiple expressions are welcome! As always, all column name in the new `DataFrame` must be unique.

In [24]:
groups.agg(
    pl.col("quantity").max().alias("largest_quantity"),
    pl.col("quantity").min().alias("smallest_quantity"),
    pl.col("quantity").mean().alias("average_quantity"),
    pl.col("quantity").median().alias("median_quantity"),
)

product_category,largest_quantity,smallest_quantity,average_quantity,median_quantity
enum,i64,i64,f64,f64
"""Electronics""",6,1,2.943478,3.0
"""Clothing""",5,1,3.004237,3.0
"""Toys""",10,1,3.130597,3.0
"""Home""",8,1,2.988722,3.0


- The `sum` method adds the values in the specified column for each group.

In [25]:
groups.agg(pl.col("total_price").sum().alias("total_spend_per_category"))

product_category,total_spend_per_category
enum,f64
"""Electronics""",168243.25
"""Clothing""",172431.2
"""Toys""",212250.02
"""Home""",193942.34


In [26]:
sales.filter(pl.col("product_category") == "Toys").select(pl.col("total_price").sum())

total_price
f64
212250.02


### Further Reading
- https://docs.pola.rs/user-guide/expressions/aggregation/#basic-aggregations
- https://docs.pola.rs/api/python/stable/reference/dataframe/api/polars.dataframe.group_by.GroupBy.agg.html
- https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.max.html
- https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.min.html
- https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.mean.html
- https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.median.html

## Using Selectors in Aggregations
- Top-level methods on the groupby object can perform aggregate operations on columns that we do not care about.
- Using the `agg` method to target individual columns for aggregations is fine but can get verbose.
- The `selectors` submodule can be helpful in targeting all columns of a specific data type.

In [27]:
import polars.selectors as cs

In [28]:
region_enum = pl.Enum(["East", "West", "North", "South"])
product_category_enum = pl.Enum(["Home", "Electronics", "Toys", "Clothing"])
sales_channel_enum = pl.Enum(["In-Store", "Online"])

sales = pl.read_csv(
    "retail_sales.csv",
    try_parse_dates=True,
    schema_overrides={
        "region": region_enum,
        "product_category": product_category_enum,
        "sales_channel": sales_channel_enum,
    },
)

groups = sales.group_by("product_category", maintain_order=True)

sales.head(3)

transaction_id,region,product_category,product_id,purchase_date,sales_channel,quantity,unit_price,total_price
str,enum,enum,str,date,enum,i64,f64,f64
"""T0000""","""North""","""Electronics""","""Headphones""",2025-02-05,"""Online""",3,126.22,378.66
"""T0001""","""West""","""Electronics""","""TV""",2025-05-10,"""Online""",1,113.23,113.23
"""T0002""","""West""","""Clothing""","""Jeans""",2025-01-02,"""Online""",5,142.7,713.5


- Let's calculate the sum of the values in every numeric column for each group.
- We'll use `cs.numeric` to create a selector targeting all numeric columns (integers and floats).
- We'll then calculate the sum of the values per product category.

In [29]:
groups.agg(cs.numeric().sum())

product_category,quantity,unit_price,total_price
enum,i64,f64,f64
"""Electronics""",677,56478.13,168243.25
"""Clothing""",709,58964.64,172431.2
"""Toys""",839,67275.65,212250.02
"""Home""",795,66723.32,193942.34


- Review: The `name.suffix` method concatenates a consistent string to the end of each column name.
- It offers a shortcut to avoid naming each column individualy.

In [30]:
groups.agg(cs.numeric().sum().name.suffix("_sum"))

product_category,quantity_sum,unit_price_sum,total_price_sum
enum,i64,f64,f64
"""Electronics""",677,56478.13,168243.25
"""Clothing""",709,58964.64,172431.2
"""Toys""",839,67275.65,212250.02
"""Home""",795,66723.32,193942.34


- Aggregation functions like `sum` or `mean` evaluate to Polars expressions.
- We can mix-and-match the expressions passed to the `agg` method.
- The next example calculates the sum and average of values in all numeric columns.

In [31]:
groups.agg(
    cs.numeric().sum().name.suffix("_sum"), cs.numeric().mean().name.suffix("_average")
)

product_category,quantity_sum,unit_price_sum,total_price_sum,quantity_average,unit_price_average,total_price_average
enum,i64,f64,f64,f64,f64,f64
"""Electronics""",677,56478.13,168243.25,2.943478,245.557087,731.492391
"""Clothing""",709,58964.64,172431.2,3.004237,249.850169,730.640678
"""Toys""",839,67275.65,212250.02,3.130597,251.028545,791.977687
"""Home""",795,66723.32,193942.34,2.988722,250.839549,729.106541


### Further Reading
- https://docs.pola.rs/api/python/stable/reference/selectors.html#polars.selectors.numeric
- https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.name.suffix.html
- https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.mean.html
- https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.sum.html

## Grouping with Multiple Columns
- Pass a list of values to `group_by` to group by multiple column values.
- Use the column with the smaller number of unique values as the outer/first group.

In [32]:
region_enum = pl.Enum(["East", "West", "North", "South"])
product_category_enum = pl.Enum(["Home", "Electronics", "Toys", "Clothing"])
sales_channel_enum = pl.Enum(["In-Store", "Online"])

sales = pl.read_csv(
    "retail_sales.csv",
    try_parse_dates=True,
    schema_overrides={
        "region": region_enum,
        "product_category": product_category_enum,
        "sales_channel": sales_channel_enum,
    },
)

groups = sales.group_by(["sales_channel", "region"], maintain_order=True)

sales.head(3)

transaction_id,region,product_category,product_id,purchase_date,sales_channel,quantity,unit_price,total_price
str,enum,enum,str,date,enum,i64,f64,f64
"""T0000""","""North""","""Electronics""","""Headphones""",2025-02-05,"""Online""",3,126.22,378.66
"""T0001""","""West""","""Electronics""","""TV""",2025-05-10,"""Online""",1,113.23,113.23
"""T0002""","""West""","""Clothing""","""Jeans""",2025-01-02,"""Online""",5,142.7,713.5


In [33]:
groups.agg(
    pl.col("total_price").mean().alias("average_spend_per_region_and_sales_channel"),
    pl.col("quantity").sum().alias("total_sum_per_region_and_sales_channel"),
)

sales_channel,region,average_spend_per_region_and_sales_channel,total_sum_per_region_and_sales_channel
enum,enum,f64,i64
"""Online""","""North""",770.837029,402
"""Online""","""West""",717.891985,395
"""Online""","""East""",738.563772,363
"""Online""","""South""",764.381031,313
"""In-Store""","""East""",710.9988,346
"""In-Store""","""West""",754.072615,402
"""In-Store""","""South""",799.673917,365
"""In-Store""","""North""",725.800414,434


### Further Reading
- https://docs.pola.rs/user-guide/expressions/aggregation/#nested-grouping
- https://docs.pola.rs/api/python/stable/reference/dataframe/api/polars.DataFrame.group_by.html

## Grouping Temporal Data
- The `group_by` method uses distinct values to define the groups.
- Grouping datetime values by exact time is usually not valuable because of the amount of variation.
- A time window is a span of time (i.e., every 1 week).
- The `group_by_dynamic` method creates groups using a time window.
- The `group_by_dynamic` method looks for the inclusion of a datetime value within the time window/range.
- For example, a window of time from January 1st to January 7th will include a row whose date is January 5th.

In [34]:
region_enum = pl.Enum(["East", "West", "North", "South"])
product_category_enum = pl.Enum(["Home", "Electronics", "Toys", "Clothing"])
sales_channel_enum = pl.Enum(["In-Store", "Online"])

sales = pl.read_csv(
    "retail_sales.csv",
    try_parse_dates=True,
    schema_overrides={
        "region": region_enum,
        "product_category": product_category_enum,
        "sales_channel": sales_channel_enum,
    },
).sort("purchase_date")

sales.head(3)

transaction_id,region,product_category,product_id,purchase_date,sales_channel,quantity,unit_price,total_price
str,enum,enum,str,date,enum,i64,f64,f64
"""T0763""","""East""","""Toys""","""RC Car""",2025-01-01,"""Online""",4,116.14,464.56
"""T0931""","""North""","""Home""","""Couch""",2025-01-01,"""In-Store""",5,442.14,2210.7
"""T0002""","""West""","""Clothing""","""Jeans""",2025-01-02,"""Online""",5,142.7,713.5


- Make sure to _sort_ the date/datetime column whose values will be grouped.
- The `every` parameter accepts an interval of time.
- The next example provides `1w` for 1 week but we can use symbols to designate other durations.
- By default, Polars finds the earliest date, then buckets it into a weekly window starting on Monday.
- The earliest date in `sales` is 01/01/2025, a Wednesday.
- The first group thus starts from Monday that week, 12/30/2024.


In [35]:
groups = sales.group_by_dynamic("purchase_date", every="1w")

- Let's calculate the sum of items sold for every 1 week grouping.

In [36]:
groups.agg(pl.col("total_price").sum())

purchase_date,total_price
date,f64
2024-12-30,15719.96
2025-01-06,26694.13
2025-01-13,21710.04
2025-01-20,25602.3
2025-01-27,32929.04
…,…
2025-06-02,34910.79
2025-06-09,38389.9
2025-06-16,30332.83
2025-06-23,32020.89


- Polars then creates the range of weeks until it reaches the last date in the dataset.
- The `start_by` argument accepts a string with a specified weekday.
- The next example asks Polars to start a week/range on Wednesdays.
- Each group/bucket now goes from Wednesday-Tuesday.

In [37]:
sales.group_by_dynamic("purchase_date", every="1w", start_by="wednesday").agg(
    pl.col("total_price").sum()
)

purchase_date,total_price
date,f64
2025-01-01,21703.82
2025-01-08,27109.7
2025-01-15,22062.4
2025-01-22,30232.26
2025-01-29,33712.96
…,…
2025-05-28,29783.05
2025-06-04,41570.55
2025-06-11,28358.12
2025-06-18,34521.57


- Alternatively, we can pass the `start_by` parameter a value of `datapoint`.
- Polars will create 1-week buckets starting from the first datapoint, `2025-01-01`.

In [38]:
sales.group_by_dynamic("purchase_date", every="1w", start_by="datapoint").agg(
    pl.col("total_price").sum()
)

purchase_date,total_price
date,f64
2025-01-01,21703.82
2025-01-08,27109.7
2025-01-15,22062.4
2025-01-22,30232.26
2025-01-29,33712.96
…,…
2025-05-28,29783.05
2025-06-04,41570.55
2025-06-11,28358.12
2025-06-18,34521.57


- The `mo` symbol stands for month.
- The next example creates monthly groupings.
- Polars is smart enough to account for differences in month lengths.

In [39]:
sales.group_by_dynamic("purchase_date", every="2mo", start_by="datapoint").agg(
    pl.col("total_price").sum()
)

purchase_date,total_price
date,f64
2025-01-01,232185.19
2025-03-01,240708.59
2025-05-01,273973.03


- The next example groups by quarter.

In [40]:
sales.group_by_dynamic("purchase_date", every="3mo", start_by="datapoint").agg(
    pl.col("total_price").sum()
)

purchase_date,total_price
date,f64
2025-01-01,358463.96
2025-04-01,388402.85


- We can mix and match different time specifications.
- The next example creates groups 1 month, 2 weeks, and 3 days apart.

In [41]:
sales.group_by_dynamic("purchase_date", every="1mo2w3d", start_by="datapoint").agg(
    pl.col("total_price").sum()
)

purchase_date,total_price
date,f64
2025-01-01,193152.24
2025-02-18,175349.67
2025-04-04,184713.01
2025-05-21,193651.89


- There are additional time window configuration options:
    - non-overlapping where each period group doesn't overlap another
    - overlapping (period groups can overlap one another)
    - gapped (no overlap, gaps between subsequent periods)

### Further Reading
- https://docs.pola.rs/user-guide/transformations/time-series/rolling/#parameters-for-group_by_dynamic
- https://docs.pola.rs/api/python/stable/reference/dataframe/api/polars.DataFrame.group_by_dynamic.html
- https://docs.pola.rs/api/python/stable/reference/dataframe/api/polars.dataframe.group_by.GroupBy.agg.html

## Window Functions and the over Method

In [42]:
region_enum = pl.Enum(["East", "West", "North", "South"])
product_category_enum = pl.Enum(["Home", "Electronics", "Toys", "Clothing"])
sales_channel_enum = pl.Enum(["In-Store", "Online"])

sales = pl.read_csv(
    "retail_sales.csv",
    try_parse_dates=True,
    schema_overrides={
        "region": region_enum,
        "product_category": product_category_enum,
        "sales_channel": sales_channel_enum,
    },
)

sales.head(3)

transaction_id,region,product_category,product_id,purchase_date,sales_channel,quantity,unit_price,total_price
str,enum,enum,str,date,enum,i64,f64,f64
"""T0000""","""North""","""Electronics""","""Headphones""",2025-02-05,"""Online""",3,126.22,378.66
"""T0001""","""West""","""Electronics""","""TV""",2025-05-10,"""Online""",1,113.23,113.23
"""T0002""","""West""","""Clothing""","""Jeans""",2025-01-02,"""Online""",5,142.7,713.5


- The `rank` method assigns a rank to each column value.
- A descending rank assigns the highest column value a rank of 1, the second highest a rank of 2, and so on.
- There are different reconciliation strategies if there are multiple equal values.
- The `rank` method doesn't account for groups. It's a ranking across the whole `DataFrame`.

In [43]:
sales.with_columns(
    pl.col("total_price").rank(descending=True).cast(pl.UInt32).alias("rank")
).sort("rank")

transaction_id,region,product_category,product_id,purchase_date,sales_channel,quantity,unit_price,total_price,rank
str,enum,enum,str,date,enum,i64,f64,f64,u32
"""T0189""","""West""","""Electronics""","""Laptop""",2025-02-16,"""In-Store""",5,497.57,2487.85,1
"""T0586""","""West""","""Home""","""Lamp""",2025-03-25,"""In-Store""",5,495.77,2478.85,2
"""T0055""","""South""","""Toys""","""RC Car""",2025-06-02,"""In-Store""",5,494.01,2470.05,3
"""T0142""","""East""","""Clothing""","""Jeans""",2025-06-09,"""Online""",5,487.4,2437.0,4
"""T0608""","""North""","""Clothing""","""Shirt""",2025-05-24,"""Online""",5,483.56,2417.8,5
…,…,…,…,…,…,…,…,…,…
"""T0734""","""East""","""Home""","""Vacuum""",2025-06-21,"""In-Store""",1,12.89,12.89,996
"""T0340""","""North""","""Electronics""","""Headphones""",2025-02-25,"""Online""",1,11.83,11.83,997
"""T0862""","""North""","""Electronics""","""Laptop""",2025-06-17,"""In-Store""",1,11.59,11.59,998
"""T0536""","""West""","""Home""","""Couch""",2025-02-23,"""In-Store""",1,10.85,10.85,999


- Say that we want to rank the transactions _within_ each region. 
- We want the most profitable transactions in the `"West"`, `"North"`, `"South"`, and `"East"` regions to have a rank of 1.
- The `over` method specifies the ranking should be done _separately_ for each group.
- We ask Polars to rank the numeric values in `total_price` over the groups defined by `region`.
- It's called a "window function" because each calculation operates over a window -- a subset of rows -- rather than the whole dataset.
- The window is effectively each "group". Each group has a subset of rows that we want to perform calculations on.

In [44]:
sales.with_columns(
    pl.col("total_price")
    .rank(descending=True)
    .over("region")
    .cast(pl.UInt32)
    .alias("rank")
).sort("rank")

transaction_id,region,product_category,product_id,purchase_date,sales_channel,quantity,unit_price,total_price,rank
str,enum,enum,str,date,enum,i64,f64,f64,u32
"""T0055""","""South""","""Toys""","""RC Car""",2025-06-02,"""In-Store""",5,494.01,2470.05,1
"""T0142""","""East""","""Clothing""","""Jeans""",2025-06-09,"""Online""",5,487.4,2437.0,1
"""T0189""","""West""","""Electronics""","""Laptop""",2025-02-16,"""In-Store""",5,497.57,2487.85,1
"""T0608""","""North""","""Clothing""","""Shirt""",2025-05-24,"""Online""",5,483.56,2417.8,1
"""T0457""","""North""","""Toys""","""Puzzle""",2025-02-04,"""In-Store""",5,478.31,2391.55,2
…,…,…,…,…,…,…,…,…,…
"""T0998""","""North""","""Toys""","""RC Car""",2025-02-24,"""In-Store""",3,10.93,32.79,279
"""T0719""","""North""","""Home""","""Lamp""",2025-03-02,"""Online""",4,7.26,29.04,280
"""T0304""","""North""","""Home""","""Vacuum""",2025-03-29,"""Online""",1,16.18,16.18,281
"""T0340""","""North""","""Electronics""","""Headphones""",2025-02-25,"""Online""",1,11.83,11.83,282


- We can pass multiple columns to the `over` method.
- The window here targets every combination of `sales_channel` and `region`.
- There are 8 possible combinations of `sales_channel` (2 options) and `region` (4 options).
- As a result, 8 rows will have a rank of 1, 8 rows will have a rank of 2, and so on.

In [45]:
sales.with_columns(
    pl.col("total_price")
    .rank(descending=True)
    .over("region", "sales_channel")
    .cast(pl.UInt32)
    .alias("rank")
).sort("rank", "sales_channel")

transaction_id,region,product_category,product_id,purchase_date,sales_channel,quantity,unit_price,total_price,rank
str,enum,enum,str,date,enum,i64,f64,f64,u32
"""T0055""","""South""","""Toys""","""RC Car""",2025-06-02,"""In-Store""",5,494.01,2470.05,1
"""T0457""","""North""","""Toys""","""Puzzle""",2025-02-04,"""In-Store""",5,478.31,2391.55,1
"""T0189""","""West""","""Electronics""","""Laptop""",2025-02-16,"""In-Store""",5,497.57,2487.85,1
"""T0624""","""East""","""Electronics""","""Headphones""",2025-04-29,"""In-Store""",5,452.32,2261.6,1
"""T0142""","""East""","""Clothing""","""Jeans""",2025-06-09,"""Online""",5,487.4,2437.0,1
…,…,…,…,…,…,…,…,…,…
"""T0583""","""North""","""Clothing""","""Jacket""",2025-03-01,"""In-Store""",1,55.11,55.11,141
"""T0821""","""North""","""Clothing""","""Jeans""",2025-04-11,"""In-Store""",4,8.89,35.56,142
"""T0133""","""North""","""Home""","""Lamp""",2025-05-03,"""In-Store""",1,35.11,35.11,143
"""T0998""","""North""","""Toys""","""RC Car""",2025-02-24,"""In-Store""",3,10.93,32.79,144


### Further Reading
- https://docs.pola.rs/user-guide/expressions/window-functions/#operations-per-group
- https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.rank.html
- https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.over.html
- https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.cast.html