# Groupby_agg : Shortcut for assigning a groupby-transform to a new column.

## Background

This notebook serves to show how to use the `groupby_agg` method from pyjanitor's general functions submodule.

The `groupby_agg` method allows us to add the result of an aggregation from a grouping, as a new column, back to the dataframe.

Currently in pandas, to append a column back to a dataframe, you do it in three steps:
1. Groupby a column or columns
2. Apply the `transform` method with an aggregate function on the grouping, and finally
3. Assign the result of the transform to a new column in the dataframe.

In pseudo-code, this might look something like:
```python
df = df.assign(
    new_column_name=df.groupby(...)[...].transform(...)
)
```

The `groupby_agg` method allows you to achieve the same result in a single function call and with sensible arguments. The example below illustrates the use of this function.

In [1]:
# load modules
import pandas as pd
import numpy as np
import janitor

## Examples

### Basic example

We start off with a simple example.
Given a `df` as defined below, we wish to use `groupby_agg` to find the average price for each item, and join the results back to the original dataframe.

In [2]:
df = pd.DataFrame(
    {
        "item": ["shoe", "shoe", "bag", "shoe", "bag"],
        "MRP": [220, 450, 320, 200, 305],
        "number_sold": [100, 40, 56, 38, 25],
    }
)
df

Unnamed: 0,item,MRP,number_sold
0,shoe,220,100
1,shoe,450,40
2,bag,320,56
3,shoe,200,38
4,bag,305,25


Note that the output of `groupby_agg` contains the same number of rows as the input dataframe, i.e., the operation here is a groupby + transform.

Here, `by` is the name(s) of the column(s) being grouped over. `agg` is the aggregate function (e.g. sum, mean, count...), which is beinng applied to the data in the column specified by `agg_column_name`.
Finally, `new_column_name` is the name of the newly-added column containing the transformed values.

In [3]:
df = df.groupby_agg(
    by="item",
    agg="mean",
    agg_column_name="MRP",
    new_column_name="Avg_MRP",
)
df

Unnamed: 0,item,MRP,number_sold,Avg_MRP
0,shoe,220,100,290.0
1,shoe,450,40,290.0
2,bag,320,56,312.5
3,shoe,200,38,290.0
4,bag,305,25,312.5


### Specifying multiple columns to group over

The basic example shown above specified a single column in `by` to group over.
Grouping over multiple columns is also supported in general, since `groupby_agg` is just using the standard pandas `DataFrame.groupby` method under the hood.

An example is shown below:

In [4]:
df = pd.DataFrame(
    {
        "date": pd.date_range("2021-01-12", periods=5, freq="W"),
        "item": ["sneaker", "boots", "sneaker", "bag", "bag"],
        "MRP": [230, 450, 300, 200, 305],
    }
)
df

Unnamed: 0,date,item,MRP
0,2021-01-17,sneaker,230
1,2021-01-24,boots,450
2,2021-01-31,sneaker,300
3,2021-02-07,bag,200
4,2021-02-14,bag,305


In [5]:
df = df.groupby_agg(
    by=["item", df["date"].dt.month],
    agg="mean",
    agg_column_name="MRP",
    new_column_name="Avg_MRP_by_item_month",
)
df

Unnamed: 0,date,item,MRP,Avg_MRP_by_item_month
0,2021-01-17,sneaker,230,265.0
1,2021-01-24,boots,450,450.0
2,2021-01-31,sneaker,300,265.0
3,2021-02-07,bag,200,252.5
4,2021-02-14,bag,305,252.5


### The `dropna` parameter

If the column(s) being grouped over (`by`) contains null values, you can include the null values as its own individual group, by passing `False` to `dropna`. Otherwise, the default behaviour is to `dropna=True`, in which case, the corresponding transformed values (in `new_column_name`) will be left as NaN.
This feature was introduced in Pandas 1.1.

You may read more about this parameter in the [Pandas user guide](https://pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html#id2).

In [6]:
df = pd.DataFrame(
    {
        "name": ("black", "black", "black", "red", "red"),
        "type": ("chair", "chair", "sofa", "sofa", "plate"),
        "num": (4, 5, 12, 4, 3),
        "nulls": (1, 1, np.nan, np.nan, 3),
    }
)
df

Unnamed: 0,name,type,num,nulls
0,black,chair,4,1.0
1,black,chair,5,1.0
2,black,sofa,12,
3,red,sofa,4,
4,red,plate,3,3.0


Let's get the value counts of the values in the `nulls` column.
Compare the two outputs from the following cell when `dropna` is set to True and False respectively:

In [7]:
print("With dropna=True (default)")
filtered_df = df.groupby_agg(
    by=["nulls"],
    agg="size",
    agg_column_name="type",
    new_column_name="counter",
    dropna=True,
)
display(filtered_df)

print("With dropna=False")
filtered_df = df.groupby_agg(
    by=["nulls"],
    agg="size",
    agg_column_name="type",
    new_column_name="counter",
    dropna=False,
)
display(filtered_df)

With dropna=True (default)


Unnamed: 0,name,type,num,nulls,counter
0,black,chair,4,1.0,2.0
1,black,chair,5,1.0,2.0
2,black,sofa,12,,
3,red,sofa,4,,
4,red,plate,3,3.0,1.0


With dropna=False


Unnamed: 0,name,type,num,nulls,counter
0,black,chair,4,1.0,2
1,black,chair,5,1.0,2
2,black,sofa,12,,2
3,red,sofa,4,,2
4,red,plate,3,3.0,1


### Method chaining

The `groupby_agg` method can be extended for different purposes. One of these is groupwise filtering, where only groups that meet a condition are retained.
Let's explore this with an example, reusing one of the small dataframe from before:

In [8]:
df = pd.DataFrame(
    {
        "name": ("black", "black", "black", "red", "red"),
        "type": ("chair", "chair", "sofa", "sofa", "plate"),
        "num": (4, 5, 12, 4, 3),
        "nulls": (1, 1, np.nan, np.nan, 3),
    }
)

filtered_df = df.groupby_agg(
    by=["name", "type"],
    agg="size",
    agg_column_name="type",
    new_column_name="counter",
).query("counter > 1")
filtered_df

Unnamed: 0,name,type,num,nulls,counter
0,black,chair,4,1.0,2
1,black,chair,5,1.0,2
