In [1]:
import numpy as np
import pandas as pd
print(pd.__version__)

2.0.2


Grouping and Aggregation
==============

Grouping and aggregation are powerful techniques in data analysis that allow us to summarize and analyze data based on different categories or groups. By grouping data, we can apply various statistical functions to obtain meaningful insights and summaries.

Pandas provides efficient methods for grouping and aggregating data. We can group data based on one or more columns and perform operations such as counting, summing, averaging, and more on the grouped data.

# Grouping Steps and applications

## Grouping steps

The grouping function in pandas is `groupby`, which is essentially a split-apply-combine process, commonly abbreviated as **SAC**. Grouping refers to a process that involves one or more of the following steps.

* Splitting: Dividing the data into groups based on specific criteria or columns.
* Applying: Performing computations or transformations on each group independently.
* Combining: Combining the results of the computations into a consolidated output.

## Grouping applications

* **Aggregation**: Compute summary statistics for each group. For example:

    - Calculate the sum or mean within each group.

    - Determine the count or size of each group.

* **Transformation**: Perform calculations specific to each group and return an object with a similar index. For example:

    - Standardize data within each group using z-scores.

    - Fill missing values within each group using derived values from the group.

* **Filtration**: Drop certain groups based on an evaluation that results in True or False. For example:

    - Discard data belonging to groups with only a few members.

    - Filter data based on the sum or mean of each group.


A combination of the above: `GroupBy` examines the result of the apply step and attempts to return a sensible combination result when it doesn't fit into the aforementioned categories.

These various applications of `groupby` provide flexibility in data analysis, allowing us to compute group-specific statistics, apply transformations within groups, and filter data based on group characteristics. The `GroupBy` functionality in pandas offers a powerful toolset for handling and analyzing grouped data effectively.

# Grouping Functions
## `groupby`

```python

groupby(by = None, axis = 0, level = None, as_index = True, sort = True, group_keys = True, observed = False, dropna = True)
```

* `by`: Specifies the column name(s) or other criteria to group the DataFrame. It can be a single column name, a list of column names, a function, or a dictionary mapping column names to group values.

* `axis`: Specifies the axis along which the grouping is performed. axis=0 groups the DataFrame by rows (along the index), and axis=1 groups by columns.

* `level`: Specifies the level(s) (if the DataFrame has a hierarchical index) on which to group the DataFrame.

* `as_index`: Specifies whether to use the grouped columns as the index of the resulting DataFrame. The default value is True.

* `sort`: Specifies whether to sort the resulting groups by the group keys. The default value is True.

* `group_keys`: Specifies whether to include the group keys in the resulting DataFrame index. The default value is True.

* `squeeze`: Specifies whether to squeeze a grouped DataFrame into a Series if possible. The default value is False.

* `observed`: Specifies whether to include only observed values when dealing with categorical data. The default value is False.

In [2]:
index = pd.Index(data = ["A", "B", "C", "D", "E", "F", "G",'H'], name="name")
data = {
    "age": [18, 30, 35, 18, np.nan, 30, 37, 25],
    "city": ["New York", "Los Angeles", "Huston", "Orlando", np.nan, " ", "Miami", "Chicago"],
    "gender": ["male", "male", "female", "male", "female", "female", "male", "male"],
    "income": [3000, 8000, 8000, 4000, 6000, 7000, 10000, 70000]
}
user_info = pd.DataFrame(data = data, index = index)
user_info

Unnamed: 0_level_0,age,city,gender,income
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
A,18.0,New York,male,3000
B,30.0,Los Angeles,male,8000
C,35.0,Huston,female,8000
D,18.0,Orlando,male,4000
E,,,female,6000
F,30.0,,female,7000
G,37.0,Miami,male,10000
H,25.0,Chicago,male,70000


In [3]:
# Groupby sex
user_info_gender_group = user_info.groupby("gender")
user_info_gender_group

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x0000019716EBB790>

In [4]:
# Look up the grouping infomation
user_info_gender_group.groups

{'female': ['C', 'E', 'F'], 'male': ['A', 'B', 'D', 'G', 'H']}

In [5]:
# Look up the male info
user_info_gender_group.get_group('male')

Unnamed: 0_level_0,age,city,gender,income
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
A,18.0,New York,male,3000
B,30.0,Los Angeles,male,8000
D,18.0,Orlando,male,4000
G,37.0,Miami,male,10000
H,25.0,Chicago,male,70000


In [6]:
# Look up the femail info
user_info_gender_group.get_group('female')

Unnamed: 0_level_0,age,city,gender,income
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
C,35.0,Huston,female,8000
E,,,female,6000
F,30.0,,female,7000


In [7]:
# To group the user_info DataFrame by both gender and age
age_gender_group = user_info.groupby(['age', 'gender'])

In [8]:
age_gender_group.groups

{(18.0, 'male'): ['A', 'D'], (30.0, 'male'): ['B'], (35.0, 'female'): ['C'], (nan, 'female'): ['E'], (25.0, 'male'): ['H'], (30.0, 'female'): ['F'], (37.0, 'male'): ['G']}

In [9]:
age_gender_group.get_group((18,"male")) # the tuple element order should be consitent with groupby

Unnamed: 0_level_0,age,city,gender,income
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
A,18.0,New York,male,3000
D,18.0,Orlando,male,4000


##### By default, `groupby` sorts the data during the operation. To achieve better performance, we can set `sort = False`.

In [10]:
user_info_group = user_info.groupby(['gender', 'age'], sort = False)
user_info_group.groups

{('female', 35.0): ['C'], ('female', nan): ['E'], ('female', 30.0): ['F'], ('male', 18.0): ['A', 'D'], ('male', 25.0): ['H'], ('male', 30.0): ['B'], ('male', 37.0): ['G']}

## Object property
### `head() ` 
The returned result is the **first few rows of each group**, rather than the first few rows of the entire dataset.

In [11]:
user_info_gender_group.head(2)  

Unnamed: 0_level_0,age,city,gender,income
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
A,18.0,New York,male,3000
B,30.0,Los Angeles,male,8000
C,35.0,Huston,female,8000
E,,,female,6000


### `first()`
To view the first row of each group in a grouped DataFramem and The `first` function displays the first group information of each group indexed by the groups.

In [12]:
user_info_gender_group.first()  # index is gender 

Unnamed: 0_level_0,age,city,income
gender,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
female,35.0,Huston,8000
male,18.0,New York,3000


### To view specific columns in a DataFrame

After grouping with `groupby`, we can use the dot notation or slicing [...] to select a specific column from the grouped DataFrame.

In [13]:
user_info_gender_group.head()

Unnamed: 0_level_0,age,city,gender,income
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
A,18.0,New York,male,3000
B,30.0,Los Angeles,male,8000
C,35.0,Huston,female,8000
D,18.0,Orlando,male,4000
E,,,female,6000
F,30.0,,female,7000
G,37.0,Miami,male,10000
H,25.0,Chicago,male,70000


In [14]:
user_info_gender_group.city.head() # .+ column name

name
A       New York
B    Los Angeles
C         Huston
D        Orlando
E            NaN
F               
G          Miami
H        Chicago
Name: city, dtype: object

In [15]:
user_info_gender_group['city'].head()    # column selection

name
A       New York
B    Los Angeles
C         Huston
D        Orlando
E            NaN
F               
G          Miami
H        Chicago
Name: city, dtype: object

##### To view simple statistical information for a specific column in a DataFrame

In [16]:
user_info_gender_group[['age']].mean()

Unnamed: 0_level_0,age
gender,Unnamed: 1_level_1
female,32.5
male,25.6


In [17]:
user_info_gender_group.age.sum()

gender
female     65.0
male      128.0
Name: age, dtype: float64

In [18]:
user_info_gender_group.ngroups  # count of groups

2

In [19]:
user_info_gender_group.size()  # size of the group

gender
female    3
male      5
dtype: int64

In [20]:
# Group indexing
user_info_gender_group.groups  # return a dict

{'female': ['C', 'E', 'F'], 'male': ['A', 'B', 'D', 'G', 'H']}

## Iterate over groups

In [21]:
for group_name, group_data in user_info_gender_group:
    print("Group name:", group_name)
    display(group_data)

Group name: female


Unnamed: 0_level_0,age,city,gender,income
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
C,35.0,Huston,female,8000
E,,,female,6000
F,30.0,,female,7000


Group name: male


Unnamed: 0_level_0,age,city,gender,income
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
A,18.0,New York,male,3000
B,30.0,Los Angeles,male,8000
D,18.0,Orlando,male,4000
G,37.0,Miami,male,10000
H,25.0,Chicago,male,70000


In [22]:
for group_name, group_data in user_info_group:
    print(group_name)
    display(group_data)

('male', 18.0)


Unnamed: 0_level_0,age,city,gender,income
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
A,18.0,New York,male,3000
D,18.0,Orlando,male,4000


('male', 30.0)


Unnamed: 0_level_0,age,city,gender,income
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
B,30.0,Los Angeles,male,8000


('female', 35.0)


Unnamed: 0_level_0,age,city,gender,income
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
C,35.0,Huston,female,8000


('female', 30.0)


Unnamed: 0_level_0,age,city,gender,income
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
F,30.0,,female,7000


('male', 37.0)


Unnamed: 0_level_0,age,city,gender,income
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
G,37.0,Miami,male,10000


('male', 25.0)


Unnamed: 0_level_0,age,city,gender,income
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
H,25.0,Chicago,male,70000


## Methods of `GroupBy` Function

In [23]:
print([attr for attr in dir(user_info_group) if not attr.startswith('_')])

['age', 'agg', 'aggregate', 'all', 'any', 'apply', 'bfill', 'boxplot', 'city', 'corr', 'corrwith', 'count', 'cov', 'cumcount', 'cummax', 'cummin', 'cumprod', 'cumsum', 'describe', 'diff', 'dtypes', 'ewm', 'expanding', 'ffill', 'fillna', 'filter', 'first', 'gender', 'get_group', 'groups', 'head', 'hist', 'idxmax', 'idxmin', 'income', 'indices', 'last', 'max', 'mean', 'median', 'min', 'ndim', 'ngroup', 'ngroups', 'nth', 'nunique', 'ohlc', 'pct_change', 'pipe', 'plot', 'prod', 'quantile', 'rank', 'resample', 'rolling', 'sample', 'sem', 'shift', 'size', 'skew', 'std', 'sum', 'tail', 'take', 'transform', 'value_counts', 'var']


## Grouping Continuous Variables 

When working with continuous variables, one common approach is to group the data into discrete categories or bins. This can be useful for analyzing the relationship between the continuous variable and other categorical variables or for creating summary statistics.


* **Define the bins**: Determine the ranges or intervals for grouping the continuous variable. We can use methods like `pd.cut()` or `pd.qcut()` to define the bins.

* **Assign bin labels**: Optionally, you can assign labels to the bins to provide meaningful names to the groups.

* **Group the data**: Use the `groupby()` function to group the data based on the defined bins.


In [24]:
bins = [0, 20, 30, 40, 100]
labels = ["Below 20", "Between 21 and 30", "Between 31 and 40", "Above 41"]
user_info["Age bins"] = pd.cut(user_info['age'], bins = bins, labels = labels)
user_info

Unnamed: 0_level_0,age,city,gender,income,Age bins
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
A,18.0,New York,male,3000,Below 20
B,30.0,Los Angeles,male,8000,Between 21 and 30
C,35.0,Huston,female,8000,Between 31 and 40
D,18.0,Orlando,male,4000,Below 20
E,,,female,6000,
F,30.0,,female,7000,Between 21 and 30
G,37.0,Miami,male,10000,Between 31 and 40
H,25.0,Chicago,male,70000,Between 21 and 30


In [25]:
user_info.groupby('Age bins').count()

Unnamed: 0_level_0,age,city,gender,income
Age bins,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Below 20,2,2,2,2
Between 21 and 30,3,3,3,3
Between 31 and 40,2,2,2,2
Above 41,0,0,0,0


# Aggregation, Transformation and Filteration
## Aggregation 

###  Commonly used aggregation functions
The purpose of grouping is often for the purpose of aggregation and statistical analysis. To perform aggregation after grouping, we can use the agg method. There are several commonly used aggregation functions available in pandas for performing various calculations. Some of the commonly used aggregation functions include:

* `mean()`: Compute the mean of the values in each group.
* `sum()`: Compute the sum of the values in each group.
* `size()`: Count the number of values in each group.
* `count()`: Count the non-null values in each group.
* `std()`: Compute the standard deviation of the values in each group.
* `var()`: Compute the variance of the values in each group.
* `sem()`: Compute the standard error of the mean of the values in each group.
* `describe()`: Generate descriptive statistics of the values in each group (e.g., count, mean, min, max, etc.).
* `first()`: Return the first value in each group.
* `last()`: Return the last value in each group.
* `nth()`: Return the nth value in each group.
* `min()`: Find the minimum value in each group.
* `max()`: Find the maximum value in each group.

In [26]:
user_info

Unnamed: 0_level_0,age,city,gender,income,Age bins
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
A,18.0,New York,male,3000,Below 20
B,30.0,Los Angeles,male,8000,Between 21 and 30
C,35.0,Huston,female,8000,Between 31 and 40
D,18.0,Orlando,male,4000,Below 20
E,,,female,6000,
F,30.0,,female,7000,Between 21 and 30
G,37.0,Miami,male,10000,Between 31 and 40
H,25.0,Chicago,male,70000,Between 21 and 30


In [27]:
# group by gender
user_info_gender_group = user_info.groupby('gender')

In [28]:
user_info_gender_group.head()

Unnamed: 0_level_0,age,city,gender,income,Age bins
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
A,18.0,New York,male,3000,Below 20
B,30.0,Los Angeles,male,8000,Between 21 and 30
C,35.0,Huston,female,8000,Between 31 and 40
D,18.0,Orlando,male,4000,Below 20
E,,,female,6000,
F,30.0,,female,7000,Between 21 and 30
G,37.0,Miami,male,10000,Between 31 and 40
H,25.0,Chicago,male,70000,Between 21 and 30


In [29]:
user_info_gender_group['age'].agg(len) # including null values

gender
female    3
male      5
Name: age, dtype: int64

In [30]:
user_info_gender_group.age.count() # not including null values

gender
female    2
male      5
Name: age, dtype: int64

In [31]:
user_info_gender_group.age.size() # inlcuding null values

gender
female    3
male      5
Name: age, dtype: int64

In [32]:
# Group by age and gender
user_info_age_gender_group = user_info.groupby(['gender','age'])
user_info_age_gender_group.agg(len)

Unnamed: 0_level_0,Unnamed: 1_level_0,city,income,Age bins
gender,age,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
female,30.0,1,1,1
female,35.0,1,1,1
male,18.0,2,2,2
male,25.0,1,1,1
male,30.0,1,1,1
male,37.0,1,1,1


In [33]:
user_info_age_gender_group.count()

Unnamed: 0_level_0,Unnamed: 1_level_0,city,income,Age bins
gender,age,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
female,30.0,1,1,1
female,35.0,1,1,1
male,18.0,2,2,2
male,25.0,1,1,1
male,30.0,1,1,1
male,37.0,1,1,1


In [34]:
user_info_age_gender_group.size()  # only show one column

gender  age 
female  30.0    1
        35.0    1
male    18.0    2
        25.0    1
        30.0    1
        37.0    1
dtype: int64

In [35]:
# Get the largest age values for gender group
user_info_gender_group.age.agg(max)
user_info_gender_group.age.agg(np.max)
user_info_gender_group.age.max()

gender
female    35.0
male      37.0
Name: age, dtype: float64

Both Series and DataFrame objects in pandas have the describe method, which provides a summary of the data's distribution. Even after grouping the data, we can still use the describe method to examine the data within each group.

In [36]:
user_info_gender_group.describe()

Unnamed: 0_level_0,age,age,age,age,age,age,age,age,income,income,income,income,income,income,income,income
Unnamed: 0_level_1,count,mean,std,min,25%,50%,75%,max,count,mean,std,min,25%,50%,75%,max
gender,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2
female,2.0,32.5,3.535534,30.0,31.25,32.5,33.75,35.0,3.0,7000.0,1000.0,6000.0,6500.0,7000.0,7500.0,8000.0
male,5.0,25.6,8.142481,18.0,18.0,25.0,30.0,37.0,5.0,19000.0,28653.097564,3000.0,4000.0,8000.0,10000.0,70000.0


In [37]:
user_info_group.describe()

Unnamed: 0_level_0,Unnamed: 1_level_0,income,income,income,income,income,income,income,income
Unnamed: 0_level_1,Unnamed: 1_level_1,count,mean,std,min,25%,50%,75%,max
gender,age,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2
male,18.0,2.0,3500.0,707.106781,3000.0,3250.0,3500.0,3750.0,4000.0
male,30.0,1.0,8000.0,,8000.0,8000.0,8000.0,8000.0,8000.0
male,37.0,1.0,10000.0,,10000.0,10000.0,10000.0,10000.0,10000.0
male,25.0,1.0,70000.0,,70000.0,70000.0,70000.0,70000.0,70000.0
female,30.0,1.0,7000.0,,7000.0,7000.0,7000.0,7000.0,7000.0
female,35.0,1.0,8000.0,,8000.0,8000.0,8000.0,8000.0,8000.0


### Avoid multi-indexing
If we are aggregating based on multiple keys, by default, the result will have a multi-level index structure. There are two ways to avoid having a multi-level index.
##### 1 `reset_index`

In [38]:
user_info_group.agg(len)

Unnamed: 0_level_0,Unnamed: 1_level_0,city,income,Age bins
gender,age,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
male,18.0,2,2,2
male,30.0,1,1,1
female,35.0,1,1,1
female,30.0,1,1,1
male,37.0,1,1,1
male,25.0,1,1,1


In [39]:
user_info_group.agg(len).reset_index()

Unnamed: 0,gender,age,city,income,Age bins
0,male,18.0,2,2,2
1,male,30.0,1,1,1
2,female,35.0,1,1,1
3,female,30.0,1,1,1
4,male,37.0,1,1,1
5,male,25.0,1,1,1


##### 2.  `as_index = False ` when grouping

In [40]:
user_info.groupby(["gender", "age"], as_index = False).agg(len)

Unnamed: 0,gender,age,city,income,Age bins
0,female,30.0,1,1,1
1,female,35.0,1,1,1
2,male,18.0,2,2,2
3,male,25.0,1,1,1
4,male,30.0,1,1,1
5,male,37.0,1,1,1


### Multiple aggregate results

In [41]:
for group_name, group_data in user_info_gender_group:
    print("Group:", group_name)
    display(group_data)
    print("---------------------")

Group: female


Unnamed: 0_level_0,age,city,gender,income,Age bins
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
C,35.0,Huston,female,8000,Between 31 and 40
E,,,female,6000,
F,30.0,,female,7000,Between 21 and 30


---------------------
Group: male


Unnamed: 0_level_0,age,city,gender,income,Age bins
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
A,18.0,New York,male,3000,Below 20
B,30.0,Los Angeles,male,8000,Between 21 and 30
D,18.0,Orlando,male,4000,Below 20
G,37.0,Miami,male,10000,Between 31 and 40
H,25.0,Chicago,male,70000,Between 21 and 30


---------------------


In [42]:
user_info_gender_group['income'].agg([np.sum, np.mean])

Unnamed: 0_level_0,sum,mean
gender,Unnamed: 1_level_1,Unnamed: 2_level_1
female,21000,7000.0
male,95000,19000.0


### Use tuple to rename the results

In [43]:
user_info_gender_group['income'].agg([('Total', np.sum), ('Average', np.mean)])

Unnamed: 0_level_0,Total,Average
gender,Unnamed: 1_level_1,Unnamed: 2_level_1
female,21000,7000.0
male,95000,19000.0


##### To apply different aggregation operations to different columns in a DataFrame, we can use the agg method with a dictionary to specify the aggregation functions for each column.

In [44]:
# Get the mean of age and total income by gender
user_info_gender_group.agg({'age': np.mean, "income": np.sum}).rename(columns = {"age": "age_mean", "income": "income_sum"})

Unnamed: 0_level_0,age_mean,income_sum
gender,Unnamed: 1_level_1,Unnamed: 2_level_1
female,32.5,21000
male,25.6,95000


### Custom functions

In [45]:
# Calcualte the mean age after one year by gender
user_info_gender_group.age.agg(lambda x: x.mean() + 1)

gender
female    33.5
male      26.6
Name: age, dtype: float64

In [46]:
# Calculate the the difference between maximum age and minimum age
user_info_gender_group.age.agg(lambda x: x.max() - x.min())

gender
female     5.0
male      19.0
Name: age, dtype: float64

### `NameAgg`
The `NamedAgg` feature in pandas allows us to apply multiple aggregation operations to different columns in a DataFrame while providing custom names for the resulting aggregated columns.

`NameAgg` does not support `lambda` function, but it supports `def` function

In [47]:
def R1(x):
    return x.max() - x.min()
def R2(x):
    return x.max()- x.median()

In [48]:

# By gender, to show the minimum income, maximum income, differece betwee min and max,
# and difference between max and median income
user_info_gender_group.agg(Min_income = pd.NamedAgg(column = 'income', aggfunc = 'min'),
                        Max_income = pd.NamedAgg(column = 'income', aggfunc = 'max' ),
                           income_diff = pd.NamedAgg(column = 'income', aggfunc = R1),
                          range_income = pd.NamedAgg(column = 'income', aggfunc = R2)
                          )

Unnamed: 0_level_0,Min_income,Max_income,income_diff,range_income
gender,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
female,6000,8000,2000,1000.0
male,3000,70000,67000,62000.0


### Aggregate function with parameters 

For examples:
* `quantile(q)`: Calculate the qth quantile of the data. The q parameter specifies the desired quantile value (e.g., 0.25 for the first quartile, 0.5 for the median).

* `agg(func, *args, **kwargs)`: Apply a custom aggregation function with additional arguments. You can pass any custom function as func along with any required positional (`*args`) or keyword (`**kwargs`) arguments.

* `apply(func, *args, **kwargs)`: Apply a custom function to each group. Similar to `agg()`, you can provide additional arguments to the func function as positional or keyword arguments.

In [49]:
#  determine if there is at least one record for individuals aged 
# between 30 and 40 in each gender group.
def f(s,low,high):
    return s.between(low,high).max()
user_info_gender_group.age.agg(f,30,40)

gender
female    True
male      True
Name: age, dtype: bool

## Filteration

The `filter` function is used to filter certain groups (remember that the result is the entire group). Therefore, the value passed should be a boolean scalar.

In [50]:
user_info_gender_group.filter(lambda x: (x['income'] > 3200).all()).head()

Unnamed: 0_level_0,age,city,gender,income,Age bins
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
C,35.0,Huston,female,8000,Between 31 and 40
E,,,female,6000,
F,30.0,,female,7000,Between 21 and 30


## Transformation

When performing aggregation operations, the result is often an object with the group names as indices. Although you can specify `as_index = False` to disable this behavior, the resulting index may still not match the original index of the data. If we want to use the original index of the array, we can use the transform method, which simplifies this process. It applies the func parameter to each group and places the results back onto the original array's index (broadcasting if the result is a scalar).

### Group-wise element transformation
In the transform function, the object passed in is a column within each group, and the return value should have the exact same length as the column.

When using `transform`, the function or operation is applied to each group individually, operating on a column within that group. The result should be a transformed version of that column, where each element is modified based on the specific group it belongs to.

In [51]:
# income by gender group
user_info_gender_group.income.agg(np.mean)

gender
female     7000.0
male      19000.0
Name: income, dtype: float64

In [52]:
# By gender group, show each member's group average income
user_info_gender_group.income.transform(np.mean)

name
A    19000.0
B    19000.0
C     7000.0
D    19000.0
E     7000.0
F     7000.0
G    19000.0
H    19000.0
Name: income, dtype: float64

In [53]:
user_info_gender_group[['age','income']].transform(np.mean)

Unnamed: 0_level_0,age,income
name,Unnamed: 1_level_1,Unnamed: 2_level_1
A,25.6,19000.0
B,25.6,19000.0
C,32.5,7000.0
D,25.6,19000.0
E,32.5,7000.0
F,32.5,7000.0
G,25.6,19000.0
H,25.6,19000.0


###  Group-wise standardization 
Performing group-wise standardization using a transformation method is a common operation. Z-score standardization (also known as zero-mean normalization) is a technique that transforms the data to have a mean of zero and a standard deviation of one, resulting in a distribution that follows the standard normal distribution.

Equation: $z = \frac{x - \mu}{\sigma}$

In [54]:
user_info_gender_group.income.transform(lambda x: (x - np.mean(x))/np.std(x))

name
A   -0.624314
B   -0.429216
C    1.224745
D   -0.585295
E   -1.224745
F    0.000000
G   -0.351177
H    1.990002
Name: income, dtype: float64

### Filling missing values with group-wise mean using a transformation method

In [55]:
user_info_gender_group.age.mean()

gender
female    32.5
male      25.6
Name: age, dtype: float64

In [56]:
user_info

Unnamed: 0_level_0,age,city,gender,income,Age bins
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
A,18.0,New York,male,3000,Below 20
B,30.0,Los Angeles,male,8000,Between 21 and 30
C,35.0,Huston,female,8000,Between 31 and 40
D,18.0,Orlando,male,4000,Below 20
E,,,female,6000,
F,30.0,,female,7000,Between 21 and 30
G,37.0,Miami,male,10000,Between 31 and 40
H,25.0,Chicago,male,70000,Between 21 and 30


In [57]:
user_info_gender_group.age.transform(lambda x: x.fillna(x.mean()))

name
A    18.0
B    30.0
C    35.0
D    18.0
E    32.5
F    30.0
G    37.0
H    25.0
Name: age, dtype: float64

# Apply function
## Difference between `apply` and `transform`

**Similarities**:

Both `apply()` and `transform()` can be used to perform calculations on a DataFrame and are often used in conjunction with the `groupby()` method.

**Differences**:

* `apply()` can accept custom functions, including simple aggregation functions (e.g., sum) and complex feature interaction functions.

* `transform()` cannot directly accept custom feature interaction functions because it operates on each element (column) individually. When using transform(), it's important to remember the following:

    1. It can only perform calculations on each column, so you need to specify the column(s) to be operated on before the `groupby()` operation. This is a significant difference compared to `apply()`.

    2. Due to the column-wise operation, `transform()` has limitations in terms of functionality compared to `apply()`. It can only perform operations such as calculating column-wise maximum/minimum/mean/variance, or creating bins.

    3. One common use of transform() is to assign the results of a function back to the original DataFrame. This means the shape of the returned result will be `(len(df), 1)`. Note that when using it in conjunction with `groupby()`, we may need to remove duplicate values.

## The flexibility of the `apply()`

The flexibility of the `apply()` function lies in its ability to return diverse types of results, making it widely used among all group functions.

### Return scalar

In [58]:
# Group by gender, and show the average values of age and income
def custom_mean(x):
    return x.mean()
user_info_gender_group[['age','income']].apply(custom_mean)

Unnamed: 0_level_0,age,income
gender,Unnamed: 1_level_1,Unnamed: 2_level_1
female,32.5,7000.0
male,25.6,19000.0


In [59]:
user_info_gender_group[['age','income']].apply(lambda x: x.mean())

Unnamed: 0_level_0,age,income
gender,Unnamed: 1_level_1,Unnamed: 2_level_1
female,32.5,7000.0
male,25.6,19000.0


In [60]:
# Find the highest n values by gender group
def find_highest_n(data, num):
    
    """
    data: Dataframe
    
    """
    return data.nlargest(num)

In [61]:
user_info_gender_group.income.apply(find_highest_n, 3)

gender  name
female  C        8000
        F        7000
        E        6000
male    H       70000
        G       10000
        B        8000
Name: income, dtype: int64

In [62]:
# By gender group, display average age
user_info_gender_group.age.apply(np.mean)

gender
female    32.5
male      25.6
Name: age, dtype: float64

In [63]:
user_info_gender_group.income.apply(lambda x: x.min())

gender
female    6000
male      3000
Name: income, dtype: int64

### Return a list

In [64]:
user_info_gender_group['income'].apply(lambda x: x - x.min())

gender  name
female  C        2000
        E           0
        F        1000
male    A           0
        B        5000
        D        1000
        G        7000
        H       67000
Name: income, dtype: int64

In [65]:
user_info_gender_group[['income','age']].apply(lambda x: x - x.min())

Unnamed: 0_level_0,Unnamed: 1_level_0,income,age
gender,name,Unnamed: 2_level_1,Unnamed: 3_level_1
female,C,2000.0,5.0
female,E,0.0,
female,F,1000.0,0.0
male,A,0.0,0.0
male,B,5000.0,12.0
male,D,1000.0,0.0
male,G,7000.0,19.0
male,H,67000.0,7.0


### Return a DataFrame

In [66]:
user_info_gender_group.apply(lambda x: pd.DataFrame({'age_diff_max':x['age']-x['age'].max(),
                                  'age_diff_min':x['age']-x['age'].min(),
                                  'income_diff_max':x['income']-x['income'].max(),
                                  'income_diff_min':x['income']-x['income'].min()})).head()

Unnamed: 0_level_0,Unnamed: 1_level_0,age_diff_max,age_diff_min,income_diff_max,income_diff_min
gender,name,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
female,C,0.0,5.0,0,2000
female,E,,,-2000,0
female,F,-5.0,0.0,-1000,1000
male,A,-19.0,0.0,-67000,0
male,B,-7.0,12.0,-62000,5000


### To calculate multiple statistics 

In [67]:
#To calculate the sum, variance, and mean of the 'income' column within each gender group, 
def f(df):
    data = {}
    data['income_sum'] = df['income'].sum()
    data['income_var'] = df['income'].var()
    data['income_mean'] = df['income'].mean()
    return pd.Series(data)

user_info_gender_group.apply(f)

Unnamed: 0_level_0,income_sum,income_var,income_mean
gender,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
female,21000.0,1000000.0,7000.0
male,95000.0,821000000.0,19000.0


## `agg`, `transform` and `apply` comparison

`agg`, `transform`, and `apply` are three methods in pandas for performing calculations on grouped data. 
The comparison was conducted for the following combinations:

* `transform()` method with a custom function
* `transform()` method with a built-in Python function
* `apply()` method with a custom function
* `agg()` method with a custom function
* `agg()` method with a built-in Python function
Based on the results, the conclusions were as follows:

`agg()` with a built-in Python function > `transform()` with a built-in Python function > `agg()` with a custom function >= `apply()` with a custom function > `transform()` with a custom function.