In [1]:
import numpy as np
import pandas as pd
print(pd.__version__)

2.0.2


Grouping and Aggregation
==============

Grouping and aggregation are powerful techniques in data analysis that allow us to summarize and analyze data based on different categories or groups. By grouping data, we can apply various statistical functions to obtain meaningful insights and summaries.

Pandas provides efficient methods for grouping and aggregating data. We can group data based on one or more columns and perform operations such as counting, summing, averaging, and more on the grouped data.

# Grouping Steps and applications

## Grouping steps

The grouping function in pandas is `groupby`, which is essentially a split-apply-combine process, commonly abbreviated as SAC. Grouping refers to a process that involves one or more of the following steps.

* Splitting: Dividing the data into groups based on specific criteria or columns.
* Applying: Performing computations or transformations on each group independently.
* Combining: Combining the results of the computations into a consolidated output.

## Grouping applications

* **Aggregation**: Compute summary statistics for each group. For example:

    - Calculate the sum or mean within each group.

    - Determine the count or size of each group.

* **Transformation**: Perform calculations specific to each group and return an object with a similar index. For example:

    - Standardize data within each group using z-scores.

    - Fill missing values within each group using derived values from the group.

* **Filtration**: Drop certain groups based on an evaluation that results in True or False. For example:

    - Discard data belonging to groups with only a few members.

    - Filter data based on the sum or mean of each group.


A combination of the above: `GroupBy` examines the result of the apply step and attempts to return a sensible combination result when it doesn't fit into the aforementioned categories.

These various applications of `groupby` provide flexibility in data analysis, allowing us to compute group-specific statistics, apply transformations within groups, and filter data based on group characteristics. The `GroupBy` functionality in pandas offers a powerful toolset for handling and analyzing grouped data effectively.

# Grouping Functions
## `groupby`

```python

groupby(by = None, axis = 0, level = None, as_index = True, sort = True, group_keys = True, observed = False, dropna = True)
```

* `by`: Specifies the column name(s) or other criteria to group the DataFrame. It can be a single column name, a list of column names, a function, or a dictionary mapping column names to group values.

* `axis`: Specifies the axis along which the grouping is performed. axis=0 groups the DataFrame by rows (along the index), and axis=1 groups by columns.

* `level`: Specifies the level(s) (if the DataFrame has a hierarchical index) on which to group the DataFrame.

* `as_index`: Specifies whether to use the grouped columns as the index of the resulting DataFrame. The default value is True.

* `sort`: Specifies whether to sort the resulting groups by the group keys. The default value is True.

* `group_keys`: Specifies whether to include the group keys in the resulting DataFrame index. The default value is True.

* `squeeze`: Specifies whether to squeeze a grouped DataFrame into a Series if possible. The default value is False.

* `observed`: Specifies whether to include only observed values when dealing with categorical data. The default value is False.

In [2]:
index = pd.Index(data = ["A", "B", "C", "D", "E", "F", "G",'H'], name="name")
data = {
    "age": [18, 30, 35, 18, np.nan, 30, 37, 25],
    "city": ["New York", "Los Angeles", "Huston", "Orlando", np.nan, " ", "Miami", "Chicago"],
    "sex": ["male", "male", "female", "male", "female", "female", "male", "male"],
    "income": [3000, 8000, 8000, 4000, 6000, 7000, 10000, 70000]
}
user_info = pd.DataFrame(data = data, index = index)
user_info

Unnamed: 0_level_0,age,city,sex,income
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
A,18.0,New York,male,3000
B,30.0,Los Angeles,male,8000
C,35.0,Huston,female,8000
D,18.0,Orlando,male,4000
E,,,female,6000
F,30.0,,female,7000
G,37.0,Miami,male,10000
H,25.0,Chicago,male,70000


In [4]:
# Groupby sex
user_info_sex_group = user_info.groupby("sex")
user_info_sex_group

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x0000028093C81540>

In [5]:
# Look up the grouping infomation
user_info_sex_group.groups

{'female': ['C', 'E', 'F'], 'male': ['A', 'B', 'D', 'G', 'H']}

In [6]:
# Look up the male info
user_info_sex_group.get_group('male')

Unnamed: 0_level_0,age,city,sex,income
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
A,18.0,New York,male,3000
B,30.0,Los Angeles,male,8000
D,18.0,Orlando,male,4000
G,37.0,Miami,male,10000
H,25.0,Chicago,male,70000


In [7]:
# Look up the femail info
user_info_sex_group.get_group('female')

Unnamed: 0_level_0,age,city,sex,income
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
C,35.0,Huston,female,8000
E,,,female,6000
F,30.0,,female,7000
