Categorizing a dataset and applying a function to each group. 

After lading, merging and preparing a dataset, you may need to compute statistics or possibly picot tables for reproting or visualization purpose. 

Use pandas `groupby` interface to slice, dice and summarize datasets

- Split a pandas object into pieces using one or more keys
- Calculate group summary statistics like count, mean or standard deviation
- Apply within-group transformation or other manipulation like normalization, linear regression, rand or subset selection
- Compute pivot tables and cross-tabulations
- Perform quantile analysis and other statistical group analyses

In [1]:
import numpy as np

import pandas as pd



# 10.1 How to think about Group Operations
"split-apply-combine" - group operations
1. Data containes in a pandas object split into groups based on one or more keys that you provide, the splitting is performed on a particular axis of an object.
2. A function applied to each group producting a new value. 
3. Finally, the results of all those function applications are combined into a result object. 

Each grouping key can take many forms, and they keu do not have to be all the same type. 



`GroupBy` object may looks like a DataFrame, but it is already grouped by the provided group key

In [2]:
df = pd.DataFrame(
    {
        "key1": ["a", "a", None, "b", "b", "a", None],
        "key2": pd.Series([1, 2, 1, 2, 1, None, 1], dtype="Int64"),
        "data1": np.random.standard_normal(7),
        "data2": np.random.standard_normal(7),
    }
)

# Compute the mean of data1 columns using the labels from key1
# Will return the mean value of each group in "key1" (same key1 will be consider as 1 group)
grouped = df['data1'].groupby(df['key1'])
grouped.mean()




key1
a   -0.455194
b    0.115268
Name: data1, dtype: float64

In [None]:
df.groupby(df['key1']).head()

In [None]:
means = df['data1'].groupby( df['key1']).mean()
means

In [None]:
means = df['data1'].groupby( df['key2']).mean()
means

In [None]:
means = df['data1'].groupby([df['key1'], df['key2']])
means.head(999)


In [None]:
means = df['data1'].groupby([df['key1'], df['key2']]).mean()
means.unstack()

In [None]:
states = np.array(['OH', "CA", "CA", "OH", "OH", "CA", "OH"])

years = [2005, 2005, 2006, 2005, 2006, 2005, 2006]

# group keys can be any array of the right length.
df['data1'].groupby([states, years]).mean().unstack()

In [None]:
# Pass column names to use the column as the group keys

df.groupby('key1').mean()

df.groupby(['key1', 'key2']).mean().unstack()

In [None]:
df.groupby(['key1', 'key2']).mean()

Use `GroupBy.size` method to return a Series containing group sizes. Any missing values in a group key are excluded from the result by default. This hebavior can be disabled by passing `dropna=False` 

In [None]:
df.groupby(['key1', 'key2'], dropna=False).size().unstack()

In [None]:
df

In [None]:
df.groupby('key1').count()

In [None]:
df.groupby('key1', dropna=False).size()

## Iterating over Groups

The object returned by groupby supposts iteration, generating a sequence of 2-tuples containing the group name along with the chunk of data.

In [None]:
for name, group in df.groupby('key1'):
	print(name)
	print(group)

# In the case of multiple keys, the first element in the tuple will be a tuple of key values
for (k1, k2), group in df.groupby(['key1', 'key2']):
	print((k1, k2))
	print(group)

In [None]:

# Computing a dictionary of data pieces as a one-linear
pieces = {name: group for name, group in df.groupby("key1")}

pieces['b']

pieces['b']

group on any other axes 

Group df by whether they start with 'key' or 'data'



In [None]:
grouped = df.groupby(
    {"key1": "key", "key2": "key", "data1": "data", "data2": "data"}, axis="columns"
)

for group_key, group_val in grouped:
	print(group_key)
	print(group_val)

In [None]:
df[['data1','data2']]

In [None]:
df[['key1', 'key2']]

## Selecting a Column or Subset of Columns

Indexing a GroupBy object created from a DataFrame with a column name or array of column names

In [10]:
df.groupby('key1')['data1'].head()

0   -0.980510
1    0.753196
3   -0.031481
4    0.262018
5   -1.138268
Name: data1, dtype: float64

In [9]:
df['data1'].groupby(df['key1']).head()

0   -0.980510
1    0.753196
3   -0.031481
4    0.262018
5   -1.138268
Name: data1, dtype: float64

In [11]:
# To aggregate only a few columns
# To only compute means for the data 2 column
df.groupby(['key1', 'key2'])[['data2']].mean()

Unnamed: 0_level_0,Unnamed: 1_level_0,data2
key1,key2,Unnamed: 2_level_1
a,1,0.092148
a,2,-1.514385
b,1,0.534622
b,2,-0.924592


## Grouping with Dictionaries and Series

In [16]:
people = pd.DataFrame(
    np.random.standard_normal((5, 5)),
    columns=["a", "b", "c", "d", "e"],
    index=["Joe", "Steve", "Wanda", "Jill", "Trey"],
)

people.iloc[2:3, [1,2]] = np.nan


In [17]:
people

Unnamed: 0,a,b,c,d,e
Joe,-0.472598,3.187343,-0.502007,-1.884353,-1.207755
Steve,0.810863,-0.416002,1.704311,-1.977146,-1.335709
Wanda,-1.390674,,,-1.774004,0.70221
Jill,-1.308237,-0.967369,0.231921,0.355734,0.373386
Trey,0.236129,-0.02646,-0.260725,0.394199,-0.552764


In [23]:
mapping = {"a": "red", "b": "red", "c": "blue", "d": "blue", "e": "red", "f": "orange"}

by_column = people.groupby(mapping, axis="columns")

by_column.count()

Unnamed: 0,blue,red
Joe,2,3
Steve,2,3
Wanda,1,2
Jill,2,3
Trey,2,3


In [24]:
by_column.sum()

Unnamed: 0,blue,red
Joe,-2.38636,1.50699
Steve,-0.272835,-0.940848
Wanda,-1.774004,-0.688465
Jill,0.587655,-1.90222
Trey,0.133474,-0.343095


In [25]:
by_column.head(999)

Unnamed: 0,a,b,c,d,e
Joe,-0.472598,3.187343,-0.502007,-1.884353,-1.207755
Steve,0.810863,-0.416002,1.704311,-1.977146,-1.335709
Wanda,-1.390674,,,-1.774004,0.70221
Jill,-1.308237,-0.967369,0.231921,0.355734,0.373386
Trey,0.236129,-0.02646,-0.260725,0.394199,-0.552764


In [26]:
map_series = pd.Series(mapping)

people.groupby(map_series, axis="columns").count()

Unnamed: 0,blue,red
Joe,2,3
Steve,2,3
Wanda,1,2
Jill,2,3
Trey,2,3


## Grouping with Functions
Any function passed as a group key will be called once per index value, with the return values being used as the group names. 

In [27]:
people.groupby(len).sum()

Unnamed: 0,a,b,c,d,e
3,-0.472598,3.187343,-0.502007,-1.884353,-1.207755
4,-1.072108,-0.993829,-0.028805,0.749933,-0.179378
5,-0.579812,-0.416002,1.704311,-3.751149,-0.633499


In [36]:
key_list = ['one', 'one', 'one', 'two', 'two']

people.groupby([len, key_list]).sum()

Unnamed: 0,Unnamed: 1,a,b,c,d,e
3,one,-0.472598,3.187343,-0.502007,-1.884353,-1.207755
4,two,-1.072108,-0.993829,-0.028805,0.749933,-0.179378
5,one,-0.579812,-0.416002,1.704311,-3.751149,-0.633499


In [37]:

people.groupby([len, key_list]).min()

Unnamed: 0,Unnamed: 1,a,b,c,d,e
3,one,-0.472598,3.187343,-0.502007,-1.884353,-1.207755
4,two,-1.308237,-0.967369,-0.260725,0.355734,-0.552764
5,one,-1.390674,-0.416002,1.704311,-1.977146,-1.335709


## Grouping by Index Levels
Aggregate using one of the levles of an axis index. 



In [40]:
columns = pd.MultiIndex.from_arrays(
    [["US", "US", "US", "JP", "JP"], [1, 3, 5, 1, 3]], names=["city", "tenor"]
)

hier_df = pd.DataFrame(np.random.standard_normal((4, 5)), columns=columns)

hier_df


city,US,US,US,JP,JP
tenor,1,3,5,1,3
0,0.06568,2.017971,0.631632,-2.054549,0.324105
1,-0.097418,-0.249019,-0.903861,0.965828,0.481348
2,0.072769,0.019059,-1.197369,0.216974,0.442749
3,-0.202678,-1.080596,1.796581,-0.040351,-1.846178


In [42]:
# To group by level, pass the level number or name using level keyword
hier_df.groupby(level="city", axis='columns').count()

city,JP,US
0,2,3
1,2,3
2,2,3
3,2,3


# 10.2 Data Aggregation

Aggregation refer to any data transformation that produces scalar values form arrays. 

Optimized groupby methods

| Function Name | Description |
| - | - |
| any, all | return True is any (one or more values) or all none-Na values are "truthy" | 
| count | Number of non-NA values | 
| cumin, cummax | Cumulative minimum and maximum of no-NA values | 
| cumsum | Cumulative sum of non-NA values |
| cumprod | Cumulative product of non-NA values |
| first, last | First and last non-NA values |
| mean | Mean of non-NA values |
| median | Arithemetic median of non-NA values |
| min, max | Minimum and maximum of non-NA values |
| nth | Retrieve value that would appear at position n with the data in sorted order |
| ohlc | Compute four "open-high-low-close" statistics for time series-like data. |
| prod | product of non-NA values | 
| quantile | Compute sample quantile | 
| rand | Ordinal ranks of non-NA values, like calling Series.rank |
| size | Compute group sizes, returning result as a Series | 
| std, var | Sample standard deviation and variance | 
 

Tp use your own aggregation functions, pass any function that aggregates an array to the aggregate methid or its short alias agg:



In [45]:
def peak_to_peak(arr):
	return arr.max() - arr.min()

grouped.agg(peak_to_peak)

key1
a    1.891464
b    0.293499
Name: data1, dtype: float64