Categorizing a dataset and applying a function to each group. 

After lading, merging and preparing a dataset, you may need to compute statistics or possibly picot tables for reproting or visualization purpose. 

Use pandas `groupby` interface to slice, dice and summarize datasets

- Split a pandas object into pieces using one or more keys
- Calculate group summary statistics like count, mean or standard deviation
- Apply within-group transformation or other manipulation like normalization, linear regression, rand or subset selection
- Compute pivot tables and cross-tabulations
- Perform quantile analysis and other statistical group analyses

In [None]:
import numpy as np

import pandas as pd



# 10.1 How to think about Group Operations
"split-apply-combine" - group operations
1. Data containes in a pandas object split into groups based on one or more keys that you provide, the splitting is performed on a particular axis of an object.
2. A function applied to each group producting a new value. 
3. Finally, the results of all those function applications are combined into a result object. 

Each grouping key can take many forms, and they keu do not have to be all the same type. 



`GroupBy` object may looks like a DataFrame, but it is already grouped by the provided group key

In [None]:
df = pd.DataFrame(
    {
        "key1": ["a", "a", None, "b", "b", "a", None],
        "key2": pd.Series([1, 2, 1, 2, 1, None, 1], dtype="Int64"),
        "data1": np.random.standard_normal(7),
        "data2": np.random.standard_normal(7),
    }
)

# Compute the mean of data1 columns using the labels from key1
# Will return the mean value of each group in "key1" (same key1 will be consider as 1 group)
grouped = df['data1'].groupby(df['key1'])
grouped.mean()




In [None]:
df.groupby(df['key1']).head()

In [None]:
means = df['data1'].groupby( df['key1']).mean()
means

In [None]:
means = df['data1'].groupby( df['key2']).mean()
means

In [None]:
means = df['data1'].groupby([df['key1'], df['key2']])
means.head(999)


In [None]:
means = df['data1'].groupby([df['key1'], df['key2']]).mean()
means.unstack()

In [None]:
states = np.array(['OH', "CA", "CA", "OH", "OH", "CA", "OH"])

years = [2005, 2005, 2006, 2005, 2006, 2005, 2006]

# group keys can be any array of the right length.
df['data1'].groupby([states, years]).mean().unstack()

In [None]:
# Pass column names to use the column as the group keys

df.groupby('key1').mean()

df.groupby(['key1', 'key2']).mean().unstack()

In [None]:
df.groupby(['key1', 'key2']).mean()

Use `GroupBy.size` method to return a Series containing group sizes. Any missing values in a group key are excluded from the result by default. This hebavior can be disabled by passing `dropna=False` 

In [None]:
df.groupby(['key1', 'key2'], dropna=False).size().unstack()

In [None]:
df

In [None]:
df.groupby('key1').count()

In [None]:
df.groupby('key1', dropna=False).size()

## Iterating over Groups

The object returned by groupby supposts iteration, generating a sequence of 2-tuples containing the group name along with the chunk of data.

In [None]:
for name, group in df.groupby('key1'):
	print(name)
	print(group)

# In the case of multiple keys, the first element in the tuple will be a tuple of key values
for (k1, k2), group in df.groupby(['key1', 'key2']):
	print((k1, k2))
	print(group)

In [75]:

# Computing a dictionary of data pieces as a one-linear
pieces = {name: group for name, group in df.groupby("key1")}

pieces['b']

Unnamed: 0,key1,key2,data1,data2
3,b,2,0.742205,-0.399149
4,b,1,-0.241077,0.276053


pieces['b']

group on any other axes 

Group df by whether they start with 'key' or 'data'



In [81]:
grouped = df.groupby(
    {"key1": "key", "key2": "key", "data1": "data", "data2": "data"}, axis="columns"
)

for group_key, group_val in grouped:
	print(group_key)
	print(group_val)

data
      data1     data2
0 -0.216346 -0.009212
1 -1.143640  0.686311
2 -0.369244 -0.933156
3  0.742205 -0.399149
4 -0.241077  0.276053
5  0.110157 -1.771335
6 -1.578311 -0.661585
key
   key1  key2
0     a     1
1     a     2
2  None     1
3     b     2
4     b     1
5     a  <NA>
6  None     1


In [84]:
df[['data1','data2']]

Unnamed: 0,data1,data2
0,-0.216346,-0.009212
1,-1.14364,0.686311
2,-0.369244,-0.933156
3,0.742205,-0.399149
4,-0.241077,0.276053
5,0.110157,-1.771335
6,-1.578311,-0.661585


In [85]:
df[['key1', 'key2']]

Unnamed: 0,key1,key2
0,a,1.0
1,a,2.0
2,,1.0
3,b,2.0
4,b,1.0
5,a,
6,,1.0
