Categorizing a dataset and applying a function to each group. 

After lading, merging and preparing a dataset, you may need to compute statistics or possibly picot tables for reproting or visualization purpose. 

Use pandas `groupby` interface to slice, dice and summarize datasets

- Split a pandas object into pieces using one or more keys
- Calculate group summary statistics like count, mean or standard deviation
- Apply within-group transformation or other manipulation like normalization, linear regression, rand or subset selection
- Compute pivot tables and cross-tabulations
- Perform quantile analysis and other statistical group analyses

In [None]:
import numpy as np

import pandas as pd



# 10.1 How to think about Group Operations
"split-apply-combine" - group operations
1. Data containes in a pandas object split into groups based on one or more keys that you provide, the splitting is performed on a particular axis of an object.
2. A function applied to each group producting a new value. 
3. Finally, the results of all those function applications are combined into a result object. 

Each grouping key can take many forms, and they keu do not have to be all the same type. 



`GroupBy` object may looks like a DataFrame, but it is already grouped by the provided group key

In [None]:
df = pd.DataFrame(
    {
        "key1": ["a", "a", None, "b", "b", "a", None],
        "key2": pd.Series([1, 2, 1, 2, 1, None, 1], dtype="Int64"),
        "data1": np.random.standard_normal(7),
        "data2": np.random.standard_normal(7),
    }
)

# Compute the mean of data1 columns using the labels from key1
# Will return the mean value of each group in "key1" (same key1 will be consider as 1 group)
grouped = df['data1'].groupby(df['key1'])
grouped.mean()




In [None]:
df.groupby(df['key1']).head()

In [None]:
means = df['data1'].groupby( df['key1']).mean()
means

In [None]:
means = df['data1'].groupby( df['key2']).mean()
means

In [None]:
means = df['data1'].groupby([df['key1'], df['key2']])
means.head(999)


In [None]:
means = df['data1'].groupby([df['key1'], df['key2']]).mean()
means.unstack()

In [None]:
states = np.array(['OH', "CA", "CA", "OH", "OH", "CA", "OH"])

years = [2005, 2005, 2006, 2005, 2006, 2005, 2006]

# group keys can be any array of the right length.
df['data1'].groupby([states, years]).mean().unstack()

In [None]:
# Pass column names to use the column as the group keys

df.groupby('key1').mean()

df.groupby(['key1', 'key2']).mean().unstack()

In [None]:
df.groupby(['key1', 'key2']).mean()

Use `GroupBy.size` method to return a Series containing group sizes. Any missing values in a group key are excluded from the result by default. This hebavior can be disabled by passing `dropna=False` 

In [None]:
df.groupby(['key1', 'key2'], dropna=False).size().unstack()

In [None]:
df

In [None]:
df.groupby('key1').count()

In [None]:
df.groupby('key1', dropna=False).size()

## Iterating over Groups

The object returned by groupby supposts iteration, generating a sequence of 2-tuples containing the group name along with the chunk of data.

In [None]:
for name, group in df.groupby('key1'):
	print(name)
	print(group)

# In the case of multiple keys, the first element in the tuple will be a tuple of key values
for (k1, k2), group in df.groupby(['key1', 'key2']):
	print((k1, k2))
	print(group)

In [None]:

# Computing a dictionary of data pieces as a one-linear
pieces = {name: group for name, group in df.groupby("key1")}

pieces['b']

pieces['b']

group on any other axes 

Group df by whether they start with 'key' or 'data'



In [None]:
grouped = df.groupby(
    {"key1": "key", "key2": "key", "data1": "data", "data2": "data"}, axis="columns"
)

for group_key, group_val in grouped:
	print(group_key)
	print(group_val)

In [None]:
df[['data1','data2']]

In [None]:
df[['key1', 'key2']]

## Selecting a Column or Subset of Columns

Indexing a GroupBy object created from a DataFrame with a column name or array of column names

In [None]:
df.groupby('key1')['data1'].head()

In [None]:
df['data1'].groupby(df['key1']).head()

In [None]:
# To aggregate only a few columns
# To only compute means for the data 2 column
df.groupby(['key1', 'key2'])[['data2']].mean()

## Grouping with Dictionaries and Series

In [None]:
people = pd.DataFrame(
    np.random.standard_normal((5, 5)),
    columns=["a", "b", "c", "d", "e"],
    index=["Joe", "Steve", "Wanda", "Jill", "Trey"],
)

people.iloc[2:3, [1,2]] = np.nan


In [None]:
people

In [None]:
mapping = {"a": "red", "b": "red", "c": "blue", "d": "blue", "e": "red", "f": "orange"}

by_column = people.groupby(mapping, axis="columns")

by_column.count()

In [None]:
by_column.sum()

In [None]:
by_column.head(999)

In [None]:
map_series = pd.Series(mapping)

people.groupby(map_series, axis="columns").count()

## Grouping with Functions
Any function passed as a group key will be called once per index value, with the return values being used as the group names. 

In [None]:
people.groupby(len).sum()

In [None]:
key_list = ['one', 'one', 'one', 'two', 'two']

people.groupby([len, key_list]).sum()

In [None]:

people.groupby([len, key_list]).min()

## Grouping by Index Levels
Aggregate using one of the levles of an axis index. 



In [None]:
columns = pd.MultiIndex.from_arrays(
    [["US", "US", "US", "JP", "JP"], [1, 3, 5, 1, 3]], names=["city", "tenor"]
)

hier_df = pd.DataFrame(np.random.standard_normal((4, 5)), columns=columns)

hier_df


In [None]:
# To group by level, pass the level number or name using level keyword
hier_df.groupby(level="city", axis='columns').count()

# 10.2 Data Aggregation

Aggregation refer to any data transformation that produces scalar values form arrays. 

Optimized groupby methods

| Function Name | Description |
| - | - |
| any, all | return True is any (one or more values) or all none-Na values are "truthy" | 
| count | Number of non-NA values | 
| cumin, cummax | Cumulative minimum and maximum of no-NA values | 
| cumsum | Cumulative sum of non-NA values |
| cumprod | Cumulative product of non-NA values |
| first, last | First and last non-NA values |
| mean | Mean of non-NA values |
| median | Arithemetic median of non-NA values |
| min, max | Minimum and maximum of non-NA values |
| nth | Retrieve value that would appear at position n with the data in sorted order |
| ohlc | Compute four "open-high-low-close" statistics for time series-like data. |
| prod | product of non-NA values | 
| quantile | Compute sample quantile | 
| rand | Ordinal ranks of non-NA values, like calling Series.rank |
| size | Compute group sizes, returning result as a Series | 
| std, var | Sample standard deviation and variance | 
 

Tp use your own aggregation functions, pass any function that aggregates an array to the aggregate methid or its short alias agg:



In [None]:
def peak_to_peak(arr):
	return arr.max() - arr.min()

grouped.agg(peak_to_peak)

## Column-wise and multiple function application 

In [None]:
tips = pd.read_csv('./datasets/tips.csv')

tips.head()


In [None]:
tips['tip_pct'] = tips['tip'] / tips['total_bill']

tips.head()

In [None]:
grouped = tips.groupby(['day', 'smoker'])

In [None]:
grouped_pct = grouped['tip_pct']

In [None]:
grouped_pct.agg('mean')


In [None]:
grouped_pct.agg(['mean', 'std', peak_to_peak])

In [None]:
# Pass a list of (name, function) tuples, the first element of each tuple will be used as DataFrame column name
grouped_pct.agg([('Average', 'mean'), ('stdev', np.std)])

In [None]:
# Specify a list of functions to apply to all of the columns or different functions per column in DataFrame
functions = ['count', 'mean', 'max']

result = grouped[['tip_pct', 'total_bill']].agg(functions)

result

In [None]:
# Apply different functions to one or more of the column, pass a dictionary to agg that contains a mapping of column names to any of the function specifications

grouped.agg({"tip": np.max, "size": "sum"})

# Pass multiple function to one column by using a list
grouped.agg({"tip_pct": ["min", "max", "mean"]})


In [None]:
tips.groupby(['day', 'smoker'], as_index=False).mean()

# 10.3 Apply: General split-apply-combine

The most general purpose GroupBy method is apply. 
`apply` splits the object being manipulated into pieces, invokes the passed function on each piece and then attempts to concatenate the pieces. 



In [None]:
# A function that selects the rows with the largest values in a particular column
def top(df, n=5, column="tip_pct"):
	return df.sort_values(column, ascending=False)[:n]

top(tips, n=6)

# The top function will be applied to each smoker group
# The result has a hierarchical index with an inner level that contains index values from the original DataFrame
tips.groupby('smoker').apply(top)

# Pass a function to apply with other arguments
# Below code will return the highest total bill in each day for smoker and non smokers
tips.groupby(['smoker', 'day']).apply(top, n=1, column='total_bill')

In [None]:
result = tips.groupby('smoker')['tip_pct'].describe()
result
result.unstack('smoker')
# Inside groupby when invoke a method like describe, it is a sort cut for
def f(group):
	return group.describe()

grouped.apply(f)

## Suppressing the Group Keys


In [None]:
tips.groupby('smoker').apply(top)

In [None]:
tips.groupby('smoker', group_keys=False).apply(top)

## Quantile and Bucket Analysis
`pandas.cut` and `pandas.qcut`, slicing data up into buckets with binds of your chooseing. 

In [None]:
# Sample random dataset and an equal-length bucket categorization using pandas.cut
frame = pd.DataFrame(
    {"data1": np.random.standard_normal(1000), "data2": np.random.standard_normal(1000)}
)

frame.head()

quartiles = pd.cut(frame['data1'], 4)

# The Categorical object returned by cut can be passed directly to groupby. 
# So we could compute a set of group statistics fro the quartiles.

In [None]:
quartiles

In [None]:
def get_stats(group):
    return pd.DataFrame(
        {
            "min": group.min(),
            "max": group.max(),
            "count": group.count(),
            "mean": group.mean(),
        }
    )

grouped = frame.groupby(quartiles)

grouped.apply(get_stats)


## Example: Filling Missing Values with Group-Specific Values
Suppose you need to fill value to vary by group.
Use apply with function that calls fillna on each data chunk

In [None]:
states = ['Ohio', 'New York', 'Vermont', 'Florida', 'Oregon', 'Nevada', 'California', 'Idaho']

group_key = ['East', 'East', 'East', 'East', 'West', 'West', 'West', 'West']

data = pd.Series(np.random.standard_normal(8), index=states)

In [None]:
data

In [None]:
# Set some value in the data to be missing
data[['Vermont', 'Nevada', "Idaho"]] = np.nan

In [None]:
data.groupby(group_key).size()

In [None]:
data.groupby(group_key).count()

In [None]:
def fill_mean(group):
	return group.fillna(group.mean())

data.groupby(group_key).apply(fill_mean)

In [None]:
fill_values = {'East':0.5, 'West': -1}

def fill_func(group):
	return group.fillna(fill_values[group.name])

In [None]:
data.groupby(group_key).apply(fill_func)



## Example : Random Sampling and Permutation

To draw a random sample from a large dataset for Monte Carlo simulation. 

In [None]:
suits = ["H", "S", "C", "D"]

card_val = (list(range(1, 11)) + [10] * 3) * 4
base_names = ["A"] + list(range(2, 11)) + ["J", "K", "Q"]

cards = []

for suit in suits:
	cards.extend(str(num) + suit for num in base_names)

deck = pd.Series(card_val, index=cards)

In [None]:
deck.head(13)

In [None]:
def draw(deck, n = 5):
	return deck.sample(5)

draw(deck)

In [None]:
# Get 2 random cards from each suit. 

def get_suit(card):
	# last letter is suit
	return card[-1]

deck.groupby(get_suit).apply(draw, n=2)

## Example: Group Weighted Average and Correlation

In [None]:
df = pd.DataFrame(
    {
        "category": ["a", "a", "a", "a", "b", "b", "b", "b"],
        "data": np.random.standard_normal(8),
        "weights": np.random.uniform(size=8),
    }
)
df


In [None]:
grouped = df.groupby("category")

# Weighted average by category
def get_wavg(group):
    return np.average(group["data"], weights=group["weights"])


grouped.apply(get_wavg)


In [None]:
close_px = pd.read_csv("./datasets/stock_px.csv", parse_dates=True, index_col=0)

close_px.info()


In [None]:
close_px.tail(4)

In [None]:
# Create a function that computes the pair-wise correlation or each column with SPX column

def spx_corr(group):
	return group.corrwith(group['SPX'])

# Compute percent change on the close_px using pct_change
rets = close_px.pct_change().dropna()

# Group these percent changes by year
def get_year(x):
	return x.year

by_year = rets.groupby(get_year)

In [None]:
by_year.apply(spx_corr)

In [None]:
# Compute intercolumn correlations

def corr_appl_msft(group):
	return group['AAPL'].corr(group["MSFT"])

by_year.apply(corr_appl_msft)

## Example: Group-Wise Linear Regression

Groupby can perform more complex group-wise statistical analysis, as long as the function returns a pandas object or scalar value. 


In [None]:
import statsmodels.api as sm

def regress(data, yvar=None, xvars=None):
	Y = data[yvar]
	X = data[xvars]
	X['intercept'] = 1
	result = sm.OLS(Y, X).fit()
	return result.params

In [None]:
by_year.apply(regress, yvar='AAPL', xvars=['SPX'])


# 10.4 Group Transforms and "Unwrapped" GroupBys

`transform` build in method. 
- Product a scalar value to be broadcast to the shape of the group
- Product an object of the same shape as the input group
- Must not mutate its input

In [None]:
df = pd.DataFrame({'key':['a', 'b', 'c'] * 4, 'value':np.arange(12.)})
g = df.groupby('key')
g.mean()



In [None]:
# Product a series of the same shape as df['value] but with values replaced by the average grouped by 'key'

def get_mean(group):
	return group.mean()

g.transform(get_mean)

g.apply(get_mean)

def times_two(group):
	return group * 2

g.transform(times_two)

# Compute the ranks in descending order for each group 
def get_ranks(group):
	return group.rank(ascending=False)

g.transform(get_ranks)

g.rank(ascending=False)

In [None]:
# a group transformation function composed from simple aggregations
def normalize(x):
	return (x - x.mean()) / x.std()

g.transform(normalize)

# Use built-in aggregate functions
g.transform('mean')

# unwrapped group operation
# Doing arithmetic between the outputs of multiple GroupBy operations
# Instead of writing a function and passing it to groupby(...).apply
normalized = (df['value'] - g.transform('mean')) / g.transform('std')

# 10.5 Pivot Tables and Cross-tabulation

A pivot table is a data summarization tool frequently found in spreadsheet programs and other data analysis software. 

It aggregates a table of data by one or more keys, arranging the data in a rectangle with some of the group keys.



In [38]:
tips = pd.read_csv("./datasets/tips.csv")

tips["tip_pct"] =  tips["tip"] / tips['total_bill']

tips.pivot_table(index=["day", "smoker"])
# Same as
tips.groupby(["day", "smoker"]).mean()

# Take the average of only tip_pct and size  and additionally group by time. 
# Put smoker in the table columns and time and day in the rows
tips.pivot_table(index=["time", "day"], columns="smoker", values=["tip_pct", "size"])

# adding a All row and column labels 
# means without taking in to account smoker versus nonsmoker 
tips.pivot_table(index=["time", "day"], columns="smoker", values=["tip_pct", "size"], margins=True)

# To use an aggregation function other than mean, pass it to aggfunc keyword argument
tips.pivot_table(index=['time', 'smoker'], columns="day", values="tip_pct", aggfunc=len, margins=True)

# use fill_value=0 argument to full up NA values

  tips.pivot_table(index=["day", "smoker"])
  tips.groupby(["day", "smoker"]).mean()


Unnamed: 0_level_0,day,Fri,Sat,Sun,Thur,All
time,smoker,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Dinner,No,3.0,45.0,57.0,1.0,106
Dinner,Yes,9.0,42.0,19.0,,70
Lunch,No,1.0,,,44.0,45
Lunch,Yes,6.0,,,17.0,23
All,,19.0,87.0,76.0,62.0,244


## Cross-Tabulations: Crosstab

a corss-tabulation is a special case of a pivot table that computes group frequencies. 

In [42]:
from io import StringIO

data = """Sample Nationality Handedness
1 USA Right-handed
2 Japan Left-handed
3 USA Right-handed
4 Japan Right-handed
5 Japan Left-handed
6 Japan Right-handed
7 USA Right-handed
"""

data = pd.read_table(StringIO(data), sep="\s+")

In [46]:
# The first two arguments to crosstab can each be an array of Series of a list of arrays
pd.crosstab(data['Nationality'], data['Handedness'], margins=True)

pd.crosstab([tips['time'], tips['day']], tips['smoker'], margins=True)

Unnamed: 0_level_0,smoker,No,Yes,All
time,day,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Dinner,Fri,3,9,12
Dinner,Sat,45,42,87
Dinner,Sun,57,19,76
Dinner,Thur,1,0,1
Lunch,Fri,1,6,7
Lunch,Thur,44,17,61
All,,151,93,244
