# Components of a groupby operation

In [1]:
from IPython.display import IFrame

In [2]:
IFrame('http://etc.ch/Qiup', 400, 300)

In [3]:
IFrame('https://directpoll.com/r?XDbzPBd3ixYqg8fScFM4ITaST2iNt4c8vN1EdJdK', 400, 300)

# The three components of every groupby aggregation
There are three components to every groupby operation that can help you understand the syntax

* **Grouping columns** - the unique values of these columns for independent groups
* **Aggregating columns** - The values in these columns will be aggregated into a single value
* **Aggregating functions** - These functions are independently applied to each aggregating column of each group

The syntax will look something similar to this:

```
>>> df.groupby(['grouping', 'columns'])['aggregating', 'columns'].agg(['aggregating', 'functions'])
```

There are many additional syntaxes but each groupby aggregation will always have these three components.

### A fairly simple groupby
Let's do an exercise to get us started

In [None]:
import pandas as pd
import numpy as np

In [None]:
college = pd.read_csv('../data/college.csv')
college.head()

### Exercise 1
<span style="color:green; font-size:16px">Find the average and max SAT Math and Verbal scores by state and religious affiliation.</span>

In [None]:
# your code here

Copy and paste solution in the next cell from Solutions notebook. 

Side note: It's possible to create 'exercise' cells with [nbextensions](https://github.com/ipython-contrib/jupyter_contrib_nbextensions).

In [None]:
# copy solution here

In [None]:
state_sat.head(10)

## Flattening a MultiIndex

### Many options available to go back to a single level index

* Rename manually with a list
* Concatenation of level values
* Swift `map` method

In [None]:
state_sat.columns.get_level_values(0) + '_' + state_sat.columns.get_level_values(1)

#### Swift Index `map` method

Let's see a simple example in pure Python first

In [None]:
t = ('first', 'second')

In [None]:
'some phrase {0}'.format(t)

In [None]:
'some phrase {0[0]}'.format(t)

In [None]:
'some phrase {0[0]} - {0[1]}'.format(t)

Let's use this idea with the **`map`** Index method

In [None]:
state_sat.columns.map('{0[0]}_{0[1]}'.format)

Or like this

In [None]:
state_sat.columns.map('_'.join)

### Exercise 2
<span style="color:green; font-size:16px">Why would we ever use the method with **`map`** when **`join`** is more straightforward. Turn the **`state_sat`** DataFrame with single level index and columns.</span>

In [None]:
# your code here

# `agg` vs `apply` on a groupby object

In [4]:
IFrame('http://etc.ch/Xig7', 400, 300)

In [5]:
IFrame('https://directpoll.com/r?XDbzPBd3ixYqg8FSFbM44x1kSxKzcBd8htg8WwN5Vx', 400, 300) 

**`agg`** must return a single value for each function. Each column is passed into the function as a Series. It cannot 'see' any other data.

The groupby **`apply`** method can return a single value, a Series or a DataFrame. You must supply a custom function to **`apply`**. This custom function accepts the entire group as a **`DataFrame`**. 

### Simple examples to see how the groupby `apply` works

In [None]:
def return_single(x):
    return 'a single value'

def return_series(x):
    return pd.Series(data=['value 1', 'value 2'], index=['col A', 'col B'])

def return_df(x):
    return pd.DataFrame(np.random.rand(3,2), 
                        index=['row one', 'row two', 'row three'], 
                        columns=['col A', 'col B'])

In [None]:
college.groupby(['STABBR', 'RELAFFIL']).apply(return_single).head(10)

In [None]:
college.groupby(['STABBR', 'RELAFFIL']).apply(return_series).head(10)

In [None]:
college.groupby(['STABBR', 'RELAFFIL']).apply(return_df).head(20)

### Exercise 3
<span style="color:green; font-size:16px">Verify that the object passed to the custom function in **`apply`** is a DataFrame</span>

In [None]:
# your code here

### Exercise 4
<span style="color:green; font-size:16px">Calculate the average SAT Math scores per state weighted by undergraduate population</span>

In [None]:
# your code here

## Can we calculate the weighted average without `apply`?

In [None]:
college_drop = college[['STABBR', 'SATMTMID', 'UGDS']].dropna()

In [None]:
college_drop['MATH_WT'] = college_drop['SATMTMID'] * college_drop['UGDS']
college_drop.head()

In [None]:
c1 = college_drop.groupby('STABBR')['MATH_WT', 'UGDS'].agg('sum')
c1.head()

In [None]:
(c1['MATH_WT'] / c1['UGDS']).astype(int).head()

### Which way is faster?

In [None]:
%%timeit 
college_drop['MATH_WT'] = college_drop['SATMTMID'] * college_drop['UGDS']
c1 = college_drop.groupby('STABBR')['MATH_WT', 'UGDS'].agg('sum')
(c1['MATH_WT'] / c1['UGDS']).astype(int).head()

In [None]:
def calc_wa(df):
    wa =  (df['SATMTMID'] * df['UGDS']).sum() / df['UGDS'].sum()
    return wa.astype(int)

In [None]:
%timeit college_drop.groupby('STABBR').apply(calc_wa).head(10)

# Keeping tab completion

Tab completion is an extremely useful feature. It disappears (some jedi) when you chain methods together.

In [None]:
college[['STABBR', 'SATMTMID', 'UGDS']].dropna().<press tab>

To work around this, save intermediate steps to a variable

### Press shift + tab + tab for help

# Summary
* Know the three components of a groupby aggregation - grouping columns, aggregating columns, aggregating functions
* Flatten a MultiIndex with the **`map`** method
* The groupby **`agg`** functions implicity get passed a Series and return a single value
* The groupby **`apply`** functions implicitly get passed a DataFrame and can return a single value, Series or DataFrame
* Can pre-calcualte a column to avoid **`apply`** and get better performance