In [3]:
import seaborn as sns
import pandas as pd
import numpy as np

# Pandas Group Operations

Let's next go over grouped operations with pandas. This section of the pandas library does not have as much feature bloat as other parts, which is nice. And the community is starting to narrow around a couple of operations that are core to grouped operations. We'll be going over these operations with particular emphasis on groupby and agg:

* groupby
* agg
* filter
* transform

Check out the full documentation [here](http://pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html), but be warned it is a bit long :)

Let's start with our good old tips dataset:

In [4]:
tips = sns.load_dataset('tips')
tips.head(3)

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.5,Male,No,Sun,Dinner,3


### Groupby

A grouped operation starts by specifying which groups of data that we would want to operate over. There are many ways of making groupsm, but the tool that pandas uses to make groups of data, is `groupby`

In [5]:
tips_gb = tips.groupby(['sex', 'smoker'])
tips_gb

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x11ff27d30>

Groupby works by telling pandas a couple of columns. Pandas will look in your data and see every unique combination of the columns that you specify. Each unique combination is a group. So in this case we will have four groups: male smoker, female smoker, male non-smoker, female non-smoker.

The groupby object by itself is not super important.

Once we have these groups (specified in the groupby object), we can do three types of operations on it (with the most important being agg)

### Agg

The aggregate operation aggregates all the data in these groups into one value. You use a dictionary to specify which values you'd like. For example look below, we are asking for both the mean and the min value of the tip column for each group:

In [6]:
tips_agg = tips_gb.agg({
    'tip': ['mean', 'min'],
    'day': 'first',
    'total_bill': 'size'
})

tips_agg

Unnamed: 0_level_0,Unnamed: 1_level_0,tip,tip,day,total_bill
Unnamed: 0_level_1,Unnamed: 1_level_1,mean,min,first,size
sex,smoker,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
Male,Yes,3.051167,1.0,Sat,60
Male,No,3.113402,1.25,Sun,97
Female,Yes,2.931515,1.0,Sat,33
Female,No,2.773519,1.0,Sun,54


So notice that we get both a multi-index for both the index and the columns. We can always get rid of the multi-index with a `reset_index` (see [indexing and selecting](https://github.com/knathanieltucker/pandas-tutorial/blob/master/notebooks/Indexing%20and%20Selecting.ipynb) for more details):

In [7]:
tips_agg.reset_index()

Unnamed: 0_level_0,sex,smoker,tip,tip,day,total_bill
Unnamed: 0_level_1,Unnamed: 1_level_1,Unnamed: 2_level_1,mean,min,first,size
0,Male,Yes,3.051167,1.0,Sat,60
1,Male,No,3.113402,1.25,Sun,97
2,Female,Yes,2.931515,1.0,Sat,33
3,Female,No,2.773519,1.0,Sun,54


And we can either use stacking or our column trick to get rid of the column nonsense:

In [8]:
# before
tips_agg.columns

MultiIndex(levels=[['tip', 'day', 'total_bill'], ['first', 'mean', 'min', 'size']],
           codes=[[0, 0, 1, 2], [1, 2, 0, 3]])

In [9]:
tips_agg.stack()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,tip,day,total_bill
sex,smoker,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Male,Yes,first,,Sat,
Male,Yes,mean,3.051167,,
Male,Yes,min,1.0,,
Male,Yes,size,,,60.0
Male,No,first,,Sun,
Male,No,mean,3.113402,,
Male,No,min,1.25,,
Male,No,size,,,97.0
Female,Yes,first,,Sat,
Female,Yes,mean,2.931515,,


In [10]:
tips_agg.columns = ['__'.join(col).strip() for col in tips_agg.columns.values]
tips_agg.columns

Index(['tip__mean', 'tip__min', 'day__first', 'total_bill__size'], dtype='object')

In [11]:
tips_agg

Unnamed: 0_level_0,Unnamed: 1_level_0,tip__mean,tip__min,day__first,total_bill__size
sex,smoker,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Male,Yes,3.051167,1.0,Sat,60
Male,No,3.113402,1.25,Sun,97
Female,Yes,2.931515,1.0,Sat,33
Female,No,2.773519,1.0,Sun,54


That is about it for the aggregation, you can find some common aggregation functions listed [here](http://pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html#aggregation)

### Filter

The next common group operation is a filter. This one is pretty simple, we filter out member of groups that don't meet our criteria.

For example let's only look at the least busy times the place is open. One way we might do that is exclude all times above the median from the analysis

In [53]:
# we use the exact same groupby syntax
tips_gb = tips.groupby(['day', 'time'])

In [54]:
median_size = tips_gb.agg({'size': 'sum'}).median()[0]

In [56]:
# notice that we carved out quite a few rows
tips_gb.filter(lambda group: group['size'].sum() < median_size).head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
90,28.97,3.0,Male,Yes,Fri,Dinner,2
91,22.49,3.5,Male,No,Fri,Dinner,2
92,5.75,1.0,Female,Yes,Fri,Dinner,2
93,16.32,4.3,Female,Yes,Fri,Dinner,2
94,22.75,3.25,Female,No,Fri,Dinner,2


That's honestly about it. I don't use this functionality too much, but it's pretty simple and I don't think it complicates things too much, so may as well throw it in.

### Transform

The final group operation is transform. This uses group information to apply transformations to individual data points. For example look below: each day let's divide by the bill and tip by the average amount spent on that day. That way we can look at how much that bill differs from the average of that day

In [57]:
tips_gb = tips.groupby(['day'])

In [58]:
tips_gb[['total_bill', 'tip']].transform(lambda x: x / x.mean()).head()

Unnamed: 0,total_bill,tip
0,0.793554,0.310279
1,0.482952,0.509964
2,0.981317,1.075225
3,1.106025,1.016856
4,1.148529,1.109018


I think I have only ever used this function for normalization, but it is pretty straight forwards and intuitive, so I'm fine with the added flexibility.

## Conclusion

This is about it for understanding pandas group operations. As always check out some of the [exercises on this topic](https://github.com/guipsamora/pandas_exercises#grouping), you should be able to do them with ease.

As a final note, understanding the groupby and agg functions is critical to using pandas effectively. The transform and filter are nice, but you could probably get by without them.