# Aggregation and grouping
for the purpose of this exercise we'll use titanic survivor data.

download the file from http://biostat.mc.vanderbilt.edu/wiki/pub/Main/DataSets/titanic3.xls, and read it as a pandas dataframe

In [9]:
import pandas as pd
import numpy as np

In [10]:
df = pd.read_excel('titanic3.xls', index_col=None)
df.head()

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest
0,1,1,"Allen, Miss. Elisabeth Walton",female,29.0,0,0,24160,211.3375,B5,S,2.0,,"St Louis, MO"
1,1,1,"Allison, Master. Hudson Trevor",male,0.9167,1,2,113781,151.55,C22 C26,S,11.0,,"Montreal, PQ / Chesterville, ON"
2,1,0,"Allison, Miss. Helen Loraine",female,2.0,1,2,113781,151.55,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"
3,1,0,"Allison, Mr. Hudson Joshua Creighton",male,30.0,1,2,113781,151.55,C22 C26,S,,135.0,"Montreal, PQ / Chesterville, ON"
4,1,0,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",female,25.0,1,2,113781,151.55,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"


Column description:
- survival: Survival (0 = no; 1 = yes)
- pclass: Passenger class (1 = first; 2 = second; 3 = third)
- name: Name
- sex: Sex
- age: Age
- sibsp: Number of siblings/spouses aboard
- parch: Number of parents/children aboard
- ticket: Ticket number
- fare: Passenger fare
- cabin: Cabin
- embarked: Port of embarkation (C = Cherbourg; Q = Queenstown; S = Southampton)
- boat: Lifeboat (if survived)
- body: Body number (if did not survive and body was recovered)

A large dataset is great but to derive insights we need to summarize or reduce it in some way. We've already discussed various functions that achieve this: mean, sum, etc. Pandas contains a convenience method `describe` to achieve most of this in one line of code. Try it now on the titanic dataset.

In [11]:
df.describe()

Unnamed: 0,pclass,survived,age,sibsp,parch,fare,body
count,1309.0,1309.0,1046.0,1309.0,1309.0,1308.0,121.0
mean,2.294882,0.381971,29.881135,0.498854,0.385027,33.295479,160.809917
std,0.837836,0.486055,14.4135,1.041658,0.86556,51.758668,97.696922
min,1.0,0.0,0.1667,0.0,0.0,0.0,1.0
25%,2.0,0.0,21.0,0.0,0.0,7.8958,72.0
50%,3.0,0.0,28.0,0.0,0.0,14.4542,155.0
75%,3.0,1.0,39.0,1.0,0.0,31.275,256.0
max,3.0,1.0,80.0,8.0,9.0,512.3292,328.0


While useful, we may want to have a finer understanding of what exactly is happening. Let's dig in to the differences based on cabin class (the pclass column). To do this we will use the groupby operator that split the dataframe into a list of dataframes by a given criterion.

In [12]:
# run forest run
df_gby_class = df.groupby('pclass')

After grouping we can access an individual group with `get_group`. Compute the mean survival rate for class 1 and 3

In [17]:
print(df_gby_class.get_group(1).mean())
print(df_gby_class.get_group(3).mean())

pclass        1.000000
survived      0.619195
age          39.159918
sibsp         0.436533
parch         0.365325
fare         87.508992
body        162.828571
dtype: float64
pclass        3.000000
survived      0.255289
age          24.816367
sibsp         0.568406
parch         0.400564
fare         13.302889
body        155.818182
dtype: float64


But, that's not where groupby shines. It shines when we use it to
1. split to groups
2. peform an action on each group
3. recombine the dataframe

## aggregation
aggregation allows us to perform a series of operations and present them in a dataframe. Here check this out

In [18]:
# run forest run
df_gby_class['survived'].aggregate([np.min, np.max, np.mean, np.median])

Unnamed: 0_level_0,amin,amax,mean,median
pclass,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,0,1,0.619195,1
2,0,1,0.429603,0
3,0,1,0.255289,0


## filtration
you can filter out using a call to the filter function with function which transforms a dataframe to a boolean. Behold

In [43]:
# run forest run
print(df.groupby('pclass').aggregate(np.std))  # let's comput the std by column per group

# now let's get of any group with a std of less than 0.7 in the number of parents and/or children on board
# this will remove pclass 2

def filt_parch_07(x):
    return x.parch.std() >= 0.7

# filtering
filt07 = df.groupby('pclass').filter(filt_parch_07,)


# let's check that pclass 2 was indeed removed
filt07.pclass.unique()

        survived        age     sibsp     parch       fare        body
pclass                                                                
1       0.486338  14.548059  0.609064  0.715602  80.447178   82.652172
2       0.495915  13.638628  0.590100  0.692717  13.607122  107.077753
3       0.436331  11.958202  1.299681  0.981639  11.494358  102.403720


array([1, 3])

Now groupby the number of parents and/or children on board and remove groups for which the average age is less than 30.

In [51]:
def filt_parch_08(x):
    return x.age.mean() <= 30

# filtering
filt08 = df.groupby('parch').filter(filt_parch_08,).groupby('parch').mean()
print(filt08)

         pclass  survived        age     sibsp       fare   body
parch                                                           
1      2.158824  0.588235  24.965625  1.029412  50.078358  161.9
2      2.300885  0.504425  18.975945  1.902655  61.346275  118.8


The groupby allows you also to:
1. transform - apply same transformation to all columns in each group
2. apply - apply an arbitrary function to each group

see example

In [49]:
# run forest run
def arbitrary2(x):
    x[]

df.groupby('pclass').transform(lambda x: x-2).head()

SyntaxError: invalid syntax (<ipython-input-49-29f7d7edfe5d>, line 3)

Now groupby number of parents and/or children and for each group compute the age divided by the fare (in that group). 

**NOTE** don't use a lambda function

In [57]:
df.groupby('parch').mean()['age']/df.groupby('parch').mean()['fare']

parch
0    1.214190
1    0.498531
2    0.309325
3    0.448445
4    0.454530
5    1.215633
6    0.884861
9         NaN
dtype: float64