<div style="color:#006666; padding:0px 10px; border-radius:5px; font-size:18px;"><h1 style='margin:10px 5px'>Grouping and Aggregation</h1>
</div>

© Copyright Machine Learning Plus

<div class="alert alert-info" style="background-color:#006666; color:white; padding:0px 10px; border-radius:5px;"><h2 style='margin:10px 5px'>1. GroupBy Mechanism: Split-Apply-Combine</h2>
</div>

Grouping and aggregation is a very useful technique in data analysis.

__When to use__

Let's suppose you have a categorical variable (`Class`) and a numerical variable (`Fare`). And you want to know the mean fare for each job type. 

You can use this, when you have more than one categorical (and numerical) variable as well.

__How it works:__ 

__Split -> Apply -> Combine__


![image.png height](attachment:image.png)

Source: Stackoverflow

In [None]:
import numpy as np
import pandas as pd

In [None]:
df = pd.read_csv('Datasets/Titanic.csv')
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


__Task__

Compute the mean survival rate for each Class (`Pclass`)

In [None]:
df.groupby('Pclass').agg({'Survived': np.mean})

Unnamed: 0_level_0,Survived
Pclass,Unnamed: 1_level_1
1,0.62963
2,0.472826
3,0.242363


Clearly, Class 1 gets more priority followed by class 2.

Compute the total persons survived in each class as well.

In [None]:
df.groupby('Pclass').agg({'Survived': [np.mean, np.sum]})

Unnamed: 0_level_0,Survived,Survived
Unnamed: 0_level_1,mean,sum
Pclass,Unnamed: 1_level_2,Unnamed: 2_level_2
1,0.62963,136
2,0.472826,87
3,0.242363,119


And more people from class 1 were saved.

Groupby `sex` as well

In [None]:
df.groupby(['Sex', 'Pclass']).agg({'Survived': [np.mean, np.sum]})

Unnamed: 0_level_0,Unnamed: 1_level_0,Survived,Survived
Unnamed: 0_level_1,Unnamed: 1_level_1,mean,sum
Sex,Pclass,Unnamed: 2_level_2,Unnamed: 3_level_2
female,1,0.968085,91
female,2,0.921053,70
female,3,0.5,72
male,1,0.368852,45
male,2,0.157407,17
male,3,0.135447,47


Within the classes, Female seem to have got more priority consistently across classes.

That's clear, how about writing custom functions instead of standard functions like 'mean' and 'sum'?

Just define the function and use it.

__To make the index as columns use `reset_index()`___

In [None]:
df.groupby(['Sex', 'Pclass']).agg({'Survived': [np.mean, np.sum]}).reset_index()

Unnamed: 0_level_0,Sex,Pclass,Survived,Survived
Unnamed: 0_level_1,Unnamed: 1_level_1,Unnamed: 2_level_1,mean,sum
0,female,1,0.968085,91
1,female,2,0.921053,70
2,female,3,0.5,72
3,male,1,0.368852,45
4,male,2,0.157407,17
5,male,3,0.135447,47


__Task__

Find the difference of the maximum and the minimum fare paid by each class.

In [None]:
df.groupby('Pclass').agg({'Fare': lambda x: np.max(x) - np.min(x)})

Unnamed: 0_level_0,Fare
Pclass,Unnamed: 1_level_1
1,512.3292
2,73.5
3,69.55


Show the max, min. And groupby Sex as well.

In [None]:
def minmax(x): 
    return np.max(x) - np.min(x)
    
df.groupby(['Pclass', 'Sex']).agg({'Fare': [np.min, np.max, minmax]})

Unnamed: 0_level_0,Unnamed: 1_level_0,Fare,Fare,Fare
Unnamed: 0_level_1,Unnamed: 1_level_1,amin,amax,minmax
Pclass,Sex,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
1,female,25.9292,512.3292,486.4
1,male,0.0,512.3292,512.3292
2,female,10.5,65.0,54.5
2,male,0.0,73.5,73.5
3,female,6.75,69.55,62.8
3,male,0.0,69.55,69.55


So, there are people who paid nothing to get onboard and these were all men.

### Mini Challege

Compute correlation between 'Survived' and 'Fare' grouped by `Pclass`.

```python
import pandas as pd
df = pd.read_csv('Datasets/Titanic.csv')
df.head()
```

__Solution:__

In [None]:
import numpy as np
import pandas as pd

In [None]:
# Get Groupwise correlations
df = pd.read_csv('Datasets/Titanic.csv')
df.groupby('Pclass').apply(lambda x: print(x[['Fare', 'Survived']].corr()))

              Fare  Survived
Fare      1.000000  0.190966
Survived  0.190966  1.000000
              Fare  Survived
Fare      1.000000  0.098628
Survived  0.098628  1.000000
             Fare  Survived
Fare      1.00000   0.00093
Survived  0.00093   1.00000


In [None]:
# Get only the correlation values
df.groupby('Pclass').apply(lambda x: print(x[['Fare', 'Survived']].corr().iloc[0,1]))

0.19096640841564308
0.09862818081146572
0.0009295304523811009


<div class="alert alert-info" style="background-color:#006666; color:white; padding:0px 10px; border-radius:5px;"><h2 style='margin:10px 5px'>2. Iterating over Groups</h2>
</div>

Doing a `groupby` on a dataframe, creates an iterable DataFrameGroupby object.

If you want to do further do customized operations, you can iterate through the groups and do it.

For example: You want to compute the mean fare for every class, but you want to omit all zero fares for Male passengers in the 2nd class alone.

For such customized logic, iterating through the groups makes it easy.


In [None]:
import numpy as np
import pandas as pd

In [None]:
df = pd.read_csv('Datasets/Titanic.csv')
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [None]:
df_groups = df.groupby('Pclass')
df_groups

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x0000017C0EC27A88>

Let's look at the data in each group.

In [None]:
# Group the dataframe by Pclass
for name, group in df.groupby('Pclass'): 
    print("Group name: ", name)
    print(group.head(2), "\n\n")


Group name:  1
   PassengerId  Survived  Pclass  \
1            2         1       1   
3            4         1       1   

                                                Name     Sex   Age  SibSp  \
1  Cumings, Mrs. John Bradley (Florence Briggs Th...  female  38.0      1   
3       Futrelle, Mrs. Jacques Heath (Lily May Peel)  female  35.0      1   

   Parch    Ticket     Fare Cabin Embarked  
1      0  PC 17599  71.2833   C85        C  
3      0    113803  53.1000  C123        S   


Group name:  2
    PassengerId  Survived  Pclass                                 Name  \
9            10         1       2  Nasser, Mrs. Nicholas (Adele Achem)   
15           16         1       2     Hewlett, Mrs. (Mary D Kingcome)    

       Sex   Age  SibSp  Parch  Ticket     Fare Cabin Embarked  
9   female  14.0      1      0  237736  30.0708   NaN        C  
15  female  55.0      0      0  248706  16.0000   NaN        S   


Group name:  3
   PassengerId  Survived  Pclass                     Na

Writing the logic

In [None]:
# Group the dataframe by Pclass
for name, group in df.groupby('Pclass'): 
    print("Group name: ", name)
    if name == 2:
        print(group.loc[(group.Sex!="male") & (group.Fare!=0), "Fare"].mean().round(2))
        # print(group.Fare.mean().round(2))
    else:
        print(group.Fare.mean().round(2))

Group name:  1
84.15
Group name:  2
20.66
Group name:  3
13.68


### Challenge

Do a groupby operation on the Titanic dataframe on `Pclass`. Iterate through each group and extract the names of top three female passengers who paid the highest fare.

In [None]:
# Solution ---
# Group the dataframe by Pclass
for name, group in df.groupby('Pclass'): 
    print("Group name: ", name)
    group = group.sort_values('Fare', ascending=False)
    print(group.loc[group.Sex=="female", ["Name", "Fare"]].head(3), "\n\n")

Group name:  1
                               Name      Fare
258                Ward, Miss. Anna  512.3292
88       Fortune, Miss. Mabel Helen  263.0000
341  Fortune, Miss. Alice Elizabeth  263.0000 


Group name:  2
                                                  Name     Fare
615                                Herman, Miss. Alice  65.0000
754                   Herman, Mrs. Samuel (Jane Laver)  65.0000
608  Laroche, Mrs. Joseph (Juliette Marie Louise La...  41.5792 


Group name:  3
                                  Name   Fare
792            Sage, Miss. Stella Anna  69.55
180       Sage, Miss. Constance Gladys  69.55
863  Sage, Miss. Dorothy Edith "Dolly"  69.55 




<div class="alert alert-info" style="background-color:#006666; color:white; padding:0px 10px; border-radius:5px;"><h2 style='margin:10px 5px'>3. Transform Method</h2>
</div>

__When to use__

Sometimes instead of aggregating based on a groupby column, you want to create an entirely new column.

__Example Task__: 

In Titanic data, you want to create a new column that contains the mean fare for that class.

In [None]:
import pandas as pd
import numpy as np

In [None]:
df = pd.read_csv('Datasets/Titanic.csv')
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [None]:
df["Fare_Mean"] = df.groupby('Pclass')["Fare"].transform('mean')

In [None]:
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Fare_Mean
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S,13.67555
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,84.154687
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S,13.67555
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S,84.154687
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S,13.67555


### Challenge

What percentage of total fare in the class, does each individual has contributed?

```python
df = pd.read_csv('Datasets/Titanic.csv')
```

In [None]:
# Solution
df = pd.read_csv('Datasets/Titanic.csv')


df["Fare_Class_Total"] = df.groupby('Pclass')["Fare"].transform('sum')
df["Fare_Perc"] = df["Fare"] / df["Fare_Class_Total"]

df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Fare_Class_Total,Fare_Perc
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S,6714.6951,0.00108
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,18177.4125,0.003922
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S,6714.6951,0.00118
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S,18177.4125,0.002921
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S,6714.6951,0.001199
