<p><font size="6"><b>06 - Pandas: "Group by" operations</b></font></p>

Adapted version from:
> *© 2021, Joris Van den Bossche and Stijn Van Hoey  (<mailto:jorisvandenbossche@gmail.com>, <mailto:stijnvanhoey@gmail.com>). Licensed under [CC BY 4.0 Creative Commons](http://creativecommons.org/licenses/by/4.0/)*

---

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
plt.style.use('seaborn-whitegrid')

# Some 'theory': the groupby operation (split-apply-combine)

In [None]:
df = pd.DataFrame({'key':['A','B','C','A','B','C','A','B','C'],
                   'data': [0, 5, 10, 5, 10, 15, 10, 15, 20]})
df

### Recap: aggregating functions

When analyzing data, you often calculate summary statistics (aggregations like the mean, max, ...). As we have seen before, we can easily calculate such a statistic for a Series or column using one of the many available methods. For example:

In [None]:
df['data'].sum()

However, in many cases your data has certain groups in it, and in that case, you may want to calculate this statistic for each of the groups.

For example, in the above dataframe `df`, there is a column 'key' which has three possible values: 'A', 'B' and 'C'. When we want to calculate the sum for each of those groups, we could do the following:

In [None]:
for key in ['A', 'B', 'C']:
    print(key, df[df['key'] == key]['data'].sum())

This becomes very verbose when having multiple groups. You could make the above a bit easier by looping over the different values, but still, it is not very convenient to work with.

What we did above, applying a function on different groups, is a "groupby operation", and pandas provides some convenient functionality for this.

### Groupby: applying functions per group

The "group by" concept: we want to **apply the same function on subsets of your dataframe, based on some key to split the dataframe in subsets**

This operation is also referred to as the "split-apply-combine" operation, involving the following steps:

* **Splitting** the data into groups based on some criteria
* **Applying** a function to each group independently
* **Combining** the results into a data structure

<img src="../img/pandas/splitApplyCombine.png">

Similar to SQL `GROUP BY`

Instead of doing the manual filtering as above


    df[df['key'] == "A"].sum()
    df[df['key'] == "B"].sum()
    ...

pandas provides the `groupby` method to do exactly this:

In [None]:
df.groupby('key').sum()

In [None]:
df.groupby('key').aggregate(np.sum)  # 'sum'

And many more methods are available.

In [None]:
df.groupby('key')['data'].sum()

# Application of the groupby concept on the titanic data

We go back to the titanic passengers survival data:

In [None]:
df = pd.read_csv("data/titanic.csv")

In [None]:
df.head()

<div class="alert alert-success">

<b>EXERCISE 1</b>:

 <ul>
  <li>Using groupby(), calculate the average age for each sex.</li>
</ul>
</div>

In [None]:
# %load _solutions/pandas_05_groupby_operations1.py

<div class="alert alert-success">

<b>EXERCISE 2</b>:

 <ul>
  <li>Calculate the average survival ratio for all passengers.</li>
</ul>
</div>

In [None]:
# %load _solutions/pandas_05_groupby_operations2.py

<div class="alert alert-success">

<b>EXERCISE 3</b>:

 <ul>
  <li>Calculate this survival ratio for all passengers younger than 25 (remember: filtering/boolean indexing).</li>
</ul>
</div>

In [None]:
# %load _solutions/pandas_05_groupby_operations3.py

<div class="alert alert-success">

<b>EXERCISE 4</b>:

 <ul>
  <li>What is the difference in the survival ratio between the sexes?</li>
</ul>
</div>

In [None]:
# %load _solutions/pandas_05_groupby_operations4.py

<div class="alert alert-success">

<b>EXERCISE 5</b>:

 <ul>
  <li>Make a bar plot of the survival ratio for the different classes ('Pclass' column).</li>
</ul>
</div>

In [None]:
# %load _solutions/pandas_05_groupby_operations5.py
df.groupby('Pclass')['Survived'].mean().plot(kind='bar') #and what if you would compare the total number of survivors?

<div class="alert alert-success">

**EXERCISE 6**:

* Make a bar plot to visualize the average Fare payed by people depending on their age. The age column is divided is separate classes using the `pd.cut()` function as provided below.

</div>

In [None]:
df['AgeClass'] = pd.cut(df['Age'], bins=np.arange(0,90,10))

In [None]:
# %load _solutions/pandas_05_groupby_operations6.py
df.groupby('AgeClass')['Fare'].mean().plot(kind='bar', rot=0)

If you are ready, more groupby exercises can be found below.

# Some more theory

## Specifying the grouper

In the previous example and exercises, we always grouped by a single column by passing its name. But, a column name is not the only value you can pass as the grouper in `df.groupby(grouper)`. Other possibilities for `grouper` are:

- a list of strings (to group by multiple columns)
- a Series (similar to a string indicating a column in df) or array
- function (to be applied on the index)
- levels=[], names of levels in a MultiIndex

In [None]:
df.groupby(df['Age'] < 18)['Survived'].mean()

In [None]:
df.groupby(['Pclass', 'Sex'])['Survived'].mean()

## The size of groups - value counts

Often you want to know how many elements there are in a certain group (or in other words: the number of occurences of the different values from a column).

To get the size of the groups, we can use `size`:

In [None]:
df.groupby('Pclass').size()

In [None]:
df.groupby('Embarked').size()

Another way to obtain such counts, is to use the Series `value_counts` method:

In [None]:
df['Embarked'].value_counts()