<div class="licence">
<span>Licence CC BY-NC-ND</span>
<span>Valérie Roy</span>
<span><img src="media/ensmp-25-alpha.png" /></span>
</div>

# pivoting tables for grouping

we'll study 2 examples:

* category-based:
  * 'category' means a data is among a (small) finite set of possible values
  * example: 'male' or 'female'; 'lower', 'middle', 'upper'

* bin-based: 
  * for valued data, we can always turn them into a category
  * by defining 'bins' - i.e. intervals
  * to turn that data into a category

## example 1: category-based (Titanic survival rate by class and sex)

we have a **dataset** about the sinking of **titanic**  
that gives for each **passenger** - among others:
- his/her **survival** (yes or no), **sex**, **class** (first, second, third), and **age**

In [None]:
import numpy as np
import pandas as pd

# importing the whole dataset
df = pd.read_csv('titanic.csv')

# let's see what's in there for us
for column in df.columns:
    print(column, end=" ")

that's a little too much
- we keep the **interesting columns** 
- using *read_csv*'s `usecols` parameter 

In [None]:
df = pd.read_csv(
    'titanic.csv', 
    usecols=['Survived', 'Pclass', 'Sex', 'Age'])
# displayed on next slide

In [None]:
df.head(3)

In [None]:
#  number of passengers in each class


df['Pclass'].value_counts()

suppose you want to know
   - the **survival rate** depending on the **sex** and the **class** 

- this can be done with the *pandas.DataFrame.pivot_table* **method**

we are going to

* group data that have the same **sex** index and the same **Pclass** column
* and on each group, compute the **mean** of their **survival status**

in terms of the `pivot_table` parameter, this translates into
   - the **value** to be **aggregated** here **Survived**
   - the **aggregation** function here the *numpy.mean*
   - the **key** to be the **index**   here the column **sex**
   - the **key** to be the **column**  here the column **Pclass**

In [None]:
# the result of pivot_table is a new dataframe

df.pivot_table('Survived', index='Sex', columns='Pclass', aggfunc=np.mean)

## example 2: bin-based (survival rate by age group)

   - you want to **compute** the **survival rate** by **age group**
   - but we do not have **age group**
   - so we must **pack** the ages in **bins** representing **age groupe**

   - we create **bins** of ages and **names** for those **bins**

In [None]:
# her we define the threshhold or our age groups
age_groups = [0, 11, 17, 25, 35, 45, 55, 65, 100] 

# and for convenience we give each of them a handy label
age_group_names = ['<11', '11-17', '17-25', '25-35', '35-45', '45-55', '55-65', '>65'] 

- we create a new category **column**, that gives the person's **age group** 
- using the *pandas.cut* **method** with the **bins** and the **names** parameters
- we just **add** the column in our data frame

In [None]:
df['Age group'] = pd.cut(
    df['Age'], bins=age_groups, labels=age_group_names)

df.head()

- so now we have a category column
- so we use *pivot_table*
- to compute a new **data frame** with the  **survival rate** by **age group**

In [None]:
# this time on a  single column 'sex'
df.pivot_table('Survived', index=['Sex'], columns='Age group', aggfunc=np.mean)

   - a **higher** rate of **women** was **saved** in **all** categories except **children under 11**
   - where $55 \%$ of the boys were saved against $54 \%$ of the girls

In [None]:
# if we're just interested in the global survival rate
# by age-group, then just don't specify an index parameter
# (or say index=[])
df.pivot_table('Survived', index=[], columns='Age group', aggfunc=np.mean)

   - we **do not need** to **add** a column to the **data frame**
   - here we  **pass** the number of **bins** and their **names**
   - pandas will automatically split the 'age' image in 3 equal intervals

In [None]:
col = pd.cut(df['Age'], 3, labels=['child', 'adult', 'old'])
display(col.head(3))
display(col.tail(3))

In [None]:
# same use of pivot_table as before, except that
# col is not a part of df
df.pivot_table('Survived', index=['Sex', 'Pclass'], columns=col, aggfunc=np.mean)

## digression: tweaking data

In [None]:
df = pd.read_csv('titanic.csv', usecols=('Pclass', 'Age', 'Sex'))

  - example of changing the **data**
  - for example, we want to **replace** the **number** code  
    of classes by a more user-friendly **name** ('first' instead of 1)
  

In [None]:
# we start with creating a mask that is
# True on all people in first class

mask = (df['Pclass'] == 1)

In [None]:
# we have a mask of indexes
mask.head()

- we **locate** the **true** values
- and replace their **Pclass** column **value** by the string **'first'**
- always use **loc** or **iloc**  
  **never** use a classical **array** assignement

In [None]:
df.loc[ mask, 'Pclass'] = 'first'

In [None]:
# and so on
df.loc[ df['Pclass'] == 2, 'Pclass'] = 'second'
df.loc[ df['Pclass'] == 3, 'Pclass'] = 'third'

In [None]:
df.head()