**Before you start:** Click **File â†’ Save a copy in Drive** so you have your own version of this notebook. If you skip this step, your work will not be saved.

# Load modules and settings

In [1]:
# first thing is to import pandas
import pandas as pd

pd.options.display.max_columns = None
pd.options.display.max_rows = 20

from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

# Load Titanic Data

<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/f/fd/RMS_Titanic_3.jpg/1280px-RMS_Titanic_3.jpg"
  width="30%" align="right">

Dataset info: https://www.kaggle.com/c/titanic/data

## Columns 
These are the subset of columns that we'll care about:
 - pclass - "Passenger class". Has 3 values:
   - `1:` 1st, `2:` 2nd, `3:` 3rd
 - sex - The sex as recorded in this dataset (values are either `male` or `female`, or `NA` (missing))
 - survived - An indicator of whether the passenger survived the sinking. 
   - `0` - did not survive
   - `1` - survived
 - age - passenger age
 


In [25]:
titanic = pd.read_csv('https://zjelveh.github.io/files/titanic.csv')
columns_to_keep = ['pclass', 'sex', 'survived', 'age', 'fare']
titanic = titanic[columns_to_keep]
titanic.head()

Unnamed: 0,pclass,sex,survived,age,fare
0,1.0,female,1.0,29.0,211.3375
1,1.0,male,1.0,0.9167,151.55
2,1.0,female,0.0,2.0,151.55
3,1.0,male,0.0,30.0,151.55
4,1.0,female,0.0,25.0,151.55


# Groupby
Let's compute the average age by sex

In [4]:
titanic.groupby(by=['sex']).age.mean()

sex
female    28.687071
male      30.585233
Name: age, dtype: float64

# Groupby with apply and lambda
Now let's do it again using apply and lambda

In [14]:
titanic.groupby(by=['sex']).apply(lambda x: x.age.mean())

sex
female    28.687071
male      30.585233
dtype: float64

# Computing conditional probabilities


<font size=4 color='blue'>$P(survived=1| pclass)$</font>


In [15]:
titanic.groupby(['pclass']).survived.mean()

pclass
1.0    0.619195
2.0    0.429603
3.0    0.255289
Name: survived, dtype: float64

Notice that the following fails? (Why?)

In [17]:
titanic.groupby(['pclass']).survived==1.mean()

SyntaxError: invalid syntax (2990600421.py, line 1)

We can use logical checking inside the lambda function

In [22]:
titanic.groupby(['pclass']).apply(lambda x: (x.survived==1).mean())

pclass
1.0    0.619195
2.0    0.429603
3.0    0.255289
dtype: float64

# Conditioning on more than one variable

Compute <font size=4 color='blue'>$P(survived==1|sex, pclass)$</font>

In [26]:
titanic.groupby(['sex', 'pclass']).apply(lambda x: (x.survived==1).mean())

sex     pclass
female  1.0       0.965278
        2.0       0.886792
        3.0       0.490741
male    1.0       0.340782
        2.0       0.146199
        3.0       0.152130
dtype: float64

# Combining value_counts and groupby to compute joint conditional distributions

Compute <font size=4 color='blue'>$P(sex, pclass|survived)$


In [28]:
titanic.groupby(['survived'])[['sex', 'pclass']].value_counts(normalize=True)

survived  sex     pclass
0.0       male    3.0       0.516687
                  2.0       0.180470
                  1.0       0.145859
          female  3.0       0.135970
                  2.0       0.014833
                  1.0       0.006180
1.0       female  1.0       0.278000
                  3.0       0.212000
                  2.0       0.188000
          male    3.0       0.150000
                  1.0       0.122000
                  2.0       0.050000
Name: proportion, dtype: float64

Notice that the probabilities for `survived==0` sum to 1 and `survived==1` sum to 1

# Lab Task
- Create a new column called `age_under_18` which is True if age<18 and False otherwise

- Compute <font size=4 color='blue'>$P(survived==1|sex, pclass, age\_under\_18)$</font>
  - Which group was most likely to survive?
  - Least likely?

- Compute <font size=4 color='blue'>$P(age>50 |survived)$</font>

- What is the average `fare` by sex? 

- Compute <font size=4 color='blue'>$P(survived, pclass|sex)$</font>
  - Which rows add up to 1?
  - Among men, which group was least likely to survive?

- Compute <font size=4 color='blue'>$P(age\_under\_18==0, sex=`male'|survived)$</font>

- Compute <font size=4 color='blue'>$P(age\_under\_18, survived=1|sex)$</font>
  - What adds up to one here?