**Before you start:** Click **File â†’ Save a copy in Drive** so you have your own version of this notebook. If you skip this step, your work will not be saved.

# Load modules and settings

In [None]:
# first thing is to import pandas
import pandas as pd

pd.options.display.max_columns = None
pd.options.display.max_rows = 20

from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

# Load Titanic Data

<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/f/fd/RMS_Titanic_3.jpg/1280px-RMS_Titanic_3.jpg"
  width="30%" align="right">

Dataset info: https://www.kaggle.com/c/titanic/data

## Columns 
These are the subset of columns that we'll care about:
 - pclass - "Passenger class". Has 3 values:
   - `1:` 1st, `2:` 2nd, `3:` 3rd
 - sex - The sex as recorded in this dataset (values are either `male` or `female`, or `NA` (missing))
 - survived - An indicator of whether the passenger survived the sinking. 
   - `0` - did not survive
   - `1` - survived
 - age - passenger age
 


In [None]:
titanic = pd.read_csv('https://zjelveh.github.io/files/titanic.csv')
titanic.head()

Each row is a passenger

Recall that `shape` tells us the number of rows and number of columns.

In [None]:
titanic.shape

We see that there are 1,310 passengers and 14 columns

We are going to limit our data to the four columns that we need for the lab

In [None]:
columns_to_keep = ['pclass', 'sex', 'survived', 'age']

In [None]:
titanic = titanic[columns_to_keep]

In [None]:
titanic.shape

In [None]:
titanic.head()

# Using iloc to access to rows and columns
The `.iloc` property of a dataframe is how we can flexibly access rows and columns.



In [None]:
# The following line of code returns the second row of titanic
titanic.iloc[1, :]


In [None]:
# The following line of code returns the first column  of titanic
titanic.iloc[:, 0]

In [None]:
# The following line of code returns the second row and first column  of titanic
titanic.iloc[1, 0]

In [None]:
# The 3rd through 6th row 2nd through 4th column
# First here are what the first seven rows look like
titanic.head(n=7)

In [None]:
titanic.iloc[2:6, 1:4]

# Creating a column
Let's create a column that tells us whether someone is older than 18

### Direct method


In [None]:
titanic['age_over_18'] = titanic.age > 18
titanic['age_over_18']

In [None]:
titanic.age_over_18.value_counts()

### Assign method
Now let's create a column that tells us if someone is older than 18 and is male

Notice that we have to assign it back to titanic 

In [None]:
titanic = titanic.assign(male_over_18=(titanic.age_over_18) & (titanic.sex=='male'))
titanic

# Compute some probabilities AND probability distributions

Check your understanding: What is the definition of a **probability distribution**?

<font size=4 color='blue'>$P(survived)$</font>

In [None]:
# I'll first show the counts
titanic.survived.value_counts()

In [None]:
titanic.survived.value_counts(normalize=True)

<font size=4 color='blue'>$P(sex)$</font>

In [None]:
titanic.sex.value_counts(normalize=True)

<font size=4 color='blue'>$P(pclass)$</font>

In [None]:
titanic.pclass.value_counts(normalize=True)

## What is the probability that a passenger's age is over 18?
<font size=4 color='blue'>$P(age\_over\_18=1)$?</font>

In [None]:
titanic.age_over_18.value_counts(normalize=True)

#### **Note**: This returns the distribution P(age_over_18), what we want is the probability from that distribution: P(age_over_18=1)

So we need to use `iloc` to access it

In [None]:
p_age_over_18 = titanic.age_over_18.value_counts(normalize=True)
p_age_over_18

In [None]:
p_age_over_18.iloc[0]

#### We see that this probability is the first element

In [None]:
p_age_over_18[True]

## Computing joint distributions
What is <font size=4 color='blue'>$P(sex, survived)$?</font>


In [None]:
# here is how to access the two columns we need
titanic[['sex', 'survived']]

In [None]:
titanic[['sex', 'survived']].value_counts()

In [None]:
titanic[['sex', 'survived']].value_counts(normalize=True)

## Marginalizing example
We will use marginalization to figure out the share of passengers that survived.

So we want to know <font size=3 color='blue'>$P(survived=1)$</font>

We will use the marginalization formula:

<font size=4 color='blue'>$P(survived=1) = P(survived=1, sex=male) + P(survived=1, sex=female)$</font>

In [None]:
p__survived_sex = titanic[['sex', 'survived']].value_counts(normalize=True)
p__survived_sex

In [None]:
p__survived_1 = p__survived_sex.iloc[2] + p__survived_sex.iloc[1]
p__survived_1.round(3)


We see that of all passengers, 52% were males who did not surive and 9.7% were females who did not survive.
So overall, the share that did not survive, or $P(survived=0)$ was close to 63%.



## Computing conditional probabilities

Let's compute 

<font size=4 color='blue'>$P(survived=1| pclass=1)$</font>


We will walk through three ways to compute it
1) Using the conditional probability formula as a guide

2) Using raw counts 

3) Using optimized pandas code

### First method


Apply the conditional probability formula to get the numerator and the denominator

<font size=4  color='blue'>$P(survived=1 | pclass=1) = \frac{P(survived=1,~pclass=1)}{P(pclass=1)}$ </font>


#### Let's work on the numerator.

We will use value_counts to get $P(survived=1, pclass=1)$

In [None]:
p__survived_pclass = titanic[['survived', 'pclass']].value_counts(normalize=True)

In [None]:
p__survived_pclass

In [None]:
p__survived_1__pclass_1 = p__survived_pclass.iloc[1]
p__survived_1__pclass_1

#### Now the denominator

In [None]:
p__pclass = titanic.pclass.value_counts(normalize=True)
p__pclass

In [None]:
p__pclass_1 = p__pclass.iloc[1]
p__pclass_1

Putting it all together

In [None]:
p__survived_1__pclass_1 / p__pclass_1


<font size=3 color='red'>So 62% of people in first class survived as compared with 38% of all passengers </font>

### Second method, using raw counts


**Numerator**: how many passengers survived and were in pclass 1

**Denominator**: how many passengers were in pclass 1

In [None]:
titanic[(titanic.pclass==1)]

In [None]:
denominator = titanic[(titanic.pclass==1)].shape[0]
denominator

In [None]:
numerator = titanic[(titanic.pclass==1) & (titanic.survived==1)].shape[0]
numerator

In [None]:
numerator / denominator

### Third method
We can combined logical filtering, accessing columns, and using the mean function to compute this in one line

In [None]:
(titanic[titanic.pclass==1].survived==1).mean()

# Lab Task
- Use the assign function to create a column called `is_male` which is True if sex==`male` and False otherwise

- Use the direct method to create a column called `age_over_50` which is True if age > 50 and false otherwise

- Compute the joint distribution for `is_male` and `survived`, i.e. <font size=4 color='blue'>$P(is\_male, survived)$</font>

- Compute the joint distribution for `is_male` and `survived` given that `pclass` is 1, i.e. <font size=4 color='blue'>$P(is\_male, survived|pclass=1)$</font>

- Using the first method for computing conditional probability distributions, compute: <font size =4 color='blue'>$P(survived=1|is\_male=1)$</font>

Find the numerator (which is a probability)

Find the denominator (which is a probability)

- Using the second method for computing conditional probabilities, compute: <font size =4 color='blue'>$P(survived=1|pclass\ne1)$</font>

- Using the third method for computing conditional probability distributions, compute: <font size =4 color='blue'>$P(survived=1|is\_male=1, pclass=1)$</font>

