## <font color=red> Who survived when the Titanic sank? An example for exploratory data analysis</font>

More info on the sinking of the Titanic in [Encyclopedia Titanica](http://www.encyclopedia-titanica.org) or Wikipedia

In [14]:
# Import code libraries or "modules" in Python lingo

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import importlib
import sys
# from IPython.core.debugger import Tracer
from datascience import *
sns.set_style("whitegrid")
%matplotlib inline

In [15]:
# Import some new methods for class Table
# Add IDS directory to search path
course_dir = "/Users/wxs/Dropbox/Introduction-to-Data Science\
/Git-repositories/STAT180-Winter-2018"
computing_dir = course_dir + "/A0-Computing"

if computing_dir not in sys.path:
    sys.path.append(computing_dir)

from datascience_extensions import *

In [16]:
# Reload the extensions after we make a change
# Importing it again does not work - a module is imported only once
module_name = "datascience_extensions"
importlib.reload(sys.modules[module_name])

<module 'datascience_extensions' from '/Users/wxs/Dropbox/Introduction-to-Data Science/Git-repositories/STAT180-Winter-2018/A0-Computing/datascience_extensions.py'>

In [17]:
# Read the data into a table titanic and display the first 3 rows
dir = "/Users/wxs/Dropbox/STAT180/Git-repositories/IDS/Data/"
titanic_filename = "titanic3.csv"
titanic_pathname = dir + titanic_filename
titanic = Table.read_table(titanic_pathname)
titanic.take([1, 2, 3])

FileNotFoundError: File b'/Users/wxs/Dropbox/STAT180/Git-repositories/IDS/Data/titanic3.csv' does not exist

#### Rows of the table correspond to passengers. Columns correspond to properties (or "features") of passengers:

    pclass     cabin class
    survived   0 = "no", 1 = "yes"
    name
    sex
    age
    sibsp      number of siblings or spouses aboard
    parch      number of parents/children aboard
    ticket     ticket number
    fare       passenger fare (in British pounds)
    cabin      cabin number
    embarked   port of embarkation (C + cherbourg; Q = Queenstown; S = Southampton
    boat       lifeboat number
    body       body identification number
    home.dest  home/destination

#### We often refer to the rows as "cases" and to the columns as "variables"
#### Some entries in the table are "nan", indicating that the value is missing

In [None]:
# Let's se how big the table is
titanic.shape()

In [None]:
# So the table "titanic" has 1,310 rows, numbered 0..1309
# The spreadsheet "titanic3.csv" has also 1,310 rows, but row 1 contains 
# the column labels. So we only have data on 1,309 passengers.
#Something is amiss. Let's look at the first row of the table

titanic.take(0)

In [None]:
# That's correct. Look at the last row

titanic.take(1309)

In [None]:
# So Table.read_table has read an empty row. Let's get rid of that

titanic = titanic.take(np.arange(1309))
titanic.take(1308)

In [None]:
# Complication:
# There are missing values (nan) in the table. 
# We have to keep that in mind when we do our analysis.
# 
# Let's count the missing values for each variable
nan_count = titanic.count_nan()
nan_count

### Let's first get a feeling on the composition of passengers: age, sex, port of embarkation. We could also look at family relationships using the same tools, but we'll skip that.

In [None]:
bla = titanic.group("sex")
bla.barh("sex")
bla
# About 2/3 of the passengers were males

In [None]:
bla = titanic.group("pclass")
bla.barh("pclass")
bla
# About half the passengers were in 3rd class

In [None]:
bla = titanic.group("embarked")
bla.barh("embarked")
bla
# Most assengers embarked in Southampton
# Note that there are two missing values
# Let's see if there is a difference in the class distribution
# between the ports of embarkation


In [None]:
# Let's see if there is a difference in the class and sex distribution
# between the ports of embarkation
titanic_sub = titanic.select("pclass", "embarked", "sex").take_complete_rows()
titanic_sub.pivot("pclass", "embarked").barh("embarked")
titanic_sub.pivot("sex", "embarked").barh("embarked")
# Drastic differences betweeb embarkation ports

In [None]:
age = titanic.select("age").take_complete_rows()
age.hist(bins=np.arange(0, 80, 5))
age.where("age", are.below(30)).num_rows
# Almost half the passengers are younger than 30

In [None]:
np.mean(age.column("age"))
# Mean age is just below 30

In [None]:
np.median(age.column("age"))
# Median age is 28

In [None]:
# Maybe the young passengers are poor emigrants who are
# predominantly in 3rd class
age_pclass = titanic.select("age", "pclass").take_complete_rows()
# age_pclass.shape()
age_pclass.group("pclass", np.median).barh("pclass")
# Yes, the 3rd class passsengers are on average much younger
# than the 1st class passengers

In [None]:
# Let's see if there is a difference in the class distribution
# between the ports of embarkation
titanic_sub = titanic.select("pclass", "embarked", "sex").take_complete_rows()
titanic_sub.pivot("pclass", "embarked").barh("embarked")
# Dramatic difference.

In [None]:
# How about gender?
titanic_sub.pivot("sex", "embarked").barh("embarked")
# Conjecture: The passengers embarking at Queenstown
# tended to be young emigrant families?

In [None]:
bla = titanic.group("survived")
bla.barh("survived")
bla
# About 40% of passengers survived

##  Let's no get to the most intersting question: How did the likelihood of survival "depend" on sex and pclass

In [None]:
# Let's no get to the most intersting question: How did the likelihood
# of survival "depend" on sex and pclass
titanic.pivot("sex", "pclass").barh("pclass")
titanic.pivot("sex", "pclass")
# Ratio of females to males is decreasing function of pclass

In [None]:
titanic.pivot("sex", "pclass", "survived", collect=np.mean).barh("pclass")
titanic.pivot("sex", "pclass", "survived", collect=np.mean)
# Males in first class were twice as likely to survive as males in 2nd
# and 3rd class
# Females were roughly equally likely to survice in 1st and 2nd class
# and twice as likely to survice as females in 3rd class

In [None]:
# Of course we could also state the result in terms of death rather than 
# survival
died = 1 - titanic.column("survived")
titanic_extended = titanic.with_columns("died", died)
titanic_extended.pivot("sex", "pclass", "died", collect=np.mean)
titanic_extended.pivot("sex", "pclass", "died", collect=np.mean)\
.barh("pclass")
# Females in 2nd class were three times as likely to die as females in 1st 
# class. Females in 3rd class were 15 times more likely to die as females
# in 1st class