<a id='Top of document'></a>

# Titanic Disaster: Survivability Parameters

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
import os
import seaborn as sns
from statsmodels.graphics.mosaicplot import mosaic
from sklearn.datasets import fetch_openml
import statsmodels.formula.api as sm

In [None]:
pd.set_option('display.max_columns', 700)
pd.set_option('display.max_rows', 400)
pd.set_option('display.min_rows', 20)
pd.set_option('display.expand_frame_repr', True)

In [None]:
titanic_df = sns.load_dataset('titanic')

# Capitalize the column names
titanic_df.columns = titanic_df.columns.str.capitalize()

# Select Specific Columns
titanic_df = titanic_df[['Survived', 'Pclass', 'Sex', 'Age', 'Parch', 'Fare', 'Embarked']]

## Problem Statement

* [Dataset Description](https://www.kaggle.com/c/titanic/data)
* Using data analysis methods, predict which metric or combination of metrics best predict passenger survivability.
* A combination of data visualizations and statistics will be used to determine the most significant predictors of survivability.

[Back to top](#Top of document)
<a id='dataexp'></a>

## Dataset Exploration

In [None]:
# Head of the dataset
titanic_df.head()

In [None]:
# Tail of the dataset
titanic_df.tail()

In [None]:
# Determine which parameters have missing values
titanic_df.info()

* Name, SibSp, Parch, Ticket and Fare will not be used
* Cabin will not be used because less the 25% of passengers have cabin data
* Missing Age data will be filled in the [Age](#age) section
* Missing Embarked data will be ignored

In [None]:
# Give gender a numeric value; 0 = male, 1 = female
titanic_df['Sex_Numeric'] = (titanic_df['Sex'].astype('category')).cat.codes

In [None]:
grouped_survived = titanic_df.groupby(['Sex_Numeric', 'Pclass', 'Age', 'Embarked'], observed=False)

In [None]:
grouped_survived['Survived'].describe()

In [None]:
# Create Survival Label Column
titanic_df['Survival'] = titanic_df.Survived.map({0 : 'Died', 1 : 'Survived'})
titanic_df.Survival.head()

In [None]:
# Create Pclass Label Column
titanic_df['Class'] = titanic_df.Pclass.map({1 : '1st Class', 2 : '2nd Class', 3 : '3rd Class'})
titanic_df.Class.head()

In [None]:
# Create Sex Label Column
titanic_df['Gender'] = titanic_df.Sex.map({'female' : 'Female', 'male' : 'Male'})
titanic_df.Gender.head()

In [None]:
# Replace blanks with NaN
titanic_df['Embarked'].replace(r'\s+', np.nan, regex=True).head()

In [None]:
# Create Port Label Column
titanic_df['Ports'] = titanic_df.Embarked.map({'S' : 'Southhampton', 'C' : 'Cherbourg', 'Q' : 'Queenstown', np.nan : 'unknown'})
titanic_df.Ports.head()

## Dataset Plots

In [None]:
# Mosaic Chart
plt.rc('figure', figsize=(17, 5))

mosaic(titanic_df, ['Survival', 'Class', 'Gender'], axes_label=False, title='Survival: Red=Died, Green=Survived')
plt.xlabel('Gender: Male & Female')
plt.ylabel('Passenger Class: 1st, 2nd & 3rd Class')
plt.show()

In [None]:
cols = ['Survival', 'Class', 'Gender', 'Ports']

fig, axes = plt.subplots(nrows=3, ncols=2, figsize=(15, 15))

axes = axes.flat

for col, ax in zip(cols, axes):

    titanic_df[col].value_counts().plot(kind='bar', title=col, ax=ax, rot=0, ylabel='Count')

titanic_df['Age'].plot(kind='hist', ax=axes[4], ylabel='Count', xlabel='Age Categories by Decade (years)', ec='k', title='Age')

fig.delaxes(axes[5])

## Age

In [None]:
# Passangers with no age
ageisnull = titanic_df[titanic_df['Age'].isnull()]
ageisnull.head()

In [None]:
print('Total passengers with no age: ', len(ageisnull))

In the [Dataset Exploration](#dataexp) section, it was determined there were only 714 of 891 valid age related records.  We can see there are 177 NaN entries for Age.

In [None]:
# Mean age
titanic_df['Age'].mean()

In [None]:
# Mean age by Sex
(titanic_df.groupby(['Gender']))['Age'].mean()

In [None]:
# Mean age by Pclass and Sex
(titanic_df.groupby(['Class', 'Gender']))['Age'].mean()

In [None]:
# Mean age by Pclass, Survived and Sex
(titanic_df.groupby(['Class', 'Survival', 'Gender']))['Age'].mean()

In [None]:
# General statistics of Age by Class, Survival and Gender
(titanic_df.groupby(['Class', 'Survival', 'Gender']))['Age'].describe()

In [None]:
# Survival count by Sex, Pclass and Age < 20
sex = titanic_df['Gender']
survived = titanic_df['Survival']
pclass = titanic_df['Class']
age_youth = titanic_df['Age'] < 20

pd.crosstab([sex, pclass, age_youth], survived)

A decision is required to determine the best method of dealing with NaN values.
* The NaN values can be ignored
* NaN can be filled in with a value, typically a mean
    * Comparing the counts for various groups leads to the conclusion, simply using the overall mean will heavily weigh one specific age and skew any age dependant results.
    * For the remainder of this analytic process, the NaN values data will be replaced with a mean age based upon Pclass, Survived and Sex.

In [None]:
# Maintain Age and create Age_Fill (populate missing ages)
titanic_df['Age_Fill'] = titanic_df['Age']

In [None]:
titanic_df['Age_Fill'] = titanic_df['Age_Fill'] \
    .groupby([titanic_df['Pclass'], titanic_df['Survived'], titanic_df['Sex']], observed=False) \
    .transform(lambda x: x.fillna(x.mean())).to_frame()

Create a new category called Age_Fill and fill NaN with an age based upon the mean of Pclass, Survived and Sex.

In [None]:
# Example of Age_Fill - #5, 17 & 19
print(titanic_df['Age'].head(20))
print(titanic_df['Age_Fill'].head(20))

## Age Histogram Comparison

In [None]:
# Setup a figue of plots
df1 = titanic_df[titanic_df['Survived'] == 0]['Age']
df2 = titanic_df[titanic_df['Survived'] == 1]['Age']
df3 = titanic_df[titanic_df['Survived'] == 0]['Age_Fill']
df4 = titanic_df[titanic_df['Survived'] == 1]['Age_Fill']

max_age = max(titanic_df['Age_Fill'])

fig, (ax1, ax2) = plt.subplots(nrows=2, ncols=1, figsize=(10, 10))

ax1.hist([df1, df2], 
             bins=8, 
             range=(1, max_age), 
             stacked=True)

ax1.legend(('Died', 'Survived'), loc='best')
ax1.set_title('Survivors by Age Group (not filled)')
ax1.set_ylabel('Count')


ax2.hist([df3, df4], 
             bins=8, 
             range=(1, max_age), 
             stacked=True)

ax2.legend(('Died', 'Survived'), loc='best')
ax2.set_title('Survivors by Age Group (filled)')
ax2.set_xlabel('Age')
ax2.set_ylabel('Count')

plt.show()

In [None]:
# Maximum age
titanic_df['Age'].max()

In [None]:
# Create a new column that has all ages by bin category: 0-10:10, 10-20:20, 20-30:30, 30-40:40
# 40-50:50, 50-60:60, 60-70:70, 70-80:80
bins = [0, 10, 20, 30, 40, 50, 60, 70, 80]
group_names = [10, 20, 30, 40, 50, 60, 70, 80]

titanic_df['Age_Categories'] = pd.cut(titanic_df['Age_Fill'], bins, labels=group_names)

titanic_df[['Age', 'Age_Fill', 'Age_Categories']].head()

In [None]:
titanic_df['Age_Categories'] = pd.to_numeric(titanic_df['Age_Categories'])

An Age_Categories column has been inserted into the dataframe to simplify certain visualizations and calculations, as there are to many individual ages to easily draw conclusions or see patterns.

In [None]:
# Survival Count by Age_Categories
titanic_df.groupby('Survival')[['Age_Categories']].count()

[Back to top](#Top of document)
<a id='age_mosaic'></a>

## Age Mosaic

In [None]:
# Mosaic Plot
plt.rc('figure', figsize=(18, 6)) # figure size

mosaic(titanic_df,['Survival', 'Class', 'Age_Categories'], axes_label=False, title='Survival: Red=Died, Green=Survived')
plt.xlabel('Age Categories by Decades (years)')
plt.ylabel('Passenger Class: 1st, 2nd & 3rd Class')
plt.show()

In [None]:
# Mosaic Plot
mosaic(titanic_df,['Survival', 'Gender', 'Age_Categories'], axes_label=False, title='Survival: Red=Died, Green=Survived')
plt.xlabel('Age Categories by Decades (years)')
plt.ylabel('Gender: Male & Female')
plt.show()

[Back to top](#Top of document)
<a id='pclass'></a>

## Pclass

In [None]:
# Survival count by Pclass
pclass_ct = titanic_df.groupby('Class')['Survival'].value_counts().unstack()
pclass_ct

In [None]:
# Survival Rate
titanic_df.groupby('Class')['Survival'].value_counts(normalize = True).unstack()

In [None]:
# Setup a figure of plots
pclass_ct.plot(kind='bar', stacked=True, figsize=(10, 5))

plt.legend(('Died', 'Survived'), loc='best')
plt.title('Survivors by Pclass')
plt.xlabel('Pclass')
plt.ylabel('Count')
plt.xticks(rotation=0)

plt.show()

Pclass is not a strong indicator for surviving, however 3rd Class is a stong indicator for dying.

## Sex

In [None]:
# Survival count by sex
sex_ct = titanic_df.groupby('Gender')['Survival'].value_counts().unstack()
sex_ct

In [None]:
# Survival rate by sex
titanic_df.groupby('Gender')['Survival'].value_counts(normalize = True).unstack()

In [None]:
sex_ct.plot(kind='bar', stacked=True, figsize=(10, 5))

plt.legend(('Died', 'Survived'), loc='best')
plt.title('Survivors by Gender')
plt.xlabel('Gender')
plt.ylabel('Count')
plt.xticks(rotation=0)

plt.show()

Gender is a strong indicator for survivability, with a significant portion of females (74%) surviving and males 81% dying.

## Embarked

In [None]:
# Survival count by Embarked

embarked_ct = titanic_df.groupby('Ports')['Survival'].value_counts().unstack()
embarked_ct

In [None]:
# Survival rate by embarked
titanic_df.groupby('Ports')['Survival'].value_counts(normalize = True).unstack()

In [None]:
plt.rc('figure', figsize=(10, 5))

embarked_ct.plot(kind='bar', stacked=True, figsize=(10, 5), rot=0)

plt.legend(('Died', 'Survived'), loc='best')
plt.title('Survivors by Embarked')
plt.xlabel('Port of Embarkation')
plt.ylabel('Count')

plt.show()

## Statistics

In [None]:
# Survival count by Sex, Embarked_Numeric, Pclass and Age Category
embarked = titanic_df['Ports']
sex = titanic_df['Gender']
survived = titanic_df['Survival']
pclass = titanic_df['Class']
age_cat = titanic_df['Age_Categories']
pd.crosstab([sex, embarked, pclass], [survived, age_cat])

### OLS Regression Models

In [None]:
# OLS modeling for Survived and Gender
result_1 = sm.ols(formula='Survived ~ Gender', data=titanic_df).fit()
result_1.summary()

In [None]:
# OLS modeling for Survived and Class
result_2 = sm.ols(formula='Survived ~ Class', data=titanic_df).fit()
result_2.summary()

In [None]:
# OLS modeling for Survived and Ports
result_3 = sm.ols(formula='Survived ~ Ports', data=titanic_df).fit()
result_3.summary()

In [None]:
# OLS modeling for Survived and Age_Fill
result_4 = sm.ols(formula='Survived ~ Age_Fill', data=titanic_df).fit()
result_4.summary()

In [None]:
# OLS modeling for Survived and Gender + Class + Age_Fill + Ports
result_5 = sm.ols(formula='Survived ~ Gender + Class + Age_Fill + Ports', data=titanic_df).fit()
result_5.summary()

In [None]:
# OLS modeling for Survived and Gender + Class + Age_Fill
result_6 = sm.ols(formula='Survived ~ Gender + Class + Age_Fill', data=titanic_df).fit()
result_6.summary()

In [None]:
# Dataframe for statistical data
comp_index_4 = 'Gender + Class + Age_Fill + Ports'
comp_index_3 = 'Gender + Class + Age_Fill'

statistics_df = pd.DataFrame(
    data=[[result_1.rsquared_adj, np.sqrt(result_1.rsquared_adj)],
          [result_2.rsquared_adj, np.sqrt(result_2.rsquared_adj)],
          [result_3.rsquared_adj, np.sqrt(result_3.rsquared_adj)],
          [result_4.rsquared_adj, np.sqrt(result_4.rsquared_adj)],
          [result_5.rsquared_adj, np.sqrt(result_5.rsquared_adj)],
          [result_6.rsquared_adj, np.sqrt(result_6.rsquared_adj)]],
    index=['Gender', 'Class', 'Ports', 'Age_Fill', comp_index_4, comp_index_3],
    columns=['R-squared', 'Correlation to Survival']
)

statistics_df

Ordinary least squares (OLS) regression modeling has been used to determine which metric or combination of metrics provides the best prediction of survival.  As can be determined by reviewing the coefficient of determination (R-squared), the individual models for Ports and Age_Fill indicate a large proportion of variance for survival.  Gender and a combination of metrics are better models.  The square root of R-squared equals the Pearson correlation coefficient of predicted to actual values; Gender is the single metric with the strongest correlation.  However, the combination of metrics, Gender + Class + Age_Fill + Ports, shows the strongest correlation to survival for the model used.