# Step 1: Define The Problem

Develop an algorithm to to predict the survival outcome of passengers on the Titanic.

**Project Summary:** 
The sinking of the RMS Titanic is one of the most infamous shipwrecks in history. On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. This sensational tragedy shocked the international community and led to better safety regulations for ships.

One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class.

In this project, we are tasked to complete the analysis of what sorts of people were likely to survive. In particular, we will apply the tools of machine learning to predict which passengers survived the tragedy.

# Step 2: Gather the Data
The dataset can be found at [Kaggle's Titanic: Machine Learning from Disaster](https://www.kaggle.com/c/titanic/data)

# Step 3: Prepare Data
Because we were served the dataset, we don't need to worry about data wrangling, architecture, governance, and extraction. We only need to clean the data.

#### 3.1 Import Standard Libraries

In [None]:
from subprocess import check_output, call
import warnings
import time
import random
import sklearn
from IPython import display
import IPython
import scipy as sp
import numpy as np
import matplotlib
import pandas as pd
import os
import sys
import string

print("Python version:", sys.version)
print("Pandas version:", pd.__version__)
print("Matplotlib version:", matplotlib.__version__)
print("NumPy version:", np.__version__)
print("SciPy version:", sp.__version__)
print("IPython version:", IPython.__version__)
print("scikit-learn version:", sklearn.__version__)

warnings.filterwarnings('ignore')

print('-'*25)


print(os.listdir('data'))


#### 3.2 Import Modeling Libraries

In [None]:
# common models
from sklearn import svm, tree, linear_model, neighbors, naive_bayes, ensemble, discriminant_analysis, gaussian_process
from xgboost import XGBClassifier

# common model helpers
from sklearn.preprocessing import OneHotEncoder, LabelEncoder, StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn import feature_selection, model_selection, metrics
from sklearn.metrics import roc_curve, auc
from sklearn.model_selection import StratifiedKFold

# visualization
import matplotlib as mpl
import matplotlib.pyplot as plt
import matplotlib.pylab as pylab
import seaborn as sns
from pandas.plotting import scatter_matrix

# configure visualziation defaults
%matplotlib inline
plt.style.use('ggplot')
sns.set_style('darkgrid')
mpl.rcParams['figure.figsize'] = 12, 8

SEED = 42

#### 3.3 Meet the Data
Get to know the data.
What are the feature names and datatypes?
What are the target names and datatypes?

Using the [Source Data Dictionary](https://www.kaggle.com/c/titanic/data), we can learn a bit about our Target and Features.

1. The **Survived** variable is our outcome or dependent variable (Target). It is a binary nominal datatype of 1 for survived and 0 for did not survive. 

2. The **PassengerID** and **Ticket** variables are assumed to be random unique identifiers, that have no impact on the outcome variable. Thus, they will be excluded from analysis.

3. The **Pclass** variable is an ordinal datatype for the ticket class, a proxy for socio-economic status (SES), representing 1 = upper class, 2 = middle class, and 3 = lower class.

4. The **Name** variable is a nominal datatype. It could be used in feature engineering to derive the gender from title, family size from surname, and SES from titles like doctor or master. Since these variables already exist, we'll make use of it to see if title, like master, makes a difference.

5. The **Sex** and **Embarked** variables are a nominal datatype. They will be converted to dummy variables for mathematical calculations.

6. The **Age** and **Fare** variable are continuous quantitative datatypes.

7. The **SibSp** represents number of related siblings/spouse aboard and **Parch** represents number of related parents/children aboard. Both are discrete quantitative datatypes. This can be used for feature engineering to create a family size and is alone variable.

8. The **Cabin** variable is a nominal datatype that can be used in feature engineering for approximate position on ship when the incident occurred and SES from deck levels. However, since there are many null values, it does not add value and thus is excluded from analysis.

In [None]:
# the data will be broken into 3 parts: Train Set, Test Set, and Validation Set
# the train file will be loaded into the data_raw dataframe, which will later be split into the Train Set and Test Set
# the test file will be loaded into the data_val dataframe, which will be used as the Validation Set

data_raw = pd.read_csv('data/train.csv')

data_val = pd.read_csv('data/test.csv')

# create a copy of data_raw to be transformed
# deep=True by default. A deep copy will create a copy of the dataframe and all of its child objects, such as the data and the indeces.
# deep=False will create a shallow copy, which will only copy the top level of the dataframe (its columns) and not its data
data_1 = data_raw.copy(deep=True)

datasets_to_clean = [data_1, data_val]

# preview the data
print(data_raw.info())
data_raw.sample(10)


In [None]:
def concat_df(train_data, test_data):
    """
    Concatenate the train and test dataframes into one dataframe
    """
    return pd.concat([train_data, test_data], sort=True).reset_index(drop=True)

def divide_df(all_data):
    """
    Divide the dataframe into the train and test dataframes
    """
    return all_data.loc[:890], all_data.loc[891:].drop(['Survived'], axis=1)

df_train = pd.read_csv('data/train.csv')
df_test = pd.read_csv('data/test.csv')
df_all = concat_df(df_train, df_test)

df_train.name = 'Training Set'
df_test.name = 'Test Set'
df_all.name = 'All Set'

dfs = [df_train, df_test]

print('Number of Training Examples: ', df_train.shape[0])
print('Number of Test Examples: ', df_test.shape[0])
print('Training X Shape: ', df_train.shape)
print('Training Y Shape: ', df_train['Survived'].shape[0])
print('Test X Shape: ', df_test.shape)
print('Test Y Shape: ', df_test.shape[0])
print(df_train.columns)
print(df_test.columns)

#### 3.4 Cleaning the Data
In this stage, we will clean our data by 1) correcting aberrant values and outliers, 2) completing missing information, 3) creating new features for analysis, and 4) converting fields to the correct format for calculations and presentation.

1. **Correcting:** Reviewing the data, there does not appear to be any aberrant or non-acceptable data inputs. In addition, we see we may have potential outliers in age and fare. However, since they are reasonable values, we will wait until after we complete our exploratory analysis to determine if we should include or exclude from the dataset. It should be noted, that if they were unreasonable values, for example age = 800 instead of 80, then it's probably a safe decision to fix now. However, we want to use caution when we modify data from its original value, because it may be necessary to create an accurate model.

2. **Completing:** There are null values or missing data in the age, cabin, and embarked field. Missing values can be bad, because some algorithms don't know how-to handle null values and will fail. While others, like decision trees, can handle null values. Thus, it's important to fix before we start modeling, because we will compare and contrast several models. There are two common methods, either delete the record or populate the missing value using a reasonable input. It is not recommended to delete the record, especially a large percentage of records, unless it truly represents an incomplete record. Instead, it's best to impute missing values. A basic methodology for qualitative data is impute using mode. A basic methodology for quantitative data is impute using mean, median, or mean + randomized standard deviation. An intermediate methodology is to use the basic methodology based on specific criteria; like the average age by class or embark port by fare and SES. There are more complex methodologies, however before deploying, it should be compared to the base model to determine if complexity truly adds value. For this dataset, age will be imputed with the median, the cabin attribute will be dropped, and embark will be imputed with mode. Subsequent model iterations may modify this decision to determine if it improves the model’s accuracy.

3. **Creating:**  Feature engineering is when we use existing features to create new features to determine if they provide new signals to predict our outcome. For this dataset, we will create a title feature to determine if it played a role in survival.

4. **Converting:** Last, but certainly not least, we'll deal with formatting. There are no date or currency formats, but datatype formats. Our categorical data imported as objects, which makes it difficult for mathematical calculations. For this dataset, we will convert object datatypes to categorical dummy variables.

In [None]:
print('Train Set, columns with null values:\n', data_1.isnull().sum())
print('-'*25)


In [None]:
print('Validation Set, columns with null values:\n', data_val.isnull().sum())
print('-'*25)


In [None]:
data_raw.describe(include='all')


##### 3.4.1 Data Cleaning: Complete or Delete Missing Values

In [None]:
for dataset in datasets_to_clean:
    # complete missing age with median
    dataset['Age'].fillna(dataset['Age'].median(), inplace=True)

    # complete embarked with mode
    dataset['Embarked'].fillna(dataset['Embarked'].mode()[0], inplace=True)

    # complete missing fare with median
    dataset['Fare'].fillna(dataset['Fare'].median(), inplace=True)

# delete the following features: Cabin, PassengerID, Ticket
columns_to_drop = ['Cabin', 'PassengerId', 'Ticket']
data_1.drop(columns_to_drop, axis=1, inplace=True)

print(data_1.isnull().sum())
print('-'*25)
print(data_val.isnull().sum())


##### 3.4.2 Create: Feature Engineering

In [None]:
for dataset in datasets_to_clean:
    # discrete variables
    dataset['FamilySize'] = dataset['SibSp'] + dataset['Parch'] + 1

    # initalize 'IsAlone' feature as 1
    dataset['IsAlone'] = 1

    # if FamilySize is greater than 1, then 'IsAlone' is 0
    dataset['IsAlone'].loc[dataset['FamilySize'] > 1] = 0

    # split title from name
    dataset['Title'] = dataset['Name'].str.split(", ", expand=True)[1].str.split(".", expand=True)[0]

    # continuous variable bins using qcut
    dataset['FareBin'] = pd.qcut(dataset['Fare'], 4)

    # create age bins
    dataset['AgeBin'] = pd.cut(dataset['Age'].astype(int), 5)


In [None]:
data_1.Title.value_counts()


In [None]:
# clean up rare titles
title_frequency_threshold = 10
titles_to_remove = (data_1.Title.value_counts() < title_frequency_threshold)

# replace titles to remove with 'Misc'
data_1['Title'] = data_1['Title'].apply(lambda x: 'Misc' if titles_to_remove.loc[x] == True else x)

data_1.Title.value_counts()


In [None]:
# preview the data again
data_1.info()
data_1.sample(10)


In [None]:
data_val.info()
data_val.sample(10)


##### 3.4.3 Convert Formats
We will convert categorical data to dummy variables.

In [None]:
# encode categorical data
label_encoder = LabelEncoder()

for dataset in datasets_to_clean:
    dataset['Sex_Code'] = label_encoder.fit_transform(dataset['Sex'])
    dataset['Embarked_Code'] = label_encoder.fit_transform(dataset['Embarked'])
    dataset['Title_Code'] = label_encoder.fit_transform(dataset['Title'])
    dataset['AgeBin_Code'] = label_encoder.fit_transform(dataset['AgeBin'])
    dataset['FareBin_Code'] = label_encoder.fit_transform(dataset['FareBin'])

data_1.sample(5)


In [None]:
# define y variable AKA Target
target = ['Survived']

# define x variables AKA Features
# pretty name/values for charts
data_1_x = ['Sex', 'Pclass', 'Embarked', 'Title', 'SibSp', 'Parch', 'Age', 'Fare', 'FamilySize', 'IsAlone']
# coded for algorithm calculation
data_1_x_calc = ['Sex_Code', 'Pclass', 'Embarked_Code', 'Title_Code', 'SibSp', 'Parch', 'Age', 'Fare']

data_1_xy = target + data_1_x
print('Original X Y: ', data_1_xy, '\n')

# define x variables for original w/bin features to remove continuous variables
data_1_x_bin = ['Sex_Code', 'Pclass', 'Embarked_Code', 'Title_Code', 'FamilySize', 'AgeBin_Code', 'FareBin_Code']
data_1_xy_bin = target + data_1_x_bin
print('Bin X Y: ', data_1_xy_bin, '\n')

# define x and y variables for dummy features original
data_1_dummy = pd.get_dummies(data_1[data_1_x])
data_1_x_dummy = data_1_dummy.columns.tolist()
data_1_xy_dummy = target + data_1_x_dummy
print('Dummy X Y: ', data_1_xy_dummy, '\n')

data_1_dummy.head()


##### 3.4.4 Double Check Cleaned Data

In [None]:
print('Train Set, columns with null values:\n', data_1.isnull().sum())


In [None]:
print('Validation Set, columns with null values:\n', data_val.isnull().sum())


In [None]:
data_1.info()


In [None]:
data_val.info()


In [None]:
data_raw.describe(include='all')


#### 3.5 Split Data_1 into Training and Testing Sets

In [None]:
train_1_x, test_1_x, train_1_y, test_1_y = model_selection.train_test_split(
    data_1[data_1_x_calc], data_1[target], random_state=0)
train_1_x_bin, test_1_x_bin, train_1_y_bin, test_1_y_bin = model_selection.train_test_split(
    data_1[data_1_x_bin], data_1[target], random_state=0)
train_1_x_dummy, test_1_x_dummy, train_1_y_dummy, test_1_y_dummy = model_selection.train_test_split(
    data_1_dummy
    [data_1_x_dummy],
    data_1[target],
    random_state=0)

print('Data 1 Shape: ', data_1.shape)
print('Train 1 X Shape: ', train_1_x.shape)
print('Test 1 X Shape: ', test_1_x.shape)

train_1_x_bin.head()


# Step 4: Perform Exploratory Data Analysis

#### 4.1 Discrete Variable Correlcation by Survival

In [None]:
# using pivot table
for x in data_1_x:
    if data_1[x].dtype != 'float64':
        print('Survival Correlation by: ', x)
        print(data_1[[x, target[0]]].groupby([x], as_index=False).mean())
        print('-'*25, '\n')

# using crosstab
print(pd.crosstab(data_1['Title'], data_1[target[0]]))


In [None]:
# graph distribution of quantitative data
plt.figure(figsize=[16, 12])

plt.subplot(231)
plt.boxplot(x=data_1['Fare'], showmeans=True, meanline=True)
plt.title('Fare Boxplot')
plt.ylabel('Fare ($)')

plt.subplot(232)
plt.boxplot(data_1['Age'], showmeans=True, meanline=True)
plt.title('Age Boxplot')
plt.ylabel('Age (Years)')

plt.subplot(233)
plt.boxplot(data_1['FamilySize'], showmeans=True, meanline=True)
plt.title('Family Size Boxplot')
plt.ylabel('Family Size (#)')

plt.subplot(234)
plt.hist(x=[data_1[data_1['Survived'] == 1]['Fare'], data_1[data_1['Survived'] == 0]['Fare']],
         stacked=True, color=['g', 'r'], label=['Survived', 'Dead'], bins=10)
plt.title('Fare Histogram by Survival')
plt.xlabel('Fare ($)')
plt.ylabel('# of Passengers')
plt.legend()

plt.subplot(235)
plt.hist(x=[data_1[data_1['Survived'] == 1]['Age'], data_1[data_1['Survived'] == 0]['Age']],
         stacked=True, color=['g', 'r'], label=['Survived', 'Dead'], bins=10)
plt.title('Age Histogram by Survival')
plt.xlabel('Age (Years)')
plt.ylabel('# of Passengers')
plt.legend()

plt.subplot(236)
plt.hist(x=[data_1[data_1['Survived'] == 1]['FamilySize'], data_1[data_1['Survived'] == 0]['FamilySize']],
         stacked=True, color=['g', 'r'], label=['Survived', 'Dead'], bins=10)
plt.title('Family Size Histogram by Survival')
plt.xlabel('Family Size (#)')
plt.ylabel('# of Passengers')
plt.legend()


In [None]:
# use seaborn graphics for multi-variable comparison:

# graph individual features by survival
fig, saxis = plt.subplots(2, 3, figsize=(16, 12))

sns.barplot(x='Embarked', y='Survived', data=data_1, ax=saxis[0, 0])
sns.barplot(x='Pclass', y='Survived', order=[1, 2, 3], data=data_1, ax=saxis[0, 1])
sns.barplot(x='IsAlone', y='Survived', order=[1, 0], data=data_1, ax=saxis[0, 2])

sns.pointplot(x='FareBin', y='Survived',  data=data_1, ax=saxis[1, 0])
sns.pointplot(x='AgeBin', y='Survived',  data=data_1, ax=saxis[1, 1])
sns.pointplot(x='FamilySize', y='Survived', data=data_1, ax=saxis[1, 2])


In [None]:
# graph distribution of quantitative data: Pclass
# we know class mattered in survival, so let's graph it against other features
fig, (axis1, axis2, axis3) = plt.subplots(1, 3, figsize=(14, 12))

sns.boxplot(x='Pclass', y='Fare', hue='Survived', data=data_1, ax=axis1)
axis1.set_title('Class vs Fare Survival Comparison')

sns.violinplot(x='Pclass', y='Age', hue='Survived', data=data_1, split=True, ax=axis2)
axis2.set_title('Class vs Age Survival Comparison')

sns.boxplot(x='Pclass', y='FamilySize', hue='Survived', data=data_1, ax=axis3)
axis3.set_title('Class vs Family Size Survival Comparison')


In [None]:
# graph distribution of quantitative data: Sex
# we know sex mattered in survival, so let's graph it against other features

fig, qaxis = plt.subplots(nrows=1, ncols=3, figsize=[14, 12])

sns.barplot(x='Sex', y='Survived', hue='Embarked', data=data_1, ax=qaxis[0])
axis1.set_title('Sex vs Embarked Survival Comparison')

sns.barplot(x='Sex', y='Survived', hue='Pclass', data=data_1, ax=qaxis[1])
axis2.set_title('Sex vs Pclass Survival Comparison')

sns.barplot(x='Sex', y='Survived', hue='IsAlone', data=data_1, ax=qaxis[2])
axis3.set_title('Sex vs IsAlone Survival Comparison')


In [None]:
# more side-by-side comparisons
fig, (maxis1, maxis2) = plt.subplots(nrows=1, ncols=2, figsize=[14, 12])

# how does family size factor with sex & survival compare
sns.pointplot(
    x='FamilySize',
    y='Survived',
    hue='Sex',
    data=data_1,
    palette={
        'male': 'blue',
        'female': 'pink'
    },
    markers=['*', 'o'],
    linestyles=['-', '--'],
    ax=maxis1
)

# how does class factor with sex & survival compare
sns.pointplot(
    x='Pclass',
    y='Survived',
    hue='Sex',
    data=data_1,
    palette={
        'male': 'blue',
        'female': 'pink'
    },
    markers=['*', 'o'],
    linestyles=['-', '--'],
    ax=maxis2
)


In [None]:
# how does embark port factor with class, sex, and survival outcome?
embarked_facet_grid = sns.FacetGrid(data=data_1, col='Embarked')
embarked_facet_grid.map(sns.pointplot, 'Pclass', 'Survived', 'Sex', ci=95.0, palette='deep')
embarked_facet_grid.add_legend()


In [None]:
# plot distributions of age of passengers who survived or did not survive
age_facet_grid = sns.FacetGrid(data=data_1, hue='Survived', aspect=4)
age_facet_grid.map(sns.kdeplot, 'Age', shade=True)
age_facet_grid.set(xlim=(0, data_1['Age'].max()))
age_facet_grid.add_legend()


In [None]:
# histogram comparison of sex, class, and age by survival
histogram_facet_grid = sns.FacetGrid(data=data_1, row='Sex', col='Pclass', hue='Survived')
histogram_facet_grid.map(plt.hist, 'Age', alpha=0.75)
histogram_facet_grid.add_legend()


In [None]:
# pair plots of the entire dataset
pair_plot = sns.pairplot(data=data_1, hue='Survived', palette='deep', size=1.2,
                         diag_kind='kde', diag_kws=dict(shade=True), plot_kws=dict(s=10))
pair_plot.set(xticklabels=[])


In [None]:
# correlation heatmap of dataset
def correlation_heatmap(df):
    _, ax = plt.subplots(figsize=(10, 10))
    colormap = sns.diverging_palette(220, 10, as_cmap=True)

    _ = sns.heatmap(
        df.corr(),
        cmap=colormap,
        square=True,
        cbar_kws={'shrink': .9},
        ax=ax,
        annot=True,
        linewidths=0.1,
        vmax=1.0,
        linecolor='white',
        annot_kws={'fontsize': 12}
    )

    plt.title('Pearson Correlation of Features', y=1.05, size=15)


correlation_heatmap(data_1)


# Step 5: Model Data

#### 5.1 Machine Learning Algorith (MLA) Selection and Initialization

In [None]:
MLAs = [
    # ensemble methods
    ensemble.AdaBoostClassifier(),
    ensemble.BaggingClassifier(),
    ensemble.ExtraTreesClassifier(),
    ensemble.GradientBoostingClassifier(),
    ensemble.RandomForestClassifier(),

    # Gaussian Processes
    gaussian_process.GaussianProcessClassifier(),

    # GLM
    linear_model.LogisticRegressionCV(),
    linear_model.PassiveAggressiveClassifier(),
    linear_model.RidgeClassifierCV(),
    linear_model.SGDClassifier(),
    linear_model.Perceptron(),

    # Navies Bayes
    naive_bayes.BernoulliNB(),
    naive_bayes.GaussianNB(),

    # Nearest Neighbor
    neighbors.KNeighborsClassifier(),

    # SVM
    svm.SVC(probability=True),
    svm.NuSVC(probability=True),
    svm.LinearSVC(),

    # Trees
    tree.DecisionTreeClassifier(),
    tree.ExtraTreeClassifier(),

    # Discriminant Analysis
    discriminant_analysis.LinearDiscriminantAnalysis(),
    discriminant_analysis.QuadraticDiscriminantAnalysis(),

    # XGBoost
    XGBClassifier()
]


In [None]:
# split dataset in cross-validation
# run model 10 times with 60/30 split intentionally leaving out 10%
cross_validation_split = model_selection.ShuffleSplit(n_splits=10, test_size=0.3, train_size=0.6, random_state=0)

# create table to compare MLA metrics
MLA_columns = ['MLA Name', 'MLA Parameters', 'MLA Train Accuracy Mean',
               'MLA Test Accuracy Mean', 'MLA Test Accuracy 3*STD']
MLA_compare = pd.DataFrame(columns=MLA_columns)

# create table to compare MLA predictions
MLA_predict = data_1[target]

# index through MLAs and save performance to table
row_index = 0
for alg in MLAs:
    # set name and parameters
    MLA_name = alg.__class__.__name__
    MLA_compare.loc[row_index, 'MLA Name'] = MLA_name
    MLA_compare.loc[row_index, 'MLA Parameters'] = str(alg.get_params())

    # score model with cross validation
    cross_validation_results = model_selection.cross_validate(
        alg,
        data_1[data_1_x_bin],
        data_1[target],
        cv=cross_validation_split,
        return_train_score=True
    )

    MLA_compare.loc[row_index, 'MLA Time'] = cross_validation_results['fit_time'].mean()
    MLA_compare.loc[row_index, 'MLA Train Accuracy Mean'] = cross_validation_results['train_score'].mean()
    MLA_compare.loc[row_index, 'MLA Test Accuracy Mean'] = cross_validation_results['test_score'].mean()
    # if this is a non-bias random sample, then +/-3 standard deviations (std) from the mean,
    # should statistically capture 99.7% of the subsets
    # let's know the worst that can happen if we're really unlucky
    MLA_compare.loc[row_index, 'MLA Test Accuracy 3*STD'] = cross_validation_results['test_score'].std() * 3

    # save MLA predictions
    alg.fit(data_1[data_1_x_bin], data_1[target])
    MLA_predict[MLA_name] = alg.predict(data_1[data_1_x_bin])

    row_index += 1

# print and sort table
MLA_compare.sort_values(by=['MLA Test Accuracy Mean'], ascending=False, inplace=True)
MLA_compare


In [None]:
sns.barplot(x='MLA Test Accuracy Mean', y='MLA Name', data=MLA_compare, color='m')
plt.title('Machine Learning Algorithm Accuracy Score \n')
plt.xlabel('Accuracy Score (%)')
plt.ylabel('Algorithm')
plt.tight_layout()


#### 5.2 Evaluate Model Performance