# SIT307 T1 2021
# Assignment 2 - Data Mining Project
***Group 5*** - Rhys McMillan (218335964), Brenton Fleming (217603898), Neb Miletic (218489118), Sean Pain (218137385), Oliver Bennett (218143462), Muhammad Sibtain (219345654), Asim Arshad (219337467)  
  
***Data*** - Titanic: Machine Learning From Disaster (https://www.kaggle.com/c/titanic/data)

## Table of Contents

* [1. Preparation](#1)
    * [1.1 Import Relevant Libraries](#1_1)
    * [1.2 Load Data from File](#1_2)
* [2. Data Overview](#2)
    * [2.1 Data Dictionary](#2_1)
    * [2.2 Properties](#2_2)
    * [2.3 Features](#2_3)
    * [2.4 Null Values](#2_4)
    * [2.5 Statistical Distribution](#2_5)
* [3. Feature Engineering](#3)
    * [3.1 Title](#3_1)
    * [3.2 Relatives](#3_2)
    * [3.3 Sex](#3_3)
    * [3.4 UniqueTicket](#3_4)
    * [3.5 Summary](#3_5)
* [4. Data Cleaning](#4)
    * [4.1 Discrete Data](#4_1)
        * [4.1.1 Survived](#4_1_1)
        * [4.1.2 Passenger Class (Pclass)](#4_1_2)
        * [4.1.3 Sex](#4_1_3)
        * [4.1.4 Siblings / Spouse (SibSp)](#4_1_4)
        * [4.1.5 Parents / Children (Parch)](#4_1_5)
        * [4.1.6 Relatives](#4_1_6)
        * [4.1.7 Alone](#4_1_7)
        * [4.1.8 UniqueTicket](#4_1_8)
    * [4.2 Continuous Data](#4_2)
        * [4.2.1 Age](#4_2_1)
        * [4.2.2 Fare](#4_2_2)
    * [4.3 Nominal Data](#4_3)
        * [4.3.1 Cabin](#4_3_1)
        * [4.3.2 Embarked](#4_3_2)
        * [4.3.3 Title](#4_3_3)
* [5. Feature Selection](#5)
    * [5.1 Numerical Features](#5_1)
    * [5.2 Categorical Features](#5_2)
        * [5.2.1 Embarked](#5_2_1)
        * [5.2.2 Title](#5_2_2)
* [6. Exploratory Data Analysis (EDA)](#6)
    * [6.1 Passenger Class (PClass)](#6_1)
    * [6.2 Sex](#6_2)
    * [6.3 Age](#6_3)
    * [6.4 Fare](#6_4)
    * [6.5 Alone](#6_5)
    * [6.6 UniqueTicket](#6_6)
    * [6.7 Embarked](#6_7)
    * [6.8 Title](#6_8)

# 1. Preparation <a class="anchor" id="1"></a>

## 1.1 Import Relevant Libraries <a class="anchor" id="1_1"></a>

In [None]:
# data analysis
import pandas as pd
import numpy as np
from scipy import stats

# visualisation
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

## 1.2 Load Data from File <a class="anchor" id="1_2"></a>
The source data (https://www.kaggle.com/c/titanic/data) contains two datasets - train.csv and test.csv.  
Train.csv is intended for model training and contains the entire feature set.  
Test.csv is intended for testing a trained model and does not contain the outcome ('Survived').

For our analysis, we will be using train.csv only.

In [None]:
# load train.csv to pandas data frame, using 'PassengerId' as the index
master_df = pd.read_csv('../input/titanic/train.csv' , index_col='PassengerId')

# Create a working copy of the data frame for manipulation. The master will serve as the baseline.
working_df = master_df.copy()

# Preview the data
working_df.head()

# 2. Data Overview <a class="anchor" id="2"></a>
We begin our analysis by taking a cursory look at the structure and properties of our data set. This will give some context to the data and help guide our exploration.
## 2.1 Data Dictionary <a class="anchor" id="2_1"></a>
The following data dictionary was provided alongside the dataset:
<table>
    <tr>
        <th>Variable</th>
        <th>Definition</th>
        <th>Key</th>
    </tr>
    <tr>
        <td>pclass</td>
        <td>Ticket class</td>
        <td>1 = 1st, 2 = 2nd, 3 = 3rd</td>
    </tr>
    <tr>
        <td>sex</td>
        <td>Sex</td>
        <td></td>
    </tr>
    <tr>
        <td>Age</td>
        <td>Age in years</td>
        <td></td>
    </tr>
    <tr>
        <td>sibsp</td>
        <td># of siblings / spouses aboard the Titanic</td>
        <td></td>
    </tr>
    <tr>
        <td>parch</td>
        <td># of parents / children aboard the Titanic</td>
        <td></td>
    </tr>
    <tr>
        <td>ticket</td>
        <td>Ticket number</td>
        <td></td>
    </tr>
    <tr>
        <td>fare</td>
        <td>Passenger fare</td>
        <td></td>
    </tr>
    <tr>
        <td>cabin</td>
        <td>Cabin number</td>
        <td></td>
    </tr>
    <tr>
        <td>embarked</td>
        <td>Port of Embarkation</td>
        <td>C = Cherbourg, Q = Queenstown, S = Southampton</td>
    </tr>
</table>

## 2.2 Properties <a class="anchor" id="2_2"></a>
Examine the basic shape and properties of the dataset.

In [None]:
# print shape of the dataset
print("There are {} rows and {} columns in the dataset.".format(master_df.shape[0] , master_df.shape[1]))

In [None]:
# print basic summary of the dataset
print(master_df.info())

## 2.3 Features <a class="anchor" id="2_3"></a>
The dataset contains 3 data types - float64, int64 and object. We will initially assume float64 represents continuous data, int64 represents discrete data and object represents categorical data. The data frame data type does not different between nominal and ordinal data. Manual inspection determined all categorical data to be nominal. Our feature set can therefore be classified as:

 - Discrete - Survived, Pclass, Sibsp, Parch
 - Continuous - Age, Fare
 - Ordinal - Name, Sex, Ticket, Cabin, Embarked

## 2.4 Null Values <a class="anchor" id="2_4"></a>

In [None]:
# get count of missing values
master_df.isnull().sum()

Only 3 features contain null values - Age, Cabin, and Embarked.
- Cabin null values consitute a significant portion of the data (687 of 891). Any imputation would likely introduce signficant bias. Consider dropping this feature.
- Embarked null values only constitue a very minor portion of the data (2 of 891). Imputation of this feature will have minimal impact on correlation. Any simple imputation method will suffice.


## 2.5 Statistical Distribution <a class="anchor" id="2_5"></a>

Statistical distrubiton of numerical features:

In [None]:
# print statistical distrution of float and integer data types
master_df.describe()

Statistical distribution of categorical features:

In [None]:
# print statistical distrution of object types
master_df.describe(include=['O'])

# 3. Feature Engineering <a class="anchor" id="3"></a>

## 3.1 Title <a class="anchor" id="3_1"></a>
All name values are unique in the data, as such no correlation is possible using this feature. As the name feature currently contains both the name and title of the passenger, we can extract title from this feature.

In [None]:
# create a new feature to extract title names from the Name column
working_df['Title'] = working_df.Name.apply(lambda name: name.split(',')[1].split('.')[0].strip())

# get unique titles
unique_titles = working_df['Title'].unique()
print("Unique Titles:", len(unique_titles))
print(unique_titles)

We can further refine this list by grouping similar titles.

In [None]:
# normalize titles into dictionary
title_dictionary = {
    "Capt":       "Officer",
    "Col":        "Officer",
    "Major":      "Officer",
    "Jonkheer":   "Royalty",
    "Don":        "Royalty",
    "Sir" :       "Royalty",
    "Dr":         "Officer",
    "Rev":        "Officer",
    "the Countess":"Royalty",
    "Dona":       "Royalty",
    "Mme":        "Mrs",
    "Mlle":       "Miss",
    "Ms":         "Mrs",
    "Mr" :        "Mr",
    "Mrs" :       "Mrs",
    "Miss" :      "Miss",
    "Master" :    "Master",
    "Lady" :      "Royalty"
}

# map normalized title to Title feature vector
working_df.Title = working_df.Title.map(title_dictionary)

# print value counts
print(working_df.Title.value_counts())
# to do transform titles to ordinal values

The name feature no longer holds any relevance and can be dropped from the dataset.

In [None]:
working_df = working_df.drop(columns='Name')

## 3.2 Relatives <a class="anchor" id="3_2"></a>
SibSp (number of siblings or spouse) and Parch (number of parents or children) both relate to the number of relatives on board along with the passenger. These values can be combined as a single 'Relatives' feature.

In [None]:
# create a new feature to calculate number of relatives
working_df['Relatives'] = working_df['SibSp'] + working_df['Parch']

# print value counts
print(working_df.Relatives.value_counts())

As a large number of passengers were travelling alone (537 of 891) we can also represent this as a seperate 'Alone' feature.

In [None]:
# create new feature to show if passenger was alone or with family
working_df['Alone'] = 0
working_df.loc[working_df['Relatives'] == 0, 'Alone'] = 1

# print value counts
print(working_df.Alone.value_counts())

## 3.3 Sex <a class="anchor" id="3_3"></a>
Sex contains two possible values male and female. We can more easily work with this information by coverting it to numeric data where 0 = male and 1 = female.

In [None]:
# convert sex to numeric values
working_df['Sex'] = working_df['Sex'].map({"male": 0, "female": 1})

# print value counts
print(working_df.Sex.value_counts())

## 3.4 UniqueTicket <a class="anchor" id="3_4"></a>
Ticket contains 681 unique values. As is, strong correlation with any other feature will be highely unlikely. We will add a new feature 'UniqueTicket' to specify if a ticket number is unique in the dataset or a duplicate. The assumption is that a duplicate ticket number permitted more than 1 person to board.

In [None]:
# first find all unique tickets
unique_tickets = pd.concat(i for _, i in working_df.groupby("Ticket") if len(i) == 1).index

# create new feature
working_df['UniqueTicket'] = 0
working_df.loc[unique_tickets, 'UniqueTicket'] = 1

# print value counts
print(working_df.UniqueTicket.value_counts())

Ticket can be be dropped in favour of UniqueTicket.

In [None]:
# drop ticket column
working_df = working_df.drop(columns='Ticket')

## 3.5 Summary <a class="anchor" id="3_5"></a>
After feature engineering, our feature set has expanded to:
- Discrete - Survived, Pclass, Sex, Sibsp, Parch, Relatives, Alone, UniqueTicket
- Continuous - Age, Fare
- Ordinal - Cabin, Embarked, Title

# 4. Data Cleaning <a class="anchor" id="4"></a>
Individually inspect each feature to determine unusual or missing values. Clean and impute values as required.

## 4.1 Discrete Data <a class="anchor" id="4_1"></a>
Data cleaning requirements for discrete data can be determined by:
- Check for any missing values.
- Inspect unique values to determine if any do not make sense.

### 4.1.1 Survived <a class="anchor" id="4_1_1"></a>

In [None]:
# print null value count
print("Null values:", working_df['Survived'].isnull().sum())

# print list of unique values to check for anything unusual
print("Unique values:", working_df['Survived'].unique())

Survived has no unusual or missing values. No data cleaning required.

### 4.1.2 Passenger Class (Pclass) <a class="anchor" id="4_1_2"></a>

In [None]:
# print null value count
print("Null values:", working_df['Pclass'].isnull().sum())

# print list of unique values to check for anything unusual
print("Unique values:", working_df['Pclass'].unique())

Pclass has no unusual or missing values. No data cleaning required.

### 4.1.3 Sex <a class="anchor" id="4_1_3"></a>

In [None]:
# print null value count
print("Null values:", working_df['Sex'].isnull().sum())

# print list of unique values to check for anything unusual
print("Unique values:", working_df['Sex'].unique())

Sex has no unusual or missing values. No data cleaning required.

### 4.1.4 Siblings / Souse (SibSp) <a class="anchor" id="4_1_4"></a>

In [None]:
# print null value count
print("Null values:", working_df['SibSp'].isnull().sum())

# print list of unique values to check for anything unusual
print("Unique values:", working_df['SibSp'].unique())

SibSp has no unusual or missing values. No data cleaning required.

### 4.1.5 Parents / Children (Parch) <a class="anchor" id="4_1_5"></a>

In [None]:
# print null value count
print("Null values:", working_df['Parch'].isnull().sum())

# print list of unique values to check for anything unusual
print("Unique values:", working_df['Parch'].unique())

Parch has no unusual or missing values. No data cleaning required.

### 4.1.6 Relatives  <a class="anchor" id="4_1_6"></a>

In [None]:
# print null value count
print("Null values:", working_df['Relatives'].isnull().sum())

# print list of unique values to check for anything unusual
print("Unique values:", working_df['Relatives'].unique())

Relatives has no unusual or missing values. No data cleaning required.

### 4.1.7 Alone  <a class="anchor" id="4_1_7"></a>

In [None]:
# print null value count
print("Null values:", working_df['Alone'].isnull().sum())

# print list of unique values to check for anything unusual
print("Unique values:", working_df['Alone'].unique())

Alone has no unusual or missing values. No data cleaning required.

### 4.1.8 UniqueTicket  <a class="anchor" id="4_1_8"></a>

In [None]:
# print null value count
print("Null values:", working_df['UniqueTicket'].isnull().sum())

# print list of unique values to check for anything unusual
print("Unique values:", working_df['UniqueTicket'].unique())

UniqueTicket has no unusual or missing values. No data cleaning required.

## 4.2 Continuous Data <a class="anchor" id="4_2"></a>
Data cleaning requirements for continuous data can be determined by:
- Check for any missing values.
- Identify any outliers using zscore (threshold = +- 3).

### 4.2.1 Age  <a class="anchor" id="4_2_1"></a>

In [None]:
# print null value count
print("Null values:", working_df['Age'].isnull().sum())

# calculate zscore for each value
zscore = (working_df.Age - working_df.Age.mean()) / working_df.Age.std(ddof=0)

# calculate outliers using zscore
outliers = working_df.loc[abs(zscore) > 3]

# print outliers
print("Outlier count:", len(outliers))
    

We will first impute the missing values, then validate and handle outliers.

#### 4.2.1.1 Impute Missing Age Values  <a class="anchor" id="4_2_1_1"></a>

We must first identify features which have a strong correlation with age to use as the basis of our imputation.  
Start by correlating all numerical features against age:

In [None]:
# extract age column from data fxrame
age = working_df['Age']

# correlate with other numerical columns
corr = working_df.drop(columns='Age').corrwith(age)

# display as bar graph
ax = corr.plot.bar(rot=0)

Categorical features can be checked by pivotting against age.  
Pivot Title vs Mean Age:

In [None]:
age_pivot = pd.pivot_table(working_df, index=['Title'], values=['Age'], aggfunc=np.mean)
age_pivot.plot(kind='bar')

Pivot Embarked vs Mean Age:

In [None]:
age_pivot = pd.pivot_table(working_df, index=['Embarked'], values=['Age'], aggfunc=np.mean)
age_pivot.plot(kind='bar')

We will impute our missing ages using Linear Regression imputation, we will be keeping very simple for now, but more complexities can be added to further improve the imputation

In [None]:
# pre-requisites - as title and class hugely impacts the age of passenegers, they' both will be used, and the feature
# with high score will be consiered or maybe both values

# we first need to map Title to numerical values to allow the algorithm to run
working_df['TitleMapped'] = working_df['Title'].map({'Mr':0, 'Mrs':1, 'Miss':2, 'Master':3, 'Royalty':4, 'Officer':5})

# getting all data with known age values to train our model
data = working_df.loc[working_df['Age'].notna()]

# creating X = features (Title and Class) and Y = response variable (Age)
X = data[['TitleMapped' , 'Pclass']]
y = data['Age']


# extracting dataframe of missing ages we want to impute
missing_ages = working_df['Age'][working_df['Age'].isna()]

# imputing age using regression imputation
from sklearn.linear_model import LinearRegression
regression_classifier = LinearRegression()

# splitting our data for training and testing - Sklearn builtin methods can also be used
X_train = X.head(537)  # contains p-class and title for known ages
y_train = y.head(537)  # contains the actual age for known ages
X_test = X.tail(177)   # contains p-class and title for missing ages to predict missing ages

model = regression_classifier.fit(X_train,y_train) # data fitted to model to train

# predict missing ages
age_result = model.predict(X_test)

# age_result contains the imputed values and can be imputed by:
working_df.loc[ working_df['Age'].isnull(), 'Age'] = age_result

# check all age values have been filled
print("Null values:", working_df['Age'].isnull().sum())

# drop TitleMapped as it is no longer required
working_df = working_df.drop(columns="TitleMapped")

#### 4.2.1.2 Validate Age Outliers  <a class="anchor" id="4_2_1_2"></a>
Identify and validate outliers.

In [None]:
# calculate zscore for each value
zscore = (working_df.Age - working_df.Age.mean()) / working_df.Age.std(ddof=0)

# calculate outliers using zscore
outliers = working_df[abs(zscore) > 3]

# print outliers
print("Outlier count:", len(outliers))
print("Outliers:")
outliers

There are only seven outliers for age.  
All outliers are valid ages for a person and not erronous values. They can be retained in the dataset.  

### 4.2.2 Fare  <a class="anchor" id="4_2_2"></a>

In [None]:
# print null value count
print("Null values:", working_df.Fare.isnull().sum())

zscore = (working_df.Fare - working_df.Fare.mean()) / working_df.Fare.std(ddof=0)

# calculate outliers using zscore
outliers = working_df.loc[abs(zscore) > 3]

# print outlier count
print("Outlier count:", len(outliers))

#### 4.2.1.1 Validate Fare Outliers  <a class="anchor" id="4_2_1_1"></a>
Identify and validate outliers.

In [None]:
print("Outliers:")
outliers

In [None]:
# group and count outlier values
outliers.Fare.value_counts()

There are multiple instances of the most extreme outliers. All outliers are also from first class, which would expect to have higher ticket costs. We do not believe these are erronous values and will be retained in the dataset.

## 4.3 Nominal Data  <a class="anchor" id="4_3"></a>
Data cleaning requirements for discrete data can be determined by:
- Check for any missing values.
- Inspect unique values to determine if any do not make sense.

### 4.3.1 Cabin  <a class="anchor" id="4_3_1"></a>

In [None]:
# print null value count
print("Null values:", working_df.Cabin.isnull().sum())

# print list of unique values to check for anything unusual
print("Unique count:", working_df.Cabin.nunique())

Cabin is missing a significant portion of data. No meaningful correlation will be possible from this feature and it will be dropped from the data set.

In [None]:
working_df = working_df.drop(columns='Cabin')

### 4.3.2 Embarked <a class="anchor" id="4_3_2"></a>

In [None]:
# print null value count
print("Null values:", working_df.Embarked.isnull().sum())

# print list of unique values to check for anything unusual
print("Unique values:", working_df.Embarked.dropna().unique())

Embarked contains 2 missing values and no unusual values. 

#### 4.3.1.1 Impute Missing Embarked Values  <a class="anchor" id="4_3_3_1"></a>
As only 2 of 891 values are missing, we can simply fill these with the most common embarked value.

In [None]:
# print embarked value counts
print(working_df.Embarked.value_counts())

Most common value is 'S'. Fill missing embarked values with 'S'.

In [None]:
# fill emabrked na with 'S'
working_df.Embarked = working_df.Embarked.fillna('S')

# confirm there are no more nulls
print("Null values:", working_df.Embarked.isnull().sum())
print(working_df.Embarked.value_counts())

### 4.3.3 Title  <a class="anchor" id="4_3_3"></a>

In [None]:
# print null value count
print("Null values:", working_df.Title.isnull().sum())

# print list of unique values to check for anything unusual
print("Unique values:", working_df.Title.unique())

Title has no unusual or missing values. No data cleaning required.

# 5. Feature Selection <a class="anchor" id="5"></a>
The primary goal of our analysis is to indentify which impacts had the greatest impact on a passengers chance of survivial. Features will be selected based on this criteria.

## 5.1 Numerical Features <a class="anchor" id="5_1"></a>
Correlate all numerical features against survival.

In [None]:
# snapshot cleaned dataframe before selecting features
clean_df = working_df.copy()

# extract survived column from data frame
survived = working_df['Survived']

# correlate with other columns
corr = working_df.drop(columns='Survived').corrwith(survived)

# display as bar graph
ax = corr.plot.bar(rot=0)

Pclass, Sex, Age, Fare, Alone and UniqueTicket demonstrate low to moderate correlation with survival and require further investigation.  
Sibsp, Parch and Relatives demonstrate minimal to no correlation and will be dropped.

In [None]:
working_df = working_df.drop(columns=['SibSp', 'Parch', 'Relatives'])

## 5.2 Categorical Features <a class="anchor" id="5_2"></a>

Spearman / Pearson correlation is not possible for categorical features.  
Correlation and selection of categorical features will be done by pivotting and visualising features against 'Survived'.

### 5.2.1 Embarked  <a class="anchor" id="5_2_1"></a>
Pivot and visualise embarked vs survival.

In [None]:
pivot = pd.pivot_table(working_df, index=['Embarked'], values=['Survived'], aggfunc=np.mean)
pivot.plot(kind="bar")

Embarked shows an uneven distrubiton of survival rate across the different values. This suggests there is some correlation between Embarked and Survived.

### 5.2.2 Title  <a class="anchor" id="5_2_2"></a>

In [None]:
pivot = pd.pivot_table(working_df, index=['Title'], values=['Survived'], aggfunc=np.mean)
pivot.plot(kind="bar")

Title shows an uneven distribution of survival rates across the different titles, especially for Mr and Officer. Females, children and royalty appear to have the highest chance of survival. This suggests there is some correlation between Title and Survived.

# 6. Exploratory Data Analysis (EDA) <a class="anchor" id="6"></a>

Finally we will take a closer look at the selected features to identify any interesting relationships or trends.

## 6.1 Passenger Class (PClass) <a class="anchor" id="6_1"></a>

***What impact does Passenger Class have on survival***

In [None]:
# Visualise survival of each passenger class
sns.set_style('whitegrid')
sns.barplot(x='Pclass' , y='Survived' , data=working_df)
plt.show()

**Observation 1:** Survival of first class passengers was prioritised, followed by second class, then third.  
To confirm, we should investigate if there are any other relationships between Pclass and strong survival idicators which could account for the bias.

***What is the distribution of males and females for each class?***

In [None]:
# Visualise distribution of males and females for each class
ax = sns.barplot(x='Pclass' , y='Sex' , data=working_df)
ax.set(ylabel='Percentage of Females')
plt.show()

The distribution of males and females between each class does not account for the bias in Pclass survival. **Observation 1** still holds true.

## 6.2 Sex <a class="anchor" id="6_2"></a>

***What impact does sex have on survival?***

In [None]:
# Calculating the value counts for our attributes
male_total = working_df[working_df['Sex']==0].shape[0]
female_total = working_df[working_df['Sex']==1].shape[0]
print('Total male in our dataset:', male_total)
print('Total female in our dataset:', female_total)

# Calculating value counts for male and female who survived
male_surv = working_df.loc[ (working_df['Sex'] == 0) & (working_df['Survived']==1)].shape[0]
female_surv = working_df.loc[ (working_df['Sex'] == 1) & (working_df['Survived']==1)].shape[0]
print('\nTotal male survived: {} ({}%)'.format(male_surv, round((male_surv / male_total)*100)))
print('Total female survived: {} ({}%)'.format(female_surv, round((female_surv / female_total)*100)))

In [None]:
# Visualizing male and female survivors
sns.set_style('whitegrid')
ax = sns.barplot(x='Sex' , y='Survived' , data=working_df)
ax.set(xticklabels=["Male", "Female"])
plt.show()

Females were almost 4 times more likely than males to surive the sinking of the titanic.  
**Observation 2:** Females were prioritised over males for survival.

## 6.3 Age <a class="anchor" id="6_2"></a>

***What impact did age have on survival?***

In [None]:
# Calculate average age of those who survived and those who died
pd.pivot_table(working_df, index=['Survived'], values=['Age'], aggfunc=np.mean)

As the average age of survivors is lower than the average age of those who died, we can assume younger passengers were prioritised over older passengers.
We can explore this further by creating age clusters.

In [None]:
# create AgeGroup feature
working_df["AgeGroup"] = 0
working_df.loc[ working_df['Age'] <= 10, 'AgeGroup'] = 10
working_df.loc[(working_df['Age'] > 10) & (working_df['Age'] <= 20), 'AgeGroup'] = 20
working_df.loc[(working_df['Age'] > 20) & (working_df['Age'] <= 30), 'AgeGroup'] = 30
working_df.loc[(working_df['Age'] > 30) & (working_df['Age'] <= 40), 'AgeGroup'] = 40
working_df.loc[(working_df['Age'] > 40) & (working_df['Age'] <= 50), 'AgeGroup'] = 50
working_df.loc[(working_df['Age'] > 50) & (working_df['Age'] <= 60), 'AgeGroup'] = 60
working_df.loc[(working_df['Age'] > 60) & (working_df['Age'] <= 70), 'AgeGroup'] = 70
working_df.loc[ working_df['Age'] > 70, 'AgeGroup'] = 80

# confirm no abnormal values for AgeGroup
working_df.AgeGroup.value_counts()

In [None]:
# visualise AgeGroup vs Survived
sns.barplot(x='AgeGroup' , y='Survived' , data=working_df)

From this graph we can see a general trend that younger people had a higher survival rate than older people, but there are spikes in the middle which prevent us using this as a definitive rule.  
We can identify there is a high survival rate of those in the 0 to 10 category, decreasing towards the 10 to 20 category.  
We will split between the two and categories those 15 and below as children.

In [None]:
# create IsChild feature
working_df["IsChild"] = 0
working_df.loc[ working_df['Age'] <= 15, 'IsChild'] = 1

# visualise IsChild vs Survived
sns.barplot(x='IsChild' , y='Survived' , data=working_df)

**Observation 3:** Children were much more likely to have survived than adults.  
This is a much more convincing indicator than our AgeGroup, so we will drop AgeGroup in favour of IsChild.

In [None]:
working_df = working_df.drop(columns='AgeGroup')

## 6.4 Fare <a class="anchor" id="6_4"></a>

***What is the relationship between Fare and Survived?***

In [None]:
# Plot Fare vs Survived
sns.lineplot(x="Fare", y="Survived", data=working_df[working_df.UniqueTicket == 1])

Plotting Fare vs Survived does not reveal any direct relationships between the two variables.  
Is Fare just a rough indicator of Pclass?

In [None]:
# Plot Fare vs Pclass
sns.barplot(x="Pclass", y="Fare", data=working_df[working_df.UniqueTicket == 1])

As expected, first class passengers paid more, followed by second, then third.  
The correlation seen between Fare and Survived is likely a derivative of Pclass.

## 6.5 Alone <a class="anchor" id="6_5"></a>

***Are Alone and UniqueTicket representing the same group of passengers?***  
From our earlier correlation we know both Alone and UniqueTicket showed similar correlation with survived.  
Based on our assumption that a UniqueTicket allowed only one passenger to board, while a duplicate ticket allowed multiple passengers to board, it would stand to reason that most solo travellers would have a unique ticket.

In [None]:
# print percentage of alone passengers who have a unique ticket
alone_total = working_df.loc[ working_df['Alone'] == 1 ].shape[0]
alone_unique = working_df.loc[ (working_df['Alone'] == 1) & (working_df['UniqueTicket']==1) ].shape[0]
print("Percentage of Alone passengers who have unique tickets: {}%\n".format(round((alone_unique / alone_total)*100, 2)))

# Correlate Alone and UniqueTicket through visualisation
sns.barplot(x='Alone' , y='UniqueTicket' , data=working_df)

86% of passengers travelling alone had a UniqueTicket. We can conclude that both features are represententing essentially the same group of passengers. To reduce the number of dimensions in our dataset we can drop either Alone or UniqueTicket.
We will drop whichever has the lower correlation with survived:

In [None]:
# print correlation of Alone and Survived
print("Alone correlation:", round(working_df.Alone.corr(working_df.Survived), 2))

# print correlation of UniqueTicket and Survived
print("UniqueTicket correlation:", round(working_df.UniqueTicket.corr(working_df.Survived), 2))

We will drop Alone in favour of UniqueTicket.

In [None]:
working_df = working_df.drop(columns="Alone")

## 6.6 UniqueTicket <a class="anchor" id="6_6"></a>

***What type of passengers were travelling with a unique ticket?***

In [None]:
# Visualise UniqueTicket vs Sex / Pclass
ax = sns.countplot(x="Sex", hue="Pclass", data=working_df[working_df.UniqueTicket == 1])
ax.set(xticklabels=["Male", "Female"])
plt.show()

A passenger travelling with a unique ticket was most likely to be a third class male.

## 6.7 Embarked <a class="anchor" id="6_7"></a>

***Is Embarked actually a survival indicator?***  
We saw during feature selection that passengers who embarked in Cherbourg had a higher survival rate than those who embarked at Queenstown or Southampton.  
It doesn't seem logical that a persons port of embarkation would be a factor considered when prioritising survival during a time of crisis. We should cross reference embarked with other strong survival indicators to account for the Cherbourg bias.

In [None]:
# Visualise Embarked vs Sex
sns.catplot(x="Embarked", y="Sex", kind="bar", data=working_df, ci=None)

Females accounted for less than 50% of passengers from Cherbourg. Sex does not account for the bias.  
We will check Pclass next.

In [None]:
# Visualise Embarked vs Pclass
sns.countplot(x="Embarked", hue="Pclass", data=working_df)

Cherbourg has a much higher ratio of 1st class passengers to 3rd class passengers. This suggests that the higher survival rate of Cherbourg is infact just a coincidental indicator of Pclass. Embarked can therefore be discounted as a survival indicator.

## 6.8 Title <a class="anchor" id="6_8"></a>

What impact does Title have on Survived?

In [None]:
# Visualizing survivors by title
sns.set_style('whitegrid')
sns.barplot(x='Title' , y='Survived' , data=working_df,ci=None)
plt.show()


# Passenger class & Title

In [None]:
#visualizing Title and Passenger class 
# A quick sanity check confirms prior observations
# Passenger class is a strong indicator of survival (i.e. First & Second passengers had a significantly higher rate of survival to Third class passengers)
# Gender is a strong indicator of survival (i.e. Female titles had a significantly higher rate of survival to Male titles)
sns.catplot(
    x="Title", y="Survived", hue="Pclass", kind="bar",
    data=working_df, 
    ci=None,
    height=8.27, aspect=11.7/8.27
)

# Titles & Gender

In [None]:
# Passengers with female titles ('Ms' or 'Miss') had a significantly higher rate of survival than male titles (Mr, Master)
# All female royals & all females officers survived
# Strengthens the case Gender is a strong indicator of survival
g = sns.catplot(x="Title", y="Survived", hue="Sex", kind="bar", 
                data=working_df,
                ci=None,legend=False, height=8.27, aspect=11.7/8.27)
plt.legend(labels=['Male', 'Female'])
plt.show(g)

# First class passengers by title

In [None]:
#Interestingly, first class Mr's, Royals and Officers had a comparably low rate of survival
# This subset weakens PClass as an indicator of survival, though strengthens the claim of gender
subset = working_df.loc[
    ( 
        (working_df['Title']=='Mr') | (working_df['Title']=='Officer') | (working_df['Title']=='Royalty')) &
        (working_df['Pclass']==1) &
        (working_df['Sex']==0)
]
subset.shape[0]
g = sns.catplot(x="Title", y="Survived", kind="bar", 
                data=subset,
                ci=None,legend=False, height=8.27, aspect=11.7/8.27)
plt.show(g)
#In this sense, this subset presents a counter-example to our previous observations around Passenger class, but strengthens the case for gender as strong indicators
g = sns.catplot(x="Title", y="Survived", hue="Sex", kind="bar", 
                data=working_df.loc[
                (working_df['Pclass']==1)
                ],
                ci=None,legend=False, height=8.27, aspect=11.7/8.27)
plt.legend(labels=['Male', 'Female'])
plt.show(g)