# Titanic Survivors
**Tim Kroehler, Jan 2020**

# Summary
The Titanic Survior competition is about trying to predict the whether a given passenger would survive or not, based on certain features.  We know something about the estimated 2222 passengers, more than 1500 of which died.  There are some holes in the dataset, which makes the feature preprocessing and engineering where the game is won or lost.

# Features in the dataset
- PassengerId: Unique Id of a passenger
- Name: name and title of passenger
- survival: Whether a passenger survived or not; 1 if survived and 0 if not.
- pclass: Ticket class
- sex: Sex
- Age: Age in years
- sibsp: # of siblings / spouses aboard the Titanic
- parch: # of parents / children aboard the Titanic
- ticket: Ticket number
- fare: Passenger fare
- cabin: Cabin number
- embarked: Port of Embarkation

# Narrative Information
I would like to combine the dataset with some of the narrative information we can read about the disaster.  Especially the rule of "woman and children first" played a part in the Titanic disaster.  We could imagine heroic fathers putting their wives and children aboard the limited lifeboats and sending them off.  Some older people may have also heroically yield their seats, or been physically unable to board the lifeboats.  The crew may have escorted the first class passengers to the boats, and may have even prevented some of the lower class passengers from boarding.  The iceberg crash happened in the evening, and we would expect most people to be in their cabins, so those passengers in lower decks would have it harder to get to the lifeboats.  They had 15 minutes after the iceberg crash before the lifeboats were ordered out, and then one hour before the ship's front half would sink.  We would expect Age and Sex to play major roles, Fare and PClass a lesser role, and Deck or Cabin to play a smaller role, although the dataset has mostly missing data for these last two features.  Family attributes may play a role.  This is the "human intuition" part of the problem.
https://upload.wikimedia.org/wikipedia/commons/thumb/3/3e/Thayer-Sketch-of-Titanic.png/800px-Thayer-Sketch-of-Titanic.png

# Exploratory Data Analysis
Part of this project is to do an exploratory data analysis.  Some of the features I would like to discard.  Some of the features will need their data filled in.  And we will also create some new features based on existing features.

* PassengerId: Drop
* Name: use to fill it sex and age for missing data
* survival: Keep as Y
* pclass: Keep 
* fare: Keep
* sex: Keep 
* Age: Keep
* sibsp: keep
* parch: keep
* ticket: drop
* cabin: although it would be interesting to extract deck number and use it, there is too sparse of data in the set.  First class passengers had the upper decks that were closer to the lifeboats.  If we had near-complete data, it seems like Cabin/Deck would be a good predictor.  But we don't have it, so we'll have to use PClass.  Drop it.
* embarked: correlate with deck number, why would port matter?  did the last passengers all go to one deck?  i think we will drop this.

With the family values, we'll use the Wikipedia remarks that "woman and children" were given priority, and add a feature called "vulnerable" if they qualify.

I wish there was a way to link the families, to see if there was a heroic father who put his wife and children on the boat.  The dataset doesn't link family members, however.  They might have the same last name, but there's alot of assumptions and text work to extrapolate.  I've read other data explorations that show its corr values are really low, and suggest if a family size was 1,2,or 3, the chances for survival were better, then dropping off in larger families.  What could this mean?  I don't know if I want to use much of the family data.

Title seems like a nice feature to engineer.  By bringing it out of a text field (that's hard to classify) into an age field, which we know are important, it seems like we can fill in the missing data.

I'm thinking a RandomForest will do the best.  We'll tune it up.

## 1. Load libraries and data

In [1]:
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn import model_selection
from sklearn.model_selection import GridSearchCV, cross_val_score
import matplotlib.pyplot as plt

train_df=pd.read_csv("train.csv")
test_df=pd.read_csv("test.csv")
train_len=len(train_df)
test_len=len(test_df)

# combine train and test for feature engineering, tagging the test data with -1 in Survived (which is blank for test data)
test_df['Survived']=-1
df=pd.concat([train_df,test_df],sort=False)


## 2. Exploratory data analysis
- a. Look at missing values in age, gender, fare, and deck (cabin)
- b. Look at correlation between survivability and some factors
- c. Look at correlations between factors (port and deck, age and fare, gender and fare)

In [2]:
# explore data
explore_data=0
if (explore_data==1):
    total = df.isnull().sum().sort_values(ascending=False)
    percent_1 = df.isnull().sum()/df.isnull().count()*100
    percent_2 = (round(percent_1, 1)).sort_values(ascending=False)
    missing_data = pd.concat([total, percent_2], axis=1, keys=['Total', '%'])
    missing_data.head(5)

In [3]:
if (explore_data):
    train_df.describe()

In [4]:
if (explore_data):
    train_df.head(5)

In [5]:
if (explore_data):
    train_df['Died']= 1 - train_df['Survived']
    train_df.groupby('Sex').agg('sum')[['Survived','Died']].plot(kind='bar',stacked=True)
    # More men died than women.  Wikipedia article says "woman and children first" sentiment prevailed.


## 3. Preprocessing of features

In [6]:
# using data in Title, let's assign our vulnerable column if the title is a certain kind. we won't impute age or sex from the title, however.
def get_title(x):
    return(x.split(',')[1].split('.')[0].strip())
titles = set()
for name in df['Name']:
    titles.add(get_title(name))
if (explore_data):
    titles


In [7]:
# set the title field, and then we'll clean it up
df['Title']=df['Name'].map(lambda x:get_title(x))
df['Title'] = df['Title'].replace(['Lady', 'Countess','Capt', 'Col', 'Don', 'Dr', 'Major', 'Rev', 'Sir', 'Jonkheer', 'Dona', 'the Countess'], 'Rare')
df['Title'] = df['Title'].replace('Mlle', 'Miss')
df['Title'] = df['Title'].replace('Ms', 'Miss')
df['Title'] = df['Title'].replace('Mme', 'Mrs')

# now we'll set the ages of each title to be the mean of each of the categories
meanAges= df[['Title','Age']].groupby(['Title'],as_index = False).mean().sort_values(by='Age')
df['Age'].fillna(-1, inplace=True)

for id, meanage in meanAges.iterrows():
    df.loc[(df['Age'] == -1) & (df['Title']==meanage['Title']), 'Age'] = meanage['Age']
    df['Age']=df['Age'].astype(int)
   

In [8]:
# set this integer value for very vulnerable(2), somewhat vulnerable(1), or not vulnerable (0)
df['Vulnerable']=0
df.loc[(df['Age']<16) ,'Vulnerable']=1
df.loc[(df['Age']<=9) | (df['Sex']=="female") ,'Vulnerable']= 2


##4. Create and train the model

In [9]:
# final cleaning up of non-essential fields
df['Male']=(df['Sex']=="male")
df['Female']=(df['Sex']=="female")
df.drop(columns=['Title','Name','Sex','Ticket','Cabin','Embarked'], axis=1, inplace=True)

# now we'lls set the ages of each title to be the mean of each of the categories
df['Fare'].fillna(-1, inplace=True)
meanFares= df[['Pclass','Fare']].groupby(['Pclass'],as_index = False).mean().sort_values(by='Fare')
for id, meanfare in meanFares.iterrows():
    df.loc[(df['Fare'] == -1) & (df['Pclass']==meanfare['Pclass']), 'Fare'] = meanfare['Fare']
df['Fare']=df['Fare'].astype(int)

In [10]:
# and the removal and saving off of passengerId and Survived and splitting of the dataset
survived=df['Survived']

training = df[:train_len].copy()
testing = df[train_len:test_len+train_len+1].copy()


training.drop(columns=['PassengerId','Survived'],axis=1, inplace=True)
passengerId=testing['PassengerId']
testing.drop(columns=['PassengerId','Survived'], axis=1, inplace=True)
if (explore_data):
    training.head()
training.describe()

Unnamed: 0,Pclass,Age,SibSp,Parch,Fare,Vulnerable
count,891.0,891.0,891.0,891.0,891.0,891.0
mean,2.308642,29.665544,0.523008,0.381594,31.785634,0.794613
std,0.836071,13.297222,1.102743,0.806057,49.70373,0.97463
min,1.0,0.0,0.0,0.0,0.0,0.0
25%,2.0,21.0,0.0,0.0,7.0,0.0
50%,3.0,30.0,0.0,0.0,14.0,0.0
75%,3.0,36.0,1.0,0.0,31.0,2.0
max,3.0,80.0,8.0,6.0,512.0,2.0


In [11]:
testing.describe()

Unnamed: 0,Pclass,Age,SibSp,Parch,Fare,Vulnerable
count,418.0,418.0,418.0,418.0,418.0,418.0
mean,2.26555,30.045455,0.447368,0.392344,35.131579,0.818182
std,0.841838,13.056797,0.89676,0.981429,55.856783,0.974719
min,1.0,0.0,0.0,0.0,0.0,0.0
25%,1.0,21.25,0.0,0.0,7.0,0.0
50%,3.0,30.0,0.0,0.0,14.0,0.0
75%,3.0,36.0,1.0,0.0,31.0,2.0
max,3.0,76.0,8.0,9.0,512.0,2.0


In [12]:
tuning=0
if (tuning==1):
    # tuning the hyperparameters
    param_grid = { "criterion" : ["gini", "entropy"], "min_samples_leaf" : [1, 5, 10, 25, 50, 70], "min_samples_split" : [2, 4, 10, 12, 16, 18, 25, 35], "n_estimators": [100, 400, 700, 1000, 1500]}
    rf = RandomForestClassifier(n_estimators=100, max_features='auto', oob_score=True, random_state=1, n_jobs=-1)
    clf = GridSearchCV(estimator=rf, param_grid=param_grid, n_jobs=-1, cv=5)
    clf.fit(training, survived[:train_len])
    clf.best_params_    


#### Results of KFold CV (after a long wait)
{'criterion': 'gini',
 'min_samples_leaf': 1,
 'min_samples_split': 16,
 'n_estimators': 100}

In [13]:

forest = RandomForestClassifier(criterion='gini', 
                             n_estimators=100,
                             min_samples_split=16,
                             min_samples_leaf=1,
                             max_features='auto',
                             oob_score=True,
                             random_state=1,
                             n_jobs=-1)
forest.fit(training, survived[:train_len])

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
                       max_depth=None, max_features='auto', max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=12,
                       min_weight_fraction_leaf=0.0, n_estimators=100,
                       n_jobs=-1, oob_score=True, random_state=1, verbose=0,
                       warm_start=False)

##5. Test the model

In [14]:
predictions = forest.predict(testing)

##6. Interpret results

In [15]:
forest.score(training, survived[:train_len])
accuracy = round(forest.score(training, survived[:train_len]) * 100, 2)
print("Model Accuracy: ",accuracy)

Model Accuracy:  89.79


In [16]:
kfold = model_selection.KFold(n_splits=10, random_state=7)
results = model_selection.cross_val_score(forest, training, survived[:train_len], cv=kfold, scoring='roc_auc')
print("AUC: %.3f (%.3f)" % (results.mean(), results.std()))


AUC: 0.858 (0.033)


##7. Submit results

In [17]:
#Create a CSV with results
submission = pd.DataFrame({
    "PassengerId": passengerId,
    "Survived": predictions
})
submission.to_csv('submission.csv', index = False)