# Titanic Survivor Classifier

In this project, we'll use data from the Titanic passengers and a random forest algorithm to predict which passengers survived and which didn't. 

In [1]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import f1_score
from sklearn.metrics import confusion_matrix

## Basic Analysis

In [2]:
df = pd.read_csv('/Users/zacrossman/Downloads/Titanic.csv')
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,892,0,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
1,893,1,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S
2,894,0,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q
3,895,0,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S
4,896,1,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  418 non-null    int64  
 1   Survived     418 non-null    int64  
 2   Pclass       418 non-null    int64  
 3   Name         418 non-null    object 
 4   Sex          418 non-null    object 
 5   Age          332 non-null    float64
 6   SibSp        418 non-null    int64  
 7   Parch        418 non-null    int64  
 8   Ticket       418 non-null    object 
 9   Fare         417 non-null    float64
 10  Cabin        91 non-null     object 
 11  Embarked     418 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 39.3+ KB


It looks like we have 327 null values in the 'Cabin' column, 86 in the 'Age' column, and one in the fare column. 327 is roughly 78% of the rows, so we'll just get rid of that column. To deal with the null age values, we'll just fill them in with the average age for the data set. 

In [4]:
average_age = df['Age'].mean()
df.fillna(average_age, inplace = True)

The 'PassengerId', 'Name', 'Embarked', and 'Ticket' columns aren't relevant for our prediction, so we can just drop those columns as well

In [5]:
df = df.drop(['Name', 'Embarked', 'PassengerId', 'Ticket', 'Cabin'], axis = 1)
print(df.iloc[0, :])

Survived         0
Pclass           3
Sex           male
Age           34.5
SibSp            0
Parch            0
Fare        7.8292
Name: 0, dtype: object


Now we have all of our relevant columns. Our algorithm can't process strings, so lets change all 'male' values to 1, and all 'female' values to 0. 

In [6]:
df['Sex'].replace('female', 0, inplace = True)
df['Sex'].replace('male', 1, inplace = True)
print(df.iloc[0, :])

Survived     0.0000
Pclass       3.0000
Sex          1.0000
Age         34.5000
SibSp        0.0000
Parch        0.0000
Fare         7.8292
Name: 0, dtype: float64


Now lets turn our feature columns and our label columns into arrays.

In [7]:
X = np.array(df.drop(['Survived'], axis = 1))
y = np.array(df['Survived'])

## Training and Implementing our Model

Now that our data is all set and converted into feature and label arrays, we can split the data into train and test subsets.

In [8]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = .2, random_state = 42)
print('X_train size:', len(X_train))
print('X_test size:', len(X_test))
print('y_train size:', len(y_train))
print('y_test size:', len(y_test))

X_train size: 334
X_test size: 84
y_train size: 334
y_test size: 84


We wanted our test set to be around 20% of the data, which turns out to be 84 in this case. Next we'll use scikit-learn and build our model. We'll stick with the defualt number of trees, which is 100. 

In [9]:
rfc = RandomForestClassifier()
rfc.fit(X_train, y_train)

RandomForestClassifier()

Now lets evaluate our model on the training set.

In [10]:
rfc_pred_train = rfc.predict(X_train)
print('Training Data Accuracy:', rfc.score(X_train, y_train))
print('Training Data F1 Score:', f1_score(y_train, rfc_pred_train))
train_cm = confusion_matrix(y_train, rfc_pred_train)
print('Training Data Confusion Matrix:')
print(train_cm)

Training Data Accuracy: 1.0
Training Data F1 Score: 1.0
Training Data Confusion Matrix:
[[216   0]
 [  0 118]]


Now lets evaluate our model on the test set.

In [11]:
rfc_pred_test = rfc.predict(X_test)
print('Test Data Accuracy:', rfc.score(X_test, y_test))
print('Test Data F1 Score:', f1_score(y_test, rfc_pred_test))
test_cm = confusion_matrix(y_test, rfc_pred_test)
print('Test Data Confusion Matrix:')
print(test_cm)

Test Data Accuracy: 1.0
Test Data F1 Score: 1.0
Test Data Confusion Matrix:
[[50  0]
 [ 0 34]]


## Conclusion

Our random forest model was very accurate and was consistently hitting at 100%. 