# Titanic ML Survivability Model

## The challenge
The challenge is simple: we want to use the Titanic passenger data (name, age, price of ticket, etc) to try to predict who will survive and who will die.

## The data
There are three files within the `input` directory: (1) **train.csv**, (2) **test.csv**, and (3) **gender_submission.csv**.

### (1) train.csv
**train.csv** contains the details of a subset of the passengers on board (891 passengers, to be exact - where each passenger gets a different row in the table).
The values in the second column ("Survived") can be used to determine whether each passenger surived or not:
- if it's a "1", the passenger survived.
- if it's a "0", the passenger died.

### (2) test.csv
Using the patterns found in **train.csv**, we'll have to predict whether the other 418 passengers on board (in **test.csv**) survived.

## Load the Data

### Import Libraries

In [1]:
# Load libraries
import pandas as pd
import numpy as np
from pandas.plotting import scatter_matrix
from matplotlib import pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
#from sklearn import Regressor
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC

import os

### Load Datasets

In [2]:
# Get directories and file paths
cwd = os.getcwd()
input_dir_path = os.path.join(cwd, "input")
output_dir_path = os.path.join(os.path.join(cwd, "output"))

train_data_file = os.path.join(input_dir_path, "train.csv")
test_data_file = os.path.join(input_dir_path, "test.csv")

# Load training dataset
train_data = pd.read_csv(train_data_file)

# Load testing dataset
test_data = pd.read_csv(test_data_file)

## Summarize the Dataset

### Dimensions of Dataset

In [3]:
train_data.shape

(891, 12)

### Peek at the Data

In [4]:
train_data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


## Explore a Pattern
Assumes that all female passengers survived (and all male passengers died).

In [5]:
women = train_data.loc[train_data.Sex == 'female']["Survived"]
rate_women = sum(women)/len(women)

print("% of women who survived:", rate_women)

% of women who survived: 0.7420382165605095


In [6]:
men = train_data.loc[train_data.Sex == 'male']["Survived"]
rate_men = sum(men)/len(men)

print("% of men who survived:", rate_men)

% of men who survived: 0.18890814558058924


From this we can see that almost 75% of the women on board survived, wheress only 19% of the men lived to tell about it.

The gender-based bases its predictions on only a single column. By considering multiple columns, we can discover more complex patterns that can potentially yield better-informed predictions.

## ML Model
We'll build a **random forest model**. This model is constructed of several "trees" that will individually consider each passenger's data and vote on whether the individual survived. Then, the random forest model makes a democratic decision: the outcome with the most votes wins!

The code cell below looks for patterns in four different columns ("**Pclass**, "**Sex**", "**SibSp**", and "**Parch**") of the data. It constructs the trees in the random forest model based on patterns in the train data file, before generating predictions for the passengers in the test data. The code also save these new predictions in a CSV file **prediction.csv**.

In [7]:
y = train_data["Survived"]

features = ["Pclass", "Sex", "SibSp", "Parch"]
X = pd.get_dummies(train_data[features])
X_test = pd.get_dummies(test_data[features])

model = RandomForestClassifier(n_estimators=100, max_depth=5, random_state=1)
model.fit(X, y)
predictions = model.predict(X_test)

output = pd.DataFrame({'PassengerId': test_data.PassengerId, 'Survived':predictions})
output.to_csv(os.path.join(output_dir_path, "prediction.csv"), index=False)
print("Your prediction was successfully save!")

Your prediction was successfully save!
