# Neural Networks - Case Study II

## Predicting Chances of Surviving the Titanic Disaster

### Project Scope:

The sinking of the Titanic is one of the most infamous shipwrecks in history.

On April 15, 1912, during her maiden voyage, Titanic sank after colliding with an iceberg. Unfortunately, there weren’t enough lifeboats for everyone onboard, resulting in the death of 1502 out of 2224 passengers and crew.

While there was some element of luck involved in surviving, it seems some groups of people were more likely to survive than others.

**Your Role:**

Build a predictive model that answers the question: “what sorts of people were more likely to survive?” using passenger data (ie name, age, gender, socio-economic class, etc).


**Specifics:** 

* Machine Learning task: Classification model 
* Target variable: **survival** 
* Input variables: Refer to data dictionary below
* Success Criteria: Accuracy of 80% and above

### **Data Dictionary:**

The dataset contains several parameters which were recorded about the passengers.
The parameters included are : 

**PassengerId:** Passenger Identifier\
**Survived:** (0 = No, 1 = Yes) \
**Pclass:** - Passenger Class (1 = 1st, 2 = 2nd, 3 = 3rd) \
**Name** - Name of the Passenger
**Sex:**  Gender of the passenger \
**Age:** Age in years  \
**SibSp:** No. of siblings / spouses aboard the Titanic \
**Parch:** No. of parents / children aboard the Titanic \
**Ticket:** Ticket Number \
**Fare:** Passenger Fare\
**Cabin:** Cabin - 'U' is for Unkown\
**Embarked:** Port of Embarkation (C = Cherbourg, Q = Queenstown, S = Southampton)

### **Loading the libraries and the dataset**

In [88]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns


import warnings
warnings.filterwarnings("ignore")

from sklearn.ensemble import RandomForestClassifier
from sklearn.neural_network import MLPClassifier

In [None]:
# load the train.csv file using the pandas `read_csv()` function. 
df = pd.read_csv('Titanic_train.csv')
df.head()

#### What features do you think contribute to a high survival rate ?

In [90]:
# Drop the unnecessary ones


In [145]:
# explore the data quickly


#### The training set has 891 examples and 11 features + the target variable (survived).

In [146]:
# look at some summary stats


**Observation:**
* we can see that __ % out of the training-set survived the Titanic
* Age of passengers is between __ & __ 

In [147]:
# check if passenger class has anything to do with survival. Plot a bar plot of Pclass vs Survived


In [148]:
# calculate the correlation of all features with target variable


### Data Prep Required

1. convert object type features into numeric ones.
2. features have different ranges, convert into roughly the same scale. 
3. Some features contain missing values (NaN = not a number) that need to be replaced.

In [149]:
# check for missing data


In [152]:
# check distribution of Age


In [153]:
# calculate mean 'age'


In [154]:
# replace nan with 29. (Not Advised)


In [155]:
# convert it to int type


In [156]:
# Create dummy variables for all 'object' type variables 


In [157]:
# saving this processed dataset. Use index=None


### Data Partition

In [158]:
# Seperate the input features and target variable


In [164]:
# splitting the data in training and testing set


## **Models**

In [105]:
# Import RandomForestClassifier 
from sklearn.ensemble import RandomForestClassifier

In [167]:
# create an instance of the model and fit the model
rfmodel = RandomForestClassifier(n_estimators=100, min_samples_leaf=5)
rfmodel.fit(xtrain,ytrain)

In [165]:
# get feature importance values
rfmodel.feature_importances_

In [166]:
# get feature names
rfmodel.feature_names_in_

In [None]:
# plot a horizontal bar plot
plt.figure(figsize=(7,8))
plt.barh(rfmodel.feature_names_in_,rfmodel.feature_importances_)
plt.title('Feature Importance Plot')
plt.show()

## Save Model

In [117]:
# import pickle to save model
import pickle

In [118]:
# Save the trained model on the drive 
pickle.dump(rfmodel, open('Model','wb'))

## Take away exercise

* 1. Train a Neural Network on this dataset

* 2. Load the saved model and use it on the test set to make predictions

In [144]:
# Load the pickled model
rfmodel = pickle.load(open('Model','rb'))