# Predicting survival on the Titanic using Machine Learning

In this part, we will build machine learning models to get better idea of various algorithms that we learned in previous parts. 

Here, we will be using Titanic dataset from Kaggle. We will try to keep this as simple as possible. To download or know more about the dataset <a href="https://www.kaggle.com/c/titanic/data">click here</a>. If you are just beginner in this field check out my tutorials on Machine learning- <a href="https://savan77.github.io/blog/machine-learning-part1.html">Part - 1</a>, <a href="https://savan77.github.io/blog/machine-learning-part2.html"> Part - 2 </a> and <a href="https://savan77.github.io/blog/machine-learning-part3.html">Part - 3</a>. 

## 1- Exploring dataset

When Machine Learning models, first thing to do is explore the dataset. Here, we will see how many samples we have, features(columns), whether any sample contains null value or not, there might be some features which are useless, you might need to perform some preprocessing ops such as normalization.

In [1]:
#import dependecies
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

In [2]:
#read data set
df = pd.read_csv("data/Titanic/train.csv")

In [3]:
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [4]:
df.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


## 2- Preprocessing

For simplicity, let us use few features and train a model on it. Later in this notebook, we will see how we can use feature engineering to create better features for our model.

In [5]:
#prepare data set
X = pd.DataFrame()
X['Pclass'] = df['Pclass']
X['Sex'] = df['Sex']
X['Age'] = df['Age']
X['Survived'] = df['Survived']
X = X.dropna(axis=0)

In [6]:
X.head()

Unnamed: 0,Pclass,Sex,Age,Survived
0,3,male,22.0,0
1,1,female,38.0,1
2,3,female,26.0,1
3,1,female,35.0,1
4,3,male,35.0,0


In [7]:
#seperate data and target vars
y = X['Survived'] #don't forget to save target(dependent) var- once we'll drop it we won't be able to get it back
X = X.drop(['Survived'],axis=1)

In [8]:
#let's make sure
X.head()

Unnamed: 0,Pclass,Sex,Age
0,3,male,22.0
1,1,female,38.0
2,3,female,26.0
3,1,female,35.0
4,3,male,35.0


In [9]:
X['Sex'] = pd.get_dummies(X.Sex)['male'] #1 for male or else 0

In [10]:
scaler = StandardScaler()
X =scaler.fit_transform(X)  #why I need to do this? -> ans - http://scikit-learn.org/stable/modules/preprocessing.html#preprocessing-scaler

In [11]:
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.3, random_state=42)

## 3- Train a model

In [12]:
from sklearn.linear_model import LogisticRegression

In [13]:
model = LogisticRegression()
model.fit(X_train, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [14]:
#checking accuracy on training dataset
model.score(X_train, y_train)

0.7995991983967936

In [15]:
pred = model.predict(X_test)

In [16]:
#better metric for binary classification is area under the curve
from sklearn.metrics import roc_auc_score
auc = roc_auc_score(y_test, pred)
print(auc)

0.7586944890315677


In [17]:
from sklearn.metrics import classification_report
print(classification_report(y_test, pred))
#The f1-score gives you the harmonic mean of precision and recall.
#The scores corresponding to every class will tell you the accuracy of the classifier in classifying the data 
#points in that particular class compared to all other classes.
#The support is the number of samples of the true response that lie in that class.


             precision    recall  f1-score   support

          0       0.80      0.81      0.80       126
          1       0.72      0.71      0.72        89

avg / total       0.77      0.77      0.77       215



## Exercise

Try using another model such as Random Forest. Change the penalty parameter or regularization strength parameter of the model. You can also perform other preprocessing on the dataset.