# Titanic: Machine Learning from Disaster

![picture.jpg](image/picture.jpg)

## Titanic was a British passenger liner operated by the White Star Line that sank in the North Atlantic Ocean in the early morning hours of 15 April 1912, after striking an iceberg during her maiden voyage from Southampton to New York City. Of the estimated 2,224 passengers and crew aboard, more than 1,500 died, making the sinking one of modern history's deadliest peacetime commercial marine disasters.

# contents

# 1.Import Data

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline 
import seaborn as sns
sns.set() 

## Loading Dataset

In [None]:
train =pd.read_csv('train.csv')
test=pd.read_csv('test.csv')

## Looking into the train dataset

In [None]:
train.head()

## Total rows and columns (train)

In [None]:
train.shape

## Describing Dataset (train data)

In [None]:
train.describe()

In [None]:
train.describe(include=['O'])

In [None]:
train.info()

In [None]:
train.isnull().sum()

# Looking into the testing dataset

In [None]:
test.head()

## Total rows and columns (testdata)

In [None]:
test.shape

## Describing Dataset (test data)

In [None]:
test.describe()

In [None]:
test.describe(include=['O'])

In [None]:
test.info()

In [None]:
test.isnull().sum()

# 2.Visulazation

## Relationship between Features and Survival

In [None]:
survived = train[train['Survived']==1]
not_survived=train[train['Survived']==0]

print("Survived: %i (%.1f%%)"%(len(survived),float(len(survived))/len(train)*100.0))
print(" Not Survived: %i (%.1f%%)"%(len(not_survived),float(len(not_survived))/len(train)*100.0))
print ("Total: %i"%len(train))

## Pclass vs Survical

In [None]:
train.Pclass.value_counts()

In [None]:
train.groupby('Pclass').Survived.value_counts()

In [None]:
train[['Pclass','Survived']].groupby(['Pclass'],as_index=False).mean()

In [None]:
sns.barplot(x='Pclass',y='Survived',data=train)

## Sex vs Survival

In [None]:
train.Sex.value_counts()

In [None]:
train.groupby('Sex').Survived.value_counts()

In [None]:
train[['Sex','Survived']].groupby(['Sex'],as_index=False).mean()

In [None]:
sns.barplot(x='Sex',y='Survived',data=train)

## Pclass & Sex vs Survival

In [None]:
tab = pd.crosstab(train['Pclass'],train['Sex'])
print(tab)
tab.div(tab.sum(1).astype(float),axis=0).plot(kind="bar",stacked=True)
plt.xlabel('Pclass')
plt.ylabel('Percentage')

In [None]:
sns.factorplot('Sex','Survived',hue='Pclass',size=4,aspect=2,data=train)

From the above plot, it can be seen that:

    *Women from 1st and 2nd Pclass have almost 100% survival chance.
    *Men from 2nd and 3rd Pclass have only around 10% survival chance.

### Pclass,Sex & Embarked vs. Survival

In [None]:
sns.factorplot(x='Pclass',y='Survived',hue='Sex',col='Embarked',data=train)

From the above plot, it can be seen that:

    1.Almost all females from Pclass 1 and 2 survived.
    2.Females dying were mostly from 3rd Pclass.
    3.Males from Pclass 1 only have slightly higher survival chance than Pclass 2 and 3.

## Embarked vs Survived

In [None]:
train.Embarked.value_counts()

In [None]:
train.groupby('Embarked').Survived.value_counts()

In [None]:
train[['Embarked','Survived']].groupby(['Embarked'],as_index=False).mean()

In [None]:
sns.barplot(x='Embarked',y='Survived',data=train)

## Parch vs Survival

In [None]:
train.Parch.value_counts()

In [None]:
train.groupby('Parch').Survived.value_counts()

In [None]:
train[['Parch', 'Survived']].groupby(['Parch'], as_index=False).mean()

In [None]:
sns.barplot(x='Parch',y='Survived',ci=None,data=train)
# ci=None will hide the error bar

## Sibsp vs Survial

In [None]:
train.SibSp.value_counts()

In [None]:
train.groupby('SibSp').Survived.value_counts()

In [None]:
train[['SibSp','Survived']].groupby(['SibSp'],as_index=False).mean()

In [None]:
sns.barplot(x='SibSp',y='Survived',ci=None,data=train)
#ci=None will hide the error bar

## Age vs Survival

In [None]:
fig = plt.figure(figsize=(15,5))
ax1 = fig.add_subplot(131)
ax2 = fig.add_subplot(132)
ax3 = fig.add_subplot(133)

sns.violinplot(x="Embarked", y="Age", hue="Survived", data=train, split=True, ax=ax1)
sns.violinplot(x="Pclass", y="Age", hue="Survived", data=train, split=True, ax=ax2)
sns.violinplot(x="Sex", y="Age", hue="Survived", data=train, split=True, ax=ax3)



From Pclass violinplot, we can see that:

    1.1st Pclass has very few children as compared to other two classes.
    2.1st Plcass has more old people as compared to other two classes.
    3.Almost all children (between age 0 to 10) of 2nd Pclass survived.
    4.Most children of 3rd Pclass survived.
    5.Younger people of 1st Pclass survived as compared to its older people.

From Sex violinplot, we can see that:

    1.Most male children (between age 0 to 14) survived.
    2.Females with age between 18 to 40 have better survival chance.



In [None]:
total_survived = train[train['Survived']==1]
total_not_survived = train[train['Survived']==0]
male_survived = train[(train['Survived']==1) & (train['Sex']=="male")]
female_survived = train[(train['Survived']==1) & (train['Sex']=="female")]
male_not_survived = train[(train['Survived']==0) & (train['Sex']=="male")]
female_not_survived = train[(train['Survived']==0) & (train['Sex']=="female")]

plt.figure(figsize=[15,5])
plt.subplot(111)
sns.distplot(total_survived['Age'].dropna().values, bins=range(0, 81, 1), kde=False, color='blue')
sns.distplot(total_not_survived['Age'].dropna().values, bins=range(0, 81, 1), kde=False, color='red', axlabel='Age')

plt.figure(figsize=[15,5])

plt.subplot(121)
sns.distplot(female_survived['Age'].dropna().values, bins=range(0, 81, 1), kde=False, color='blue')
sns.distplot(female_not_survived['Age'].dropna().values, bins=range(0, 81, 1), kde=False, color='red', axlabel='Female Age')

plt.subplot(122)
sns.distplot(male_survived['Age'].dropna().values, bins=range(0, 81, 1), kde=False, color='blue')
sns.distplot(male_not_survived['Age'].dropna().values, bins=range(0, 81, 1), kde=False, color='red', axlabel='Male Age')

From the above figures, we can see that:

1.Combining both male and female, we can see that children with age between 0 to 5 have better chance of survival.
2.Females with age between "18 to 40" and "50 and above" have higher chance of survival.
3.Males with age between 0 to 14 have better chance of survival

## Correlating Features

In [None]:
plt.figure(figsize=(15,6))
sns.heatmap(train.drop('PassengerId',axis=1).corr(), vmax=0.6, square=True, annot=True)


### Correlating Features

Heatmap of Correlation between different features:

    Positive numbers = Positive correlation, i.e. increase in one feature will increase the other feature & vice-versa.

    Negative numbers = Negative correlation, i.e. increase in one feature will decrease the other feature & vice-versa.

In our case, we focus on which features have strong positive or negative correlation with the Survived feature.


# 3.Cleaning data

### Feature Extraction

In this section, we select the appropriate features to train our classifier. Here, we create new features based on existing features. We also convert categorical features into numeric form.
### Name Feature

Let's first extract titles from Name column.

In [None]:
train_test_data = [train,test] #combining train and test dataset

for dataset in train_test_data:
    dataset['Title']=dataset.Name.str.extract('([A-Za-z]+)\.')

In [None]:
train.head()

As you can see above, we have added a new column named Title in the Train dataset with the Title present in the particular passenger name.

In [None]:
pd.crosstab(train['Title'],train['Sex'])

In [None]:
for dataset in train_test_data:
    dataset['Title']= dataset['Title'].replace(['Lady','Countess','Capt','Col',\
                            'Don','Dr','Major','Rev','Sir','Jonkheer','Dona'],'Other')
    dataset['Title']=dataset['Title'].replace('Mlle','Miss')
    dataset['Title']=dataset['Title'].replace('Ms','Miss')
    dataset['Title']=dataset['Title'].replace('Mme','Mrs')
train[['Title','Survived']].groupby(['Title'],as_index=False).mean()

In [None]:
title_mapping = {"Mr": 1, "Miss": 2, "Mrs": 3, "Master": 4, "Other": 5}
for dataset in train_test_data:
    dataset['Title'] = dataset['Title'].map(title_mapping)
    dataset['Title'] = dataset['Title'].fillna(0)

In [None]:
train.head()

### Sex Feature

We convert the categorical value of Sex into numeric. We represent 0 as female and 1 as male.

In [None]:
for dataset in train_test_data:
    dataset['Sex'] = dataset['Sex'].map( {'female': 1, 'male': 0} ).astype(int)

In [None]:
train.head()

### Embarked Feature

There are empty values for some rows for Embarked column. The empty values are represented as "nan" in below list.

In [None]:
train.Embarked.unique()

In [None]:
train.Embarked.value_counts()

In [None]:
for dataset in train_test_data:
    dataset['Embarked']= dataset['Embarked'].fillna('S')

In [None]:
train.head()

We now convert the categorical value of Embarked into numeric. We represent 0 as S, 1 as C and 2 as Q.

In [None]:
for dataset in train_test_data:
    #print(dataset.Embarked.unique())
    dataset['Embarked'] = dataset['Embarked'].map( {'S': 0, 'C': 1, 'Q': 2} ).astype(int)


In [None]:
train.head()

### Age Feature

We first fill the NULL values of Age with a random number between (mean_age - std_age) and (mean_age + std_age).

We then create a new column named AgeBand. This categorizes age into 5 different age range.

In [None]:
for dataset in train_test_data:
    age_avg = dataset['Age'].mean()
    age_std = dataset['Age'].std()
    age_null_count = dataset['Age'].isnull().sum()
    
    age_null_random_list = np.random.randint(age_avg - age_std, age_avg + age_std, size=age_null_count)
    dataset['Age'][np.isnan(dataset['Age'])] = age_null_random_list
    dataset['Age'] = dataset['Age'].astype(int)
    
train['AgeBand'] = pd.cut(train['Age'], 5)

print (train[['AgeBand', 'Survived']].groupby(['AgeBand'], as_index=False).mean())

In [None]:
train.head()

In [None]:
for dataset in train_test_data:
    dataset.loc[ dataset['Age'] <= 16, 'Age'] = 0
    dataset.loc[(dataset['Age'] > 16) & (dataset['Age'] <= 32), 'Age'] = 1
    dataset.loc[(dataset['Age'] > 32) & (dataset['Age'] <= 48), 'Age'] = 2
    dataset.loc[(dataset['Age'] > 48) & (dataset['Age'] <= 64), 'Age'] = 3
    dataset.loc[ dataset['Age'] > 64, 'Age'] = 4

In [None]:
train.head()

### Fare Feature

Replace missing Fare values with the median of Fare.

In [None]:
for dataset in train_test_data:
    dataset['Fare'] = dataset['Fare'].fillna(train['Fare'].median())

In [None]:
train['FareBand'] = pd.qcut(train['Fare'], 4)
print (train[['FareBand', 'Survived']].groupby(['FareBand'], as_index=False).mean())

In [None]:
train.head()

In [None]:
for dataset in train_test_data:
    dataset.loc[ dataset['Fare'] <= 7.91, 'Fare'] = 0
    dataset.loc[(dataset['Fare'] > 7.91) & (dataset['Fare'] <= 14.454), 'Fare'] = 1
    dataset.loc[(dataset['Fare'] > 14.454) & (dataset['Fare'] <= 31), 'Fare']   = 2
    dataset.loc[ dataset['Fare'] > 31, 'Fare'] = 3
    dataset['Fare'] = dataset['Fare'].astype(int)

In [None]:
train.head()

### SibSp & Parch Feature

Combining SibSp & Parch feature, we create a new feature named FamilySize.

In [None]:
for dataset in train_test_data:
    dataset['FamilySize'] = dataset['SibSp'] +  dataset['Parch'] + 1

print (train[['FamilySize', 'Survived']].groupby(['FamilySize'], as_index=False).mean())

About data shows that:

    1.Having FamilySize upto 4 (from 2 to 4) has better survival chance.
    2.FamilySize = 1, i.e. travelling alone has less survival chance.
    3.Large FamilySize (size of 5 and above) also have less survival chance.

Let's create a new feature named IsAlone. This feature is used to check how is the survival chance while travelling alone as compared to travelling with family.


In [None]:
for dataset in train_test_data:
    dataset['IsAlone'] = 0
    dataset.loc[dataset['FamilySize'] == 1, 'IsAlone'] = 1
    
print (train[['IsAlone', 'Survived']].groupby(['IsAlone'], as_index=False).mean())

This shows that travelling alone has only 30% survival chance.

In [None]:
train.head(1)

### Feature Selection

We drop unnecessary columns/features and keep only the useful ones for our experiment. Column PassengerId is only dropped from Train set because we need PassengerId in Test set while creating Submission file to Kaggle.


In [None]:
features_drop = ['Name', 'SibSp', 'Parch', 'Ticket', 'Cabin', 'FamilySize']
train = train.drop(features_drop, axis=1)
test = test.drop(features_drop, axis=1)
train = train.drop(['PassengerId', 'AgeBand', 'FareBand'], axis=1)


In [None]:
train.head()

In [None]:
test.head()

## Classification & Accuracy

In [None]:
X_train = train.drop('Survived', axis=1)
y_train = train['Survived']
X_test = test.drop("PassengerId", axis=1).copy()

X_train.shape, y_train.shape, X_test.shape

There are many classifying algorithms present. Among them, we choose the following Classification algorithms for our problem:

    1.Logistic Regression
    2.Support Vector Machines (SVC)
    3.Linear SVC
    4.k-Nearest Neighbor (KNN)
    5.Decision Tree
    6.Random Forest
    7.Naive Bayes (GaussianNB)
    8.Perceptron
    9.Stochastic Gradient Descent (SGD)

Here's the training and testing procedure:

    First, we train these classifiers with our training data.

    After that, using the trained classifier, we predict the Survival outcome of test data.

    Finally, we calculate the accuracy score (in percentange) of the trained classifier.

Please note: that the accuracy score is generated based on our training dataset.


# 4.Choosing Best Model

In [None]:
# Importing Classifier Modules
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC, LinearSVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import Perceptron
from sklearn.linear_model import SGDClassifier

## Logistic Regression

In [None]:
clf = LogisticRegression()
clf.fit(X_train,y_train)
y_pred_log_reg =clf.predict(X_test)
acc_log_reg = round( clf.score(X_train, y_train) * 100, 2)
print (str(acc_log_reg) + ' percent')

## Support Vector Machine (SVM)

In [None]:
clf = SVC()
clf.fit(X_train, y_train)
y_pred_svc = clf.predict(X_test)
acc_svc = round(clf.score(X_train, y_train) * 100, 2)
print (acc_svc)

## Linear SVM

In [None]:
clf = LinearSVC()
clf.fit(X_train, y_train)
y_pred_linear_svc = clf.predict(X_test)
acc_linear_svc = round(clf.score(X_train, y_train) * 100, 2)
print (acc_linear_svc)

## Nearest Neighbors

In [None]:
clf = KNeighborsClassifier(n_neighbors = 3)
clf.fit(X_train, y_train)
y_pred_knn = clf.predict(X_test)
acc_knn = round(clf.score(X_train, y_train) * 100, 2)
print (acc_knn)

## Decision Tree

In [None]:
clf = DecisionTreeClassifier()
clf.fit(X_train, y_train)
y_pred_decision_tree = clf.predict(X_test)
acc_decision_tree = round(clf.score(X_train, y_train) * 100, 2)
print (acc_decision_tree)

## Random Forest

In [None]:
clf = RandomForestClassifier(n_estimators=100)
clf.fit(X_train, y_train)
y_pred_random_forest = clf.predict(X_test)
acc_random_forest = round(clf.score(X_train, y_train) * 100, 2)
print (acc_random_forest)

## Gaussian Naive Bayes

In [None]:
clf = GaussianNB()
clf.fit(X_train, y_train)
y_pred_gnb = clf.predict(X_test)
acc_gnb = round(clf.score(X_train, y_train) * 100, 2)
print (acc_gnb)

## Perceptron

In [None]:
clf = Perceptron(max_iter=5, tol=None)
clf.fit(X_train, y_train)
y_pred_perceptron = clf.predict(X_test)
acc_perceptron = round(clf.score(X_train, y_train) * 100, 2)
print (acc_perceptron)

## Stochastic Gradient Descent (SGD)

In [None]:
clf = SGDClassifier(max_iter=5, tol=None)
clf.fit(X_train, y_train)
y_pred_sgd = clf.predict(X_test)
acc_sgd = round(clf.score(X_train, y_train) * 100, 2)
print (acc_sgd)

## Confusion Matrix

In [None]:
from sklearn.metrics import confusion_matrix
import itertools

clf = RandomForestClassifier(n_estimators=100)
clf.fit(X_train, y_train)
y_pred_random_forest_training_set = clf.predict(X_train)
acc_random_forest = round(clf.score(X_train, y_train) * 100, 2)
print ("Accuracy: %i %% \n"%acc_random_forest)

class_names = ['Survived', 'Not Survived']

# Compute confusion matrix
cnf_matrix = confusion_matrix(y_train, y_pred_random_forest_training_set)
np.set_printoptions(precision=2)

print ('Confusion Matrix in Numbers')
print (cnf_matrix)
print ('')

cnf_matrix_percent = cnf_matrix.astype('float') / cnf_matrix.sum(axis=1)[:, np.newaxis]

print ('Confusion Matrix in Percentage')
print (cnf_matrix_percent)
print ('')

true_class_names = ['True Survived', 'True Not Survived']
predicted_class_names = ['Predicted Survived', 'Predicted Not Survived']

df_cnf_matrix = pd.DataFrame(cnf_matrix, 
                             index = true_class_names,
                             columns = predicted_class_names)

df_cnf_matrix_percent = pd.DataFrame(cnf_matrix_percent, 
                                     index = true_class_names,
                                     columns = predicted_class_names)

plt.figure(figsize = (15,5))

plt.subplot(121)
sns.heatmap(df_cnf_matrix, annot=True, fmt='d')

plt.subplot(122)
sns.heatmap(df_cnf_matrix_percent, annot=True)

## Comparing Models

Let's compare the accuracy score of all the classifier models used above.

In [None]:
models = pd.DataFrame({
    'Model': ['Logistic Regression', 'Support Vector Machines', 'Linear SVC', 
              'KNN', 'Decision Tree', 'Random Forest', 'Naive Bayes', 
              'Perceptron', 'Stochastic Gradient Decent'],
    
    'Score': [acc_log_reg, acc_svc, acc_linear_svc, 
              acc_knn,  acc_decision_tree, acc_random_forest, acc_gnb, 
              acc_perceptron, acc_sgd]
    })

models.sort_values(by='Score', ascending=False)

From the above table, we can see that Decision Tree and Random Forest classfiers have the highest accuracy score.

Among these two, we choose Random Forest classifier as it has the ability to limit overfitting as compared to Decision Tree classifier.


# Create Submission File to Kaggle

In [None]:
test.head()

In [None]:
submission = pd.DataFrame({
        "PassengerId": test["PassengerId"],
        "Survived": y_pred_random_forest
    })


In [None]:
#submission.to_csv('Titanic_submission.csv', index=False)