# Kaggle - Titanic: Machine Learning from Disaster

The competition details as below

### The Challenge

The sinking of the Titanic is one of the most infamous shipwrecks in history.

On April 15, 1912, during her maiden voyage, the widely considered “unsinkable” RMS Titanic sank after colliding with an iceberg. Unfortunately, there weren’t enough lifeboats for everyone onboard, resulting in the death of 1502 out of 2224 passengers and crew.

While there was some element of luck involved in surviving, it seems some groups of people were more likely to survive than others.

In this challenge, we ask you to build a predictive model that answers the question: “what sorts of people were more likely to survive?” using passenger data (ie name, age, gender, socio-economic class, etc). 

### Predicting the survival on the Titanic

**Prediction Results : 0.78947**

### Load Helpful Packages

In [None]:
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

### Load the Data 

In [None]:
train_data = pd.read_csv('../input/titanic/train.csv')
train_data.head(10)

In [None]:
test_data = pd.read_csv('../input/titanic/test.csv')
test_data.head(10)

In [None]:
train_data.shape

In [None]:
test_data.shape

In [None]:
train_data.info()

In [None]:
test_data.info()

### Check Missing Values 

In [None]:
# Checking Missing values in train_data
train_data.isnull().sum()

In [None]:
# Checking Missing values in test_data
test_data.isnull().sum()

In [None]:
train_data.columns

In [None]:
train_data.head()

In [None]:
test_data.head()

In [None]:
train_data['Age'] = train_data['Age'].fillna(train_data['Age'].mean())
test_data['Age'] = test_data['Age'].fillna(test_data['Age'].mean())

In [None]:
train_data.isnull().sum()

In [None]:
test_data.isnull().sum()

In [None]:
sns.catplot(x = 'Embarked', kind = 'count', data = train_data)

In [None]:
train_data['Embarked'] = train_data['Embarked'].fillna("S")

In [None]:
train_data.isnull().sum()

In [None]:
train_data['Cabin'] = train_data['Cabin'].fillna("Missing")
test_data['Cabin'] = test_data['Cabin'].fillna("Missing")

In [None]:
train_data.isnull().sum()

In [None]:
test_data.isnull().sum()

In [None]:
test_data['Fare'] = test_data['Fare'].median()

In [None]:
train_data.isnull().sum()

In [None]:
test_data.isnull().sum()

#### No missing values left so we can proceed further

In [None]:
## get dummy variables for Column sex and embarked since they are categorical value.
train_data = pd.get_dummies(train_data, columns=["Sex"], drop_first=True)
train_data = pd.get_dummies(train_data, columns=["Embarked"],drop_first=True)


#Mapping the data.
train_data['Fare'] = train_data['Fare'].astype(int)
train_data.loc[train_data.Fare<=7.91,'Fare']=0
train_data.loc[(train_data.Fare>7.91) &(train_data.Fare<=14.454),'Fare']=1
train_data.loc[(train_data.Fare>14.454)&(train_data.Fare<=31),'Fare']=2
train_data.loc[(train_data.Fare>31),'Fare']=3

train_data['Age']=train_data['Age'].astype(int)
train_data.loc[ train_data['Age'] <= 16, 'Age']= 0
train_data.loc[(train_data['Age'] > 16) & (train_data['Age'] <= 32), 'Age'] = 1
train_data.loc[(train_data['Age'] > 32) & (train_data['Age'] <= 48), 'Age'] = 2
train_data.loc[(train_data['Age'] > 48) & (train_data['Age'] <= 64), 'Age'] = 3
train_data.loc[train_data['Age'] > 64, 'Age'] = 4

In [None]:
## get dummy variables for Column sex and embarked since they are categorical value.
test_data = pd.get_dummies(test_data, columns=["Sex"], drop_first=True)
test_data = pd.get_dummies(test_data, columns=["Embarked"],drop_first=True)


#Mapping the data.
test_data['Fare'] = test_data['Fare'].astype(int)
test_data.loc[test_data.Fare<=7.91,'Fare']=0
test_data.loc[(test_data.Fare>7.91) &(test_data.Fare<=14.454),'Fare']=1
test_data.loc[(test_data.Fare>14.454)&(test_data.Fare<=31),'Fare']=2
test_data.loc[(test_data.Fare>31),'Fare']=3

test_data['Age']=test_data['Age'].astype(int)
test_data.loc[ test_data['Age'] <= 16, 'Age']= 0
test_data.loc[(test_data['Age'] > 16) & (test_data['Age'] <= 32), 'Age'] = 1
test_data.loc[(test_data['Age'] > 32) & (test_data['Age'] <= 48), 'Age'] = 2
test_data.loc[(test_data['Age'] > 48) & (test_data['Age'] <= 64), 'Age'] = 3
test_data.loc[test_data['Age'] > 64, 'Age'] = 4

In [None]:
# In our data the Ticket and Cabin,Name are the base less,leds to the false prediction so Drop both of them.
train_data.drop(['Ticket','Cabin','Name'],axis=1,inplace=True)
test_data.drop(['Ticket','Cabin','Name'],axis=1,inplace=True)

## Exploratory Data Analysis 

In [None]:
train_data.describe()

In [None]:
train_data.Survived.value_counts()/len(train_data)*100
#This signifies almost 61% people in the ship died and 38% survived.

In [None]:
train_data.groupby("Survived").mean()

In [None]:
train_data.groupby("Sex_male").mean()

 #### The points to know from the analysis
 #### 1. 38% of people survived
 #### 2. 74% of Females survived and ~19% of Males survived 

### Correlation between Variables

In [None]:
train_data.corr()

In [None]:
#Heatmap
plt.subplots(figsize=(10,8))
sns.heatmap(train_data.corr(),annot=True,cmap='Blues_r')
plt.title("Correlation Among Variables", fontsize = 20);

- Survived has positive correlation of 0.3 with Fare
- Sex and survived have negative correlation of -0.54
- Pclass and Survived have negative correlation of -0.34**

In [None]:
sns.barplot(x="Sex_male",y="Survived",data=train_data)
plt.title("Gender Distribution - Survived", fontsize = 16)

##### Female passengers have survived more than male passengers i.e Females and Children would have been the priority

In [None]:
sns.barplot(x='Pclass',y='Survived',data=train_data)
plt.title("Passenger Class Distribution - Survived", fontsize = 16)

### Survival as per classes
- 63% of Passenger Class 1
- 48% of Passenger Class 2
- Only 25% of Passenger Class 3 survived

### Modeling Data 
###### I will be modelling the data with the below models:
- Logistic Regression
- Support Vector Machine
- Decision Tree Classifier
- Random Forest Classifier
- K-Nearest Neighbour Classifier
- Gradient Boosting
- Grid SearchCV

In [None]:
from sklearn.metrics import accuracy_score, log_loss
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn import metrics

In [None]:
X = train_data.drop(['Survived'], axis=1)
y = train_data["Survived"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.22, random_state = 5)

In [None]:
print(len(X_train),len(X_test),len(y_train),len(y_test))

### Logistic Regression

In [None]:
#Logistic Regression
logReg = LogisticRegression()
logReg.fit(X_train,y_train)

In [None]:
logReg_predict = logReg.predict(X_test)
logReg_score = logReg.score(X_test,y_test)
# print("Logistic Regression Prediction :",logReg_predict)
print("Logistic Regression Score :",logReg_score)

In [None]:
print("Accuracy Score of Logistic Regression Model:")
print(metrics.accuracy_score(y_test,logReg_predict))
print("\n","Classification Report:")
print(metrics.classification_report(y_test,logReg_predict),'\n')

### Support Vector Machine

In [None]:
SVC_model = SVC(probability=True)
SVC_model.fit(X_train,y_train)

In [None]:
SVC_predict = SVC_model.predict(X_test)
SVC_score = SVC_model.score(X_test,y_test)
#print("Support Vector Classifier Prediction :",SVC_predict)
print("Support Vector Classifier Score :",SVC_score)

In [None]:
print("Accuracy Score of Support Vector Classifier SVC Model:")
print(metrics.accuracy_score(y_test,SVC_predict))
print("\n","Classification Report:")
print(metrics.classification_report(y_test,SVC_predict),'\n')

## Decision Tree Classifier

In [None]:
decisionTreeModel = DecisionTreeClassifier(max_leaf_nodes=17, random_state=0)
decisionTreeModel.fit(X_train, y_train)

In [None]:
decisionTree_predict = decisionTreeModel.predict(X_test)
decisionTree_score = decisionTreeModel.score(X_test,y_test)
#print("Decision Tree Classifier Prediction :",len(decisionTree_predict))
print("Decision Tree Classifier Score :",decisionTree_score)

In [None]:
print("Accuracy Score of Decision Tree Classifier Model:")
print(metrics.accuracy_score(y_test,decisionTree_predict))
print("\n","Classification Report:")
print(metrics.classification_report(y_test,decisionTree_predict),'\n')

## Random Tree Classifier

In [None]:
Random_forest = RandomForestClassifier(n_estimators=17)
Random_forest.fit(X_train,y_train)

In [None]:
randomForest_predict = Random_forest.predict(X_test)
randomForest_score = Random_forest.score(X_test,y_test)
# print("Random Forest Prediction :",RF_predict)
print("Random Forest Score :",randomForest_score)

In [None]:
print("Accuracy Score of Random Forest Classifier Model:")
print(metrics.accuracy_score(y_test,randomForest_predict))
print("\n","Classification Report:")
print(metrics.classification_report(y_test,randomForest_predict),'\n')

## K-Nearest Neighbours

In [None]:
KNN_model = KNeighborsClassifier(n_neighbors=37)
KNN_model.fit(X_train, y_train)

In [None]:
KNN_predict = KNN_model.predict(X_test)
KNN_score = KNN_model.score(X_test,y_test)
#print("KNN Classifier Prediction :",KNN_predict)
print("KNN Classifier Score :",KNN_score)

In [None]:
print("Accuracy Score of KNN Model:")
print(metrics.accuracy_score(y_test,KNN_predict))
print("\n","Classification Report:")
print(metrics.classification_report(y_test,KNN_predict),'\n')

### Gradient Boosting

In [None]:
from sklearn.ensemble import GradientBoostingClassifier

gbk = GradientBoostingClassifier(random_state=101, n_estimators=150,min_samples_split=100, max_depth=6)
gbk.fit(X_train, y_train)

In [None]:
gbk_predict = gbk.predict(X_test)
gbk_score = gbk.score(X_test,y_test)
#print("Gradient Boosting Prediction :",gbk_predict)
print("Gradient Boosting Score :",gbk_score)

In [None]:
print("Accuracy Score of Gradient Boosting Model:")
print(metrics.accuracy_score(y_test,gbk_predict))

## Grid SearchCV

In [None]:
from sklearn import ensemble
from sklearn.model_selection import GridSearchCV

In [None]:
GridList =[ {'n_estimators' : [10, 15, 20, 25, 30, 35, 40], 'max_depth' : [5,10,15, 20]},]
randomForest_ensemble = ensemble.RandomForestClassifier(random_state=31, max_features= 3)
gridSearchCV = GridSearchCV(randomForest_ensemble,GridList, cv = 5)

In [None]:
gridSearchCV.fit(X_train,y_train)

In [None]:
gridSearchCV_predict = gridSearchCV.predict(X_test)
gridSearchCV_score = gridSearchCV.score(X_test,y_test)
#print("Grid SearchCV Prediction :",gridSearchCV_predict)
print("Grid SearchCV Score :",gridSearchCV_score)

In [None]:
from tabulate import tabulate

In [None]:
print(tabulate([['K-Nearest Neighbour', KNN_score],['Logistic Regression',logReg_score ],['Decision Tree',decisionTree_score ],['Random Forest',randomForest_score ],['SVC', SVC_score],['Gradient Boosting', gbk_score],['Grid SearchCV',gridSearchCV_score]], headers=['Model Algorithm', 'Score']))

### From the above table, we can clearly see that the accuracy of the Grid SearchCV is Better

#### Lets apply this to our test data

## Prediction

#### Let's use the Gradient Boosting Classifier to predict our data

In [None]:
test_data.head()

In [None]:
#set ids as PassengerId and predict survival 
ids = test_data['PassengerId']
print(len(ids))
predictions = gridSearchCV.predict(test_data)

In [None]:
#set the output as a dataframe and convert to csv file named submission.csv
output = pd.DataFrame({ 'PassengerId' : ids, 'Survived': predictions })

In [None]:
output.head(10) # Output preview

In [None]:
output.to_csv('submission.csv', index=False) # Submission csv file

I will keep updating the notebook with updates

**If you have any recommendations and suggestions, please share in the comments below !!**

Looking forward to know your views and suggestions :)

**If you feel the notebook is worth it, UPVOTE !!**

**Thanks for reading :)**

**In Case you fork the Notebook, Don't forget to Mention the Author's name and Link below as well**
https://www.kaggle.com/samridhmathur/titanicdisaster-survivorprediction-78-94-score/