# Titanic Machine Learning - Comparison Between Linear Regression, Random Forest and XGB comparison

## Introduction & Problem Statement

As many of you probably know, Titanic was an ocean liner that sank in the North Atlantic Ocean due to crashing into an Iceberg. The ship made stops in Cherboug, Queensland and Southampton before it was supposed to travel to New York. The passengers are split into 3 different classes. A total of 2224 passengers boarded the ship but only 710 survived the crash, making the survival rate only 32%. The goal of this project was to accurately predict which passengers survive based on given data. 


## Summary

In this kernel, three different machine learning techniques will be employed to predict whether a passenger survives the titanic crash. First, plots were created to better understand the data and then the missing data was filled. Various features were also engineered to improve the performance ofthe models. The most accurate model was the XGB followed by the decision tree and linear regression. 


# Contents

* [Preprocessing](#Preprocessing)
    - [Exploratory Data Analysis](#Plots)
    - [Filling in Missing Data](#Missing_data)
    - [Feature Engineering and Scaling](#feature_eng)
* [Machine Learning](#Ml_models)
    - [Logistic Regression](#log_reg)
    - [Random Forest](#Rand_Forest)
    - [XGB](#XGB)
* [Conclusion](#conclusion)

# Preprocessing <a id="Preprocessing"></a>

## Import Libraries and Data <a id="ImportLibrariesandData"></a>

In [None]:
import numpy as np 
import pandas as pd 
import os
import matplotlib.pyplot as plt
import seaborn as sns
import matplotlib.patches as mpatches
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
from sklearn import preprocessing
from sklearn.impute import SimpleImputer
from scipy import stats
from sklearn.model_selection import train_test_split

train_raw=pd.read_csv("/kaggle/input/titanic/train.csv")
y_train=train_raw['Survived']
test_raw=pd.read_csv("/kaggle/input/titanic/test.csv")
dfs=[train_raw, test_raw]


# Exploratory Data Analysis <a id="Plots"></a>

First need to understand data better to be able to:
1. Fill in missing data
1. Perform feature engineering

## **Preview Data**

In [None]:
train_raw.head()

## Understanding Missing Data

See which features are missing data
- Age and cabin have a large number of missing data ==> Drop Cabin, fill in age data using some sort of metric
- Embarked and Fare have a few missing data ==> Impute these values using mode

In [None]:
print('Train Data',train_raw.isnull().sum(),' ',sep='\n\n')
print('Test Data',test_raw.isnull().sum(),sep='\n\n')

## **Correlation Plot**

- Overall none of the combinations show strong relationships. <br>
- The combinations with highest correlations are Fare and Passanger Class (-0.55), # of Siblings/Spouses and # of Parent/Children (0.41) and Passanger Class and Age (-0.37). <br>
- The highest combination with Survived was Passenger Class (-0.34) then Fare (0.26). 

In [None]:
dftrain_cor = train_raw[['Survived', 'Pclass', 'Age','SibSp','Parch','Fare']].copy()
train_cor=dftrain_cor.corr()

mask = np.zeros_like(train_cor)
mask[np.triu_indices_from(mask)] = True
with sns.axes_style("white"):
    f, ax = plt.subplots(figsize=(7, 5))
    ax = sns.heatmap(train_cor, mask=mask, annot=True,linewidths=1, vmax=.3, square=True)

## **Bar Plots and Histograms**

Following charts visualize various statistics of the data. The following are some conclusions made:
- More females survived than males
- Children (ages less than 10) had higher chance of survival
- More people that boarded in Cherbourg survived
- Large number of males in 3rd class

In [None]:
plt.subplot2grid((1,2),(0,0))
sns.countplot(x='Survived',hue='Sex',data=train_raw)
plt.ylabel('Frequency')
plt.title("# of Survived")

plt.subplot2grid((1,2),(0,1))
sns.countplot(x='Embarked',hue='Survived',data=train_raw)
plt.ylabel(' ')
plt.title("# of Embarked")
plt.show()

plt.subplot2grid((1,2),(0,0))
sns.countplot(x='Pclass',hue='Sex',data=train_raw)
plt.ylabel('Frequency')
plt.title("# of Each Sex")

plt.subplot2grid((1,2),(0,1))
sns.countplot(x='Pclass',hue='Survived',data=train_raw)
plt.ylabel(' ')
plt.title("# of Survived")
plt.show()

fig, ax = plt.subplots()
sns.distplot(train_raw.Age[train_raw.Survived==1],kde=False,ax=ax, color="#1f77b4")
sns.distplot(train_raw.Age[train_raw.Survived==0],kde=False,ax=ax, color="#ff7f0e")
plt.title("Age Distribution")
plt.xlabel('Age')
red_patch = mpatches.Patch(color='#1f77b4', label='Survived')
blue_patch = mpatches.Patch(color='#ff7f0e', label='Died')
plt.legend(handles=[red_patch, blue_patch] ,loc='best')
plt.show()

fig, ax = plt.subplots()
sns.distplot(train_raw.Fare[train_raw.Pclass==1],kde=False,color="#1f77b4", ax=ax)
sns.distplot(train_raw.Fare[train_raw.Pclass==2],kde=False,color="#ff7f0e", ax=ax)
sns.distplot(train_raw.Fare[train_raw.Pclass==3],kde=False,color="#2ca02c", ax=ax)
plt.title("Fare Distribution")
patch_1 = mpatches.Patch(color='#1f77b4', label='Class 1')
patch_2 = mpatches.Patch(color='#ff7f0e', label='Class 2')
patch_3 = mpatches.Patch(color='#2ca02c', label='Class 3')
plt.legend(handles=[patch_1, patch_2, patch_3])
plt.xlabel('Fare')
plt.show()

In [None]:
#Pclass 1 Age Survival Figure
fig, ax = plt.subplots()
sns.distplot(train_raw.Age[train_raw.Survived==1][train_raw.Pclass==1],kde=False,ax=ax, color="#FFA500")
sns.distplot(train_raw.Age[train_raw.Survived==0][train_raw.Pclass==1],kde=False,ax=ax, color="#00FFFF")
red_patch = mpatches.Patch(color='#FFA500', label='Class 1 - Survived')
blue_patch = mpatches.Patch(color='#00FFFF', label='Class 1 - Died')
plt.legend(handles=[red_patch, blue_patch] ,loc='best')
#Pclass 2 Age Survival Figure
fig, ax = plt.subplots()
sns.distplot(train_raw.Age[train_raw.Survived==1][train_raw.Pclass==2],kde=False,ax=ax, color="#FFA500")
sns.distplot(train_raw.Age[train_raw.Survived==0][train_raw.Pclass==2],kde=False,ax=ax, color="#00FFFF")
red_patch = mpatches.Patch(color='#FFA500', label='Class 2 - Survived')
blue_patch = mpatches.Patch(color='#00FFFF', label='Class 2 - Died')
plt.legend(handles=[red_patch, blue_patch] ,loc='best')
#Pclass 3 Age Survival Figure
fig, ax = plt.subplots()
sns.distplot(train_raw.Age[train_raw.Survived==1][train_raw.Pclass==3],kde=False,ax=ax, color="#FFA500")
sns.distplot(train_raw.Age[train_raw.Survived==0][train_raw.Pclass==3],kde=False,ax=ax, color="#00FFFF")
red_patch = mpatches.Patch(color='#FFA500', label='Class 3 - Survived')
blue_patch = mpatches.Patch(color='#00FFFF', label='Class 3 - Died')
plt.legend(handles=[red_patch, blue_patch] ,loc='best')

# Filling in Missing Data <a id="Missing_data"></a>

In both the train and test datasets are missing values in the age column. Since there are a lot of missing values filling in the data just based on general statistics, such as mean or mode, would be innacurate. Therefore should predict based on other columns. User Allohvk <sup>1</sup>   came up with the idea of using the title in the name to predict the age. Titles were grouped together since a few of them only had a few entries. Note that the title 'Fchild' relies on an engineered feature of a family size larger than 1. Since there were very few missing data in embarked and fare columns the median value was used. After missing features were put in one hot encoding was performed on all the non numerical values.

</br> <sup>1</sup> The link to Allohvk's full notebook can be found here: https://www.kaggle.com/allohvk/titanic-missing-age-imputation-tutorial-advanced

# Feature Engineering and Scaling <a id="feature_eng"></a>

Various features were engineered. The following list outlines the different features and how they were calculated 
- Fam_size - Family size determined by adding number of siblings and spouse and number of parents and children plus one
- Alone -  Binary feature determined on whether family size is equal to one 
- Ageclass - Product of age and class, idea based on user Manav Sehgal's notebook <sup>2</sup>  

The age and fare features also had boxcox transformation performed on them since these features were relatively skewed. Then finally a min-max scaler was applied to boxcox features and the classAge feature.
 
</br> <sup>2</sup> The link to Manav Sehgal's full notebook can be found here: https://www.kaggle.com/startupsci/titanic-data-science-solutions

In [None]:
# Dictionary with all the titles
TitleDict = {"Capt": "Officer","Col": "Officer","Major": "Officer","Jonkheer": "Royalty", \
             "Don": "Royalty", "Sir" : "Royalty","Dr": "Royalty","Rev": "Royalty", \
             "Countess":"Royalty", "Mme": "Mrs", "Mlle": "Miss", "Ms": "Mrs","Mr" : "Mr", \
             "Mrs" : "Mrs","Miss" : "Miss","Master" : "Master","Lady" : "Royalty"}

for df in dfs:
    df['Embarked']=df[['Embarked']].fillna(train_raw.mode()['Embarked'][0])
    df['Fare']=df[['Fare']].fillna(train_raw['Fare'].median())
    df['Title']=df.Name.str.extract(' ([A-Za-z]+)\.', expand=False)
    df['Title']=df.Title.map(TitleDict)
    df['Fam_size']=df['SibSp']+df['Parch']+1
    df.loc[(df.Title=='Miss')&(df.Parch!=0)&(df.Fam_size>1),'Title']='Fchild'
    df['Alone']=df['Fam_size'].apply(lambda x:1 if x==1 else 0)

age_vals = dfs[0].groupby(['Pclass','Sex','Title'])['Age'].mean()
for df in dfs:
    vals=df[df["Age"].isnull()].index.values.astype(int).tolist()
    for val in vals:
        df.loc[val,'Age']=age_vals[df.loc[val,'Pclass'],df.loc[val,'Sex'],df.loc[val,'Title']]

In [None]:
train=dfs[0]
test=dfs[1]
train=train.drop(['PassengerId','Name','Cabin','Survived','Ticket'],axis=1)
test=test.drop(['PassengerId','Name','Cabin','Ticket'],axis=1)
test['Title']=test[['Title']].fillna('Mrs')

In [None]:
train = pd.get_dummies(train)
test= pd.get_dummies(test)

In [None]:
train['boxAge']=stats.boxcox(train['Age'])[0]
trainfare=np.array(train['Fare'])
trainfare[trainfare<=0]=0.01
train['boxFare']=stats.boxcox(trainfare)[0]

testfare=np.array(test['Fare'])
testfare[testfare<=0]=0.01

test['boxAge']=stats.boxcox(test['Age'], stats.boxcox(train['Age'])[1])
test['boxFare']=stats.boxcox(testfare, stats.boxcox(trainfare)[1])
train=train.drop(['Age','Fare'], axis=1)
test=test.drop(['Age','Fare'], axis=1)
train['Ageclass']=train['boxAge']*train['Pclass']
test['Ageclass']=test['boxAge']*train['Pclass']
train.head()

In [None]:
min_max_scaler = preprocessing.MinMaxScaler()
train[['boxAge','boxFare','Ageclass']] = min_max_scaler.fit_transform(train[['boxAge','boxFare','Ageclass']])
test[['boxAge','boxFare','Ageclass']] = min_max_scaler.fit_transform(test[['boxAge','boxFare','Ageclass']])

# Machine Learning Models <a id="Ml_models"></a>

Three different machine learning models were used and compared. Namely linear regression, random forest and XGB. The cross validation test size was set to 0.2. The number of estimators for the Random Forest and the learning rate, subsample and column sample by tree variables for the XGB model were optimized through trial and error. A full summary of the results can be found in the conclusion. Predictions of each of the models were output as csv files.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(train, y_train, test_size=0.2, random_state=42)

## Logistic Regression <a id="log_reg"></a> 

In [None]:
LogReg=LogisticRegression()
LogReg.fit(X_train,y_train)
log_reg_s=LogReg.score(X_test,y_test)
print(log_reg_s)
model_comp=pd.DataFrame({'ML Model':'Log_Reg','Score':[log_reg_s]})
predict=LogReg.predict(test)
results=pd.DataFrame({'PassengerId':test_raw['PassengerId'],'Survived':pd.Series(predict)})
results.to_csv('resultslogreg.csv',index=False)

## Random Forest <a id="Rand_Forest"></a> 

In [None]:
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier(n_estimators=100,max_depth=5)
model.fit(X_train, y_train)
Rand_For_S=model.score(X_test, y_test)
model_comp=model_comp.append({'ML Model':'Rand_For','Score':Rand_For_S},ignore_index=True)
print(Rand_For_S)

In [None]:
predict = model.predict(test)
results=pd.DataFrame({'PassengerId':test_raw['PassengerId'],'Survived':pd.Series(predict)})
results.to_csv('resultsrt.csv',index=False)

## XGB <a id="XGB"></a> 

In [None]:
from xgboost import XGBClassifier
my_modelxg = XGBClassifier(n_estimators=1000, learning_rate=0.1, subsample=0.9, colsample_bytree = 0.9)
my_modelxg.fit(X_train, y_train, 
             early_stopping_rounds=5,
             eval_set=[(X_test, y_test)], 
             verbose=False)

In [None]:
predict_cv = my_modelxg.predict(X_test)


from sklearn.metrics import accuracy_score
cv_score=accuracy_score(y_test,predict_cv)
model_comp=model_comp.append({'ML Model':'XGB','Score':cv_score},ignore_index=True)
print ("CV Score: ",cv_score)

In [None]:
predict = my_modelxg.predict(test)
results=pd.DataFrame({'PassengerId':test_raw['PassengerId'],'Survived':pd.Series(predict)})
results.to_csv('resultsXG.csv',index=False)

# Conclusion <a id="conclusion"></a> 

The model with the highest cross validation score was the XGB model with 85.47% accuracy. However, after submitting all three solutions to the competition the random forest achieved the highest accuracy with 78.71% with XGB achieving 78.47% and the logistic regression achieving 77.75%

Thank you for visiting my kernel, any comments and suggestions are welcome!

In [None]:
print(model_comp.sort_values(by=['Score'],ascending=False))