In the tutorial we are required to make predictions for the 'Survived' Column in the titanic dataset.


Table of Contents: 
1. Data Preparation
     - [Data Exploration & Cleaning ](#0)
2. Data Anaysis
      - [Feature Construction](#1)
      - [Feature selection](#2)
      - [Model Selection ](#3)
      
  

In [None]:
import numpy as np 
import pandas as pd 
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
train= pd.read_csv('../input/titanic/train.csv',index_col='PassengerId')
test = pd.read_csv('../input/titanic/test.csv', index_col = "PassengerId")

In [None]:
train

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from scipy.stats import randint
from xgboost import XGBClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import roc_auc_score
from sklearn.ensemble import ExtraTreesClassifier


# Data Exploration and Cleaning<a id="0"></a>

First, we will explore the size of data that we have. 

In [None]:
test.head()

In [None]:
train.head()

In [None]:
print("The number of traning examples(data points) = %i " % train.shape[0])
print("The number of features = %i " % train.shape[1])


Data cleaning is the process of ensuring that your data is correct, consistent and usable. This improves the quality of the training data for analytics and enables accurate decision-making.<br/>

For data cleaning, we will focus on three points: 
* Non-numerical data 
* Missing values
* Outliers 


First, let's check the size of null values we have.

In [None]:
train.isnull().sum()

Now we will to examine the data types of our features

In [None]:
train.dtypes

From the results, we can see that we have 5 columns of type 'Object'. We need to decide on how we will use these features. The main purpose here is to prepare data that can be used in machine learning models and for that, we need our data to be numerical. 

In [None]:
train.columns[train.dtypes==object]



In [None]:
train["Sex"].value_counts()

In [None]:
train["Embarked"].value_counts()

We can see here that Sex and Embarked are nominal features, and they contain a few number of unique values. Some ML libraries do not take categorical variables as input. Thus, we will convert them into numerical variables.

In [None]:
cleanup_nums = { "Embarked": {"S": 0, "C": 1, "Q": 2 },"Sex": {"male": 0, "female": 1}}

In [None]:
train.replace(cleanup_nums, inplace=True)
test.replace(cleanup_nums, inplace=True)
train.head()

Now let's take a look at the Cabin and Ticket features.

In [None]:
train['Cabin'].unique()

In [None]:
train['Ticket'].unique()

We can see that the are too many unique values for Cabin and Ticket features; it might be reasonable to remove them, but we will try to extract to some information from them. 

In [None]:
# let's make boxplots to visualise outliers in the continuous variables 
# Age and Fare
 
plt.figure(figsize=(15,6))
plt.subplot(1, 2, 1)
fig = train.boxplot(column='Age')
fig.set_title('')
fig.set_ylabel('Age')
 
plt.subplot(1, 2, 2)
fig = train.boxplot(column='Fare')
fig.set_title('')
fig.set_ylabel('Fare')

First we plot the distributions to find out if they are Gaussian or skewed.
Depending on the distribution, we will use the normal assumption or the interquantile range to find outliers

In [None]:
# first we plot the distributions to find out if they are Gaussian or skewed.
# Depending on the distribution, we will use the normal assumption or the interquantile
# range to find outliers
 
plt.figure(figsize=(15,6))
plt.subplot(1, 2, 1)
fig = train.Age.hist(bins=20)
fig.set_ylabel('Number of passengers')
fig.set_xlabel('Age')
 
plt.subplot(1, 2, 2)
fig = train.Fare.hist(bins=20)
fig.set_ylabel('Number of passengers')
fig.set_xlabel('Fare')

Age has a normal distribution while the fare feature has skewed distribution. For the age feature, we will use the Gaussian assumption , and the interquantile range for Fare.

In [None]:
# find outliers
# Age
Upper_boundary = train.Age.mean() + 3* train.Age.std()
Lower_boundary = train.Age.mean() - 3* train.Age.std()
print('Age outliers are values < {lowerboundary} or > {upperboundary}'.format(lowerboundary=Lower_boundary, upperboundary=Upper_boundary))
 
# Fare
IQR = train.Fare.quantile(0.75) - train.Fare.quantile(0.25)
Lower_fence = train.Fare.quantile(0.25) - (IQR * 3)
Upper_fence = train.Fare.quantile(0.75) + (IQR * 3)
print('Fare outliers are values < {lowerboundary} or > {upperboundary}'.format(lowerboundary=Lower_fence, upperboundary=Upper_fence))

Now we will replace outliers with reasonable values based on the above calculations.

In [None]:
train['Age'] = np.where(train['Age']>73, 73, train['Age'])
test['Age'] = np.where(test['Age']>73, 73, test['Age'])
train['Age'].max()

In [None]:
train['Fare'] = np.where(train['Fare']>100, 100, train['Fare'])
test['Fare'] = np.where(test['Fare']>100, 100, test['Fare'])
train['Fare'].max()

I referred to this great kernel for outlier detection https://www.kaggle.com/anuragnegi/feature-engineering-outliers-handling-ensembling.

Now we will take a look at the correlation matrix to get a quick insight about the relationships between features. 

In [None]:
train.corr()

In [None]:

corr = train.corr()
f, ax = plt.subplots(figsize=(20, 8))
cmap = sns.diverging_palette(220, 10, as_cmap=True)
sns.heatmap(corr,linewidths=.5, annot= True)

# Feature Construction <a id="1"></a>
Feature construction is a process which builds intermediate features from the original descriptors in a dataset. The aim is to build more efficient features for a machine learning task.

From the 'Name' feature, we can astract other important features such as the family name to identify members of the same family. It is likely that members of the same family withh have the same family name.

In [None]:
train['Family_name']=train['Name'].str.split(', ').str[0]
test['Family_name']=test['Name'].str.split(', ').str[0]
train

In addition, we can also derive the marital status of each member from the Name feature. 

In [None]:
train['Title']=train['Name'].str.split(', ').str[1].str.split('.').str[0]
test['Title']=test['Name'].str.split(', ').str[1].str.split('.').str[0]
train['Title'].unique()

In [None]:
train['Title'] =train['Title'].replace(['Ms','Mlle'], 'Miss')
train['Title'] = train['Title'].replace(['Mme','Dona','the Countess','Lady'], 'Mrs')
train['Title'] =train['Title'].replace(['Rev','Mlle','Jonkheer','Dr','Capt','Don','Col','Major','Sir'], 'Mr')

test['Title'] =test['Title'].replace(['Ms','Mlle'], 'Miss')
test['Title'] = test['Title'].replace(['Mme','Dona','the Countess','Lady'], 'Mrs')
test['Title'] =test['Title'].replace(['Rev','Mlle','Jonkheer','Dr','Capt','Don','Col','Major','Sir'], 'Mr')

In [None]:
train['Title']

In [None]:


cleanup_nums = { "Title": {"Mr": 0, "Mrs": 1, "Miss": 2, "Master": 3 } }
train.replace(cleanup_nums, inplace=True)
test.replace(cleanup_nums, inplace=True)

We can divide 'Age' into categories and see whether this will be better than the original 'Age' feature. 

In [None]:
train['Age'].fillna((train['Age'].mean()), inplace=True) # I will fill the columns that do with nan values with mean age number 
test['Age'].fillna((test['Age'].mean()), inplace=True)

bins = [0, 2, 18, 35, 65, np.inf]
names = ['<2', '2-18', '18-35', '35-65', '65+']

train['AgeRange'] = pd.cut(train['Age'], bins, labels=names)
test['AgeRange'] = pd.cut(test['Age'], bins, labels=names)

NumberedAgeCategories = {'<2':0 , '2-18':1, '18-35':2, '35-65':3, '65+':4}
train['AgeRange']=train['AgeRange'].map(NumberedAgeCategories)  
train['AgeRange']=pd.to_numeric(train['AgeRange'])
test['AgeRange']=test['AgeRange'].map(NumberedAgeCategories)  
test['AgeRange']=pd.to_numeric(test['AgeRange'])
train

Since SibSp include information about the number of siblings and spouses altogether and Parch includes information about the number of nannies, we can extract the family size from this info. Another option would have been to use the family name to identify member of the same family but since the resemblance of family name might be a coincidence we should avoid using this feature for more accurate results. 

In [None]:
train['FamilySize']= train['SibSp']+train['Parch']+1
test['FamilySize']= test['SibSp']+test['Parch']+1
train

Next, we can work out some meaningful interpretation from the Cabin column.<br/>
Despite the fact that this feature contains too many unique values and null values, we can still see that this feature carries useful information like the deck group and the room number ( One letter followed by numbers). <br/>
We can use this to check whether there is any relation between the deck and the fare amount or the Pclass of a passenger. 

In [None]:
cabin_only = train[["Cabin"]].copy()
cabin_only["Cabin_Data"] = cabin_only["Cabin"].isnull().apply(lambda x: not x) # extract rows that do not contain null Cabin data.

In [None]:
cabin_only["Deck"] = cabin_only["Cabin"].str.slice(0,1)
cabin_only["Room"] = cabin_only["Cabin"].str.slice(1,5).str.extract("([0-9]+)", expand=False).astype("float")
cabin_only[cabin_only["Cabin_Data"]]
cabin_only

Here we will deal the the null values in the Cabin column.

In [None]:
cabin_only.drop(["Cabin", "Cabin_Data"], axis=1, inplace=True, errors="ignore")
cabin_only["Deck"] = cabin_only["Deck"].fillna("N") # assign 'N' for the deck name of the null Cabin value. 
cabin_only["Room"] = cabin_only["Room"].fillna(cabin_only["Room"].mean()) # use mean to fill null Room values.

In [None]:
cabin_only

Again, we need to make sure our values are numerical so we will represent the column 'Deck' in a different way using pandas dummies. 

In [None]:
cabin_only=cabin_only.join(pd.get_dummies(cabin_only['Deck'], prefix='Deck'))
cabin_only=cabin_only.drop(['Deck'], axis=1)
cabin_only

In [None]:
train=pd.concat([train,cabin_only],axis=1)
train.shape


Now I will just repeat the same process for test data. You can overlook this part. 

In [None]:
cabin_only_test = test[["Cabin"]].copy()
cabin_only_test["Cabin_Data"] = cabin_only_test["Cabin"].isnull().apply(lambda x: not x) # extract rows that do not contain null Cabin data.
cabin_only_test["Deck"] = cabin_only_test["Cabin"].str.slice(0,1)
cabin_only_test["Room"] = cabin_only_test["Cabin"].str.slice(1,5).str.extract("([0-9]+)", expand=False).astype("float")
cabin_only_test[cabin_only_test["Cabin_Data"]]
cabin_only_test.drop(["Cabin", "Cabin_Data"], axis=1, inplace=True, errors="ignore")
cabin_only_test["Deck"] = cabin_only_test["Deck"].fillna("N") # assign 'N' for the deck name of the null Cabin value. 
cabin_only_test["Room"] = cabin_only_test["Room"].fillna(cabin_only_test["Room"].mean()) # use mean to fill null Room values.
cabin_only_test=cabin_only_test.join(pd.get_dummies(cabin_only_test['Deck'], prefix='Deck'))
cabin_only_test=cabin_only_test.drop(['Deck'], axis=1)
test=pd.concat([test,cabin_only_test],axis=1)

Similarily, for the ticket feature, there is a pattern in the format which is letters followed by words; so we will extract this data to be able to see if it has an effect on the y_target('Survived' column). 

In [None]:

# extract numbers from the ticket
train['Ticket_numerical'] = train.Ticket.apply(lambda s: s.split()[-1])
train['Ticket_numerical'] = np.where(train.Ticket_numerical.str.isdigit(), train.Ticket_numerical, np.nan)
train['Ticket_numerical'] = train['Ticket_numerical'].astype('float')
train["Ticket_numerical"] = train["Ticket_numerical"].fillna(0) # some tickets have string values only, so we will assign a 0 for their ticket_numerical.


test['Ticket_numerical'] = test.Ticket.apply(lambda s: s.split()[-1])
test['Ticket_numerical'] = np.where(test.Ticket_numerical.str.isdigit(), test.Ticket_numerical, np.nan)
test['Ticket_numerical'] = test['Ticket_numerical'].astype('float')
test["Ticket_numerical"] = test["Ticket_numerical"].fillna(0) 

# extract the first part of ticket as category
train['Ticket_categorical'] = train.Ticket.apply(lambda s: s.split()[0])
train['Ticket_categorical'] = np.where(train.Ticket_categorical.str.isdigit(), np.nan, train.Ticket_categorical)
train["Ticket_categorical"] = train["Ticket_categorical"].fillna("NONE") # some tickets have digit values only, so we will assign 'NONE' for their ticket_categorical.
train['Ticket_numerical'].tolist()
 
test['Ticket_categorical'] = test.Ticket.apply(lambda s: s.split()[0])
test['Ticket_categorical'] = np.where(test.Ticket_categorical.str.isdigit(), np.nan, test.Ticket_categorical)
test["Ticket_categorical"] = test["Ticket_categorical"].fillna("NONE") # some tickets have digit values only, so we will assign 'NONE' for their ticket_categorical.
test['Ticket_numerical'].tolist()

train[['Ticket', 'Ticket_numerical', 'Ticket_categorical']].head()

After adding all these new features, we need to check whether we have null values and deal with them. 

In [None]:
train.isna().sum()

It is reasonable to remove the Cabin column now that we have extracted two new features from it : 'Deck' and 'Room'. <br/>
For the rest of the null values, we can drop the rows with null values since they are few and this will not affect our machine learning model performance.

In [None]:
# Fare in the test dataset contains one null value, I will replace it by the median 
train.Fare.fillna(train.Fare.median(), inplace=True)
test.Fare.fillna(train.Fare.median(), inplace=True)

In [None]:
train= train.drop(['Cabin'], axis=1)
test= test.drop(['Cabin'], axis=1)


# Feature Selection <a id="2"></a>

Feature Selection is the process of selecting a subset of relevant features for use in model construction.

For this dataset, we will examine the effect of features on 5 different models: </br>

1. Decision Trees:A Decision Tree is a Flow Chart, and can help you make decisions based on previous experience. 
2. Random Forests:Random Forest is essentially a collection of Decision Trees.
3. Logistic Regression: Logistic Regression is a type of Generalized Linear Models. Logistic regression models the probabilities for classification problems with two possible outcomes.
4. XGBoost: XGBoost is an implementation of gradient boosted decision trees designed for speed and performance.
5. Extra Trees: Extra Trees is an ensemble machine learning algorithm that combines the predictions from many decision trees.

In [None]:
#label encoder can also be used to transform non-numerical labels (as long as they are hashable and comparable) to numerical labels.
label_encoder = LabelEncoder()
for col in train.columns[train.dtypes == "object"]:
    train[col] = label_encoder.fit_transform(train[col].astype('str'))

for col in test.columns[test.dtypes == "object"]:
    test[col] = label_encoder.fit_transform(test[col].astype('str'))

# drop rows with null values    
train.dropna(inplace=True)

X = train.drop('Survived', axis=1)

# create our response variable
y = train['Survived']


#train_test_ split is used to split the dataset into two pieces, so that the model can be trained and tested on different data.
#This is a better method for evaluating the model performance rather than testing it on the training data only. 
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)



In [None]:
def featureSelection(label):    
    clf = DecisionTreeClassifier(random_state=0)
    if(label=='Decision Tree'):
        clf = DecisionTreeClassifier(random_state=0)
    if(label=='Random Forest'):
        clf = RandomForestClassifier(random_state=0)
    if(label=='XGBoost'):
        clf = XGBClassifier(random_state=0)  
    if(label=='Extra Trees'):
        clf = ExtraTreesClassifier(random_state=0)  
        
    clf= clf.fit(X_train, y_train)
    
    arr= dict(zip(X_train.columns, clf.feature_importances_)) ## this is used to write the feature name next to the probability
    data= pd.DataFrame.from_dict(arr,orient='index', columns=['importance'])
    return data.sort_values(['importance'], ascending=False)

Here we will display the features with their corresponding importance values based on each model.

In [None]:
fig, ((ax1,ax2),(ax3,ax4)) = plt.subplots(2,2, figsize=(20,10)) # one row, three columns
r=featureSelection("Decision Tree")
v=featureSelection("Random Forest")
s=featureSelection("XGBoost")
t=featureSelection("Extra Trees")
r.plot.bar(y="importance", rot=70, title="Decision Tree Features with their corresponding importance values",ax=ax1)
v.plot.bar(y="importance", rot=70, title="Random Forest Features with their corresponding importance values", ax=ax2)
s.plot.bar(y="importance", rot=70, title="XGBoost Features with their corresponding importance values", ax=ax3)
t.plot.bar(y="importance", rot=70, title="Extra Trees Features with their corresponding importance values", ax=ax4)
plt.tight_layout() 


In [None]:

logit_model = LogisticRegression(max_iter=10000)
logit_model.fit(X_train, y_train)
 
importance = pd.Series(np.abs(logit_model.coef_.ravel()))
importance.index = X_train.columns
importance.sort_values(inplace=True, ascending=False)
importance.plot.bar(figsize=(12,6))

It seems that all the columns with prefix 'Deck' and the AgeRange seem to have little importance in all 4 classifiers. We will check under the 'Model Selection' part whether removing them will improve the accuarcy of the classifier. ( This is an A/B testing method)

# Model Selection<a id="3"></a>

Model selection is the process of selecting one final machine learning model from among a collection of machine learning models for a training dataset. 

Grid search is used to tune hyperparameters to improve model performance. You can read more about it here https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html.

In [None]:
def get_best_model_and_accuracy(model, params, X, y):
    grid_clf_auc = GridSearchCV(model,param_grid=params,error_score=0.,scoring = 'roc_auc')
    grid_clf_auc.fit(X, y) # fit the model and parameters
    print('Grid best parameter (max. AUC): ', grid_clf_auc.best_params_)
    print('Grid best score (AUC): ', grid_clf_auc.best_score_)

Now we will the scoring of each model. 

Before deleting the columns with prefix 'Deck' and the AgeGroup column, I will check the accuracy score with and without these columns in 3 classifiers as a test to make sure removing them is beneficial.

In [None]:
dt=DecisionTreeClassifier(random_state=0)
param_grid = {"max_depth": [3,7,10,50,100],
              "min_samples_leaf": [1,2,3,4,5,6,7,8,9],
              "criterion": ["gini", "entropy"]}  
print("Decision Tree train:")
get_best_model_and_accuracy(dt, param_grid, X_train, y_train)
print("Decision Tree test:")
get_best_model_and_accuracy(dt, param_grid, X_test, y_test)

In [None]:
param_grid = {'C': [0.001, 0.01, 0.1, 1, 10, 100, 1000] }
LR=LogisticRegression(max_iter=100000,random_state=0)
print("Logistic Regression train:")
get_best_model_and_accuracy(LR, param_grid, X_train, y_train)
print("Logistic Regression test:")
get_best_model_and_accuracy(LR, param_grid, X_test, y_test)

In [None]:
param_grid = { 
    'n_estimators': [200, 500],
    'max_features': ['auto', 'sqrt', 'log2'],
    'max_depth' : [4,5,6,7,8],
}
rf=RandomForestClassifier(random_state=0)
print("Random Forest train:")
get_best_model_and_accuracy(rf, param_grid, X_train, y_train)
print("Random Forest test:")
get_best_model_and_accuracy(rf, param_grid, X_test, y_test)

Now I will repeat the same process but with some columns(those with the predix 'Deck' and AgeRange column) being removed. 

In [None]:
train=train[train.columns.drop(list(train.filter(regex='Deck')))]
train= train.drop(['AgeRange'], axis=1)
test=test[test.columns.drop(list(test.filter(regex='Deck')))]
test= test.drop(['AgeRange'], axis=1)
X = train.drop('Survived', axis=1)
y = train['Survived']
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
train

In [None]:
dt=DecisionTreeClassifier(random_state=0)
param_grid = {"max_depth": [3,7,10,50,100],
              "min_samples_leaf": [1,2,3,4,5,6,7,8,9],
              "criterion": ["gini", "entropy"]}  
print("Decision Tree train:")
get_best_model_and_accuracy(dt, param_grid, X_train, y_train)
print("Decision Tree test:")
get_best_model_and_accuracy(dt, param_grid, X_test, y_test)

In [None]:
param_grid = {'C': [0.001, 0.01, 0.1, 1, 10, 100, 1000] }
LR=LogisticRegression(max_iter=100000,random_state=0)
print("Logistic Regression train:")
get_best_model_and_accuracy(LR, param_grid, X_train, y_train)
print("Logistic Regression test:")
get_best_model_and_accuracy(LR, param_grid, X_test, y_test)

In [None]:
param_grid = { 
    'n_estimators': [200, 500],
    'max_features': ['auto', 'sqrt', 'log2'],
    'max_depth' : [4,5,6,7,8],
}
rf=RandomForestClassifier(random_state=0)
print("Random Forest train:")
get_best_model_and_accuracy(rf, param_grid, X_train, y_train)
print("Random Forest test:")
get_best_model_and_accuracy(rf, param_grid, X_test, y_test)

Although the difference is very small, removing them still results in better results. 

Now I will find the accuracy score for the rest of the classifiers after removing these features. 

In [None]:
param_grid = {
        'min_child_weight': [1, 5, 10],
        'gamma': [0.5, 1, 1.5, 2, 5],
        'max_depth': [3, 4, 5]
        }
XGB= XGBClassifier(random_state=0)
print("XGBoost train:")
get_best_model_and_accuracy(XGB, param_grid, X_train, y_train)
print("XGBoost test:")
get_best_model_and_accuracy(XGB, param_grid, X_test, y_test)

In [None]:
 param_grid={
    
     'n_estimators':[100,500,1000], 'max_depth':[5,6,9], 'min_samples_split':[5,6,9], 'min_samples_leaf':[4,5,6,9]
       
    }
ET= ExtraTreesClassifier(random_state=0) 
print("ExtraTrees train:")
get_best_model_and_accuracy(ET, param_grid, X_train, y_train)
print("ExtraTrees test:")
get_best_model_and_accuracy(ET, param_grid, X_test, y_test)

The difference between train and test scores are small which is a good indication that there is no over-fitting. Finally, it would be reasonable to choose the model with the hight score for X_Test dataset, which in this case would be Random Forest. 

We will use the best paramaters for random forest classifier that were identified by GridSearchCV in the code above. 

In [None]:
rf= RandomForestClassifier(random_state=0,max_depth= 7, max_features= 'sqrt', n_estimators= 500)
rf.fit(X_train,y_train)
predictions = rf.predict(test)

According to the Feature Selection graphs for random forest ( the selected model ), features of Sex and Title had the greatest influence. In order to determine the most important category within each feature, I will use visuals: 

In [None]:
sns.barplot(x='Sex', y='Survived', data=train, estimator=lambda x: sum(x==0)*100.0/len(x))

As noted in the data analysis part, 0 refers to male. 

In [None]:
sns.barplot(x='Title', y='Survived', data=train, estimator=lambda x: sum(x==0)*100.0/len(x))

0 in this plot refers to men with the title "Mr". 

Hence, we can conlude the men of age 18 and above had the largest chance for survival. 

In [None]:
results_df = pd.DataFrame()
results_df["PassengerId"] = test.index
results_df["Survived"] = predictions
results_df.to_csv("my_submissions", index=False)
results_df.head(5)