## **Titanic Project - The ML Mine**

In this notebook, I hope to show how a data scientist would go about working through a problem. The goal is to correctly predict if someone survived the Titanic shipwreck. I thought it would be fun to see what factors were involved in deciding the chances of the passenger getting survived or not.

The accompanying video is located here: https://www.youtube.com/watch?v=NcbrbWOSvLA

## Outline:
1) Understand the data

2) Understand the distribution

3) Feature engineering

4) Data pre-processing

5) Building ML models

6) Model Hyperparameter tuning

7) Test data predictions and Submission

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

## Understand the data

In [None]:
df_train = pd.read_csv('/kaggle/input/titanic/train.csv')

In [None]:
df_train.info()

In [None]:
df_train.isna().sum()

In [None]:
df_train.head()

## Understand the distribution

In [None]:
df_train.describe()

In [None]:
num_col = ['PassengerId','Survived','Pclass','Age','SibSp','Parch','Fare']
df_num = df_train[num_col]
sns.heatmap(df_num.corr())

In [None]:
# sns.pairplot(df_train, hue='Survived')

In [None]:
women = df_train[df_train['Sex']=='female']['Survived']
print("Percentage of women who survived", sum(women)/len(women))

men = df_train[df_train['Sex']=='male']['Survived']
print("Percentage of men who survived", sum(men)/len(men))

In [None]:
pd.pivot_table(df_train, index='Survived',columns='Parch',values=['Ticket'],aggfunc='count')

In [None]:
pd.pivot_table(df_train, index='Survived',columns='Embarked',values=['Ticket'],aggfunc='count')

In [None]:
pd.pivot_table(df_train, index='Survived',values=['Age','Parch','SibSp'])

## Feature engineering

In [None]:
display(df_train)

In [None]:
df_train['title'] = df_train.Name.apply(lambda x: x.split(',')[1].split(' ')[0])
df_train['title'].value_counts()

In [None]:
df_train['numeric_ticket'] = df_train.Ticket.apply(lambda x: 1 if x.isnumeric() else 0)
df_train['numeric_ticket'].value_counts()

In [None]:
df_train['ticket_letter'] = df_train.Ticket.apply(lambda x: ''.join(x.split(' ')[:-1]).replace('/','').replace('.','').lower() if len(x.split(' ')[:-1])>0 else 0)
df_train['ticket_letter'].value_counts()

In [None]:
df_train['cabin_letter'] = df_train.Cabin.apply(lambda x: str(x)[0])
df_train['cabin_letter'].value_counts()
pd.pivot_table(df_train, index='Survived',columns='cabin_letter',values='Ticket',aggfunc='count')

In [None]:
df_train = df_train.drop(['Name','Ticket','Cabin','ticket_letter'],axis=1)

In [None]:
#Implement all the above steps in a class
class feature_engg:
    def __init__(self,data):
        self.data=data
    def name_title(self):
        self.data['title'] = self.data.Name.apply(lambda x: x.split(',')[1].split(' ')[0])
    def numeric_ticket(self):
        self.data['numeric_ticket'] = self.data.Ticket.apply(lambda x: 1 if x.isnumeric() else 0)
    def cabin_letter(self):
        self.data['cabin_letter'] = self.data.Cabin.apply(lambda x: str(x)[0])
    def remove_cols(self):
        return self.data.drop(['Name','Ticket','Cabin'], axis=1)

## Data preprocessing

In [None]:
display(df_train)

In [None]:
df_train.info()

In [None]:
from sklearn.preprocessing import StandardScaler

class data_preprocess:
    def __init__(self,data):
        self.data=data
    def handle_null_values(self):
        for column in self.data.columns:
            if self.data[column].dtype == 'object':
                mode_val = self.data[column].mode()[0]
                self.data[column] = self.data[column].fillna(mode_val)
            else:
                mean_val = self.data[column].mean()
                self.data[column] = self.data[column].fillna(mean_val)
    def handle_duplicate_values(self):
        if self.data.duplicated().any():
            self.data.drop_duplicates(inplace=True)
    
    def scale(self):
        scale = StandardScaler()
        self.data[['Age','Fare']] = scale.fit_transform(self.data[['Age','Fare']])
        return self.data
        
    

In [None]:
## Preprocess training data
data_preprocessor= data_preprocess(df_train) 
data_preprocessor.handle_null_values()
data_preprocessor.handle_duplicate_values()
train = data_preprocessor.scale()

In [None]:
train.describe()

In [None]:
## Prepare dataset
X_train = train.drop(['Survived'],axis=1)
y_train = train['Survived']

## Encoder for categorical columns
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import make_column_transformer

col_trans = make_column_transformer((OneHotEncoder(handle_unknown='ignore'), ['title','Sex','Embarked','cabin_letter']), remainder = "passthrough")
X_train = col_trans.fit_transform(X_train)
X_train = pd.DataFrame(X_train)
X_train.head(20)

## Building the models

In [None]:
from sklearn.model_selection import cross_val_score

# Models tested
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import LogisticRegression
from sklearn import tree
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC

In [None]:
#Using Naive Bayes as a baseline for my classification tasks 
gnb = GaussianNB()
cv = cross_val_score(gnb,X_train,y_train,cv=5)
print(cv)
print(cv.mean())

In [None]:
lr = LogisticRegression(max_iter = 2000)
cv = cross_val_score(lr,X_train,y_train,cv=5)
print(cv)
print(cv.mean())

In [None]:
dt = tree.DecisionTreeClassifier(random_state = 1)
cv = cross_val_score(dt,X_train,y_train,cv=5)
print(cv)
print(cv.mean())

In [None]:
knn = KNeighborsClassifier()
cv = cross_val_score(knn,X_train,y_train,cv=5)
print(cv)
print(cv.mean())

In [None]:
rf = RandomForestClassifier(random_state = 1)
cv = cross_val_score(rf,X_train,y_train,cv=5)
print(cv)
print(cv.mean())

In [None]:
svc = SVC(probability = True)
cv = cross_val_score(svc,X_train,y_train,cv=5)
print(cv)
print(cv.mean())

In [None]:
from xgboost import XGBClassifier
xgb = XGBClassifier(random_state =1)
cv = cross_val_score(xgb,X_train,y_train,cv=5)
print(cv)
print(cv.mean())

## Model Hyperparameter tuning

In [None]:
from sklearn.model_selection import GridSearchCV 
from sklearn.model_selection import RandomizedSearchCV 

In [None]:
#simple performance reporting function
def clf_performance(classifier, model_name):
    print(model_name)
    print('Best Score: ' + str(classifier.best_score_))
    print('Best Parameters: ' + str(classifier.best_params_))

In [None]:
lr = LogisticRegression()
param_grid = {'max_iter' : [2000],
              'penalty' : ['l1', 'l2'],
              'C' : np.logspace(-4, 4, 20),
              'solver' : ['liblinear']}

clf_lr = GridSearchCV(lr, param_grid = param_grid, cv = 5, verbose = True, n_jobs = -1)
best_clf_lr = clf_lr.fit(X_train,y_train)
clf_performance(best_clf_lr,'Logistic Regression')

In [None]:
knn = KNeighborsClassifier()
param_grid = {'n_neighbors' : [3,5,7,9],
              'weights' : ['uniform', 'distance'],
              'algorithm' : ['auto', 'ball_tree','kd_tree'],
              'p' : [1,2]}
clf_knn = GridSearchCV(knn, param_grid = param_grid, cv = 5, verbose = True, n_jobs = -1)
best_clf_knn = clf_knn.fit(X_train,y_train)
clf_performance(best_clf_knn,'KNN')

In [None]:
# svc = SVC(probability = True)
# param_grid = tuned_parameters = [{'kernel': ['rbf'], 'gamma': [.1,.5,1,2,5,10],
#                                   'C': [.1, 1, 10, 100, 1000]},
#                                  {'kernel': ['linear'], 'C': [.1, 1, 10, 100, 1000]},
#                                  {'kernel': ['poly'], 'degree' : [2,3,4,5], 'C': [.1, 1, 10, 100, 1000]}]
# clf_svc = GridSearchCV(svc, param_grid = param_grid, cv = 5, verbose = True, n_jobs = -1)
# best_clf_svc = clf_svc.fit(X_train,y_train)
# clf_performance(best_clf_svc,'SVC')

In [None]:
rf = RandomForestClassifier(random_state = 1)
param_grid =  {'n_estimators': [400,450,500,550],
               'criterion':['gini','entropy'],
                                  'bootstrap': [True],
                                  'max_depth': [15, 20, 25],
                                  'max_features': ['auto','sqrt', 10],
                                  'min_samples_leaf': [2,3],
                                  'min_samples_split': [2,3]}
                                  
clf_rf = GridSearchCV(rf, param_grid = param_grid, cv = 5, verbose = True, n_jobs = -1)
best_clf_rf = clf_rf.fit(X_train,y_train)
clf_performance(best_clf_rf,'Random Forest')

In [None]:
best_rf = best_clf_rf.best_estimator_.fit(X_train,y_train)
col_names = col_trans.get_feature_names_out()
X_train = pd.DataFrame(X_train,columns=col_names)
feat_importances = pd.Series(best_rf.feature_importances_, index=X_train.columns)
feat_importances.nlargest(20).plot(kind='barh')

## Test data predictions and submission

In [None]:
df_test=pd.read_csv('/kaggle/input/titanic/test.csv')

In [None]:
df_test.info()

In [None]:
## Feature engineering
test_feature_eng = feature_engg(df_test)
test_feature_eng.name_title()
test_feature_eng.numeric_ticket()
test_feature_eng.cabin_letter()
df_test = test_feature_eng.remove_cols()

In [None]:
## Preprocess test data
data_preprocessor= data_preprocess(df_test) 
data_preprocessor.handle_null_values()
data_preprocessor.handle_duplicate_values()
test = data_preprocessor.scale()

In [None]:
## Encode the categorical columns (Use same encoder used while training)
X_test=col_trans.transform(test)
X_test=pd.DataFrame(X_test)

In [None]:
X_test.head()

In [None]:
y_hat_rf = best_clf_rf.best_estimator_.predict(X_test)

In [None]:
rf_submission = {'PassengerId': X_test[15].astype(int), 'Survived': y_hat_rf}
submission_rf = pd.DataFrame(data=rf_submission)
submission_rf.to_csv('submission.csv', index=False)
print("Your submission was successfully saved!")