**<center><font size=6>Titanic - Machine Learning from Disaster</font></center>**
***

**date**: 07.01.2021

**Table of Contents**
- <a href='#read'>1. Reading the data</a> 
- <a href='#understand'>2. Understanding and preparing the data</a>
    - <a href='#describe'>2.1. Describing and planning the data</a>
    - <a href='#group'>2.2. Grouping and transforming</a>
    - <a href='#norm'>2.3. Normalising</a>
- <a href='#split'>3. Splitting the data</a>
- <a href='#fit'>4. Fitting and validating the models</a>
    - <a href='#fitLinSVC'>4.1. Fitting the Linear SupportVectorClassifier (LinSVC) and GridSearchCV</a>
        - <a href='#fitLinSVCcgscv'>4.1.1. LinSVC and GridSearchCV</a>
        - <a href='#fitLinSVCparam'>4.1.2. Fitting the LinSVC, GridSearchCV parameters</a>
        - <a href='#valiLinSVC'>4.1.3. Validating the LinSVC</a>
    - <a href='#fitRBFSVC'>4.2. Fitting the RBF SupportVectorClassifier (RBF SVC) and GridSearchCV</a>
        - <a href='#fitRBFSVCgscv'>4.2.1. RBF SVC and GridSearchCV</a>
        - <a href='#fitRBFSVCparam'>4.2.2. Fitting the RBF SVC, GridSearchCV parameters</a>
        - <a href='#valiRBFSVC'>4.2.3. Validating the RBF SVC</a>
    - <a href='#fitDT'>4.3. Fitting the DecisionTreeClassifier (DT) and GridSearchCV</a>
        - <a href='#fitDTgscv'>4.3.1. DT and GridSearchCV</a>
        - <a href='#fitDTparam'>4.3.2. Fitting the DT, GridSearchCV parameters</a>
        - <a href='#valiDT'>4.3.3. Validating the DT</a>
    - <a href='#fitRF'>4.4. Fitting the RandomForestClassifier (RF)</a>
        - <a href='#fitRFgscv'>4.4.1. RF and GridSearchCV</a>
        - <a href='#fitRFparam'>4.4.2. Fitting the RF, GridSearchCV parameters</a>
        - <a href='#valiRF'>4.4.3. Validating the RF</a>
    - <a href='#fitKNN'>4.5. Fitting the K-Nearest-Neighbours (KNN)</a>  
        - <a href='#fitKNNgscv'>4.5.1. KNN and GridSearchCV</a>
        - <a href='#fitKNNparam'>4.5.2. Fitting the KNN, GridSearchCV parameters</a>
        - <a href='#valiKNN'>4.5.3. Validating the KNN</a> 
    - <a href='#fitLgR'>4.6. Fitting the Logistic Regression (LgR)</a>
        - <a href='#valiLgR'>4.6.1. Validating the LgR</a>    
    - <a href='#fitPCA-LgR'>4.7. Fitting the Principal Component Analysis (PCA) and Logistic Regression (LgR)</a>
        - <a href='#valiPCA-LgR'>4.7.1. Validating the PCA LgR</a> 
    - <a href='#fitOLS'>4.8. Fitting the Ordinary Least Squares Linear Regression (OLS)</a>
        - <a href='#valiOLS'>4.8.1. Validating the OLS</a> 
    - <a href='#valiARP'>4.9. Accuracy, Recall, Precision: validation set truth vs. predicted values</a>
    - <a href='#fitsummary'>4.10. Fitting and validating summary</a>
- <a href='#perd'>5. Predicting</a>
    - <a href='#predlinsvc'>5.1. Predicting with LinSVC</a>
    - <a href='#predrbfsvc'>5.2. Predicting with RBF SVC</a>
    - <a href='#predRF'>5.3. Predicting with DT</a>
    - <a href='#predRF'>5.4. Predicting with RF</a>
    - <a href='#predKNN'>5.5. Predicting with KNN</a>
    - <a href='#predLgR'>5.6. Predicting with LgR</a>
    - <a href='#predPCA-LgR'>5.7. Predicting with PCA and LgR</a>
    - <a href='#predOLS'>5.8. Predicting with OLS</a>
    - <a href='#predARP'>5.9. Accuracy, Recall, Precision: example set truth vs. test set predictions</a>
    - <a href='#predsummary'>5.10. Prediction summary</a>
- <a href='#submit'>6. Submitting the data</a>

# <a id='read'>1. Reading the data</a>

In [None]:
# Matplotlib config
%matplotlib inline
%config InlineBackend.figure_formats = ['svg']
%config InlineBackend.rc = {'figure.figsize': (5.0, 4.0)}

import pandas as pd
import numpy as np
import csv
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.tree import plot_tree
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import confusion_matrix, accuracy_score, recall_score, precision_score, plot_confusion_matrix
from sklearn.model_selection import GridSearchCV, RepeatedKFold, train_test_split

from sklearn.svm import SVC
from sklearn.svm import LinearSVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.decomposition import PCA
import statsmodels.api as sm

input_file = "../input/titanic/train.csv"
df = pd.read_csv(input_file, header = 0, sep = ',', quotechar='"')
df.head()

# <a id='understand'>2. Understanding and preparing the data</a>

In [None]:
df.columns

In [None]:
%matplotlib
df.info()
df.describe()

In [None]:
#Correlation with the "Survived"

df.corr()["Survived"].abs().sort_values(ascending = False)

## <a id='describe'>2.1. Describing and planning the data</a>

| Variable orig | not missing values | Type | Ranges/Values | Describing | Preparing/Transforming | New Variable | New Ranges/New Values |
| :- | --- | :- | :- | :- | :- | :- | :- |
| PassengerId | 891 | int64 | 1-891 | | | | 
| Survived | 891 | int64 | 0/1 | 0=died, 1=survived | | | 
| Pclass | 891 | int64 | 1/2/3 | 1=1st class, 2=2nd class, 3=3rd class | normalising | Pclass_norm| 0-1 | 
| Name | 891 | object | text | | | | 
| Sex | 891 | object | female/male | female=0, male=1 | categories as numbers | Sex_ord | 0/1 |
| Age | 714 | float64 | 0.42-80 Years | | grouping, than splitting age-groups in 5 variables | Age_Group; Age_Infant; Age_Kid; Age_Young; Age_Adults; Age_Elderly; Age_norm | Infant(0-5)/Kid(5-14)/Young(14-25)/Adult(25-55)/Elderly(55-80); 0/1; 0/1; 0/1; 0/1; 0/1; 0-1 |
| SibSp | 891 | int64 | 0-8 | Sibling = brother, sister, stepbrother, stepsister / Spouse = husband, wife | as 0=alone, 1=not alone| SibORParch | 0/1 |
| Parch | 891 | int64 | 0-6 | Parent = mother, father / Child = daughter, son, stepdaughter, stepson |as 0=alone, 1=not alone| SibORParch | 0/1 |
| Ticket | 891 | object | Letters and Numbers | | | |
| Fare | 891 | float64 | 0-512.3292 Dolars | | normalising | Fare_norm | 0-1 | 
| Cabin | 204 | object | Letters and Numbers | | | |
| Embarked | 889 | object | C/Q/S | C = Cherbourg, Q = Queenstown, S = Southampton | categories as numbers (C=0, Q=1, S=2) and than also as 3 single variables 0/1 | Embarked_ord; Embk_Cherbourg; Embk_Queenstown; Embk_Southampton | 0/1/2; 0/1; 0/1; 0/1 |

## <a id='group'>2.2. Grouping and transforming: Age, SibSp, Parch, Sex, Embarked</a>

In [None]:
#Age: Age grouped in categories Age_Group = ['Infant','Kid','Young','Adult','Elderly']. Then Age-groups get splitted in 5 single variables, with values 0/1

bins_Age_Group= [0,4.9999,13.9999,24.9999,54.9999,100]
labels_Age_Group = ['Infant','Kid','Young','Adult','Elderly']
df['Age_Group'] = pd.cut(df['Age'], bins=bins_Age_Group, labels=labels_Age_Group, right=False)
df['Age_Group_ord'] = pd.Categorical(df.Age_Group).codes

bins_Age_Infant = [5,13.9999]
labels_Age_Infant = ['Infant']
df['Age_Infant'] = pd.cut(df['Age'], bins =bins_Age_Infant, labels =labels_Age_Infant, right=False)
df['Age_Infant'] = df['Age_Infant'].notna().astype('int')

bins_Age_Kid = [5,13.9999]
labels_Age_Kid = ['Kid']
df['Age_Kid'] = pd.cut(df['Age'], bins =bins_Age_Kid, labels =labels_Age_Kid, right=False)
df['Age_Kid'] = df['Age_Kid'].notna().astype('int')

bins_Age_Young = [14,24.9999]
labels_Age_Young = ['Young']
df['Age_Young'] = pd.cut(df['Age'], bins =bins_Age_Young, labels =labels_Age_Young, right=False)
df['Age_Young'] = df['Age_Young'].notna().astype('int')

bins_Age_Adult = [25,54.9999]
labels_Age_Adult = ['Adult']
df['Age_Adult'] = pd.cut(df['Age'], bins =bins_Age_Adult, labels =labels_Age_Adult, right=False)
df['Age_Adult'] = df['Age_Adult'].notna().astype('int')

bins_Age_Elderly = [55,100]
labels_Age_Elderly = ['Elderly']
df['Age_Elderly'] = pd.cut(df['Age'], bins =bins_Age_Elderly, labels =labels_Age_Elderly, right=False)
df['Age_Elderly'] = df['Age_Elderly'].notna().astype('int')

In [None]:
#SibSp and Parch: get aggregated in one 0/1 single variable SibORParch, which means alone/not alone on board.

bins_SibORParch = [1,20]
labels_SibORParch = ['notalone']
df['SibORParch'] = pd.cut(df['SibSp']+df['Parch'], bins =bins_SibORParch, labels =labels_SibORParch, right=False)
df['SibORParch'] = df['SibORParch'].notna().astype('int')

In [None]:
#Sex: categories as numbers, numerical variable 0/1 (0 = female, 1 = male)

df['Sex_ord'] = pd.Categorical(df.Sex).codes

In [None]:
#Embarked: Embarked [C = Cherbourg, Q = Queenstown, S = Southampton], categories as numbers, as a numerical variable Embarked_ord 0/1/2, than Embarked get splitted in 3 single variables, with values 0/1

#C=0, Q=1, S=2
df['Embarked_ord'] = pd.Categorical(df.Embarked).codes

#Cherbourg
bins_Embk_Cherbourg = [0,0.9999]
labels_Embk_Cherbourg = ['Embk_Cherbourg']
df['Embk_Cherbourg'] = pd.cut(df['Embarked_ord'], bins =bins_Embk_Cherbourg, labels =labels_Embk_Cherbourg, right=False)
df['Embk_Cherbourg'] = df['Embk_Cherbourg'].notna().astype('int')

#Queenstown
bins_Embk_Queenstown = [1,1.9999]
labels_Embk_Queenstown = ['Embk_Queenstown']
df['Embk_Queenstown'] = pd.cut(df['Embarked_ord'], bins =bins_Embk_Queenstown, labels =labels_Embk_Queenstown, right=False)
df['Embk_Queenstown'] = df['Embk_Queenstown'].notna().astype('int')

#Southampton
bins_Embk_Southampton = [2,2.9999]
labels_Embk_Southampton = ['Embk_Southampton']
df['Embk_Southampton'] = pd.cut(df['Embarked_ord'], bins =bins_Embk_Southampton, labels =labels_Embk_Southampton, right=False)
df['Embk_Southampton'] = df['Embk_Southampton'].notna().astype('int')

## <a id='norm'>2.3. Normalising: Fare, Age, Pclass</a>

In [None]:
#Normalise Fare as new variable 0-1
df['Fare_norm'] = df['Fare']/np.max(df['Fare'])

#Normalise Age as new variable 0-1

df['Age_norm'] = df['Age']/np.max(df['Age'])

#Normalise Pclass as new variable 0-1
df['Pclass_norm'] = df['Pclass']/np.max(df['Pclass'])

In [None]:
#All the new variavles correlated with the "Survived"

df.corr()["Survived"].abs().sort_values(ascending = False)

In [None]:
#Correlation Matrix

#sns.heatmap(df.corr());

# <a id='split'>3. Splitting the data</a>

Splitting the data in training and validation sets in new csv files.

| Kaggle set | | Splitted sets |need for | | | | | 
| :- | --- | :- | :- | :- | :- | :- | :- |
| train set | 80% | training set | fitting the model| | | 
| train set | 20% | validation set | validating the model| | | |  
| test set | 100%| |predicting the data| | | | |
| | | | | | | | 

In [None]:
#Write the actual dataframe df in a new csv
df.to_csv (r'train_new.csv', index = False, header=True)

#Splitt randomly the new csv in training set (80% of rows) und validation set (20% of rows) csv 
import random

with open('train_new.csv') as data:
    with open('train_training.csv', 'w') as test:
        with open('train_validation.csv', 'w') as train:
            header = next(data)
            train.write(header)
            test.write(header)
            
            for line in data:
                if random.random() > 0.80:
                    train.write(line)
                else:
                    test.write(line)

# <a id='fit'>4. Fitting and validating the models</a>

In [None]:
#Read the training set (80% der data)
input_training_file = "./train_training.csv"
df_training = pd.read_csv(input_training_file, header = 0, sep = ',', quotechar='"')
#df_training.head()

In [None]:
#Read the validation set
input_validation_file = "./train_validation.csv"
df_validation = pd.read_csv(input_validation_file, header = 0, sep = ',', quotechar='"')
#df_validation.head()

In [None]:
%matplotlib
df_training.info()
df_training.describe()

In [None]:
df_training.columns

## <a id='fitLinSVC'>4.1. Fitting the Linear SupportVectorClassifier (LinSVC) and GridSearchCV</a>

### <a id='fitLinSVCgscv'>4.1.1. LinSVC and GridSearchCV</a>

In [None]:
# Attention! This code needs long time to run! 

X_linsvcgscv_training = df_training[['Sex_ord', 'SibORParch', 'Fare', 'Pclass', 'Embk_Cherbourg', 'Embk_Queenstown', 'Embk_Southampton', 'Age_Infant', 'Age_Kid', 'Age_Young', 'Age_Adult', 'Age_Elderly' ]]
y_linsvcgscv_training = df_training["Survived"]

X_linsvcgscv_validation = df_validation[['Sex_ord', 'SibORParch', 'Fare', 'Pclass', 'Embk_Cherbourg', 'Embk_Queenstown', 'Embk_Southampton', 'Age_Infant', 'Age_Kid', 'Age_Young', 'Age_Adult', 'Age_Elderly' ]]
y_linsvcgscv_validation = df_validation["Survived"]

sc_linsvcgscv = StandardScaler()
sc_linsvcgscv_training = sc_linsvcgscv.fit(X_linsvcgscv_training)
sc_linsvcgscv_validation = sc_linsvcgscv.fit(X_linsvcgscv_validation)

X_linsvcgscv_training_scalar = sc_linsvcgscv_training.transform(X_linsvcgscv_training)
X_linsvcgscv_validation_scalar = sc_linsvcgscv_validation.transform(X_linsvcgscv_validation)

model_linsvcgscv = GridSearchCV(LinearSVC(), param_grid = {
    "C": [0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1], 
    "max_iter": [1000000]
}, cv = RepeatedKFold())

model_linsvcgscv_training = model_linsvcgscv.fit(X_linsvcgscv_training_scalar, y_linsvcgscv_training)
model_linsvcgscv_validation = model_linsvcgscv.fit(X_linsvcgscv_validation_scalar, y_linsvcgscv_validation)

print('Best model parameter: ' + str(model_linsvcgscv_training.best_params_))
print('Best model score: ' + str(model_linsvcgscv_training.best_score_))

print('Model score: ' + str(model_linsvcgscv_training.score(X_linsvcgscv_training, y_linsvcgscv_training)))
print('Model score testing new data: ' + str(model_linsvcgscv_validation.score(X_linsvcgscv_validation, y_linsvcgscv_validation)))

### <a id='fitLinSVCparam'>4.1.2. Fitting the LinSVC, GridSearchCV parameters</a>

In [None]:
#from sklearn.svm import LinearSVC

X_linsvc_training = df_training[['Sex_ord', 'SibORParch', 'Fare', 'Pclass', 'Embk_Cherbourg', 'Embk_Queenstown', 'Embk_Southampton', 'Age_Infant', 'Age_Kid', 'Age_Young', 'Age_Adult', 'Age_Elderly' ]]
y_linsvc_training = df_training["Survived"]

sc_linsvc = StandardScaler()
sc_linsvc_training = sc_linsvc.fit(X_linsvc_training)

X_linsvc_training_scale = sc_linsvc_training.transform(X_linsvc_training)

model_linsvc = LinearSVC(max_iter = 1000000, C = 0.01)
model_linsvc.fit(X_linsvc_training_scale, y_linsvc_training)

#print(model_linsvc.score(X_linsvc_training_scale, y_linsvc_training))

In [None]:
plot_confusion_matrix(model_linsvc, X_linsvc_training, y_linsvc_training, normalize = "all")

### <a id='valiLinSVC'>4.1.3. Validating the LinSVC</a>

In [None]:
X_linsvc_validation = df_validation[['Sex_ord', 'SibORParch', 'Fare', 'Pclass', 'Embk_Cherbourg', 'Embk_Queenstown', 'Embk_Southampton', 'Age_Infant', 'Age_Kid', 'Age_Young', 'Age_Adult', 'Age_Elderly' ]]
y_linsvc_validation = model_linsvc.predict(X_linsvc_validation)

#Add new column with the values predicted with the Linear SVC
df_validation['Survived_linsvc_validation']=y_linsvc_validation

In [None]:
plot_confusion_matrix(model_linsvc, X_linsvc_validation, y_linsvc_validation, normalize = "all")

In [None]:
#Confusion matrix: the true values from the validation set vs. LinearSVC predicted values for the validation set

y_linsvc_true = df_validation['Survived']
y_linsvc_pred = df_validation['Survived_linsvc_validation']

linsvc_validation_cm = confusion_matrix(y_linsvc_true, y_linsvc_pred, normalize = "all")
linsvc_validation_score = accuracy_score(y_linsvc_true, y_linsvc_pred, normalize = "all")

linsvc_validation_cm

In [None]:
print('LinSVC training score: ' + str(model_linsvc.score(X_linsvc_training, y_linsvc_training)))
print('LinSVC validation score: ' + str(linsvc_validation_score))

## <a id='fitRBFSVC'>4.2. Fitting the RBF SVM (RBF SVM) and GridSearchCV</a>

### <a id='fitRBFSVCgscv'>4.2.1. RBF SVM and GridSearchCV</a>

In [None]:
#from sklearn.svm import SVC
#from sklearn.preprocessing import StandardScaler

X_rbfsvcgscv_training = df_training[['Sex_ord', 'SibORParch', 'Fare', 'Pclass', 'Embk_Cherbourg', 'Embk_Queenstown', 'Embk_Southampton', 'Age_Infant', 'Age_Kid', 'Age_Young', 'Age_Adult', 'Age_Elderly' ]]
y_rbfsvcgscv_training = df_training["Survived"]

X_rbfsvcgscv_validation = df_validation[['Sex_ord', 'SibORParch', 'Fare', 'Pclass', 'Embk_Cherbourg', 'Embk_Queenstown', 'Embk_Southampton', 'Age_Infant', 'Age_Kid', 'Age_Young', 'Age_Adult', 'Age_Elderly' ]]
y_rbfsvcgscv_validation = df_validation["Survived"]

#Use scalar or PCA for optimizing
sc_rbfsvcgscv = StandardScaler()
sc_rbfsvcgscv_training = sc_rbfsvcgscv.fit(X_rbfsvcgscv_training)
sc_rbfsvcgscv_validation = sc_rbfsvcgscv.fit(X_rbfsvcgscv_validation)

X_rbfsvcgscv_training_scalar = sc_rbfsvcgscv_training.transform(X_rbfsvcgscv_training)
X_rbfsvcgscv_validation_scalar = sc_rbfsvcgscv_validation.transform(X_rbfsvcgscv_validation)

model_rbfsvcgscv = GridSearchCV(SVC(), param_grid = {
    "kernel": ["rbf"], 
    "C": [20, 25, 30, 35, 40], 
    "gamma": [0.0005, 0.001, 0.005, 0.01, 0.05]
}, cv = RepeatedKFold(), n_jobs = 8)

model_rbfsvcgscv_training = model_rbfsvcgscv.fit(X_rbfsvcgscv_training_scalar, y_rbfsvcgscv_training)
model_rbfsvcgscv_validation = model_rbfsvcgscv.fit(X_rbfsvcgscv_validation_scalar, y_rbfsvcgscv_validation)

print('Best model parameter: ' + str(model_rbfsvcgscv_training.best_params_))
print('Best model score: ' + str(model_rbfsvcgscv_training.best_score_))

print('Model score: ' + str(model_rbfsvcgscv_training.score(X_rbfsvcgscv_training, y_rbfsvcgscv_training)))
print('Model score testing new data: ' + str(model_rbfsvcgscv_validation.score(X_rbfsvcgscv_validation, y_rbfsvcgscv_validation)))

### <a id='fitRBFSVCparam'>4.2.2. Fitting the RBF SVM, GridSearchCV parameters</a>

In [None]:
#from sklearn.svm import SVC

X_rbfsvc_training = df_training[['Sex_ord', 'SibORParch', 'Fare', 'Pclass', 'Embk_Cherbourg', 'Embk_Queenstown', 'Embk_Southampton', 'Age_Infant', 'Age_Kid', 'Age_Young', 'Age_Adult', 'Age_Elderly' ]]
y_rbfsvc_training = df_training["Survived"]

sc_rbfsvc = StandardScaler()
sc_rbfsvc_training = sc_rbfsvc.fit(X_rbfsvc_training)

X_rbfsvc_training_scale = sc_rbfsvc_training.transform(X_rbfsvc_training)

model_rbfsvc = SVC(kernel = "rbf", C = 40, gamma = 0.01)
model_rbfsvc.fit(X_rbfsvc_training_scale, y_rbfsvc_training)

#print(model_rbfsvc.score(X_rbfsvc_training_scale, y_rbfsvc_training))

In [None]:
plot_confusion_matrix(model_linsvc, X_rbfsvc_training, y_rbfsvc_training, normalize = "all")

### <a id='valiRBFSVC'>4.2.3. Validating the RBF SVM</a>

In [None]:
X_rbfsvc_validation = df_validation[['Sex_ord', 'SibORParch', 'Fare', 'Pclass', 'Embk_Cherbourg', 'Embk_Queenstown', 'Embk_Southampton', 'Age_Infant', 'Age_Kid', 'Age_Young', 'Age_Adult', 'Age_Elderly' ]]
y_rbfsvc_validation = model_rbfsvc.predict(X_rbfsvc_validation)

#Add new column with the values predicted with the Linear SVC
df_validation['Survived_rbfsvc_validation']=y_rbfsvc_validation

In [None]:
plot_confusion_matrix(model_linsvc, X_rbfsvc_validation, y_rbfsvc_validation, normalize = "all")

In [None]:
#Confusion matrix: the true values from the validation set vs. rbfSVC predicted values for the validation set

y_rbfsvc_true = df_validation['Survived']
y_rbfsvc_pred = df_validation['Survived_linsvc_validation']

rbfsvc_validation_cm = confusion_matrix(y_rbfsvc_true, y_rbfsvc_pred, normalize = "all")
rbfsvc_validation_score = accuracy_score(y_rbfsvc_true, y_rbfsvc_pred, normalize = "all")

rbfsvc_validation_cm

In [None]:
print('RBF SVC training score: ' + str(model_rbfsvc.score(X_rbfsvc_training, y_rbfsvc_training)))
print('RBF SVC validation score: ' + str(rbfsvc_validation_score))

## <a id='fitDT'>4.3. Fitting the DecisionTreeClassifier (DT) and GridSearchCV</a>

### <a id='fitDTgscv'>4.3.1. DT and GridSearchCV</a>

In [None]:
#test some parameters with GridSearchCV for the DecisionTreeClassifier

#from sklearn.tree import DecisionTreeClassifier

features_dt = ['Fare', 'SibORParch', 'Sex_ord', 'Pclass', 'Embk_Cherbourg', 'Embk_Queenstown', 'Embk_Southampton', 'Age_Infant', 'Age_Kid', 'Age_Young', 'Age_Adult', 'Age_Elderly' ]
X_dtgscv_training = pd.get_dummies(df_training[features_dt])
y_dtgscv_training = df_training["Survived"]
X_dtgscv_validation = pd.get_dummies(df_validation[features_dt])
y_dtgscv_validation = df_validation["Survived"]

model_dtgscv = GridSearchCV(DecisionTreeClassifier(), param_grid = {
    'max_depth': [23, 24, 25, 26, 27, 28, 29, 30, 31, 32],
    'min_samples_leaf': [2, 3, 4, 5, 6, 7]
}, cv = RepeatedKFold())

model_dtgscv.fit(X_dtgscv_training, y_dtgscv_training)

print('Best model parameters: ' + str(model_dtgscv.best_params_))
print('Best model score: ' + str(model_dtgscv.best_score_))

print('Model score: ' + str(model_dtgscv.score(X_dtgscv_training, y_dtgscv_training)))
print('Model score testing new data: ' + str(model_dtgscv.score(X_dtgscv_validation, y_dtgscv_validation)))

In [None]:
#pd.DataFrame(model_dtgscv.cv_results_)

### <a id='fitDTparam'>4.3.2. Fitting the DT, GridSearchCV parameters</a>

In [None]:
#Model DecisionTreeClassifier: Train a DecisionTreeClassifier with the training set

#from sklearn.tree import DecisionTreeClassifier

#features_dt = ['Fare', 'SibORParch', 'Sex_ord', 'Pclass', 'Embk_Cherbourg', 'Embk_Queenstown', 'Embk_Southampton', 'Age_Infant', 'Age_Kid', 'Age_Young', 'Age_Adult', 'Age_Elderly' ]
X_dt_training = pd.get_dummies(df_training[features_dt])
y_dt_training = df_training["Survived"]

#set the best parameters for the decision tree
model_dt = DecisionTreeClassifier(max_depth=32, min_samples_leaf = 6)
model_dt.fit(X_dt_training, y_dt_training)

In [None]:
#Print the DecisionTree
#from sklearn.tree import plot_tree
#import matplotlib.pyplot as plt

plot_tree(model_dt, 
          feature_names = ['Fare', 'SibORParch', 'Sex_ord', 'Pclass', 'Embk_Cherbourg', 'Embk_Queenstown', 'Embk_Southampton', 'Age_Infant', 'Age_Kid', 'Age_Young', 'Age_Adult', 'Age_Elderly' ], 
          class_names = ["not survived", "survived"],
          filled = True)
plt.show()

In [None]:
plot_confusion_matrix(model_dt, X_dt_training, y_dt_training, normalize = "all")

### <a id='valiDT'>4.3.3. Validating the DT</a>

In [None]:
#Model DecisionTreeClassifier: Validate the fitted DecisionTreeClassifier
#Predict the validation set with the fitted DecisionTreeClassifier

X_dt_validation = pd.get_dummies(df_validation[features_dt])
y_dt_validation = model_dt.predict(X_dt_validation)

#Add new column with the values predicted with the DecisionTreeClassifier
df_validation['Survived_dt_validation']=y_dt_validation
#df_validation.head()

In [None]:
plot_confusion_matrix(model_dt, X_dt_validation, y_dt_validation, normalize = "all")

In [None]:
#Confusion matrix: the true values from the validation set vs. DecisionTreeClassifier predicted values for the validation set

y_dt_true = df_validation['Survived']
y_dt_pred = df_validation['Survived_dt_validation']

dt_validation_cm = confusion_matrix(y_dt_true, y_dt_pred, normalize = "all")
dt_validation_score = accuracy_score(y_dt_true, y_dt_pred, normalize = "all")

dt_validation_cm

In [None]:
print('DT training score: ' + str(model_dt.score(X_dt_training, y_dt_training)))
print('DT validation score: ' + str(dt_validation_score))

## <a id='fitRF'>4.4. Fitting the RandomForestClassifier (RF) and GridSearchCV</a>

### <a id='fitRFgscv'>4.4.1. RF and GridSearchCV</a>

In [None]:
#test some parameters with GridSearchCV for the RandomForestClassifier

#from sklearn.ensemble import RandomForestClassifier

features_rf = ['Fare', 'SibORParch', 'Sex_ord', 'Pclass', 'Embk_Cherbourg', 'Embk_Queenstown', 'Embk_Southampton', 'Age_Infant', 'Age_Kid', 'Age_Young', 'Age_Adult', 'Age_Elderly' ]
X_rfgscv_training = pd.get_dummies(df_training[features_rf])
y_rfgscv_training = df_training["Survived"]
X_rfgscv_validation = pd.get_dummies(df_validation[features_rf])
y_rfgscv_validation = df_validation["Survived"]

model_rfgscv = GridSearchCV(RandomForestClassifier(), param_grid = {
    'max_depth': [11, 12, 13, 14, 15],
    'min_samples_leaf': [1, 2, 3]
}, cv = RepeatedKFold())

model_rfgscv.fit(X_rfgscv_training, y_rfgscv_training)

print('Best model parameter: ' + str(model_rfgscv.best_params_))
print('Best model score: ' + str(model_rfgscv.best_score_))

print('Model score: ' + str(model_rfgscv.score(X_rfgscv_training, y_rfgscv_training)))
print('Model score testing new data: ' + str(model_rfgscv.score(X_rfgscv_validation, y_rfgscv_validation)))

### <a id='fitRFparam'>4.4.2. Fitting the RF, GridSearchCV parameters</a>

In [None]:
#Model RandomForestClassifier: Train a RandomForestClassifier model with the training set

#from sklearn.ensemble import RandomForestClassifier

#features_rf = ['Fare', 'SibORParch', 'Sex_ord', 'Pclass', 'Embk_Cherbourg', 'Embk_Queenstown', 'Embk_Southampton', 'Age_Infant', 'Age_Kid', 'Age_Young', 'Age_Adult', 'Age_Elderly' ]
X_rf_training = pd.get_dummies(df_training[features_rf])
y_rf_training = df_training["Survived"]

model_rf = RandomForestClassifier(n_estimators=100, max_depth=13, min_samples_leaf = 2)
model_rf.fit(X_rf_training, y_rf_training)

In [None]:
plot_confusion_matrix(model_rf, X_rf_training, y_rf_training, normalize = "all")

### <a id='valiRF'>4.4.3. Validating the RF</a>

In [None]:
#Model RandomForestClassifier: Validate the fitted RandomForestClassifier model
#Predict the validation set with the fitted RandomForestClassifier model

X_rf_validation = pd.get_dummies(df_validation[features_rf])
y_rf_validation = model_rf.predict(X_rf_validation)

#Add new column with the values predicted with the RandomForestClassifier model
df_validation['Survived_rf_validation']=y_rf_validation
#df_validation.head()

In [None]:
plot_confusion_matrix(model_rf, X_rf_validation, y_rf_validation, normalize = "all")

In [None]:
#Confusion matrix: the true values from the validation set vs. RandomForestClassifier model predicted values for the validation set

y_rf_true = df_validation['Survived']
y_rf_pred = df_validation['Survived_rf_validation']

rf_validation_cm = confusion_matrix(y_rf_true, y_rf_pred, normalize = "all")
rf_validation_score = accuracy_score(y_rf_true, y_rf_pred, normalize = "all")

rf_validation_cm

In [None]:
print('RF training score: ' + str(model_rf.score(X_rf_training, y_rf_training)))
print('RF validation score: ' + str(rf_validation_score))

## <a id='fitKNN'>4.5. Fitting the K-Nearest-Neighbours (KNN)</a>

### <a id='fitKNNparam'>4.5.1. KNN and GridSearchCV</a>

In [None]:
#from sklearn.neighbors import KNeighborsClassifier

X_knngscv_training = df_training[['Sex_ord', 'SibORParch', 'Fare', 'Pclass', 'Embk_Cherbourg', 'Embk_Queenstown', 'Embk_Southampton', 'Age_Infant', 'Age_Kid', 'Age_Young', 'Age_Adult', 'Age_Elderly' ]]
y_knngscv_training = df_training["Survived"]

X_knngscv_validation = df_validation[['Sex_ord', 'SibORParch', 'Fare', 'Pclass', 'Embk_Cherbourg', 'Embk_Queenstown', 'Embk_Southampton', 'Age_Infant', 'Age_Kid', 'Age_Young', 'Age_Adult', 'Age_Elderly' ]]
y_knngscv_validation = df_validation["Survived"]

sc_knngscv = StandardScaler()
sc_knngscv_training = sc_knngscv.fit(X_knngscv_training)
sc_knngscv_validation = sc_knngscv.fit(X_knngscv_validation)

X_knngscv_training_scalar = sc_knngscv_training.transform(X_knngscv_training)
X_knngscv_validation_scalar = sc_knngscv_validation.transform(X_knngscv_validation)

#manhattan_distance (p=1) and euclidean_distance (p=2)
model_knngscv = GridSearchCV(KNeighborsClassifier(), param_grid = {
    'n_neighbors': [5, 6, 7, 8, 9, 10, 15, 20, 25, 35, 50, 75],
    'p': [1, 2], 
    #'weights': ['uniform', 'distance']
}, cv = RepeatedKFold())

model_knngscv_training = model_knngscv.fit(X_knngscv_training_scalar, y_knngscv_training)
model_knngscv_validation = model_knngscv.fit(X_knngscv_validation_scalar, y_knngscv_validation)

print('Best model parameter: ' + str(model_knngscv_training.best_params_))
print('Best model score: ' + str(model_knngscv_training.best_score_))

print('Model score: ' + str(model_knngscv_training.score(X_knngscv_training, y_knngscv_training)))
print('Model score testing new data: ' + str(model_knngscv_validation.score(X_knngscv_validation, y_knngscv_validation)))

### <a id='fitKNNparam'>4.5.2. Fitting the KNN, GridSearchCV parameters</a>

In [None]:
#from sklearn.neighbors import KNeighborsClassifier

X_knn_training = df_training[['Sex_ord', 'SibORParch', 'Fare', 'Pclass', 'Embk_Cherbourg', 'Embk_Queenstown', 'Embk_Southampton', 'Age_Infant', 'Age_Kid', 'Age_Young', 'Age_Adult', 'Age_Elderly' ]]
y_knn_training = df_training["Survived"]

sc_knn = StandardScaler()
sc_knn.fit(X_knn_training)

X_knn_training_scaled = sc_knn.transform(X_knn_training)

#manhattan distance p=1
model_knn = KNeighborsClassifier(n_neighbors = 25, p = 2)
model_knn.fit(X_knn_training_scaled, y_knn_training)

In [None]:
#print(model_knn.predict_proba(X_knn_training_scaled))

In [None]:
plot_confusion_matrix(model_knn, X_knn_training_scaled, y_knn_training, normalize = "all")

### <a id='valiKNN'>4.5.3. Validating the KNN</a>

In [None]:
X_knn_validation = df_validation[['Sex_ord', 'SibORParch', 'Fare', 'Pclass', 'Embk_Cherbourg', 'Embk_Queenstown', 'Embk_Southampton', 'Age_Infant', 'Age_Kid', 'Age_Young', 'Age_Adult', 'Age_Elderly' ]]
X_knn_validation_scaled = sc_knn.transform(X_knn_validation)
y_knn_validation = model_knn.predict(X_knn_validation_scaled)

df_validation['Survived_knn_validation']=y_knn_validation

In [None]:
plot_confusion_matrix(model_knn, X_knn_validation_scaled, y_knn_validation, normalize = "all")

In [None]:
#Confusion matrix: the true values from the validation set vs. KNN model predicted values for the validation set

y_knn_true = df_validation['Survived']
y_knn_pred = df_validation['Survived_knn_validation']

knn_validation_cm = confusion_matrix(y_knn_true, y_knn_pred, normalize = "all")
knn_validation_score = accuracy_score(y_knn_true, y_knn_pred, normalize = "all")

knn_validation_cm

In [None]:
print('KNN training score: ' + str(model_knn.score(X_knn_training_scaled, y_knn_training)))
print('KNN validation score: ' + str(knn_validation_score))

## <a id='fitLgR'>4.6. Fitting the Logistic Regression (LgR)</a>

In [None]:
#from sklearn.linear_model import LogisticRegression

X_lgr_training = df_training[['Fare', 'SibORParch', 'Sex_ord', 'Pclass', 'Embk_Cherbourg', 'Embk_Queenstown', 'Embk_Southampton', 'Age_Infant', 'Age_Kid', 'Age_Young', 'Age_Adult', 'Age_Elderly' ]]
y_lgr_training = df_training[['Survived']]

model_lgr = LogisticRegression(class_weight = "balancend")
model_lgr.fit(X_lgr_training, y_lgr_training)

In [None]:
plot_confusion_matrix(model_lgr, X_lgr_training, y_lgr_training, normalize = "all")

### <a id='valiLgR'>4.6.1. Validating the LgR</a>

In [None]:
X_lgr_validation = df_validation[['Fare', 'SibORParch', 'Sex_ord', 'Pclass', 'Embk_Cherbourg', 'Embk_Queenstown', 'Embk_Southampton', 'Age_Infant', 'Age_Kid', 'Age_Young', 'Age_Adult', 'Age_Elderly' ]]

y_lgr_validation = model_lgr.predict(X_lgr_validation)
df_validation['Survived_lgr_validation']=y_lgr_validation

In [None]:
plot_confusion_matrix(model_lgr, X_lgr_validation, y_lgr_validation, normalize = "all")

In [None]:
#Confusion matrix: the true values from the validation set vs. Logistic Regression model predicted values for the validation set

y_lgr_true = df_validation['Survived']
y_lgr_pred = df_validation['Survived_lgr_validation']

lgr_validation_cm = confusion_matrix(y_lgr_true, y_lgr_pred, normalize = "all")
lgr_validation_score = accuracy_score(y_lgr_true, y_lgr_pred, normalize = "all")

lgr_validation_cm

In [None]:
print('LgR training score: ' + str(model_lgr.score(X_lgr_training, y_lgr_training)))
print('LgR validation score: ' + str(lgr_validation_score))

## <a id='fitPCA-LgR'>4.7. Fitting the Principal Component Analysis and Logistic Regression (PCA LgR)</a>

In [None]:
#from sklearn.decomposition import PCA

X_pca_lgr_training = df_training[['Sex_ord', 'SibORParch', 'Fare', 'Pclass', 'Embk_Cherbourg', 'Embk_Queenstown', 'Embk_Southampton', 'Age_Infant', 'Age_Kid', 'Age_Young', 'Age_Adult', 'Age_Elderly' ]]
y_pca_lgr_training = df_training["Survived"]

pca = PCA(n_components = 2)

# pca.fit(X_pca_lgr_training)
# X_pca_lgr_training_transformed = pca.transform(X_pca_lgr_training)

X_pca_lgr_training_transformed = pca.fit_transform(X_pca_lgr_training)

model_pca_lgr = LogisticRegression(class_weight = "balancend")
model_pca_lgr.fit(X_pca_lgr_training_transformed, y_pca_lgr_training)

In [None]:
sns.scatterplot(X_pca_lgr_training_transformed[:, 0], X_pca_lgr_training_transformed[:, 1], hue = df_training["Survived"]);

In [None]:
plot_confusion_matrix(model_pca_lgr, X_pca_lgr_training_transformed, y_pca_lgr_training, normalize = "all")

### <a id='valiPCA-LgR'>4.7.1. Validating the PCA LgR</a>

In [None]:
#Dimension reduction and Logistic Regression

X_pca_lgr_validation = df_validation[['Fare', 'SibORParch', 'Sex_ord', 'Pclass', 'Embk_Cherbourg', 'Embk_Queenstown', 'Embk_Southampton', 'Age_Infant', 'Age_Kid', 'Age_Young', 'Age_Adult', 'Age_Elderly' ]]
X_pca_lgr_validation_transformed = pca.transform(X_pca_lgr_validation)

y_pca_lgr_validation = model_pca_lgr.predict(pca.transform(X_pca_lgr_validation))
df_validation['Survived_pca_lgr_validation'] = y_pca_lgr_validation

In [None]:
sns.scatterplot(X_pca_lgr_validation_transformed[:, 0], X_pca_lgr_validation_transformed[:, 1], hue = df_validation["Survived"]);

In [None]:
plot_confusion_matrix(model_pca_lgr, X_pca_lgr_validation_transformed, y_pca_lgr_validation, normalize = "all")

In [None]:
#Confusion matrix: the true values from the validation set vs. PCA Logistic Regression model predicted values for the validation set

y_pca_lgr_true = df_validation['Survived']
y_pca_lgr_pred = df_validation['Survived_pca_lgr_validation']

pca_lgr_validation_cm = confusion_matrix(y_pca_lgr_true, y_pca_lgr_pred, normalize = "all")
pca_lgr_validation_score = accuracy_score(y_pca_lgr_true, y_pca_lgr_pred, normalize = "all")

pca_lgr_validation_cm

In [None]:
print('PCA LgR training score: ' + str(model_pca_lgr.score(X_pca_lgr_training_transformed, y_pca_lgr_training)))
print('PCA LgR validation score: ' + str(pca_lgr_validation_score))

## <a id='fitOLS'>4.8. Fitting the Ordinary Least Squares (OLS)</a>

In [None]:
#Model OLS: Train an OLS model with the training set

#import statsmodels.api as sm

X_ols_training = df_training[['Fare', 'SibORParch', 'Sex_ord', 'Pclass', 'Embk_Cherbourg', 'Embk_Queenstown', 'Embk_Southampton', 'Age_Infant', 'Age_Kid', 'Age_Young', 'Age_Adult', 'Age_Elderly' ]]
y_ols_training = df_training[['Survived']]

X1_ols_training = sm.add_constant(X_ols_training)
model_ols = sm.OLS(y_ols_training, X1_ols_training).fit()

#summary
model_ols.summary()

### <a id='valiOLS'>4.8.1. Validating the OLS</a>

In [None]:
#Model OLS: Validate the fitted OLS model
#Predict the validation set with the fitted OLS model

X_ols_validation = df_validation[['Fare', 'SibORParch', 'Sex_ord', 'Pclass', 'Embk_Cherbourg', 'Embk_Queenstown', 'Embk_Southampton', 'Age_Infant', 'Age_Kid', 'Age_Young', 'Age_Adult', 'Age_Elderly' ]]
X1_ols_validation = sm.add_constant(X_ols_validation)

y_ols_validation =  model_ols.predict(X1_ols_validation)
y_ols_validation =  round(y_ols_validation)

#Add new column with the values predicted with the OLS model
df_validation['Survived_ols_validation']=y_ols_validation

In [None]:
#Confusion matrix: the true values from the validation set vs. OLS model predicted values for the validation set

y_ols_true = df_validation['Survived']
y_ols_pred = df_validation['Survived_ols_validation']

ols_validation_cm = confusion_matrix(y_ols_true, y_ols_pred, normalize = "all")
ols_validation_score = accuracy_score(y_ols_true, y_ols_pred, normalize = "all")

ols_validation_cm

In [None]:
print('OLS validation score: ' + str(ols_validation_score))

## <a id='valiARP'>4.9. Accuracy, Recall, Precision: validation set truth vs. validation set predictions</a>

In [None]:
#Accuracy

A_linsvc = accuracy_score(y_linsvc_true, y_linsvc_pred)
A_rbfsvc = accuracy_score(y_rbfsvc_true, y_rbfsvc_pred)
A_dt = accuracy_score(y_dt_true, y_dt_pred)
A_rf = accuracy_score(y_rf_true, y_rf_pred)
A_knn = accuracy_score(y_knn_true, y_knn_pred)
A_lgr = accuracy_score(y_lgr_true, y_lgr_pred)
A_pca_lgr = accuracy_score(y_pca_lgr_true, y_pca_lgr_pred)
A_ols = accuracy_score(y_ols_true, y_ols_pred)

print("Accuracy LinSVC = " + str(A_linsvc))
print("Accuracy RBF SVC = " + str(A_rbfsvc))
print("Accuracy DT = " + str(A_dt))
print("Accuracy RF = " + str(A_rf))
print("Accuracy KNN = " + str(A_knn))
print("Accuracy LgR = " + str(A_lgr))
print("Accuracy PCA LgR = " + str(A_pca_lgr))
print("Accuracy OLS = " + str(A_ols))

In [None]:
#Recall

R_linsvc = recall_score(y_linsvc_true, y_linsvc_pred)
R_rbfsvc = recall_score(y_rbfsvc_true, y_rbfsvc_pred)
R_dt = recall_score(y_dt_true, y_dt_pred)
R_rf = recall_score(y_rf_true, y_rf_pred)
R_knn = recall_score(y_knn_true, y_knn_pred)
R_lgr = recall_score(y_lgr_true, y_lgr_pred)
R_pca_lgr = recall_score(y_pca_lgr_true, y_pca_lgr_pred)
R_ols = recall_score(y_ols_true, y_ols_pred)

print("Recall LinSVC = " + str(R_linsvc))
print("Recall RBF SVC = " + str(R_rbfsvc))
print("Recall DT = " + str(R_dt))
print("Recall RF = " + str(R_rf))
print("Recall KNN = " + str(R_knn))
print("Recall LgR = " + str(R_lgr))
print("Recall PCA LgR = " + str(R_pca_lgr))
print("Recall OLS = " + str(R_ols))

In [None]:
#Precision

P_linsvc = precision_score(y_linsvc_true, y_linsvc_pred)
P_rbfsvc = precision_score(y_rbfsvc_true, y_rbfsvc_pred)
P_dt = precision_score(y_dt_true, y_dt_pred)
P_rf = precision_score(y_rf_true, y_rf_pred)
P_knn = precision_score(y_knn_true, y_knn_pred)
P_lgr = precision_score(y_lgr_true, y_lgr_pred)
P_pca_lgr = precision_score(y_pca_lgr_true, y_pca_lgr_pred)
P_ols = precision_score(y_ols_true, y_ols_pred)

print("Precision LinSVC = " + str(P_linsvc))
print("Precision RBF SVC = " + str(P_rbfsvc))
print("Precision DT = " + str(P_dt))
print("Precision RF = " + str(P_rf))
print("Precision KNN = " + str(P_knn))
print("Precision LgR = " + str(P_lgr))
print("Precision PCA_LgR = " + str(P_pca_lgr))
print("Precision OLS = " + str(P_ols))

## <a id='fitsummary'>4.10. Fitting and validating summary</a>

In [None]:
relevant_metrics = pd.DataFrame({
    'Model': ['LinearSVC', 'RBF SVC', 'Decision Tree', 'Random Forest', 'K-Nearest-Neighbours', 'Logistic Regression', 'PCA Logistic Regression', 'Ordinary Least Squares Linear Regression'],
    'Accuracy, A': [A_linsvc, A_rbfsvc, A_dt, A_rf, A_knn, A_lgr, A_pca_lgr, A_ols],
    'Recall, R': [R_linsvc, R_rbfsvc, R_dt, R_rf, R_knn, R_lgr, R_pca_lgr, R_ols],
    'Precision, P': [P_linsvc, P_rbfsvc, P_dt, P_rf, P_knn, P_lgr, P_pca_lgr, P_ols]})
best_model =relevant_metrics.sort_values(by='Accuracy, A', ascending=False)
best_model

# <a id='pred'>5. Predicting</a>

In [None]:
#Read the test set
input_test_file = "../input/titanic/test.csv"
df_test = pd.read_csv(input_test_file, header = 0, sep = ',', quotechar='"')
df_test.head()

In [None]:
#Open the example set (with 100% accuracy) for comparison

input_example_file = "../input/titanic-leaked/titanic.csv"
df_example = pd.read_csv(input_example_file, header = 0, sep = ',', quotechar='"')
df_example.head()

In [None]:
df_test.info()

In [None]:
#Age
import numpy as np

bins_Age_Group= [0,4.9999,13.9999,24.9999,54.9999,100]
labels_Age_Group = ['Infant','Kid','Young','Adult','Elderly']
df_test['Age_Group'] = pd.cut(df_test['Age'], bins=bins_Age_Group, labels=labels_Age_Group, right=False)
#df['Age_Group_ord'] = pd.Categorical(df.Age_Group).codes

bins_Age_Infant = [5,13.9999]
labels_Age_Infant = ['Infant']
df_test['Age_Infant'] = pd.cut(df_test['Age'], bins =bins_Age_Infant, labels =labels_Age_Infant, right=False)
df_test['Age_Infant'] = df_test['Age_Infant'].notna().astype('int')

bins_Age_Kid = [5,13.9999]
labels_Age_Kid = ['Kid']
df_test['Age_Kid'] = pd.cut(df_test['Age'], bins =bins_Age_Kid, labels =labels_Age_Kid, right=False)
df_test['Age_Kid'] = df_test['Age_Kid'].notna().astype('int')

bins_Age_Young = [14,24.9999]
labels_Age_Young = ['Young']
df_test['Age_Young'] = pd.cut(df_test['Age'], bins =bins_Age_Young, labels =labels_Age_Young, right=False)
df_test['Age_Young'] = df_test['Age_Young'].notna().astype('int')

bins_Age_Adult = [25,54.9999]
labels_Age_Adult = ['Adult']
df_test['Age_Adult'] = pd.cut(df_test['Age'], bins =bins_Age_Adult, labels =labels_Age_Adult, right=False)
df_test['Age_Adult'] = df_test['Age_Adult'].notna().astype('int')

bins_Age_Elderly = [55,100]
labels_Age_Elderly = ['Elderly']
df_test['Age_Elderly'] = pd.cut(df_test['Age'], bins =bins_Age_Elderly, labels =labels_Age_Elderly, right=False)
df_test['Age_Elderly'] = df_test['Age_Elderly'].notna().astype('int')

In [None]:
#SibSp and Parch

bins_SibORParch = [1,20]
labels_SibORParch = ['notalone']
df_test['SibORParch'] = pd.cut(df_test['SibSp']+df_test['Parch'], bins =bins_SibORParch, labels =labels_SibORParch, right=False)
df_test['SibORParch'] = df_test['SibORParch'].notna().astype('int')

In [None]:
#Sex
df_test['Sex_ord'] = pd.Categorical(df_test.Sex).codes

In [None]:
#Embarked

#C=0, Q=1, S=2
df_test['Embarked_ord'] = pd.Categorical(df_test.Embarked).codes

#Cherbourg
bins_Embk_Cherbourg = [0,0.9999]
labels_Embk_Cherbourg = ['Embk_Cherbourg']
df_test['Embk_Cherbourg'] = pd.cut(df_test['Embarked_ord'], bins =bins_Embk_Cherbourg, labels =labels_Embk_Cherbourg, right=False)
df_test['Embk_Cherbourg'] = df_test['Embk_Cherbourg'].notna().astype('int')

#Queenstown
bins_Embk_Queenstown = [1,1.9999]
labels_Embk_Queenstown = ['Embk_Queenstown']
df_test['Embk_Queenstown'] = pd.cut(df_test['Embarked_ord'], bins =bins_Embk_Queenstown, labels =labels_Embk_Queenstown, right=False)
df_test['Embk_Queenstown'] = df_test['Embk_Queenstown'].notna().astype('int')

#Southampton
bins_Embk_Southampton = [2,2.9999]
labels_Embk_Southampton = ['Embk_Southampton']
df_test['Embk_Southampton'] = pd.cut(df_test['Embarked_ord'], bins =bins_Embk_Southampton, labels =labels_Embk_Southampton, right=False)
df_test['Embk_Southampton'] = df_test['Embk_Southampton'].notna().astype('int')

In [None]:
#Normalise Fare
df_test['Fare'] = df_test['Fare'].fillna(0.0)
df_test['Fare_norm'] = df_test['Fare']/np.max(df_test['Fare'])
#df_test['Fare_norm'] = df_test['Fare_norm'].fillna(0.0)

#Normalise Age
df_test['Age_norm'] = df_test['Age']/np.max(df_test['Age'])

#Normalise Pclass
df_test['Pclass_norm'] = df_test['Pclass']/np.max(df_test['Pclass'])

df_test.info()

## <a id='predLlinSVC'>5.1. Predicting with LinSVC</a>

In [None]:
X_linsvc_test = df_test[['Sex_ord', 'SibORParch', 'Fare', 'Pclass', 'Embk_Cherbourg', 'Embk_Queenstown', 'Embk_Southampton', 'Age_Infant', 'Age_Kid', 'Age_Young', 'Age_Adult', 'Age_Elderly' ]]

sc_linsvc = StandardScaler()
sc_linsvc_test = sc_linsvc.fit(X_linsvc_test)
X_linsvc_test_scaled = sc_linsvc.transform(X_linsvc_test)

y_linsvc_test = model_linsvc.predict(X_linsvc_test_scaled)

df_test['Survived_linsvc_test']=y_linsvc_test

#print(model_linsvc.score(X_linsvc_test_scaled, y_linsvc_test))

In [None]:
plot_confusion_matrix(model_linsvc, X_linsvc_test_scaled, y_linsvc_test, normalize = "all")

In [None]:
#Confusion Matrix: Compare the example as the truth vs. my predicted values with the KNN model

y_example_true = df_example['Survived']
y_example_vs_linsvc_pred = y_linsvc_test

example_vs_linsvc_test_cm = confusion_matrix(y_example_true, y_example_vs_linsvc_pred, normalize = "all")
example_vs_linsvc_test_score = accuracy_score(y_example_true, y_example_vs_linsvc_pred, normalize = "all")

print(example_vs_linsvc_test_cm)
print('LinSVC test score: ' + str(example_vs_linsvc_test_score))

## <a id='predRBFSVC'>5.2. Predicting with RBF SVC</a>

In [None]:
X_rbfsvc_test = df_test[['Sex_ord', 'SibORParch', 'Fare', 'Pclass', 'Embk_Cherbourg', 'Embk_Queenstown', 'Embk_Southampton', 'Age_Infant', 'Age_Kid', 'Age_Young', 'Age_Adult', 'Age_Elderly' ]]

sc_rbfsvc = StandardScaler()
sc_rbfsvc_test = sc_rbfsvc.fit(X_rbfsvc_test)
X_rbfsvc_test_scaled = sc_rbfsvc.transform(X_rbfsvc_test)

y_rbfsvc_test = model_rbfsvc.predict(X_rbfsvc_test_scaled)

df_test['Survived_rbfsvc_test']=y_rbfsvc_test

#print(model_rbfsvc.score(X_rbfsvc_test_scaled, y_rbfsvc_test))

In [None]:
plot_confusion_matrix(model_rbfsvc, X_rbfsvc_test_scaled, y_rbfsvc_test, normalize = "all")

In [None]:
#Confusion Matrix: Compare the example as the truth vs. my predicted values with the KNN model

y_example_true = df_example['Survived']
y_example_vs_rbfsvc_pred = y_rbfsvc_test

example_vs_rbfsvc_test_cm = confusion_matrix(y_example_true, y_example_vs_rbfsvc_pred, normalize = "all")
example_vs_rbfsvc_test_score = accuracy_score(y_example_true, y_example_vs_rbfsvc_pred, normalize = "all")

print(example_vs_rbfsvc_test_cm)
print('RBF SVC test score: ' + str(example_vs_rbfsvc_test_score))

## <a id='predDT'>5.3. Predicting with DT</a>

In [None]:
#Model DecisionTree: Predicting the test set with the fitted DecisionTree model

X_dt_test = pd.get_dummies(df_test[features_dt])

y_dt_test = model_dt.predict(X_dt_test)

#Add new column with the values predicted with the RandomForestClassifier model
df_test['Survived_dt_test']=y_dt_test

#print(model_dt.score(X_dt_test, y_dt_test))

In [None]:
plot_confusion_matrix(model_dt, X_dt_test, y_dt_test, normalize = "all")

In [None]:
#Confusion Matrix: Compare the example as the truth vs. my predicted values with the DecisionTreeClassifier model
#from sklearn.metrics import confusion_matrix

y_example_true = df_example['Survived']
y_example_vs_dt_pred = y_dt_test

example_vs_dt_test_cm = confusion_matrix(y_example_true, y_example_vs_dt_pred, normalize = "all")
example_vs_dt_test_score = accuracy_score(y_example_true, y_example_vs_dt_pred, normalize = "all")

print(example_vs_dt_test_cm)
print('DT test score: ' + str(example_vs_dt_test_score))

## <a id='predRF'>5.4. Predicting with RF</a>

In [None]:
#Model RandomForestClassifier: Predicting the test set with the fitted RandomForestClassifier model

X_rf_test = pd.get_dummies(df_test[features_rf])

y_rf_test = model_rf.predict(X_rf_test)

#Add new column with the values predicted with the RandomForestClassifier model
df_test['Survived_rf_test']=y_rf_test

#print(model_rf.score(X_rf_test, y_rf_test))

In [None]:
plot_confusion_matrix(model_rf, X_rf_test, y_rf_test, normalize = "all")

In [None]:
#Confusion Matrix: Compare the example as the truth vs. my predicted values with the RandomForestClassifier model
#from sklearn.metrics import confusion_matrix

y_example_true = df_example['Survived']
y_example_vs_rf_pred = y_rf_test

example_vs_rf_test_cm = confusion_matrix(y_example_true, y_example_vs_rf_pred, normalize = "all")
example_vs_rf_test_score = accuracy_score(y_example_true, y_example_vs_rf_pred, normalize = "all")

print(example_vs_rf_test_cm)
print('RF test score: ' + str(example_vs_rf_test_score))

## <a id='predKNN'>5.5. Predicting with KNN</a>

In [None]:
X_knn_test = df_test[['Sex_ord', 'SibORParch', 'Fare', 'Pclass', 'Embk_Cherbourg', 'Embk_Queenstown', 'Embk_Southampton', 'Age_Infant', 'Age_Kid', 'Age_Young', 'Age_Adult', 'Age_Elderly' ]]

sc = StandardScaler()
sc.fit(X_knn_test)
X_knn_test_scaled = sc.transform(X_knn_test)

y_knn_test = model_knn.predict(X_knn_test_scaled)

df_test['Survived_knn_test']=y_knn_test

#print(model_knn.score(X_knn_test_scaled, y_knn_test)

In [None]:
plot_confusion_matrix(model_knn, X_knn_test_scaled, y_knn_test, normalize = "all")

In [None]:
#Confusion Matrix: Compare the example as the truth vs. my predicted values with the KNN model

y_example_true = df_example['Survived']
y_example_vs_knn_pred = y_knn_test

example_vs_knn_test_cm = confusion_matrix(y_example_true, y_example_vs_knn_pred, normalize = "all")
example_vs_knn_test_score = accuracy_score(y_example_true, y_example_vs_knn_pred, normalize = "all")

print(example_vs_knn_test_cm)
print('KNN test score: ' + str(example_vs_knn_test_score))

## <a id='predLgR'>5.6. Predicting with LgR</a>

In [None]:
X_lgr_test = df_test[['Fare', 'SibORParch', 'Sex_ord', 'Pclass', 'Embk_Cherbourg', 'Embk_Queenstown', 'Embk_Southampton', 'Age_Infant', 'Age_Kid', 'Age_Young', 'Age_Adult', 'Age_Elderly' ]]

y_lgr_test = model_lgr.predict(X_lgr_test)
df_test['Survived_lgr_test']=y_lgr_test

#print(model_lgr.score(X_lgr_test_transformed, y_lgr_test))

In [None]:
plot_confusion_matrix(model_lgr, X_lgr_test, y_lgr_test, normalize = "all")

In [None]:
#Confusion Matrix: Compare the example as the truth vs. my predicted values with the Logistic Regression model
#from sklearn.metrics import confusion_matrix

y_example_true = df_example['Survived']
y_example_vs_lgr_pred = y_lgr_test

example_vs_lgr_test_cm = confusion_matrix(y_example_true, y_example_vs_lgr_pred, normalize = "all")
example_vs_lgr_test_score = accuracy_score(y_example_true, y_example_vs_lgr_pred, normalize = "all")

print(example_vs_lgr_test_cm)
print('LgR test score: ' + str(example_vs_lgr_test_score))

## <a id='predPCA-LgR'>5.7. Predicting with PCA - LgR</a>

In [None]:
X_pca_lgr_test = df_test[['Fare', 'SibORParch', 'Sex_ord', 'Pclass', 'Embk_Cherbourg', 'Embk_Queenstown', 'Embk_Southampton', 'Age_Infant', 'Age_Kid', 'Age_Young', 'Age_Adult', 'Age_Elderly' ]]

X_pca_lgr_test_transformed = pca.transform(X_pca_lgr_test)

y_pca_lgr_test = model_pca_lgr.predict(X_pca_lgr_test_transformed)
df_test['Survived_pca_lgr_test']=y_pca_lgr_test

#print(model_pca_lgr.score(X_pca_lgr_test_transformed, y_pca_lgr_test))

In [None]:
plot_confusion_matrix(model_pca_lgr, X_pca_lgr_test_transformed, y_pca_lgr_test, normalize = "all")

In [None]:
#Confusion Matrix: Compare the example as the truth vs. my predicted values with the PCA Logistic Regression model
#from sklearn.metrics import confusion_matrix

y_example_true = df_example['Survived']
y_example_vs_pca_lgr_pred = y_pca_lgr_test

example_vs_pca_lgr_test_cm = confusion_matrix(y_example_true, y_example_vs_pca_lgr_pred, normalize = "all")
example_vs_pca_lgr_test_score = accuracy_score(y_example_true, y_example_vs_pca_lgr_pred, normalize = "all")

print(example_vs_pca_lgr_test_cm)
print('PCA LgR test score: ' + str(example_vs_pca_lgr_test_score))

## <a id='predOLS'>5.8. Predicting with OLS</a>

In [None]:
#Model OLS: Predicting the test set with the fitted OLS model

#import statsmodels.api as sm

X_ols_test = df_test[['Fare', 'SibORParch', 'Sex_ord', 'Pclass', 'Embk_Cherbourg', 'Embk_Queenstown', 'Embk_Southampton', 'Age_Infant', 'Age_Kid', 'Age_Young', 'Age_Adult', 'Age_Elderly' ]]
X1_ols_test = sm.add_constant(X_ols_test)

y_ols_test =  model_ols.predict(X1_ols_test)
y_ols_test =  round(y_ols_test)

#Add new column with the values predicted with the OLS model
df_test['Survived_ols_test']=y_ols_test

In [None]:
y_example_true = df_example['Survived']
y_example_vs_ols_pred = y_ols_test

example_vs_ols_test_cm = confusion_matrix(y_example_true, y_example_vs_ols_pred, normalize = "all")
example_vs_ols_test_score = accuracy_score(y_example_true, y_example_vs_ols_pred, normalize = "all")

print(example_vs_ols_test_cm)
print('OLS test score: ' + str(example_vs_ols_test_score))

## <a id='predARP'>5.9. Accuracy, Recall, Precision: example settruth vs. test set predictions</a>

In [None]:
#Accuracy
#from sklearn.metrics import accuracy_score

A_example_vs_linsvc = accuracy_score(y_example_true, y_example_vs_linsvc_pred)
A_example_vs_rbfsvc = accuracy_score(y_example_true, y_example_vs_rbfsvc_pred)
A_example_vs_dt = accuracy_score(y_example_true, y_example_vs_dt_pred)
A_example_vs_rf = accuracy_score(y_example_true, y_example_vs_rf_pred)
A_example_vs_knn = accuracy_score(y_example_true, y_example_vs_knn_pred)
A_example_vs_lgr = accuracy_score(y_example_true, y_example_vs_lgr_pred)
A_example_vs_pca_lgr = accuracy_score(y_example_true, y_example_vs_pca_lgr_pred)
A_example_vs_ols = accuracy_score(y_example_true, y_example_vs_ols_pred)

print("Accuracy example vs. LinSVC = " + str(A_example_vs_linsvc))
print("Accuracy example vs. rbfSVC = " + str(A_example_vs_rbfsvc))
print("Accuracy example vs. DT = " + str(A_example_vs_dt))
print("Accuracy example vs. RF = " + str(A_example_vs_rf))
print("Accuracy example vs. KNN = " + str(A_example_vs_knn))
print("Accuracy example vs. LgR = " + str(A_example_vs_lgr))
print("Accuracy example vs. PCA LgR = " + str(A_example_vs_pca_lgr))
print("Accuracy example vs. OLS = " + str(A_example_vs_ols))

In [None]:
#Recall
#from sklearn.metrics import recall_score

R_example_vs_linsvc = recall_score(y_example_true, y_example_vs_linsvc_pred)
R_example_vs_rbfsvc = recall_score(y_example_true, y_example_vs_rbfsvc_pred)
R_example_vs_dt = recall_score(y_example_true, y_example_vs_dt_pred)
R_example_vs_rf = recall_score(y_example_true, y_example_vs_rf_pred)
R_example_vs_knn = recall_score(y_example_true, y_example_vs_knn_pred)
R_example_vs_lgr = recall_score(y_example_true, y_example_vs_lgr_pred)
R_example_vs_pca_lgr = recall_score(y_example_true, y_example_vs_pca_lgr_pred)
R_example_vs_ols = recall_score(y_example_true, y_example_vs_ols_pred)

print("Recall example vs. LinSVC = " + str(R_example_vs_linsvc))
print("Recall example vs. rbfSVC = " + str(R_example_vs_rbfsvc))
print("Recall example vs. DT = " + str(R_example_vs_dt))
print("Recall example vs. RF = " + str(R_example_vs_rf))
print("Recall example vs. KNN = " + str(R_example_vs_knn))
print("Recall example vs. LgR = " + str(R_example_vs_lgr))
print("Recall example vs. PCA LgR = " + str(R_example_vs_pca_lgr))
print("Recall example vs. OLS = " + str(R_example_vs_ols))

In [None]:
#Precision
#from sklearn.metrics import precision_score

P_example_vs_linsvc = precision_score(y_example_true, y_example_vs_linsvc_pred)
P_example_vs_rbfsvc = precision_score(y_example_true, y_example_vs_rbfsvc_pred)
P_example_vs_dt = precision_score(y_example_true, y_example_vs_dt_pred)
P_example_vs_rf = precision_score(y_example_true, y_example_vs_rf_pred)
P_example_vs_knn = precision_score(y_example_true, y_example_vs_knn_pred)
P_example_vs_lgr = precision_score(y_example_true, y_example_vs_lgr_pred)
P_example_vs_pca_lgr = precision_score(y_example_true, y_example_vs_pca_lgr_pred)
P_example_vs_ols = precision_score(y_example_true, y_example_vs_ols_pred)

print("Precision example vs. LinSVC = " + str(P_example_vs_linsvc))
print("Precision example vs. rbfSVC = " + str(P_example_vs_rbfsvc))
print("Precision example vs. DT = " + str(P_example_vs_dt))
print("Precision example vs. RF = " + str(P_example_vs_rf))
print("Precision example vs. KNN = " + str(P_example_vs_knn))
print("Precision example vs. LgR = " + str(P_example_vs_lgr))
print("Precision example vs. PCA LgR = " + str(P_example_vs_pca_lgr))
print("Precision example vs. OLS = " + str(P_example_vs_ols))

## <a id='predsummary'>5.10. Prediction summary</a>

In [None]:
relevant_metrics_pred = pd.DataFrame({
    'Model': ['LinSVC', 'RBFSVC', 'Decision Tree', 'Random Forest', 'K-Nearest-Neighbours', 'Logistic Regression', 'PCA Logistic Regression', 'OLS'],
    'Accuracy, A': [A_example_vs_linsvc, A_example_vs_rbfsvc, A_example_vs_dt, A_example_vs_rf, A_example_vs_knn, A_example_vs_lgr, A_example_vs_pca_lgr, A_example_vs_ols],
    'Recall, R': [R_example_vs_linsvc, R_example_vs_rbfsvc, R_example_vs_dt, R_example_vs_rf, R_example_vs_knn, R_example_vs_lgr, R_example_vs_pca_lgr, R_example_vs_ols],
    'Precision, P': [P_example_vs_linsvc, P_example_vs_rbfsvc, P_example_vs_dt, P_example_vs_rf, P_example_vs_knn, P_example_vs_lgr, P_example_vs_pca_lgr, P_example_vs_ols]})
best_model_pred =relevant_metrics_pred.sort_values(by='Accuracy, A', ascending=False)
best_model_pred

# <a id='submit'>6. Submitting the data</a>

In [None]:
d = {}
d['PassengerId']=df_test['PassengerId']
d['Survived']=df_test['Survived_rbfsvc_test']

df_rbfsvc_submission = pd.DataFrame(d)

df_rbfsvc_submission.to_csv (r'titanic_data_submission_RBFSVC_new.csv', index = False, header=True)