<img src="https://datasciencecampus.ons.gov.uk/wp-content/uploads/sites/10/2017/03/data-science-campus-logo-new.svg"
             alt="ONS Data Science Campus Logo"
             width = "240"
             style="margin: 0px 60px"
             />

# 2.0 Model Selection

purpose of script: compare logreg vs knn on titanic_train


In [None]:
# import libraries
import pandas as pd
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import roc_curve
from sklearn.metrics import roc_auc_score
import matplotlib.pyplot as plt


In [None]:
# import cached data from titanic_EDA.py
titanic_train = pd.read_pickle('../../cache/titanic_train.pkl')


## Preprocessing

In [None]:
# Define functions to preprocess target & features

def preprocess_target(df) :
    # Create arrays for the features and the target variable
    target = df['Survived'].values
    return(target)

def preprocess_features(df) :
    #extract features series
    features = df.drop('Survived', axis=1)
    #remove features that cannot be converted to float: name, ticket & cabin
    features = features.drop(['Name', 'Ticket', 'Cabin'], axis=1)
    # dummy encoding of any remaining categorical data
    features = pd.get_dummies(features, drop_first=True)
    # ensure np.nan used to replace missing values
    features.replace('nan', np.nan, inplace=True)
    return features

Need to use pipeline to select from best model & best parameters. Use the following workflow:

* Train-Test-Split
* Instantiate
* Fit
* Predict


Prep data for logreg

In [None]:
# preprocess target from titanic_train
target = preprocess_target(titanic_train)
#preprocess features from titanic_train
features = preprocess_features(titanic_train)



## train_test_split

In [None]:
# X == features. y == target. Use 25% of data as 'hold out' data. Use a random state of 36.
X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=0.25, random_state=36)




## Instantiate

In [None]:
#impute median for NaNs in age column
imp = SimpleImputer(missing_values=np.nan, strategy='median')

# instantiate classifier
logreg = LogisticRegression()

steps = [
    # impute medians on NaNs
    ('imputation', imp),
    # scale features
    ('scaler', StandardScaler()),
    # fit logreg classifier
    ('logistic_regression', logreg)]

# establish pipeline
pipeline = Pipeline(steps)



## Train model

In [None]:
pipeline.fit(X_train, y_train)


## Predict labels

In [None]:
y_pred = pipeline.predict(X_test)



## Review performance

In [None]:
pipeline.score(X_train, y_train)

In [None]:
pipeline.score(X_test, y_test)

In [None]:
print(confusion_matrix(y_test, y_pred))

In [None]:
# print a classification report of the model's performance passing the true labels and the predicted labels
# as arguments in that order

print(classification_report(y_test, y_pred))


Precision is 9% lower in the survived category. High precision == low FP 
rate. This model performs 9 % better in relation to false positives 
(assigning survived when in fact died) when class assigned is 0 than 1.

Recall (false negative rate - assigning died but in truth survived) is 5%
higher in 0 class.

The harmonic mean of precision and recall - f1 - has a 7 percent increase 
when assigning 0 as survived. 

This has resulted in 133 rows (versus 90 rows in survived) of the true
response sampled faling within the 0 (died) category.

Overall, it appears that this model is considerably better at predicting when
people died rather than survived.

## Receiver Operator Curve

In [None]:
# predict the probability of a sample being in a particular class
y_pred_prob = pipeline.predict_proba(X_test)[:,1]
# calculate the false positive rate, true positive rate and thresholds using roc_curve()
fpr, tpr, thresholds = roc_curve(y_test, y_pred_prob)
# create the axes
plt.plot([0, 1], [0, 1], 'k--')
# control the aesthetics
plt.plot(fpr, tpr, label='Logistic Regression')
# x axis label
plt.xlabel('False Positive Rate')
# y axis label
plt.ylabel('True Positive Rate')
# main title
plt.title('Logistic Regression ROC Curve (titanic_train)')
# show plot
plt.show()


To my eye it looks as though a p value of around 0.8 is going to be optimum. This curve can then be used against further ROC curves to visually compare performance. What is the area under curve?

In [None]:
roc_auc_score(y_test, y_pred_prob)

An AUC of 0.5 is as good as a model randomly asisgning classes and being correct on average 50% of the time. Here we have an untuned AUC of 0.879 which can be used to compare against further models. Precision recall curve not pursued as class imbalance is not high

In [None]:
# tidy up
del fpr, logreg, pipeline, steps, thresholds, tpr, y_pred, y_pred_prob

***

## KNearestNeighbours

In [None]:
steps = [
    # impute median for any NaNs
    ('imputation', imp),
    # scale features
    ('scaler', StandardScaler()),
    # specify the knn classifier function for the 'knn' tep below, specifying k=5
    ('knn', KNeighborsClassifier(5))]

# establish a pipeline for the above steps
pipeline = Pipeline(steps)




## Train model

In [None]:
pipeline.fit(X_train, y_train)



## Predict labels

In [None]:
# Predict the labels for the test features using the KNN pipeline you have created
y_pred = pipeline.predict(X_test)




## Review performance

In [None]:
pipeline.score(X_train, y_train)

In [None]:
# As above, calculate accuracy of the classifier, but this time on the test data
pipeline.score(X_test, y_test)


In [None]:
print(confusion_matrix(y_test, y_pred))

True positive marginally higher than in logreg and true negaive identical.

In [None]:
print(classification_report(y_test, y_pred))

Precision is still lower within the survived category, however the difference
has now reduced from 9 % to 8 % lower than the logreg model. Averages are up
marginally over logreg.

Recall (false negative rate - assigning died but in truth survived) within the
survived predicted group has increased by 2 % than in logreg.

 harmonic mean of these - f1 is similarly reduced. f1 has been marginally 
 improved over logreg in both classes and average.

support output has been unaffected. 

KNN appears to marginally outperform logreg. Now need to compare whether 
titanic_engineered adds any value.

## Receiver Operator Curve

In [None]:
# predict the probability of a sample being in a particular class
y_pred_prob = pipeline.predict_proba(X_test)[:,1]
# unpack roc curve objects
fpr, tpr, thresholds = roc_curve(y_test, y_pred_prob)
# plot an ROC curve using matplotlib below:
# create axes
plt.plot([0, 1], [0, 1], 'k--')
# control aesthetics
plt.plot(fpr, tpr, label='Logistic Regression')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Logistic Regression ROC Curve (titanic_train)')
plt.show()


The curve looks very similar to the logreg model, I imagine AUC will be very similar also.

In [None]:
roc_auc_score(y_test, y_pred_prob)


An AUC of 0.5 is as good as a model randomly assigning classes and being correct on average 50% of the time. Here we have an untuned AUC of 0.872 This is down on the logreg AUC by 0.007.