<img src="https://datasciencecampus.ons.gov.uk/wp-content/uploads/sites/10/2017/03/data-science-campus-logo-new.svg"
             alt="ONS Data Science Campus Logo"
             width = "240"
             style="margin: 0px 60px"
             />

# 3.0 Source or Engineered Data?

Purpose of script: compare logreg vs knn on titanic_engineered



In [None]:
# import libraries
import pandas as pd
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import roc_curve
from sklearn.metrics import roc_auc_score
import matplotlib.pyplot as plt

In [None]:
# import cached data from titanic_EDA.ipynb
titanic_engineered = pd.read_pickle('../../cache/titanic_engineered.pkl')


## Preprocessing

In [None]:
# Define preprocessing functions
def preprocess_target(df) :
    # Create arrays for the features and the target variable
    target = df['Survived'].values
    # ensure the required object is returned
    return(target)

def preprocess_features(df) :
    #extract features series
    features = df.drop('Survived', axis=1)
    #remove features that cannot be converted to float: name, ticket & cabin
    features = features.drop(['Name', 'Ticket', 'Cabin'], axis=1)
    # dummy encoding of any remaining categorical data
    features = pd.get_dummies(features, drop_first=True)
    # ensure np.nan used to replace missing values
    features.replace('nan', np.nan, inplace=True)
    # ensure the required object is returned
    return features


In [None]:
# preprocess target from titanic_train
target = preprocess_target(titanic_engineered)
#preprocess features from titanic_train
features = preprocess_features(titanic_engineered)

***

## 3.2 Train Test Split

In [None]:
# test set of 25 % as in previous models
X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=0.25, random_state=36)

***

## 3.3 Instantiate

In [None]:
#impute median for NaNs in age column
imp = SimpleImputer(missing_values=np.nan, strategy='median')

# instantiate classifier
logreg = LogisticRegression()

steps = [('imputation', imp),
         ('scaler', StandardScaler()),
         ('logistic_regression', logreg)]

# establish pipeline
pipeline = Pipeline(steps)



## Train model

In [None]:
pipeline.fit(X_train, y_train)


## Predict labels

In [None]:
y_pred = pipeline.predict(X_test)
print(y_pred)


## Review

In [None]:
pipeline.score(X_train, y_train)

Down from 0.7934131736526946 in non-engineered df

In [None]:
pipeline.score(X_test, y_test)

Up from 0.8116591928251121 in non engineered df

In [None]:
print(confusion_matrix(y_test, y_pred))

In [None]:
# print a report of performance metrics, passing the true test labels and predicted labels as arguments in that order
print(classification_report(y_test, y_pred))


Precision is 10% lower in the survived category. High precision == low FP 
rate. This model performs 10 % better in relation to false positives 
(assigning survived when in fact died) when class assigned is 0 than 1.

Recall (false negative rate - assigning died but in truth survived) is largely
comparable across both classes. 

The harmmonic mean of precision and recall - f1 - has a 6 percent increase 
when assigning 0 as survived. 

This has resulted in 133 rows (versus 90 rows in survived) of the true
response sampled faling within the 0 (died) category.

Overall, it appears that this model is considerably better at predicting when
people died rather than survived.  

After comparison of the two datasets and logreg vs knn, this model dataset
combination yields the highest performance metrics across the board.

Ultimately, this is the model selected to take forward. Compared to other 
models this had highest accuracy and F1 scores. I believe this to be more
important due to minor class imbalance and no preference on TP or FN rate. 
Next closest model was KNN on titanic_train, had equivalent accuracy and 
F1 scores but both precision and recall were more balanced in this model.


## Receiver Operator Curve

In [None]:
# predict the probability of a sample being in a particular class
y_pred_prob = pipeline.predict_proba(X_test)[:,1]
fpr, tpr, thresholds = roc_curve(y_test, y_pred_prob)
plt.plot([0, 1], [0, 1], 'k--')
plt.plot(fpr, tpr, label='Logistic Regression')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Logistic Regression ROC Curve (titanic_train)')
plt.show()


I can see that the ROC curve has changed shape from the same plotted on titanic_train, particularly in the p=0.8 to p=0.6 range. Effect on auc?

In [None]:
roc_auc_score(y_test, y_pred_prob)

Down from 0.8787802840434419. The AUC here has declined by 0.004 from the untrained model, indicating a slight decline in untuned model performance.

In [None]:
# tidy up
del fpr, logreg, pipeline, steps, thresholds, tpr, y_pred, y_pred_prob

***

## KNearestNeighbours

## Instantiate

In [None]:
steps = [('imputation', imp),
         ('scaler', StandardScaler()),
         ('knn', KNeighborsClassifier(5))]

# establish pipeline
pipeline = Pipeline(steps)

## Train

In [None]:
# Train the pipeline model

pipeline.fit(X_train, y_train)


## Predict labels 

In [None]:
y_pred = pipeline.predict(X_test)
print(y_pred)

## Review

In [None]:
pipeline.score(X_train, y_train)

Down from 0.8697604790419161 in non engineered

In [None]:
pipeline.score(X_test, y_test)

Up from 0.7937219730941704 in non engineered

In [None]:
print(confusion_matrix(y_test, y_pred))

True positive rate similar to logreg but true negative count is higher.


In [None]:
print(classification_report(y_test, y_pred))

Precision is still lower within the survived category, however the difference
has now reduced from 10 % to 7 % lower than the logreg model. Crucially though,
precision is down overall, with a macro average reduction of 2 %. 

Recall (false negative rate - assigning died but in truth survived) within the
survived predicted group has decreased by 4 % than in logreg.

The KNN model appears to perform poorly against the logreg model in terms of
both precision and recall. harmonic mean of these - f1 is similarly reduced
by 1 to 2% depending on favouring macro vs weighted average. 

support output has been unaffected. 

interesting, in terms of precision and recall, KNN appears to be outperformed
by logreg. This is also true of accuracy in training. But 
(and possibly an important point) accuracy against the test set improved over
logreg by ~2.7 %. This could indicate a degree of overfitting within the 
untuned logreg model.

Decision above to select titanic_engineered to take forward with logreg model.
However, one question still persists, what affect (if any) did imputation of 
age with median make to performance metrics?

## Receiver Operator Curve

In [None]:
# predict the probability of a sample being in a particular class
y_pred_prob = pipeline.predict_proba(X_test)[:,1]
fpr, tpr, thresholds = roc_curve(y_test, y_pred_prob)
plt.plot([0, 1], [0, 1], 'k--')
plt.plot(fpr, tpr, label='Logistic Regression')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Logistic Regression ROC Curve (titanic_train)')
plt.show()

To my eye this looks identical to the knn on titanic_train

In [None]:
# How can you quantify the area under the roc curve?
roc_auc_score(y_test, y_pred_prob)


Down from 0.8715538847117794. This is a marked decline in model performance. It is possible that the engineered features are merely duplicating relationships already encoded within the titanic_train data. KNN would then I assume be affected by duplication of this data. AUC has declined by ~ 0.01, an order of magnitude larger than in previous AUC comparisons. 

In [None]:
# clean up
del fpr, pipeline, steps, thresholds, tpr, y_pred, y_pred_prob

***

## titanic_no_age

Rather than imputing median values for the age column, would it be advantageous to simply remove the age column altogether?

In [None]:
# How do you remove the Age column?
titanic_no_age = titanic_engineered.drop('Age', axis=1)
titanic_no_age.head(3)


## Preprocess data

In [None]:
# preprocess target from titanic_train
target = preprocess_target(titanic_no_age)
#preprocess features from titanic_train
features = preprocess_features(titanic_no_age)

## Train test split

In [None]:
X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=0.25, random_state=36)

## Instantiate

In [None]:
#impute median for NaNs in age column
imp = SimpleImputer(missing_values=np.nan, strategy='median')

# instantiate classifier
logreg = LogisticRegression()

steps = [('imputation', imp),
         ('scaler', StandardScaler()),
         ('logistic_regression', logreg)]

# establish pipeline
pipeline = Pipeline(steps)

## Train model

In [None]:
pipeline.fit(X_train, y_train)

## Predict labels

In [None]:
y_pred = pipeline.predict(X_test)

## Review

In [None]:
pipeline.score(X_train, y_train)

Down from 0.7919161676646707 

In [None]:
pipeline.score(X_test, y_test)

Down from 0.8340807174887892

In [None]:
# print a confusion matrix, passing the true test labels and predicted labels as arguments in that order
print(confusion_matrix(y_test, y_pred))


tp the same but tn reduced

In [None]:
print(classification_report(y_test, y_pred))

Precision, recall, f1 and accuracy are down across the board in each class and
on average. deletion of the age column is clearly not recommended.

## Receiver Operator Curve

In [None]:
# predict the probability of a sample being in a particular class
y_pred_prob = pipeline.predict_proba(X_test)[:,1]
fpr, tpr, thresholds = roc_curve(y_test, y_pred_prob)
plt.plot([0, 1], [0, 1], 'k--')
plt.plot(fpr, tpr, label='Logistic Regression')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Logistic Regression ROC Curve (titanic_train)')
plt.show()

To my eye this looks largely similar to the logreg on titanic_engineered ROC

In [None]:
roc_auc_score(y_test, y_pred_prob)

Up from 0.8740183792815372. AUC has improved by ~0.01, a marked improvement over logreg on titanic_engineered. How can this be when the classification report showed performance down across the board?