<img src="https://datasciencecampus.ons.gov.uk/wp-content/uploads/sites/10/2017/03/data-science-campus-logo-new.svg"
             alt="ONS Data Science Campus Logo"
             width = "240"
             style="margin: 0px 60px"
             />

# 4.0 Tuning the Selected Model

Purpose of script: tune logreg on titanic_engineered



In [None]:
# import necessary libraries
import pandas as pd
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import GridSearchCV


In [None]:
# import cached data from titanic_EDA.py
titanic_engineered = pd.read_pickle('../../cache/titanic_engineered.pkl')


## Preprocessing

In [None]:
# define processing functions
def preprocess_target(df) :
    # Create arrays for the features and the target variable
    target = df['Survived'].values
    return(target)

def preprocess_features(df) :
    #extract features series
    features = df.drop('Survived', axis=1)
    #remove features that cannot be converted to float: name, ticket & cabin
    features = features.drop(['Name', 'Ticket', 'Cabin'], axis=1)
    # dummy encoding of any remaining categorical data
    features = pd.get_dummies(features, drop_first=True)
    # ensure np.nan used to replace missing values
    features.replace('nan', np.nan, inplace=True)
    return features
toggle_code(title='answers')

In [None]:
# preprocess target from titanic_train
target = preprocess_target(titanic_engineered)
#preprocess features from titanic_train
features = preprocess_features(titanic_engineered)


## Train test split

In [None]:
# unpack the necessary test and train sets using a test size of 25 % and a random state of 36
X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=0.25, random_state=36)


## Instantiate

In [None]:
#impute median for NaNs in age column
imp = SimpleImputer(missing_values=np.nan, strategy='median')

# instantiate classifier
logreg = LogisticRegression()

# create a list called steps, each step should be a tuple
# required steps are 'imputation', 'scaler', 'logistic_regression'
steps = [('imputation', imp),
         ('scaler', StandardScaler()),
         ('logistic_regression', logreg)]

# establish pipeline
pipeline = Pipeline(steps)


## Train model

In [None]:
# How do you fit the model?
pipeline.fit(X_train, y_train)


## Predict labels

In [None]:
# Can you predict the labels of the test set?
y_pred = pipeline.predict(X_test)


## Review

In [None]:
pipeline.score(X_train, y_train)

Down from 0.7934131736526946 in non-engineered df

In [None]:
pipeline.score(X_test, y_test)

Up from 0.8116591928251121 in non engineered df

In [None]:
print(confusion_matrix(y_test, y_pred))

In [None]:
print(classification_report(y_test, y_pred))

Precision is 10% lower in the survived category. High precision == low FP 
rate. This model performs 10 % better in relation to false positives 
(assigning survived when in fact died) when class assigned is 0 than 1.

Recall (false negative rate - assigning died but in truth survived) is largely
comparable across both classes. 

The harmmonic mean of precision and recall - f1 - has a 6 percent increase 
when assigning 0 as survived. 

This has resulted in 133 rows (versus 90 rows in survived) of the true
response sampled faling within the 0 (died) category.

Overall, it appears that this model is considerably better at predicting when
people died rather than survived.  

After comparison of the two datasets and logreg vs knn, this model dataset
combination yields the highest performance metrics across the board.

## Tuning

In [None]:
# specify the hyperparameter space
parameters = [
    {'logistic_regression__C':np.logspace(-1,1,20),
    'logistic_regression__penalty':['l2'],
    'logistic_regression__solver': ['lbfgs'],
    'logistic_regression__max_iter' : [50, 100, 150, 200]
    }
              ]

# instantiate the gridsearch object with 5 fold cross validation 
cv = GridSearchCV(pipeline, param_grid=parameters, cv=5)

## Train model

In [None]:
# fit the cross validation model to the training data
cv.fit(X_train, y_train)


## Predict labels

In [None]:
# predict labels of test set
y_pred = cv.predict(X_test)


## Review

In [None]:
print("Accuracy: {}".format(cv.score(X_test, y_test)))

In [None]:
print(classification_report(y_test, y_pred))

In [None]:
print("Tuned model parameters: {}".format(cv.best_params_))