# Modeling Template Notebook - v1

This notebook should serve as a template for the standard modelling we will all be using for as the base of version 1 development.

The goal is for us to use the same predicted probabilities and classes to build different functions according the to tasks assigned to us.

The notebook should be used as follows:
1. Import all needed packages:
2. Load Data
3. Split Train and Test Sets
4. Build and Run model
5. Output predictions

### Import necessary packages

Note: In the case a Module Not Found error is thrown, it will automatically attempt to install all needed packages. You would have to rerun the import package cell again after.

In [1]:
def install_packages():
    !pip install pandas
    !pip install numpy
    !pip install scikit-learn

In [2]:
try:
    import pandas as pd
    import numpy as np
    from sklearn import datasets
    from sklearn.linear_model import LogisticRegression
    from sklearn.model_selection import train_test_split
    from sklearn.metrics import accuracy_score, f1_score
    print('import successful')
except ModuleNotFoundError:
    print('forcing install of necessary packages. If you see this, rerun this cell to try again')
    install_packages()

import successful


### Load necessary data

The dataset we're starting off with is the standard breast cancer dataset that lends itself directly to binary classification modelling.

In [3]:
X, y = datasets.load_breast_cancer(return_X_y=True)

### Train Test Split

In [4]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

### Run model

We're using the simplest flavour of the ordinary least squares logistic regression in this case and no data transformation / scaling to maintain simplicity

In [5]:
clf = LogisticRegression(random_state=0).fit(X_train, y_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


### Output Predictions

This section provides the basic outputs we will be using for function building as variables. They are as follows:
1. predicted_probabilities: this is output of pre for which we will be performing optizations. This has two 'columns' where the values in index 0 are the probabilities for the class 0 and those in index 1 have probabilies for class 1
2. sklearn_class_labels: this is the class label prediction computed by sklearn's density function
3. naive_class_labels: this is the class label prediction computed by using a naïve 0.5 threshold

In [6]:
predicted_probabilities = clf.predict_proba(X_test)
print('shape of predicted probabilties:', predicted_probabilities.shape)

shape of predicted probabilties: (188, 2)


In [7]:
sklearn_class_labels = clf.predict(X_test)
print('shape of model default class predicttions:', sklearn_class_labels.shape)

shape of model default class predicttions: (188,)


In [8]:
naive_class_labels = np.where(predicted_probabilities[:,1] > 0.5, 1, 0)
print('shape of naïve class predicttions:', naive_class_labels.shape)

shape of naïve class predicttions: (188,)


In [9]:
print('Accuracy and F1 score of sklearn predicted classes')
accuracy_score(sklearn_class_labels, y_test), f1_score(sklearn_class_labels, y_test)

Accuracy and F1 score of sklearn predicted classes


(0.9680851063829787, 0.9752066115702479)

In [10]:
print('Accuracy and F1 score of naive predicted classes')
accuracy_score(naive_class_labels, y_test), f1_score(naive_class_labels, y_test)

Accuracy and F1 score of naive predicted classes


(0.9680851063829787, 0.9752066115702479)

## Build Functions

In [33]:
from sklearn.metrics import f1_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import recall_score

In [34]:
def generate_search_space(number_of_values: int =100) -> np.array:
    return np.linspace(0,1,number_of_values)

In [35]:
def convert_classes(predicted_probabilities: np.array,
                    threshold: float,
                    is_multidimensional: bool,
                    target_class_index: int = None):
    if is_multidimensional:
        assert target_class_index is not None
        assert len(predicted_probabilities) > 1
    if is_multidimensional:
        classes = np.where(predicted_probabilities[:,target_class_index] >= threshold, 1, 0)
    else:
        classes = np.where(predicted_probabilities >= threshold, 1, 0)
    return classes    

In [36]:
classes = convert_classes(predicted_probabilities=predicted_probabilities,
                          threshold=0.5, 
                          is_multidimensional=True,
                          target_class_index=1)

In [37]:
def get_best_f1(predicted_probabilities: np.array,
                       is_multidimensional: bool,
                       search_space: np.array, # this should be type unioned with lists\
                       y_test: np.array,
                       target_class_index: int = None):
    f1_scores = []
    for i in search_space:
        classes = convert_classes(predicted_probabilities=predicted_probabilities,
                                  threshold=i, 
                                  is_multidimensional=is_multidimensional,
                                  target_class_index=target_class_index)
        f1_scores.append(f1_score(classes, y_test))
    best_f1_score = max(f1_scores)
    best_f1_index = f1_scores.index(best_f1_score)
    best_f1_threshold = search_space[best_f1_index]
    print(f'best f1 score: {best_f1_score} occurs at threshold {best_f1_threshold}')
    return best_f1_score, best_f1_threshold

In [41]:
def get_best_sensitivity(predicted_probabilities: np.array,
                       is_multidimensional: bool,
                       search_space: np.array, # this should be type unioned with lists\
                       y_test: np.array,
                       target_class_index: int = None):
    sensitivity_scores = []
    for i in search_space:
        classes = convert_classes(predicted_probabilities=predicted_probabilities,
                                  threshold=i, 
                                  is_multidimensional=is_multidimensional,
                                  target_class_index=target_class_index)
        tn, fp, fn, tp = confusion_matrix(y_test, classes).ravel()
        sensitivity = tp / (tp+fn)
        sensitivity_scores.append(sensitivity)
    print(sensitivity_scores)
    best_sensitivity_score = max(sensitivity_scores)
    best_sensitivity_index = sensitivity_scores.index(best_sensitivity_score)
    best_sensitivity_threshold = search_space[best_sensitivity_index]
    print(f'best sensitivity score: {best_sensitivity_score} occurs at threshold {best_sensitivity_threshold}')
    return best_sensitivity_score, best_sensitivity_threshold

In [39]:
def get_best_specificity(predicted_probabilities: np.array,
                       is_multidimensional: bool,
                       search_space: np.array, # this should be type unioned with lists\
                       y_test: np.array,
                       target_class_index: int = None):
    specificity_scores = []
    for i in search_space:
        classes = convert_classes(predicted_probabilities=predicted_probabilities,
                                  threshold=i, 
                                  is_multidimensional=is_multidimensional,
                                  target_class_index=target_class_index)
        tn, fp, fn, tp = confusion_matrix(y_test, classes).ravel()
        specificity = tn / (tn+fp)
        specificity_scores.append(specificity) 
    best_specificity_score = max(specificity_scores)
    best_specificity_index = specificity_scores.index(best_specificity_score)
    best_specificity_threshold = search_space[best_specificity_index]
    print(f'best specificity score: {best_specificity_score} occurs at threshold {best_specificity_threshold}')
    return best_specificity_score, best_specificity_threshold

In [42]:
get_best_sensitivity(predicted_probabilities=predicted_probabilities,
                    search_space=generate_search_space(),
                    y_test=y_test,
                  is_multidimensional=True,
                  target_class_index=1)

[1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.9917355371900827, 0.9917355371900827, 0.9917355371900827, 0.9834710743801653, 0.9834710743801653, 0.9834710743801653, 0.9834710743801653, 0.9834710743801653, 0.9834710743801653, 0.9752066115702479, 0.9752066115702479, 0.9752066115702479, 0.9752066115702479, 0.9752066115702479, 0.9752066115702479, 0.9752066115702479, 0.9752066115702479, 0.9752066115702479, 0.9752066115702479, 0.9752066115702479, 0.9752066115702479, 0.9752066115702479, 0.9752066115702479, 0.9752066115702479, 0.9752066115702479, 0.9752066115702479, 0.9752066115702479, 0.9752066115702479, 0.9752066115702479, 0.9752066115702479, 0.9752066115702479, 0.9752066115702479, 0.9752066115702479, 0.9752066115702479, 0.9752066115702479, 0.9752066115702479, 0.9752066115702479, 0.9669421487603306, 0.9669421487603306, 0.9669421487603306, 0.9669421487603306, 0.9669421487603306, 0.9669421487603306, 0.9669421487603306, 0.9669421487603306,

(1.0, 0.0)