# Modeling Template Notebook - v1

This notebook should serve as a template for the standard modelling we will all be using for as the base of version 1 development.

The goal is for us to use the same predicted probabilities and classes to build different functions according the to tasks assigned to us.

The notebook should be used as follows:
1. Import all needed packages:
2. Load Data
3. Split Train and Test Sets
4. Build and Run model
5. Output predictions

### Import necessary packages

Note: In the case a Module Not Found error is thrown, it will automatically attempt to install all needed packages. You would have to rerun the import package cell again after.

In [101]:
def install_packages():
    !pip install pandas
    !pip install numpy
    !pip install scikit-learn

In [102]:
try:
    import pandas as pd
    import numpy as np
    from sklearn import datasets
    from sklearn.linear_model import LogisticRegression
    from sklearn.model_selection import train_test_split
    from sklearn.metrics import accuracy_score, f1_score
    print('import successful')
except ModuleNotFoundError:
    print('forcing install of necessary packages. If you see this, rerun this cell to try again')
    install_packages()

import successful


### Load necessary data

The dataset we're starting off with is the standard breast cancer dataset that lends itself directly to binary classification modelling.

In [103]:
X, y = datasets.load_breast_cancer(return_X_y=True)

### Train Test Split

In [104]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

### Run model

We're using the simplest flavour of the ordinary least squares logistic regression in this case and no data transformation / scaling to maintain simplicity

In [105]:
clf = LogisticRegression(random_state=0).fit(X_train, y_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


### Output Predictions

This section provides the basic outputs we will be using for function building as variables. They are as follows:
1. predicted_probabilities: this is output of pre for which we will be performing optizations. This has two 'columns' where the values in index 0 are the probabilities for the class 0 and those in index 1 have probabilies for class 1
2. sklearn_class_labels: this is the class label prediction computed by sklearn's density function
3. naive_class_labels: this is the class label prediction computed by using a naïve 0.5 threshold

In [106]:
predicted_probabilities = clf.predict_proba(X_test)
print('shape of predicted probabilties:', predicted_probabilities.shape)

shape of predicted probabilties: (188, 2)


In [107]:
sklearn_class_labels = clf.predict(X_test)
print('shape of model default class predicttions:', sklearn_class_labels.shape)

shape of model default class predicttions: (188,)


In [108]:
naive_class_labels = np.where(predicted_probabilities[:,1] > 0.5, 1, 0)
print('shape of naïve class predicttions:', naive_class_labels.shape)

shape of naïve class predicttions: (188,)


In [109]:
print('Accuracy and F1 score of sklearn predicted classes')
accuracy_score(sklearn_class_labels, y_test), f1_score(sklearn_class_labels, y_test)

Accuracy and F1 score of sklearn predicted classes


(0.9680851063829787, 0.9752066115702479)

In [110]:
print('Accuracy and F1 score of naive predicted classes')
accuracy_score(naive_class_labels, y_test), f1_score(naive_class_labels, y_test)

Accuracy and F1 score of naive predicted classes


(0.9680851063829787, 0.9752066115702479)

## Build Functions

In [111]:
from sklearn.metrics import f1_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import recall_score

In [112]:
def generate_search_space(number_of_values: int =100) -> np.array:
    return np.linspace(0,1,number_of_values)

In [113]:
def convert_classes(predicted_probabilities: np.array,
                    threshold: float,
                    is_multidimensional: bool,
                    target_class_index: int = None):
    if is_multidimensional:
        assert target_class_index is not None
        assert len(predicted_probabilities) > 1
    if is_multidimensional:
        classes = np.where(predicted_probabilities[:,target_class_index] >= threshold, 1, 0)
    else:
        classes = np.where(predicted_probabilities >= threshold, 1, 0)
    return classes    

In [114]:
classes = convert_classes(predicted_probabilities=predicted_probabilities,
                          threshold=0.5, 
                          is_multidimensional=True,
                          target_class_index=1)

In [115]:
def get_best_f1(predicted_probabilities: np.array,
                       is_multidimensional: bool,
                       search_space: np.array, # this should be type unioned with lists\
                       y_test: np.array,
                       target_class_index: int = None):
    f1_scores = []
    for i in search_space:
        classes = convert_classes(predicted_probabilities=predicted_probabilities,
                                  threshold=i, 
                                  is_multidimensional=is_multidimensional,
                                  target_class_index=target_class_index)
        f1_scores.append(f1_score(classes, y_test))
    best_f1_score = max(f1_scores)
    best_f1_index = f1_scores.index(best_f1_score)
    best_f1_threshold = search_space[best_f1_index]
    print(f'best f1 score: {best_f1_score} occurs at threshold {best_f1_threshold}')
    return best_f1_score, best_f1_threshold

In [116]:
def get_best_sensitivity(predicted_probabilities: np.array,
                       is_multidimensional: bool,
                       search_space: np.array, # this should be type unioned with lists\
                       y_test: np.array,
                       target_class_index: int = None):
    sensitivity_scores = []
    for i in search_space:
        classes = convert_classes(predicted_probabilities=predicted_probabilities,
                                  threshold=i, 
                                  is_multidimensional=is_multidimensional,
                                  target_class_index=target_class_index)
        tn, fp, fn, tp = confusion_matrix(y_test, classes).ravel()
        sensitivity = tp / (tp+fn)
        sensitivity_scores.append(sensitivity)
    print(sensitivity_scores)
    best_sensitivity_score = max(sensitivity_scores)
    best_sensitivity_index = sensitivity_scores.index(best_sensitivity_score)
    best_sensitivity_threshold = search_space[best_sensitivity_index]
    print(f'best sensitivity score: {best_sensitivity_score} occurs at threshold {best_sensitivity_threshold}')
    return best_sensitivity_score, best_sensitivity_threshold

In [117]:
def get_best_specificity(predicted_probabilities: np.array,
                       is_multidimensional: bool,
                       search_space: np.array, # this should be type unioned with lists\
                       y_test: np.array,
                       target_class_index: int = None):
    specificity_scores = []
    for i in search_space:
        classes = convert_classes(predicted_probabilities=predicted_probabilities,
                                  threshold=i, 
                                  is_multidimensional=is_multidimensional,
                                  target_class_index=target_class_index)
        tn, fp, fn, tp = confusion_matrix(y_test, classes).ravel()
        specificity = tn / (tn+fp)
        specificity_scores.append(specificity) 
    best_specificity_score = max(specificity_scores)
    best_specificity_index = specificity_scores.index(best_specificity_score)
    best_specificity_threshold = search_space[best_specificity_index]
    print(f'best specificity score: {best_specificity_score} occurs at threshold {best_specificity_threshold}')
    return best_specificity_score, best_specificity_threshold

In [118]:
get_best_sensitivity(predicted_probabilities=predicted_probabilities,
                    search_space=generate_search_space(),
                    y_test=y_test,
                  is_multidimensional=True,
                  target_class_index=1)

[1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.9917355371900827, 0.9917355371900827, 0.9917355371900827, 0.9834710743801653, 0.9834710743801653, 0.9834710743801653, 0.9834710743801653, 0.9834710743801653, 0.9834710743801653, 0.9752066115702479, 0.9752066115702479, 0.9752066115702479, 0.9752066115702479, 0.9752066115702479, 0.9752066115702479, 0.9752066115702479, 0.9752066115702479, 0.9752066115702479, 0.9752066115702479, 0.9752066115702479, 0.9752066115702479, 0.9752066115702479, 0.9752066115702479, 0.9752066115702479, 0.9752066115702479, 0.9752066115702479, 0.9752066115702479, 0.9752066115702479, 0.9752066115702479, 0.9752066115702479, 0.9752066115702479, 0.9752066115702479, 0.9752066115702479, 0.9752066115702479, 0.9752066115702479, 0.9752066115702479, 0.9752066115702479, 0.9669421487603306, 0.9669421487603306, 0.9669421487603306, 0.9669421487603306, 0.9669421487603306, 0.9669421487603306, 0.9669421487603306, 0.9669421487603306,

(1.0, 0.0)

In [127]:
from typing import Union, Tuple

import pandas as pd
import numpy as np
from sklearn.metrics import accuracy_score, f1_score, confusion_matrix, precision_score, recall_score


class ThresholdOptimizer:
    def __init__(self,
                 predicted_probabilities: Union[np.ndarray, pd.Series, list],
                 y_test: Union[np.ndarray, pd.Series, list],
                 search_space_size: int = 100):
        """

        Args:
            predicted_probabilities: output from the application of test/validation data from model/estimator.
                This should be a list, numpy array or pandas series containing probabilities
                that are to be converted into class predictions. If multidimensional input is given,
                it defaults to use predictions for class 1 during optimization.
            y_test: The true class values from the test/validation set passed into the model/estimator for predictions.
            search_space_size: The number of possible probability threshold values to optimze for
        """
        self.predicted_probabilities = predicted_probabilities
        if len(self.predicted_probabilities.shape) == 2:
            self.predicted_probabilities = self.predicted_probabilities[:, 1]
        self.search_space = np.linspace(0, 1, search_space_size)
        self.y_test = np.array(y_test)
        self.optimized_metrics = dict()
        self._supported_metrics = [
            'f1', 'accuracy', 'sensitivity', 'specificity',
            'precision', 'recall',
        ]

    def set_search_space(self,
                         search_space_size: int):
        """set the number of possible probability threshold values to optimze for

        This function is useful to reset the size of the search space after initializing the ThresholdOptimizer object.

        Args:
            search_space_size: The number of possible probability threshold values to optimze for
        """
        self.search_space = np.linspace(0, 1, search_space_size)

    def convert_classes(self,
                        threshold: int) -> np.ndarray:
        """Convert predicted probabilities into binary classes based on a threshold/cutoff value

        Args:
            threshold: The probability threshold value to determine predicted classes.
                        This follows a greater than or equal to format for determining class 1

        Returns: 1 dimensional numpy array of classes

        """
        classes = np.where(self.predicted_probabilities >= threshold, 1, 0)
        return classes

    def _get_best_metrics(self,
                          metric_type: str,
                          scores: list,
                          optimization: str = 'max') -> Tuple[int, int]:
        """computes optimized metrics based which supported metric was specified

        Args:
            optimization:
            metric_type:
            scores:

        Returns: best score and best threshold for a specified metric

        """
        if optimization.lower() == 'max':
            best_score = max(scores)
        elif optimization.lower() == 'min':
            best_score = min(scores)
        else:
            raise ValueError('Wrong value passed into optimization parameter. Should be max or min')
        best_index = scores.index(best_score)
        best_threshold = self.search_space[best_index]
        self.optimized_metrics.update(
            {
                metric_type: {
                    'best_score': best_score,
                    'best_threshold': best_threshold,
                    'all_scores': scores,
                },
            },
        )
        print(f'best {metric_type}: {best_score} occurs at threshold {best_threshold}')
        return best_score, best_threshold

    def get_best_f1_metrics(self) -> Tuple[int, int]:
        """Optimizes threshold for F1 score

        Returns: best F1 score and threshold at which best F1 score occurs

        """
        f1_scores = list()
        for i in self.search_space:
            classes = self.convert_classes(threshold=i)
            f1_scores.append(f1_score(classes, self.y_test))
        best_f1_score, best_f1_threshold = self._get_best_metrics(
            metric_type='f1_score',
            scores=f1_scores,
            optimization='max'
        )
        return best_f1_score, best_f1_threshold

    def get_best_sensitivity_metrics(self) -> Tuple[int, int]:
        """Optimizes threshold for sensitivity score

        Returns: best sensitivity score and threshold at which best sensitivity score occurs

        """
        sensitivity_scores = list()
        for i in self.search_space:
            classes = self.convert_classes(threshold=i)
            tn, fp, fn, tp = confusion_matrix(self.y_test, classes).ravel()
            sensitivity = tp / (tp + fn)
            sensitivity_scores.append(sensitivity)
        best_sensitivity_score, best_sensitivity_threshold = self._get_best_metrics(
            metric_type='sensitivity_score',
            scores=sensitivity_scores,
            optimization='max'
        )
        return best_sensitivity_score, best_sensitivity_threshold

    def get_best_specificity_metrics(self) -> Tuple[int, int]:
        """Optimizes threshold for specificity

        Returns: best specificity score and threshold at which best specificity score occurs

        """
        specificity_scores = list()
        for i in self.search_space:
            classes = self.convert_classes(threshold=i)
            tn, fp, fn, tp = confusion_matrix(self.y_test, classes).ravel()
            specificity = tn / (tn + fp)
            specificity_scores.append(specificity)
        best_specificity_score, best_specificity_threshold = self._get_best_metrics(
            metric_type='specificity_score',
            scores=specificity_scores,
            optimization='max'
        )
        return best_specificity_score, best_specificity_threshold

    def get_best_accuracy_metrics(self) -> Tuple[int, int]:
        """Optimizes threshold for accuracy

        Returns: best accuracy score and threshold at which best accuracy score occurs

        """
        accuracy_scores = list()
        for i in self.search_space:
            classes = self.convert_classes(threshold=i)
            accuracy_scores.append(accuracy_score(classes, self.y_test))
        best_accuracy_score, best_accuracy_threshold = self._get_best_metrics(
            metric_type='accuracy_score',
            scores=accuracy_scores,
            optimization='max'
        )
        return best_accuracy_score, best_accuracy_threshold

    def get_best_precision_metrics(self) -> Tuple[int, int]:
        """Optimizes threshold for precision

        Returns: best precision score and threshold at which best precision score occurs

        """
        precision_scores = list()
        for i in self.search_space:
            classes = self.convert_classes(threshold=i)
            precision_scores.append(precision_score(classes, self.y_test))
        best_precision_score, best_precision_threshold = self._get_best_metrics(
            metric_type='precision_score',
            scores=precision_scores,
            optimization='max'
        )
        return best_precision_score, best_precision_threshold

    def get_best_recall_metrics(self) -> Tuple[int, int]:
        """Optimizes threshold for recall

        Returns: best recall score and threshold at which best recall score occurs

        """
        recall_scores = list()
        for i in self.search_space:
            classes = self.convert_classes(threshold=i)
            recall_scores.append(recall_score(classes, self.y_test))
        best_recall_score, best_recall_threshold = self._get_best_metrics(
            metric_type='precision_score',
            scores=recall_scores,
            optimization='max'
        )
        return best_recall_score, best_recall_threshold

    def optimize_metrics(self,
                         metrics: list = None):
        """Function to optimize for supported metrics in a batch format

        Args:
            metrics: Optional. Should be specified if only specific supported metrics are
                    to be optimized. input must be a subset one of the supported metrics.
                    If no metrics are applied, all metrics will be optimized for.

        """
        if metrics is None:
            metrics = self._supported_metrics
        metrics = [metric.lower() for metric in metrics]
        assert all(metric in self._supported_metrics for metric in metrics)
        for i in metrics:
            super(ThresholdOptimizer, self).__getattribute__(f'get_best_{i}_metrics')()


In [128]:
thresh = ThresholdOptimizer(predicted_probabilities=predicted_probabilities,
                           y_test=y_test)

In [129]:
thresh.optimize_metrics()

best f1_score: 0.9797570850202428 occurs at threshold 0.15151515151515152
best accuracy_score: 0.973404255319149 occurs at threshold 0.15151515151515152
best sensitivity_score: 1.0 occurs at threshold 0.0
best specificity_score: 1.0 occurs at threshold 0.8383838383838385
best precision_score: 1.0 occurs at threshold 0.0
best precision_score: 1.0 occurs at threshold 0.8383838383838385


  _warn_prf(average, modifier, msg_start, len(result))
