# Modeling Template Notebook - v1

This notebook should serve as a template for the standard modelling we will all be using for as the base of version 1 development.

The goal is for us to use the same predicted probabilities and classes to build different functions according the to tasks assigned to us.

The notebook should be used as follows:
1. Import all needed packages:
2. Load Data
3. Split Train and Test Sets
4. Build and Run model
5. Output predictions

### Import necessary packages

Note: In the case a Module Not Found error is thrown, it will automatically attempt to install all needed packages. You would have to rerun the import package cell again after.

In [1]:
def install_packages():
    !pip install pandas
    !pip install numpy
    !pip install scikit-learn

In [2]:
try:
    import pandas as pd
    import numpy as np
    from sklearn import datasets
    from sklearn.linear_model import LogisticRegression
    from sklearn.model_selection import train_test_split
    from sklearn.metrics import accuracy_score, f1_score
    from sklearn.metrics import precision_score, precision_recall_curve
    print('import successful')
except ModuleNotFoundError:
    print('forcing install of necessary packages. If you see this, rerun this cell to try again')
    install_packages()

import successful


### Load necessary data

The dataset we're starting off with is the standard breast cancer dataset that lends itself directly to binary classification modelling.

In [3]:
X, y = datasets.load_breast_cancer(return_X_y=True)

### Train Test Split

In [4]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

### Run model

We're using the simplest flavour of the ordinary least squares logistic regression in this case and no data transformation / scaling to maintain simplicity

In [5]:
clf = LogisticRegression(random_state=0).fit(X_train, y_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


### Output Predictions

This section provides the basic outputs we will be using for function building as variables. They are as follows:
1. predicted_probabilities: this is output of pre for which we will be performing optizations. This has two 'columns' where the values in index 0 are the probabilities for the class 0 and those in index 1 have probabilies for class 1
2. sklearn_class_labels: this is the class label prediction computed by sklearn's density function
3. naive_class_labels: this is the class label prediction computed by using a naïve 0.5 threshold

In [6]:
predicted_probabilities = clf.predict_proba(X_test)
print('shape of predicted probabilties:', predicted_probabilities.shape)

shape of predicted probabilties: (188, 2)


In [7]:
sklearn_class_labels = clf.predict(X_test)
print('shape of model default class predicttions:', sklearn_class_labels.shape)

shape of model default class predicttions: (188,)


In [8]:
naive_class_labels = np.where(predicted_probabilities[:,1] > 0.5, 1, 0)
print('shape of naïve class predicttions:', naive_class_labels.shape)

shape of naïve class predicttions: (188,)


In [9]:
print('Accuracy and F1 score of sklearn predicted classes')
accuracy_score(sklearn_class_labels, y_test), f1_score(sklearn_class_labels, y_test)

Accuracy and F1 score of sklearn predicted classes


(0.9680851063829787, 0.9752066115702479)

In [10]:
print('Accuracy and F1 score of naive predicted classes')
accuracy_score(naive_class_labels, y_test), f1_score(naive_class_labels, y_test)

Accuracy and F1 score of naive predicted classes


(0.9680851063829787, 0.9752066115702479)

## Build Functions

In [11]:
def getting_probs(model,X_test):
    y_scores = model.predict_proba(X_test)[:, 1]
    return y_scores

def adjusted_classes(y_scores, t):
##works for binary classes right now
    return [1 if y >= t else 0 for y in y_scores]

def find_best_scores(numtimes, y_scores):
    best = 0
    increment = 1/numtimes
    #threshold = increment
    for threshold in np.arange(0.0, 1.0, 0.1):
        y_pred_adjusted = adjusted_classes(y_scores, threshold)
        score = precision_score(y_pred_adjusted, y_test)
        if score >= best:
            best = score
    return best, threshold

#best threshold and precision score
def find_best_threshold(model, X_test, numtimes):
    y_scores = getting_probs(model,X_test)
    return find_best_scores(numtimes,y_scores)

#labels with best threshold
def best_threshold_labels(model, X_test, numtimes):
    y_scores = getting_probs(model,X_test)
    res = find_best_scores(numtimes, y_scores)
    return adjusted_classes(y_scores, res[1])