## 5.3.1 Logistic regression: Performance metrics evaluation
After training and fitting the logistic regression model, we shall evaluate its performance via the metrics that are suitable for binary classification models (accuracy, precision, recall, f1 score, and confusion matrix). 

### 1. Classification accuracy
We shall refer to the same trained model previously. The following code was executed to create and train the logistic regression model:

In [1]:
import pandas as pd
df = pd.read_csv('../datasets/clean_creditcard.csv')

from sklearn.linear_model import LogisticRegression

lr_Object  = LogisticRegression(C=1.0, class_weight=None, dual=False,
                                fit_intercept=True, intercept_scaling=1,
                                max_iter=500, multi_class='auto', n_jobs=None,
                                penalty='l2', random_state=None,
                                solver='liblinear', tol=0.0001,
                                verbose=0, warm_start=False)


X = df.drop(['Class_Category'], axis=1)
y = df[['Class_Category']]


from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)


lr_Object.fit(X_train, y_train.values.ravel())

We employed the trained model to make predictions on the features of the samples from the held-out test set:



In [2]:
y_pred = lr_Object.predict(X_test)
y_pred

array([1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1,
       1, 0, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 1, 1, 0, 1, 0, 0, 0, 1, 1,
       0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 1, 0, 0, 0, 1, 1, 1, 0, 0,
       1, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1,
       1, 1, 1, 1, 0, 1, 1, 1, 0, 0, 1, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 1,
       1, 0, 1, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 1, 0, 1, 1, 0, 0, 0, 1, 1,
       1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0,
       0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 1, 1,
       0, 1, 0, 0, 0, 1, 0, 1, 1, 1, 0, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0,
       1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 0,
       0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 0, 1, 1,
       0, 0, 1, 0, 1, 1, 0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0,
       0, 1, 1, 0], dtype=int64)

 The model-predicted labels of the test set are stored in a variable called y_pred. It is now the time to assess the model performance and the quality of the predictions. We have the true labels, in the y_test variable. 

Let us now calculate the model prediction accuracy; accuracy is defined as the proportion of samples that were correctly classified. 

We shall calculate accuracy via two ways: python and scikit-learn builtin functions

**Calculate classification accuracy via Python**

We can calculate accuracy by creating a logical mask that is True whenever the predicted label y_pred is equal to the actual label in y_test, and False otherwise. 

In [4]:
is_correct= y_pred==y_test.values.ravel()
is_correct

array([ True,  True,  True,  True,  True, False, False,  True, False,
        True, False,  True,  True,  True,  True,  True,  True,  True,
       False,  True,  True,  True,  True,  True,  True,  True,  True,
        True, False,  True,  True,  True,  True,  True,  True,  True,
        True, False,  True,  True, False, False,  True,  True,  True,
        True,  True,  True, False,  True, False,  True,  True,  True,
        True,  True,  True,  True, False,  True,  True,  True,  True,
        True,  True, False,  True,  True, False,  True,  True,  True,
        True,  True, False,  True,  True,  True,  True,  True, False,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,

Next, we can take the average of this mask, which will interpret True as 1 and False as 0, giving us the proportion of correct classifications:

In [7]:
import numpy as np
print(np.mean(is_correct))

0.914179104477612


This indicates that the model is correct 91%. While this is a pretty straightforward calculation, there are actually easier ways to calculate accuracy using the convenience of scikit-learn, which we shall see next. 

**Calculate classification accuracy via scikit-learn**

One way is to use the trained model's .score method, passing the features of the test data to make predictions on, as well as the test labels. This method makes the predictions and then does the same calculation we performed previously, all in one step. 

In [8]:
print(lr_Object.score(X_test,y_test))

0.914179104477612


Another way is to import scikit-learn's metrics library, which includes many model performance metrics, such as accuracy_score. For this, we pass the true labels and the predicted labels:

In [9]:
from sklearn import metrics 
print(metrics.accuracy_score(y_test,y_pred))

0.914179104477612


All the previous ways of calculating accuracy give the same result. 

**How to interpret accuracy?**

An accuracy of 91% may sound good. We are getting most predictions correctly. However, consider the case of the original credit card data that is unbalanced; most of the transactions are non-frauds (the class label is negative) and only a few transactions are frauds (the class label is positive). Assume we train our model using the original unbalanced data.

> Remember this is not our case when we did train and fit the model as we process the data to make it balanced. However, we are explaining the case of unbalanced data for the sake of explanation. 

In this case of unbalanced data, an important test for the accuracy of binary classification is to compare things to a very simple hypothetical model that only makes one prediction in the case we have imbalanced class labels:

- This hypothetical model predicts the majority class for every sample, no matter what the features are. 
- While in practice this model is useless, it provides an important extreme case with which to compare the accuracy of our trained model. Such extreme cases are sometimes referred to as null models.

Think about what the accuracy of such a null model would be. In the original credit unbalanced dataset, we know that very few of the samples are positive; the negative class is the majority class. Therefore, a null model for this dataset, which always predicts the majority negative class, will be right most of the time for all the samples of the negative class. 

When we compare a trained model on this unbalanced dataset to such a null model, it becomes clear that the accuracy of a high percentage is not very useful. 

- We can get the same accuracy with a model that doesn't pay any attention to the features. 

Accordingly, while we can interpret accuracy in terms of a majority-class null model, there are other binary classification metrics that delve a little deeper into how the model is performing for negative, as well as positive samples separately.

### 2. True and false positive and negative rates
We will use the test data and model predictions from the logistic regression model we created previously. We will illustrate how to manually, using Python, calculate:

- The numbers of true and false positives and negatives.  
- The true and false positive and negative rates.

Perform the following steps to complete the exercise,

> The previous code for creating, training and making predictions must be run before doing these calculations.  

1. Calculate the number of positive samples, as in this code:

In [10]:
P = sum(y_test.values.ravel())
print(P)

132


2. Calculate the number of true positives. These are samples where the true label is 1 and the prediction is also 1. 

- We can identify these with a logical mask for the samples that are positive (the class labels in y_test==1) AND (& is the logical AND operator in Python) have a positive prediction (y_pred==1).

Use this code to calculate the number of true positives:

In [11]:
TP = sum( (y_test.values.ravel()==1) & (y_pred==1) )
print(TP)

112


3. Calculate the true positive rate. This is the proportion of true positives to positives. We Run the following code to obtain the TPR:

In [12]:
TPR = TP/P
print(TPR)

0.8484848484848485


Similarly, we can calculate false negatives and false-negative rate.

4. Calculate the number of false negatives FN with this code:

In [13]:
FN = sum( (y_test.values.ravel()==1) & (y_pred==0) )
print(FN)

20


5. Calculate the false-negative rate (FNR) with this code:

In [14]:
FNR = FN/P
print(FNR)

0.15151515151515152


6. Calculate the TNR and FPR of our test data. This is very similar to what we did earlier.

In [15]:
N= sum(y_test.values.ravel()==0)
print(N)

136


In [16]:
TN= sum((y_test.values.ravel()==0) & (y_pred==0))
print(TN)
# that yields: 133

FP = sum((y_test.values.ravel()==0) & (y_pred==1))
print(FP)
# that yields: 3

TNR = TN/N
FPR = FP/N
print('the true negative rate is {} and the false positive rate is {}'. format(TNR,FPR))

133
3
the true negative rate is 0.9779411764705882 and the false positive rate is 0.022058823529411766


### How to Interpret the true positive and false negative rates?
We can notice that FNR = 0.15151515151515152 and TPR = 0.8484848484848485 and they sum to 1. 

We can also notice that FPR = 0.022058823529411766 and TNR = 0.9779411764705882 and they sum to 1 as well. 

We can see that the trained model is performing well as the TPR  and TNR are much larger than FNR  and FPR. Specifically, the model is performing even better for the negative class as TNR = 0.9779411764705882 is very much larger than FPR = 0.022058823529411766. 

### 3. The confusion matrix
While we have manually calculated all the entries of the confusion matrix earlier, we can do that quickly in scikit-learn. 

7. Create a confusion matrix in scikit-learn with this code:

In [17]:
from sklearn.metrics import confusion_matrix
print(f"Confusion Matrix : \n {confusion_matrix(y_test, y_pred)}")

Confusion Matrix : 
 [[133   3]
 [ 20 112]]


All the information we need to calculate the TPR, FNR, TNR, and FPR is contained in the confusion matrix. 

### Additional classification metrics
We also note that there are many more classification metrics that can be derived from the confusion matrix. In fact, some of these are actually synonyms for ones we've already examined here. For example, the TPR is also called recall and sensitivity. 


### 4. Precision
Another metric that is often used for binary classification is precision: this is the proportion of positive predictions that are correct. 

8. Run the following code to calculate the model precision:

In [18]:
Precision = TP/ (TP+FP)
print(Precision)

0.9739130434782609


### 5. Recall
Also, recall is an important metric for evaluating binary classification. 

9. Run the following code to calculate the model recall:

In [19]:
Recall = TP/ (TP+FN)
print(Recall)

0.8484848484848485


As mentioned, the recall has been already calculated earlier as TPR. 

### 6. F1 Score
F1Score is another important metric to be considered. 

10. Run the following code to calculate F1Score. 

In [20]:
F1Score = 2*((Precision * Recall)/(Precision+Recall))
print(F1Score)

0.9068825910931174


### Classification report
In fact, it is possible to get the previous calculations in one report by using a built-in function classification_report in scikit-learn. 

11. Run the following code to generate a classification report in which it is required to import the function classification_report from the library then call that function on the y_test and y_pred. 

In [21]:
from sklearn.metrics import classification_report
print(f"Classification Report : \n {classification_report(y_test, y_pred)}")

Classification Report : 
               precision    recall  f1-score   support

           0       0.87      0.98      0.92       136
           1       0.97      0.85      0.91       132

    accuracy                           0.91       268
   macro avg       0.92      0.91      0.91       268
weighted avg       0.92      0.91      0.91       268



which shows the same calculations done earlier. Indeed, the values are rounded to two values in the report.