# Model Performance

Get ready to put on your mathematics, data science, and ML engineering hats! In the next few lessons, you are going to learn how to evaluate the performance of ML models. Recall that after you train an ML model on past data, you can use that model to make predictions on new or previously unseen data. But how do you know if that model is useful?  In ML, when you hear the phrase "model performance", I want you to think about evaluating the quality of model predictions, commonly referred to as its **forecast skill** or **prediction skill**. This first lesson will focus on evaluating predictive modeling in the context of supervised learning, including both classification and regression problems.

## Classifier Accuracy and Error

Classification is about predicting a label, typically a discrete value. For example, an image of an animal may be classified as being a picture of a "cat" or "dog". There are many ways to measure the prediction skill of a classification model, but **accuracy** and **error rate** are the de facto standard.

### Accuracy
Accuracy is the ratio of the correct predictions to the total number of predictions made.
* Accuracy = Correct Predictions / Total Predictions

90% and above for the accuracy of a predictive model is considered to be good, and it is common practice to aim for that level. 

### Error Rate
You can also summarize model performance in terms of the error rate.
* Error Rate = Incorrect Predictions / Total Predictions

Accuracy and error rates are complements of each other and therefore you can calculate one from the other as follows:
* Accuracy = 1 - Error Rate
* Error Rate = 1 - Accuracy

Consider a classifier that labels pictures as either cats or dogs and that, when tested on 12 pictures (8 cats and 4 dogs), produces the following results:
* 9 Correct Predictions   = (9/12) = 0.75 
* 3 Incorrect Predictions = (3/12) = 0.25

Knowing that the classifier has an accuracy of 0.75 or 75%, does not provide any insight into where the classifier is not performing well. 

Is it more mistaking cats for dogs, or dogs for cats? or is it about the same? 

This is where a **confusion matrix** may prove useful. 

### Confusion Matrix

<img style="float: right; margin: 15px 15px 15px 15px;" src="img/confusion_matrix.png" width="200">

A confusion matrix allows you to easily visualize classification performance.  

In this confusion matrix, of the 8 cat pictures, the model predicted that 2 were dogs, and of the 4 dog pictures, it predicted that 1 was a cat. All correct predictions are located in the diagonal of the table (highlighted in bold), so it is easy to visually inspect the table for prediction errors, as they are represented by values outside the diagonal. By examining the confusing matrix during development, you can see where the model may be confusing two or more classes.

## Hands-On Classification Metrics: First Look

Scikit-learn is a free ML library for the Python programming language. It has 3 different programming interfaces for evaluating the quality of a model’s predictions:

* Estimator Score Method
* Scoring Parameter
* Metrics Functions

In this interactive demonstration, you'll get experience using the scikit-learn metrics functions to measure the prediction skill of a binary classifier that distinguishes cats and dogs. 

First start by importing the scikit-learn metrics module: 

In [1]:
from sklearn import metrics

Assume that the actual and predicted values from the example are defined as follows, where cats belong to the class 0 and dogs belong to the class 1.

In [2]:
actual_values = [0,0,0,0,0,0,0,0,1,1,1,1]
predictions =   [1,1,0,0,0,0,0,0,1,1,1,0]

Now you can use the metrics functions to calculate the accuracy and print the confusion matrix.

In [9]:
print(f'Accuracy: {metrics.accuracy_score(actual_values, predictions) * 100} % ')

print(f'Confusion Matrix:')

print(metrics.confusion_matrix(actual_values, predictions))

Accuracy: 75.0 % 
Confusion Matrix:
[[6 2]
 [1 3]]


## Classifier Precision, Recall, and F-Measure

As a performance measure, classification accuracy has its limitations. One example where accuracy may be an inadequate performance measure is in the presence of class imbalance. For example, imagine a situation where a dataset of cat and dog images contains a large number of cat examples (majority class) and a small number of dog examples (minority class). On such a dataset, even unskillful models model may achieve high accuracy if the large number of examples from the majority class overwhelms those in the minority class.  

An alternative to using classification accuracy is to use precision and recall metrics. 

However, prior to getting into precision and recall, it is important to dive deeper into the confusion matrix as it provides insight into both the performance of the model and the types of errors being made.

### Confusion Matrix: Reloaded

<img style="float: right; margin: 15px 15px 15px 15px;" src="img/confusion_matrix_reloaded.png" width="200">

The results summary displayed in the confusion matrix consists of true predictions and false predictions.

True Predictions: 
  * TP: True Positives. 
    - Model predicted Yes, and actual value is Yes.
  * TN: True Negatives. 
    - Model predicted No, and actual value is no.
    
False Predictions: 
  * FP: False Positives. 
    - Model predicted Yes, but actual value is No.
  * FN: False Negatives. 
    - Model predicted No, but actual value is Yes.


The **precision** and **recall** metrics are defined using the four terms (TP, TN, FP, and FN) in the confusion matrix.

### Precision

Precision quantifies the number of correct positive predictions made. It answers the question: When the model predicts yes, how often is it right? It is calculated as the ratio of correctly predicted positive examples divided by the total number of positive examples that were predicted.

* Precision = True Positives / (True Positives + False Positives)

### Recall

Recall quantifies the number of correct positive predictions made out of all of the positive predictions that could have been made. It answers the question: What percentage of the actual positives were identified? Therefore, unlike precision, recall provides an indication of missed positive predictions. It is calculated as the number of true positives divided by the total number of true positives and false negatives.

* Recall = True Positives / (True Positives + False Negatives)

A predictive model with high recall and low precision returns many results, but most of its predicted labels are incorrect. On the other hand, a predictive model with high precision and low recall returns very few results, but most of its predicted labels are correct. The ideal predictive model has high precision and high recall, returning many results with most of its results labeled correctly.

### F-Measure

Precision and recall can be used to compute the **F-Measure** &mdash; a single metric that captures both properties. The traditional F measure is calculated as the harmonic mean of the two fractions.

* F-Measure = (2 * Precision * Recall) / (Precision + Recall)

It is sometimes called the **F-Score** or **F1-Score** and is perhaps the most commonly used metric for imbalanced classification problems.

## Hands-On Classification Metrics: Deep Dive

The following code example demonstrates how precision, recall and the f1 score can be computed individually using scikit-learn.

In [4]:
#import libraries
from sklearn import metrics

#model prediction results
actual_values = [0,0,0,0,0,0,0,0,1,1,1,1]
predictions =   [1,1,0,0,0,0,0,0,1,1,1,0]

#precision
print(f'Precision Score is: {metrics.precision_score(actual_values, predictions)}')

#recall
print(f'Recall Score is: {metrics.recall_score(actual_values, predictions)}')

#f1 score
print('F1 Score:', metrics.f1_score(actual_values, predictions))

Precision Score is: 0.6
Recall Score is: 0.75
F1 Score: 0.6666666666666665


Alternatively, you can view a full classification report which includes the accuracy, precision, recall and f-score.

In [5]:
#classification report
print(metrics.classification_report(actual_values, predictions))

              precision    recall  f1-score   support

           0       0.86      0.75      0.80         8
           1       0.60      0.75      0.67         4

    accuracy                           0.75        12
   macro avg       0.73      0.75      0.73        12
weighted avg       0.77      0.75      0.76        12



## Exercise 1: Evaluating a Classifier for Spam Detection

For this exercise, you'll put everything you've learned so far about modeling actual and predicted values, and measuring classification performance in scikit-learn to the test.

### Problem Description:
You are tasked with evaluating the performance of an predictive ML model for detecting email spam. 

The model is a binary classifier that distinguishes between email messages that are either **_spam_** or **_not spam_**.

### Data:

The following table summarizes the performance data for this problem: 
<table align="left" style="border-collapse:collapse;border-spacing:0" class="tg"><tbody><tr><td style="border-color:black;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px;overflow:hidden;padding:10px 5px;text-align:left;vertical-align:top;word-break:normal"></td><td style="border-color:black;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px;overflow:hidden;padding:10px 5px;text-align:center;vertical-align:top;word-break:normal"><span style="font-weight:bold">Email 1</span></td><td style="border-color:black;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px;overflow:hidden;padding:10px 5px;text-align:center;vertical-align:top;word-break:normal"><span style="font-weight:bold">Email 2</span></td><td style="border-color:black;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px;font-weight:bold;overflow:hidden;padding:10px 5px;text-align:center;vertical-align:top;word-break:normal">Email 3</td><td style="border-color:black;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px;font-weight:bold;overflow:hidden;padding:10px 5px;text-align:center;vertical-align:top;word-break:normal">Email 4</td><td style="border-color:black;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px;font-weight:bold;overflow:hidden;padding:10px 5px;text-align:center;vertical-align:top;word-break:normal">Email 5</td><td style="border-color:black;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px;font-weight:bold;overflow:hidden;padding:10px 5px;text-align:center;vertical-align:top;word-break:normal">Email 6</td><td style="border-color:black;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px;font-weight:bold;overflow:hidden;padding:10px 5px;text-align:center;vertical-align:top;word-break:normal">Email 7</td><td style="border-color:black;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px;font-weight:bold;overflow:hidden;padding:10px 5px;text-align:center;vertical-align:top;word-break:normal">Email 8</td><td style="border-color:black;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px;font-weight:bold;overflow:hidden;padding:10px 5px;text-align:center;vertical-align:top;word-break:normal">Email 9</td><td style="border-color:black;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px;font-weight:bold;overflow:hidden;padding:10px 5px;text-align:center;vertical-align:top;word-break:normal">Email 10</td><td style="border-color:black;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px;font-weight:bold;overflow:hidden;padding:10px 5px;text-align:center;vertical-align:top;word-break:normal">Email 11</td><td style="border-color:black;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px;font-weight:bold;overflow:hidden;padding:10px 5px;text-align:center;vertical-align:top;word-break:normal">Email 12</td></tr><tr><td style="border-color:inherit;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px;font-weight:bold;overflow:hidden;padding:10px 5px;text-align:left;vertical-align:top;word-break:normal">Actual Values</td><td style="border-color:inherit;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px;overflow:hidden;padding:10px 5px;text-align:center;vertical-align:top;word-break:normal">Spam</td><td style="border-color:inherit;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px;overflow:hidden;padding:10px 5px;text-align:center;vertical-align:top;word-break:normal">Not Spam</td><td style="border-color:inherit;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px;overflow:hidden;padding:10px 5px;text-align:center;vertical-align:top;word-break:normal">Not Spam</td><td style="border-color:inherit;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px;overflow:hidden;padding:10px 5px;text-align:center;vertical-align:top;word-break:normal">Spam</td><td style="border-color:inherit;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px;overflow:hidden;padding:10px 5px;text-align:center;vertical-align:top;word-break:normal">Spam</td><td style="border-color:inherit;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px;overflow:hidden;padding:10px 5px;text-align:center;vertical-align:top;word-break:normal">Not Spam</td><td style="border-color:inherit;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px;overflow:hidden;padding:10px 5px;text-align:center;vertical-align:top;word-break:normal">Spam</td><td style="border-color:inherit;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px;overflow:hidden;padding:10px 5px;text-align:center;vertical-align:top;word-break:normal">Not Spam</td><td style="border-color:inherit;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px;overflow:hidden;padding:10px 5px;text-align:center;vertical-align:top;word-break:normal">Not Spam</td><td style="border-color:inherit;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px;overflow:hidden;padding:10px 5px;text-align:center;vertical-align:top;word-break:normal">Spam</td><td style="border-color:inherit;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px;overflow:hidden;padding:10px 5px;text-align:center;vertical-align:top;word-break:normal">Not Spam</td><td style="border-color:inherit;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px;overflow:hidden;padding:10px 5px;text-align:center;vertical-align:top;word-break:normal">Spam</td></tr><tr><td style="border-color:inherit;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px;font-weight:bold;overflow:hidden;padding:10px 5px;text-align:left;vertical-align:top;word-break:normal">Predictions</td><td style="border-color:inherit;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px;overflow:hidden;padding:10px 5px;text-align:center;vertical-align:top;word-break:normal">Spam</td><td style="border-color:inherit;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px;overflow:hidden;padding:10px 5px;text-align:center;vertical-align:top;word-break:normal">Spam</td><td style="border-color:inherit;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px;overflow:hidden;padding:10px 5px;text-align:center;vertical-align:top;word-break:normal">Not Spam</td><td style="border-color:inherit;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px;overflow:hidden;padding:10px 5px;text-align:center;vertical-align:top;word-break:normal">Spam</td><td style="border-color:inherit;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px;overflow:hidden;padding:10px 5px;text-align:center;vertical-align:top;word-break:normal">Spam</td><td style="border-color:inherit;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px;overflow:hidden;padding:10px 5px;text-align:center;vertical-align:top;word-break:normal">Not Spam</td><td style="border-color:inherit;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px;overflow:hidden;padding:10px 5px;text-align:center;vertical-align:top;word-break:normal">Not Spam</td><td style="border-color:inherit;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px;overflow:hidden;padding:10px 5px;text-align:center;vertical-align:top;word-break:normal">Not Spam</td><td style="border-color:inherit;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px;overflow:hidden;padding:10px 5px;text-align:center;vertical-align:top;word-break:normal">Not Spam<br></td><td style="border-color:inherit;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px;overflow:hidden;padding:10px 5px;text-align:center;vertical-align:top;word-break:normal">Spam</td><td style="border-color:inherit;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px;overflow:hidden;padding:10px 5px;text-align:center;vertical-align:top;word-break:normal">Spam</td><td style="border-color:inherit;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px;overflow:hidden;padding:10px 5px;text-align:center;vertical-align:top;word-break:normal">Spam</td></tr></tbody></table>

### Instructions:
Use the Python programming language, including the scikit-learn metrics library, to compute the following performance metrics, visualizations, and reports.

* Accuracy, Error Rate, Confusion Matrix
* Precision, Recall, F1 Score
* Tabular Classification Performance Report

### Solution:

In [10]:
# Code your solution here...

## Regression Error, Not Accuracy

After learning about predictive classification models, one of the first questions you may have on your mind is: 

>How do I calculate the accuracy for a regression model? 
>>The answer here is simple: **you cannot**!

Accuracy is a measure of classification, not regression. Questions like the one above are typically a symptom of not understanding the difference between classification and regression, and what accuracy is trying to measure. 

### Regression vs. Classification
Recall that classification is about predicting a discrete class label: _cat_, _dog_, _spam_, _not spam_. In contrast, regression is about predicting a quantity, typically a continuous value such as amounts, sizes, or prices.  For example, consider that a house may be predicted to sell for a specific dollar value.  It only makes sense that if you are predicting a numeric value like this, that you don't want to know if the model predicted the value exactly.  Instead, you care more about how close the prediction was to the expected value. 

A way to describe the numerical difference between the actual and expected values is **distance** or **error**, and the prediction skill of a regression model is reported as an error in those predictions as opposed to accuracy.

### Regression Error Metrics

<img style="float: right; margin: 15px 15px 15px 15px;" src="img/mse.png" width="200">

There are four error metrics that are commonly used for evaluating and reporting on the quality of a regression predictions:

#### Mean Squared Error (MSE)
* Finds the average squared distance (error) between the predicted and actual values. 
* Tells you how close a regression line is to a set of points by taking the distances from the points to the line and squaring them.
* Squaring removes any (-) negative signs and magnifies large errors. 
* The lower the MSE, the better the prediction skill.
* Formula: 
  - **MSE** = 1 / N * sum for i to N (y_i – yhat_i)^2

#### Root Mean Squared Error (RMSE)

<img style="float: right; margin: 15px 15px 15px 15px;" src="img/rmse.png" width="200">

* Variation of the MSE metric which shows what is the average **deviation** in predictions from actual values.
* Follows an assumption that error is unbiased and follows a normal distribution.
* Just like MSE, RMSE is a non-negative value and the lower the RMSE, the better the prediction skill.
* RMSE punishes large errors and is the best metric for large numbers (actual value or prediction). 
* It is affected by outliers so make sure that you remove them from the dataset beforehand.
* Formula: 
  - **RMSE** = sqrt(1 / N * sum for i to N (y_i – yhat_i)^2)


<img style="float: right; margin: 15px 15px 15px 15px;" src="img/mae.png" width="270">

#### Mean Absolute Error (MAE)

* Computes the average of the absolute error values by forcing the difference between predicted and actual values to be positive.
* Unlike the MSE and RMSE that punish larger errors more than smaller errors, the changes in MAE are linear and therefore more intuitive.
* MAE gives you information on the magnitude of the error, but no idea of the direction, i.e., Is the model over or under estimating?
* Like the others, an error value of 0.0 would be ideal, meaning that all predictions matched the expected values exactly.
* Formula: 
  - **MAE** = 1 / N * sum for i to N abs(y_i – yhat_i)



#### R-Squared (R<sup>2</sup>)

* Also referred to as the **coefficient of determination**.
* Provides an indication of the goodness of fit of a set of predictions to the actual values.
* Yields a value between 0 and 1 for no-fit and perfect fit respectively.
* Formula: 
  - **R<sup>2</sup>** = 1 - Unexplained Variance / Total Variance
<img style="float: right; margin: 15px 15px 15px 15px;" src="img/rsquared.png" width="500">

>**Calculation**: The actual calculation of R<sup>2</sup> requires several steps, including taking data points (observations) of dependent and independent variables, and finding the line of best fit from a regression model. From there you would calculate predicted values, subtract actual values and, square the results. This yields a list of errors squared, which is then summed and equals the **unexplained variance**.
> To calculate the **total variance**, you would subtract the average actual value from each of the actual values, square the results and sum them. From there, divide the first sum of errors (explained variance) by the second sum (total variance), subtract the result from one, and you now have the R-Squared measure.

>**Meaning**: R<sup>2</sup> gives you an idea of how many data points fall within the results of the line formed by the regression equation. The higher the coefficient, the higher percentage of points the line passes through when the data points and line are plotted. If the coefficient is 0.80, then 80% of the points should fall within the regression line. Values of 1 or 0 would indicate the regression line represents all or none of the data, respectively. A higher coefficient is an indicator of a better goodness of fit for the observations

>**Usefulness**: The usefulness of R<sup>2</sup> is its ability to find the likelihood of future events falling within the predicted outcomes. The idea is that if more samples are added, the coefficient would show the probability of a new point falling on the line. Even if there is a strong connection between the two variables, determination does not prove causality. For example, a study on birthdays may show a large number of birthdays happen within a time frame of one or two months. This does not mean that the passage of time or the change of seasons causes pregnancy.

#### Need Help? 
* If these metrics seem complicated, don't worry... 
* The good news is that the computing these metrics in Python with open-source libraries is easy
* All you have to worry about is knowing:
  - The context(s) in which each metric is suitable.
  - Any influencing factors or limitations that may impact your evaluation or threaten its validity.
  - How to invoke the appropriate metrics functions from your code and interpret the results!


## Hands-On Regression Metrics: A Quick Demo

In [12]:
#Import Required Libraries
from math import sqrt
import numpy as np
from sklearn import metrics

#Load Prediction Results
actual_values = [9, -3.3, 6, 11]
predictions =   [8.5, -2.9, 6, 9.2]

In [17]:
# Calculate Mean Squared Error (MSE)
print (f'MSE:  {metrics.mean_squared_error(actual_values, predictions)}')

# Calculate Root Mean Squared Error (RMSE)
def rmse(actual_values, predictions):
    actual_values = np.asarray(actual_values)
    predictions = np.asarray(predictions)
    return np.sqrt(((predictions - actual_values) ** 2).mean())
print(f'RMSE: {rmse(actual_values, predictions)}')

# Calculate Mean Absolute Error
print (f'MAE:  {metrics.mean_absolute_error(actual_values, predictions)}')

# Calculate R-Squared
print (f'R^2:  {metrics.r2_score(actual_values, predictions)}')

MSE:  0.9125000000000005
RMSE: 0.9552486587271403
MAE:  0.6750000000000002
R^2:  0.9696004330897203
