# Performances Metrics

## K-Nearest Neighbors

Let's see a new model which is capable of solving both regression or classification problems : K-Nearest Neighbors (KNN).

### Principle

<div>
<img src="files/knn_1.webp" width="65%" align='center' source="https://medium.com/swlh/k-nearest-neighbor-ca2593d7a3c4"><br></div>

When we use the model for **regression**, the result is the mean of the $k$ nearest neighbours. You'll notice that $k$ is called a hyperparameter, it's up to the human behind the screen to choose it.

## Prediction

<div>
<img src="files/knn_2.webp" width="65%" align='center' source="https://medium.com/swlh/k-nearest-neighbor-ca2593d7a3c4"><br></div>

### Distance

There are several ways to compute the distance between two different points (Manhattan, Hamming, Jaccard, Cosine etc.). But default is juste the euclidian distance.

<div>
<img src="files/knn_3.webp" width="45%" align='center' source="https://medium.com/swlh/k-nearest-neighbor-ca2593d7a3c4"><br></div>


## $k$ : the number of neighbors

<div>
<img src="files/knn_4.webp" width="45%" align='center' source="https://medium.com/swlh/k-nearest-neighbor-ca2593d7a3c4"><br></div>

## Baseline
To enhance our models we need to compare the results to something. This "something" is called a baseline, it is usually a simple model. Once we've chosen our baseline, we can start using many other models and see if they perform better or worse.

## Metrics
To assess the performance of a model, we need metrics and they are many different types of metrics in Machine Learning. Each one of them has pros and cons.

### About the Data

Our **target**:

- **charges**: Individual medical costs billed by health insurance.

Our **features**:

- **age**: Age of primary beneficiary.
- **sex**: Insurance contractor gender: female, male.
- **bmi**: Body mass index, providing an understanding of body, weights that are relatively high or low relative to height, objective index of body weight (kg / m ^ 2) using the ratio of height to weight, ideally 18.5 to 24.9.
- **children**: Number of children covered by health insurance / Number of dependents.
- **smoker**: Whether the contractor is a smoker or not. .
- **region**: The beneficiary's residential area in the US, northeast, southeast, southwest, northwest.


### Import Libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

## Import Data 

In [None]:
df = pd.read_csv('data/insurance.csv')
df.head()

In [None]:
from sklearn.preprocessing import LabelEncoder

df['price_range_encoded'] = LabelEncoder().fit_transform(df['price_range'])
df.head()

### Holdout Method

In [None]:
from sklearn.model_selection import train_test_split

X = df[['age','bmi','children','smoker']]
y = df['charges']

# Before doing anything, let's to make sure we don't leak information
# So we're creating a train test right away, before altering the data

X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    test_size=0.2, 
                                                    random_state=42) # Holdout

### Dummy Regressor

A dummy regressor is a very simple model. This one predicts the y using the mean of our target.

[sklearn doc](https://scikit-learn.org/stable/modules/generated/sklearn.dummy.DummyRegressor.html)


In [None]:
from sklearn.dummy import DummyRegressor
# from sklearn.linear_model import LinearRegression

baseline_model = DummyRegressor(strategy="mean") # Baseline
baseline_model.fit(X_train, y_train) #
baseline_model.score(X_test, y_test) # R²

The score is negative, so it's worse than than the mean. Why? Because we used the mean of our train dataset and we applied it to our test dataset. This score is the score we want to beat.

### Linear Regression

In [None]:
from sklearn.linear_model import LinearRegression

lr_model = LinearRegression().fit(X_train, y_train)
lr_model.score(X_test, y_test)

## Regression Error Metrics

The goal of a regression model is to minimize the error. There are different ways to compute what we called "error".

### Mean Squared Error (MSE)

$MSE = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2$

Mean of the squared differences between ground truth and predicted values. Not expressed in the unit than the target.

**Use MSE when**:

- Larger errors have a disproportionally bigger impact, hence should be more penalized
- Example: clinical trials, where an error of 4mg can be more than twice as bad as an error of 2mg
- Direction and unit of error does not matter
- Comparing the sensitivity of different models/methods to large errors

### Root Mean Squared Error (RMSE)

$RMSE = \sqrt{MSE}$

Square root of the Mean Square Error.

**Use RMSE when**:

- You want the MSE to be represented in the unit of the target, making it more interpretable.

### Mean Absolute Error (MAE)

$MAE = \frac{1}{n} \sum_{i=1}^{n} |y_i - \hat{y}_i|$

The mean of the absolute differences between true values and predicted values.

**Use MAE when**:
- Errors can be penalized proportionally to their size

### Cheat sheet

| Metric   | Advantages                                      | Disadvantages                                      | Example                                                                |
|----------|-------------------------------------------------|----------------------------------------------------|------------------------------------------------------------------------|
| MSE/RMSE | Highlights large errors, smooth, optimizable    | Sensitive to outliers                               | Accepts 5 errors of 1°C more than a single error of 5°C                  |
| MAE      | Homogeneous, interpretable                      | Less sensitive to outliers                               | 5 errors of 1°C are equivalent to a single error of 5°C                  |
| Max Error | The limit of the magnitude of an error is a priority | Case specific usage | A piece of equipment overheats if temperature goes over 2°C |

### Coefficient of determination (R²) and error metrics

#### A quick reminder:

  - $SS_{res} = \sum_{i=1}^{n} (y_i - \hat{y}_i)^2$
  - $SS_{tot} = \sum_{i=1}^{n} (y_i - \bar{y})^2$
  - $R^2 = 1 - \frac{SS_{res}}{SS_{tot}}$

#### Should we also use the R²?

The R² has two specificities:

- It facilitates the comparison between different models. Stating that a model has an MSE of 25 does not allow us to conclude if the model is correct because it depends on the values taken by the variable to be predicted.
    On the other hand, the normalization done in the R² allows us to say that a model with less than 20% R² is not performing well, and conversely, a model that achieves more than 80% R² is performing well.

- However, it is not very interpretable and does not provide information on the average error of the model. Indeed, while the R² allows us to compare the model's performance with a basic performance, it does not allow us to determine the average error made in predictions. It often needs to be combined with other metrics to better understand the model's performance, such as MSE or MAE.

In [None]:
from sklearn.metrics import mean_squared_error, root_mean_squared_error, mean_absolute_error, max_error, r2_score

# Fit and Predict
lr_model = LinearRegression().fit(X, y)
y_pred = lr_model.predict(X) # Let's compute on ALL our data. Because we want to compare different models.

# MSE
mse = mean_squared_error(y, y_pred)

# RMSE
rmse = root_mean_squared_error(y, y_pred)

# MAE
mae = mean_absolute_error(y, y_pred)

# Max Error
max_error = max_error(y, y_pred)

# R²
rsquared = r2_score(y, y_pred)

print('MSE =', round(mse, 2))
print('RMSE =', round(rmse, 2))
print('MAE =', round(mae, 2))
print('Max Error =', round(max_error, 2))
print('r² =', round(rsquared, 2))

### How well did we perform?

Our model is trying to predict the medical costs billed by health insurance for each individual. This could be useful for an insurance company that would want to adjust the cost of the health insurance.

The RMSE is about 6000 US dollars, but the MAE, less sensitive to outliers, is about 1/3 less at around 4200 US dollars.

### One or several metrics?

Should we compute all of the metrics above each time we fit a new model, or should we just pick one and try to improve it? It depends on what we're trying to achieve. But, at the start at least, it's a good practice to take a look at several metrics to make sure everything is ok.

### Metrics and Cross Validation

When we cross-validate a score, when we pass an argument to the ```scoring``` parameter to specify the metric. If it's empty, it's the default score. [You can find a list of the metrics here.](https://scikit-learn.org/stable/modules/model_evaluation.html)

Many metrics have a "neg_" before their names. As they are treated as a score, sklearn assumes the higher it gets, the better. But, as we've seen previously, all our error metrics need to be lowered. So sklearn will take the negative value.

In [None]:
from sklearn.model_selection import cross_validate

model = LinearRegression()
cv_results = cross_validate(model, X, y, cv=5, 
                            scoring=['neg_mean_squared_error',
                                     'neg_root_mean_squared_error',
                                     'neg_mean_absolute_error',
                                     'max_error',
                                     'r2',])

pd.DataFrame(cv_results).round(2)

## Classification Metrics

### Correct and Incorrect Predictions

When we use classification metrics, we don't output a number but a class. Let's say we only have two classes to predict : **True** or **False**. 

Therefore if we make a **correct** prediction we can have:

- **True Positive**: A positive sample which has been **correctly** predicted as positive. The value is **True** in both ```y``` and ```y_pred```.
- **True Negative**: A negative sample which has been **correctly** predicted as negative. The value is **False** in both ```y``` and ```y_pred```.

But if we make an **incorrect** prediction we can have:

- **False Positive**: A negative sample which has been **incorrectly** predicted as positive. The value is **False** in ```y``` but **True** in ```y_pred```.
- **False Negative**: A positive sample which has been **incorrectly** predicted as negative. The value is **True** in ```y``` but **False** in ```y_pred```.

In a multiclass problem there is one score for each class, counting any other class as a negative.

### Confusion Matrix
<div>
<img src="files/confusion_matrix.jpg" width="40%" align='center' source="https://medium.com/analytics-vidhya/what-is-a-confusion-matrix-d1c0f8feda5" /> </div>


### Confusion Matrix using Pandas

In [None]:
y_test = [0, 1, 0, 0, 1, 0, 1, 1, 0, 1] # actual truths
y_pred = [0, 0, 0, 0, 1, 1, 1, 1, 1, 1] # predictions

results_df = pd.DataFrame({"actual": y_test,
                           "predicted": y_pred}) #Store results in a dataframe
results_df

In [None]:
pd.crosstab(index=results_df['actual'],
            columns=results_df['predicted']).reindex(index=[1, 0], columns=[1, 0])

### Confusion Matrix using Sklearn

In [None]:
from sklearn.metrics import confusion_matrix

y_test = [0, 1, 0, 0, 1, 0, 1, 1, 0, 1] # actual truths
y_pred = [0, 0, 0, 0, 1, 1, 1, 1, 1, 1] # predictions

confusion_matrix = confusion_matrix(y_test, y_pred,)[::-1, ::-1]
confusion_matrix

In [None]:
from sklearn.metrics import ConfusionMatrixDisplay

ConfusionMatrixDisplay(confusion_matrix=confusion_matrix, display_labels=[1, 0]).plot();

### Accuracy

Sum if the correct predictions divided by the sum of the overall number of predictions.

$Accuracy = \frac{TP+TN}{TP+TN+FP+FN}$ or $Accuracy = \frac{Number\ of\ Correct\ Predictions}{Total\ Number\ of\ Predictions}$

In [None]:
# Accuracy Score for our y_test and y_pred
# [4, 1] [TP, FN]
# [2, 3] [FP, TN]

(4 + 3) / (4 + 3 + 2 + 1)

In [None]:
from sklearn.metrics import accuracy_score

accuracy_score(y_test, y_pred)

### Accuracy limits

In [None]:
a = np.array([[1, 0],
              [9, 90]])

ConfusionMatrixDisplay(confusion_matrix=a, display_labels=[1, 0]).plot();

In [None]:
# Accuracy Score
(1+90) / (1+0+9+90)

Our model seems to predict very well! But... Actually our data is very imbalanced and it failed to predict the positive class.

So accuracy is good when:

- Target are balanced
- Prediction of each class is equally important.

### Recall

The recall, also known as sensitivity or true positive rate (TPR), measures the proportion of actual positive cases that were correctly identified by the classifier.

$Recall = \frac{TP}{TP + FN}$ or $Recall = \frac{True\ Positives}{True\ Positives + False\ Negatives}$

In [None]:
# Recall Score for our y_test and y_pred
# [4, 1] [TP, FN]
# [2, 3] [FP, TN]

4 / (4+1)

In [None]:
from sklearn.metrics import recall_score

recall_score(y_test, y_pred)

#### Recall Advantages

In [None]:
# Previous example

a = np.array([[1, 0],
              [9, 90]])

ConfusionMatrixDisplay(confusion_matrix=a, display_labels=[1, 0]).plot();

In [None]:
# Recall for a

1 / (9+1)

Now, thanks to the recall metric, we can state that this model is very bad at identifying the positive class.

#### Recall Limits

In [None]:
b = np.array([[27, 100],
              [3, 10]])

ConfusionMatrixDisplay(confusion_matrix=b, display_labels=[1, 0]).plot();

In [None]:
# Recall for b

27 / (27 + 3)

To improve the recall score we need to lower the number of false negatives. No matter how many false positive or true negatives you have, it doesn't change the recall score.

With the result for the array b, the recall is high so the model did a good job at predicting if a sample is positive. But we also got a lot of fake positives (false alarms).

Recall is good when:

- It is important to identify as many occurrences of a class as possible, reducing false negatives but potentially increasing false positives
- You don't want to miss any positive classes (E.g. Detecting fraudulent transactions, cases of a novel disease or potential sales leads).

### Precision

Measures the ability of a model to avoid false alarms for a class.

$Precision = \frac{TP}{TP + FP}$ or $Precision = \frac{True\ Positives}{True\ Positives + False\ Positives}$

We take a look at the column of predicted values and look at the relation between true positives and false positives.

Precision is good when:

- It is important to be correct when identifying a class we don't want false positive. Ex: target advertising, we want to make sure that the customer is someone who potentially needs our product.

In [None]:
# Precision Score for our y_test and y_pred
# [4, 1] [TP, FN]
# [2, 3] [FP, TN]

4 / (4+2)

In [None]:
from sklearn.metrics import precision_score

precision_score(y_test, y_pred)

### F1 Score

The F1 score is the harmonic mean of precision and recall. It provides a balance between precision and recall.

$F1\ Score = 2 \times \frac{Precision \times Recall}{Precision + Recall}$

It comes from the F-beta score. We decide that precision and recall are weighted the same, but we could choose an other one (F2 or F0.5).
F1-Score is influenced more by the lower of the two values. Usually precision and recall are a balance. If you improve precision, recall score will lower.

F1 Score is good when:

- You want a general metric to compare across models and datasets.
- You want to combine the Precision/Recall tradeoff in a single metric.


In [None]:
# F1 Score for our y_test and y_pred
# [4, 1] [TP, FN]
# [2, 3] [FP, TN]

# recall = 4 / (4+1)
# precision = 4 / (4+2)

2 * ((4 / (4+2) * (4 / (4+1)))) / (4 / (4+2) + (4 / (4+1)))

In [None]:
from sklearn.metrics import f1_score

f1_score(y_test, y_pred)

### An other way to visualize it

<div>
<img src="files/precision_recall_wiki.png" alt="precision_recall_accuracy" width="40%" align='center' source="wikipedia" /> </div>

<div>
<img src="files/precision_recall_accuracy.png" alt="precision_recall_accuracy" width="70%" align='center' source="https://medium.com/@shrutisaxena0617/precision-vs-recall-386cf9f89488" /> </div>

## Which metric?

We're  building a model to detect the safety of seatbelts. What metric should we optimize for?
- Seat belt safe = 1
- Seat belt faulty = 0

So we're trying to minimize the numbers of fake positives, because it might kill someone.

Which metric should we look for?
(Answer inside the hidden cell below)

In [None]:
Answer : Precision.