- Evaluation and metrics
- train test split
- metrics for classification
- metrics for regression

## Evaluation ✅

___

![](https://drive.google.com/uc?export=view&id=1bjrsYhVwc2HYB1FFJKUK6AoyQ27ujJRU)

___

# I. Training Set & Test Set

It is particularly important to separate the data that you use for **training** your model, and the data that you use for **testing** your model.

Otherwise it is too easy! 🙈

It would be like a student practicing on sample questions for his/her exams, and being tested on the exact same question. 

You need to be tested on **new questions** in order to make sure you **learnt correctly**.

<p align="center">
<img src="https://drive.google.com/uc?export=view&id=1x4j8bjbFbkSUKKsRPcPAGeJs8JF1OI9g" width="350">
</p>

For one given labeled dataset, we often split it in a **training set** and a **test set** with the associated repartition: **80%** - **20%**, but you can choose to split your dataset as you want.

> 🔦 **Hint**: This convention allows to retrieve sufficient data to learn from it, and a small subset in order to understand the performance.

<p align="center">
<img src="https://drive.google.com/uc?export=view&id=1I0Dn2d0RUM1BrOGG40VKwJHfKE-Lh6VE" width="500">
</p>

To recap, we have:

-  a **Training set**: A set of observations used for **learning**. It **fits the parameters (or weights) of the model**.
   
    Commonly noted `X_train` (for the features) and `y_train` for the corresponding targets.


- a **Test set**: A set of observations (unseen during training) used only to **assess the performance** of the model.

    Commonly noted `X_test` (for the features) and `y_test` for the corresponding targets.

In [None]:
# Sklearn provides a function to split your data into train set and test set
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

---

# II. Evaluation for classification

## II.1. Accuracy

Let's consider a classification model that predicts if a mushroom is eatable or not (0 = Not Eatable, 1 = Eatable).

We have seen before, that we could evaluate our classification model by computing the **accuracy**: it corresponds to the number of correct predictions over the total number of predictions.

In other words, it's the **percentage of correct predictions** of the model (on the test set).

Accuracy is a **global** metric. It is always useful to evaluate the accuracy but sometimes it can be insufficient to evaluate properly your model.

## II.2. Confusion matrix

Now, let's consider that we obtain these results by testing the model on 200 mushrooms:

| 🍄  | **Is eatable** 😋 | **Is not eatable** ☠️ | 
|------|------|------|
| **Predicted eatable** 👍 | 100 | 6 |
| **Predicted not eatable** 👎 | 0 | 94 |

We call:
* True Positive (TP): the number of predicted positive that are indeed positive - here 100
* True Negative (TN): the number of predicted negative that are indeed negative - here 94
* False Positive (FP): the number of predicted positive that are in reality negative - here 6
* False Negative (FN): the number of predicted negative that are in reality positive - here 0

Accuracy as we know it can be defined as:
$$Accuracy = \frac{TP + TN}{TP + TN + FN + FP}$$

In our case, $Accuracy = \frac{100 + 94}{100 + 94 + 0 + 7} = 0.97$ 

Our model has an accuracy of 97%. Not that bad right?.. 

But wait.

What if we predict to someone that a mushroom is eatable when it's not?! ☠️

This can cause severe damage, that's why, instead of computing only global accuracy, we can also compute other scores to get a better understanding of the performance of our model.

In addition, we can compute new scores per class: 


<p align="center">
<img src="https://drive.google.com/uc?export=view&id=1LM2d2k6lPt8g3skTW3MTOErXwORhndYB" width="500">
</p>

- **Precision**: How many selected items are relevant?

Here, how many mushrooms classified as eatable are indeed eatable? It's 100/106=0.94 (94%) for eatable mushrooms

If the goal is to eat the mushrooms afterwards... we might want to improve our model because there is a 6% risk that you get a non-eatable mushroom (classified as eatable)

- **Recall**: How many relevant items are selected?

Here, how many mushrooms truly eatable are classified as such? It's 100% for eatable mushrooms.

It's the ability of our model capacity to classify all eatable mushrooms as such.

More precisely, we can write:

$${Precision = \frac{TP}{TP + FP}}$$

$${Recall = \frac{TP}{P} = \frac{TP}{TP + FN}}$$

Finally we can define **F1_score**, which is a measure that combines both Precision metric and Recall metrics:

$${F1\_score = \frac{2 * (Precision * Recall)}{Precision + Recall}}$$

Here, ${F1\_score = \frac{2 * (1 * 0.94)}{1 + 0.94} = 0.969\%}$

**F1 score reaches its best value at 1 (perfect precision and recall) and worst at 0**

Note that we can define Precision and Recall per class (the same way we just defined them for the class `eatable`.

In particular, the Precision for the negative class (here `non-eatable`) is called **Specificity**:

$$Specificity = \frac{TN}{N} = \frac{TN}{TN + FP}$$

> 📚 **Resources**: More information about Precision and Recall: https://en.wikipedia.org/wiki/Precision_and_recall

## II.3. ROC curve

Most of the time, a classifier returns a probability (that the input belongs to the corresponding class), and we consider the highest probability as the predicted class.

> 🔦 **Hint**: In scikit-learn instead of predicting the class label (with `model.predict(X)`),
we can do:
>
> `y_pred_proba = model.predict_proba(X)`

For each value of the threshold, we can compute the **Recall** and the **Specificity**. 

Then we can plot the curve ROC with Recall in y-axis and 1-Specificity in x-axis:

<p align="center">
<img src="https://drive.google.com/uc?export=view&id=1II0OImj0Yx6fyDnSNSUH47mxdIAHo7Hq" width="400">
</p>

> 📚 **Books**: Cool visualization of the ROC curve and impact of data distribution/threshold: http://www.navan.name/roc/

**The Area Under the Curve ROC (AUC ROC) indicates how well the probabilities from the positive classes are separated from the negative classes**

> 🔦 **Hint**: For understanding the trade-off of the threshold chosen, consider an airport security. Since passengers can be potential threats to safety, scanners may be set to trigger alarms on low-risk items like belt buckles and keys (**low specificity**) in order to increase the probability of identifying dangerous objects and minimize the risk of missing objects that do pose a threat (**high sensitivity**). 


As an example, let's build a ROC curve. We continue with our airport security example. Suppose our classifier returned the following scores ("+" represents a true threat, while a high score, close to 1, means a confident prediction that the observation is a threat) for 6 observations :

<p align="center">
<img src="https://drive.google.com/uc?export=view&id=1byt8YC6h9klPF2s0lUYZ4rDyo3gJ05A1" width="400">
</p>

Then we can compute the TPR (=TP/P) and FPR (=FP/P) for every threshold range: 

<p align="center">
<img src="https://drive.google.com/uc?export=view&id=1hJxvUQKKS5RZzv6FvbKKH7Ua-nsawyDa" width="500">
</p>

Which gives us the following ROC curve:
<p align="center">
<img src="https://drive.google.com/uc?export=view&id=1dXCxWe5jiEHz_bsf5wXziK8mqJjxEzXX" width="300">
</p>

---

# III. Evaluation for regression

## III.1. Residual Sum of Squares 

Remember that we fitted a Regression model by trying to minimize the distance between the model predictions and the data points

<img src="https://drive.google.com/uc?export=view&id=1UmfQle9OImxlTgu3AEUsk9Yvo8RYaVsR" width="100%">


If we sum all the squares of distances (in order to sum only positive values), we obtain the **Residual Sum of Squares**:

$${RSS=\sum _{i=1}^{n}(y_{i}-f(x_{i}))^{2}}$$

## III.2. (Root) Mean Squared Error

In order to have a "magnitude of the error" for one single data point, we can:

- Divide the RSS by the number of points (get the mean). This gives us the **Mean Squared Error** (MSE):

$${MSE=\frac{1}{n}\sum _{i=1}^{n}(y_{i}-f(x_{i}))^{2}}$$

- More accurately, as we computed the squares of the residuals, we can take the square root in order to have the same unit as the data points. This gives us the **Root Mean Squared Error** (RMSE):

$${RMSE=\sqrt{\frac{1}{n}\sum _{i=1}^{n}(y_{i}-f(x_{i}))^{2}}}$$

For example, the RMSE for a home price regression model will be in € and will correspond to the mean error of your regression model

## III.3. R-squared

Finally, we can compute a new value very useful in regression: $R^2$

First, we define the **Relative Squared Error**: 

<p align="center">
<img src="https://drive.google.com/uc?export=view&id=1TWGTV-uYaoZoLs2-kmFVQqxmjjmDxvfY" width="30%">
</p>

Instead of dividing the RSS by the number of points in the dataset (which gives the MSE), we divide the RSS by a "reasonable" error: the sum of distances between the data points and the mean.

$R^2$ is then defined as $R^2 = 1 - RSE$ and it gives an indication of **how much are the true value and the predicted value correlated?**

<p align="center">
<img src="https://drive.google.com/uc?export=view&id=1eAiEwTMjfzobHnW9RUB0YRIPGFbzGy7e" width="70%">
</p>

It's value is between 0 and 1 and:
- $R^2$ is close to 0 when values are not correlated at all (random noise)
- $R^2$ is close to 1 when values are highly correlated (even negatively). This means that knowing one helps a lot knowing the other. 

> ⚠️ **Warning**: The previous charts do not correspond to the regression chart (relation between output and input). It shows the correlation between true value and predicted value

Here is another way of seeing it:

<p align="center">
<img src="https://drive.google.com/uc?export=view&id=1ysIEDMeZ7G43T6_oQOqXYxEsE1-ElMoV" width="400">
</p>