# Lab instruction: Model evaluation
Model evaluation forms an important part of the development of machine learning models as this allows us to estimate the performance of the model and to test what is the expected performance upon deployment. In order to achieve such evaluation we employ *metrics*, which are measurements that incdicates us aspects related to the performance of the model.

# Measuring performance

In order to evaluate the performance of a classifier, there are different metrics that can help us. Simple examples of them are the following:

* The recall/precision/true positive rate measures teh proportion of which of the true positives were detectted with respect to all samples of the positive class. The recall is the  is defined by the expression

Recall = (True Positives) / (True Positives + False Negatives)

* The precision describes which of the samples predicted as positive class are actual positive class members. It is expressed by 

Precision = True Positives / (True Positives + False Positives)

* A major problem of the precision and recall is that they suffer under class imbalance. For example, if we had 99% of samples of the positive class and 1% of the negative class, then the precision would be 99% even if we missclassify all the negative class samples. To account for this there are different measurements. For example, the  the F1 score is a

F1 Score=2*(Precision * Recall)/(Precision + Recall)

* Besides the F1 score, we can also use the balanced accuracy

Balanced accuracy = (True positive rate + True negative rate)/2

Where the True positive rate and the True negative rate is defined by

True positive rate = (True positives)/(True Positives + False negatives)

True negative rate = (True negatives)/(True negatives + False positives)
​
 
### Excercise 1: Accuracy and balanced accuracy
For this excercise you can test the difference between the accuracy and balanced accuracy. To test this, we will use the make_classification routine from sklearn.datasets. With this function we can generate an imbalanced dataset by changing the weights function. Each weight define the percentage of datapoints generated for each class 

In [None]:
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import classification_report


# Generate synthetic data
X, y = make_classification(n_classes=2 ,n_samples=3000, n_features=30, weights=[0.9, 0.1], random_state=42)

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

# Train MLP
model = MLPClassifier(random_state=42, validation_fraction=0.25, max_iter=2000)
model.fit(X_train, y_train)

# The standard masurements for performance, such as the precision, recall and F1 scores per class (1 versus all) can be computed with the  
# Sklearn's classification_report function
y_pred = model.predict(X_test)
print(classification_report(y_test, y_pred))


Now that we computed the precision, recall adn f1 scores with Sklearn's classification_report. This function only classifies the regular accuracy, but does not compute the balanced accuracy score, which actually accounts for class imbalance. In order to improve this score, we can use the balanced_accuracy_score. Note that the balanced accuracy is lower than the balanced accuracy. 

What happens if you generate a balanced dataset in the previous cell with the accuracy and balanced accuracy?

In [None]:
from sklearn.metrics import balanced_accuracy_score, accuracy_score

# After computing the cla
# Computing (un)balanced accuracy scores
acc = accuracy_score(y_pred, y_test)
bacc = balanced_accuracy_score(y_pred, y_test)

print("Accuracy: %1.2f, Balanced accuracy: %1.2f"%(acc, bacc))

### Exercise 2: Receiver Operating Characteristic (ROC) curve 
A different way to compute performance in classifiers is the ROC curve. This curve allows us to determine not only how good is a classifier. But also, for example, to define the threshold to be used for classification to tune our classifier to have specific true positive and false positive rates

In [None]:
from sklearn.model_selection import train_test_split
# Train and evaluate on validation set
model = MLPClassifier(max_iter=2000, early_stopping=False, validation_fraction=0.25, random_state=42)
model.fit(X_train, y_train)

# Evaluating
# model.predict() generates a binary label, we need to get the raw output of the classifier instead
# predict_proba() geneartes a continous label. This is what we need for the ROC curve
y_pred = model.predict(X_test)
y_score = model.predict_proba(X_test)

In [None]:
import matplotlib.pyplot as plt
from sklearn.metrics import RocCurveDisplay

# Here we generate the ROC curve!
display = RocCurveDisplay.from_predictions(
    y_test,
    y_score[:,1],
    name=f"ROC curve of MLP Classifier",
    curve_kwargs=dict(color="darkorange"),
    plot_chance_level=True,
    despine=True,
)
_ = display.ax_.set(
    xlabel="False Positive Rate",
    ylabel="True Positive Rate",
    title="ROC curve"
)

Besides the ROC Curve, a common apprach to validate a model is the *area* under the ROC curve. This metric is ofen referred as the AUROC score 

In [None]:
from sklearn.metrics import roc_auc_score

print("AUROC for the MLP classifier: %1.2f"%(roc_auc_score(y_test, y_score[:,1])))

Finaly, just as performned in the past, we can use the confusioon matrix also to evaluate the performance of our models

In [None]:
from sklearn.metrics import ConfusionMatrixDisplay
_ = ConfusionMatrixDisplay.from_estimator(model, X_test, y_test)

# Exercise 3: Multiclass examples

 ### Cross validation for model selection
Now that we explored the cross validatio, we can use Sklearn's GridSearchCV to define a grid search to find the best classifier model.


In [None]:
# Loading and splitting the data

import numpy as np
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split


data, labels = load_digits(return_X_y=True)
(n_samples, n_features), n_digits = data.shape, np.unique(labels).size
X, y = data, labels

# Split into train/test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [None]:
# Plot the first 25 digits
import matplotlib.pyplot as plt

fig, axs = plt.subplots(5, 5)
for i in range(5):
    for j in range(5):
        axs[i,j].imshow(X_train[i+j*5,:].reshape(8,8) )
plt.show()

In [None]:
# Train MLP
model = MLPClassifier(random_state=42, validation_fraction=0.25, max_iter=2000, early_stopping=True, hidden_layer_sizes=(100))
model.fit(X_train, y_train)
_ = ConfusionMatrixDisplay.from_estimator(model, X_test, y_test)

Now it's your turn! Make the ROC curves for each class of your digit classifier. For this follow a one versus rest apprach (OvR). For this will need to one-hot encode the labels of the data with the sklearn.preprocessing LabelBinarizer. Afterwards just call for each class the computation of the ROC curve

In [None]:
# Add your code here