# Evaluation Metrics Selection 

Each machine learning model is trying to solve a problem with a different objective using a different dataset and hence, it is important to understand the context before choosing a metric. Usually, the answers to the following question help us choose the appropriate metric:
* Type of task: Regression? Classification?
* Available data distribution
* Business goal

# Classification 

##  Confusion Matrix

For easy understanding, the idea behind confusion matrix let's suppose we have a binary classification task. Suppose we have some medical measueremnts off the different patients and the task is to classify, the patient is healthy or sick.So our target is.
* 1: When a person is sick 
* 0: When a person is healthy.

Alright! Now that we have identified the problem, the confusion matrix, is a table with two dimensions (“Actual” and “Predicted”), and sets of “classes” in both dimensions. Our Actual classifications are columns and Predicted ones are Rows.

![title](../Images/conf_matrix.png)


In [1]:
from sklearn.metrics import confusion_matrix
import numpy as np 
 

y = np.array([0,0,0,0,0,0,0,1,1,1])
y_pred = np.array([0,0,0,0,0,0,0,0,0,0])



conf_matrix_ = confusion_matrix(y,y_pred)
print(conf_matrix_)

[[7 0]
 [3 0]]


you can plot conf matrix like heatmap: https://scikit-learn.org/stable/auto_examples/model_selection/plot_confusion_matrix.html#sphx-glr-auto-examples-model-selection-plot-confusion-matrix-py

### When to minimise what? 

####  <br> 1. Minimising False Negatives:<br>
Let’s say in our cancer detection problem example, out of 100 people, only 5 people have cancer. In this case, we want to correctly classify all the cancerous patients as even a very BAD model(Predicting everyone as NON-Cancerous) will give us a 95% accuracy(will come to what accuracy is). But, in order to capture all cancer cases, we might end up making a classification when the person actually NOT having cancer is classified as Cancerous. This might be okay as it is less dangerous than NOT identifying/capturing a cancerous patient since we will anyway send the cancer cases for further examination and reports. But missing a cancer patient will be a huge mistake as no further examination will be done on them.

 ####  <br> 2. Minimising False Positives: <br>
For better understanding of False Positives, let’s use a different example where the model classifies whether an email is spam or not
Let’s say that you are expecting an important email like hearing back from a recruiter or awaiting an admit letter from a university. Let’s assign a label to the target variable and say,1: “Email is a spam” and 0:”Email is not a spam”
Suppose the Model classifies that important email that you are desperately waiting for, as Spam(case of False positive). Now, in this situation, this is pretty bad than classifying a spam email as important or not spam since in that case, we can still go ahead and manually delete it and it’s not a pain if it happens once a while. So in case of Spam email classification, minimising False positives is more important than False Negatives.

### Accuracy 

![title](../Images/accuracy.png)


## Precision

![title](../Images/prec.png)


In [2]:
from sklearn.metrics import precision_score

y = np.array([0,0,0,0,0,0,0,1,1,1])
y_pred = np.array([0,0,0,0,0,0,0,0,0,0])

average_precision = precision_score(y, y_pred)

print('Precision score: {0:0.2f}'.format(
      average_precision))

Precision score: 0.00


  'precision', 'predicted', average, warn_for)


Precision is a measure that tells us what proportion of patients that we diagnosed as having cancer, actually had cancer. The predicted positives (People predicted as cancerous are TP and FP) and the people actually having a cancer are TP.

## Reccal/True Positive Rate/Sensitivity
![rec](../Images/rec.png)


![rec](../Images/Precisionrecall.png)


Recall is a measure that tells us what proportion of patients that actually had cancer was diagnosed by the algorithm as having cancer.

In [3]:
from sklearn.metrics import recall_score

y = np.array([0,0,0,0,0,0,0,1,1,1])
y_pred = np.array([0,0,0,0,0,0,0,0,0,0])

reccal = recall_score(y, y_pred)

print('Recall score: {0:0.2f}'.format(
      reccal))

Recall score: 0.00


## F1 score

We don’t really want to carry both Precision and Recall in our pockets every time we make a model for solving a classification problem. So it’s best if we can get a single score that kind of represents both Precision(P) and Recall(R). 

Suppose we have 100 credit card transactions, of which 97 are legit and 3 are fraud and let’s say we came up a model that predicts everything as fraud. (Horrendous right!?)

![rec](../Images/fraud.png)


![rec](../Images/mean.png)


#### proof:https://artofproblemsolving.com/wiki/index.php/Root-Mean_Square-Arithmetic_Mean-Geometric_Mean-Harmonic_mean_Inequality

![rec](../Images/f1_score.svg?sanitize=true)


In [4]:
from sklearn.metrics import classification_report
import numpy as np 

y = np.array([0,0,0,0,0,0,0,1,1,1])
y_pred = np.zeros_like(y)

print(classification_report(y,y_pred))

              precision    recall  f1-score   support

           0       0.70      1.00      0.82         7
           1       0.00      0.00      0.00         3

    accuracy                           0.70        10
   macro avg       0.35      0.50      0.41        10
weighted avg       0.49      0.70      0.58        10



  'precision', 'predicted', average, warn_for)


In [5]:
from sklearn.metrics import classification_report
import numpy as np 

y = np.array([0,0,0,0,0,0,0,1,1,1])
y_pred = np.ones_like(y)

print(classification_report(y,y_pred))

              precision    recall  f1-score   support

           0       0.00      0.00      0.00         7
           1       0.30      1.00      0.46         3

    accuracy                           0.30        10
   macro avg       0.15      0.50      0.23        10
weighted avg       0.09      0.30      0.14        10



## ROC, AUC curves

#### If your model output is class probabilities, for minimizing FP or FN you can use threshold moving. 
The small threshold can minimize the false negative, and high threshold can minimize false positive.

What if we start to calculate the true positive rate and false-positive rate for different thresholds

![rec](../Images/tp_rate.png)


![rec](../Images/fp_rate.png)


![rec](../Images/roc_auc_1.png)


![rec](../Images/log_roc.png)


check this for better understanding roc auc curves: https://www.youtube.com/watch?v=4jRBRDbJemM

good sklearn tutorial: https://scikit-learn.org/stable/auto_examples/ensemble/plot_feature_transformation.html#sphx-glr-auto-examples-ensemble-plot-feature-transformation-py

#### more literature for classification metrics 

<br> https://machinelearningmastery.com/tour-of-evaluation-metrics-for-imbalanced-classification/ <br>
<br> https://machinelearningmastery.com/precision-recall-and-f-measure-for-imbalanced-classification/ <br>

# Regression metrics

Sometimes in regression Tasks Loss Function is not informative. Imagine you get Loss=23.5, so is it good enough?

## R Squared (R²)

![rec](../Images/r_2.png)


##### The R² is always going to be between -∞ and 1.

![rec](../Images/r2.png)


## Referances 
https://towardsdatascience.com/understanding-auc-roc-curve-68b2303cc9c5

https://medium.com/@george.drakos62/how-to-select-the-right-evaluation-metric-for-machine-learning-models-part-1-regrression-metrics-3606e25beae0

https://medium.com/thalus-ai/performance-metrics-for-classification-problems-in-machine-learning-part-i-b085d432082b