Different problem statements in machine learning present us with the need to have different evaluation methodologies for judging the results of our modelling efforts. Here we discuss some of the most used classification metrices including their interpretions.

# Accuracy Metric

The accuracy_score function computes the accuracy, either the fraction (default) or the count (normalize=False) of correct predictions.

![image.png](attachment:image.png)

# Confusion Matrix

An accuracy score may tell us whereall the model went wrong but it doesn't give much insight into which way it went wrong, which class was it that was predicted wrong, how many such mistakes were made in each class and which class did the model end up predicting if not the correct one. It gives us details about the model's biases, edge cases and gives us a whole new view of the results which can be used to improve the model in specific areas.

![image.png](attachment:image.png)

We can see the above graphic has true label on the y axis and the predicted label on the x axis.
If we see that there are 6 such cases where the true label was versicolor but predicted label was virginica and these are all incorrect predictions. Confusion matrix tells us that the model may be predicting versicolor more times than any other classes for some reason.
It also tells us that the model catches the Setosa class very well, it neither under-predicts nor over-predicts that class.

![image.png](attachment:image.png)

![image.png](attachment:image.png)

Another useful aspect of this matrix is the set of notations above. Considering the simpler to understand case of binary classification, we see 4 possibilities of outcome types.
- True: Which was predicted correctly
    -  True Positive: which was predicted correctly and predicted to be positive class
    -  True Negative: which was predicted correctly and predicted to be negative class.

- False: Which was not predicted correctly or an incorrect prediction
    -  False Positive: Which was not predicted correctly and was predicted to be positive class
    -  False Negative: which was not predicted correctly and was predicted to be negative class



A few terms:
- **Precision** shows us that out of the positively classified data how many were correctly classified. It's the ability of the model to only label positive data as positive class in prediction. How precise was the model in predicting the positive class. (the model should not go on prediction positive case for all cases even if its actually a negative class case.)
- **Recall** shows us that out of the total actual positive class data how many were correctly classified. It's the ability of the model to find and correctly label all the positive samples. How many positive data points was the model able to correctly classify as positive. Recall is also known as sensitivity.

Interpretations: 
- If the recall is high but precision is very low, it shows that the model is just predicting positive class more often than observed in the training data and that it may be a biased model which is learning to just predict positive class in all classes.

    Here’s an example to illustrate this:

    - Imagine a model that predicts whether a patient has a certain disease.
    - If the model is very liberal and labels many patients as having the disease, it might catch most of the actual cases (high recall).
    - However, this approach might also label many healthy patients as having the disease, leading to a lot of false positives and thus low precision.

- On the other hand if precision is high but recall is low, it means that the model is being very conservation in giving out the positive class and while it is not mistaking in labeling the negative class as positive ( for eg labeling an email spam when it's not) but it is missing out on labeling a lot of actual positive cases as positive.


    Here’s a simple example to illustrate this:

    - Suppose you have a model that predicts whether an email is spam or not.
    - If the model is very conservative and only labels an email as spam when it is very sure, it might have high precision because most of the emails it labels as spam are indeed spam.
    - However, this conservative approach might miss many actual spam emails, leading to low recall.




One more terms:
- **F1/Fb Score** - The F1 score is the harmonic mean of the precision and recall. It thus symmetrically represents both precision and recall in one metric. The more generic Fβ score applies additional weights, valuing one of precision or recall more than the other.

![image.png](attachment:image.png)

Reference: https://en.wikipedia.org/wiki/F-score

Here's an important visual of many ways to look at the performance:

![image.png](attachment:image.png)

# Classification Report

Simply put, a classification report is a textual representation of a subset of the components that we see in the confusion matrix.

![image.png](attachment:image.png)

It focuses on precision, recall and f1 scores for various classes. The support stands for the number of observations that belong to the particular class in question for the metric calculation. It is useful when we want to see these numbers together and don't want to have to calculate them from the confusion matrix.

# ROC Curve

Since we are working with probabilities of classes are output from the model there are various thresholds than one can use to bucket the predicted probability into classes. By default 0.5 is used as a threshold which means that if the model predicts the probability > 0.5 the class is set as 1 and otherwise 0.
But 0.5 is only the default threshold and there are so many other thresholds that can be used based on what the problem statement is and which type of prediction error is more critical to the use-case.
With each different threshold the performance of the model changes and the change can be precisely captured in the confusion matrix.

![image.png](attachment:image.png)

In order to compare the model performance with different thresholds we need to look at a metric like confusion matrix but it's not possible to compare them easily. So instead of having to go through numerous confusion matrices ROC provides a simple way to summarise all information. The following curve shows an ROC curve:

![image.png](attachment:image.png)

The x axis shows the False Positive Rate which is 1-specificity and the y axis shows the True Positive Rate which is sensitivity. So the diagonal line shows the points where TPR = FPR.

So for different thresholds, we get different data points on that graph. Now we can see that the farther to the top left, the better it is as the TPR increases and the FPR decreases towards the left top. The final ROC graph we get summarises all of the confusion matrices that each threshold produced. This finally helps us decide which threshold is better to make the classification.

![image.png](attachment:image.png)

The AUC helps compare multiple ROC curves because when multiple models are built and different ROC curves come out it is not possible to visually compare them. Therefore AUC value is useful. The higher the AUC the better it is because it means there's more area under the curve actually.

Variation:

Many times, precision instead of FPR is used for ROC curve calculation especially in the case of class imbalance where positive class is less in number and that is because


![image.png](attachment:image.png)