#### Q1. Explain any three new evaluation metrics (other than discussed in class) with proper explanation and analysis

##### F Beta Score

> The Fbeta-measure is calculated using precision and recall.<br/>
> **Precision** is a metric that calculates the percentage of correct predictions for the positive class.<br/>
  *Precision = TruePositives / (TruePositives + FalsePositives)*

> **Recall** calculates the percentage of correct predictions for the positive class out of all positive predictions that could be made. Maximizing precision will minimize the false-positive errors, whereas maximizing recall will minimize the false-negative errors.<br/>
*Recall = TruePositives / (TruePositives + FalseNegatives)*

> The **F-measure** is calculated as the harmonic mean of precision and recall, giving each the same weighting. It allows a model to be evaluated taking both the precision and recall into account using a single score, which is helpful when describing the performance of the model and in comparing models.

> The **Fbeta-measure** is a generalization of the F-measure that adds a configuration parameter called beta. A default beta value is 1.0, which is the same as the F-measure. A smaller beta value, such as 0.5, gives more weight to precision and less to recall, whereas a larger beta value, such as 2.0, gives less weight to precision and more weight to recall in the calculation of the score.

    - Precision and recall provide two ways to summarize the errors made for the positive class in a binary classification problem.
    - F-measure provides a single score that summarizes the precision and recall.
    - Fbeta-measure provides a configurable version of the F-measure to give more or less attention to the precision and recall measure when calculating a single score.

> Balance the FalsePositives and FalseNegatives with F-Beta and reduce the FalsePositives and FalseNegatives.
  Formulae:<br>
  *Fbeta = ((1 + beta^2) * Precision * Recall) / (beta^2 * Precision + Recall)*

From formule we can clearly identify that beta is taken into account with Precision which deals with FalsePositive. Hence when we want to put more weight on FalsePositive(i.e Precision) than on FalseNegative we can use a smaller beta value (beta < 1 and when we want more weight to be given to FalseNegative (i.e Recall), we can use a greater value (beta > 1). If we want a balanced value between precision and call we can use beta value as 1.

>Rules and Examples:

>>F0.5-Measure (beta=0.5):
F0.5-measure puts more attention on minimizing false positives than minimizing false negatives. 
ex. In case of spam check, we want very low false positives because if regular mail is reported as spam, user might lose important information hence we can more focus on precision.

>>F2-Measure (beta=2.0): Less weight on precision, more weight on recall.
ex. In case of airport we don't want any dangerous items to be boarded into a flight. Hence we want to reduce falseNegatives putting more value on Recall.


>Compariosion of F 0.5, F1 & F2
Suppose for below scenarios where we have perfect precision and 50% recall.
Precision = 1.0
Recall = 0.5


>>If Beta = 1: 
F1-Measure = ((1 + 1^2) * Precision * Recall) / (1^2 * Precision + Recall)
F1-Measure = (2 * 1.0 * 0.5) / ( 1.0 + 0.5 )
F1-Measure = 1.0 / 1.5
F1-Measure = 0.666


>>If Beta = 0.5
F0.5-Measure = ((1 + 0.5^2) * Precision * Recall) / (0.5^2 * Precision + Recall)
F0.5-Measure = (1.25 * Precision * Recall) / (0.25 * Precision + Recall)
F0.5-Measure = (1.25 * 1.0 * 0.5) / (0.25 * 1.0 + 0.5)
F0.5-Measure = 0.625 /0.75
F0.5-Measure = 0.833

*As in Beta < 1 we want to provide more weightage to the Precision, we can clearly see a higher precision score of 0.83 which is much more than the normal F1 score. We are not getting penalized for the low recall.*


>>If Beta = 2.0
F2-Measure = ((1 + 2^2) * Precision * Recall) / (2^2 * Precision + Recall)
F2-Measure = (5 * Precision * Recall) / (4 * Precision + Recall)
F2-Measure = (5 * 1.0 * 0.5) / (4 * 1.0 + 0.5)
F2-Measure = 2.5 /4.5
F2-Measure = 0.555

*As in Beta > 1 we want to provide more weightage to the Recall, we can clearly see a lower score of 0.555 which is much less than the normal F1 score. We are getting penalized for the low recall value.*


##### Cohen Kappa

Cohen’s kappa is a metric often used to assess the agreement between two raters. It can also be used to assess the performance of a classification model.

Like many other evaluation metrics, Cohen’s kappa is calculated based on the confusion matrix. However, in contrast to calculating overall accuracy, Cohen’s kappa takes imbalance in class distribution into account

For example, if we had two bankers and we asked both to classify 100 customers in two classes for credit rating (i.e., good and bad) based on their creditworthiness, we could then measure the level of their agreement through Cohen’s kappa.

> Cohen's Kappa for binary districution <br/>
Customer with good rating represent 90% of the data and customer with bad rating represent 10% of the data. A classification model that predicts a rating of all customers as good would reach an accuracy of as high as 90%. Cohen's Kappa tries to correct this bias by taking into account the priori distribution.



k = (Po - Pe) / (1 - Pe)
Po -> Overall accuracy of the model i.e. is the proportion of observed agreement
Pe -> Overall accuracy that can be reached with a random guess i.e. the Expected Accuracy, the level of Accuracy we expect to obtain by chance.
1- Pe -> is the maximum value of this difference as give by a perfect model with accuracy of 100%


To calucate overall accuracy for a random guess Pe, we make a strong assumption that the model predictions are not affected by the priori distribution of the target class. This assumption is often violated when working with unbalanced data since the classification model tends to predict the majority class in case of uncertinity.

However be caeful when you compare cohen kappa values between models, because cohen's kappa tends to be higher when the target classes are balanced.

Cohen's Kappa is not informative about expected prediction accuracy. The interpretation of cohen's Kappa alues in terms like "moderate" is neither easy not fixed.

##### Mattheus Correlation Coefficient (MCC)

1. The Matthews correlation coefficient (MCC) or phi coefficient is used in machine learning as a measure of the quality of binary (two-class) classifications, introduced by biochemist Brian W. Matthews in 1975.

2. The coefficient takes into account true and false positives and negatives and is generally regarded as a balanced measure which can be used even if the classes are of very different sizes

>
![image-2.png](attachment:image-2.png)

3. MCC is a tool for model evaluation. It measures the differences between actual values and predicted values and is equivalent to the chi-square statistic for a 2 x 2 contingency table.
     
4. MCC is generally regarded as a balanced measure which can be used in binary classification even if the classes are very different in size. The coefficient takes into account true negatives, true positives, false negatives and false positives. This reliable measure produces high scores only if the prediction returns good rates for all four of these categories.
     
Like most correlation coefficients, MCC ranges between -1 and 1:
    - 1 is the best agreement between actuals and predictions,
    - zero is no agreement at all. In other words, the prediction is random with respect to actuals.

>MCC for binary distriution
As MCC takes into account full confusion matrix below is the formulae for the same.
*MCC = TP * TN - FP * FN / ((TP + FP)(TP + FP)(TN + FN)(TN + FP))*


Imagine we have a classifier that is trying to differenciate between cats and dogs by identifying them based on their images. Below is the confusion matrix:
    
    || Cat | Dog |  |
    |cat| 109 | 1 | 110 |
    |Dog| 9 | 1 | 10 |
    || 118 | 2 | 120 |
    
    We can see we have predicted only 1 out of 10 dog correctly while 109 cats were correctly identified.
    
      Scenario 1: 
        cats are positive  
        dogs are negative
    
    | Evaluation Matrix | Value | 
    | :- | -: |
    | **Accuracy** | (109 + 1)/ 120 = 0.92 |
    | **Precision** | 109/ (109+9) = 0.92 |
    | **Recall** | 109/ (109+1) = 0.99  |
    | **F1 Score** | 0.95  |
    
    Scenario 2: 
        cats are negative  
        dogs are positive
 
     | Evaluation Matrix | Value | 
    | :- | -: |
    | **Accuracy** | (109 + 1)/ 120 = 0.92 |
    | **Precision** | 1/ (1+1) = 0.50 |
    | **Recall** | 1/ (9+1) = 0.10  |
    | **F1 Score** | 0.16  |


F1 score does not take into account TN at all which is the reason of this drop in percentage.
    
    Using MCC formule we get a value **MCC = 0.21**. MCC has a low value which suggests that even with a high positive class results it still gave us poor score because our negative class was not well predicted (1 out of 10 was predicted correctly).

#### Q2. Evaluate the performance of the predictions using different evaluation metrics

In [1]:
#reading the csv file
import pandas as pd
import numpy as np

df = pd.read_excel("dataset.xlsx")
print(df)

     Predicted Class  Actual Class
0                  0             0
1                  0             0
2                  1             0
3                  0             0
4                  0             0
..               ...           ...
155                3             3
156                3             3
157                3             3
158                2             3
159                0             3

[160 rows x 2 columns]


In [2]:
#renaming the columns
df.rename({'Predicted Class':'y_predicted','Actual Class':'y_actual'},axis=1,inplace=True)
print(df)

     y_predicted  y_actual
0              0         0
1              0         0
2              1         0
3              0         0
4              0         0
..           ...       ...
155            3         3
156            3         3
157            3         3
158            2         3
159            0         3

[160 rows x 2 columns]


**computing tp,fp,tn,fn from confusion matrix**
![image.png](attachment:image.png)

In [3]:
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
from sklearn.metrics import precision_score
import numpy as np
import math
from sklearn.metrics import roc_curve

print("\x1b[1;31m"+"classification report"+"\x1b[0m")
print(classification_report(df['y_actual'],df['y_predicted'],labels=[0,1,2,3],digits=3))

#print total number classes
total_classes=df["y_actual"].unique().size
print("\x1b[1;31m"+"total number of classes: "+"\x1b[0m"+str(total_classes))
#print sample size
sample_size=df["y_actual"].size
print("\x1b[1;31m"+"sample size: "+"\x1b[0m"+str(sample_size))

print("\n\x1b[1;31m"+"confusion matrix"+"\x1b[0m")
#collect the label into a list
label=df["y_actual"].unique()
#preparing the confusion matrix with inbuild library
cf_matrix=confusion_matrix(df["y_actual"],df["y_predicted"],label)
#preparing the confusion matrix with the help of cross tab
cf_matrix_crosstab=pd.crosstab(df["y_actual"],df["y_predicted"])
print(cf_matrix_crosstab)
print(cf_matrix)

#tn, fp, fn, tp = confusion_matrix(df["y_actual"],df["y_predicted"]).ravel() works for binary classification
#computing the true positive
true_positive=cf_matrix.diagonal()
print("\n\x1b[1;31m"+"true positive: "+"\x1b[0m"+str(true_positive))
#computing the false positive
false_positive=cf_matrix.sum(axis=0)-cf_matrix.diagonal()
print("\x1b[1;31m"+"false positive: "+"\x1b[0m"+str(false_positive))
#computing the false negative
false_negative=cf_matrix.sum(axis=1)-cf_matrix.diagonal()
print("\x1b[1;31m"+"false negative: "+"\x1b[0m"+str(false_negative))
#computing the true negative
true_negative=sample_size-(cf_matrix.sum(axis=1)+cf_matrix.sum(axis=0)-cf_matrix.diagonal())
print("\x1b[1;31m"+"true negative: "+"\x1b[0m"+str(true_negative))

#computing the overall accuracy (sum of correct predictions of all classes/sample siz)
overall_accuracy=str(cf_matrix.diagonal().sum()/cf_matrix.sum())
print("\n\x1b[1;31m"+"overall accuracy : "+"\x1b[0m"+overall_accuracy)

#class accuracy for balanced data
#class_accuracy=(true_positive+true_negative)/(true_positive+true_negative+false_positive+false_negative)
#print("\x1b[1;31m"+"class accuracy (tp+tn)/(tp+tn+fp+fn): "+"\x1b[0m"+str(class_accuracy))

#class accuracy for imbalanced data (correct predictions/total predictions)
class_accuracy=(true_positive)/cf_matrix.sum(axis=1)
print("\x1b[1;31m"+"class accuracy (correct predictions/total predictions): "+"\x1b[0m"+str(class_accuracy))
cm = cf_matrix.astype('float') / cf_matrix.sum(axis=1)[:, np.newaxis]
print("\x1b[1;31m"+"class accuracy from library: "+"\x1b[0m"+str(cm.diagonal()))

#computing the precision, reducing the false positive
precision=true_positive/(true_positive+false_positive)
print("\x1b[1;31m"+"class precision (tp/(tp+fp)): "+"\x1b[0m"+str(precision))
                         
#computing the recall, reducing the false negative
recall=true_positive/(true_positive+false_negative)
print("\x1b[1;31m"+"class recall (tp/(tp+fn)): "+"\x1b[0m"+str(recall))

#computing the macro average or taking the average of precision
macro_average=precision.sum()/precision.size
print("\n\x1b[1;31m"+"macro average: "+"\x1b[0m"+str(macro_average))
print("\x1b[1;31m"+"macro average: "+"\x1b[0m"+str(precision_score(df['y_actual'],df['y_predicted'], average='macro')))

#computing the weighted average
total1=0.0
total2=0.0
for i in range(0,4):
    #(weight1*precision1+weight2*precision2)/(weight1+weight2)
    total1=total1+(cf_matrix[:,i].sum()*(cf_matrix.diagonal()[i]/cf_matrix[:,i].sum()))
    total2=total2+cf_matrix[:,i].sum()
print("\x1b[1;31m"+"weighted average: "+"\x1b[0m"+str(total1/total2))
print("\x1b[1;31m"+"weighted average: "+"\x1b[0m"+str(precision_score(df['y_actual'],df['y_predicted'], average='weighted')))

#computing the f1 score
f1score=2*((recall*precision)/(recall+precision))
print("\n\x1b[1;31m"+"f1 score: "+"\x1b[0m"+str(f1score))

# computing the false positive error (fpr)
print("\x1b[1;31m"+"type1 error (Or) fpr fp/(tn+fp): "+"\x1b[0m"+str(false_positive/(true_negative+false_positive)))
# computing the false negative error (fnr)
print("\x1b[1;31m"+"type1 error (Or) fnr fn/(fn+tp): "+"\x1b[0m"+str(false_negative/(false_negative+true_positive)))


[1;31mclassification report[0m
              precision    recall  f1-score   support

           0      0.846     0.673     0.750        49
           1      0.711     0.889     0.790        36
           2      0.744     0.725     0.734        40
           3      0.730     0.771     0.750        35

    accuracy                          0.756       160
   macro avg      0.758     0.765     0.756       160
weighted avg      0.765     0.756     0.755       160

[1;31mtotal number of classes: [0m4
[1;31msample size: [0m160

[1;31mconfusion matrix[0m
y_predicted   0   1   2   3
y_actual                   
0            33   8   4   4
1             1  32   3   0
2             3   2  29   6
3             2   3   3  27
[[33  8  4  4]
 [ 1 32  3  0]
 [ 3  2 29  6]
 [ 2  3  3 27]]

[1;31mtrue positive: [0m[33 32 29 27]
[1;31mfalse positive: [0m[ 6 13 10 10]
[1;31mfalse negative: [0m[16  4 11  8]
[1;31mtrue negative: [0m[105 111 110 115]

[1;31moverall accuracy : [0m0.75625
[



|       metric        |        manual evaluation     |      sklearn                 |
|:--------------------|:-----------------------------|:-----------------------------|
| overall accuracy    |          0.75625             |         0.75625              |
| class 0 accuracy    |          0.67346939          |         0.67346939           |
| class 1 accuracy    |          0.88888889          |         0.88888889           |
| class 2 accuracy    |          0.725               |         0.725                |
| class 3 accuracy    |          0.77142857          |         0.77142857           |
| confustion matrix   |                              |                              |
|                     |          33  8  4  4         |         33  8  4  4          |
|                     |          1 32  3  0          |         1 32  3  0           |
|                     |          3  2 29  6          |         3  2 29  6           |
|                     |          2  3  3 27          |         2  3  3 27           |
|                     |                              |                              |
|                     |                              |                              |
|                     |                              |                              |
| class 0 precision   |   0.8461538461538461         |   0.846                      |
| class 1 precision   |   0.7111111111111111         |   0.711                      |
| class 2 precision   |   0.7435897435897436         |   0.744                      |
| class 3 precision   |   0.7297297297297297         |   0.730                      |
|                     |                              |                              |
| class 0 recall      |   0.673469387755102          |   0.673                      |
| class 1 recall      |   0.8888888888888888         |   0.889                      |
| class 2 recall      |   0.725                      |   0.725                      |
| class 3 recall      |   0.7714285714285715         |   0.771                      |
|                     |                              |                              |
| macro average       |   0.7576461076461076         |   0.7576461076461076         |
| weighted average    |   0.7646604296604297         |   0.75625                    |
|                     |                              |                              |
| class 0 f1 score    |   0.75                       |   0.750                      |
| class 1 f1 score    |   0.7901234567901234         |   0.790                      |
| class 2 f1 score    |   0.7341772151898733         |   0.734                      |
| class 3 f1 score    |   0.75                       |   0.750                      |
|                     |                              |                              |
| class 0 type1 error |   0.05405405405405406        |                              |
| class 1 type1 error |   0.10483870967741936        |                              |
| class 2 type1 error |   0.08333333333333333        |                              |
| class 3 type1 error |   0.08                       |                              |
|                     |                              |                              |
| class 0 type2 error |   0.32653061224489793        |   0.32653061224489793        |
| class 1 type2 error |   0.1111111111111111         |   0.1111111111111111         |
| class 2 type2 error |   0.275                      |   0.275                      |
| class 3 type2 error |   0.22857142857142856        |   0.22857142857142856        |


In [4]:
#from sklearn import jaccard_similarity_score
#from pandas_ml import ConfusionMatrix
#cm = ConfusionMatrix(df["y_actual"], df["y_predicted"])
#cm.print_stats()

In [5]:
from sklearn.metrics import cohen_kappa_score

print(cohen_kappa_score(df['y_actual'], df['y_predicted']))
print(cohen_kappa_score(df['y_predicted'], df['y_actual']))

0.6752368064952639
0.6752368064952639


In [6]:
from sklearn.metrics import matthews_corrcoef

for i in range(4):
    print(round ( ((true_positive[i]*true_negative[i])-(false_positive[i]*false_negative[i]))/math.sqrt((true_positive[i]+false_positive[i])*(true_positive[i]+false_negative[i])*(true_negative[i]+false_positive[i])*(true_negative[i]+false_negative[i])),4))

matthews_corrcoef(df['y_actual'], df['y_predicted'])

0.665
0.7282
0.6472
0.6779


0.6785227135925035

Reference links:<br/>
https://towardsdatascience.com/multi-class-metrics-made-simple-part-i-precision-and-recall-9250280bddc2 <br/>
https://www.analyticsvidhya.com/blog/2020/04/confusion-matrix-machine-learning/ <br/>
https://scikit-learn.org/stable/modules/model_evaluation.html#classification-metrics <br/>
https://machinelearningmastery.com/fbeta-measure-for-machine-learning/ <br/>
https://thenewstack.io/cohens-kappa-what-it-is-when-to-use-it-and-how-to-avoid-its-pitfalls/ <br/>
https://analyticsindiamag.com/understanding-cohens-kappa-score-with-hands-on-implementation/ <br/>
https://towardsdatascience.com/metrics-and-python-850b60710e0c <br>
https://towardsdatascience.com/metrics-and-python-ii-2e49597964ff