For simplicity’s sake, let’s consider our multi-class classification problem to be a 3-class classification problem. Say, we have a dataset that has three class labels, namely Apple, Orange and Mango. The following is a possible confusion matrix for these classes.

![https://miro.medium.com/max/720/1*yH2SM0DIUQlEiveK42NnBg.png](https://miro.medium.com/max/720/1*yH2SM0DIUQlEiveK42NnBg.png)

Unlike binary classification, there are no positive or negative classes here. At first, it might be a little difficult to find TP, TN, FP and FN since there are no positive or negative classes, but it’s actually pretty easy. What we have to do here is to find TP, TN, FP and FN for each individual class.

For example, if we take class Apple, then let’s see what are the values of the metrics from the confusion matrix.

- TP = 7
- TN = (2+3+2+1) = 8
- FP = (8+9) = 17
- FN = (1+3) = 4

Since we have all the necessary metrics for class Apple from the confusion matrix, now we can calculate the performance measures for class Apple. For example, class Apple has

- Precision = 7/(7+17) = 0.29
- Recall = 7/(7+4) = 0.64
- F1-score = 0.40

Similarly, we can calculate the measures for the other classes. Here is a table that shows the values of each measure for each class.

![https://miro.medium.com/max/720/1*X1ghULso7P3AdMalomM6yg.png](https://miro.medium.com/max/720/1*X1ghULso7P3AdMalomM6yg.png)

Now we can do more with these measures. We can combine the F1-score of each class to have a single measure for the whole model. There are a few ways to do that, let’s look at them now.

# Micro F1
This is called micro-averaged F1-score. It is calculated by considering the total TP, total FP and total FN of the model. It does not consider each class individually, It calculates the metrics globally. So for our example,

- Total TP = (7+2+1) = 10
- Total FP = (8+9)+(1+3)+(3+2) = 26
- Total FN = (1+3)+(8+2)+(9+3) = 26

Hence,

- Precision = 10/(10+26) = 0.28
- Recall = 10/(10+26) = 0.28

Now we can use the regular formula for F1-score and get the Micro F1-score using the above precision and recall.



- Micro F1 = 0.28

# Macro F1
This is macro-averaged F1-score. It calculates metrics for each class individually and then takes unweighted mean of the measures. As we have seen from figure “Precision, Recall and F1-score for Each Class”,

- Class Apple F1-score = 0.40
- Class Orange F1-score = 0.22
- Class Mango F1-score = 0.11

Hence,

- Macro F1 = (0.40+0.22+0.11)/3 = 0.24

# Weighted F1
The last one is weighted-averaged F1-score. Unlike Macro F1, it takes a weighted mean of the measures. The weights for each class are the total number of samples of that class. Since we had 11 Apples, 12 Oranges and 13 Mangoes,

- Weighted F1 = ((0.40*11)+(0.22*12)+(0.11*13))/(11+12+13) = 0.24

In [1]:
#importing a 3-class dataset from sklearn's toy dataset
from sklearn.datasets import load_wine

dataset = load_wine()
X = dataset.data
y = dataset.target
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
svc = SVC(kernel='rbf', C=1).fit(X_train, y_train)
y_pred = svc.predict(X_test)

#importing confusion matrix
from sklearn.metrics import confusion_matrix
confusion = confusion_matrix(y_test, y_pred)
print('Confusion Matrix\n')
print(confusion)

#importing accuracy_score, precision_score, recall_score, f1_score
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
print('\nAccuracy: {:.2f}\n'.format(accuracy_score(y_test, y_pred)))

print('Micro Precision: {:.2f}'.format(precision_score(y_test, y_pred, average='micro')))
print('Micro Recall: {:.2f}'.format(recall_score(y_test, y_pred, average='micro')))
print('Micro F1-score: {:.2f}\n'.format(f1_score(y_test, y_pred, average='micro')))

print('Macro Precision: {:.2f}'.format(precision_score(y_test, y_pred, average='macro')))
print('Macro Recall: {:.2f}'.format(recall_score(y_test, y_pred, average='macro')))
print('Macro F1-score: {:.2f}\n'.format(f1_score(y_test, y_pred, average='macro')))

print('Weighted Precision: {:.2f}'.format(precision_score(y_test, y_pred, average='weighted')))
print('Weighted Recall: {:.2f}'.format(recall_score(y_test, y_pred, average='weighted')))
print('Weighted F1-score: {:.2f}'.format(f1_score(y_test, y_pred, average='weighted')))

from sklearn.metrics import classification_report
print('\nClassification Report\n')
print(classification_report(y_test, y_pred, target_names=['Class 1', 'Class 2', 'Class 3']))

Confusion Matrix

[[15  0  1]
 [ 0 17  4]
 [ 0  3  5]]

Accuracy: 0.82

Micro Precision: 0.82
Micro Recall: 0.82
Micro F1-score: 0.82

Macro Precision: 0.78
Macro Recall: 0.79
Macro F1-score: 0.78

Weighted Precision: 0.84
Weighted Recall: 0.82
Weighted F1-score: 0.83

Classification Report

              precision    recall  f1-score   support

     Class 1       1.00      0.94      0.97        16
     Class 2       0.85      0.81      0.83        21
     Class 3       0.50      0.62      0.56         8

    accuracy                           0.82        45
   macro avg       0.78      0.79      0.78        45
weighted avg       0.84      0.82      0.83        45



**Note:** Scikit-Learn uses the rows to be the “true class” and the columns to be the “predicted class.” This is opposite to our consideration for the Apple, Orange and Mango example, but logically similar. You can consider true and predicted class either way. But if you are using Scikit-Learn, then you have to play by their rules.