In [None]:
import numpy as np
from sklearn.metrics import *
import warnings
warnings.filterwarnings('ignore')

# Introduction and aim of the notebook

Quadratic Weighted Kappa is a measure used to evaluate the **agreement between two outcomes in a classification problem (or between two annotators' ratings), taking into account the possibility of agreement by chance.**<br>
A perfect score `1` is granted when both the predictions and actuals are the same, whereas `0` is when the predictions are very far away (for instance having all 4s and we are predicting all 0s).

In this notebook, we will **explore the metric and its calculation**, so that hopefully the [sklearn implementation](https://github.com/scikit-learn/scikit-learn/blob/7db5b6a98/sklearn/metrics/_classification.py#L598) `sklearn.metrics.cohen_kappa_score` when `weights='Quadratic'` would make sense to you.

```
    confusion = confusion_matrix(y1, y2, labels=labels, sample_weight=sample_weight)    (1)
    n_classes = confusion.shape[0]      
    sum0 = np.sum(confusion, axis=0)                                                    (2)
    sum1 = np.sum(confusion, axis=1) 
    expected = np.outer(sum0, sum1) / np.sum(sum0)                                      

    if weights is None:
        w_mat = np.ones([n_classes, n_classes], dtype=int)
        w_mat.flat[:: n_classes + 1] = 0
    elif weights == "linear" or weights == "quadratic":
        w_mat = np.zeros([n_classes, n_classes], dtype=int)                             (3)
        w_mat += np.arange(n_classes)
        if weights == "linear":
            w_mat = np.abs(w_mat - w_mat.T)
        else:
            w_mat = (w_mat - w_mat.T) ** 2                                              
    else:
        raise ValueError("Unknown kappa weighting type.")

    k = np.sum(w_mat * confusion) / np.sum(w_mat * expected)                            (4)
    return 1 - k

```

# Quadratic Weighted Kappa

According to the **competition evaluation section**:

<img src='https://i.imgur.com/Dk8lQi0.png' width=750>

Let's try to make sense of all that, **breaking the metric into smaller steps.**

## Calculation

Let's assume we have two vectors representing the ground truth (actuals) and our classifier's predictions (preds). Numbers can be also be thought as labels or rating given by a human, it doens't change how we are going to build the metric.

In [None]:
actuals = np.array([0, 0, 4, 3, 2, 4, 1, 1, 2, 1])
preds   = np.array([0, 2, 3, 0, 0, 4, 1, 1, 3, 1])

The **first step (according to the Kaggle docs)** is to create a **confusion matrix O between predicted and actuals.** We'll be using `confusion_matrix` from `sklearn` as follows:

In [None]:
O = confusion_matrix(actuals, preds)
O

Let's visualize it to get a better sense of what it is happening:

In [None]:
ConfusionMatrixDisplay.from_predictions(actuals, preds)

From the plot above, for instance, we see that we are doing **quite a good job predicting the label 1, not quite the same for label 3.**<br>
This first step is also what `sklearn` does in **(1)** above.

The **2nd step** is to construct a **weight matrix `w`** which calculates the weight between the actuals and predicted: **predictions that are far away from actuals are penalized more than those closer to the actuals.**

In [None]:
# Shape of the confusion matrix, the same as 
n = O.shape[0]

# Init zero weight matrix
w = np.zeros((n,n))
w

We'll **apply the formula reported in the evaluation section** as follows:

In [None]:
for i in range(n):
    for j in range(n):
        w[i][j] = ((i-j)**2)/(n-1)**2

In [None]:
np.round(w,2)

All values on the **diagonal have penalty 0** since we have perfect agreement, whereas **predictions and actuals further away from each other are penalised the most.**

`sklearn` comes up with that matrix in a different (and elegant) way using **(3)**, reported here:

```
w_mat = np.zeros([n_classes, n_classes], dtype=int)
w_mat += np.arange(n_classes)
if weights == "linear":
    w_mat = np.abs(w_mat - w_mat.T)
else:
    w_mat = (w_mat - w_mat.T) ** 2
```

Let's try to understand what it does. Let's recreate the empty matrix and create an increasing range of numbers:

In [None]:
w = np.zeros((n,n))
w += np.arange(n)

w

It looks like it's calculating exactly what we did before with `for i in range(n)`.
Then we have:

In [None]:
 w.T

Which is just the transposed of `w`, so that this calculation:

In [None]:
w = (w - w.T) ** 2
w

resembles our

```
for i in range(n):
    for j in range(n):
        w[i][j] = ((i-j)**2)
```

We simply need to divide by $(N-1)^2$ to get the exact same result we got before. However, this is not necessary (that's why `sklearn` doesn't have it), since later we'll do $w*O / w*E$, so the denominator $(N-1)^2$ will cancel out. Let's stick to the plan and do it anyways.

In [None]:
w = w / (n-1)**2
w

The **3rd step is to calculate E**, which is the **outer product between the actual counts of outcomes and the predicted.** This is **(2)** in the sklearn code:

In [None]:
actual_sum = np.sum(O, axis=1)
predicted_sum = np.sum(O, axis=0)

In [None]:
print(f'Actuals value counts: {actual_sum}, Prediction value counts: {predicted_sum}')

In the actuals (rows of the confusion matrix), in fact we can see that we have:
- 2 values with label 0
- 3 values with label 1
- 2 value with label 2
- 1 with label 3
- 2 values with label 4

In [None]:
E = np.outer(actual_sum, predicted_sum)/O.sum()
E

**But why do we need the outer product?**<br>
Let's try to understand first what we mean with **expected outcome.**

This value is defined as the value **expected to be achieved based on the confusion matrix.** Given the confusion matrix calculated before:

In [None]:
ConfusionMatrixDisplay.from_predictions(actuals, preds)

The expected outcome is calculated as follows, starting from cell 1:
- 2 (1 + 1 = 2) instances were labeled as 0 according to ground truth (first row), and 3 (1 + 1 + 1) instances were classified as 0 by the classifier. Considering 10 total instances in the confusion matrix, this results in a value of 0.6 (2 * 3 / 10 = 0.6).

Moving to the right:
- 2 instances were labeled as 0 according to ground truth, and 3 instances were classified as 1 by the classifier. We divide by 10 again and we still obtain 0.6

And so on.
**The fact that we are multiplying a row by each column is actually the outer product between two vectors!**<br>
If we have two vectors:

$$
\mathbf{v}=\left[\begin{array}{c}
v_{1} \\
v_{2} \\
\vdots \\
v_{n}
\end{array}\right], \mathbf{w}=\left[\begin{array}{c}
w_{1} \\
w_{2} \\
\vdots \\
w_{m}
\end{array}\right]
$$

the outer product is:

$$
\underset{\underset{\scriptstyle\text{}}{\scriptstyle}}{\mathbf{v} \otimes \mathbf{w}}=\left[\begin{array}{cccc}
v_{1} w_{1} & v_{1} w_{2} & \cdots & v_{1} w_{m} \\
v_{2} w_{1} & v_{2} w_{2} & \cdots & v_{2} w_{m} \\
\vdots & \vdots & \ddots & \vdots \\
v_{n} w_{1} & v_{n} w_{2} & \cdots & v_{n} w_{m}
\end{array}\right]
$$
If we then divide by the total instances (10), we **get the expected outcome for each cell.**

Let's now **normalize E and O**:

In [None]:
O = O/O.sum()
E = E/E.sum()

And now we are ready to **calculate the Kappa as indicated by Kaggle!**

In [None]:
k = np.sum(w * O) / np.sum(w * E)
1-k

Let's compare this result with the `sklearn` implementation:

In [None]:
cohen_kappa_score(preds, actuals, weights='quadratic')

# Intuitions and examples on imbalanced dataset

According to [Wikipedia](https://en.wikipedia.org/wiki/Cohen%27s_kappa#Interpreting-magnitude), some Kappa values can be defined (according to Landis and Koch) as:

- 0 no agreement
- 0.01 – 0.20 Slight
- 0.21 – 0.40 Fair
- 0.41 – 0.60 Moderate
- 0.61 – 0.80 Substantial 
- 0.81 – 0.99 Almost perfect

What about **imbalanced datasets?** Let's try to create one:

In [None]:
actuals = np.concatenate([np.zeros(100000), np.ones(10)])
preds   = np.concatenate([np.zeros(100010)])

ConfusionMatrixDisplay.from_predictions(actuals, preds)

print(classification_report(preds, actuals))
print('Kappa score:', cohen_kappa_score(preds, actuals, weights='quadratic'))

Given that we are not able to predict any positive class, **Kappa = 0 is reassuring!** What about being able to predict correctly **1 instance of the positive class?**

In [None]:
actuals = np.concatenate([np.zeros(100000), np.ones(10)])
preds   = np.concatenate([np.zeros(100009), np.ones(1)])

ConfusionMatrixDisplay.from_predictions(actuals, preds)

print(classification_report(preds, actuals))
print('Kappa score:', cohen_kappa_score(preds, actuals, weights='quadratic'))

A Kappa of **0.18 indicates not a great result**, which is a good thing given that we are **not doing well on the positive class.**

# Conclusion

In conclusion, the Quadratic Weighted Kappa is a popular metric used to measure the inter-rater agreement between annotators (or labels, if we consider the truth and the predictions as annotators). It is a robust measure that can also take into account the severity of the disagreement thanks to the weighting part.

I hope this helped you understand better this competition's metric!

# References

- https://www.kaggle.com/code/aroraaman/quadratic-kappa-metric-explained-in-5-simple-steps/notebook
- https://datascience.stackexchange.com/questions/1108/kappa-near-to-60-in-unbalanced-110-data-set
- https://stats.stackexchange.com/questions/82162/cohens-kappa-in-plain-english