# Motivations Behind this Notebook
The metric of this competition is "quadratic weighted kappa". 

To solve the problem, it is necessary to understand how to improve the metric. 


To undertand how to improve the metric, I need to understand the metric.


So, I look into the metric in this notebook. 

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
MAX_RATING = 6 # the max score in this comp

# 1. Understand "kappa"
The quadratic weighted kappa consists of two parts: kappa and weights. I start with the kappa. 
Kappa is a metric for quantifying the agreement between two nominal categorical variables. 
Examples of nominal categorical variables are gender, color, and so on. Importanly, norminal variables have no order in their values.
The kappa is not suitable to this comp, because scores of essays in this comp are not norminal variables. Therefore, "quadratic weighting" is necessary. I will explain it later!

The kappa is computed as the below:
$$\kappa = 1- \frac{\sum_{I, J} O{i,j}}{\sum_{I, J} E{i,j}} $$

To improve the kappa, we decrease $\frac{\sum_{I, J} O{i,j}}{\sum_{I, J} E{i,j}}$. The term quantifies the magnitude of the agreement between predicted scores and ground-truth scores.

- $O_{i,j}$ is the element at (i, j) of the observed agreement matrix.
- $E_{i,j}$ is the element at (i, j) of the expected agreement matrix.


## Observed Agreement Matrix O
The observed agreement matrix O is a confusion matrix where each cell contains the count of observations classified into predicted score i and ground-truth score j, if I say the case in this comp.


In [None]:

def compute_observed_agreement_matrix(y_true, y_pred, max_rating):
    """
    Calculate the observed agreement matrix.

    Parameters:
    - y_true : array-like of shape (n_samples,)
               Actual ratings.
    - y_pred : array-like of shape (n_samples,)
               Predicted ratings.
    - max_rating: int
               Maximum possible rating.
    Returns:
    - confusion : array-like of shape (max_rating, max_rating)
               Observed agreement matrix.
    """
    confusion = np.zeros((max_rating, max_rating))
    for i, j in zip(y_true, y_pred):
        confusion[i - 1, j - 1] += 1
    return confusion

In [None]:
example_y_true = np.array([1, 1, 1, 2, 2, 2])
example_y_pred = np.array([1, 1, 1, 2, 2, 2])
O = compute_observed_agreement_matrix(example_y_true, example_y_pred, max_rating=MAX_RATING)
print(O)

# Interpretation
# i is a predicted score. j is a ground-truth score.


## Expected Agreement Matrix E
Expected agreement matrix E represents what the agreement would be by chance, given the marginal probabilities of the scores.

In [None]:
def compute_expected_agreement_matrix(y_true, y_pred, max_rating):
    """
    Calculate the expected agreement matrix.

    Parameters:
    - y_true : array-like of shape (n_samples,)
               Actual ratings.
    - y_pred : array-like of shape (n_samples,)
               Predicted ratings.
    - max_rating: int
               Maximum possible rating.
    Returns:
    - expected : array-like of shape (max_rating, max_rating)
               Expected agreement matrix.
    """
   
    confusion = compute_observed_agreement_matrix(y_true, y_pred, max_rating)
    # Calculate the total number of samples
    num_ratings = np.sum(confusion)

    # Marginal sums for each rating
    # Marginalized over true values (predicted histogram vector of outcomes)
    marginal_pred = np.sum(confusion, axis=1) / num_ratings
    # Marginalized over predicted values (actual histogram vector of outcomes)
    marginal_true = np.sum(confusion, axis=0) / num_ratings

    # Expected Agreement matrix E
    expected = np.outer(marginal_pred, marginal_true) * num_ratings
    return expected

In [None]:
example_y_true = np.array([1, 1, 1, 2, 2, 2])
example_y_pred = np.array([1, 1, 1, 2, 2, 2])
E = compute_expected_agreement_matrix(example_y_true, example_y_pred, max_rating=MAX_RATING)
print(E)
# The core component of kappa is the ratio of the total of O to the total E. 
# The matrix O represents the extent of the agreement between the predicted categorical variable and the ground-truth one. 
# On the other hand, E represents the extent of the agreement between the two variables "by chance."

In [None]:

## Why does E represent the extent of agreement by chance? 
# By marginalizing the confusion matrix, we can get the two categroical probabilty distributions.
# P_a(A) represents how likely each score (1~6) occurs when predicted?
# P_b(B) represents how likely each score (1~6) occurs in the ground truth?
# E assumes that the two distributions are statistically independent of each other. "By chance" refers to the independence assumption.
# Under this assumption, the joint probability of P(A = 1 and B = 6) is represented as P_a(1) * P_b(6).
# The computation is applied to the all combinations of the score. And, this is what the outer product do.


# Appendix: native implementation of outer product
def outer_product(scores1, scores2):
    # scores1: (batch_size, 1)
    # scores2: (batch_size, 1)
    result_matrix = np.zeros((len(scores1), len(scores2)))
    for i, score1 in enumerate(scores1):
        for j, score2 in enumerate(scores2):
            result_matrix[i][j] += score1 * score2

    return result_matrix

scores1 = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
scores2 = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
# comparison with the result of numpy's outer product
np.all(outer_product(scores1, scores2) == np.outer(scores1, scores2))

## Compute Kappa
Putting it all together, the kappa score is calculated as follows:

In [None]:

def compute_kappa(y_true, y_pred, max_rating):
    """
    Calculate the kappa score.

    Parameters:
    - y_true : array-like of shape (n_samples,)
               Actual ratings.
    - y_pred : array-like of shape (n_samples,)
               Predicted ratings.
    - max_rating: int
               Maximum possible rating.
    Returns:
    - kappa : float
            kappa score.
    """
    # =========== 1. Creating the Observed Agreement Matrix O ======================
    # i indicate predicted score, j indicate actual score.
    confusion = np.zeros((max_rating, max_rating))
    for i, j in zip(y_true, y_pred):
        confusion[i - 1, j - 1] += 1
        
    # =========== 2. Calculating the Expected Agreement Matrix E ==============
    # Calculate the total number of samples
    num_ratings = np.sum(confusion)

    # Marginal sums for each rating
    # Marginalized over true values (predicted histogram vector of outcomes)
    marginal_pred = np.sum(confusion, axis=1) / num_ratings
    # Marginalized over predicted values (actual histogram vector of outcomes)
    marginal_true = np.sum(confusion, axis=0) / num_ratings

    # Expected ratings matrix E
    # to match the scale with confusion matrix, we multiply num_ratings
    expected = np.outer(marginal_pred, marginal_true) * num_ratings

    # ================ 3. Score calculation ===============================
    # zero out diagonal values
    w_mat = np.ones([max_rating, max_rating], dtype=int)
    w_mat.flat[:: max_rating + 1] = 0
    observed_weighted_sum = np.sum(w_mat * confusion)
    expected_weighted_sum = np.sum(w_mat * expected)

    kappa = 1 - (observed_weighted_sum / expected_weighted_sum)

    return {
        "kappa": kappa,
        "observed_weighted_sum": observed_weighted_sum,
        "expected_weighted_sum": expected_weighted_sum,
    }

In [None]:
## The below example reveal the limitation of this metric to this comp.

y_true = np.array([1, 1, 1, 2, 2, 2])
y_pred = np.array([1, 1, 1, 2, 2, 2]) + 2 # 2 is added to each element
print(compute_kappa(y_true, y_pred, max_rating=MAX_RATING))

y_true = np.array([1, 1, 1, 2, 2, 2])
y_pred = np.array([1, 1, 1, 2, 2, 2]) + 1 # 1 is added to each element
print(compute_kappa(y_true, y_pred, max_rating=MAX_RATING))

# 2. "Quadratic Weighted" Kappa

Unlike the kappa, quadratic weighted kappa can take into account the magnitude of the disagreement between the predicted score and the actual score.
Let k be the max score (In this comp, k is 6). The weights are calculated by the following formula.
$$w_{ij}=(i−j)^2 / (k-1)^2$$
The larger the quadratic weighted kappa is, the better the model is at ordering categorical scores. This comp adopts this metric. In another words, this comp does not require the model to predict categorical scores of essays correctly. Precisely speaking, it requires the model to predict as close as possible to the actual score.
$$\kappa = 1- \frac{\sum_{I, J} w_{i,j}O{i,j}}{\sum_{I, J} w_{i,j}E{i,j}} $$


## Creating the weights matrix W

In [None]:
def compute_weights_matrix(max_rating):
    """
    Calculate the weights matrix.

    Parameters:
    - max_rating: int
               Maximum possible rating.
    Returns:
    - weights : array-like of shape (max_rating, max_rating)
               Weights matrix.
    """
    w_mat = np.zeros([max_rating, max_rating], dtype=int)
    w_mat += np.arange(max_rating)
    weights = (w_mat - w_mat.T) ** 2 / (max_rating - 1) ** 2
    return weights


W = compute_weights_matrix(max_rating=MAX_RATING)
print(W)
# W[i, j] represents the quadratic difference in the position between the predicted score i and the ground-truth score j.
# For example, if i = 1 and j = 2, its weight is (1 - 2)**2/(6-1)**2 = 1/25 = 0.04. The computation is done for all combinations of i and j.
# The mechanism enables the kappa to reflect the ordinal difference.

## Compute quadratic weighted kappa
Incorporating the weight matrix, the quadratic weighted kappa score is calculated as follows:

In [None]:

def compute_quadratic_weighted_kappa(y_true, y_pred, max_rating):
    """
    Calculate the quadratic weighted kappa score.

    Parameters:
    - y_true : array-like of shape (n_samples,)
               Actual ratings.
    - y_pred : array-like of shape (n_samples,)
               Predicted ratings.
    - max_rating: int
               Maximum possible rating.
    Returns:
    - kappa : float
            Quadratic weighted kappa score.
    """
    
    # =========== 1. Creating the quadratic weights matrix W ======================
    w_mat = np.zeros([max_rating, max_rating], dtype=int)
    w_mat += np.arange(max_rating)
    weights = (w_mat - w_mat.T) ** 2 / (max_rating - 1) ** 2
    
    # =========== 2. Creating the observed agreement matrix O ======================
    # i indicate predicted score, j indicate actual score.
    confusion = np.zeros((max_rating, max_rating))
    for i, j in zip(y_true, y_pred):
        confusion[i - 1, j - 1] += 1
        
    # =========== 3. Calculating the expected agreement Matrix E ==============
    # Calculate the total number of samples
    num_ratings = np.sum(confusion)

    # Marginal sums for each rating
    # Marginalized over true values (predicted histogram vector of outcomes)
    marginal_pred = np.sum(confusion, axis=1) / num_ratings
    # Marginalized over predicted values (actual histogram vector of outcomes)
    marginal_true = np.sum(confusion, axis=0) / num_ratings

    # Expected ratings matrix E
    # to match the scale with confusion matrix, we multiply num_ratings
    expected = np.outer(marginal_pred, marginal_true) * num_ratings

    # ================ 4. Score calculation ===============================
    observed_weighted_sum = np.sum(weights * confusion)
    expected_weighted_sum = np.sum(weights * expected)

    kappa = 1 - (observed_weighted_sum / expected_weighted_sum)

    return {
        "kappa": kappa,
        "observed_weighted_sum": observed_weighted_sum,
        "expected_weighted_sum": expected_weighted_sum,
    }


## 3. Check change in quadratic weighted kappa based on the closeness of the predicted score to the actual score

In [None]:


y_true = np.array([1, 1, 1, 2, 2, 2])
y_pred = np.array([1, 1, 1, 2, 2, 2]) + 2 # 2 is added to each element
print(compute_quadratic_weighted_kappa(y_true, y_pred, max_rating=MAX_RATING))
# output: 0.11111111111111116

y_true = np.array([1, 1, 1, 2, 2, 2])
y_pred = np.array([1, 1, 1, 2, 2, 2]) + 1 # 1 is added to each element
print(compute_quadratic_weighted_kappa(y_true, y_pred, max_rating=MAX_RATING))
# output: 0.33333333333333337


# As a result, the quadratic weighted kappa is higher when the predicted score is closer to the actual score.

# 3. Appendix

You don't need to implement the kappa score by yourself. You can use the `sklearn.metrics.cohen_kappa_score` function like this:

In [None]:
from sklearn.metrics import cohen_kappa_score

y_true = np.array([1, 1, 1, 2, 2, 2])
y_pred = np.array([1, 1, 1, 2, 2, 2]) + 1 
print(cohen_kappa_score(y_true, y_pred, weights='quadratic', labels=[1, 2, 3, 4, 5, 6]))

y_true = np.array([1, 1, 1, 2, 2, 2])
y_pred = np.array([1, 1, 1, 2, 2, 2]) + 2
print(cohen_kappa_score(y_true, y_pred, weights='quadratic', labels=[1, 2, 3, 4, 5, 6]))

# The results should be the same as the results of the function I implemented.