# Evaluating agreement using Cohen's $\kappa$

## Calculating Cohen's $\kappa$

In this exercise, we calculate Cohen's $\kappa$ by hand, in order to understand how the measure works.

To do so, first you need to classify each of the sentences above into three categories: **positive** (`POS`), **neutral** (`NEU`) or **negative** (`NEG`).

```
1. Congratulations on your promotion! You've worked hard and truly deserve it.
2. The test results will be available by the end of the week. Great.
3. The report provides a detailed analysis of the market trends over the past quarter.
4. The traffic was terrible this morning.
5. The cat is sleeping peacefully on the windowsill.
6. We have to cancel the event due to unforeseen circumstances.
7. Your kindness and generosity make the world a better place.
8. The meeting is scheduled for 2:00 PM in the conference room.
9. I didn't hear the news about your new job opportunity.
10. I can't believe I forgot my keys again; this is so frustrating.
11. I'm so grateful for your help with the project; we couldn't have done it without you.
12. The weather today is absolutely beautiful, perfect for a picnic in the park.
13. I'm disappointed that the project didn't meet the client's expectations.
14. The train is scheduled to arrive at the station on time.
15. The customer service I received was extremely poor, and I'm not satisfied with the product.
```

Write your answers into the dictionary below. Write the labels (`POS`, `NEU`, `NEG`) within the empty strings (e.g. `'NEU'`).

Do not discuss your decisions with your neighbour!

In [None]:
classifications = {
    1: '',
    2: '',
    3: '',
    4: '',
    5: '',
    6: '',
    7: '',
    8: '',
    9: '',
    10: '',
    11: '',
    12: '',
    13: '',
    14: '',
    15: ''
}

Next, we will start by calculating **observed agreement**, which corresponds to the number of times you chose to place the sentence in the same category as the person next to you.

In [None]:
# Replace the number 0 with the number of items you agreed on
agreed = 0

# Divide the count with the number of items in the dictionary 'classifications'. The len() function counts the number of items.
observed_agreement = agreed / len(classifications)

# Print out the value for 'observed_agreement'
print(observed_agreement)

The problem with observed agreement is that it does not account for the possibility that you agreed by chance!

In other words, the observed agreement may have been pure luck!

This is why we need to estimate the possibility of agreement by chance.

To do so, count how many times you used each of the three categories (`POS`, `NEU`, `NEG`).

We can do this easily using the `Counter` class in Python's `collections` module.

In [None]:
# Import the Counter class from the collections module
from collections import Counter

# Count the number of unique values in the dictionary 'classifications'
categories = Counter(classifications.values())

# Print out the result
print(categories)

We can convert these counts into probabilities by dividing them with the total number of items in the dictionary.

In [None]:
# Retrieve the count for each category from the Counter object and divide by the total number of items
pos = categories['POS'] / len(classifications)
neu = categories['NEU'] / len(classifications)
neg = categories['NEG'] / len(classifications)

# Print out the probabilities for each category
print("Probability for POS:", pos)
print("Probability for NEU:", neu)
print("Probability for NEG:", neg)

These probabilities represent the chance of you assigning a sentence to this particular category.

Now ask the person next to you for their probabilities for each category.

Store the values into the variables below by replacing the value 0. 

In [None]:
# Store your neighbour's probabilities for choosing each category here
neighbour_pos = 0.33
neighbour_neu = 0.2
neighbour_neg = 0.47

Now that we know the probabilities for each category for both annotators, we can calculate the probability that both annotators chose the same category by chance.

This is easy: for each category, simply multiply your probability with the corresponding probability from the person next to you.

If either annotator did not assign a single tweet into a category, e.g. negative, and the other annotator did, then this effectively rules out the possibility of agreeing by chance (multiplication by zero results in zero).

In [None]:
# Calculate probabilities for choosing the same category
both_pos = pos * neighbour_pos
both_neu = neu * neighbour_neu
both_neg = neg * neighbour_neg

# Print out the probabilities for chance agreement for each category
print("Probability for chance agreement for POS:", both_pos)
print("Probability for chance agreement for NEU:", both_neu)
print("Probability for chance agreement for NEG:", both_neg)

Next, we can calculate how likely you are to agree by chance.

This is known as **expected agreement**, which is calculated by summing up the probabilities for chance agreement for each category.

In [None]:
# Calculate expected agreement
expected_agreement = both_pos + both_neu + both_neg

# Print out expected agreement
print(expected_agreement)

Now that we know both observed agreement (stored under the variable `observed_agreement`) and the agreement expected by chance (`expected_agreement`), we can use this information to calculate Cohen's $\kappa$.

The formula for Cohen's $\kappa$ is as follows:

$\kappa = \frac{P_{observed} - P_{expected}}{1 - P_{expected}}$

As all of the information needed is stored into the variables `observed_agreement` and `expected_agreement`, we can easily calculate the value for $\kappa$.

In [None]:
# Calculate Cohen's kappa
kappa = (observed_agreement - expected_agreement) / (1 - expected_agreement)

# Print out the value for Cohen's kappa
print(kappa)

Which value did you get?

Remember that the values for Cohen's $\kappa$ range from $-1$ to $+1$, where $-1$ stands for perfect disagreement, while $+1$ indicates perfect agreement. A value of $0$ indicates completely random agreement.

## Calculating weighted Cohen's $\kappa$

It should be noted that the original $\kappa$ score does not account for how many times each category appears in the data.

Think, for example, of a situation where most items would fall within a single category – this would radically increase the possibility of chance agreement. Both annotators could simply choose to categorise each item into the dominant category, and this would result in a very high level of agreement!

This issue can be mitigated by weighting observed and expected agreement by how many times each category is used by the annotators.

To get started, ask your neighbour for their classifications and enter them into the dictionary below.

In [None]:
neighbour_classifications = {
    1: '',
    2: '',
    3: '',
    4: '',
    5: '',
    6: '',
    7: '',
    8: '',
    9: '',
    10: '',
    11: '',
    12: '',
    13: '',
    14: '',
    15: ''
}

One rarely has to calculate agreement measures such as Cohen's $\kappa$ by hand, as they have been implemented in numerous Python libraries.

Let's import the `cohen_kappa_score` function from the `metrics` module of the *scikit-learn* library (`sklearn`).

In [None]:
# Import the cohen_kappa_score function from scikit-learn's metrics module
from sklearn.metrics import cohen_kappa_score

Next, let's calculate Cohen's $\kappa$ without weights – just as we did by hand above.

In [None]:
# Retrieve lists of categories from both dictionaries
cats_you = list(classifications.values())
cats_nb = list(neighbour_classifications.values())

# Input the two lists to the cohen_kappa_score function
cohen_kappa_score(cats_you, cats_nb)

This should give you the same result that you calculated manually.

If we want to take into account the distribution of categories, we can call the function with the argument `weights`.

In this case, we use [quadratic weighting](https://datatab.net/tutorial/weighted-cohens-kappa) to account for the distribution of categories.

In [None]:
# Calculate Cohen's kappa with quadratic weights
cohen_kappa_score(cats_you, cats_nb, weights='quadratic')

This brief tutorial should have given you an idea of how Cohen's $\kappa$ and other measures of agreement work.  