# Calculating Krippendorff's Alpha

This notebook demonstrates how we calculated Krippendorff's alpha for interannotator agreement for our study. 

Krippendorff's alpha is an ideal choice for our project because it robustly handles missing data and is designed for to handle agreement among multiple annotators, whereas some methods are limited to only two annotators. In our case, where items have nominal categories and varying numbers of annotators (ranging from one to four), Krippendorff's alpha accommodates these inconsistencies without biasing the results. Unlike other agreement metrics that require complete data from all annotators, this method calculates reliability even when some annotations are missing, ensuring that our interannotator agreement assessment remains statistically sound and valid despite the inherent variability in our dataset.

In [26]:
# Imports

import pandas as pd
from nltk.metrics.agreement import AnnotationTask

In [3]:
import os
print("Current working directory:", os.getcwd())

Current working directory: /Users/timsmac/DSCI/COLX523_linkedin_corpus/src


In [35]:
# Reading in the data - just the individual annotations

# Change this directory as needed
annotation_dir = '../data/Annotation_instances/linkedin_combined_annotation.csv'

df = pd.read_csv(annotation_dir)[['label_1','label_2','label_3','label_4']].astype(float)

df.head(10)

Unnamed: 0,label_1,label_2,label_3,label_4
0,4.0,6.0,6.0,
1,,5.0,4.0,6.0
2,6.0,6.0,5.0,
3,,5.0,5.0,6.0
4,6.0,6.0,6.0,
5,6.0,5.0,,6.0
6,6.0,1.0,,1.0
7,,5.0,3.0,6.0
8,6.0,,,6.0
9,5.0,,6.0,6.0


### Distance function

For this task, there are several things to take into account when consdering the distance function we should use for Krippendorf's alpha. Namely, some of our labels/categories have significant ambiguity and/or overlap with each other. Thus, inter-annotator disagreements between these categories ought to be less penalized due to the natural ambiguity of our dataset. On the other hand, some categories are quite distant, and the reverse is true for these. In our case, we also have a 'Others' category, which requires its own special treatment as a 'neutral/default' category. As a group, we discussed this and determined that the following categories required some special treatment with the distance function:

### Categories 2 and 3:
Category 2 (Events) and Category 3 (Interactive Promotions) are different enough in theory to keep as different categories; however in practice, many Linkedin posts from our dataset were quite ambiguous between these two categories. Consider the following example:

_'NEW LIVE CLASS: How to land a job you love. Join us on Thursday, June 4th at 11am ET.  Pay what you can and register at  https://bit.ly/2ZQSrlY \n \n \n …see more'_

This example demonstrates the overlap; it is largely talking about an event/gathering, but it is soliciting payment for a service and comes from an account that regularly offers paid seminars / is promoting its brand, making it closely related to category 3 as well. Ultimately the above example was labeled as 2, but if somebody thought this was a 3 we believe that sort of disagreement should not be heavily penalized.

### Categories 4 and 5:
Category 4 (Educational Resources) and Category 5 (Trends) also shared some overlap. Many posts on Linkedin are naturally ambiguous between these two categories since they often pair real-world trends with educational resources. Consider this example:

_'Voices of educators as to how to improve online learning.  A worthy read ..... myriad of suggestions.  https://lnkd.in/e2QMwrZ \n \n \n …see more'_

This was labeled a 4 by our group, since it does offer an external resource to learn from. However, that resource is clearly centered around an industry trend (the move toward online learning), offering a bit of ambiguity. We found that there were many such examples that created some ambiguity in labeling, so we also decided to be less punitive to errors between categories 4 and 5.

### Category 6:
Category 6 (Others) is a default, 'catch-all' category for any posts that are too difficult to categorize in the other categories. We discussed how this affects its role in the distance function, and determined that disagreements between this label and each other label should be slightly penalized. Since category 6 has no content of its own and serves only to categorize posts that don't suit another label, it presents a problem if some annotators are routinely using this label when another label truly does apply, and if this is happening frequently it means we have a problem with our schema or the task description, which should be reflected in a lower alpha. Thus, we opted to add a slight penalty to disagreements where either annotator labels a 6.


In [117]:
def custom_distance(a, b):
    # Return 0 if the categories are the same or one is None
    if a == b:
        return 0
    if a is None or b is None: # One annotator is missing from each example; don't penalize this
        return 0
    # Small penalty between 4/5 , 2/3:
    if (a, b) in [(4.0, 5.0), (5.0, 4.0), (2.0, 3.0),(3.0, 2.0)]:
        return 0.5
    # Larger penalty between any rating and 6:
    elif (a == 6.0 and b != 6.0) or (a!= 6.0 and b == 6.0):
        return 1.5
    # Otherwise, use the default penalty.
    return 1

In [None]:
# Converting our data into triples, suited for use with NLTK's AnnotationTask package

def convert_to_triples(data):
    triple_list = []
    annotators = ['label_1','label_2','label_3','label_4']
    for annotator in annotators:
        for i in range (len(data)):
            value = data[annotator][i]
            if pd.isnull(value):
                value = None
            else:
                value = value
            triple_list.append([annotator,i,value])
    return triple_list

triples = convert_to_triples(data=df)

triples[:10]

[['label_1', 0, np.float64(4.0)],
 ['label_1', 1, None],
 ['label_1', 2, np.float64(6.0)],
 ['label_1', 3, None],
 ['label_1', 4, np.float64(6.0)],
 ['label_1', 5, np.float64(6.0)],
 ['label_1', 6, np.float64(6.0)],
 ['label_1', 7, None],
 ['label_1', 8, np.float64(6.0)],
 ['label_1', 9, np.float64(5.0)]]

In [119]:
agreement_task = AnnotationTask(triples, distance=custom_distance)

In [120]:
agreement_task.alpha()

0.4400713149516744

## Discussion


A Krippendorff’s alpha of `0.44` indicates a moderate level of agreement among annotators, which is understandable given the inherent complexity of the task. In challenging annotation tasks, even experienced annotators can differ substantially in interpretation. On our task in particular, the data is in a lengthy text format, and requires in-depth, nuanced reading of hundreds of examples; further, as already discussed, there is natural ambiguity in the dataset, which can further contribute to a modest alpha score.

This level of reliability suggests that while there is some consensus, there remains considerable subjectivity and possibly ambiguity in the annotation guidelines or the task itself. The result provides a useful benchmark, indicating that further refinement in our schema or additional annotator training might improve consistency in future iterations of this task.