# Script for comparing annotation
Currently, we are looking to compare the annotations between annotators regarding metaphors and non-metaphors. Note that all words/phrases that are tagged are only tagged with the label [METAPHOR]. We want a script that:
1. Compares words/phases that get tagged
2. Compares the tags associated with these words/phrases
3. Outputs some sort of agreement level between annotators

In the future, we also want some way to compare these annotations with the appraisal annotations for these articles to see what class of words tend to be classified as metaphors or non-metaphors.

In [2]:
# # run this block if connecting to google drive
# from google.colab import drive
# drive.mount('/content/drive')

In [3]:
import pandas as pd
from sklearn.metrics import cohen_kappa_score
from statistics import mean
from ast import literal_eval
import numpy as np

## Load annotations from LabelStudio output
LabelStudio can output the annotations in various formats, including json, csv, and tsv.

For simplicity we are using the csv option, which includes the following columns:
* **annotation_id** - unique id associated with each annotation
* **annotator** - id associated with a particular annotator
* **created_at** - time the annotation was created
* **id** - text id
* **label** - dictionary of labels associated with the text
  * "start": the index of the first labelled character within the selected text
  * "end": the index of the final labelled character within the selected text
  * "text": the text (word/phrase) that was selected to be labelled
  * "labels": a list of labels associated with the selected text
* **lead_time** - (not sure what this is)
* **text** - article text
* **updated_at** - time the annotation was last updated

Relevant columns here are annotator, id (rename to text_id), label, text

In [4]:
import os

In [5]:
os.getcwd()

'C:\\Users\\romha\\OneDrive - University of Waterloo\\Desktop\\Spr25\\metaphor_annotations'

In [6]:
input = 'mar4.csv'

In [7]:
pd.read_csv(input)

Unnamed: 0,annotation_id,annotator,created_at,id,label,lead_time,text,updated_at
0,509,4,2025-01-25T03:02:14.469802Z,1,"[{""start"":151,""end"":162,""text"":""blue collar"",""...",9.888,' The problem is bigger than Mr. Trump.'Extend...,2025-01-25T03:02:14.469842Z
1,510,4,2025-01-25T03:02:37.974849Z,2,"[{""start"":298,""end"":306,""text"":""genitals"",""lab...",28.824,"'If she was a man, she would be president-elec...",2025-02-06T17:15:28.502235Z
2,502,4,2025-01-25T02:48:07.557501Z,3,"[{""start"":140,""end"":146,""text"":""engage"",""label...",176.665,'What are you going to tell your daughters?'Te...,2025-02-06T17:16:36.503806Z
3,511,4,2025-01-25T03:04:01.361546Z,4,"[{""start"":313,""end"":316,""text"":""top"",""labels"":...",101.213,'What are you going to tell your daughters?'We...,2025-02-01T06:28:56.983552Z
4,503,4,2025-01-25T02:48:59.759517Z,5,,1.503,A PR piece designed to reverse all the damage ...,2025-01-25T02:48:59.759572Z
...,...,...,...,...,...,...,...,...
1010,1133,3,2025-02-05T01:02:28.018384Z,1142,,20.526,The license has nothing to do with almost any ...,2025-02-05T01:02:28.018427Z
1011,1132,3,2025-02-05T01:01:59.522189Z,1143,,71.664,"They dictate meter rates because, as the artic...",2025-02-05T01:01:59.522227Z
1012,1131,3,2025-02-05T00:06:40.287424Z,1144,"[{""start"":138,""end"":142,""text"":""pays"",""labels""...",1907.271,UBER customers just want a CHEAPER ride (no in...,2025-02-05T00:07:08.794118Z
1013,1130,3,2025-02-04T23:35:06.074462Z,1145,,236.592,Why? The taxi medallions allow a cartel to fun...,2025-02-04T23:35:06.074500Z


In [8]:
selected = [11, 72, 131, 146, 178, 185, 214, 235, 258, 280, 333, 
            339, 414, 420, 495, 504, 521, 532, 593, 1139, 
            606, 678, 683, 745, 751, 762, 853, 864, 907, 908]
len(selected)

30

In [9]:
annotations_df = pd.read_csv(input, usecols=["annotator", "id", "label", "text", "created_at", "updated_at"])
# annotations_df=["updated_at"]>"2025-02-25"][["annotator", "id", "label", "text"]]
annotations_df=annotations_df[annotations_df['id'].isin(selected)]
annotations_df.head()

Unnamed: 0,annotator,created_at,id,label,text,updated_at
10,3,2025-03-02T22:59:31.345865Z,11,,Another load of tosh from a GTA Liberal,2025-03-03T01:39:11.709897Z
11,5,2025-03-02T22:28:24.168230Z,11,,Another load of tosh from a GTA Liberal,2025-03-02T23:01:27.466009Z
12,4,2025-01-25T03:05:57.967446Z,11,,Another load of tosh from a GTA Liberal,2025-01-25T03:05:57.967512Z
73,3,2025-03-03T01:27:45.721978Z,72,"[{""start"":38,""end"":43,""text"":""holds"",""labels"":...",Nobody really cares if a woman or man holds th...,2025-03-03T01:39:56.571135Z
74,5,2025-03-02T22:29:08.419293Z,72,"[{""start"":48,""end"":55,""text"":""highest"",""labels...",Nobody really cares if a woman or man holds th...,2025-03-02T23:26:05.507051Z


In [10]:
# rename 'id' to 'text_id' for clarification
annotations_df.rename(columns={"id": "text_id"}, inplace=True)

# reorder columns for clarification
order = ["text_id", "annotator", "text", "label"]
annotations_df = annotations_df.reindex(columns=order)

In [11]:
annotations_df.head()

Unnamed: 0,text_id,annotator,text,label
10,11,3,Another load of tosh from a GTA Liberal,
11,11,5,Another load of tosh from a GTA Liberal,
12,11,4,Another load of tosh from a GTA Liberal,
73,72,3,Nobody really cares if a woman or man holds th...,"[{""start"":38,""end"":43,""text"":""holds"",""labels"":..."
74,72,5,Nobody really cares if a woman or man holds th...,"[{""start"":48,""end"":55,""text"":""highest"",""labels..."


In [12]:
# find all the unique annotators
annotations_df['annotator'].unique().tolist()

[3, 5, 4]

## Comparing tagged words and labels


Tentative idea for how to proceed:
* With just two annotators, we could manually split the annotations into two dataframes, and then merge them again on text id, giving us a new dataframe containing the text id, the article text, and two label columns from the annotators
* From there, we can write a function to compare the lists of dictionaries in the label columns, and then create the following additional columns in this new merged dataframe:
  * overlap_agreement - if there is an overlap between non-identical tags (e.g. "line" vs "line in the sand"), how much do they agree?
  * textual_agreement - of words/phrases tagged, how many are identical?

To filter by annotators, use the numbers found from the previous code block and replace them below.

In [13]:
# filter the dataframe to contain just copies of annotations from each annotator
annotator1_df = annotations_df.loc[(annotations_df['annotator'] == 4)] # romina
annotator2_df = annotations_df.loc[(annotations_df['annotator'] == 5)] # amber
annotator3_df = annotations_df.loc[(annotations_df['annotator'] == 3)] # vanja

In [14]:
annotator1_df.head()

Unnamed: 0,text_id,annotator,text,label
12,11,4,Another load of tosh from a GTA Liberal,
75,72,4,Nobody really cares if a woman or man holds th...,"[{""start"":322,""end"":326,""text"":""bill"",""labels""..."
136,131,4,"Funnily enough, if the Ontario Liberals tried ...","[{""start"":48,""end"":58,""text"":""shell game"",""lab..."
153,146,4,Apple stuff consist of 95% marketing nonsense ...,"[{""start"":138,""end"":144,""text"":""thrive"",""label..."
187,178,4,This whole concept of 'national daycare' sound...,"[{""start"":72,""end"":76,""text"":""he##"",""labels"":[..."


In [15]:
annotator2_df.head()

Unnamed: 0,text_id,annotator,text,label
11,11,5,Another load of tosh from a GTA Liberal,
74,72,5,Nobody really cares if a woman or man holds th...,"[{""start"":48,""end"":55,""text"":""highest"",""labels..."
135,131,5,"Funnily enough, if the Ontario Liberals tried ...","[{""start"":48,""end"":58,""text"":""shell game"",""lab..."
152,146,5,Apple stuff consist of 95% marketing nonsense ...,"[{""start"":125,""end"":130,""text"":""knows"",""labels..."
186,178,5,This whole concept of 'national daycare' sound...,"[{""start"":294,""end"":302,""text"":""draining"",""lab..."


Here we create a few helper functions to turn the label columns into useable data.

First, `labels_to_list` takes in the labels columns and returns a list of lists containing just the start and end indices of tagged words/phrases. These uniquely identify a tagged item within the text, and disambiguates between different instances of the same word (e.g. if "line" appears multiple times throughout the comment), so it is all the information we will need to work with.

Each item in the returned list is a list of [start index, end index pairs], extracted from the list of dictionaries of labels.

In [16]:
# helper function to build a list of lists containing the start and end indices
# and the difference between these indices
# where labels is the labels associated with a specific text
def labels_to_list(labels):
  annotations = []

  labels = literal_eval(labels)

  for label in labels:
    tags = []
    tags.append(int(label['start']))
    tags.append(int(label['end']))

    annotations.append(tags)

  return annotations

This is the main helper function that will produce textual and overlap agreement. As a reminder, textual agreement refers to the similarity in items that are tagged between the annotators (we are not counting the number of tags, but rather, the specific words that are tagged - it would not be very useful to know that they share the same number of annotations if they are both tagging different words). Overlap agreement looks specifically at items that may overlap, for example, if one annotator only tagged "line" as metaphor but the other tagged the entire phrase "line in the sand".

To determine the textual agreement, we first make the labels dictionaries into usable start-end index pairs, making sure to sort them in order (the csv output seems to list the labels in the order that they are created, not in the order the text appears in the comment). Then, we iterate through the two lists to create two additional lists - these contain the annotations that will later be used to compare against each other to produce cohen's kappa.

To produce the additional lists, we add in order the labels that appear in both lists; when a label only appears in one list, we append "0, 0" to the other list to denote that the item was not annotated by that annotator.

It is trickier to use cohen's kappa to determine overlap agreement. As such, I used a rudimentary/brute force method where for each overlaping (but, crucially, not identical) label, I find the ratio between the difference of the indices and add this to a third list for easier calculation.

As an example: let's say "line" has indices [0, 3] and "line in the sand" has indices [0, 15]. Taking the difference, "line" has a difference of `3` and "line in the sand" has a differnece of `15`. We then divide `3/15` to get `0.2`, telling us that the two tags only overlap by `20%`.

Once these ratios are calculated, we then find the mean of these overlap ratios to determine how closely the overlapping tags were to each other, with 0 being no overlap at all and 1 being 100% overlap (note: a 1 should never appear, as for comments where there is no overlap between annotations, `NaN` is instead appended to the row to avoid messing up final calculations).

Similarly, a 1 in textual agreement suggests 100% agreement, and a 0 indicates 0% agreement between the annotators.

In [17]:
# helper function to compare tags and labels for a single text
# where row contains the text and labels from both annotators associated with the text
# label1 and label2 are the column headers for the labels
# returns a list containing the agreement score
# of overlapping labels and overall tags
def compare_labels(row, label1, label2):

  # turn labels dictionaries into usable lists
  list1 = labels_to_list(row[label1])
  list2 = labels_to_list(row[label2])

  # sort the keys to be in index order
  # since LabelStudio orders it by time updated, not order of appearance in text
  list1 = sorted(list1, key=lambda x: x[0])
  list2 = sorted(list2, key=lambda x: x[0])

  tags1 = []
  tags2 = []
  overlap = []

  i = 0
  j = 0

  # cohen's kappa takes two lists of equal lengths to compare labels
  # so here we are building new lists from the annotations
  # appending 0's where the label exists in one list but not the other
  # and iterating through both lists until we have read both of them in full

  while i < len(list1) and j < len(list2):
    # if the current label in list1 appears before the current label in list2
    # append list1's label to tags1
    # and append the '0, 0' label to tags2
    if list1[i][0] < list2[j][0]:
      tags1.append(str(list1[i]))
      tags2.append("0, 0")
      i += 1

    # if the current label in list 1 starts with the same word instance
    # as the label in list2
    # append both labels to their respective lists
    elif list1[i][0] == list2[j][0]:
      tags1.append(str(list1[i]))
      tags2.append(str(list2[j]))

      # and here we check for overlap with the difference in index
      # e.g. if one label is [4, 10] and the other [4, 16]
      # we compare the difference of 6 and 12
      # and append 6/12, or 0.5, to the overlap list
      if list1[i][1] != list2[j][1]:
        diff1 = list1[i][1] - list1[i][0]
        diff2 = list2[j][1] - list2[j][0]
        numerator = min(diff1, diff2)
        denominator = max(diff1, diff2)
        overlap.append(float(numerator/denominator))

      i += 1
      j += 1

    # if the current label in list2 appears before the current label in list1
    # append list2's label to tags2
    # and append the '0, 0' label to tags1
    else:
      tags1.append("0, 0")
      tags2.append(str(list2[j]))
      j += 1

  # find cohen's kappa using our two tags arrays
  # if the tags arrays only contain 1 element (i.e. only has one label)
  # cohen's kappa returns a NaN
  # so the condition checks and corrects for that
  textual_kappa = cohen_kappa_score(tags1, tags2)
  if tags1 == tags2 and len(tags1) == 1:
    textual_kappa = 1

  # find the mean of the overlap agreement in the text
  # or append NaN if no overlap
  if overlap != []:
    overlap_agreement = mean(overlap)
  else:
    overlap_agreement = np.nan

  return [textual_kappa, overlap_agreement]

Note: dropping NaN rows means that this code only checks for inter-annotator agreement if BOTH annotators have annotated the text. It does not account for cases where one annotator found metaphors in the text and another does not annotate anything.

In [18]:
compare_df = pd.merge(annotator1_df, annotator2_df, on=['text_id', 'text'], suffixes=('_1', '_2'))
# annotator columns are redundant now
compare_df = compare_df.drop(columns=['annotator_1', 'annotator_2'])
# drop any rows without annotations
compare_df = compare_df.dropna()

# compare_df.head()

In [19]:
# sanity check for how many rows are getting compared
len(compare_df)

25

In [20]:
textual_agreement = []
overlap_agreement = []

for index, row in compare_df.iterrows():
  agreement = compare_labels(row, 'label_1', 'label_2')
  textual_agreement.append(agreement[0])
  overlap_agreement.append(agreement[1])

compare_df['textual_agreement'] = textual_agreement
compare_df['overlap_agreement'] = overlap_agreement

  k = np.sum(w_mat * confusion) / np.sum(w_mat * expected)
  k = np.sum(w_mat * confusion) / np.sum(w_mat * expected)
  k = np.sum(w_mat * confusion) / np.sum(w_mat * expected)
  k = np.sum(w_mat * confusion) / np.sum(w_mat * expected)
  k = np.sum(w_mat * confusion) / np.sum(w_mat * expected)
  k = np.sum(w_mat * confusion) / np.sum(w_mat * expected)
  k = np.sum(w_mat * confusion) / np.sum(w_mat * expected)

























As per [this](https://datatab.net/tutorial/cohens-kappa) explanation of cohen's kappa, below is a guideline for how the result can be interpreted:
* \> 0.8 - almost perfect
* \> 0.6 - substantial
* \> 0.4 - moderate
* \> 0.2 - fair
* 0 - 0.2 - slight
* < 0 - poor

In [21]:
print("Average textual agreement: ", compare_df['textual_agreement'].mean())
print("Average overlap agreement: ", compare_df['overlap_agreement'].mean())

Average textual agreement:  0.7757065536440007
Average overlap agreement:  0.6375


# PAIRWISE

In [22]:
compare_df_1vs2 = pd.merge(annotator1_df, annotator2_df, on=['text_id', 'text'], suffixes=('_1', '_2'))
# annotator columns are redundant now
compare_df_1vs2 = compare_df_1vs2.drop(columns=['annotator_1', 'annotator_2'])
# drop any rows without annotations
compare_df_1vs2 = compare_df_1vs2.dropna()

# compare_df_1vs2.head()

In [23]:
# sanity check for how many rows are getting compared
len(compare_df_1vs2)

25

In [24]:
textual_agreement_1vs2 = []
overlap_agreement_1vs2 = []

for index, row in compare_df_1vs2.iterrows():
  agreement_1vs2 = compare_labels(row, 'label_1', 'label_2')
  textual_agreement_1vs2.append(agreement_1vs2[0])
  overlap_agreement_1vs2.append(agreement_1vs2[1])

compare_df_1vs2['textual_agreement'] = textual_agreement_1vs2
compare_df_1vs2['overlap_agreement'] = overlap_agreement_1vs2

  k = np.sum(w_mat * confusion) / np.sum(w_mat * expected)
  k = np.sum(w_mat * confusion) / np.sum(w_mat * expected)
  k = np.sum(w_mat * confusion) / np.sum(w_mat * expected)
  k = np.sum(w_mat * confusion) / np.sum(w_mat * expected)
  k = np.sum(w_mat * confusion) / np.sum(w_mat * expected)
  k = np.sum(w_mat * confusion) / np.sum(w_mat * expected)
  k = np.sum(w_mat * confusion) / np.sum(w_mat * expected)


In [25]:
print("Average textual agreement between annotators 1 and 2: ", compare_df_1vs2['textual_agreement'].mean())
print("Average overlap agreement between annotations 1 and 2: ", compare_df_1vs2['overlap_agreement'].mean())

Average textual agreement between annotators 1 and 2:  0.7757065536440007
Average overlap agreement between annotations 1 and 2:  0.6375


1 vs 3

In [26]:
compare_df_1vs3 = pd.merge(annotator1_df, annotator3_df, on=['text_id', 'text'], suffixes=('_1', '_3'))
# annotator columns are redundant now
compare_df_1vs3 = compare_df_1vs3.drop(columns=['annotator_1', 'annotator_3'])
# drop any rows without annotations
compare_df_1vs3 = compare_df_1vs3.dropna()

# compare_df.head()

In [27]:
# sanity check for how many rows are getting compared
len(compare_df_1vs3)

23

In [28]:
textual_agreement_1vs3 = []
overlap_agreement_1vs3 = []

for index, row in compare_df_1vs3.iterrows():
  agreement_1vs3 = compare_labels(row, 'label_1', 'label_3')
  textual_agreement_1vs3.append(agreement_1vs3[0])
  overlap_agreement_1vs3.append(agreement_1vs3[1])

compare_df_1vs3['textual_agreement'] = textual_agreement_1vs3
compare_df_1vs3['overlap_agreement'] = overlap_agreement_1vs3

  k = np.sum(w_mat * confusion) / np.sum(w_mat * expected)
  k = np.sum(w_mat * confusion) / np.sum(w_mat * expected)
  k = np.sum(w_mat * confusion) / np.sum(w_mat * expected)
  k = np.sum(w_mat * confusion) / np.sum(w_mat * expected)
  k = np.sum(w_mat * confusion) / np.sum(w_mat * expected)
  k = np.sum(w_mat * confusion) / np.sum(w_mat * expected)


In [29]:
print("Average textual agreement between annotators 1 and 3:", compare_df_1vs3['textual_agreement'].mean())
print("Average overlap agreement between annotators 1 and 3:", compare_df_1vs3['overlap_agreement'].mean())

Average textual agreement between annotators 1 and 3: 0.866639566587864
Average overlap agreement between annotators 1 and 3: 0.4857142857142857


2 vs 3

In [30]:
compare_df_2vs3 = pd.merge(annotator2_df, annotator3_df, on=['text_id', 'text'], suffixes=('_2', '_3'))
# annotator columns are redundant now
compare_df_2vs3 = compare_df_2vs3.drop(columns=['annotator_2', 'annotator_3'])
# drop any rows without annotations
compare_df_2vs3 = compare_df_2vs3.dropna()

# compare_df.head()

In [31]:
# sanity check for how many rows are getting compared
len(compare_df_2vs3)

23

In [32]:
textual_agreement_2vs3 = []
overlap_agreement_2vs3 = []

for index, row in compare_df_2vs3.iterrows():
  agreement_2vs3 = compare_labels(row, 'label_2', 'label_3')
  textual_agreement_2vs3.append(agreement_2vs3[0])
  overlap_agreement_2vs3.append(agreement_2vs3[1])

compare_df_2vs3['textual_agreement'] = textual_agreement_2vs3
compare_df_2vs3['overlap_agreement'] = overlap_agreement_2vs3

  k = np.sum(w_mat * confusion) / np.sum(w_mat * expected)
  k = np.sum(w_mat * confusion) / np.sum(w_mat * expected)
  k = np.sum(w_mat * confusion) / np.sum(w_mat * expected)
  k = np.sum(w_mat * confusion) / np.sum(w_mat * expected)


In [33]:
print("Average textual agreement between annotators 2 and 3:", compare_df_2vs3['textual_agreement'].mean())
print("Average overlap agreement: annotators 2 and 3:", compare_df_2vs3['overlap_agreement'].mean())

Average textual agreement between annotators 2 and 3: 0.8116449387955535
Average overlap agreement: annotators 2 and 3: 0.875


In [34]:
print("Average textual agreement between annotators R and A:", compare_df_1vs2['textual_agreement'].mean())
print("Average overlap agreement between annotators R and A:", compare_df_1vs2['overlap_agreement'].mean(), "\n")
print("Average textual agreement between annotators R and V:", compare_df_1vs3['textual_agreement'].mean())
print("Average overlap agreement between annotators R and V:", compare_df_1vs3['overlap_agreement'].mean(), "\n")
print("Average textual agreement between annotators A and V:", compare_df_2vs3['textual_agreement'].mean())
print("Average overlap agreement between annotators A and V:", compare_df_2vs3['overlap_agreement'].mean())

Average textual agreement between annotators R and A: 0.7757065536440007
Average overlap agreement between annotators R and A: 0.6375 

Average textual agreement between annotators R and V: 0.866639566587864
Average overlap agreement between annotators R and V: 0.4857142857142857 

Average textual agreement between annotators A and V: 0.8116449387955535
Average overlap agreement between annotators A and V: 0.875


In [35]:
np.mean([compare_df_1vs2['textual_agreement'].mean(),
         compare_df_1vs3['textual_agreement'].mean(), 
         compare_df_2vs3['textual_agreement'].mean()])

0.8179970196758061

In [36]:
np.mean([compare_df_1vs2['overlap_agreement'].mean(),
         compare_df_1vs3['overlap_agreement'].mean(), 
         compare_df_2vs3['overlap_agreement'].mean()])

0.6660714285714285