# Script for comparing annotation
Currently, we are looking to compare the annotations between annotators regarding metaphors and non-metaphors. Note that all words/phrases that are tagged are only tagged with the label [METAPHOR]. We want a script that:
1. Compares words/phases that get tagged
2. Compares the tags associated with these words/phrases
3. Outputs some sort of agreement level between annotators

In the future, we also want some way to compare these annotations with the appraisal annotations for these articles to see what class of words tend to be classified as metaphors or non-metaphors.

In [None]:
# run this block if connecting to google drive
from google.colab import drive
drive.mount('/content/drive')

In [None]:
import pandas as pd
from sklearn.metrics import cohen_kappa_score
from statistics import mean
from ast import literal_eval
import numpy as np

## Load annotations from LabelStudio output
LabelStudio can output the annotations in various formats, including json, csv, and tsv.

For simplicity we are using the csv option, which includes the following columns:
* **annotation_id** - unique id associated with each annotation
* **annotator** - id associated with a particular annotator
* **created_at** - time the annotation was created
* **id** - text id
* **label** - dictionary of labels associated with the text
  * "start": the index of the first labelled character within the selected text
  * "end": the index of the final labelled character within the selected text
  * "text": the text (word/phrase) that was selected to be labelled
  * "labels": a list of labels associated with the selected text
* **lead_time** - (not sure what this is)
* **text** - article text
* **updated_at** - time the annotation was last updated

Relevant columns here are annotator, id (rename to text_id), label, text

In [None]:
input = '/content/drive/My Drive/discourse_lab/annotations_20.csv'

In [None]:
annotations_df = pd.read_csv(input, usecols=["annotator", "id", "label", "text"])

# rename 'id' to 'text_id' for clarification
annotations_df.rename(columns={"id": "text_id"}, inplace=True)

# reorder columns for clarification
order = ["text_id", "annotator", "text", "label"]
annotations_df = annotations_df.reindex(columns=order)

In [None]:
annotations_df.head()

Unnamed: 0,text_id,annotator,text,label
0,182,4,"Wow, what a great ideal. All we need to do now...","[{""start"":56,""end"":64,""text"":""planting"",""label..."
1,182,5,"Wow, what a great ideal. All we need to do now...","[{""start"":56,""end"":64,""text"":""planting"",""label..."
2,183,4,"""What, let Teddy Roosevelt's acquisition go ju...",
3,183,5,"""What, let Teddy Roosevelt's acquisition go ju...",
4,184,4,' So with Prime Minister Justin Trudeau's comi...,"[{""start"":769,""end"":776,""text"":""fragile"",""labe..."


In [None]:
# find all the unique annotators
annotations_df['annotator'].unique().tolist()

[4, 5]

## Comparing tagged words and labels


Tentative idea for how to proceed:
* With just two annotators, we could manually split the annotations into two dataframes, and then merge them again on text id, giving us a new dataframe containing the text id, the article text, and two label columns from the annotators
* From there, we can write a function to compare the lists of dictionaries in the label columns, and then create the following additional columns in this new merged dataframe:
  * overlap_agreement - if there is an overlap between non-identical tags (e.g. "line" vs "line in the sand"), how much do they agree?
  * textual_agreement - of words/phrases tagged, how many are identical?

To filter by annotators, use the numbers found from the previous code block and replace them below.

In [None]:
# filter the dataframe to contain just copies of annotations from each annotator
annotator1_df = annotations_df.loc[(annotations_df['annotator'] == 4)]
annotator2_df = annotations_df.loc[(annotations_df['annotator'] == 5)]

In [None]:
annotator1_df.head()

Unnamed: 0,text_id,annotator,text,label
0,182,4,"Wow, what a great ideal. All we need to do now...","[{""start"":56,""end"":64,""text"":""planting"",""label..."
2,183,4,"""What, let Teddy Roosevelt's acquisition go ju...",
4,184,4,' So with Prime Minister Justin Trudeau's comi...,"[{""start"":769,""end"":776,""text"":""fragile"",""labe..."
6,185,4,'... a larger question is how to deal with Chi...,"[{""start"":139,""end"":143,""text"":""sold"",""labels""..."
8,186,4,'Peaceful negotiations' with China means 'peac...,


In [None]:
annotator2_df.head()

Unnamed: 0,text_id,annotator,text,label
1,182,5,"Wow, what a great ideal. All we need to do now...","[{""start"":56,""end"":64,""text"":""planting"",""label..."
3,183,5,"""What, let Teddy Roosevelt's acquisition go ju...",
5,184,5,' So with Prime Minister Justin Trudeau's comi...,"[{""start"":122,""end"":127,""text"":""build"",""labels..."
7,185,5,'... a larger question is how to deal with Chi...,"[{""start"":7,""end"":13,""text"":""larger"",""labels"":..."
9,186,5,'Peaceful negotiations' with China means 'peac...,


Note: dropping NaN rows means that this code only checks for inter-annotator agreement if BOTH annotators have annotated the text. It does not account for cases where one annotator found metaphors in the text and another does not annotate anything.

In [None]:
compare_df = pd.merge(annotator1_df, annotator2_df, on=['text_id', 'text'], suffixes=('_1', '_2'))
# annotator columns are redundant now
compare_df = compare_df.drop(columns=['annotator_1', 'annotator_2'])
# drop any rows without annotations
compare_df = compare_df.dropna()

In [None]:
compare_df.head()

Unnamed: 0,text_id,text,label_1,label_2
0,182,"Wow, what a great ideal. All we need to do now...","[{""start"":56,""end"":64,""text"":""planting"",""label...","[{""start"":56,""end"":64,""text"":""planting"",""label..."
2,184,' So with Prime Minister Justin Trudeau's comi...,"[{""start"":769,""end"":776,""text"":""fragile"",""labe...","[{""start"":122,""end"":127,""text"":""build"",""labels..."
3,185,'... a larger question is how to deal with Chi...,"[{""start"":139,""end"":143,""text"":""sold"",""labels""...","[{""start"":7,""end"":13,""text"":""larger"",""labels"":..."
5,187,'The proposal has strong resonance in China do...,"[{""start"":25,""end"":34,""text"":""resonance"",""labe...","[{""start"":364,""end"":367,""text"":""hot"",""labels"":..."
7,189,A nice 50% tariff imposed on Chinese imports e...,"[{""start"":84,""end"":91,""text"":""wonders"",""labels...","[{""start"":2,""end"":6,""text"":""nice"",""labels"":[""M..."


In [None]:
# sanity check for how many rows are getting compared
len(compare_df)

13

Here we create a few helper functions to turn the label columns into useable data.

First, `labels_to_list` takes in the labels columns and returns a list of lists containing just the start and end indices of tagged words/phrases. These uniquely identify a tagged item within the text, and disambiguates between different instances of the same word (e.g. if "line" appears multiple times throughout the comment), so it is all the information we will need to work with.

Each item in the returned list is a list of [start index, end index pairs], extracted from the list of dictionaries of labels.

In [None]:
# helper function to build a list of lists containing the start and end indices
# and the difference between these indices
# where labels is the labels associated with a specific text
def labels_to_list(labels):
  annotations = []

  labels = literal_eval(labels)

  for label in labels:
    tags = []
    tags.append(int(label['start']))
    tags.append(int(label['end']))

    annotations.append(tags)

  return annotations

This is the main helper function that will produce textual and overlap agreement. As a reminder, textual agreement refers to the similarity in items that are tagged between the annotators (we are not counting the number of tags, but rather, the specific words that are tagged - it would not be very useful to know that they share the same number of annotations if they are both tagging different words). Overlap agreement looks specifically at items that may overlap, for example, if one annotator only tagged "line" as metaphor but the other tagged the entire phrase "line in the sand".

To determine the textual agreement, we first make the labels dictionaries into usable start-end index pairs, making sure to sort them in order (the csv output seems to list the labels in the order that they are created, not in the order the text appears in the comment). Then, we iterate through the two lists to create two additional lists - these contain the annotations that will later be used to compare against each other to produce cohen's kappa.

To produce the additional lists, we add in order the labels that appear in both lists; when a label only appears in one list, we append "0, 0" to the other list to denote that the item was not annotated by that annotator.

It is trickier to use cohen's kappa to determine overlap agreement. As such, I used a rudimentary/brute force method where for each overlaping (but, crucially, not identical) label, I find the ratio between the difference of the indices and add this to a third list for easier calculation.

As an example: let's say "line" has indices [0, 3] and "line in the sand" has indices [0, 15]. Taking the difference, "line" has a difference of `3` and "line in the sand" has a differnece of `15`. We then divide `3/15` to get `0.2`, telling us that the two tags only overlap by `20%`.

Once these ratios are calculated, we then find the mean of these overlap ratios to determine how closely the overlapping tags were to each other, with 0 being no overlap at all and 1 being 100% overlap (note: a 1 should never appear, as for comments where there is no overlap between annotations, `NaN` is instead appended to the row to avoid messing up final calculations).

Similarly, a 1 in textual agreement suggests 100% agreement, and a 0 indicates 0% agreement between the annotators.

In [None]:
# helper function to compare tags and labels for a single text
# where row contains the text and labels from both annotators associated with the text
# label1 and label2 are the column headers for the labels
# returns a list containing the agreement score
# of overlapping labels and overall tags
def compare_labels(row, label1, label2):

  # turn labels dictionaries into usable lists
  list1 = labels_to_list(row[label1])
  list2 = labels_to_list(row[label2])

  # sort the keys to be in index order
  # since LabelStudio orders it by time updated, not order of appearance in text
  list1 = sorted(list1, key=lambda x: x[0])
  list2 = sorted(list2, key=lambda x: x[0])

  tags1 = []
  tags2 = []
  overlap = []

  i = 0
  j = 0

  # cohen's kappa takes two lists of equal lengths to compare labels
  # so here we are building new lists from the annotations
  # appending 0's where the label exists in one list but not the other
  # and iterating through both lists until we have read both of them in full

  while i < len(list1) and j < len(list2):
    # if the current label in list1 appears before the current label in list2
    # append list1's label to tags1
    # and append the '0, 0' label to tags2
    if list1[i][0] < list2[j][0]:
      tags1.append(str(list1[i]))
      tags2.append("0, 0")
      i += 1

    # if the current label in list 1 starts with the same word instance
    # as the label in list2
    # append both labels to their respective lists
    elif list1[i][0] == list2[j][0]:
      tags1.append(str(list1[i]))
      tags2.append(str(list2[j]))

      # and here we check for overlap with the difference in index
      # e.g. if one label is [4, 10] and the other [4, 16]
      # we compare the difference of 6 and 12
      # and append 6/12, or 0.5, to the overlap list
      if list1[i][1] != list2[j][1]:
        diff1 = list1[i][1] - list1[i][0]
        diff2 = list2[j][1] - list2[j][0]
        numerator = min(diff1, diff2)
        denominator = max(diff1, diff2)
        overlap.append(float(numerator/denominator))

      i += 1
      j += 1

    # if the current label in list2 appears before the current label in list1
    # append list2's label to tags2
    # and append the '0, 0' label to tags1
    else:
      tags1.append("0, 0")
      tags2.append(str(list2[j]))
      j += 1

  # find cohen's kappa using our two tags arrays
  # if the tags arrays only contain 1 element (i.e. only has one label)
  # cohen's kappa returns a NaN
  # so the condition checks and corrects for that
  textual_kappa = cohen_kappa_score(tags1, tags2)
  if tags1 == tags2 and len(tags1) == 1:
    textual_kappa = 1

  # find the mean of the overlap agreement in the text
  # or append NaN if no overlap
  if overlap != []:
    overlap_agreement = mean(overlap)
  else:
    overlap_agreement = np.nan

  return [textual_kappa, overlap_agreement]

In [None]:
textual_agreement = []
overlap_agreement = []

for index, row in compare_df.iterrows():
  agreement = compare_labels(row, 'label_1', 'label_2')
  textual_agreement.append(agreement[0])
  overlap_agreement.append(agreement[1])

compare_df['textual_agreement'] = textual_agreement
compare_df['overlap_agreement'] = overlap_agreement

In [None]:
compare_df.head()

Unnamed: 0,text_id,text,label_1,label_2,textual_agreement,overlap_agreement
0,182,"Wow, what a great ideal. All we need to do now...","[{""start"":56,""end"":64,""text"":""planting"",""label...","[{""start"":56,""end"":64,""text"":""planting"",""label...",1.0,
2,184,' So with Prime Minister Justin Trudeau's comi...,"[{""start"":769,""end"":776,""text"":""fragile"",""labe...","[{""start"":122,""end"":127,""text"":""build"",""labels...",0.548387,0.416667
3,185,'... a larger question is how to deal with Chi...,"[{""start"":139,""end"":143,""text"":""sold"",""labels""...","[{""start"":7,""end"":13,""text"":""larger"",""labels"":...",0.347826,
5,187,'The proposal has strong resonance in China do...,"[{""start"":25,""end"":34,""text"":""resonance"",""labe...","[{""start"":364,""end"":367,""text"":""hot"",""labels"":...",0.347826,0.555556
7,189,A nice 50% tariff imposed on Chinese imports e...,"[{""start"":84,""end"":91,""text"":""wonders"",""labels...","[{""start"":2,""end"":6,""text"":""nice"",""labels"":[""M...",1.0,
8,190,Boy it's hard to begin a refutation on Charles...,"[{""start"":759,""end"":768,""text"":""adventure"",""la...","[{""start"":0,""end"":3,""text"":""Boy"",""labels"":[""ME...",0.295499,
10,192,China is an angry country with a big chip on i...,"[{""start"":37,""end"":41,""text"":""chip"",""labels"":[...","[{""start"":12,""end"":17,""text"":""angry"",""labels"":...",0.348837,
11,193,Everything they manufacture was invented and o...,"[{""start"":217,""end"":223,""text"":""driven"",""label...","[{""start"":217,""end"":223,""text"":""driven"",""label...",1.0,
13,195,Here's a good story about Russian disinformati...,"[{""start"":74,""end"":80,""text"":""trolls"",""labels""...","[{""start"":74,""end"":80,""text"":""trolls"",""labels""...",1.0,
14,196,I don't know: it seems people are equally skep...,"[{""start"":161,""end"":168,""text"":""sketchy"",""labe...","[{""start"":161,""end"":168,""text"":""sketchy"",""labe...",1.0,

























As per [this](https://datatab.net/tutorial/cohens-kappa) explanation of cohen's kappa, below is a guideline for how the result can be interpreted:
* \> 0.8 - almost perfect
* \> 0.6 - substantial
* \> 0.4 - moderate
* \> 0.2 - fair
* 0 - 0.2 - slight
* < 0 - poor

In [None]:
print("Average textual agreement: ", compare_df['textual_agreement'].mean())
print("Average overlap agreement: ", compare_df['overlap_agreement'].mean())

Average textual agreement:  0.6280268867300005
Average overlap agreement:  0.4887820512820513
