# Label Comparison

**Task:** create an analytical program that compares BEND labeling between two different annotators.

**Research Questions:**

1) Are humans good detectors of the maneuvers?

2) Are some maneuvers easier to detect by humans than others?

3) How much overlap is there between labelers? Do they more or less match in what they are labeling?

4) What is the correlation between the maneuvers and the agreement of labels between annotators? Calculate the agreement by maneuver between two coders. Maybe this goes back somewhat to question 2.

In [None]:
# setting up drive for data import

import os
import tarfile
import urllib

from google.colab import drive
drive.mount('/content/drive')

import pandas as pd

DATASET_PATH = "/content/drive/MyDrive/Completed"

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
from sklearn.metrics import cohen_kappa_score
import numpy as np

# processing data for calculation, including loading and binarizing

BEND_labels = ['Engage', 'Explain', 'Excite', 'Enhance', 'Dismiss', 'Distort', 'Dismay', 'Distract', 'Back', 'Build', 'Bridge', 'Boost', 'Neutralize', 'Nuke', 'Narrow', 'Neglect', 'NONE']
completed_datasets = ["CapnMarvel_1_aahmad1.csv", "CapnMarvel_1_yucongw.csv", "CapnMarvel_2_TPedireddi.csv", "CapnMarvel_2_yucongw.csv", "Election2020_20201102_4_aahmad1.csv", "Election2020_20201102_4_coco.csv"]

def load_data(data_path):
    df = pd.read_csv(os.path.join(DATASET_PATH, data_path))
    df["maneuver"] = df["maneuver"].fillna("NONE")
    return df["maneuver"].tolist()

def binarize(maneuver, data):
  return [1 if maneuver in x else 0 for x in data]

# implementing cohen's kappa

def calculate_kappa():
  kappas = []
  for x in BEND_labels:
      binarized1 = binarize(x, annotator1)
      binarized2 = binarize(x, annotator2)
      kappas.append(cohen_kappa_score(binarized1, binarized2))
  return kappas

def overall_agreement(scores):
  return sum(scores) / len(scores)

# calculating annotator agreement through multiple datasets

arr_scores = np.zeros(17)

for i in range(0, 6, 2):
  annotator1 = load_data(completed_datasets[i])
  annotator2 = load_data(completed_datasets[i+1])
  arr_scores = np.add(np.array(calculate_kappa()), arr_scores)

arr_scores /= len(completed_datasets) / 2

# printing results

print("Annotator agreement by maneuver:")
for i in range(0, 17):
  print('%-15s' '%s' % (BEND_labels[i], arr_scores[i]))
print(f'Overall agreement between both annotators: {overall_agreement(arr_scores)}')

Annotator agreement by maneuver:
Engage         0.1719889309739445
Explain        0.2427149255131402
Excite         0.36445016345346787
Enhance        0.048521264862550484
Dismiss        -0.0063405797101449375
Distort        0.10213817841038982
Dismay         0.3263528981968355
Distract       0.08281523991710504
Back           0.04380936799871941
Build          0.11491840719474185
Bridge         -0.021021821907249622
Boost          0.07654837656807412
Neutralize     0.1850906974837134
Nuke           0.27209916683600893
Narrow         0.1621152872485969
Neglect        -0.029075893616940036
NONE           0.2262449038073281
Overall agreement between both annotators: 0.1390217360723695


In [None]:
# calculating frequencies of maneuvers in the datasets

total_frequencies = np.zeros(17)

def frequency(annotations):
  maneuver_frequencies = []
  for x in BEND_labels:
    binarized = binarize(x, annotations)
    maneuver_frequencies.append(sum(binarized))
  return maneuver_frequencies

for i in range(0, 6):
  total_frequencies = np.add(total_frequencies, frequency(load_data(completed_datasets[i])))

print("Frequency of maneuvers:")
for i in range(0, 17):
  print('%-15s' '%s' % (BEND_labels[i], total_frequencies[i]))

percentages = np.divide(total_frequencies, sum(total_frequencies))
percentages = np.multiply(percentages, 1000)

print("Percent frequency of maneuvers:")
for i in range(0, 17):
  print('%-15s' '%s' % (BEND_labels[i], percentages[i]))

Frequency of maneuvers:
Engage         75.0
Explain        167.0
Excite         159.0
Enhance        85.0
Dismiss        58.0
Distort        88.0
Dismay         206.0
Distract       13.0
Back           153.0
Build          96.0
Bridge         19.0
Boost          92.0
Neutralize     127.0
Nuke           39.0
Narrow         42.0
Neglect        53.0
NONE           59.0
Percent frequency of maneuvers:
Engage         48.98758981058132
Explain        109.07903331156108
Excite         103.85369039843239
Enhance        55.51926845199216
Dismiss        37.88373612018289
Distort        57.478772044415415
Dismay         134.55258001306336
Distract       8.491182233834095
Back           99.9346832135859
Build          62.70411495754409
Bridge         12.4101894186806
Boost          60.09144350097975
Neutralize     82.9523187459177
Nuke           25.473546701502286
Narrow         27.433050293925536
Neglect        34.617896799477464
NONE           38.53690398432397


# Interpreting Cohen's kappa

`<0`: no agreement

`0-0.20`: slight agreement

`0.21-0.40`: fair agreement

`0.41-0.60`: moderate agreement

`0.61-0.80`: substantial agreement

`0.81-1`: perfect agreement