# Interrater Reliability between two human raters and NLP algorithm.

You'll have to change the code at places marked #TOCHANGE

It feels like it makes most sense to compare each human rater pair-wise with the NLP algorithm. First of all, this seems to be the main question we are interested in: How does the NLP algorithm compare to humans? Secondly, if we calculated the overall agreement between the three raters (two human and one NLP), it could return an abnormally high/low value because the humans are agreeing/not agreeing with each other. I think the same is true for Fleiss' Kappa (Cohen's Kappa generalized to more than two raters).

However, it may be useful to compare the humans pair-wise as well, just to provide some context for how the NLP algorithm is performing. (ie. If humans agree really well with each other, but not with NLP, something is strange.)

Also, I think it makes sense to include a significance value for the following Null Hypothesis Significance Testing:
    H_0: kappa = 0 (ratings are random)
    H_A: kappa > 0 (ratings are better than random)
    
We can test this using a randomization test through the following algorithm: 1) Generate n fake datasets of random ratings of k categories between two raters. 2) For each fake dataset, calculate Cohen's kappa for a total of n values of Cohen's kappa. By definition, these kappas are generated under the null hypothesis. 3) Count the proportion of generated kappas that are greater than our observed kappas. By definition, this is the probability of obtaining the observed kappa, or more extreme, under the null hypothesis - also known as the p-value.

In [316]:
import pandas as pd
import nltk
import os
import numpy as np
import random

In [317]:
ratings = pd.read_csv("fake_data.csv") #TOCHANGE needs path to real data csv

#need to add "abstract id" to convert to tidy format
id = pd.DataFrame({"id": range(len(ratings))})
ratings_id = pd.concat([id, ratings], axis=1)

#create three pairwise comparisons: rater_1 (Rater 1 and NLP), rater_2 (Rater 2 and NLP), human (Rater 1 and Rater 2)
rater_1 = ratings_id.drop(columns = ['Rater 2'])
rater_2 = ratings_id.drop(columns = ['Rater 1'])
human = ratings_id.drop(columns = ['Actual'])

In [318]:
def convert_to_tidy(df):
    df_str = df.applymap(str) #cohen's kappa needs ratings as strings
    df_long = pd.melt(df_str, id_vars = "id") #creates tidy format
    df_reorder = df_long[['variable', 'id', 'value']] #required order for cohens kappa
    rating_list = [df_str.iloc[:, 1].values.tolist(), df_str.iloc[:, 2].values.tolist()] #needed for agreement
    return(rating_list, df_reorder)

def print_values(rater_list, rater_tidy, desc): #prints out Kappa and agreement
    agree_obj = agreement.AnnotationTask(data = rater_tidy.values.tolist())
    cm = ConfusionMatrix(rater_list[0], rater_list[1])
    
    print(desc + ":")
    print("Cohen's Kappa: " + str(agree_obj.kappa()))
    print("Agreement: " + str(agree_obj.avg_Ao()))
    
    return(agree_obj.kappa())

In [319]:
rater_1_list, rater_1_tidy = convert_to_tidy(rater_1)
rater_2_list, rater_2_tidy = convert_to_tidy(rater_2)
human_list, human_tidy = convert_to_tidy(human)
    
rater_1_kappa = print_values(rater_1_list, rater_1_tidy, "NLP vs Rater 1")
rater_2_kappa = print_values(rater_2_list, rater_2_tidy, "NLP vs Rater 2")
human_kappa = print_values(human_list, human_tidy, "Rater 1 vs Rater 2")

NLP vs Rater 1:
Cohen's Kappa: -0.01719104768083037
Agreement: 0.0
NLP vs Rater 2:
Cohen's Kappa: -0.015873015873015876
Agreement: 0.0
Rater 1 vs Rater 2:
Cohen's Kappa: -0.014230271668822774
Agreement: 0.0


Cohen's Kappa normal interpretation is as follows: <0.0 = Poor, 0.0-0.2 = Slight, 0.2-0.4 = Fair, 0.4-0.6 = Moderate, 0.6-0.8 = Substantial, 0.8-1.0 = Almost perfect.

Agreement is just # correct/total. Cohen's Kappa should be almost identical to Agreement for large k, where k is number of categories raters had to pick from.

For large k, I'm not sure Cohen's Kappa is a fair assessment, because it's much harder to pick the right one and Cohen's Kappa does not adequately correct for that. So if the normal interpretation is good for your NLP, go with that! Otherwise, you can qualify it, using the human:human pair for comparison - "Cohen's Kappa and agreement for Rater 1:NLP pairs and Rater 2:NLP pairs were low; however, given the large number of categories raters had to choose from, this was expected, as raters had to choose the exact category out of __ options. We see these values are comparable to those from the Rater 1:Rater 2 pair, highlighting the difficulty of the task."

To quantify if the pairs are doing better than expected from random chance, we can generate Kappas under the null.

Below, I generate a large number (n_iterations) of "rating datasets" where each rater randomly choose from k categories (k_categories) with equal probability (our assumption under H_0) and calculate Cohen's Kappa for each of these.

Then, I check what proportion of these null Cohen's Kappa are as extreme, or more extreme, than our observed Cohen's Kappa for each pair of raters. These have identical interpretation as p-values and can be used to argue the raters are doing better than random ratings. Unfortunately, this is not a super strong statement...

In [309]:
np.random.seed(30)
n_iterations = 100000
k_categories = 59 #TOCHANGE How many categories did Raters have access to?

def generate_ratings(n_ratings, k): 
    variable = ["rater_1"] * n_ratings + ["rater_2"] * n_ratings
    rating_num = list(range(n_ratings)) * 2
    value = np.random.randint(1, k, n_ratings * 2).tolist() #random ratings from k categories
    
    rating_num_str = list(map(str, rating_num)) #convert to string
    value_str = list(map(str, value))
    
    ratings_list = [None] * n_ratings * 2
    for i in range(n_ratings * 2):
        ratings_list[i] = [variable[i], rating_num_str[i], value_str[i]]
    
    agree_obj = agreement.AnnotationTask(data = ratings_list)
    return(agree_obj.kappa())

kappa = [None] * n_iterations
for i in range(n_iterations):
    kappa[i] = generate_ratings(n_ratings = 60, k = k_categories)

In [313]:
rater_1_pval = len([val for val in kappa if val >= rater_1_kappa])/n_iterations #proportion of values as extreme or more
rater_2_pval = len([val for val in kappa if val >= rater_2_kappa])/n_iterations
human_pval = len([val for val in kappa if val >= human_kappa])/n_iterations

print(rater_1_pval)
print(rater_2_pval)
print(human_pval)

0.81903
0.74814
0.67951
