# GIGO revisited
## Script 1: Inter-Rater Reliability Metrics

### Install dependencies
See `requirements.txt` for specific version numbers

In [1]:
!pip install pandas seaborn openpyxl simpledorff

Defaulting to user installation because normal site-packages is not writeable
You should consider upgrading via the '/opt/conda/bin/python3.8 -m pip install --upgrade pip' command.[0m


### Import libraries

In [2]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from matplotlib import ticker
import simpledorff

## Analysis: Krippendorff's alpha

### Load and explore data

In [3]:
df = pd.read_excel("../data/all_labels_hashed.xlsx", sheet_name=None, keep_default_na=False)

The Excel file has one sheet per question. Rows are items, with one column for each labeler's response. `pd.read_excel()` puts each sheet in a dictionary. 

In [4]:
df.keys()

dict_keys(['original_classification_task', 'classification_outcome', 'labels_from_human_annotation', 'human_annotation_for_training_', 'used_original_human_annotation', 'original_human_annotation_sour', 'prescreening_for_crowdwork', 'annotator_compensation', 'training_for_human_annotators', 'formal_instructions_', 'multiple_annotator_overlap', 'synthesis_of_annotator_overlap', 'reported_inter-annotator_agree', 'total_num_of_human_annotators', 'median_num_of_annotators_per_i', 'link_to_dataset_available'])

In [5]:
df['original_classification_task']

Unnamed: 0,URL hash,Annotator 1,Annotator 2,Annotator 3,Annotator 4,Annotator 5,Annotator 6,Final
0,4e6ec7fb47f277e38f25e35238ca9685ab97c6f0caae10...,yes,yes,yes,yes,yes,,
1,2421ef5ad32c44fd217fa11cae35504247a6e2b7a3f94e...,yes,,yes,yes,yes,,yes
2,1e0bdeb7f8a0fdd7bc3ee02ca01673f4a5433241095be9...,Unsure,yes,yes,yes,yes,,yes
3,3714e16720b27a5a102dde1dd7cfe0201d52a65e4f486f...,yes,,yes,yes,yes,,yes
4,29bab7d76596226416eead7306f540340a0cab99338ec1...,yes,yes,yes,yes,yes,,yes
...,...,...,...,...,...,...,...,...
195,929a8c77b1e75c26130344dca269759b44ed01ed53eba7...,yes,yes,yes,,yes,,yes
196,e6da1957824240f2b7de01f1d7b54def296924d357ff95...,yes,yes,yes,,no,,yes
197,a747196b27097e349348547272305007a128b49a159449...,yes,yes,yes,,no,,yes
198,d29948f93b01194845e2ebc5c5899c38e73cd3d5d2d843...,yes,yes,yes,,yes,,yes


### Clean data
First, some values aren't formatted in the way they need to for comparisons to work. Convert all fields to lowercase strings. 

1. Blanks were np.nan, those got converted to "nan" strings in the prior step, so replace those back.
1. Replace "N/A" with "answered NA" because some libraries like to convert "N/A" to np.nan on their own
1. Convert any blank strings into np.nan
1. Replace "-" with "no information" (these were how the instuctions said to report no info on the number of annotators questions")
1. Replace "unsure" with np.nan, which makes unsures the same as blanks for IRR purposes

In [6]:
def clean_df(df):
    df = df.apply(lambda x: x.astype(str).str.lower())
    df = df.replace(to_replace="nan",value=np.nan)    
    df = df.replace(to_replace="n/a",value="answered NA")  # because some code keeps wanting to turn N/A into np.nan
    df = df.replace(to_replace="",value=np.nan)
    df = df.replace(to_replace="-", value="no information") # for the number of annotators questions
    df = df.replace(to_replace="unsure",value=np.nan)
    
    return df

Clean each sheet

In [7]:
for sheet in list(df.keys()):
    df[sheet] = clean_df(df[sheet])
    

In [8]:
df['total_num_of_human_annotators']

Unnamed: 0,URL hash,Annotator 1,Annotator 2,Annotator 3,Annotator 4,Annotator 5,Annotator 6,Final
0,4e6ec7fb47f277e38f25e35238ca9685ab97c6f0caae10...,answered NA,answered NA,no information,no information,,,answered NA
1,2421ef5ad32c44fd217fa11cae35504247a6e2b7a3f94e...,answered NA,,no information,no information,answered NA,,answered NA
2,1e0bdeb7f8a0fdd7bc3ee02ca01673f4a5433241095be9...,answered NA,no information,no information,no information,answered NA,,answered NA
3,3714e16720b27a5a102dde1dd7cfe0201d52a65e4f486f...,answered NA,,no information,no information,answered NA,,10
4,29bab7d76596226416eead7306f540340a0cab99338ec1...,answered NA,no information,no information,no information,answered NA,,answered NA
...,...,...,...,...,...,...,...,...
195,929a8c77b1e75c26130344dca269759b44ed01ed53eba7...,no information,no information,answered NA,,answered NA,,answered NA
196,e6da1957824240f2b7de01f1d7b54def296924d357ff95...,no information,2,2,,answered NA,,2
197,a747196b27097e349348547272305007a128b49a159449...,2,answered NA,answered NA,,answered NA,,answered NA
198,d29948f93b01194845e2ebc5c5899c38e73cd3d5d2d843...,answered NA,no information,answered NA,,answered NA,,no information


### Transform data

The `simpledorff` library expects a melted dataframe in the following format:

In [9]:
q1 = df['original_classification_task'].melt(id_vars="URL hash")
q1

Unnamed: 0,URL hash,variable,value
0,4e6ec7fb47f277e38f25e35238ca9685ab97c6f0caae10...,Annotator 1,yes
1,2421ef5ad32c44fd217fa11cae35504247a6e2b7a3f94e...,Annotator 1,yes
2,1e0bdeb7f8a0fdd7bc3ee02ca01673f4a5433241095be9...,Annotator 1,
3,3714e16720b27a5a102dde1dd7cfe0201d52a65e4f486f...,Annotator 1,yes
4,29bab7d76596226416eead7306f540340a0cab99338ec1...,Annotator 1,yes
...,...,...,...
1395,929a8c77b1e75c26130344dca269759b44ed01ed53eba7...,Final,yes
1396,e6da1957824240f2b7de01f1d7b54def296924d357ff95...,Final,yes
1397,a747196b27097e349348547272305007a128b49a159449...,Final,yes
1398,d29948f93b01194845e2ebc5c5899c38e73cd3d5d2d843...,Final,yes


Run a sample to test.

In [10]:
simpledorff.calculate_krippendorffs_alpha_for_df(q1,experiment_col='URL hash',
                                                 annotator_col='variable',
                                                 class_col='value')

0.6704685927345297

A loop that iterates through each sheet in the dictionary, melts the sheet, calculates Ka for that melted sheet, and stores the Ka in a new dictionary `ka_dict`:

In [11]:
ka_dict = {}
for sheet in list(df.keys()):
    sheet_df = df[sheet].melt(id_vars="URL hash")
    print(sheet)
    ka_dict[sheet] = simpledorff.calculate_krippendorffs_alpha_for_df(sheet_df,experiment_col='URL hash',
                                                 annotator_col='variable',
                                                 class_col='value')
    print(ka_dict[sheet])

original_classification_task
0.6704685927345297
classification_outcome
0.520310339963324
labels_from_human_annotation
0.517077633724152
human_annotation_for_training_
0.5168762194608635
used_original_human_annotation
0.49764243839882727
original_human_annotation_sour
0.3295543968704172
prescreening_for_crowdwork
0.09716769760445754
annotator_compensation
0.3427458775989042
training_for_human_annotators
0.3639740074970065
formal_instructions_
0.33656217062897187
multiple_annotator_overlap
0.3703134478611023
synthesis_of_annotator_overlap
0.1459893714190822
reported_inter-annotator_agree
0.1211357811559225
total_num_of_human_annotators
0.28141508841358065
median_num_of_annotators_per_i
0.26144640785109685
link_to_dataset_available
0.3217499138905462


Transform results into a dataframe:

In [12]:
ka_df = pd.DataFrame(ka_dict,index=[0]).T
ka_df.columns = ["ka_score"]
ka_df

Unnamed: 0,ka_score
original_classification_task,0.670469
classification_outcome,0.52031
labels_from_human_annotation,0.517078
human_annotation_for_training_,0.516876
used_original_human_annotation,0.497642
original_human_annotation_sour,0.329554
prescreening_for_crowdwork,0.097168
annotator_compensation,0.342746
training_for_human_annotators,0.363974
formal_instructions_,0.336562


## Custom agreement metrics


### Load and clean dataset
Using the same `clean_df()` function from earlier:

In [13]:
df = pd.read_excel("../data/all_labels_hashed.xlsx", sheet_name=None, keep_default_na=False)

In [14]:
for sheet in list(df.keys()):
    df[sheet] = clean_df(df[sheet])
    

### Functions for scoring a row

In [15]:
def calc_total_agree_row(row):
    """
    Score the total agreement of a row of labels, ignoring blank (np.nan) values
    
    Parameters:
        row (pd.Series): row name (URL hash, unused) followed by 6 labels (1 for each labeler)
    
    returns: 
        score (int): 1 if all non-np.nan/blank labels are the same, 0 if any difference
    """
   
    labels = row[1:7].str.lower()
    
    label_count = labels.value_counts(dropna=True)
    
    if len(label_count) <= 1:
        return 1
    else:
        return 0   
        

In [16]:
def calc_mean_correct_row(row):
    """
    Score the mean agreement of a row of labels.
    
    Parameters:
        row (pd.Series): row name (URL hash, unused) followed by 7 labels (6 labelers, 1 final)
    
    returns: 
        score (float): proportion of first 6 labels that are the same as the last
        label.
    """
    correct_count = 0
    total_count = 0
    for label in row[1:7]:
        
        if label == row[7]:
            correct_count += 1
            total_count += 1
        elif label is np.nan:
            pass
        else:
            total_count += 1
    
    if total_count == 0:
        return np.nan
    
    return correct_count / total_count

### Testing with a few rows

In [17]:
df['original_classification_task'].iloc[[1,90,99,152]]

Unnamed: 0,URL hash,Annotator 1,Annotator 2,Annotator 3,Annotator 4,Annotator 5,Annotator 6,Final
1,2421ef5ad32c44fd217fa11cae35504247a6e2b7a3f94e...,yes,,yes,yes,yes,,yes
90,708fb39c97d35d9ddcf13c4cd9b9c45d810769e7f7c031...,yes,yes,yes,yes,no,yes,yes
99,94ad8c9b7ec2ae92d99fe6b4a659686da9220169e67d1e...,no,answered NA,yes,no,yes,,yes
152,64b61dee6d1e73dd8a98dbfdbd2a41c13e423a3197e553...,yes,no,yes,,yes,,no


This row should be: 1, 0, 0, 0

In [18]:

df['original_classification_task'].iloc[[1,90,99,152]].apply(calc_total_agree_row,axis=1)

1      1
90     0
99     0
152    0
dtype: int64

This row should be: 1 (4/4), 0.833 (5/6), 0.4 (2/5), 0.25 (1/4)

In [19]:
df['original_classification_task'].iloc[[1,90,99,152]].apply(calc_mean_correct_row,axis=1)

1      1.000000
90     0.833333
99     0.400000
152    0.250000
dtype: float64

### Calculate and store scores
Iterate through each sheet. Use `apply` to calculate the score for each row, and store each row's score as a new column.

In [20]:
for q in list(df.keys()):
    print(q)
    df[q]['total_agreement'] = df[q].apply(calc_total_agree_row,axis=1)
    df[q]['mean_correct'] = df[q].apply(calc_mean_correct_row,axis=1)

original_classification_task
classification_outcome
labels_from_human_annotation
human_annotation_for_training_
used_original_human_annotation
original_human_annotation_sour
prescreening_for_crowdwork
annotator_compensation
training_for_human_annotators
formal_instructions_
multiple_annotator_overlap
synthesis_of_annotator_overlap
reported_inter-annotator_agree
total_num_of_human_annotators
median_num_of_annotators_per_i
link_to_dataset_available


Show results aggregated for each question:

In [21]:
for q in list(df.keys()):
    print(q)
    print("Mean total agreement:", df[q]['total_agreement'].mean())
    print("Mean mean correct:", df[q]['mean_correct'].mean())
    print()

original_classification_task
Mean total agreement: 0.66
Mean mean correct: 0.8479999999999996

classification_outcome
Mean total agreement: 0.345
Mean mean correct: 0.6542500000000001

labels_from_human_annotation
Mean total agreement: 0.375
Mean mean correct: 0.6823333333333332

human_annotation_for_training_
Mean total agreement: 0.465
Mean mean correct: 0.772583333333333

used_original_human_annotation
Mean total agreement: 0.435
Mean mean correct: 0.7105

original_human_annotation_sour
Mean total agreement: 0.435
Mean mean correct: 0.7106666666666667

prescreening_for_crowdwork
Mean total agreement: 0.585
Mean mean correct: 0.8419999999999994

annotator_compensation
Mean total agreement: 0.46
Mean mean correct: 0.6796666666666664

training_for_human_annotators
Mean total agreement: 0.48
Mean mean correct: 0.7

formal_instructions_
Mean total agreement: 0.475
Mean mean correct: 0.6679166666666666

multiple_annotator_overlap
Mean total agreement: 0.485
Mean mean correct: 0.6929166666

### Convert results into final dataframe

In [22]:
summary_df = pd.DataFrame(columns=("Question","Mean total agreement", "Mean mean correct"))
summary_df

Unnamed: 0,Question,Mean total agreement,Mean mean correct


In [23]:
list_of_dicts = []
for q in list(df.keys()):
    row = {"Question":q,
           "Mean total agreement":df[q]['total_agreement'].mean(),
           "Mean mean correct":df[q]['mean_correct'].mean()
          }
    
    list_of_dicts.append(row)
    

Merge results with Ka results

In [24]:
summary_df = pd.DataFrame(list_of_dicts)[['Question','Mean total agreement','Mean mean correct']]
summary_df = summary_df.set_index('Question')
summary_df = pd.concat([summary_df,ka_df],axis=1)
summary_df

Unnamed: 0,Mean total agreement,Mean mean correct,ka_score
original_classification_task,0.66,0.848,0.670469
classification_outcome,0.345,0.65425,0.52031
labels_from_human_annotation,0.375,0.682333,0.517078
human_annotation_for_training_,0.465,0.772583,0.516876
used_original_human_annotation,0.435,0.7105,0.497642
original_human_annotation_sour,0.435,0.710667,0.329554
prescreening_for_crowdwork,0.585,0.842,0.097168
annotator_compensation,0.46,0.679667,0.342746
training_for_human_annotators,0.48,0.7,0.363974
formal_instructions_,0.475,0.667917,0.336562


Calculate means and medians across all questions:

In [25]:
summary_df.mean()

Mean total agreement    0.480313
Mean mean correct       0.731224
ka_score                0.355902
dtype: float64

In [26]:
summary_df.median()

Mean total agreement    0.477500
Mean mean correct       0.696500
ka_score                0.339654
dtype: float64

In [27]:
summary_df = summary_df.append(pd.Series(summary_df.mean(),name="Average across all questions"))
summary_df = summary_df.append(pd.Series(summary_df.median(),name="Median across all questions"))
summary_df.columns = ['Mean total agreement', 'Mean percent correct', "Krippendorf's alpha"]



Round to 3 decimal places and format first two metrics as percentages:

In [28]:
df_style = {
    'Mean total agreement': '{:,.1%}'.format,
    'Mean percent correct': '{:,.1%}'.format,
    "Krippendorf's alpha": '{:,.3f}'.format,
}

summary_df.style.format(df_style)


Unnamed: 0,Mean total agreement,Mean percent correct,Krippendorf's alpha
original_classification_task,66.0%,84.8%,0.67
classification_outcome,34.5%,65.4%,0.52
labels_from_human_annotation,37.5%,68.2%,0.517
human_annotation_for_training_,46.5%,77.3%,0.517
used_original_human_annotation,43.5%,71.0%,0.498
original_human_annotation_sour,43.5%,71.1%,0.33
prescreening_for_crowdwork,58.5%,84.2%,0.097
annotator_compensation,46.0%,68.0%,0.343
training_for_human_annotators,48.0%,70.0%,0.364
formal_instructions_,47.5%,66.8%,0.337


### Export results with per-row scores to file

In [29]:
with pd.ExcelWriter('../data/all_labels_with_irr_hashed.xlsx') as writer:
    for sheet in list(df.keys()):
        df[sheet].to_excel(writer, sheet_name=sheet)