# Calculating inter-annotator agreement

In this notebook, inter-annotator agreement is determined on a sample of 50 items in the 1000 item random sample.

In [None]:
import numpy as np

# Define Fleiss' kappa calculation
# Taken from https://gist.github.com/LouisdeBruijn/9dced57c54e0029e29cdfcfb2e54a8c8#file-fleis_kappa-py
def fleiss_kappa(M):
    """Computes Fleiss' kappa for group of annotators.
    :param M: a matrix of shape (:attr:'N', :attr:'k') with 'N' = number of subjects and 'k' = the number of categories.
        'M[i, j]' represent the number of raters who assigned the 'i'th subject to the 'j'th category.
    :type: numpy matrix
    :rtype: float
    :return: Fleiss' kappa score
    """
    N, k = M.shape  # N is # of items, k is # of categories
    n_annotators = float(np.sum(M[0, :]))  # # of annotators
    tot_annotations = N * n_annotators  # the total # of annotations
    category_sum = np.sum(M, axis=0)  # the sum of each category over all items

    # chance agreement
    p = category_sum / tot_annotations  # the distribution of each category over all annotations
    PbarE = np.sum(p * p)  # average chance agreement over all categories

    # observed agreement
    P = (np.sum(M * M, axis=1) - n_annotators) / (n_annotators * (n_annotators - 1))
    Pbar = np.sum(P) / N  # add all observed agreement chances per item and divide by amount of items

    return round((Pbar - PbarE) / (1 - PbarE), 4)

First up, we need to import the data into a form that makes it usable. For this, we read the CSV into pandas dataframes.

In [None]:
# Build matrices
import pandas as pd

df_sb = pd.read_csv('1000sample_SB_ForInterAnnotator.csv', keep_default_na=False, na_values=['_'])
df_nch = pd.read_csv('1000sample_NCH_ForInterAnnotator.csv', keep_default_na=False, na_values=['_'])
df_sd = pd.read_csv('1000sample_SD_ForInterAnnotator.csv', keep_default_na=False, na_values=['_'])

We're only going to use items 101-150 for calculating the IAA, and only a subset of columns, so let's chop the dataframes down a bit.

In [None]:
target_cols = ['software', 'MAIN', 'QA', 'QA_retrieval', 'preprint', 'software_paper']
df_sb = df_sb[target_cols]
df_sb = df_sb.iloc[100:150, :]
df_nch = df_nch[target_cols]
df_nch = df_nch.iloc[100:150, :]
df_sd = df_sd[target_cols]
df_sd = df_sd.iloc[100:150, :]

# Drop the line with a duplicate MAIN annotation
df_sb = df_sb.drop(df_sb.index[12])
df_nch = df_nch.drop(df_nch.index[12])
df_sd = df_sd.drop(df_sd.index[12])

# Also keep the names of the columns containing binary data
binary_cols = target_cols[3:]
binary_cols

To calculate inter-annotator agreement, we need to transform the categorical data into numerical data.
Therefore, let's have a look at the annotation categories and how they transform into ints.

We are working on **nominal scales**, so the actual int values have no mathematical meaning.

### MAIN category

- 1: PRO
- 2: PUB
- 3: MAN
- 4: URL
- 5: INS
- 6: NAM
- 7: NOT

### QA Layer

- 1: SC
- 2: SN
- 3: SF
- 4: NA
- 5: UN

### QA retrieval

- 1: Y (Yes)
- 2: N (No)

### Preprint

- 1: Y (Yes)
- 2: N (No)

### Software paper

- 1: Y (Yes)
- 2: N (No)

### Confidence

- 1: Y (Yes)
- 2: N (No)

Let's put these in a dict of dicts for programmatic access.

#### Nominal vs. binary data

The last four categories are effectively binary data. Both Fleiss' kappa dn Krippendorff's alpha support nominal and binary.

In [None]:
str_num_map = {
    'MAIN': {'PRO': 1, 'PUB': 2, 'MAN': 3, 'URL': 4, 'INS': 5, 'NAM': 6, 'NOT': 7},
    'QA': {'SC': 1, 'SN': 2, 'SF': 3, 'NA': 4, 'UN': 5},
    'QA_retrieval': {'Y': 1, 'N': 2},
    'preprint': {'Y': 1, 'N': 2},
    'software_paper': {'Y': 1, 'N': 2},
    'confidence': {'Y': 1, 'N': 2}
}

Before we can replace the string values with numericals, some normalization has to be done.

In [None]:
for df in [df_sb, df_nch, df_sd]:
    for col_name in binary_cols:
        df[col_name] = df[col_name].replace(['YES', 'Yes'],'Y')
        df[col_name] = df[col_name].replace(['NO', 'No'],'N')

Now we can replace the string categories with numericals.

In [None]:
df_sb = df_sb.replace(str_num_map)
df_nch = df_nch.replace(str_num_map)
df_sd = df_sd.replace(str_num_map)

df_sb = df_sb.replace('', 0)
df_nch = df_nch.replace('', 0)
df_sd = df_sd.replace('', 0)

Create numpy arrays for the single annotation categories with values from all dataframes.

In [None]:
main_sb = df_sb['MAIN'].to_numpy()
main_nch = df_nch['MAIN'].to_numpy()
main_sd = df_sd['MAIN'].to_numpy()
main = np.array([main_sb, main_nch, main_sd])

# Compute Fleiss' kappa
from statsmodels.stats import inter_rater as irr
dats, cats = irr.aggregate_raters(main)
print(dats)
print(cats)
print(f"Fleiss' kappa: {irr.fleiss_kappa(dats, method='fleiss')}")

# Compute Krippendorf's alpha
import krippendorff as kd
print(f"Krippendorff's alpha: {kd.alpha(main, level_of_measurement='nominal')}")

### Compute IAA scores

In [None]:
# Compute Fleiss' kappa
dats, cats = irr.aggregate_raters(main)
print(dats)
print(cats)
irr.fleiss_kappa(dats, method='fleiss')

# Compute Krippendorf's alpha
import krippendorff as kd
kd.alpha(main, level_of_measurement='nominal')