# Calculating inter-annotator agreement

In this notebook, inter-annotator agreement is determined on a sample of 50 items in the 1000 item random sample.

First up, we need to import the data into a form that makes it usable. For this, we read the CSV into pandas dataframes.

In [1]:
# Build matrices
import pandas as pd
import numpy as np

df_sb = pd.read_csv('iaa/IAA_SB.csv', keep_default_na=False, na_values=['_'])
df_nch = pd.read_csv('iaa/IAA_NCH.csv', keep_default_na=False, na_values=['_'])
df_sd = pd.read_csv('iaa/IAA_SD.csv', keep_default_na=False, na_values=['_'])
df_ok = pd.read_csv('iaa/IAA_OK.csv', keep_default_na=False, na_values=['_'])

The sample has been manually pre-cleaned, so that all empty cells in `preprint` and `software_paper` have been set to `N`. Also, all variances in "Yes/No" answers have been normalized to `N` and `Y`.

In [2]:
# Drop the line with a duplicate MAIN annotation
df_sb = df_sb.drop(df_sb.index[12])
df_nch = df_nch.drop(df_nch.index[12])
df_sd = df_sd.drop(df_sd.index[12])
df_ok = df_ok.drop(df_ok.index[12])

To calculate inter-annotator agreement, we need to transform the categorical data into numerical data.
Therefore, let's have a look at the annotation categories and how they transform into ints.

We are working on **nominal scales**, so the actual int values have no mathematical meaning.

### MAIN category

- 1: PRO
- 2: PUB
- 3: MAN
- 4: URL
- 5: INS
- 6: NAM
- 7: NOT

### QA Layer

- 1: SC
- 2: SN
- 3: SF
- 4: NA
- 5: UN

### QA retrieval

- 1: Y (Yes)
- 2: N (No)

### Preprint

- 1: Y (Yes)
- 2: N (No)

### Software paper

- 1: Y (Yes)
- 2: N (No)

### Confidence

- 1: Y (Yes)
- 2: N (No)

Let's put these in a dict of dicts for programmatic access.

#### Nominal vs. binary data

The last four categories are effectively binary data. Both Fleiss' kappa dn Krippendorff's alpha support nominal and binary.

In [3]:
str_num_map = {
    'MAIN': {'PRO': 1, 'PUB': 2, 'MAN': 3, 'URL': 4, 'INS': 5, 'NAM': 6, 'NOT': 7},
    'QA': {'SC': 1, 'SN': 2, 'SF': 3, 'NA': 4, 'UN': 5},
    'QA_retrieval': {'Y': 1, 'N': 2},
    'preprint': {'Y': 1, 'N': 2},
    'software_paper': {'Y': 1, 'N': 2},
    'confidence': {'Y': 1, 'N': 2}
}

Now we can replace the string categories with numericals.

In [4]:
df_sb = df_sb.replace(str_num_map)
df_nch = df_nch.replace(str_num_map)
df_sd = df_sd.replace(str_num_map)
df_ok = df_ok.replace(str_num_map)

Create numpy arrays for the single annotation categories with values from all dataframes.

In [5]:
main_sb = df_sb['MAIN'].to_numpy()
main_nch = df_nch['MAIN'].to_numpy()
main_sd = df_sd['MAIN'].to_numpy()
main_ok = df_ok['MAIN'].to_numpy()
main = np.array([main_sb, main_nch, main_sd, main_ok])
print(main)

qa_sb = df_sb['QA'].to_numpy()
qa_nch = df_nch['QA'].to_numpy()
qa_sd = df_sd['QA'].to_numpy()
qa_ok = df_ok['QA'].to_numpy()
qa = np.array([qa_sb, qa_nch, qa_sd, qa_ok])
print(qa)

qa_ret_sb = df_sb['QA_retrieval'].to_numpy()
qa_ret_nch = df_nch['QA_retrieval'].to_numpy()
qa_ret_sd = df_sd['QA_retrieval'].to_numpy()
qa_ret_ok = df_ok['QA_retrieval'].to_numpy()
qa_ret = np.array([qa_ret_sb, qa_ret_nch, qa_ret_sd, qa_ret_ok])
print(qa_ret)

preprint_sb = df_sb['preprint'].to_numpy()
preprint_nch = df_nch['preprint'].to_numpy()
preprint_sd = df_sd['preprint'].to_numpy()
preprint_ok = df_ok['preprint'].to_numpy()
preprint = np.array([preprint_sb, preprint_nch, preprint_sd, preprint_ok])
print(preprint)

swpap_sb = df_sb['software_paper'].to_numpy()
swpap_nch = df_nch['software_paper'].to_numpy()
swpap_sd = df_sd['software_paper'].to_numpy()
swpap_ok = df_ok['software_paper'].to_numpy()
swpap = np.array([swpap_sb, swpap_nch, swpap_sd, swpap_ok])
print(swpap)

[[6 6 6 6 6 6 4 6 6 2 6 4 2 6 6 6 6 5 6 6 6 6 6 6 4 2 2 6 6 4 6 1 2 6 6 6
  6 1 6 2 6 2 6 6 2 6 6 2 2]
 [2 6 2 5 6 6 4 5 6 2 6 4 2 6 2 6 6 5 6 6 6 6 6 6 4 2 2 2 2 6 6 4 2 6 6 2
  2 1 6 2 6 2 6 6 2 6 6 2 2]
 [6 2 2 6 2 6 4 5 6 1 6 4 2 6 2 5 4 5 6 6 6 4 6 6 4 1 2 2 6 4 6 1 2 6 6 2
  2 1 6 2 6 6 6 6 2 6 6 2 2]
 [6 2 2 5 6 6 4 5 6 1 6 4 2 6 2 5 4 5 6 6 6 4 4 6 4 1 1 7 2 6 6 1 1 6 6 1
  2 6 6 6 6 2 6 6 7 6 6 2 2]]
[[2 4 4 4 4 4 1 4 3 2 4 1 2 4 4 2 4 2 4 4 4 2 4 4 2 2 2 4 5 4 4 1 2 4 2 2
  2 4 5 4 4 2 2 4 2 2 4 2 2]
 [2 4 4 4 4 4 1 4 3 2 4 2 2 4 4 2 4 2 4 4 4 1 4 4 2 2 2 2 2 4 4 2 2 4 2 2
  2 4 4 4 4 2 4 4 2 3 4 2 2]
 [2 4 4 4 4 4 2 4 3 2 4 2 2 4 4 2 2 2 4 4 4 1 2 4 2 2 2 4 4 4 4 2 2 4 2 2
  2 4 4 4 4 2 2 4 2 3 4 2 2]
 [2 4 4 4 4 4 2 4 4 2 4 1 2 4 4 2 2 2 4 4 4 1 2 4 2 2 2 2 2 4 4 2 2 4 2 2
  2 4 4 2 4 2 2 4 2 2 4 2 2]]
[[1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1
  1 1 1 1 2 1 1 1 1 1 1 1 2]
 [1 1 1 1 2 1 1 1 1 1 1 1 1 1 2 1 1 1 1 2 2 1 1 1 1 1 2 1 1 1 2 1 1 1 2

### Compute IAA scores

In [6]:
########### MAIN
import krippendorff as kd
main_alpha1 = kd.alpha(main, level_of_measurement='nominal')
print(f"MAIN - Krippendorff's alpha: {main_alpha1}")

########### QA
# Compute Krippendorf's alpha
qa_alpha1 = kd.alpha(qa, level_of_measurement='nominal')
print(f"QA - Krippendorff's alpha: {qa_alpha1}")

########### QA_retrieval
# Compute Krippendorf's alpha
qa_ret_alpha1 = kd.alpha(qa_ret, level_of_measurement='nominal')
print(f"QA_retrieval - Krippendorff's alpha: {qa_ret_alpha1}")

########### preprint
# Compute Krippendorf's alpha
preprint_alpha1 = kd.alpha(preprint, level_of_measurement='nominal')
print(f"preprint - Krippendorff's alpha: {preprint_alpha1}")

########### software_paper
# Compute Krippendorf's alpha
swpap_alpha1 = kd.alpha(swpap, level_of_measurement='nominal')
print(f"software_paper - Krippendorff's alpha: {swpap_alpha1}")

MAIN - Krippendorff's alpha: 0.5464277218301482
QA - Krippendorff's alpha: 0.721030042918455
QA_retrieval - Krippendorff's alpha: 0.6532012195121951
preprint - Krippendorff's alpha: 0.8023391812865497
software_paper - Krippendorff's alpha: 0.4921875


Calculate average for Krippendorf's alpha across all categories.

In [7]:
avg = (main_alpha1 + qa_alpha1 + qa_ret_alpha1 + preprint_alpha1 + swpap_alpha1) / 5
print(f'Average alpha: {avg}.')

Average alpha: 0.6430371331094695.
