In [1]:
%pip install -U pip >/dev/null 2>/dev/null
%pip install -r requirements.txt >/dev/null 2>/dev/null

Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


Import `pandas` and some helper functions.

In [2]:
import pandas as pd

from itertools import combinations
from krippendorff import alpha

Load the raw point ratings from the TSV file (you can download the latest from the [Google Sheet](https://docs.google.com/spreadsheets/d/1vc_LhpHgfO3w-UpSyGYtex0r6f1CrAdHQZFAHJRU_g4/edit?usp=sharing), but you may have to munge the columns/rows if anything has moved around).

In [3]:
raw_df = pd.read_table('EDH Point System - Sheet1.tsv')

We'll explicitly list the columns to extract from the raw `pd.DataFrame`:

In [4]:
raters = [
    'Joe',
    'Connor',
    'Ryan',
    'Artemiy',
    'RJ',
    'Albert',
    'Mike',
    'Zach'
]

We'll extract the specific columns/rows for the raters/point vlaues and coerce the non-numeric data to `NaN` so we can ignore it:

In [5]:
header = slice(4, None) # skip first 4 rows in the DataFrame
df = raw_df[raters].loc[header].apply(pd.to_numeric, errors="coerce")
df

Unnamed: 0,Joe,Connor,Ryan,Artemiy,RJ,Albert,Mike,Zach
4,,10.0,,10.0,10.0,10.0,9.0,3.0
5,,10.0,,10.0,10.0,10.0,9.0,0.0
6,,10.0,,6.0,0.0,10.0,5.0,3.0
7,4.0,5.0,4.0,5.0,4.0,5.0,4.0,5.0
8,4.0,5.0,4.0,5.0,4.0,5.0,4.0,5.0
...,...,...,...,...,...,...,...,...
101,0.0,0.0,1.0,0.0,0.0,0.0,2.0,0.0
102,0.0,1.0,1.0,0.0,1.0,0.0,0.0,0.0
103,0.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0
104,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0


Now we'll compute [Krippendorff's $\alpha$](https://en.wikipedia.org/wiki/Krippendorff%27s_alpha), a statistical measure of inter-rater reliability.  First we'll compute the $\alpha$ score over the whole dataset:

In [6]:
alpha(reliability_data=df.T, level_of_measurement='ordinal')

0.2933638414661267

$\sim0.3$ isn't great.  $\alpha$ scores range from $\left[-1 ... 1\right]$, with a score of $0.0$ indicating the ratings were assigned no better than random chance (as if raters were choosing ratings by rolling fair dice).  Ratings $\lt\sim0.8$ are considered unreliable or are not in high agreement.

Next we'll compute pairwise $\alpha$ scores to see which pairs of raters agreed best and worst:

In [7]:
def pairwise_alphas(df, raters=raters):
    """A function to compute pair-wise alpha scores for each pair of raters
    
    df: a pd.DataFrame whose columns correspond to the raters
    raters: an iterable of raters that correspond to columns in df
    """
    pairs = [list(pair) for pair in combinations(raters, 2)]
    alphas = {"rater1": [], "rater2": [], "alpha": []}
    for pair in pairs:
        rater1, rater2 = pair
        alphas["rater1"].append(rater1)
        alphas["rater2"].append(rater2)
        alphas["alpha"].append(alpha(reliability_data=df[pair].T, level_of_measurement='ordinal'))
    return pd.DataFrame(alphas)

In [8]:
pairwise_alphas(df).sort_values(by='alpha', ascending=False)

Unnamed: 0,rater1,rater2,alpha
8,Connor,Artemiy,0.667724
22,RJ,Albert,0.479231
2,Joe,Artemiy,0.468071
26,Albert,Zach,0.444865
20,Artemiy,Mike,0.440836
15,Ryan,Albert,0.421964
14,Ryan,RJ,0.387429
25,Albert,Mike,0.383111
13,Ryan,Artemiy,0.374988
16,Ryan,Mike,0.373422


As we can see, Connor & Artemiy agreed best and are in high agreement, while Connor and Zach agreed worst.  (In fact, Connor and Zach had a negative score, indicating they _disagreed_ slightly better than chance!)