# Second Review Analysis

In this notebook we calculate metrics on the agreement and variation between the first and second reviewer.

In [14]:
# Load the data
import pandas as pd
reviews = pd.read_csv("reviews.csv")
# Drop theoretical papers
reviews = reviews[reviews["theoretical"] == False]
second_reviews = pd.read_csv("2nd_reviews.csv")

print(f"Number of papers with second review: {len(second_reviews)}")
print(f"Percentage of papers with second review: {(len(second_reviews) / len(reviews)) * 100}")

# Filter the reviews on those having a second
reviews = reviews[reviews["title"].isin(second_reviews["title"])]

Number of papers with second review: 46
Percentage of papers with second review: 5.010893246187364


### Intra-class Correlation Coefficient

Here we calculate the ICC to determine the reliability of each dimension. We use ICC3 as we have a fixed set k=2 raters reviewing every target.
We then use ICC3k as we are interested in the reliability of ratings overall rather than individual ratings:


"ICC3: A fixed set of raters rate each target. There is no generalization to a larger population of raters. 
       ICC2 and ICC3 remove mean differences between raters, but are sensitive to interactions. 
       The difference between ICC2 and ICC3 is whether raters are seen as fixed or random effects."

In [15]:
# Prepare the data
from pingouin import intraclass_corr
# Categories
categories = ["Implementation", "Data", "Configuration", "Experimental Procedure", "Expertise"]
# Calculate the ordinal interrater agreement
source_labels = []
target_labels = []
for title in reviews["title"]:
    rev = reviews[reviews["title"] == title]
    sec_rev = second_reviews[second_reviews["title"] == title]
    source_labels.append(rev[categories].to_numpy()[0])
    target_labels.append(sec_rev[categories].to_numpy()[0])


In [16]:
# Now we evaluate it per dimension across papers

df_format = []
i = 0
for rev1, rev2 in zip(source_labels, target_labels):
    for value, cat in zip(rev1, categories):
        df_format.append([1, i, cat, value])
    for value, cat in zip(rev2, categories):
        df_format.append([2, i, cat, value])
    i += 1

# We evaluate the dimensions as classes, where each reviewer has multiple reviewer per category
df = pd.DataFrame(df_format, columns=["Reviewer", "Paper", "Category", "Cost"])
total_icc = None
for c in categories:
    #print(f"ICC for {c}:")
    icc = intraclass_corr(data=df[df["Category"] == c], targets='Paper', raters='Reviewer',
                          ratings='Cost')
    icc = icc[icc["Type"] == "ICC3k"]
    icc["Category"] = c
    if total_icc is None:
        total_icc = icc
    else:
        total_icc = pd.concat([total_icc, icc])

total_icc.set_index("Category", inplace=True)
total_icc = total_icc.round(4)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)
print(total_icc)

# Compile the data for a latex table
def koo_labels(icc_value: float):
    """Determine the ICC label based on Koo and Li (2016)."""
    assert icc_value >= 0.0 and icc_value <= 1.0
    if icc_value < 0.5:
        return "Poor"
    if icc_value < 0.75:
        return "Moderate"
    if icc_value < 0.9:
        return "Good"
    return "Excellent"

values = []
for cat in total_icc.index:
    distance = abs(reviews[cat].to_numpy() - second_reviews[cat].to_numpy()).mean()
    ci = total_icc["CI95%"][cat]
    icc_val = total_icc["ICC"][cat]
    label = koo_labels(icc_val)
    values.append([f"{distance:.4f}", f"{icc_val}", f"[{ci[0]}, {ci[1]}]", f"{total_icc["pval"][cat]}", label])

import latex

ltable = latex.latex_table(["MAD", "ICC", r"95\% CI", "P-Value", "Label"], list(total_icc.index), values,
                           column_char="r", label="tab:second_review_icc",
                           caption=r"The intraclass correlation coefficient (ICC) over the first and second review with 95\% confidence interval, P-Value and their interpreted labels based on \citet{koo2016guideline}.")
latex.compile_latex(ltable, "tables/icc_2nd_review.tex")



                         Type           Description     ICC        F  df1  df2    pval         CI95%
Category                                                                                            
Implementation          ICC3k  Average fixed raters  0.9727  36.6044   45   45  0.0000  [0.95, 0.98]
Data                    ICC3k  Average fixed raters  0.9111  11.2496   45   45  0.0000  [0.84, 0.95]
Configuration           ICC3k  Average fixed raters  0.9348  15.3443   45   45  0.0000  [0.88, 0.96]
Experimental Procedure  ICC3k  Average fixed raters  0.8072   5.1858   45   45  0.0000  [0.65, 0.89]
Expertise               ICC3k  Average fixed raters  0.5445   2.1953   45   45  0.0048  [0.18, 0.75]


Now that we have determined the ICC per dimension, we have an idea about the rater reliability of this experiment. It would also be good to know what the variation is per dimension on average for each dimension. For this we also calculate the mean absolute distance (MAD).

In [17]:
d_distances = []
for c in categories:
    distance = abs(reviews[c].to_numpy() - second_reviews[c].to_numpy()).mean()
    d_distances.append(distance)
    print(f"{c}:", round(distance, 4))

Implementation: 0.7609
Data: 0.7826
Configuration: 0.913
Experimental Procedure: 0.7826
Expertise: 2.0652


We see that Expertise is an outlier. Calculate the factor of MAD between expertise and the others to allow for easy/interpretable expression.

In [18]:
a = 0
for v in d_distances[:-1]:
    print(d_distances[-1] / v)
    a += d_distances[-1] / v
print("Average factor:", a / len(d_distances[:-1]))

2.714285714285714
2.6388888888888884
2.261904761904762
2.6388888888888884
Average factor: 2.5634920634920633
