## Bug Analysis

In this notebook, we analyze the impact of a minor bug in model-vs-human: In the calculation of Error Consistency, the border conditions are not handled properly: If one observer gets a perfect score, the EC is 1 automatically, which is not necessarily correct, and we think it's better to return NaN in this case.

Here, we load two dataframes of model-vs-human data, one which was created with the bug, the other without (by just changing one line in utils.py to return 1 instead of NaN, then running `bootstrap_models.py -n 1`). 

We find that this only has an impact on two models, `efficientnet_l2_noisy_student_475` and `transformer_L16_IN21K`, because no other model got a perfect condition anywhere.
We here plot how these models would have been ranked using our way of doing it, but since the experiments have such a different distribution of values, the impact of not including the NaN-experiments in the mean is very large.

It would probably be better to z-transform all error-consistency scores first, before averaging, and this would probably have a large impact on the ranking as well.

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns

In [None]:
df_with = pd.read_parquet("data/bug_analysis_with_1.parquet", engine="pyarrow")
df_without = pd.read_parquet("data/bug_analysis_without_1.parquet", engine="pyarrow")

merged = pd.merge(
    left=df_with,
    right=df_without,
    suffixes=["_with", "_without"],
    on=["experiment", "condition", "bootstrap_id", "model"],
)

# important to do this in two steps
agg = (
    merged.groupby(["experiment", "model"], observed=True)
    .mean(numeric_only=True)
    .reset_index()
)
final = agg.groupby("model", observed=True).mean(numeric_only=True)

# displaying where we get differences
display(final[final["model-human-ec_with"] != final["model-human-ec_without"]])

In [None]:
import matplotlib.pyplot as plt

fig, ax = plt.subplots(1, 1, figsize=(15, 5))
plt.grid(axis="y")
order = (
    final.groupby("model")["model-human-ec_with"]
    .mean()
    .sort_values(ascending=False)
    .index.tolist()
)
sns.stripplot(
    data=final, x="model", y="model-human-ec_without", color="red", order=order, ax=ax
)
sns.stripplot(
    data=final, x="model", y="model-human-ec_with", color="blue", order=order, ax=ax
)
sns.despine()
ax.set_xlabel("Model")
ax.set_ylabel("Error Consistency [kappa]")
ax.set_ylim(0, 0.3)
ax.tick_params(axis="x", labelrotation=90)