# Hertziana IconClass Analytics: Expert study

This notebook perform quantitative analytics on a user study conducted with scholars in art history.

It was performed on a *focused* iconographic corpus of 24 hand-picked images from the **Bibliotheca Hertziana Photographic Collection**, specifically with Iconclass tags of human sacrifices in the Trojan war (Iphigenia, Polyxena) and the cycle of Helen and Paris. Twelve of these images are in common with the one of the larger corpus in the non-expert study.

The outcomes were conductive to a later qualitative investigation with the scholars themselves. 

In [1]:
%pip install --quiet pandas pingouin

Note: you may need to restart the kernel to use updated packages.


## Step 1: Prepare the data

Because the image corpus is very focused, we moved away from the previous category-based analysis which took the 24 **SenticNet** emotions as distinct labels only. Based on the updated Hourglass model described in [this paper](https://sentic.net/hourglass-model-revisited.pdf), we score each label over one of four dimensions, from -1 to 1. Each term is considered non-neutral with respect to that dimension, so it never receives a zero there.

In [2]:
coordinates = ( 'attitude', 'introspection', 'sensitivity', 'temper' )
hourglass = {
    'enthusiasm'    : { 'sensitivity' : 1 },
    'eagerness'     : { 'sensitivity' : 0.66 },
    'responsiveness': { 'sensitivity' : 0.33 },
    'anxiety'       : { 'sensitivity' : -0.33 },
    'fear'          : { 'sensitivity' : -0.66 },
    'terror'        : { 'sensitivity' : -1 },
    'bliss'         : { 'temper' : 1 },
    'calmness'      : { 'temper' : 0.66 },
    'serenity'      : { 'temper' : 0.33 },
    'annoyance'     : { 'temper' : -0.33 },
    'anger'         : { 'temper' : -0.66 },
    'rage'          : { 'temper' : -1 },
    'delight'       : { 'attitude' : 1 },
    'pleasantness'  : { 'attitude' : 0.66 },
    'acceptance'    : { 'attitude' : 0.33 },
    'dislike'       : { 'attitude' : -0.33 },
    'disgust'       : { 'attitude' : -0.66 },
    'loathing'      : { 'attitude' : -1 },
    'ecstasy'       : { 'introspection' : 1 },
    'joy'           : { 'introspection' : 0.66 },
    'contentment'   : { 'introspection' : 0.33 },
    'melancholy'    : { 'introspection' : -0.33 },
    'sadness'       : { 'introspection' : -0.66 },
    'grief'         : { 'introspection' : -1 }
}

## Step 2: Load and convert expert ratings

Load the (anonymized) JSON data file from the LabelStudio project and, from each image annotation, calculate a rating as a linear combination of an emotion's score for its relevant dimension, repeated as many times as it appears: basically summing up said scores, so that the number of occurrences of emotions along the same dimension influence the rating.


In [3]:
import json
with open('experts.json') as f:
    annotations = json.load(f)

In [4]:
import pandas as pd
import random, statistics

def calculate_rating(array):
    """
    Can use several measures, but we'll go for a sum.
    """
    return sum(array)  # linear combination
    # return sum(array) / len(array)  # Average
    # return statistics.median(array)  # Median

iconclasses = {}  # Track the Iconclass codes for each image
raters = set()    # The IDs of the annotators
ratings_dict = {} # Ratings organized by image, then by rater

for task in annotations:
    for ann in task['annotations']:
        item = f"{task['id']}"
        iconclasses[item] = task["iconclass"]
        rater = f"{ann['completed_by']}"
        raters.add(rater)
        for r in ann['result']:
            label = r['value']['polygonlabels'][0]
            rating = ratings_dict.setdefault(item, {}).setdefault(rater, {})
            if label in hourglass:
                for coord in hourglass[label]:
                    rating.setdefault(coord, []).append(hourglass[label][coord])

### Step 2.1: partition the ratings

Rater agreement calculations require that the ratings be presented as lists of `{ item, rater, rating }` objects. We create one such list for each category (Iconclass group) that we are interested in.

In [5]:
r_iphigenia = []
r_polyxena = []
r_helenandparis = []

for idd,item in ratings_dict.items():
    if "94D132" in iconclasses[idd]: r_secondary = r_iphigenia
    elif "94H243" in iconclasses[idd]: r_secondary = r_polyxena
    else: r_secondary = r_helenandparis
    for rater in raters:
        appraisal = item[rater]
        for coord in coordinates:
            try:
                rating = calculate_rating(item[rater][coord])
            except:
                rating = 0.0
            o = { 'ID' : idd, 'Rater' : rater, coord: rating }
            r_secondary.append(o)

r_iphigenia

[{'ID': '171682379', 'Rater': '64261', 'attitude': 0.0},
 {'ID': '171682379', 'Rater': '64261', 'introspection': 0.0},
 {'ID': '171682379', 'Rater': '64261', 'sensitivity': -0.6699999999999999},
 {'ID': '171682379', 'Rater': '64261', 'temper': 0.0},
 {'ID': '171682379', 'Rater': '70243', 'attitude': 0.0},
 {'ID': '171682379', 'Rater': '70243', 'introspection': -3},
 {'ID': '171682379', 'Rater': '70243', 'sensitivity': 0.0},
 {'ID': '171682379', 'Rater': '70243', 'temper': 0.0},
 {'ID': '171690536', 'Rater': '64261', 'attitude': 0.0},
 {'ID': '171690536', 'Rater': '64261', 'introspection': 0.0},
 {'ID': '171690536', 'Rater': '64261', 'sensitivity': -0.6699999999999999},
 {'ID': '171690536', 'Rater': '64261', 'temper': -1},
 {'ID': '171690536', 'Rater': '70243', 'attitude': 0.0},
 {'ID': '171690536', 'Rater': '70243', 'introspection': -2},
 {'ID': '171690536', 'Rater': '70243', 'sensitivity': -0.67},
 {'ID': '171690536', 'Rater': '70243', 'temper': 0.0},
 {'ID': '171691322', 'Rater': '64

## Step 3: Compute rater agreement

Because this time we have numeric data that are guaranteed to be given for every item and rater, instead of checkered categorical data like before, we do not use category-based rater reliability measures. We will use the Intraclass Correlation Coefficient (ICC). In Python, these are implemented in the `pingouin` package. 

In [6]:
import pingouin as pg

### Sacrifices

In [7]:
df = pd.concat([pd.DataFrame(r_iphigenia), pd.DataFrame(r_polyxena)])
icc = pg.intraclass_corr(data=df, targets='ID', raters='Rater', ratings='sensitivity').round(3)
icc

Unnamed: 0,Type,Description,ICC,F,df1,df2,pval,CI95%
0,ICC1,Single raters absolute,-0.056,0.894,8,9,0.557,"[-0.64, 0.59]"
1,ICC2,Single random raters,0.157,1.715,8,8,0.231,"[-0.19, 0.63]"
2,ICC3,Single fixed raters,0.263,1.715,8,8,0.231,"[-0.44, 0.77]"
3,ICC1k,Average raters absolute,-0.118,0.894,8,9,0.557,"[-3.59, 0.74]"
4,ICC2k,Average random raters,0.272,1.715,8,8,0.231,"[-0.47, 0.78]"
5,ICC3k,Average fixed raters,0.417,1.715,8,8,0.231,"[-1.58, 0.87]"


In [8]:
icc = pg.intraclass_corr(data=df, targets='ID', raters='Rater', ratings='temper').round(3)
icc

Unnamed: 0,Type,Description,ICC,F,df1,df2,pval,CI95%
0,ICC1,Single raters absolute,-0.27,0.574,8,9,0.777,"[-0.75, 0.43]"
1,ICC2,Single random raters,-0.215,0.619,8,8,0.744,"[-0.7, 0.47]"
2,ICC3,Single fixed raters,-0.236,0.619,8,8,0.744,"[-0.76, 0.47]"
3,ICC1k,Average raters absolute,-0.741,0.574,8,9,0.777,"[-6.14, 0.6]"
4,ICC2k,Average random raters,-0.548,0.619,8,8,0.744,"[-4.71, 0.63]"
5,ICC3k,Average fixed raters,-0.616,0.619,8,8,0.744,"[-6.17, 0.64]"


In [9]:
icc = pg.intraclass_corr(data=df, targets='ID', raters='Rater', ratings='attitude').round(3)
icc

Unnamed: 0,Type,Description,ICC,F,df1,df2,pval,CI95%
0,ICC1,Single raters absolute,0.529,3.25,8,9,0.049,"[-0.12, 0.87]"
1,ICC2,Single random raters,0.533,3.37,8,8,0.053,"[-0.09, 0.87]"
2,ICC3,Single fixed raters,0.542,3.37,8,8,0.053,"[-0.14, 0.87]"
3,ICC1k,Average raters absolute,0.692,3.25,8,9,0.049,"[-0.26, 0.93]"
4,ICC2k,Average random raters,0.696,3.37,8,8,0.053,"[-0.21, 0.93]"
5,ICC3k,Average fixed raters,0.703,3.37,8,8,0.053,"[-0.32, 0.93]"


In [10]:
icc = pg.intraclass_corr(data=df, targets='ID', raters='Rater', ratings='introspection').round(3)
icc

Unnamed: 0,Type,Description,ICC,F,df1,df2,pval,CI95%
0,ICC1,Single raters absolute,0.579,3.746,8,9,0.033,"[-0.05, 0.88]"
1,ICC2,Single random raters,0.59,4.287,8,8,0.027,"[0.01, 0.89]"
2,ICC3,Single fixed raters,0.622,4.287,8,8,0.027,"[-0.02, 0.9]"
3,ICC1k,Average raters absolute,0.733,3.746,8,9,0.033,"[-0.09, 0.94]"
4,ICC2k,Average random raters,0.742,4.287,8,8,0.027,"[0.02, 0.94]"
5,ICC3k,Average fixed raters,0.767,4.287,8,8,0.027,"[-0.03, 0.95]"


### Helen and Paris

In [11]:
df = pd.DataFrame(r_helenandparis)
icc = pg.intraclass_corr(data=df, targets='ID', raters='Rater', ratings='sensitivity').round(3)
icc

Unnamed: 0,Type,Description,ICC,F,df1,df2,pval,CI95%
0,ICC1,Single raters absolute,0.555,3.495,14,15,0.011,"[0.09, 0.82]"
1,ICC2,Single random raters,0.553,3.431,14,14,0.014,"[0.08, 0.82]"
2,ICC3,Single fixed raters,0.549,3.431,14,14,0.014,"[0.07, 0.82]"
3,ICC1k,Average raters absolute,0.714,3.495,14,15,0.011,"[0.17, 0.9]"
4,ICC2k,Average random raters,0.712,3.431,14,14,0.014,"[0.15, 0.9]"
5,ICC3k,Average fixed raters,0.709,3.431,14,14,0.014,"[0.13, 0.9]"


In [12]:
icc = pg.intraclass_corr(data=df, targets='ID', raters='Rater', ratings='temper').round(3)
icc

Unnamed: 0,Type,Description,ICC,F,df1,df2,pval,CI95%
0,ICC1,Single raters absolute,0.642,4.581,14,15,0.003,"[0.23, 0.86]"
1,ICC2,Single random raters,0.637,4.276,14,14,0.005,"[0.19, 0.86]"
2,ICC3,Single fixed raters,0.621,4.276,14,14,0.005,"[0.18, 0.85]"
3,ICC1k,Average raters absolute,0.782,4.581,14,15,0.003,"[0.37, 0.93]"
4,ICC2k,Average random raters,0.778,4.276,14,14,0.005,"[0.32, 0.93]"
5,ICC3k,Average fixed raters,0.766,4.276,14,14,0.005,"[0.3, 0.92]"


In [13]:
icc = pg.intraclass_corr(data=df, targets='ID', raters='Rater', ratings='attitude').round(3)
icc

Unnamed: 0,Type,Description,ICC,F,df1,df2,pval,CI95%
0,ICC1,Single raters absolute,0.097,1.216,14,15,0.355,"[-0.41, 0.56]"
1,ICC2,Single random raters,0.083,1.175,14,14,0.383,"[-0.45, 0.56]"
2,ICC3,Single fixed raters,0.081,1.175,14,14,0.383,"[-0.43, 0.56]"
3,ICC1k,Average raters absolute,0.177,1.216,14,15,0.355,"[-1.38, 0.72]"
4,ICC2k,Average random raters,0.153,1.175,14,14,0.383,"[-1.66, 0.72]"
5,ICC3k,Average fixed raters,0.149,1.175,14,14,0.383,"[-1.53, 0.71]"


In [14]:
icc = pg.intraclass_corr(data=df, targets='ID', raters='Rater', ratings='introspection').round(3)
icc

Unnamed: 0,Type,Description,ICC,F,df1,df2,pval,CI95%
0,ICC1,Single raters absolute,0.639,4.547,14,15,0.003,"[0.22, 0.86]"
1,ICC2,Single random raters,0.637,4.365,14,14,0.005,"[0.2, 0.86]"
2,ICC3,Single fixed raters,0.627,4.365,14,14,0.005,"[0.19, 0.86]"
3,ICC1k,Average raters absolute,0.78,4.547,14,15,0.003,"[0.36, 0.93]"
4,ICC2k,Average random raters,0.778,4.365,14,14,0.005,"[0.34, 0.93]"
5,ICC3k,Average fixed raters,0.771,4.365,14,14,0.005,"[0.32, 0.92]"


As you can see, `pingouin` computes several forms of ICC: which one to pick depends on the conditions of the experiment. We recommend to consider ICC3 (Single fixed raters), as it is a good fit for our case where we have both raters who evaluated the entire corpus independently of each other, and they are not so many as to take average measures into account. See [this article on medium](https://medium.com/@SalahAssana/a-beginners-guide-to-the-intraclass-correlation-coefficient-icc-288f7fe7bcfc) for further explanation.