In [46]:
%pip install pygamma-agreement cylp pandas numpy tqdm scipy

9670.12s - pydevd: Sending message related to process being replaced timed-out after 5 seconds



[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.0[0m[39;49m -> [0m[32;49m24.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


## Computing inter-annotator agreement (IAA) with factgenie

This notebook shows how to compute inter-annotator agreement (IAA) between two annotator groups.

### Input data
For using the notebook, you will need the CSV files generated by factgenie for computing inter-annotator agreement:
- `dataset_level_counts.csv`
- `example_level_counts.csv`
- `gamma_spans.csv`

You can generate these files on the `/analyze` page (on the Inter-annotator agreement tab). On that page, you need to select the campaign(s) with multiple annotators per example and select `Export data files`.

### Annotator groups
We will compute the correlation between two **annotator groups**. Each annotator group has an id in the format `{campaign_id}-anngroup-{group_idx}`. That means that it uniquely defines the ordinal number of the annotator within a specific campaign.

#### Single campaign
You can compute IAA between annotators within a single campaign.

Example: in the campaign `llm-eval-1`, you used two annotators per example. Then you want to measure agreement between `llm-eval-1-anngroup-0` and `llm-eval-1-anngroup-1`.

#### Multiple campaigns
You can compute IAA between annotators in multiple campaigns **if these campaigns were annotating the same outputs**.

Example: you ran campaigns `llm-eval-1` and `llm-eval-2` over the same set of examples. Then you will measure agreement between `llm-eval-1-anngroup-0` and `llm-eval-2-anngroup-0`.

In [47]:
import pandas as pd
import numpy as np
import logging
import pygamma_agreement as pa
import traceback
from pyannote.core import Segment
from tqdm.notebook import tqdm
from scipy.stats import pearsonr

# Set the directory where the csv files are located here
csv_path = "."

# Pearson r

First, we will use the **Pearson correlation coefficient** to measure the agreement between two annotators.

Specifically, we will measure how much the **error counts** agree. 

In the ideal case, both annotators annotated the **same amount of errors of each category** for each example. The Pearson r coefficient will help us to quantify to which extent it is true. The value of 1 signifies perfect *positive linear correlation*, 0 signifies no linear correlation, -1 signifies perfect *negative linear correllation*.

We will compare both the example-level correlation, which is more strict, and dataset-level (or, more precisely, dataset-split-setup_id-level) correlation, which is more lenient.

## Levels

### Dataset-level
Pearson r between two annotators computed over a list of average error counts for each (dataset, split, setup_id) combination.

### Example-level
Pearson r between two annotators computed over a list of error counts for each (dataset, split, setup_id, example_idx) combination.

## Average type

### Micro-average
A coefficient computed over concatenated results from all the categories.

### Macro-average
An average of coefficients computed separately for each category.

In [48]:
def compute_pearson_r(csv_path, group1, group2):
    # Load data
    df = pd.read_csv(csv_path)
    
    group1_data = df[df['annotator_group_id'] == group1]
    group2_data = df[df['annotator_group_id'] == group2]
    
    group1_counts = list(group1_data["count"])
    group2_counts = list(group2_data["count"])

    # Micro correlation - correlation of counts
    micro_corr = pearsonr(group1_counts, group2_counts)[0]

    # Macro correlation - average of per-type correlations
    type_corrs = []
    for ann_type in df['annotation_type'].unique():
        g1_type = list(group1_data[group1_data['annotation_type'] == ann_type]["count"])
        g2_type = list(group2_data[group2_data['annotation_type'] == ann_type]["count"])

        type_corrs.append(pearsonr(g1_type, g2_type)[0])
    
    macro_corr = np.mean(type_corrs)
    
    return {'micro': micro_corr, 'macro': macro_corr, 'category_correlations': type_corrs}

In [49]:

for level in ["dataset", "example"]:
    csv_filename = f"{csv_path}/{level}_level_counts.csv"

    groups = ('quintd1-gpt-4-anngroup-0', 'quintd1-human-anngroup-0')
    correlations = compute_pearson_r(csv_filename, *groups)

    print(f"{level}-level correlations between {groups[0]} and {groups[1]}")
    print("==============================================")

    print(f"Micro Pearson-r: {correlations['micro']:.3f}")
    print("==============================================")

    for i, corr in enumerate(correlations['category_correlations']):
        print(f"Category {i}: {corr:.3f}")
    print("----------------------------------------------")

    print(f"Macro Pearson-r: {correlations['macro']:.3f}")
    print("==============================================")

FileNotFoundError: [Errno 2] No such file or directory: './dataset_level_counts.csv'

# Gamma (γ) score
Second, we compute the gamma (γ) score between the two annotator groups.

This score suitable for computing IAA in cases where are both (1) determining span positions and (2) categorizing the spans.

The γ score considers the best alignment between the spans and computes the value based on the number of local dissimilarities. The score will help us to quantify the correlation not just between the error counts, but also their exact **positions** on top of the output text.

For full description, please refer to the original paper [Mathet et al. (2015)](https://doi.org/10.1162/COLI_a_00227).

For Python, the score is implemented in the [pygamma-agreement](https://pygamma-agreement.readthedocs.io/en/latest/index.html) library.

**Note that computing the score is computationally intensive. Consider saving intermediate per-example scores in case you need to repeat the experiments.**

In [43]:
def compute_gamma(span_index, dissim, precision_level="low"):
    gamma_scores = []
    running_avg = 0
    
    # Group examples
    groups = list(span_index.groupby(["dataset", "split", "setup_id", "example_idx"]))
    
    # Create progress bar
    pbar = tqdm(total=len(groups), desc='Computing gamma score')
    
    for idx, (i, group) in enumerate(groups, 1):
        try:
            # Add each annotation to continuum
            continuum = pa.Continuum()

            if group.annotator_group_id.unique().shape[0] < 2:
                print(f"Skipping example {idx} due to insufficient annotators")
                gamma_scores.append(0.0)
                running_avg = np.mean(gamma_scores)
                pbar.set_postfix({'avg_gamma': f'{running_avg:.3f}'})
                pbar.update(1)
                continue

            for j, row in group.iterrows():
                # make sure we do not add empty segments
                if row["annotation_start"] == row["annotation_end"]:
                    continue

                continuum.add(
                    str(row["annotator_group_id"]),
                    Segment(row["annotation_start"], row["annotation_end"]),
                    str(row["annotation_type"]),
                )

            # Temporarily increase logging level to suppress output
            logging.getLogger().setLevel(logging.WARNING)
            gamma_results = continuum.compute_gamma(dissim, soft=True, precision_level=precision_level)
            logging.getLogger().setLevel(logging.INFO)

            gamma_scores.append(gamma_results.gamma)
            running_avg = np.mean(gamma_scores)
            
            # Update progress bar with current average
            pbar.set_postfix({'avg_gamma': f'{running_avg:.3f}'})
            pbar.update(1)
        except Exception as e:
            traceback.print_exc()
            print(f"Error computing gamma for example {idx}")
            gamma_scores.append(0.0)
            running_avg = np.mean(gamma_scores)
            pbar.set_postfix({'avg_gamma': f'{running_avg:.3f}'})
            pbar.update(1)
    
    pbar.close()
    return float(np.mean(gamma_scores)) if gamma_scores else 0.0

In [44]:
gamma_spans = pd.read_csv(f"{csv_path}/gamma_spans.csv")

In [45]:
# Higher precision_level will result in more accurate gamma scores, but will take longer to compute
precision_level = "low"

# `alpha`: coefficient weighting the *positional* dissimilarity value, defaults to 1
alpha = 1
# `beta`: coefficient weighting the *categorical* dissimilarity value, defaults to 1
beta = 1
# `delta_empty`: empty dissimilarity value, defaults to 1
dissim = pa.CombinedCategoricalDissimilarity(delta_empty=1, alpha=1, beta=1)
gamma = compute_gamma(gamma_spans, dissim, precision_level=precision_level)

print(f"Gamma score: {gamma:.3f}")