# Week 4 | Assignment 2: Evaluating & Adjudicating Annotations - Part I


### Overview
* You start from the annotations and guidelines that you produced last week
* You will evaluate the annotations quantatively ("researchers") and qualitatively ("annotators")
* Next week: make final, "gold" version of your annotations

## Stage 1.1: Setting things up

* If you were in the "researchers" team last week, you'll now be in the "annotators" team, and the other way around 
* In the rest of the notebook, sections marked "/R" are meant for the researchers, sections marked "/A" are for annotators

> **IMPORTANT**: 
> * **make a copy of this notebook (_File > Save a copy in Drive_) and share it with your groupmates. One team member should submit the notebook on Nestor before the start of the next lab session.**
> * **also include external documents / spreadsheets that you created as part of the assignment when you submit**

## Stage 1.2/R: Inter-Annotator Agreement

* How inter-annotator agreement ("IAA" for short) works was discussed in the last lecture. Review the [slides here](https://annotationfor-wmh5113.slack.com/files/U03BU7F0DH9/F03FGQU128G/week_04_-_iaa.pdf) and/or read chapter 6 (p. 126 onwards) from the [book](https://rug-on-worldcat-org.proxy-ub.rug.nl/search/detail/801812987?submitButton=&queryString=natural%20language%20annotation%20for%20machine%20learning&databaseList=638).
* In the assignments (as in the lecture), we'll work with Cohen's and Fleiss' $\kappa$ metric.


### Preparing the data

* On Doccano, find the projects containing the annotations created in **week 3** (part II of Assignment 1)
* For each project, export the annotations (Datasets > Actions > Export Annotations > JSONL)
* Unzip the downloaded files, find the files containing the actual annotations (tip: this is the file that has the project code as filename, e.g. `rd-event-1.jsonl`) and rename them with the names of the annotators (e.g. `hermoine.jsonl`, `harry.jsonl`, `ron.jsonl`)
* In the "Files" pane (left-hand menu in Colab), create a new folder `annotations/` and upload the annotation files to this folder

**Write the following three functions:**
1. a function that reads a single annotation file and returns the JSON lines as a list of dictionaries*
2. a function that groups annotations per annotator. This function should take as input a dictionary mapping each annotator to that annotator's list of annotations, and should return a dictionary mapping document IDs to a dictionary containing the annotations for that document ID per annotator. Example input:
```python
{
    "hermione": [annotation_1, annotation_2, annotation_3, ..., annotation_30],
    "harry": [annotation_1, annotation_2, ..., annotation_28],
    "ron": [annotation_1, annotation_2, ..., annotation_29] 
}
```
where each `annotation_i` is one of the JSONL records that you exported, e.g. `{"id": 1, "data": "an accident happened...", "label": "injury", ... }`. The
output should look like this:
```python
{
    1: {"hermione": annotation, "harry": annotation, "ron": annotation},
    ...,
    29: {"hermione": annotation, "ron": annotation},
    30: {"hermione": annotation}
}
```
3. A function that counts how many annotations were made by each annotator. The output should look like this:
```python
{
    "hermione": 30,
    "ron": 29,
    "harry": 28
}
```


*) N.B.: in a real-world situation, depending on how large the annotation files are and how much resources (memory) you have available, storing a complete JSONL file in memory at the same time might not always be a good idea, but here the quantities of data are such that this isn't a problem


> Write your functions in the cells below. 

In [None]:
# function 1
import json

def read_annotations(anno_file):
  records = []
  with open(anno_file, encoding="utf-8") as f:
    for line in f:
      record = json.loads(line)
      records.append(record)
  return records

In [None]:
# function 2
def group_annotations(annotator_to_records):
  grouped_records = {}
  for annotator, records in annotator_to_records.items():
    for rec in records:
      doc_id = rec["article_id"]
      if doc_id in grouped_records:
        grouped_records[doc_id][annotator] = rec
      else:
        grouped_records[doc_id] = {annotator: rec}
  return grouped_records

In [None]:
# function 3
def count_records(grouped_records):
  counts = {}
  for doc_id, group in grouped_records.items():
    for annotator, record in group.items():
      if annotator not in counts:
        counts[annotator] = 0
      counts[annotator] += 1
  return counts


> Use the functions that you wrote to calculate how many annotations every annotator made. In >this< cell, write a short report on your findings: are there large differences? (N.B.: again, this is not a competition of who annotates more! Faster is not always better, and also not vice versa. But it can be interesting to reflect on why: did some annotators need more time per item? For span-based tasks, did some annotators annotate more spans per document? Were there specific issues that only some annotators run into?) 

In [None]:
# calculations
annotations = {
    "dertje": read_annotations("annotations/dertje.jsonl"),
    "justin": read_annotations("annotations/justin.jsonl"),
    "tariq": read_annotations("annotations/tariq.jsonl")
    # "maurice": read_annotations("annotations/maurice.jsonl"),
    # "stijn": read_annotations("annotations/stijn.jsonl"),

}
grouped = group_annotations(annotations)
print(grouped.keys())
counts = count_records(grouped)
print(counts)

dict_keys([6046, 8199, 3409, 1851, 1055, 9629, 6434, 8616, 4457, 304, 3472, 5413, 4790, 4552, 11831, 9767, 8793, 4188, 471, 9676, 11593, 8253])
{'dertje': 22, 'justin': 2, 'tariq': 2}


> Finally, in the cell below, write another function that filters the grouped annotations (from function 2) so that you keep only the ones that were annotated by all annotators.

In [None]:
# function 4
def get_shared_annotations(grouped_annotations, num_annotators):
  shared_annotations = {}
  for doc_id, records in grouped_annotations.items():
    if len(records)  == num_annotators:
      shared_annotations[doc_id] = records
  return shared_annotations
  
shared = get_shared_annotations(grouped, 3)
shared

{6046: {'dertje': {'article_id': 6046,
   'crash_id': 5129,
   'data': 'Apeldoornse wielrenner (84) overleed vijf dagen na ongeluk op bedrijventerrein\n\nDe wielrenner die op 26 juli ernstig gewond raakte bij een ongeval op bedrijventerrein Ecofactorij in Apeldoorn, blijkt te zijn overleden. Dat heeft de politie vandaag bevestigd. Het gaat om een 84-jarige Apeldoorner.\n\n',
   'id': 771778,
   'label': [[12, 22, 'Cyclist'], [83, 93, 'Cyclist']],
   'sitename': 'destentor.nl',
   'url': 'https://www.destentor.nl/apeldoorn/apeldoornse-wielrenner-84-overleed-vijf-dagen-na-ongeluk-op-bedrijventerrein~abe70587/'},
  'justin': {'article_id': 6046,
   'crash_id': 5129,
   'data': 'Apeldoornse wielrenner (84) overleed vijf dagen na ongeluk op bedrijventerrein\n\nDe wielrenner die op 26 juli ernstig gewond raakte bij een ongeval op bedrijventerrein Ecofactorij in Apeldoorn, blijkt te zijn overleden. Dat heeft de politie vandaag bevestigd. Het gaat om een 84-jarige Apeldoorner.\n\n',
   'id': 7

### Finding (dis)agreements

* Now it's time to process your annotations in such a way that you can calculate agreement over them. How this should be done depends on the type of problem you have chosen and how you have approached the problem:

  * **Document-level tasks, single label** --- if each document can have only one label, then agreement is simple: two annotators agree on document $X$ if and only if they assigned the same label to $X$
  * **Document-level tasks, multiple labels** --- if two annotators assign exactly the same set of tags to document $X$, then there is _exact agreement_, if some but not all of the tags match, there is _partial agreement_, and otherwise there is no agreement.
  * **Span-level tasks** --- if two annotators assign the same tag to exactly the same span of text (e.g. the span "ambulance" was tagged as `emergency_vehicle`), then there is _exact agreement_; if they assign the same tag to partially the same span (e.g. one annotator has "the ambulance" and the other has just "ambulance"), there is _partial agreement_, and otherwise there is no agreement. 

* Start from the output of function 4 that you wrote above. This should be a dictionary of the form `{doc_id: { "hermione": {...}, "harry": {...}, "ron": {...} }, ...}`
* From this, we want to create a dictionary like this:
```python
{
    "hermione": ["pedestrian", "cyclist", "emergency_vehicle", ... ],
    "harry": ["pedestrian", "pedestrian", "emergency_vehicle", ... ],
    "ron": ["cyclist", "cyclist", "emergency_vehicle", ... ],
}
```
i.e., a dictionary mapping annotator names to lists of strings labels. (This means that we get rid of all other information such as document IDs; we want to get a data into such a format that we can directly compute an agreement metric over it)
* You need to write the function to create such a dictionary by yourself, but we'll give you instructions for how to approach this. Below, follow **only** the instructions from the section that applies to your situation.


#### Document-level, single label

Follow these instructions if you chose a document-level task and annotated in such a way that there is never more than one label per document

* Take the output of function 4
* Create a list for every annotator
* Loop over the documents, and for every document, add each annotator's label to that annotator's list

In [None]:
# write your code here
def process_doc_sing(shared_annotations, annotator_names):
  result = {}
  for an in annotator_names:
    result[an] = []
  
  for doc_id, records in shared_annotations.items():
    for an, rec in records.items():
      label = rec["label"][0]
      result[an].append(label)
  return result

#### Document-level, multiple labels

You will need to write two functions for preparing the data:

1. Preparing for exact matches
  * Take the output of function 4
  * Create a list for every annotator
  * Loop over the documents, and for every document, do the following:
    * For each annotator, sort list the labels (the labels should match, the order doesn't matter), e.g. `["injured", "dead", "damage"]` --> `"[damage", "dead", "injured"]`
    * Make a single string from the sorted labels, e.g. `"damage,dead,injured"` (the separator doesn't matter, as long as it is possible to separate the labels again later on, so don't choose a separator that can also occur inside a label, e.g. `_` if you have labels like `"emergency_vehicle"`)
    * Add the combined label to the annotator's list
    * Essentially, what we do here is that we treat multiple labels as if they were a single label
2. Preparing for partial matches
  * Take the output of function 4
  * Create a list for every annotator
  * Loop over the documents, and for every document, do the following:
    * Compare the label sets of the different annotators
    * If there is a label that is shared between all annotators, choose that label for all annotators. (E.g. if Hermione had labels `[A, B]`, Ron had labels `[B, C]`, and Harry had labels `[B, D]`, then we add `B` to each annotator's list)
    * If no label is shared between all annotators, add an empty label (e.g., `"NULL"`)


In [None]:
# write your exact match code here
def process_doc_multi_exact(shared_annotations, annotator_names):
  result = {}
  for an in annotator_names:
    result[an] = []
  
  for doc_id, records in shared_annotations.items():
    for an, rec in records.items():
      label = "++".join(sorted(rec["label"]))
      result[an].append(label)
  return result

In [None]:
# write your partial match code here
def process_doc_multi_partial(shared_annotations, annotator_names):
  result = {}
  for an in annotator_names:
    result[an] = []
  
  for doc_id, records in shared_annotations.items():
    shared_labels = None
    for an, rec in records.items():
      labels = set(rec["label"])
      if shared_labels is None:
        shared_labels = labels
      else:
        shared_labels = labels.intersection(shared_labels)

    label = "++".join(sorted(shared_labels))
    for an in annotator_names:
      result[an].append(label)
      
  return result

#### Span-level, multiple labels

You will need to write two functions for preparing the data:

1. Preparing for exact matches
  * Take the output of function 4
  * Create a list for every annotator
  * Loop over the documents, and for every document, do the following:
    * Loop over annotators 
    * Loop over all tagged spans
    * For every tagged span, search in the other annotators' tags if they annotated exactly the same span. For all annotators that tagged the same span with the same label, add the label to the list. For all annotators, add an empty label to the list. (E.g.: if Hermione and Harry, but not Ron, applied the tag `"vehicle"` to span `[20, 25]`, add `"vehicle"` to Hermione and Harry's lists, and add `"NULL"` to Ron's list)

2. Preparing for partial matches
  * Take the output of function 4
  * Create a list for every annotator
  * Loop over the documents, and for every document, do the following:
    * Loop over annotators 
    * Loop over all tagged spans
    * For every tagged span, search in the other annotators' tags if they annotated an overlapping span. For all annotators that tagged an overlapping span with the same label, add the label to the list. For all annotators, add an empty label to the list. (E.g.: if Hermione applied the tag `"vehicle"` to span `[20, 25]`, while Harry added `"vehicle"` to span `[17, 25]` and Ron completely missed this span, add `"vehicle"` to Hermione and Harry's lists, and add `"NULL"` to Ron's list) 
    * Count spans as "overlapping" only if one of the spans includes (is a superstring of) the other. I.e., if span A is `[X_i, X_j]` and span B is `[Y_i, Y_j]`, it should be the case that `(X_i <= Y_i and X_j >= Y_j) or (X_i >= Y_i and X_j <= Y_j)`. So, for example A = "the red ambulance" and B = "ambulance" counts, but A = "the red" and B = "red ambulance" doesn't count)

In [None]:
# write your exact match code here
def process_span_exact(shared_annotations, annotator_names):
  result = {}
  for an in annotator_names:
    result[an] = []
  
  for doc_id, records in shared_annotations.items():
    # make sure we never match twice
    already_matched = {}
    for an in annotator_names:
      already_matched[an] = []

    for an_i in annotator_names:
      an_i_labels = [tuple(i) for i in records[an_i]["label"]]
      for lab in an_i_labels:
        if lab in already_matched[an_i]:
          continue
        span_start, span_end, tag_name = lab
        result[an_i].append(tag_name)
        for an_j in annotator_names:
          if an_i == an_j:
            continue
          an_j_labels = [tuple(i) for i in records[an_j]["label"]]
          if lab in an_j_labels:
            result[an_j].append(tag_name)
            # remove the label so that we won't match it again later in the loop
            already_matched[an_j].append(lab)
          else:
            result[an_j].append("NULL")
           
  return result

In [None]:
# write your partial match code here
def process_span_partial(shared_annotations, annotator_names):
  result = {}
  for an in annotator_names:
    result[an] = []
  
  for doc_id, records in shared_annotations.items():
    # make sure we never match twice
    already_matched = {}
    for an in annotator_names:
      already_matched[an] = []

    for an_i in annotator_names:
      an_i_labels = [tuple(i) for i in records[an_i]["label"]]
      for lab in an_i_labels:
        if lab in already_matched[an_i]:
          continue
        span_start, span_end, tag_name = lab
        result[an_i].append(tag_name)
        for an_j in annotator_names:
          if an_i == an_j:
            continue
          an_j_labels = [tuple(i) for i in records[an_j]["label"]]
          partial_match_found = False
          for lab_j in an_j_labels:
            span_start_j, span_end_j, tag_name_j = lab_j
            if tag_name == tag_name_j and ((span_start >= span_start_j and span_end <= span_end_j) or (span_start <= span_start_j and span_end >= span_end_j)):
              # remove the label so that we won't match it again later in the loop
              already_matched[an_j].append(lab_j)
              partial_match_found = True
          if partial_match_found:
            result[an_j].append(tag_name)
          else:
            result[an_j].append("NULL")  
          
  return result

In [None]:
span_exact = process_span_exact(shared, ["dertje", "justin", "tariq"])

In [None]:
span_partial = process_span_partial(shared, ["dertje", "justin", "tariq"])

### Calculating Cohen's $\kappa$

* Now it's finally time to actually calculate the agreement!
* We'll first do Cohen's $\kappa$. If your group had 2 annotators, you'll only have to do this once. If there were more than 2 annotators, repeat the procedure for all pairs (e.g. if the annotators are Hermione, Harry & Ron --> repeat for Hermione & Harry, Hermione & Ron, and Harry & Ron)
* Scikit-Learn makes it really easy to compute the metric. See the documentation here: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.cohen_kappa_score.html

In [None]:
# complete the code in the cell and run it
from sklearn.metrics import cohen_kappa_score

print(cohen_kappa_score(span_exact["dertje"], span_exact["tariq"]))
print(cohen_kappa_score(span_partial["dertje"], span_partial["tariq"]))

0.12903225806451624
0.23699421965317913


> Report on your the results: how much agreement do you get? Is it higher or lower than you expected? If applicable, how much difference is there between exact and partial matches? If there were multiple pairs, which ones had better agreement? Did you expect this?

### Calculating Fleiss' $\kappa$

* Now it's time for Fleiss! Unfortunately, this isn't included in Scikit-Learn, but there is an alternative package: `statsmodels`. You can find the documentation here: https://www.statsmodels.org/stable/generated/statsmodels.stats.inter_rater.fleiss_kappa.html
* The function below prepares your data: we need to make a table (matrix) where the rows are samples and the columns are category distributions

In [None]:
!pip install statsmodels
from statsmodels.stats import inter_rater as irr

def prepare_fleiss_table(annotators_to_label_lists):
  table = []
  for row in zip(*annotators_to_label_lists.values()):
    table.append(row)
  return irr.aggregate_raters(table)



In [None]:
# adapt / complete the code below
aggregate_exact = prepare_fleiss_table(span_exact)[0]
aggregate_partial = prepare_fleiss_table(span_partial)[0]

print(irr.fleiss_kappa(aggregate_exact))
print(irr.fleiss_kappa(aggregate_partial))

-0.10902636916835681
-0.03164179104477613


> Again, report on your the results: how much agreement do you get? Is it higher or lower than you expected? If applicable, how much difference is there between exact and partial matches? Are the results for Fleiss' $\kappa$ very different from those for Cohen's $\kappa$?

## Stage 1.2/A: Qualitative Analysis

* Make a shared spreadsheet (Google Sheets or similar)
* Go through *at least 25%* of your labeled documents and identify all cases of disagreement
* You can either manually browse the Doccano projects and compare the differences, or, if you think this makes the task easier, you can write some code that automatically finds differences (if you do this, you could take inspiration from the researchers' work -- see above)
* For each disagreement that you find, add a row to the spreadhseet with the following columns:
  * **WHERE?**: in which document?
  * **WHO?**: which annotators disagree here?
  * **WHAT?**: what is the difference?
  * **TYPE**: what type of disagreement? 
    * for document-level annotations: "partial" (only some labels are different) or "full" (none of the labels match)
    * for span-level annotations: "same span, different label", "partially matching span, same label", "missing" (= neither span, nor label matches)
  * **REASON**: try to find a reason for the disagreement. You can label the reasons in any way that you want, but some possible labels are "guidelines" (something was ambiguous in the guidelines), "annotator error" (the annotator misinterpreted the guidelines), or "unknown" (if you can't find out what the cause was)