# Round 2 Model evaluation

For Round 2 just the test set results are evaluated against the original HITL gold-standard annotations (since the train articles were used to investigate and develop improvements)

__NER:__

- spaCy en_core_web_trf
_ + improvements

__CR:__

- fastcoref
- + improvements

__REX:__

- Babelscape/rebel-large
- + alternate_name from Flair

The below code reads in each model's outputs as well as the HITL annotated gold-standard datasets for each task and then compares the results. The main metric used for comparison was the ___Macro F1___ score (in other words the average F1 score across all articles in the corpus). __Precision__ and __Recall__ are also shown where available.

## Import required libraries

In [1]:
import pickle
import json
import time
import copy
import pandas as pd
pd.set_option('display.max_rows', 100)
from kg_builder import kg
from kg_builder import ner
from kg_builder import cr
from kg_builder import rex
from kg_builder import get_wikidata_prepared_info
from kg_builder import make_lookup_dict_from_df

## Import required data

Includes dataframe containing sample data, as well as model outputs and annotations to compare against.

In [2]:
# Import sample data
df = pd.read_parquet('source_data/sample_text_30.pq')
train_ids = df.loc[df['Split'] == 'train', 'Id'].tolist()

In [3]:
# In each case the output is a list of Article instances
ner_annotations = ner.load_ner_from_label_studio('model_outputs/annotations/sample_ner_30_annotated.json', df, True)
ner_annotations_test = [article for article in ner_annotations if article.article_id not in train_ids]
cr_annotations = cr.load_cr_from_label_studio('model_outputs/annotations/sample_cr_30_annotated.json', df, True)
cr_annotations_test = [article for article in cr_annotations if article.article_id not in train_ids]
rex_annotations = rex.load_rex_from_label_studio('model_outputs/annotations/sample_re_30_annotated.json', df, True)
rex_annotations_test = [article for article in rex_annotations if article.article_id not in train_ids]

In [4]:
with open('model_outputs/round2/results.pkl', 'rb') as file:
    articles = pickle.load(file)
articles_test = [article for article in articles if article.article_id not in train_ids]

## Evaluations

In [5]:
def evaluate_task(predictions: list, annotations: list, task: str, selected = []) -> [float, float, float]:
    '''
    Get precision, recall and F1 for the task (one of 'ner', 'cr', and 'rex')
    and print the results.
    '''
    comparisons = []
    if task == 'ner':
        for article in predictions:
            comparisons.append(ner.calc_article_ner_metrics(article, annotations))
        precision, recall, f1 = ner.calc_corpus_ner_metrics(comparisons)  
    if task == 'cr':
        for article in predictions:
            comparisons.append(cr.calc_article_cr_metrics(article, annotations))
        precision, recall, f1 = None, None, cr.calc_corpus_cr_metrics(comparisons)  
    if task == 'rex':
        for article in predictions:
            comparisons.append(rex.calc_article_rex_metrics(article, annotations, selected = selected))
        precision, recall, f1 = rex.calc_corpus_rex_metrics(comparisons)  
    print(f'''precision: {round(precision, 5) if precision is not None else None}
recall : {round(recall, 5) if recall is not None else None}
macro f1: {round(f1, 5)}''')
    return (precision, recall, f1)

### NER

In [6]:
# spaCy results
evaluate_task(articles_test, ner_annotations_test, task = 'ner')

precision: 0.93791
recall : 0.93795
macro f1: 0.93697


(0.9379090717443128, 0.9379529940502028, 0.9369658106230876)

### CR

In [7]:
# Fastcoref results
evaluate_task(articles_test, cr_annotations_test, task = 'cr')

precision: None
recall : None
macro f1: 0.87199


(None, None, 0.8719948523074941)

### REX

To evaluate relation extraction a few preparatory steps are necessary:

1) Only include those relations pre-identified for inclusion
2) Populate inverse relations (where possible)

In [8]:
# rebel_flair_overview contains a summary of relations to be included
rebel_flair_overview, _, _, _, _, = get_wikidata_prepared_info('reference_info/wikidata_references.pkl')

# We only want to evaluate relations which have been preselected for inclusion
included_relations = list(rebel_flair_overview.loc[rebel_flair_overview['rebel description'].notna(), 'rebel description'])
included_relations += list(make_lookup_dict_from_df(rebel_flair_overview[rebel_flair_overview['rebel description'].notna()], 'rebel description', 'inverse description').values())
included_relations += ['alternate_name'] # additional Flair relation not in REBEL
included_relations = list(set(included_relations))

In [9]:
# Only include pre-identified relations
for article in articles_test:
    article.relations = [relation for relation in article.relations if relation.relation_type in included_relations]
for article in rex_annotations_test:
    article.relations = [relation for relation in article.relations if relation.relation_type in included_relations]

In [10]:
# Populate inverse relations
for article in articles_test:
    rex.populate_inverse_relations(article)
for article in rex_annotations_test:
    rex.populate_inverse_relations(article)

In [11]:
# REBEL results overall
evaluate_task(articles_test, rex_annotations_test, task = 'rex')

precision: 0.65184
recall : 0.48132
macro f1: 0.53236


(0.6518435211463601, 0.48131530377899195, 0.5323606594181386)