# Round 1 Initial model evaluation

For Round 1 evaluation the outputs of each model are evaluated against the HITL annotated gold-standard dataset:

__NER:__

- spaCy en_core_web_trf (used as basis for annotation)
- flair/ner-english-ontonotes-large

__CR:__

- fastcoref (used as basis for annotation)
- LingMess

__REX:__

- Babelscape/rebel-large (used as basis for annotation)
- Flair (only alternat_name included in annotations)

The below code reads in each model's outputs as well as the HITL annotated gold-standard datasets for each task and then compares the results. The main metric used for comparison was the ___Macro F1___ score (in other words the average F1 score across all articles in the corpus). __Precision__ and __Recall__ are also shown where available.

## Import required libraries

In [1]:
import json
import time
import copy
import pandas as pd
pd.set_option('display.max_rows', 100)
from kg_builder import kg
from kg_builder import ner
from kg_builder import cr
from kg_builder import rex
from kg_builder import get_wikidata_prepared_info
from kg_builder import make_lookup_dict_from_df

## Import required data

Includes dataframe containing sample data, as well as model outputs and annotations to compare against.

In [2]:
# Import original sample data and get a list of the 10 articles designated to 'train'
df = pd.read_parquet('source_data/sample_text_30.pq')
train_ids = df.loc[df['Split'] == 'train', 'Id'].tolist()

In [3]:
# Load the original json data: in each case the output is a list of Article instances
ner_annotations = ner.load_ner_from_label_studio('outputs/annotations/sample_ner_30_annotated.json', df, True)
ner_spacy = ner.load_ner_from_label_studio('outputs/round1/sample_ner_30_spacy.json', df, False)
ner_flair = ner.load_ner_from_label_studio('outputs/round1/sample_ner_30_flair.json', df, False)

cr_annotations = cr.load_cr_from_label_studio('outputs/annotations/sample_cr_30_annotated.json', df, True)
cr_fastcoref = cr.load_cr_from_label_studio('outputs/round1/sample_cr_30_fastcoref.json', df, False)
cr_lingmess = cr.load_cr_from_label_studio('outputs/round1/sample_cr_30_lingmess.json', df, False)

rex_annotations = rex.load_rex_from_label_studio('outputs/annotations/sample_re_30_annotated.json', df, True)
rex_rebel = rex.load_rex_from_label_studio('outputs/round1/sample_re_30_rebel.json', df, False)
rex_flair = rex.load_rex_from_label_studio('outputs/round1/sample_re_30_flair.json', df, False)

## Evaluations

In [4]:
def evaluate_task(predictions: list, annotations: list, task: str, selected = []) -> [float, float, float]:
    '''
    Get precision, recall and F1 for the task (one of 'ner', 'cr', and 'rex')
    and print the results.
    '''
    comparisons = []
    if task == 'ner':
        for article in predictions:
            comparisons.append(ner.calc_article_ner_metrics(article, annotations))
        precision, recall, f1 = ner.calc_corpus_ner_metrics(comparisons)  
    if task == 'cr':
        for article in predictions:
            comparisons.append(cr.calc_article_cr_metrics(article, annotations))
        precision, recall, f1 = None, None, cr.calc_corpus_cr_metrics(comparisons)  
    if task == 'rex':
        for article in predictions:
            comparisons.append(rex.calc_article_rex_metrics(article, annotations, selected = selected))
        precision, recall, f1 = rex.calc_corpus_rex_metrics(comparisons)  
    print(f'''precision: {round(precision, 5) if precision is not None else None}
recall : {round(recall, 5) if recall is not None else None}
macro f1: {round(f1, 5)}''')
    return (precision, recall, f1)

### NER

spaCy performs marginally better than Flair. The metrics for just the test set of 20 articles are also noted below as this will be used to evaluate whether future rounds are improved by any changes or not.

#### spaCy full results

In [5]:
# spaCy results
evaluate_task(ner_spacy, ner_annotations, task = 'ner')

precision: 0.91757
recall : 0.91592
macro f1: 0.91592


(0.9175712334967081, 0.9159155020316009, 0.9159188595938839)

#### Flair full results

In [6]:
# Flair results
evaluate_task(ner_flair, ner_annotations, task = 'ner')

precision: 0.90782
recall : 0.91421
macro f1: 0.91078


(0.907819118970955, 0.914214816091885, 0.9107843402218926)

#### spaCy test set results only

In [7]:
ner_spacy_test = [article for article in ner_spacy if article.article_id not in train_ids]
ner_annotations_test = [article for article in ner_annotations if article.article_id not in train_ids]
evaluate_task(ner_spacy_test, ner_annotations_test, task = 'ner')

precision: 0.92177
recall : 0.92565
macro f1: 0.92278


(0.9217686585090373, 0.925649177622239, 0.9227833400350385)

### CR
fastcoref performs better than LingMess. The metrics for just the test set of 20 articles are also noted below as this will be used to evaluate whether future rounds are improved by any changes or not.

#### fastcoref full results

In [8]:
# fastcoref results
evaluate_task(cr_fastcoref, cr_annotations, task = 'cr')

precision: None
recall : None
macro f1: 0.71688


(None, None, 0.7168828573284605)

#### LingMess full results

In [9]:
# Lingmess results
evaluate_task(cr_lingmess, cr_annotations, task = 'cr')

precision: None
recall : None
macro f1: 0.68682


(None, None, 0.6868215051393629)

#### fastcoref test set results only

In [10]:
cr_fastcoref_test = [article for article in cr_fastcoref if article.article_id not in train_ids]
cr_annotations_test = [article for article in cr_annotations if article.article_id not in train_ids]
evaluate_task(cr_fastcoref_test, cr_annotations_test, task = 'cr')

precision: None
recall : None
macro f1: 0.73203


(None, None, 0.732025492823859)

### REX

To evaluate relation extraction a few preparatory steps are necessary:

1) Transform Flair relations to REBEL terminology (except for alternate_name which is unique to Flair)
2) Only include those relations pre-identified for inclusion (those that were deemed useful for the KG after review)
3) Populate inverse relations (where possible)

In [11]:
# rebel_flair_overview contains a summary of relations to be included
rebel_flair_overview, _, _, _, _, = get_wikidata_prepared_info('reference_info/wikidata_references.pkl')

# We only want to evaluate relations which have been preselected for inclusion
included_relations = list(rebel_flair_overview.loc[rebel_flair_overview['rebel description'].notna(), 'rebel description'])
included_relations += list(make_lookup_dict_from_df(rebel_flair_overview[rebel_flair_overview['rebel description'].notna()], 'rebel description', 'inverse description').values())
included_relations += ['alternate_name'] # additional Flair relation not in REBEL
included_relations = list(set(included_relations))

# To make a like-for-like comparison we want to compare performance of just the 
# relations shared by the models
shared_relations = list(rebel_flair_overview.loc[rebel_flair_overview['wikidata description mapping'].notna(), 'wikidata description mapping'])
shared_relations += list(make_lookup_dict_from_df(rebel_flair_overview[rebel_flair_overview['wikidata description mapping'].notna()], 'wikidata description mapping', 'inverse description').values())
shared_relations = list(set(shared_relations))

In [12]:
# Transform flair relations to rebel terminology
for article in rex_flair:
    rex.flair_to_rebel(article)

In [13]:
# Only include pre-identified relations
for article in rex_flair:
    article.relations = [relation for relation in article.relations if relation.relation_type in included_relations]
for article in rex_rebel:
    article.relations = [relation for relation in article.relations if relation.relation_type in included_relations]
for article in rex_annotations:
    article.relations = [relation for relation in article.relations if relation.relation_type in included_relations]

In [14]:
# Populate inverse relations
for article in rex_flair:
    rex.populate_inverse_relations(article)
for article in rex_rebel:
    rex.populate_inverse_relations(article)
for article in rex_annotations:
    rex.populate_inverse_relations(article)

#### Flair vs Rebel shared results

Results for those relations shared by Flair and Rebel - this is the fairest test of performance since Rebel includes many more relations that Flair does inherently:

In [15]:
# Flair shared results
evaluate_task(rex_flair, rex_annotations, task = 'rex', selected = shared_relations)

precision: 0.37202
recall : 0.1825
macro f1: 0.22977


(0.37201890701890694, 0.18250212271763996, 0.2297716468646836)

In [16]:
# REBEL shared results
evaluate_task(rex_rebel, rex_annotations, task = 'rex', selected = shared_relations)

precision: 0.67765
recall : 0.41647
macro f1: 0.49206


(0.6776521735345263, 0.4164660132332546, 0.4920582906818482)

#### Flair vs Rebel overall results

Results when comparing all relations against the annotated dataset:

In [17]:
# Flair results overall
evaluate_task(rex_flair, rex_annotations, task = 'rex')

precision: 0.34059
recall : 0.14962
macro f1: 0.19685


(0.3405930893961585, 0.14962041768646236, 0.1968470159427254)

In [18]:
# REBEL results overall
evaluate_task(rex_rebel, rex_annotations, task = 'rex')

precision: 0.64396
recall : 0.49399
macro f1: 0.53658


(0.6439566173690335, 0.4939900539374412, 0.5365831581871705)

#### Rebel test results only

In [19]:
rex_rebel_test = [article for article in rex_rebel if article.article_id not in train_ids]
rex_annotations_test = [article for article in rex_annotations if article.article_id not in train_ids]
evaluate_task(rex_rebel_test, rex_annotations_test, task = 'rex')

precision: 0.65823
recall : 0.4664
macro f1: 0.5255


(0.658230319617822, 0.46639740004529884, 0.525495109705935)