## Evaluation of spacy 2.2.4 models by Stefan

There are 7 trained models. 6 of which have evaluation data in their folder.

This evaluation data exists in these formats:
- txt (1x: 2019-12-03)
- pickle (4x: 2020-01-02 - 2020-04-16)
- json (1x: 2020-04-30)

For the `txt` and `pickle` files there are parsing functions scattered around this repo. Importing
them from their modules however causes execution of the modules which leads to crashes as a lot of
context is missing. In order to avoid changes to an undocumented and sprawled codebase, but still
to reuse code and its context, relevant code was copied / imported here. For the `json` file, no
existing parsing function was found, so one was implemented.

There were also existing evaluation functions compatible with the pickle files of the models
2020-01-02 - 2020-04-16, so that was copied here and reused. For the others, custom evaluation
logic was implemented, using spaCy's function on the `txt` and a custom one on the `json` file as
that data was a in shape incompatible with spaCy's function.

The commit of the remaining code base which was copied or imported from is
`8e75d3561e617f1bd135d4c06fbb982285f6f544`

## Evaluation summary

- 2019-12-03
  - spaCy's evaluation: **p: 53.84, r: 51.28**
  - evaluation file: `ner_apis_2019-12-03_23:32:24/corpus/evalset.txt`
  - evaluation data size: 3042 sentences, 1657 assigned tags
  - NER tags: 'LOC', 'MISC', 'ORG', 'PER'
- 2020-01-02
  - EM (manual evaluation): **p: 76.86, r: 38.92**
  - E0 (spaCy's evaluation, with 'ner' pipeline only): **p: 61.78, r: 47.92**
  - E1 (spaCy's evaluation, with all pipelines): throws errors
  - evaluation file: `ner_apis_2020-01-02_12:34:48/corpus/evalset.pickle`
  - evaluation data size: 2474 sentences, 1647 assigned tags
  - NER tags: 'LOC', 'ORG'
- 2020-01-29
  - EM (manual evaluation):  **p: 72.04, r: 74.83**
  - E0 (spaCy's evaluation, with 'ner' pipeline only): **p: 76.41, r: 71.47**
  - E1 (spaCy's evaluation, with all pipelines): throws errors
  - evaluation file: `ner_apis_2020-01-29_13:19:53/corpus/evalset.pickle`
  - evaluation data size: 832 sentences, 1557 assigned tags
  - NER tags: 'LOC', 'MISC', 'ORG', 'PER'
- 2020-04-07
  - EM (manual evaluation): **p: 72.44, r: 82.33**
  - E0 (spaCy's evaluation, with 'ner' pipeline only): **p: 77.70, r: 81.18**
  - E1 (spaCy's evaluation, with all pipelines): **p: 75.62, r: 77.63**
  - evaluation file: `ner_apis_2020-04-07_15:00:35/corpus/evalset.pickle`
  - evaluation data size: 861 sentences, 1804 assigned tags
  - NER tags: 'LOC', 'MISC', 'ORG', 'PER'
- 2020-04-16
  - EM (manual evaluation): **p: 53.37, r: 49.37**
  - E0 (spaCy's evaluation, with 'ner' pipeline only): **p: 54.81, r: 43.24**
  - E1 (spaCy's evaluation, with all pipelines): **p: 54.18, r: 41.84**
  - evaluation file: `ner_apis_2020-04-16_14:21:46/corpus/evalset.pickle`
  - evaluation data size: 866 sentences, 1878 assigned tags
  - NER tags: 'LOC', 'MISC', 'ORG', 'PER'
- 2020-04-30
  - manual evaluation: **p: 81.76, r: 23.59**
  - evaluation file: ` ner_apis_2020-04-30_11:24:09/corpus/evalset.json`
  - evaluation data size: 904 sentences, 3144 assigned tags (BILOU tags, so plenty of redundancies)
  - NER tags: 'LOC', 'MISC', 'ORG', 'PER'

## How to reproduce

I don't know the exact versions of the python interpreter and packages that were used to build this
codebase, but I could get the model loading and this evaluation here running by using:

- python 3.6.9 (needed for the unpickling of some eval data sets, 3.8 crashed)
- poetry 1.3.2

and the `pyproject.toml` and `poetry.lock` in this folder. To run this, install poetry, go into this
folder and do: `poetry install`

In [1]:
import spacy
import json
# __sresch__ these imports need to be done, because otherwise the unpickling of metadata would crash
# in the function NERer.load_metadata . No idea why.
from ner.model_ner import ModelType, TrainingStyle
import ner.model_ner


# this is necessary to avoid this strange E050
# related: https://github.com/explosion/spaCy/issues/3552
nlp = spacy.load('de_core_news_md')
del nlp

ModuleNotFoundError: No module named 'ner.corpus_dummy'

In [2]:
# evaluate model 2019_12_03


# __sresch__ this function is copied without modifications from 'NER Place Institution.ipynb'
# because since it's in a jupyter notebook, it couldn't be easily imported without third party
# tooling. And if it the import was made possible, it executed the whole other notebook.
def read_data_from_txt(mypath):
    mydata = []
    with open(mypath, "r") as f:
        lines = f.readlines()
        i = 0
        while i < len(lines):
            while lines[i].isspace():
                i += 1
            # we found a non-empty line to use for t
            t = lines[i].strip()
            i += 1
            while lines[i].isspace():
                i += 1
            # we found a non-empty line to use for e if possible, else for t
            e = None
            while e == None:
                try:
                    e = eval(lines[i])
                except SyntaxError:
                    t += lines[i].strip()
                    i += 1
                    while lines[i].isspace():
                        i += 1
                    # we found a non-empty line to try to use for e again
            i += 1
            mydata.append( (t, e, None, None) )
    return mydata

# __sresch__ custom code, utilizing the given txt reader function above made for that data shape
model_dir = "ner_apis_2019-12-03_23:32:24"
nlp = spacy.load(f"../{model_dir}/nlp")
print(f"NER tags: {nlp.get_pipe('ner').labels}")
eval_data = read_data_from_txt("../ner_apis_2019-12-03_23:32:24/corpus/evalset.txt")
print(f"number of tags in evaluation data: {sum([len(e[1]['entities']) for e in eval_data])}")
print(f"number of sentences in evaluation data: {len(eval_data)}")
eval_data_spacy = [(s,e) for s, e, _, _ in eval_data]
scorer = nlp.evaluate(eval_data_spacy, verbose=False)
print(f"p: {scorer.ents_p}, r: {scorer.ents_r}")

NER tags: ('LOC', 'MISC', 'ORG', 'PER')
number of tags in evaluation data: 1657
number of sentences in evaluation data: 3042
p: 53.84615384615385, r: 51.28712871287129


In [3]:
# evaluate models 2020_01_02, 2020-01-29, 2020-04-07, 2020-04-16


# __sresch__ copied into here from evaluate_model.py as it could not be imported without causing
# execution of the imported modules.
def evaluate_without_tokenization_mismatches(nerer, pipes_to_disable):
    print("The abbreviations files during data extraction and model training did not match. " +
        "I will now remove the datapoints from the evaluation set whose tokenization differs.")
    with nerer.nlp.disable_pipes(*pipes_to_disable):
        sent_doc_gp = [ (d.sentence, nerer.nlp(d.sentence), d.goldparse) for d in nerer.evaluation_data ]
    num_prob = 0
    ok = []
    for s,d,g in sent_doc_gp:
        if len(d) != len(g):
            num_prob += 1
        else:
            ok.append( (s,g) )
    print(f"I remove {num_prob} datapoints from the evaluation data.")
    print(f"Trying the evaluation again ...")
    # copy here what nerer.evaluate() does:
    with nerer.nlp.disable_pipes(*pipes_to_disable):
        scorer = nerer.nlp.evaluate(ok, verbose=False)
        nerer.scorer = scorer
        nerer.scores = scorer.scores

# __sresch__ custom data structure to avoid the hardwired on in evaluate_models.py
evaluations = [
    {
        "model_dir": "ner_apis_2020-01-02_12:34:48",
        "run_eval_manual": True,
        "run_eval_0": True,
         # run_eval_1 would crash because tagger pipeline is missing in model, it's disabled
        "run_eval_1": False,
    },
    {
        "model_dir": "ner_apis_2020-01-29_13:19:53",
        "run_eval_manual": True,
        "run_eval_0": True,
         # run_eval_1 would crash because tagger pipeline is missing in model, it's disabled
        "run_eval_1": False,
    },
    {
        "model_dir": "ner_apis_2020-04-07_15:00:35",
        "run_eval_manual": True,
        "run_eval_0": True,
        "run_eval_1": True,
    },
    {
        "model_dir": "ner_apis_2020-04-16_14:21:46",
        "run_eval_manual": True,
        "run_eval_0": True,
        "run_eval_1": True,
    },
]
# __sresch__ taken from evaluate_model.py, adapted only in minor parts
for e in evaluations:
    # ============= LOAD MODEL

    print(f'Loading model at {e["model_dir"]} ...')
    nerer = ner.model_ner.NERer.from_saved("../" + e['model_dir'], load_training_data=False)
    print("Finished loading model.")
    print(f"NER tags: {nerer.nlp.get_pipe('ner').labels}")
    print(f"number of tags in evaluation data: {sum(len(e.entities) for e in nerer.evaluation_data)}")
    print(f"number of sentences in evaluation data: {len(nerer.evaluation_data)}")

    # ================= MANUAL EVALUATION

    if e["run_eval_manual"]:
        print("Starting EM (manual evaluation) ...")
        nerer.evaluate_manually()
        e['EM'] = nerer.scores_manual
        print(f"EM: ents_p={nerer.scores_manual.p()}, ents_r={nerer.scores_manual.r()}")

    # ================= SPACY'S EVALUATION

    if e["run_eval_0"]:
        print("Running spacy's evaluation (E0) with (string, GoldParse) as input over only the 'ner' pipe ...")
        assert nerer.nlp.has_pipe('ner')
        pipes_to_disable = []
        if nerer.nlp.has_pipe('tagger'):
            pipes_to_disable.append('tagger')
        if nerer.nlp.has_pipe('parser'):
            pipes_to_disable.append('parser')
        try:
            nerer.evaluate(pipes_to_disable=pipes_to_disable)
        except ValueError:
            evaluate_without_tokenization_mismatches(nerer, pipes_to_disable)
        scorer0 = nerer.scorer
        e['E0'] = scorer0
        print(f"E0: ents_p={scorer0.ents_p}, ents_r={scorer0.ents_r}")

    if e["run_eval_1"]:
        print("Running spacy's evaluation (E1) with (string, GoldParse) as input over the pipes 'tagger', 'parser', and 'ner' ...")
        assert nerer.nlp.has_pipe('tagger')
        assert nerer.nlp.has_pipe('parser')
        assert nerer.nlp.has_pipe('ner')
        pipes_to_disable = []
        try:
            nerer.evaluate(pipes_to_disable=pipes_to_disable)
        except ValueError:
            evaluate_without_tokenization_mismatches(nerer, pipes_to_disable)
        scorer1 = nerer.scorer
        e['E1'] = scorer1
        print(f"E1: ents_p={scorer1.ents_p}, ents_r={scorer1.ents_r}")

Loading model at ner_apis_2020-01-02_12:34:48 ...


  dict = pickle.load(open(path, "rb"))
  data_without_goldparse = pickle.load(open(path, "rb"))


Finished loading model.
NER tags: ('LOC', 'ORG')
number of tags in evaluation data: 1647
number of sentences in evaluation data: 2474
Starting EM (manual evaluation) ...
EM: ents_p=76.86472819216182, ents_r=38.92445582586428
Running spacy's evaluation (E0) with (string, GoldParse) as input over only the 'ner' pipe ...
The abbreviations files during data extraction and model training did not match. I will now remove the datapoints from the evaluation set whose tokenization differs.
I remove 1153 datapoints from the evaluation data.
Trying the evaluation again ...
E0: ents_p=61.78571428571429, ents_r=47.92243767313019
Loading model at ner_apis_2020-01-29_13:19:53 ...


  dict = pickle.load(open(path, "rb"))
  data_without_goldparse = pickle.load(open(path, "rb"))


Finished loading model.
NER tags: ('LOC', 'MISC', 'ORG', 'PER')
number of tags in evaluation data: 1557
number of sentences in evaluation data: 832
Starting EM (manual evaluation) ...
EM: ents_p=72.04152249134948, ents_r=74.83824586628324
Running spacy's evaluation (E0) with (string, GoldParse) as input over only the 'ner' pipe ...
The abbreviations files during data extraction and model training did not match. I will now remove the datapoints from the evaluation set whose tokenization differs.
I remove 515 datapoints from the evaluation data.
Trying the evaluation again ...
E0: ents_p=76.41509433962264, ents_r=71.47058823529412
Loading model at ner_apis_2020-04-07_15:00:35 ...


  dict = pickle.load(open(path, "rb"))
  data_without_goldparse = pickle.load(open(path, "rb"))


Finished loading model.
NER tags: ('LOC', 'MISC', 'ORG', 'PER')
number of tags in evaluation data: 1804
number of sentences in evaluation data: 861
Starting EM (manual evaluation) ...
EM: ents_p=72.44565217391305, ents_r=82.33477455219271
Running spacy's evaluation (E0) with (string, GoldParse) as input over only the 'ner' pipe ...
The abbreviations files during data extraction and model training did not match. I will now remove the datapoints from the evaluation set whose tokenization differs.
I remove 66 datapoints from the evaluation data.
Trying the evaluation again ...
E0: ents_p=77.705112960761, ents_r=81.18012422360248
Running spacy's evaluation (E1) with (string, GoldParse) as input over the pipes 'tagger', 'parser', and 'ner' ...
The abbreviations files during data extraction and model training did not match. I will now remove the datapoints from the evaluation set whose tokenization differs.
I remove 66 datapoints from the evaluation data.
Trying the evaluation again ...
E1: 

  dict = pickle.load(open(path, "rb"))
  data_without_goldparse = pickle.load(open(path, "rb"))


Finished loading model.
NER tags: ('LOC', 'MISC', 'ORG', 'PER')
number of tags in evaluation data: 1878
number of sentences in evaluation data: 866
Starting EM (manual evaluation) ...
EM: ents_p=53.37837837837838, ents_r=49.375
Running spacy's evaluation (E0) with (string, GoldParse) as input over only the 'ner' pipe ...
E0: ents_p=54.81283422459893, ents_r=43.24894514767932
Running spacy's evaluation (E1) with (string, GoldParse) as input over the pipes 'tagger', 'parser', and 'ner' ...
E1: ents_p=54.189435336976324, ents_r=41.842475386779185


In [4]:
# evaluate model 2020_04_30


# __srech__ custom function so that the evaluation data's shape is handled. The data and its tags is
# persisted as list of tokens, making it necessary to compare them against predicted tokens, which
# hinders the usage of spaCy's evaluate function (to my knowledge). Meaning that precision and
# recall are manually calculated.
model_dir = "ner_apis_2020-04-30_11:24:09"
nlp = spacy.load(f"../{model_dir}/nlp")
ner_valid_list = nlp.get_pipe("ner").labels
print(f"NER tags: {ner_valid_list}")
count_tp = 0
count_fp = 0
count_fn = 0
count_total = 0
count_sentences = 0
count_tags = 0
with open(f"../{model_dir}/corpus/evalset.json", "r", encoding="utf-8") as f:
    eval_data = json.load(f)["paragraphs"]
    for p in eval_data:
        token_pred_list = nlp(p["raw"])
        token_real_list = []
        for s in p["sentences"]:
            for t in s["tokens"]:
                token_real_list.append(t)
                if t["ner"] != "O":
                    count_tags += 1
            count_sentences += 1
        for token_pred, token_real in zip(token_pred_list, token_real_list):
            if token_pred.orth_ == token_real["orth"]:
                ner_pred = token_pred.ent_type_
                ner_real = token_real["ner"]
                if ner_pred != "" or ner_real != "O":
                    count_total += 1
                    pred_is_correct = False
                    for ner_valid in ner_valid_list:
                        if ner_valid in ner_pred and ner_valid in ner_real:
                            pred_is_correct = True
                            count_tp += 1
                            break
                    if not pred_is_correct:
                        if ner_pred != "":
                            count_fp += 1
                        elif ner_pred == "" and ner_real != "O" :
                            count_fn += 1

print(f"number of tags in evaluation data: {count_tags}")
print(f"number of sentences in evaluation data: {count_sentences}")
if count_tp + count_fp + count_fn != count_total:
    raise Exception()

p = count_tp / (count_tp + count_fp)
r = count_tp / (count_tp + count_fn)
print(f"p: {p}, r: {r}")

NER tags: ('LOC', 'MISC', 'ORG', 'PER')
number of tags in evaluation data: 3144
number of sentences in evaluation data: 904
p: 0.8176733780760627, r: 0.23595868302130407
