# What is it with spaCy and biblical names?
This notebook highlights an issue with spaCy and other Named Entity Recognition models not being able to accurately detect person names, especially if they are biblical names. The detection differences between regular names and biblical names are quite overwhelming.
I tried to get to the bottom of this and believe I have an answer. But first, let's do a short experiment with two spaCy models (using spaCy version 3.0.5).

In [None]:
from typing import List
import itertools
import pprint

import spacy
from spacy.lang.en import English
import pandas as pd


spacy.__version__

## Compare detection rates of biblical vs. other names

Why is there a difference in the first place?
The reason for the different detection rates could arise from:
1. The fact that biblical names are sometimes older and less common (therefore might be less frequent in the dataset the model was trained on).
2. That the surrounding sentence is less likely to co-occur with the specific name.
3. Issue with the dataset itself (over/under representation, labeling errors and more).

To (simplistically) test hypotheses 1 and 2, we compared biblical names with both old and new names, and three templates:
- "My name is X"
- "And X said, Why hast thou troubled us?". 
- "And she conceived again, a bare a son; and she called his name X."

Let's start by creating name lists and templates:

In [None]:
biblical_names = ["David", "Moses", "Abraham", "Samuel", "Jacob", 
                  "Isaac", "Jesus", "Matthew", 
                  "John", "Judas","Simon", "Mary"] # Random biblical names

other_names = ["Beyonce", "Ariana", "Katy", # Singers
               "Michael", "Lebron", "Coby", # NBA players
               "William", "Charles","Robert", "Margaret","Frank", "Helen", # Popular (non biblical) names in 1900 (https://www.ssa.gov/oact/babynames/decades/names1900s.html)
               "Ronald", "George", "Bill", "Barack", "Donald", "Joe" # Presidents
               ]

template1 = "My name is {}"
template2 = "And {} said, Why hast thou troubled us?"
template3 = "And she conceived again, a bare a son; and she called his name {}."

name_sets = {"Biblical": biblical_names, "Other": other_names}
templates = (template1, template2, template3)

Method for running the spaCy model and checking if "PERSON" was detected.

In [None]:
def names_recall(nlp: spacy.lang.en.English, names: List[str], template: str):
    """
    Run the spaCy NLP model on the template + name, 
    calculate recall for detecting the "PERSON" entity 
    and return a detailed list of detection
    :param nlp: spaCy nlp model
    :param names: list of names to run model on
    :param template: sentence with placeholder for name (e.g. "He calls himself {}")
    """
    results = {}
    for name in names:
        doc = nlp(template.format(name))
        name_token = [token for token in doc if token.text == name][0]
        results[name] = name_token.ent_type_ == "PERSON"
    recall = sum(results.values()) / len(results)
    print(f"Recall: {recall:.2f}\n")
    return results

#### Model 1: spaCy's `en_core_web_lg` model

- This model uses the original (non-transformers based) spaCy architecture. 
- It was trained on the [OntoNotes 5.0](https://catalog.ldc.upenn.edu/LDC2013T19) dataset and features [0.86 F-measure on named entities](https://spacy.io/models/en#en_core_web_lg).

Load the model:

In [None]:
en_core_web_lg = spacy.load("en_core_web_lg")

In [None]:
detailed_results = {}
nlp = en_core_web_lg

print("Model name: en_core_web_lg")
for template, name_set in itertools.product(templates, name_sets.items()):
    print(f"Name set: {name_set[0]}, Template: \"{template}\"")
    results = names_recall(nlp, name_set[1], template)
    detailed_results[template, name_set[0]] = results

print("\nDetailed results:")
pprint.pprint(detailed_results)

So there's a pretty big difference between biblical names detection and other names. 

#### Model 2: spaCy's `en_core_web_trf` model

spaCy recently released a new model, `en_core_web_trf`, based on the huggingface transformers library, and also trained on OntoNotes 5. 

Let's try this model:

In [None]:
nlp = spacy.load("en_core_web_trf")

In [None]:
detailed_results = {}
print("Model name: en_core_web_trf")
for template, name_set in itertools.product(templates, name_sets.items()):
    print(f"Name set: {name_set[0]}, Template: \"{template}\"")
    results = names_recall(nlp, name_set[1], template)
    detailed_results[template, name_set[0]] = results

print("Detailed results:")
pprint.pprint(detailed_results)


Although the numbers are different, we still see a difference between the two sets. However, this time it seems that old names (like Helen, William or Charles) are something the model is also struggling with.

Let's double check our results on a few samples:

In [35]:
name = "Simon"
doc=nlp(f"My name is {name}")
print(f"Name = {name}. Detected entities: {doc.ents}")

name = "Katy"
doc=nlp(f"My name is {name}")
print(f"Name = {name}. Detected entities: {doc.ents}")

name = "Moses"
doc=nlp(f"This is what God said to {name}")
print(f"Name = {name}. Detected entities: {doc.ents}")

name = "Ronald"
doc=nlp(f"This is what God said to {name}")
print(f"Name = {name}. Detected entities: {doc.ents}")


Name = Simon. Detected entities: ()
Name = Katy. Detected entities: (Katy,)
Name = Moses. Detected entities: ()
Name = Ronald. Detected entities: (Ronald,)


### So what's going on here?

As part of our work on [Presidio](https://aka.ms/presidio) (a tool for data-deidentification), we develop models to detect PII entities. For that purpose, [we extract template sentences](https://aka.ms/presidio-research) out of existing NER datasets, including CONLL03 and OntoNotes 5. The idea is to augment these datasets with additional entity values, for better coverage of names, cultures and ethnicities. In other words, every time we see a sentence with a tagged person name on a dataset, we extract a template sentence (e.g. `The name is [LAST_NAME], [FIRST_NAME] [LAST_NAME]`) and later replace it with multiple samples each containing different first and last names. 

When we manually went over the templating results, we figured out that there are still many names in our new templates dataset which didn't turn into templates. A majority of these names came from the biblical sentences that OntoNotes 5 contains. So many of the samples in the OntoNotes 5 did not contain any PERSON labels, even though they did contain names, an entity type the OntoNotes dataset claims to support. It seems like these models actually learn the errors in the dataset, in this case to ignore names if they are biblical.

Obviously, these errors are found in both the train and test set, so a model that would learn that biblical names are not really names would also succeed on a similar test set. This is yet another example why [SOTA](https://paperswithcode.com/sota/named-entity-recognition-ner-on-ontonotes-v5) results are not necessarily the best way to show progress in science.

A quick check shows that the Flair NER model (which reports higher f1 scores, 89.3%, on this dataset) [suffers from a similar problem](https://huggingface.co/flair/ner-english-ontonotes?text=This+is+what+God+said+to+Moses).

## Conclusion

First, I'd like to say that this is by no means a complaint to the developers and contributors of spaCy. spaCy is one of the most exciting things happening in NLP today and it's considered one of the most mature, accurate, fast and well documented NLP libraries in the world. As shown with the Flair example, this is an inherent problem in ML models and especially ML datasets.


Three relevant pointers to conclude:

1. Andrew NG recently argued that [the ML community should be more data-centric and less model-centric](https://analyticsindiamag.com/big-data-to-good-data-andrew-ng-urges-ml-community-to-be-more-data-centric-and-less-model-centric/). This post is another example of why this is true.
2. This is another example of an [issue with a major ML dataset](https://www.csail.mit.edu/news/major-ml-datasets-have-tens-thousands-errors).
3. A tool like [Checklist](https://github.com/marcotcr/checklist) is really helpful to validate that your model or data doesn't suffer from similar issues. Make sure you check it out.
