## Good Data vs Bad Data
Here’s an excerpt from a training set that labels the entity type TOURIST_DESTINATION in traveler reviews.

In [None]:
TRAINING_DATA = [
    (
        "i went to amsterdem last year and the canals were beautiful",
        {"entities": [(10, 19, "TOURIST_DESTINATION")]},
    ),
    (
        "You should visit Paris once in your life, but the Eiffel Tower is kinda boring",
        {"entities": [(17, 22, "TOURIST_DESTINATION")]},
    ),
    ("There's also a Paris in Arkansas, lol", {"entities": []}),
    (
        "Berlin is perfect for summer holiday: lots of parks, great nightlife, cheap beer!",
        {"entities": [(0, 6, "TOURIST_DESTINATION")]},
    ),
]

Whether a place is a tourist destination is a subjective judgement and not a definitive category. 

It will be very difficult for the entity recognizer to learn.

A much better approach would be to only label "GPE" (geopolitical entity) or "LOCATION"

and then use a rule-based system to determine whether the entity is a tourist destination in this context.

For example, you could resolve the entities types back to a knowledge base or look them up in a travel wiki.

Part 2
- Rewrite the TRAINING_DATA to only use the label "GPE" (cities, states, countries) instead of "TOURIST_DESTINATION".
- Don’t forget to add tuples for the "GPE" entities that weren’t labeled in the old data.

In [1]:
TRAINING_DATA = [
    (
        "i went to amsterdem last year and the canals were beautiful",
        {"entities": [(10, 19, "GPE")]},
    ),
    (
        "You should visit Paris once in your life, but the Eiffel Tower is kinda boring",
        {"entities": [(17, 22, "GPE")]},
    ),
    (
        "There's also a Paris in Arkansas, lol",
        {"entities": [(15, 20, "GPE"), (24, 32, "GPE")]},
    ),
    (
        "Berlin is perfect for summer holiday: lots of parks, great nightlife, cheap beer!",
        {"entities": [(0, 6, "GPE")]},
    ),
]

Here’s a small sample of a dataset created to train a new entity type "WEBSITE". 

The original dataset contains a few thousand sentences. In this exercise, you’ll be doing the labeling by hand.

In real life, you probably want to automate this and use an annotation tool – for example, Brat,

a popular open-source solution, or Prodigy, our own annotation tool that integrates with spaCy.

Part 1
- Complete the entity offsets for the "WEBSITE" entities in the data. Feel free to use len() if you don’t want to count the characters.

In [2]:
TRAINING_DATA = [
    (
        "Reddit partners with Patreon to help creators build communities",
        {"entities": [(0, 6, "WEBSITE"), (21, 28, "WEBSITE")]},
    ),
    ("PewDiePie smashes YouTube record", {"entities": [(18, 25, "WEBSITE")]}),
    (
        "Reddit founder Alexis Ohanian gave away two Metallica tickets to fans",
        {"entities": [(0, 6, "WEBSITE")]},
    ),
    # And so on...
]

Part 2

A model was trained with the data you just labelled, plus a few thousand similar examples. After training, it’s doing great on "WEBSITE", but doesn’t recognize "PERSON" anymore. Why could this be happening?

The training data included no examples of "PERSON", so the model learned that this label is incorrect.

If "PERSON" entities occur in the training data but aren’t labelled, the model will learn that they shouldn’t be predicted. Similarly, if an existing entity type isn’t present in the training data, the model may ”forget” and stop predicting it.

Part 3

Update the training data to include annotations for the "PERSON" entities “PewDiePie” and “Alexis Ohanian”.


In [3]:
TRAINING_DATA = [
    (
        "Reddit partners with Patreon to help creators build communities",
        {"entities": [(0, 6, "WEBSITE"), (21, 28, "WEBSITE")]},
    ),
    (
        "PewDiePie smashes YouTube record",
        {"entities": [(0, 9, "PERSON"), (18, 25, "WEBSITE")]},
    ),
    (
        "Reddit founder Alexis Ohanian gave away two Metallica tickets to fans",
        {"entities": [(0, 6, "WEBSITE"), (15, 29, "PERSON")]},
    ),
    # And so on...
]