This notebooks is created using Chapter 4 of the the [Advanced NLP with spaCy](https://course.spacy.io/en/chapter4) course

# Chapter 4: Training a neural network model

In this chapter, you'll learn how to update spaCy's statistical models to customize them for your use case – for example, to predict a new entity type in online comments. You'll train your own model from scratch, and understand the basics of how training works, along with tips and tricks that can make your custom NLP projects more successful.

## Training and updating models

Welcome to the final chapter, which is about one of the most exciting aspects of modern NLP: training your own models!

In this lesson, you'll learn about training and updating spaCy's pipeline components and their neural network models, and the data you need for it – focusing specifically on the named entity recognizer.

### Why update the model?

Before we get starting with explaining how, it's worth taking a second to ask ourselves: Why would we want to update the model with our own examples? Why can't we just rely on pre-trained pipelines?

Statistical models make predictions based on the examples they were trained on.

You can usually make the model more accurate by showing it examples from your domain.

You often also want to predict categories specific to your problem, so the model needs to learn about them.

This is essential for text classification, very useful for entity recognition and a little less critical for tagging and parsing.

- Better results on your specific domain
- Learn classification schemes specifically for your problem
- Essential for text classification
- Very useful for named entity recognition
- Less critical for part-of-speech tagging and dependency parsing

### How training works (1)

spaCy supports updating existing models with more examples, and training new models. If we're not starting with a trained pipeline, we first initialize the weights randomly.

Next, spaCy calls `nlp.update`, which predicts a batch of examples with the current weights.

The model then checks the predictions against the correct answers, and decides how to change the weights to achieve better predictions next time.

Finally, we make a small correction to the current weights and move on to the next batch of examples.

spaCy then continues calling `nlp.update` for each batch of examples in the data. During training, you usually want to make multiple passes over the data and train until the model stops improving.

1. **Initialize** the model weights randomly
2. **Predict** a few examples with the current weights
3. **Compare** prediction with true labels
4. **Calculate** how to change weights to improve predictions
5. **Update** weights slightly
6. Go back to 2.

### How training works (2)

Here's an illustration showing the process.

The training data are the examples we want to update the model with.

The text should be a sentence, paragraph or longer document. For the best results, it should be similar to what the model will see at runtime.

The label is what we want the model to predict. This can be a text category, or an entity span and its type.

The gradient is how we should change the model to reduce the current error. It's computed when we compare the predicted label to the true label.

After training, we can then save out an updated model and use it in our application.

![training](./img/training.png)

- **Training data**: Examples and their annotations.
- **Text**: The input text the model should predict a label for.
- **Label**: The label the model should predict.
- **Gradient**: How to change the weights.

### Example: Training the entity recognizer

Let's look at an example for a specific component: the entity recognizer.

The entity recognizer takes a document and predicts phrases and their labels in context. This means that the training data needs to include texts, the entities they contain, and the entity labels.

Entities can't overlap, so each token can only be part of one entity.

The easiest way to do this is to show the model a text and entity spans. spaCy can be updated from regular `Doc` objects with entities annotated as the `doc.ents`. For example, "iPhone X" is a gadget, starts at token 0 and ends at token 1.

It's also very important for the model to learn words that *aren't* entities.

In this case, the list of span annotations will be empty.

Our goal is to teach the model to recognize new entities in similar contexts, even if they weren't in the training data.

- The entity recognizer tags words and phrases in context
- Each token can only be part of one entity
- Examples need to come with context

- Texts with no entities are also important

- **Goal**: teach the model to generalize

### The training data

The training data tells the model what we want it to predict. This could be texts and named entities we want to recognize, tokens and their correct part-of-speech tags or anything else the model should predict.

To update an existing model, we can start with a few hundred to a few thousand examples.

To train a new category we may need up to a million.

spaCy's trained English pipelines for instance were trained on 2 million words labelled with part-of-speech tags, dependencies and named entities.

Training data is usually created by humans who assign labels to texts.

This is a lot of work, but can be semi-automated – for example, using spaCy's `Matcher`.

- Examples of what we want the model to predict in context
- Update an **existing model**: a few hundred to a few thousand examples
- Train a **new category**: a few thousand to a million examples
  - spaCy's English models: 2 million words
- Usually created manually by human annotators
- Can be semi-automated – for example, using spaCy's `Matcher`!

### Training vs. evaluation data

When training your model, it's important to know how it's doing and whether it's learning the right thing. This is done by comparing the model's predictions on examples it hasn't seen during training to answers we already know. So in addition to the training data, you also need evaluation data, also called development data.

The evaluation data is used to calculate how accurate your model is. For example, an accuracy score of 90% means that the model predicted 90% of the evaluation examples correctly.

This also means that the evaluation data needs to be representative of the data your model will see at runtime. Otherwise, the accuracy score will be meaningless, because it won't tell you how good your model *really* is.

- **Training data**: used to update the model
- **Evaluation data**:
  - data the model hasn't seen during training
  - used to calculate how accurate the model is
  - should be representative of the data the model will see at runtime

### Generating a training corpus (1)

spaCy can be updated from data in the same format it creates: `Doc` objects. You already learned all about creating `Doc` and `Span` objects in chapter 2.

In this example, we're creating two `Doc` objects for our corpus: one that contains an entity and another one that doesn't contain any entities. To set the entities on the `Doc`, we can add a `Span` to the `doc.ents`.

Of course, you'll need a lot more examples to effectively train your model to generalize and predict similar entities in context. Depending on the task, you usually want at least a few hundred to a few thousand representative examples.

### Generating a training corpus (2)

As I mentioned earlier, we don't just need data to train the model. We also want to evaluate its accuracy on examples it hasn't seen during training. This is usually done by shuffling and splitting your data in two: one portion for training and one for evaluation. Here, we're using a simple 50/50 split.

- split data into two portions:
  - **training data**: used to update the model
  - **development data**: used to evaluate the model

### Generating a training corpus (3)

You typically want to store your training and development data as files on disk so you can load them into spaCy's training process.

The `DocBin` is a container for efficiently storing and serializing `Doc` objects. You can instantiate it with a list of `Doc` objects and call its `to_disk` method to save it to a binary file. We typically use the file extension `.spacy` for these files.

Compared to other binary serialization protocols like `pickle`, the `DocBin` is faster and produces smaller file sizes because it only stores the shared vocabulary once. You can read more about how it works in the [documentation](https://spacy.io/api/docbin).

- `DocBin`: container to efficiently store and save `Doc` objects
- can be saved to a binary file
- binary files are used for training

### Tip: Converting your data

In some cases, you might already have training and development data in a common format – for example, CoNLL or IOB. spaCy's `convert` command automatically converts these files into spaCy's binary format. It also converts JSON files in the old format used by spaCy v2.

- `spacy convert` lets you convert corpora in common formats
- supports `.conll`, `.conllu`, `.iob` and spaCy's old JSON format

## Training and evaluation data

**NOTE:** The question below is not very coherent. The question mentions training and development data, then mentions evaluation data meaning development data. The term development data should be simply replaced with test data in my opinion

To train a model, you typically need training data and development data for evaluation. What is this evaluation data used for?

- Provide more training examples as a fallback if the training data isn't enough.

- Check predictions on unseen examples and calculate the accuracy score. (Correct)

- Define training examples without annotations.

## Creating training data (1)

spaCy’s rule-based `Matcher` is a great way to quickly create training data for named entity models. A list of sentences is available as the variable `TEXTS`. You can print it to inspect it. We want to find all mentions of different iPhone models, so we can create training data to teach a model to recognize them as `"GADGET"`.

- Write a pattern for two tokens whose lowercase forms match `"iphone"` and `"x"`.
- Write a pattern for two tokens: one token whose lowercase form matches `"iphone"` and a digit.

In [5]:
import json
import spacy
from spacy.matcher import Matcher
from spacy.tokens import Span

with open("../../exercises/en/iphone.json", encoding="utf8") as f:
    TEXTS = json.loads(f.read())

nlp = spacy.blank("en")
matcher = Matcher(nlp.vocab)

# Two tokens whose lowercase forms match "iphone" and "x"
pattern1 = [{"LOWER": "iphone"}, {"LOWER": "x"}]

# Token whose lowercase form matches "iphone" and a digit
pattern2 = [{"LOWER": "iphone"}, {"IS_DIGIT": True}]

# Add patterns to the matcher and create docs with matched entities
matcher.add("GADGET", [pattern1, pattern2])
docs = []
for doc in nlp.pipe(TEXTS):
    matches = matcher(doc)
    spans = [Span(doc, start, end, label=match_id) for match_id, start, end in matches]
    print(spans)
    doc.ents = spans
    docs.append(doc)

[iPhone X]
[iPhone X]
[iPhone X]
[iPhone 8]
[iPhone 11, iPhone 8]
[]


## Creating training data (2)

After creating the data for our corpus, we need to save it out to a `.spacy` file. The code from the previous example is already available.

- Instantiate the `DocBin` with the list of `docs`.
- Save the `DocBin` to a file called `train.spacy`.

In [10]:
import json
import spacy
from spacy.matcher import Matcher
from spacy.tokens import Span, DocBin

with open("../../exercises/en/iphone.json", encoding="utf8") as f:
    TEXTS = json.loads(f.read())

nlp = spacy.blank("en")
matcher = Matcher(nlp.vocab)
# Add patterns to the matcher
pattern1 = ([{"LOWER": "iphone"}, {"LOWER": "x"}])
pattern2 = [{"LOWER": "iphone"}, {"IS_DIGIT": True}]
matcher.add("GADGET", [pattern1, pattern2])
docs = []
for doc in nlp.pipe(TEXTS):
    matches = matcher(doc)
    spans = [Span(doc, start, end, label=match_id) for match_id, start, end in matches]
    doc.ents = spans
    docs.append(doc)

doc_bin = DocBin(docs=docs)
doc_bin.to_disk("./train.spacy")

## Configuring and running training

Now that you've learned how to create training data, let's take a look at training your pipeline and configuring the training. In this lesson, you'll learn all about spaCy's training config system, how to generate your own training config, how to use the CLI to train a model and how to explore your trained pipeline afterwards.

### The training config (1)

spaCy uses a config file, usually called `config.cfg`, as the "single source of truth" for all settings. The config file defines how to initialize the `nlp` object, which pipeline components to add and how their internal model implementations should be configured. It also includes all settings for the training process and how to load the data, including hyperparameters.

Instead of providing lots of arguments on the command line or having to remember to define every single setting in code, you only need to pass your config file to spaCy's training command.

Config files also help with reproducibility: you'll have all settings in one place and always know how your pipeline was trained. You can even check your config file into a Git repo to version it and share it with others so they can train the same pipeline with the same settings.

- **single source of truth** for all settings
- typically called `config.cfg`
- defines how to initialize the `nlp` object
- includes all settings about the pipeline components and their model implementations
- configures the training process and hyperparameters
- makes your training more reproducible

### The training config (2)

Here's an excerpt from a config file used to train a pipeline with a named entity recognizer. The config is grouped into sections, and nested sections are defined using a dot. For example, `[components.ner.model]` defines the settings for the named entity recognizer's model implementation.

Config files can also reference Python functions using the `@` notation. For example, the tokenizer defines a registered tokenizer function. You can use this to customize different parts of the `nlp` object and training – from plugging in your own tokenizer, all the way to implementing your own model architectures. But let's not worry about this for now – what you'll learn in this chapter will simply use the defaults spaCy provides out-of-the-box!

### Generating a config

Of course, you don't have to write the config files by hand, and in a lot of cases, you won't even need to customize it at all. spaCy can auto-generate a config file for you.

The quickstart widget in the documentation lets you generate a config interactively by selecting the language and pipeline components you need, as well as optional hardware and optimization settings.

Alternatively, you can also use spaCy's built-in `init config` command. It takes the output file as the first argument. We usually call this file `config.cfg`. The argument `--lang` defines the language class that should be used for the pipeline, for example, `en` for English. The `--pipeline` argument lets you specify one or more comma-separated pipeline components to include. In this example, we're creating a config with one pipeline component, the named entity recognizer.

- spaCy can auto-generate a default config file for you
- interactive [quickstart widget](https://spacy.io/usage/training#quickstart) in the docs
- [init config](https://spacy.io/api/cli#init-config) command on the CLI

- `init config`: the command to run
- `config.cfg`: output path for the generated config
- `--lang`: language class of the pipeline, e.g. `en` for English
- `--pipeline`: comma-separated names of components to include

### Training a pipeline (1)

To train a pipeline, all you need is the config file and the training and development data. These are the `.spacy` files you already worked with in the previous exercises.

The first argument of `spacy train` is the path to the config file. The `--output` argument lets you specify a directory for saving the final trained pipeline.

You can also override different config settings on the command line. In this case, we override `paths.train` using the path to the `train.spacy` file and `paths.dev` using the `dev.spacy` file.

- all you need is the `config.cfg` and the training and development data
- config settings can be overwritten on the command line

`$ python -m spacy train ./config.cfg --output ./output --paths.train train.spacy --paths.dev dev.spacy`

- `train`: the command to run
- `config.cfg`: the path to the config file
- `--output`: the path to the output directory to save the trained pipeline
- `--paths.train`: override with path to the training data
- `--paths.dev`: override with path to the evaluation data

**NOTE**: Again there is mixing names between dev and evaluation data which should both be test data

### Training a pipeline (2)

Here's an example of the output you'll see during and after training. You might remember from earlier in this chapter that you usually want to make several passes over the data during training. Each pass over the data is also called an "epoch". This is shown in the first column of the table.

Within each epoch, spaCy outputs the accuracy scores every 200 examples. These are the steps shown in the second column. You can change the frequency in the config. Each line shows the loss and calculated accuracy score at this point during training.

The most interesting score to keep an eye on is the combined score in the last column. It reflects how accurately your model predicted the correct answers in the evaluation data.

The training runs until the model stops improving and exits automatically.



```
============================ Training pipeline ============================
ℹ Pipeline: ['tok2vec', 'ner']
ℹ Initial learn rate: 0.001

E    #       LOSS TOK2VEC  LOSS NER  ENTS_F  ENTS_P  ENTS_R  SCORE
---  ------  ------------  --------  ------  ------  ------  ------
  0       0          0.00     26.50    0.73    0.39    5.43    0.01
  0     200         33.58    847.68   10.88   44.44    6.20    0.11
  1     400         70.88    267.65   33.50   45.95   26.36    0.33
  2     600         67.56    156.63   45.32   62.16   35.66    0.45
  3     800        138.28    134.12   48.17   74.19   35.66    0.48
  4    1000        177.95    109.77   51.43   66.67   41.86    0.51
  6    1200         94.95     52.13   54.63   67.82   45.74    0.55
  8    1400        126.85     66.19   56.00   65.62   48.84    0.56
 10    1600         38.34     24.16   51.96   70.67   41.09    0.52
 13    1800        105.14     23.23   56.88   69.66   48.06    0.57

✔ Saved pipeline to output directory
/path/to/output/model-last

```

**NOTE**: Again there is mixing names between dev and evaluation data which should both be test data

### Loading a trained pipeline

The pipeline saved after training is a regular loadable spaCy pipeline – just like the trained pipelines provided by spaCy, for example `en_core_web_sm`. At the end, the last trained pipeline and the pipeline with the best score is saved to the output directory.

You can load your trained pipeline by passing the path to `spacy.load`. You can then use it to process and analyze text.

- output after training is a regular loadable spaCy pipeline
  - `model-last`: last trained pipeline
  - `model-best`: best trained pipeline
- load it with `spacy.load`

```
import spacy

nlp = spacy.load("/path/to/output/model-best")
doc = nlp("iPhone 11 vs iPhone 8: What's the difference?")
print(doc.ents)
```

### Tip: Packaging your pipeline

To make it easy to deploy your pipelines, spaCy provides a handy command to package them as Python packages. The `spacy package` command takes the path to your exported pipeline and an output directory. It then generates a Python package containing your pipeline. The Python package is a `.tar.gz` file and can be installed into your environment.

You can also provide an optional name and version on the command. This lets you manage multiple different versions of a pipeline, for example, if you decide to customize your pipeline later or train it with more data.

The package behaves just like any other Python package. After installation, you can load your pipeline using its name. Note that spaCy will automatically add the language code to the name. So your pipeline `my_pipeline` will become `en_my_pipeline`.

- [spacy package](https://spacy.io/api/cli#package): create an installable Python package containing your pipeline
- easy to version and deploy

```
$ python -m spacy package /path/to/output/model-best ./packages --name my_pipeline --version 1.0.0
```

```
$ cd ./packages/en_my_pipeline-1.0.0
$ pip install dist/en_my_pipeline-1.0.0.tar.gz
```

Load and use the pipeline after installation:

``` nlp = spacy.load("en_my_pipeline") ```

## The training config

The `config.cfg` file is the “single source of truth” for training a pipeline with spaCy. Which of the following is not true about the config?

- It allows you to configure the training process and hyperparameters.

- It helps make your training more reproducible.

- It creates an installable Python package with your pipeline. (Correct)

- It defines the pipeline's components and their settings.

The config file includes all settings related to training and how to set up the pipeline, but it doesn’t package your pipeline. To create an installable Python package, you can use the `spacy package` command.

## Generating a config file

The [`init config` command](https://spacy.io/api/cli#init-config) auto-generates a config file for training with the default settings. We want to train a named entity recognizer, so we’ll generate a config file for one pipeline component, `ner`. Because we’re executing the command in a Jupyter environment in this course, we’re using the prefix `!`. If you’re running the command in your local terminal, you can leave this out.

Part 1

- Use spaCy’s `init config` command to auto-generate a config for an English pipeline.
- Save the config to a file `config.cfg`.
- Use the `--pipeline` argument to specify one pipeline component, `ner`.

In [2]:
!python -m spacy init config ./config.cfg --lang en --pipeline ner

[38;5;3m⚠ To generate a more effective transformer-based config (GPU-only),
install the spacy-transformers package and re-run this command. The config
generated now does not use transformers.[0m
[38;5;4mℹ Generated config template specific for your use case[0m
- Language: en
- Pipeline: ner
- Optimize for: efficiency
- Hardware: CPU
- Transformer: None
[38;5;2m✔ Auto-filled config with all values[0m
[38;5;2m✔ Saved config[0m
config.cfg
You can now add your data and train your pipeline:
python -m spacy train config.cfg --paths.train ./train.spacy --paths.dev ./dev.spacy


Part 2

- Let’s take a look at the config spaCy just generated! You can run the command below to print the config to the terminal and inspect it.

In [3]:
!cat ./config.cfg

[paths]
train = null
dev = null
vectors = null
init_tok2vec = null

[system]
gpu_allocator = null
seed = 0

[nlp]
lang = "en"
pipeline = ["tok2vec","ner"]
batch_size = 1000
disabled = []
before_creation = null
after_creation = null
after_pipeline_creation = null
tokenizer = {"@tokenizers":"spacy.Tokenizer.v1"}
vectors = {"@vectors":"spacy.Vectors.v1"}

[components]

[components.ner]
factory = "ner"
incorrect_spans_key = null
moves = null
scorer = {"@scorers":"spacy.ner_scorer.v1"}
update_with_oracle_cut_size = 100

[components.ner.model]
@architectures = "spacy.TransitionBasedParser.v2"
state_type = "ner"
extra_state_tokens = false
hidden_width = 64
maxout_pieces = 2
use_upper = true
nO = null

[components.ner.model.tok2vec]
@architectures = "spacy.Tok2VecListener.v1"
width = ${components.tok2vec.model.encode.width}
upstream = "*"

[components.tok2vec]
factory = "tok2vec"

[components.tok2vec.model]
@architectures = "spacy.Tok2Vec.v2"

[components.tok2vec.model.embed]
@architectures = 

## Using the training CLI

Let’s use the config file generated in the previous exercise and the training corpus we’ve created to train a named entity recognizer!

The `train` command lets you train a model from a training config file. A file `config_gadget.cfg` is already available in the directory `exercises/en`, as well as a file `train_gadget.spacy` containing the training examples, and a file `dev_gadget.spacy` containing the evaluation examples. Because we’re executing the command in a Jupyter environment in this course, we’re using the prefix `!`. If you’re running the command in your local terminal, you can leave this out.

- Call the `train` command with the file `exercises/en/config_gadget.cfg`.
- Save the trained pipeline to a directory `output`.
- Pass in the `exercises/en/train_gadget.spacy` and `exercises/en/dev_gadget.spacy` paths.

In [4]:
!python -m spacy train ../../exercises/en/config_gadget.cfg --output output --paths.train ../../exercises/en/train_gadget.spacy --paths.dev ../../exercises/en/dev_gadget.spacy

[38;5;2m✔ Created output directory: output[0m
[38;5;4mℹ Saving to output directory: output[0m
[38;5;4mℹ Using CPU[0m
[38;5;4mℹ To switch to GPU 0, use the option: --gpu-id 0[0m
[1m
[38;5;2m✔ Initialized pipeline[0m
[1m
[38;5;4mℹ Pipeline: ['tok2vec', 'ner'][0m
[38;5;4mℹ Initial learn rate: 0.001[0m
E    #       LOSS TOK2VEC  LOSS NER  ENTS_F  ENTS_P  ENTS_R  SCORE 
---  ------  ------------  --------  ------  ------  ------  ------
  0       0          0.00     20.33    1.69    1.04    4.44    0.02
  1     200         30.09    985.76   81.32   80.43   82.22    0.81
  2     400         71.52    221.11   84.92   85.39   84.44    0.85
  4     600         67.26    109.45   81.97   80.65   83.33    0.82
  6     800         97.36     96.94   86.34   84.95   87.78    0.86
  9    1000         70.54     55.10   83.16   79.00   87.78    0.83
 12    1200         46.49     27.38   84.92   85.39   84.44    0.85
 16    1400        109.35     30.16   79.79   74.76   85.56    0.80
 22 

## Exploring the model

Let’s see how the model performs on unseen data! To speed things up a little, we already ran a trained pipeline for the label `"GADGET"` over some text. Here are some of the results:

Text |	Entities
-----|----------
Apple is slowing down the iPhone 8 and iPhone X - how to stop it	| (iPhone 8, iPhone X)
I finally understand what the iPhone X ‘notch’ is for	| (iPhone X,)
Everything you need to know about the Samsung Galaxy S9	| (Samsung Galaxy,)
Looking to compare iPad models? Here’s how the 2018 lineup stacks up	| (iPad,)
The iPhone 8 and iPhone 8 Plus are smartphones designed, developed, and marketed by Apple	| (iPhone 8, iPhone 8)
what is the cheapest ipad, especially ipad pro???	| (ipad, ipad)
Samsung Galaxy is a series of mobile computing devices designed, manufactured and marketed by Samsung Electronics	| (Samsung Galaxy,)


Out of all the entities in the texts, **how many did the model get correct**? Keep in mind that incomplete entity spans count as mistakes, too! Tip: Count the number of entities that the model *should* have predicted. Then count the number of entities it *actually* predicted correctly and divide it by the number of total correct entities.

- 45%

- 60%

- 70% (Correct)

- 90%

Text |	Entities | Mistake
-----|----------|---------
Apple is slowing down the iPhone 8 and iPhone X - how to stop it	| (iPhone 8, iPhone X) |
I finally understand what the iPhone X ‘notch’ is for	| (iPhone X,) |
Everything you need to know about the Samsung Galaxy S9	| (Samsung Galaxy,) | Samsung Galaxy S9
Looking to compare iPad models? Here’s how the 2018 lineup stacks up	| (iPad,) | 
The iPhone 8 and iPhone 8 Plus are smartphones designed, developed, and marketed by Apple	| (iPhone 8, iPhone 8) | iPhone 8 Plus
what is the cheapest ipad, especially ipad pro???	| (ipad, ipad) | ipad pro
Samsung Galaxy is a series of mobile computing devices designed, manufactured and marketed by Samsung Electronics	| (Samsung Galaxy,) |

## Training best practices

When you start running your own experiments, you might find that a lot of things just don't work the way you want them to. And that's okay.

Training models is an iterative process, and you have to try different things until you find out what works best.

In this lesson, I'll be sharing some best practices and things to keep in mind when training your own models.

Let's take a look at some of the problems you may come across.

### Problem 1: Models can "forget" things

Statistical models can learn lots of things – but they can also unlearn them.

If you're updating an existing model with new data, especially new labels, it can overfit and adjust too much to the new examples.

For instance, if you're only updating it with examples of `"WEBSITE"`, it may "forget" other labels it previously predicted correctly – like `"PERSON"`.

This is also known as the catastrophic forgetting problem.

- Existing model can overfit on new data
  - e.g.: if you only update it with `"WEBSITE"`, it can "unlearn" what a `"PERSON"` is
- Also known as "catastrophic forgetting" problem

### Solution 1: Mix in previously correct predictions

To prevent this, make sure to always mix in examples of what the model previously got correct.

If you're training a new category `"WEBSITE"`, also include examples of `"PERSON"`.

spaCy can help you with this. You can create those additional examples by running the existing model over data and extracting the entity spans you care about.

You can then mix those examples in with your existing data and update the model with annotations of all labels.

- For example, if you're training `"WEBSITE"`, also include examples of `"PERSON"`
- Run existing spaCy model over data and extract all other relevant entities

### Problem 2: Models can't learn everything

Another common problem is that your model just won't learn what you want it to.

spaCy's models make predictions based on the local context – for example, for named entities, the surrounding words are most important.

If the decision is difficult to make based on the context, the model can struggle to learn it.

The label scheme also needs to be consistent and not too specific.

For example, it may be very difficult to teach a model to predict whether something is adult clothing or children's clothing based on the context. However, just predicting the label "clothing" may work better.

- spaCy's models make predictions based on **local context**
- Model can struggle to learn if decision is difficult to make based on context
- Label scheme needs to be consistent and not too specific
  - For example: `"CLOTHING"` is better than `"ADULT_CLOTHING"` and `"CHILDRENS_CLOTHING"`

### Solution 2: Plan your label scheme carefully

Before you start training and updating models, it's worth taking a step back and planning your label scheme.

Try to pick categories that are reflected in the local context and make them more generic if possible.

You can always add a rule-based system later to go from generic to specific.

Generic categories like "clothing" or "band" are both easier to label and easier to learn.

- Pick categories that are reflected in local context
- More generic is better than too specific
- Use rules to go from generic labels to specific categories

**BAD:**

`LABELS = ["ADULT_SHOES", "CHILDRENS_SHOES", "BANDS_I_LIKE"]`

**GOOD:**

`LABELS = ["CLOTHING", "BAND"]`

## Good data vs. bad data

Here’s an excerpt from a training set that labels the entity type `TOURIST_DESTINATION` in traveler reviews.

```
doc1 = nlp("i went to amsterdem last year and the canals were beautiful")
doc1.ents = [Span(doc1, 3, 4, label="TOURIST_DESTINATION")]

doc2 = nlp("You should visit Paris once, but the Eiffel Tower is kinda boring")
doc2.ents = [Span(doc2, 3, 4, label="TOURIST_DESTINATION")]

doc3 = nlp("There's also a Paris in Arkansas, lol")
doc3.ents = []

doc4 = nlp("Berlin is perfect for summer holiday: great nightlife and cheap beer!")
doc4.ents = [Span(doc4, 0, 1, label="TOURIST_DESTINATION")]
```

Part 1

Why is this data and label scheme problematic?

- Whether a place is a tourist destination is a subjective judgement and not a definitive category. It will be very difficult for the entity recognizer to learn. (Correct)

- Paris should also be labelled as tourist destinations for consistency. Otherwise, the model will be confused.

- Rare out-of-vocabulary words like the misspelled 'amsterdem' shouldn't be labelled as entities.

Answer:
The first option is correct.

A much better approach would be to only label `"GPE"` (geopolitical entity) or `"LOCATION"` and then use a rule-based system to determine whether the entity is a tourist destination in this context. For example, you could resolve the entities types back to a knowledge base or look them up in a travel wiki.

Part 2

- Rewrite the `doc.ents` to only use spans of the label `"GPE"` (cities, states, countries) instead of `"TOURIST_DESTINATION"`.
- Don’t forget to add spans for the `"GPE"` entities that weren’t labeled in the old data.

Problem

In [1]:
import spacy
from spacy.tokens import Span

nlp = spacy.blank("en")

doc1 = nlp("i went to amsterdem last year and the canals were beautiful")
doc1.ents = [Span(doc1, 3, 4, label="TOURIST_DESTINATION")]

doc2 = nlp("You should visit Paris once, but the Eiffel Tower is kinda boring")
doc2.ents = [Span(doc2, 3, 4, label="TOURIST_DESTINATION")]

doc3 = nlp("There's also a Paris in Arkansas, lol")
doc3.ents = []

doc4 = nlp("Berlin is perfect for summer holiday: great nightlife and cheap beer!")
doc4.ents = [Span(doc4, 0, 1, label="TOURIST_DESTINATION")]

Solution

In [3]:
import spacy
from spacy.tokens import Span

nlp = spacy.blank("en")

doc1 = nlp("i went to amsterdem last year and the canals were beautiful")
doc1.ents = [Span(doc1, 3, 4, label="GPE")]

doc2 = nlp("You should visit Paris once, but the Eiffel Tower is kinda boring")
doc2.ents = [Span(doc2, 3, 4, label="GPE")]

doc3 = nlp("There's also a Paris in Arkansas, lol")
doc3.ents = [Span(doc3, 4, 5, label="GPE"), Span(doc2, 6, 7, label="GPE")]

doc4 = nlp("Berlin is perfect for summer holiday: great nightlife and cheap beer!")
doc4.ents = [Span(doc4, 0, 1, label="GPE")]

In [5]:

for idx, token in enumerate(doc3):
    print(idx, token)
    

0 There
1 's
2 also
3 a
4 Paris
5 in
6 Arkansas
7 ,
8 lol
