# Named Entity Recognition

Entity Recognition means detecting the following in raw text: people, company names, locations and many other types of named entities. After finding entities in the text, we would like to display them nicely. Here is an example of our goal:

![ner example](https://files.ifi.uzh.ch/cl/archiv/2019/tamedia/ner.png)

In the example above, people, organizations and dates are highlighted. Our goal is to identifiy those entities _automatically_.

## 1 Web Demos as Warm Up

Before we analyze entities in our own data, let's briefly look at some demos of existing services that showcase nicely the potential and limits.

**Dandelion API**: https://dandelion.eu/semantic-text/entity-extraction-demo

Notable features: supports many languages, users can choose between recognizing more entities or higher precision. Entities are not only recognized, but also _linked_ to more information, such as images or Wiki articles.

**Explosion DisplaCy**: https://demos.explosion.ai/displacy-ent/

Notable features: users can select which classes should be recognized. Also supports several languages and is transparent about which model is used in the background. Generates HTML and CSS at the bottom that can be copy-pasted into any web page. Entirely open-source, too!

**Tasks:**
- **Try those web demos with your own texts.**
- **Try several languages.**
- **How would you rate the quality of named entity recognition?**



## 2 Setup

In [5]:
! pip install requests spacy spacy-lookups-data

Collecting spacy-lookups-data
[?25l  Downloading https://files.pythonhosted.org/packages/3c/f1/be61b032e02a06a221e14f906dc251de90ac459dc2739f0c5225844ecb08/spacy_lookups_data-0.2.0.tar.gz (29.2MB)
[K     |████████████████████████████████| 29.2MB 82kB/s 
Building wheels for collected packages: spacy-lookups-data
  Building wheel for spacy-lookups-data (setup.py) ... [?25l[?25hdone
  Created wheel for spacy-lookups-data: filename=spacy_lookups_data-0.2.0-py2.py3-none-any.whl size=29164785 sha256=dee0bc23c150826fa0d2f2a175912cebcc980600b9ae5e357364ae0a68078d0f
  Stored in directory: /root/.cache/pip/wheels/79/a4/b8/6085d282396938b29675292697e72871b145990d0079ceadc1
Successfully built spacy-lookups-data
Installing collected packages: spacy-lookups-data
Successfully installed spacy-lookups-data-0.2.0


In [21]:
!python -m spacy download de_core_news_md
!python -m spacy download fr_core_news_md

! python -m spacy link --force de_core_news_md de_core_news_md
! python -m spacy link --force fr_core_news_sm fr_core_news_sm

Collecting https://github.com/explosion/spacy-models/releases/download/de_core_news_md-2.2.5/de_core_news_md-2.2.5.tar.gz
[?25l  Downloading https://github.com/explosion/spacy-models/releases/download/de_core_news_md-2.2.5/de_core_news_md-2.2.5.tar.gz (224.6MB)
[K     |████████████████████████████████| 224.6MB 49kB/s 
[?25hCollecting spacy>=2.2.2
[?25l  Downloading https://files.pythonhosted.org/packages/47/13/80ad28ef7a16e2a86d16d73e28588be5f1085afd3e85e4b9b912bd700e8a/spacy-2.2.3-cp36-cp36m-manylinux1_x86_64.whl (10.4MB)
[K     |████████████████████████████████| 10.4MB 2.8MB/s 
[?25hCollecting catalogue<1.1.0,>=0.0.7
  Downloading https://files.pythonhosted.org/packages/4f/d5/46ff975f0d7d055cf95557b944fd5d29d9dfb37a4341038e070f212b24fe/catalogue-0.0.8-py2.py3-none-any.whl
Collecting preshed<3.1.0,>=3.0.2
[?25l  Downloading https://files.pythonhosted.org/packages/db/6b/e07fad36913879757c90ba03d6fb7f406f7279e11dcefc105ee562de63ea/preshed-3.0.2-cp36-cp36m-manylinux1_x86_64.whl (1

In [0]:
# Alternatively, if the above does not work (if models cannot be found after installation was successful)

#! pip install https://github.com/explosion/spacy-models/releases/download/de_core_news_md-2.2.5/de_core_news_md-2.2.5.tar.gz
#! pip install https://github.com/explosion/spacy-models/releases/download/fr_core_news_md-2.2.5/fr_core_news_md-2.2.5.tar.gz

Download data (parliament transcriptions):

In [7]:
! wget https://files.ifi.uzh.ch/cl/siclemat/lehre/hs19/tm/parlament_transcriptions.jsonl.bz2
! bzip2 -d parlament_transcriptions.jsonl.bz2

--2019-12-03 14:26:45--  https://files.ifi.uzh.ch/cl/siclemat/lehre/hs19/tm/parlament_transcriptions.jsonl.bz2
Resolving files.ifi.uzh.ch (files.ifi.uzh.ch)... 130.60.155.125
Connecting to files.ifi.uzh.ch (files.ifi.uzh.ch)|130.60.155.125|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 54629228 (52M) [application/x-bzip2]
Saving to: ‘parlament_transcriptions.jsonl.bz2’


2019-12-03 14:26:49 (19.4 MB/s) - ‘parlament_transcriptions.jsonl.bz2’ saved [54629228/54629228]



In [8]:
! ls

IJxRfXT.png  parlament_transcriptions.jsonl  sample_data


In [12]:
! head parlament_transcriptions.jsonl

{"__metadata": {"id": "https://ws.parlament.ch/OData.svc/Transcript(ID=2L,Language='DE')", "uri": "https://ws.parlament.ch/OData.svc/Transcript(ID=2L,Language='DE')", "type": "itsystems.Pd.DataServices.DataModel.Transcript"}, "Subjects": {"__deferred": {"uri": "https://ws.parlament.ch/OData.svc/Transcript(ID=2L,Language='DE')/Subjects"}}, "MembersCouncil": {"__deferred": {"uri": "https://ws.parlament.ch/OData.svc/Transcript(ID=2L,Language='DE')/MembersCouncil"}}, "Businesses": {"__deferred": {"uri": "https://ws.parlament.ch/OData.svc/Transcript(ID=2L,Language='DE')/Businesses"}}, "ID": "2", "Language": "DE", "IdSubject": "1", "VoteId": null, "PersonNumber": null, "Type": 3, "Text": "<pd_text><p>[VS]</p>\n<p><i>Musiker des Schweizer Jugend-Sinfonie-Orchesters</i></p>\n<p><i>Musiciens de l'Orchestre Symphonique Suisse des Jeunes</i></p>\n<p>[VS]</p>\n<p><b>Antonio Vivaldi (1678-1741)</b></p>\n<p>[VS]</p>\n<p><i>Konzert C-Dur für zwei Trompeten und Streicher (Allegro)</i></p>\n<p><i>Conce

In [14]:
! grep "SVP" parlament_transcriptions.jsonl | head -n 50

{"__metadata": {"id": "https://ws.parlament.ch/OData.svc/Transcript(ID=8L,Language='DE')", "uri": "https://ws.parlament.ch/OData.svc/Transcript(ID=8L,Language='DE')", "type": "itsystems.Pd.DataServices.DataModel.Transcript"}, "Subjects": {"__deferred": {"uri": "https://ws.parlament.ch/OData.svc/Transcript(ID=8L,Language='DE')/Subjects"}}, "MembersCouncil": {"__deferred": {"uri": "https://ws.parlament.ch/OData.svc/Transcript(ID=8L,Language='DE')/MembersCouncil"}}, "Businesses": {"__deferred": {"uri": "https://ws.parlament.ch/OData.svc/Transcript(ID=8L,Language='DE')/Businesses"}}, "ID": "8", "Language": "DE", "IdSubject": "2", "VoteId": null, "PersonNumber": null, "Type": 3, "Text": "<pd_text><p>[VS]</p>\n<p><b>Wahl der zweiten Vizepräsidentin des Nationalrates für 1999/2000</b></p>\n<p><b>Election de la deuxième vice-présidente du Conseil national pour 1999/2000</b></p>\n<p>[VS]</p>\n<p><b>Präsident</b> (Seiler Hanspeter, Präsident): Die sozialdemokratische Fraktion, unterstützt von de

## 3 Pre-trained models with spaCy

Named entity recognition is a _supervised classification task_ that requires training data to learn from. In this section, instead of training an NER model, we will use a pre-trained model from the NLP library spaCy.

After importing spaCy, load a specific pre-trained model:

In [0]:
import spacy
nlp = spacy.load("de_core_news_md")

Which returns a function, `nlp`, which can be called with a string to be analyzed:

In [8]:
doc = nlp("(Seiler Hanspeter, Präsident): Die sozialdemokratische Fraktion, unterstützt von der freisinnig-demokratischen, der SVP-, der christlichdemokratischen, der evangelischen und unabhängigen Fraktion, schlägt Ihnen Frau Maury Pasquier vor.")

for ent in doc.ents:
    print(ent.text, ent.start_char, ent.end_char, ent.label_)

Seiler Hanspeter 1 17 PER
sozialdemokratische 35 54 MISC
SVP- 116 120 ORG
evangelischen 156 169 MISC
Maury Pasquier 216 230 PER


In this text, spaCy has recognized PER, ORG and MISC entities. See https://spacy.io/api/annotation#named-entities for all kinds of entities spaCy knows about.

SpaCy also includes very helpful code to visualize results, in the sub-package `displacy`:

In [9]:
from spacy import displacy

displacy.render(doc, style="ent", jupyter=True)

**Tasks:**

- **Recognize entities in parliament speeches with a pre-trained spaCy model.**
- **Use the German and/or French spaCy models to recognize named entities in parliament speeches, and display them with displacy.**
- **The data file should be on your machine already, `parlament_transcriptions.jsonl`.**

<details>
<summary>If you are stuck: click to see more specific instructions.</summary>

- Important: Start to experiment with a _small number of examples_, for instance the top 100 lines in the input file.
- Use Python's standard library package `json` to read in JSON lines from the file, for instance with `DictReader`.
- Make sure you use the JSON key `Language` to decide whether lines are German or French. The JSON key `Text` holds the actual text content.
- The actual content contains XML or HTML tags. Remove those tags before analyzing the strings with SpaCy. Some ways to remove the HTML: regexes, `lxml`,`BeautifulSoup`.
- To analyze documents in a loop, use spaCy's _pipeline_ feature that returns a generator to loop over:

```python
for doc in nlp.pipeline(texts):
  # process individual Doc element
```
- When calling `nlp` with an input text, disable steps that are not needed for NER, to make processing much faster:

```python
for doc in nlp.pipeline(texts, disable=["tagger", "parser"]):
  # process individual Doc element
```
</details>

In [0]:
# your code here

## 4 Train from scratch or extend an NER model with spaCy

Instead of using a pre-trained model, we can of course train our own model. **This requires labelled training examples that contain the "right" answers**.

For the sake of this exercise, we will assume that there is a need to add a new class to our label set. To adapt to our target text domain, we will add the class label (political) `PARTY`.

### Training data format

NLP tools require training data to be in a specific format. In the case of NER, training examples must be structured as follows:

<table>
<tr>
<th>TEXT</th>
<th>ENTITY</th>
<th>START</th>
<th>END</th>
<th>LABEL</th>
</tr>
<tr>
<td>(Seiler Hanspeter, Präsident):</td>
<td>Seiler Hanspeter</td>
<td>1</td>
<td>16</td>
<td>PERSON</td>
</tr>
<tr>
<td>(Seiler Hanspeter, Präsident):</td>
<td>Präsident</td>
<td>19</td>
<td>27</td>
<td>TITLE</td>
</tr>
</table>

START and END are character offsets into the string, ENTITY is just the substring identified by those offsets, for convenience. The specific format can vary, but it must be clear 1) which exact span of text is being labelled and 2) which class label is assigned to this span.

In the case of spaCy, the exact format that training data need to be in is:

In [0]:
TRAIN_DATA = [
        ("(Seiler Hanspeter, Präsident):", {"entities": [(1, 16, "PERSON")]}),
        ("(Seiler Hanspeter, Präsident):", {"entities": [(19, 27, "TITLE")]})]

**Tasks:**
- **We are adding to our model a new label, `PARTY`. Look at our collection of speeches to find some examples you would label as `PARTY`.**
- **In your opinion, can we create the training data automatically? After all, we know the names of most political parties involved.**
- **Create some new training data, automatically (100 examples) or manually (10 examples).**

<details>
<summary>If you are stuck: click to see more specific instructions.</summary>

- Define keywords that identify different parties, store them in a Python dict.
- Loop over all our training data examples, for instance by analyzing documents to segment them into sentences:
```python
doc = nlp("Nur ganz kurz: Bei diesem Artikel geht es um die Frage des Übergangsrechtes. Die SVP-Fraktion unterstützt die Minderheit Baumann Alexander einstimmig.")
for sent in doc.sents:
    print(sent.text)
```
- Go over all tokens in a sentence. If a token is one of your pre-defined keywords, save this sentence as a training example, together with START, END and LABEL of the token.
</details>

In [0]:
# your code here

### Continue training with additional data

The following code assumes that you have compiled a list of additional training examples, in the variable `TRAIN_DATA`.

To do the continued training, also called _fine-tuning_, we will adapt code from the [superb spaCy documentation](https://spacy.io/usage/training#example-new-entity-type):

In [0]:
import plac
import random
from pathlib import Path
import spacy
from spacy.util import minibatch, compounding

# new entity label
LABEL = "ANIMAL"

@plac.annotations(
    model=("Model name. Defaults to blank 'en' model.", "option", "m", str),
    new_model_name=("New model name for model meta.", "option", "nm", str),
    output_dir=("Optional output directory", "option", "o", Path),
    n_iter=("Number of training iterations", "option", "n", int),
)
def main(model=None, new_model_name="animal", output_dir=None, n_iter=30):
    """Set up the pipeline and entity recognizer, and train the new entity."""
    random.seed(0)
    if model is not None:
        nlp = spacy.load(model)  # load existing spaCy model
        print("Loaded model '%s'" % model)
    else:
        nlp = spacy.blank("en")  # create blank Language class
        print("Created blank 'en' model")
    # Add entity recognizer to model if it's not in the pipeline
    # nlp.create_pipe works for built-ins that are registered with spaCy
    if "ner" not in nlp.pipe_names:
        ner = nlp.create_pipe("ner")
        nlp.add_pipe(ner)
    # otherwise, get it, so we can add labels to it
    else:
        ner = nlp.get_pipe("ner")

    ner.add_label(LABEL)  # add new entity label to entity recognizer
    # Adding extraneous labels shouldn't mess anything up
    ner.add_label("VEGETABLE")
    if model is None:
        optimizer = nlp.begin_training()
    else:
        optimizer = nlp.resume_training()
    move_names = list(ner.move_names)
    # get names of other pipes to disable them during training
    other_pipes = [pipe for pipe in nlp.pipe_names if pipe != "ner"]
    with nlp.disable_pipes(*other_pipes):  # only train NER
        sizes = compounding(1.0, 4.0, 1.001)
        # batch up the examples using spaCy's minibatch
        for itn in range(n_iter):
            random.shuffle(TRAIN_DATA)
            batches = minibatch(TRAIN_DATA, size=sizes)
            losses = {}
            for batch in batches:
                texts, annotations = zip(*batch)
                nlp.update(texts, annotations, sgd=optimizer, drop=0.35, losses=losses)
            print("Losses", losses)

    # test the trained model
    test_text = "Do you like horses?"
    doc = nlp(test_text)
    print("Entities in '%s'" % test_text)
    for ent in doc.ents:
        print(ent.label_, ent.text)

    # save model to output directory
    if output_dir is not None:
        output_dir = Path(output_dir)
        if not output_dir.exists():
            output_dir.mkdir()
        nlp.meta["name"] = new_model_name  # rename model
        nlp.to_disk(output_dir)
        print("Saved model to", output_dir)

        # test the saved model
        print("Loading from", output_dir)
        nlp2 = spacy.load(output_dir)
        # Check the classes have loaded back consistently
        assert nlp2.get_pipe("ner").move_names == move_names
        doc2 = nlp2(test_text)
        for ent in doc2.ents:
            print(ent.label_, ent.text)

**Tasks:**

- **Check out the link to spaCy documentation above for more explanations of all the steps above.**
- **Adapt the code to our needs, in order for the model to recognize the additional label `PARTY`.**

In [0]:
# Your code here

### Combat catastrophic forgetting (Bonus Section)

Continuing to train with this method is prone to _catastrophic forgetting_: if a model is only shown examples with the new label during training, it might forget how to recognize the original set of entities.

One way to avoid this problem is to also show examples with the labels from the first training phase during the second training. Since we do not have access to the original training set, we can use the model to predict the labels for our own data set.

**Additional Tasks:**
- **Run the trained model on some of our parliament data, to create some examples for known labels such as `ORG` and `PERSON`.**
- **Mix those training examples with the ones created for `PARTY`, then continue training the model on this combined data set.**

<details>
<summary>If you are stuck: click to see more specific instructions.</summary>

- Use a loaded spaCy model (that we usually call `nlp`) to analyze input texts. The resulting variable has an attribute `ents` that contains a list of all recognized entities, and each entity contains START, END and LABEL:
```python
doc = nlp("Example text with Obama.")
for ent in doc.ents:
  print(ent.start_char, ent.end_char, ent.label_)
```
- If you are _really_ stuck, read through https://explosion.ai/blog/pseudo-rehearsal-catastrophic-forgetting for a walk-through and explanations of _pseudo rehearsal_.
</details>

In [0]:
# Your code here

## 5 Combine a pre-trained statistical model with rule-based extraction

Instead of creating more training data and fine-tuning an existing model, an alternative approach is to complement a trained statistical model with **hand-written rules**. The resulting model is called a **hybrid**, and the improvements over a statistical baseline can be substantial.

Here are some examples for rules:

<table>
<tr>
<th>IF</th>
<th>THEN</th>
</tr>
<tr>
<td>Token is 'SVP' or 'svp'</td>
<td>class is "PARTY"</td>
</tr>
<tr>
<td>Token contains 'Fraktion'</td>
<td>class is "PARTY"</td>
</tr>
</table>

## 6 Further Reading and Links

- spaCy documentation: https://spacy.io/
- Course with self-test exercises: https://course.spacy.io/
- In-depth explanations of spaCy's NER model: https://www.youtube.com/watch?v=sqDHBH9IjRU. In a nutshell, spaCy uses embeddings with subword features, and processes them with deep convolutional neural networks with residual connections.

Do you have feedback, corrections, suggestions to improve this notebook? Please write an email to mmueller@cl.uzh.ch. Thanks!