# Named Entity Recognition (NER) with Spacy

The purpose of this notebook is to demonstrate the entire process of name-entity recognition([**NER**](https://nlp.stanford.edu/ner/)) from start to the end with [**Spacy**](https://spacy.io/). 
This notebook also explore **pattern matching** as an alternative to **NER** when there is a known small set of fixed values. 

This will be a complete end-to-end demonstration of the entire process, including both labelling and model training.

In this notebook, we train a model to detect entities related to **oil/petrol** from this [public dataset](https://www.kaggle.com/mitusha/email-dataset) which contains a list of emails related to the oil industry. This is an over simplification because we want more generic entities, but it shows how pattern matching is a better alternative than NER in this case. To summarise, we will extract oil-related elements from email messages.

Below are the process perform in this notebook:
- Read the emails data set which has an email per line.
- Label the emails with the OIL entity using **[Doccano](http://doccano.herokuapp.com/)** labeling tool. This is a manual process.
- Save the labels in a text file as **JSONL**.
- Use **Spacy** Neural Network model to train a new statistical model. 
- Save the model.
- Create a **Spacy** NLP pipeline and use the new model to detect oil entities never seen before.
- Use pattern matching instead of a deep learning model to compare both method.



## Label the Data

First Step: Level the data using open source platform **Doccano**.

Follow **[Doccano](https://doccano.github.io/doccano/tutorial/)** instructions to install and open Doccano.

If you use **Linux/Mac**, I recommend using the docker image:
- `docker pull doccano/doccano`
- `docker container create --name doccano -e "ADMIN_USERNAME=admin" -e "ADMIN_EMAIL=admin@example.com" -e "ADMIN_PASSWORD=password" -p 8000:8000 doccano/doccano`
- `docker container start doccano`
    
For **Windows**, just use **pip**: 
- `pip install doccano`
- `doccano`

Go to http://127.0.0.1:8000/.

Next, label the data using Doccano. Find entities which talk about oil, petrol, petroleum, etc and label them with the tag **OIL**. 

Export the result as **JSONL(Text label)** format.

## Model Training

First, let's read the JSONL file using format:

`{"id": 15, "text": "....", "meta": {}, "annotation_approver": null, "labels": [[226, 234, "OIL"]]}
`

In [1]:
import json
labeled_data = []
with open(r"emails_labeled.jsonl", "r") as read_file:
    for line in read_file:
        data = json.loads(line)
        labeled_data.append(data)

#### Convert the format to spacy format

Next, let's convert the Deccano format to Spacy format.

We will also remove extra columns and rename labels to entities.

In [2]:
TRAINING_DATA = []
for entry in labeled_data:
    entities = []
    for e in entry['labels']:
        entities.append((e[0], e[1],e[2]))
    spacy_entry = (entry['text'], {"entities": entities})
    TRAINING_DATA.append(spacy_entry)

### Train the model!

Use Deep Learning (NN) with a 0.3 dropout rate to avoid overfitting.

The idea is to use a Neural Network with numerous layers and a large number of neurons. We present them text that has already been classified, so the answer is already known. We'll run a lot of iterations, and on each one, we'll calculate the error using a Loss Function, which will modify the weight of the neurons, causing them to fire. The weight of the network will be modified over time in order to eliminate improper learning patterns and solve the problem.

To avoid overfitting, which means the model "memorises" the training data and does not perform well with new data, we remove specific neurons at random on each iteration. This makes it easier for the model to generalise.

Remember to install Spacy first:
```
pip install -U pip setuptools wheel
pip install -U spacy
python -m spacy download en_core_web_sm
```

In [3]:
# Import models
import spacy
import random
import json

nlp = spacy.blank("en")
ner = nlp.create_pipe("ner")
nlp.add_pipe(ner)
ner.add_label("OIL")

# Start the training
nlp.begin_training()

# Loop for 40 iterations
for itn in range(40):
    # Shuffle the training data
    random.shuffle(TRAINING_DATA)
    losses = {}

    # Batch the examples and iterate over them
    for batch in spacy.util.minibatch(TRAINING_DATA, size=2):
        texts = [text for text, entities in batch]
        annotations = [entities for text, entities in batch]

        # Update the model
        nlp.update(texts, annotations, losses=losses, drop=0.3)
    print(losses)

{'ner': 1915.9444246831372}
{'ner': 107.82956493094102}


  d_xhat = N * dy - sum_dy - dist * var ** (-1.0) * sum_dy_dist


{'ner': 114.18649069852805}
{'ner': 30.69304090604681}
{'ner': 2.103905114710004}
{'ner': 3.85198211228822}
{'ner': 1.9485266517025366}
{'ner': 0.004584697754935514}
{'ner': 0.0006568707259644221}
{'ner': 0.25235692383820013}
{'ner': 1.399154501044839}
{'ner': 0.005002993828410104}
{'ner': 0.0007396650207534202}
{'ner': 3.399979473944603e-07}
{'ner': 8.389657480840494e-07}
{'ner': 1.4286800035740475e-09}
{'ner': 4.3158335529742466e-07}
{'ner': 1.0808071583551658e-06}
{'ner': 3.095908195536521e-08}
{'ner': 2.361087045811699e-06}
{'ner': 7.632661516871893e-05}
{'ner': 9.230668441407683e-10}
{'ner': 2.430684610377712e-09}
{'ner': 2.0937108988823943e-06}
{'ner': 3.121169423770219e-08}
{'ner': 0.0005652823139269539}
{'ner': 0.003151026469285797}
{'ner': 8.318149025712086e-08}
{'ner': 1.707827043826918e-05}
{'ner': 0.0013131838653112446}
{'ner': 0.0024066308325335723}
{'ner': 3.7774335260993454e-05}
{'ner': 0.4828146020814521}
{'ner': 0.004459282312749806}
{'ner': 4.888743675530538e-07}
{'ne

You should see the error decreasing as iterations go by, note that some times it may increase due to the dropout setting.

#### Save the model to disk

In [4]:
nlp.to_disk("oil.model")

## Test the model

Let's test the model.  For this we use displacy which will display the entities in the text.

In [5]:
from spacy import displacy
example = "service postings marathon petroleum co said it reduced the contract price it will pay for all grades of service oil one dlr a barrel effective today the decrease brings marathon s posted price for both west texas intermediate and west texas sour to dlrs a bbl the south louisiana sweet grade of service was reduced to dlrs a bbl the company last changed its service postings on jan reuter"

In [6]:
doc = nlp(example)
displacy.render(doc, style='ent')

### Conclusion

This project shown how to label data with **Doccano** and create a custom model with **spaCy**. Happy explore! You now can customize your own model with spaCy. 

## Phrase Matching

The second approach is to use pattern matching to look for certain keywords and patterns in the text. 

**Spacy** provides matchers which can be easily used to look for specific substrings, digits, etc. We can also set rules based on the part-of-speech tags.

In [7]:
import spacy
# Import the Matcher
from spacy.matcher import Matcher

nlp = spacy.load(name='en_core_web_sm')
doc = nlp(example)

# Initialize the Matcher with the shared vocabulary
matcher = Matcher(nlp.vocab)

# Add the pattern to the matcher
matcher.add("OIL_PATTERN", None, [{"LOWER": "oil"}], [{"LOWER": "petroleum"}])

# Use the matcher on the doc
matches = matcher(doc)
print("Matches:", [doc[start:end].text for match_id, start, end in matches])

Matches: ['petroleum', 'oil']


We can see how we also found the right tag, but in this case typos or similar words are not detected.

## Conclusion

We have seen to different approaches used for entity recognition. 

**Pattern Matching** can be used in the following use cases:
- Low cardinality attributes
- Common patterns such dates, quantities, numbers, etc.
- Patterns occuring in certain parts of the speech
- When typos are not expected
- Structured data



**Statistical Models** are great to learn complex patterns in the data and can "guess" and categorize data never seem before. Use cases:
- High Cardinality attributes
- You need to deal with typos (fuzzy matches)
- You need to categorize new, never seen data.
- Unstructured data

These models are much more powerful since they can make decisions on things that were never trained on. It can detect new entities without any code change.
