<a href="https://colab.research.google.com/github/rahiakela/practical-natural-language-processing/blob/chapter-5-information-extraction/4_named_entity_recognition_using_spacy.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Named Entity Recognition using spaCy

Consider a scenario where the user asks a search query—“Where was Albert Einstein born?”—using Google search.

<img src='https://github.com/practical-nlp/practical-nlp-figures/raw/master/figures/5-5.png?raw=1' width='800'/>

To be able to show “Ulm, Germany” for this query, the search engine needs to decipher that Albert Einstein is a person before going on to look for a place of birth. This is an example of NER in action in a real-world application.

**NER refers to the IE task of identifying the entities in a document. Entities are typically names of persons, locations, and organizations, and other specialized strings, such as money expressions, dates, products, names/numbers of laws or articles, and so on. NER is an important step in the pipeline of several NLP applications involving information extraction.**

<img src='https://github.com/practical-nlp/practical-nlp-figures/raw/master/figures/5-6.png?raw=1' width='800'/>

As seen in the figure, for a given text, NER is expected to identify person names, locations, dates, and other entities. Different categories of entities identified here are some of the ones commonly used in NER system development.

**NER is a prerequisite for being able to do other IE tasks, such as relation extraction or event extraction**.

NER is also useful in other applications like machine translation, as names
need not necessarily be translated while translating a sentence. So, clearly, there’s a range of scenarios in NLP projects where NER is a major component. It’s one of the common tasks you’re likely to encounter in NLP projects in industry.

## Building an NER System

A simple approach to building an NER system is to maintain a large collection of person/ organization/location names that are the most relevant to our company (e.g., names of all clients, cities in their addresses, etc.); this is typically referred to as a gazetteer. To check whether a given word is a named entity or not, just do a lookup in the gazetteer. If a large number of entities present in our data are covered by a gazetteer, then it’s a great way to start, especially when we don’t have an existing NER system available.

An approach that goes beyond a lookup table is rule-based NER, which can be based on a compiled list of patterns based on word tokens and POS tags.

For example, a pattern “NNP was born,” where “NNP” is the POS tag for a proper noun, indicates that the word that was tagged “NNP” refers to a person. Such rules can be programmed to cover as many cases as possible to build a rule-based NER system. 

1. **[Stanford NLP’s RegexNER](https://nlp.stanford.edu/software/regexner.html)**
2. **[spaCy’s EntityRuler](https://spacy.io/usage/rule-based-matching#entityruler)**

provide functionalities to implement your own rule-based NER.

A more practical approach to NER is to train an ML model, which can predict the
named entities in unseen text. For each word, a decision has to be made whether or not that word is an entity, and if it is, what type of the entity it is. In many ways, this is very similar to the classification problems.

**The only difference here is that NER is a “sequence labeling” problem.**

The typical classifiers predict labels for texts independent of their surrounding context. Consider a classifier that classifies sentences in a movie review into positive/negative/neutral categories based on their sentiment. This classifier does not (usually) take into account the sentiment of previous (or subsequent) sentences when classifying the current sentence.

**In a sequence classifier, such context is important. A common use case for sequence labeling is POS tagging, where we need information about the parts of speech of surrounding words to estimate the part of speech of the current word. NER is traditionally modeled as a sequence classification problem, where the entity prediction for the current word also depends on the context.**

For example, if the previous word was a person name, there’s a higher probability that the current word is also a person name if it’s a noun (e.g., first and last names).

To illustrate the difference between a normal classifier and a sequence classifier, consider the following sentence: “Washington is a rainy state.” When a normal classifier sees this sentence and has to classify it word by word, it has to make a decision as to whether Washington refers to a person (e.g., George Washington) or the State of Washington without looking at the surrounding words. It’s possible to classify the word “Washington” in this particular sentence as a location only after looking at the context in which it’s being used. It’s for this reason that sequence classifiers are used
for training NER models.

**Conditional random fields (CRFs) is one of the popular sequence classifier training algorithms.**

Recent advances in NER research either exclude or augment the kind of feature engineering we did in this example with neural network models. [NCRF++](https://github.com/jiesutd/NCRFpp) is another library that can be used to train your own NER using different neural network architectures. 

In this notebook, we will take a look at using spaCy commandline to train and evaluate a NER model. We will also compare it with the pretrained NER model in spacy.

Note: we will create multiple folders during this experiment: spacyNER_data


### Loading The Data

Loading the dataset is straightforward. This particular dataset is also already split into a train/dev/test set. So, we’ll train the model using the training set.

In [None]:
%%shell

mkdir Data
cd Data
mkdir conll2003

In [10]:
%%shell

wget https://raw.githubusercontent.com/practical-nlp/practical-nlp/master/Ch5/Data/conll2003/en/train.txt
wget https://raw.githubusercontent.com/practical-nlp/practical-nlp/master/Ch5/Data/conll2003/en/test.txt
wget https://raw.githubusercontent.com/practical-nlp/practical-nlp/master/Ch5/Data/conll2003/en/valid.txt

mv *.txt Data/conll2003

--2021-01-22 11:14:33--  https://raw.githubusercontent.com/practical-nlp/practical-nlp/master/Ch5/Data/conll2003/en/train.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 3283420 (3.1M) [text/plain]
Saving to: ‘train.txt’


2021-01-22 11:14:33 (77.0 MB/s) - ‘train.txt’ saved [3283420/3283420]

--2021-01-22 11:14:33--  https://raw.githubusercontent.com/practical-nlp/practical-nlp/master/Ch5/Data/conll2003/en/test.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 748095 (731K) [text/plain]
Saving to: ‘test.txt’


2021-01-22 11:14:34 (48.



### Converting data to json structures so it can be used by Spacy

In [12]:
%%shell

# Read the CONLL data from conll2003 folder, and store the formatted data into a folder spacyNER_data
mkdir spacyNER_data

# the above two lines create folders if they don't exist. If they do, the output shows a message that it already exists and cannot be created again
python3 -m spacy convert "Data/conll2003/train.txt" spacyNER_data -c ner
python3 -m spacy convert "Data/conll2003/test.txt" spacyNER_data -c ner
python3 -m spacy convert "Data/conll2003/valid.txt" spacyNER_data -c ner

mkdir: cannot create directory ‘spacyNER_data’: File exists
[38;5;4mℹ Auto-detected token-per-line NER format[0m
[38;5;4mℹ Grouping every 1 sentences into a document.[0m
[38;5;3m⚠ To generate better training data, you may want to group sentences
into documents with `-n 10`.[0m
[38;5;2m✔ Generated output file (14987 documents): spacyNER_data/train.json[0m
[38;5;4mℹ Auto-detected token-per-line NER format[0m
[38;5;4mℹ Grouping every 1 sentences into a document.[0m
[38;5;3m⚠ To generate better training data, you may want to group sentences
into documents with `-n 10`.[0m
[38;5;2m✔ Generated output file (3684 documents): spacyNER_data/test.json[0m
[38;5;4mℹ Auto-detected token-per-line NER format[0m
[38;5;4mℹ Grouping every 1 sentences into a document.[0m
[38;5;3m⚠ To generate better training data, you may want to group sentences
into documents with `-n 10`.[0m
[38;5;2m✔ Generated output file (3466 documents): spacyNER_data/valid.json[0m




**For example, the data before and after running spacy's convert program looks as follows.**

In [13]:
!echo "BEFORE : (Data/conll2003/train.txt)"
!head "Data/conll2003/train.txt" -n 11 | tail -n 9
!echo "\nAFTER : (Data/conll2003/train.json)"
!head "spacyNER_data/train.json" -n 64 | tail -n 49

BEFORE : (Data/conll2003/train.txt)
EU NNP B-NP B-ORG
rejects VBZ B-VP O
German JJ B-NP B-MISC
call NN I-NP O
to TO B-VP O
boycott VB I-VP O
British JJ B-NP B-MISC
lamb NN I-NP O
. . O O
\nAFTER : (Data/conll2003/train.json)
        ]
      }
    ]
  },
  {
    "id":1,
    "paragraphs":[
      {
        "sentences":[
          {
            "tokens":[
              {
                "orth":"EU",
                "tag":"NNP",
                "ner":"U-ORG"
              },
              {
                "orth":"rejects",
                "tag":"VBZ",
                "ner":"O"
              },
              {
                "orth":"German",
                "tag":"JJ",
                "ner":"U-MISC"
              },
              {
                "orth":"call",
                "tag":"NN",
                "ner":"O"
              },
              {
                "orth":"to",
                "tag":"TO",
                "ner":"O"
              },
              {
                "orth":"boyc

### Training the NER model with Spacy (CLI)

All the commandline options can be seen at: https://spacy.io/api/cli#train We are training using the train program in spacy, for English (en), and the results are stored in a folder called "model" (created while training). 

Our training file is in "spacyNER_data/train.json" and the validation file is at: "spacyNER_data/valid.json".

-G stands for gpu option. 
-p stands for pipeline, 

and it should be followed by a comma separated set of options - in this case, a tagger and an NER are being trained simultaneously.

In [14]:
!python3 -m spacy train en model spacyNER_data/train.json spacyNER_data/valid.json -G -p tagger,ner

[38;5;2m✔ Created output directory: model[0m
Training pipeline: ['tagger', 'ner']
Starting with blank model 'en'
Counting training words (limit=0)
  "__main__", mod_spec)

Itn  Tag Loss    Tag %    NER Loss   NER P   NER R   NER F   Token %  CPU WPS
---  ---------  --------  ---------  ------  ------  ------  -------  -------
  1  31189.805    94.120  16526.115  84.840  83.726  84.279  100.000    15578
  2  16787.411    94.860   7625.323  87.322  86.469  86.893  100.000    15437
  3  13609.065    95.196   5282.025  88.016  87.513  87.764  100.000    15512
  4  11813.764    95.254   3974.093  88.230  87.681  87.955  100.000    15581
  5  10422.525    95.345   3011.552  88.957  88.522  88.739  100.000    15152
  6   9633.024    95.479   2636.098  88.970  88.506  88.737  100.000    15658
  7   9058.265    95.545   2277.682  89.294  88.708  89.000  100.000    15380
  8   8254.984    95.543   2156.480  89.273  88.657  88.964  100.000    15318
  9   7913.531    95.566   1895.878  89.226  8

Notice how the performance improves with each iteration!

### Evaluating the model with test data set

**On Trained model (model/model-best)**

In [15]:
%%shell

# create a folder to store the output and visualizations. 
mkdir result

python3 -m spacy evaluate model/model-best spacyNER_data/test.json -dp result
# python -m spacy evaluate model/model-final data/test.txt.json -dp result

[1m

Time      3.23 s
Words     46666 
Words/s   14432 
TOK       100.00
POS       95.21 
UAS       0.00  
LAS       0.00  
NER P     81.72 
NER R     82.26 
NER F     81.99 
Textcat   0.00  

  "__main__", mod_spec)
  "__main__", mod_spec)
  "__main__", mod_spec)
  "__main__", mod_spec)
  "__main__", mod_spec)
  "__main__", mod_spec)
[38;5;2m✔ Generated 25 parses as HTML[0m
result




A visualization of the entity tagged test data can be seen in result/entities.html folder.

In [19]:
%%shell

# create a folder to store the output and visualizations. 
mkdir result

python -m spacy evaluate model/model-final spacyNER_data/test.json -dp result

mkdir: cannot create directory ‘result’: File exists
[1m

Time      3.30 s
Words     46666 
Words/s   14157 
TOK       100.00
POS       95.21 
UAS       0.00  
LAS       0.00  
NER P     81.82 
NER R     81.76 
NER F     81.79 
Textcat   0.00  

  "__main__", mod_spec)
  "__main__", mod_spec)
  "__main__", mod_spec)
  "__main__", mod_spec)
  "__main__", mod_spec)
[38;5;2m✔ Generated 25 parses as HTML[0m
result




**On spacy's Pretrained NER model (en)**

In [16]:
%%shell

mkdir pretrained_result

python3 -m spacy evaluate en spacyNER_data/test.json -dp pretrained_result

[1m

Time      5.73 s
Words     46666 
Words/s   8144  
TOK       100.00
POS       86.21 
UAS       0.00  
LAS       0.00  
NER P     6.51  
NER R     9.17  
NER F     7.62  
Textcat   0.00  

  "__main__", mod_spec)
  "__main__", mod_spec)
  "__main__", mod_spec)
  "__main__", mod_spec)
  "__main__", mod_spec)
[38;5;2m✔ Generated 25 parses as HTML[0m
pretrained_result




A visualization of the entity tagged test data can be seen in pretrained_result/entities.html folder.