<a href="https://colab.research.google.com/github/rahiakela/practical-natural-language-processing/blob/chapter-5-information-extraction/5_named_entity_recognition_issues.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Named Entity Recognition issues 

Consider a scenario where the user asks a search query—“Where was Albert Einstein born?”—using Google search.

<img src='https://github.com/practical-nlp/practical-nlp-figures/raw/master/figures/5-5.png?raw=1' width='800'/>

To be able to show “Ulm, Germany” for this query, the search engine needs to decipher that Albert Einstein is a person before going on to look for a place of birth. This is an example of NER in action in a real-world application.

**NER refers to the IE task of identifying the entities in a document. Entities are typically names of persons, locations, and organizations, and other specialized strings, such as money expressions, dates, products, names/numbers of laws or articles, and so on. NER is an important step in the pipeline of several NLP applications involving information extraction.**

<img src='https://github.com/practical-nlp/practical-nlp-figures/raw/master/figures/5-6.png?raw=1' width='800'/>

As seen in the figure, for a given text, NER is expected to identify person names, locations, dates, and other entities. Different categories of entities identified here are some of the ones commonly used in NER system development.

**NER is a prerequisite for being able to do other IE tasks, such as relation extraction or event extraction**.

NER is also useful in other applications like machine translation, as names
need not necessarily be translated while translating a sentence. So, clearly, there’s a range of scenarios in NLP projects where NER is a major component. It’s one of the common tasks you’re likely to encounter in NLP projects in industry.

## Building an NER System

A simple approach to building an NER system is to maintain a large collection of person/ organization/location names that are the most relevant to our company (e.g., names of all clients, cities in their addresses, etc.); this is typically referred to as a gazetteer. To check whether a given word is a named entity or not, just do a lookup in the gazetteer. If a large number of entities present in our data are covered by a gazetteer, then it’s a great way to start, especially when we don’t have an existing NER system available.

An approach that goes beyond a lookup table is rule-based NER, which can be based on a compiled list of patterns based on word tokens and POS tags.

For example, a pattern “NNP was born,” where “NNP” is the POS tag for a proper noun, indicates that the word that was tagged “NNP” refers to a person. Such rules can be programmed to cover as many cases as possible to build a rule-based NER system. 

1. **[Stanford NLP’s RegexNER](https://nlp.stanford.edu/software/regexner.html)**
2. **[spaCy’s EntityRuler](https://spacy.io/usage/rule-based-matching#entityruler)**

provide functionalities to implement your own rule-based NER.

A more practical approach to NER is to train an ML model, which can predict the
named entities in unseen text. For each word, a decision has to be made whether or not that word is an entity, and if it is, what type of the entity it is. In many ways, this is very similar to the classification problems.

**The only difference here is that NER is a “sequence labeling” problem.**

The typical classifiers predict labels for texts independent of their surrounding context. Consider a classifier that classifies sentences in a movie review into positive/negative/neutral categories based on their sentiment. This classifier does not (usually) take into account the sentiment of previous (or subsequent) sentences when classifying the current sentence.

**In a sequence classifier, such context is important. A common use case for sequence labeling is POS tagging, where we need information about the parts of speech of surrounding words to estimate the part of speech of the current word. NER is traditionally modeled as a sequence classification problem, where the entity prediction for the current word also depends on the context.**

For example, if the previous word was a person name, there’s a higher probability that the current word is also a person name if it’s a noun (e.g., first and last names).

To illustrate the difference between a normal classifier and a sequence classifier, consider the following sentence: “Washington is a rainy state.” When a normal classifier sees this sentence and has to classify it word by word, it has to make a decision as to whether Washington refers to a person (e.g., George Washington) or the State of Washington without looking at the surrounding words. It’s possible to classify the word “Washington” in this particular sentence as a location only after looking at the context in which it’s being used. It’s for this reason that sequence classifiers are used
for training NER models.

**Conditional random fields (CRFs) is one of the popular sequence classifier training algorithms.**

Recent advances in NER research either exclude or augment the kind of feature engineering we did in this example with neural network models. [NCRF++](https://github.com/jiesutd/NCRFpp) is another library that can be used to train your own NER using different neural network architectures. 

In this notebook, we will take a look at using spaCy commandline to train and evaluate a NER model. We will also compare it with the pretrained NER model in spacy.

Note: we will create multiple folders during this experiment: spacyNER_data


## Practical Advice

Despite the fact that state-of-the-art NER is highly accurate (with F1 scores over 90% using standard evaluation frameworks for NER in NLP research), there are several issues to keep in mind when using NER in our own software applications. Here are a couple caveats based on our own experience with developing NER systems:

- NER is very sensitive to the format of its input. It’s more accurate with wellformatted plain text than with, say, a PDF document from which plain text needs to be extracted first. While it’s possible to build custom NER systems for specific domains or for data like tweets, the challenge with PDFs comes from the failure to be 100% accurate in extracting text from them while preserving the structure. it illustrates some of the challenges with PDF-to-text extraction. Why do we need to be so accurate in properly extracting the structure from PDFs, though? In PDFs, partial sentences, headings, and formatting are common, and they can all mess up NER accuracy. There’s no single solution for this. One approach is to do custom post-processing of PDFs to extract blobs of text, then run NER on the blobs.

- NER is also very sensitive to the accuracy of the prior steps in its processing pipeline: sentence splitting, tokenization, and POS tagging.To understand how improper sentence splitting can result in poor NER results and looking at the output from spaCy.So, some amount of pre-processing may be necessary before passing a piece of text into an NER model to extract entities.

In [None]:
!python -m spacy download en_core_web_lg

In [4]:
import spacy

nlp = spacy.load("en_core_web_sm")

In [5]:
mytext = """
SAN FRANCISCO — Shortly after Apple used a new tax law last year to bring back most of the $252 billion it had held abroad, the company said it would buy back $100 billion of its stock.

On Tuesday, Apple announced its plans for another major chunk of the money: It will buy back a further $75 billion in stock.

“Our first priority is always looking after the business and making sure we continue to grow and invest,” Luca Maestri, Apple’s finance chief, said in an interview. “If there is excess cash, then obviously we want to return it to investors.”

Apple’s record buybacks should be welcome news to shareholders, as the stock price is likely to climb. But the buybacks could also expose the company to more criticism that the tax cuts it received have mostly benefited investors and executives.
"""

doc = nlp(mytext)
for ent in doc.ents:
  print(ent.text, "\t", ent.label_)

SAN FRANCISCO 	 GPE
Apple 	 ORG
last year 	 DATE
$252 billion 	 MONEY
$100 billion 	 MONEY
Tuesday 	 DATE
Apple 	 ORG
$75 billion 	 MONEY
first 	 ORDINAL
Luca Maestri 	 PERSON
Apple 	 ORG
Apple 	 ORG


In [6]:
# We see 6 sentences as humans in this text. How many does Spacy see? 
count = 0

for sent in doc.sents:
  print(sent.text)
  print("***End of sent****")
  count = count + 1

print("Total sentences: ", count)


SAN FRANCISCO —
***End of sent****
Shortly after Apple used a new tax law last year to bring back most of the $252 billion it had held abroad, the company said it would buy back $100 billion of its stock.


***End of sent****
On Tuesday, Apple announced its plans for another major chunk of the money: It will buy back a further $75 billion in stock.


***End of sent****
“Our first priority is always looking after the business and making sure we continue to grow and invest,” Luca Maestri, Apple’s finance chief, said in an interview.
***End of sent****
“If there is excess cash, then obviously we want to return it to investors.”


***End of sent****
Apple’s record buybacks should be welcome news to shareholders, as the stock price is likely to climb.
***End of sent****
But the buybacks could also expose the company to more criticism that the tax cuts it received have mostly benefited investors and executives.

***End of sent****
Total sentences:  7


Despite such shortcomings, NER is immensely useful for many IE scenarios, such as content tagging, search, and mining social media to identify customer feedback about specific products, to name a few.

While NER (and KPE) serve the useful task of identifying important words, phrases, and entities in documents, some NLP applications require further analysis of language, which leads us to more advanced NLP tasks.