# <font color='green'> Named-Entity Recognition </font>

Named-entity recognition seeks to locate and classify named entities mentioned in unstructured text into pre-defined categories such as person names, organizations, locations, medical codes, time expressions, quantities, monetary values, percentages, etc. One can consider it as a form of chunking.

## <font color='blue'> What is chunking? </font>
The process of extracting phrases from sentences.

### <font color='brown'> Example </font>

In [22]:
# Imports
import nltk

# Let's chunk this sentence
sentence = "By Election Day, more than 60% of U.S. K-12 public school students will be attending "\
           "schools that offer in-person learning at least a few days a week, an updated tracker finds."
# We will use a Noun Phrase Chunker - this is a chunker that looks for Noun Phrases in a sentence
# For example 'United States' is a chunk, even though there are two words here.
# So, the below regex looks for noun phrases.
# The rule states that whenever the chunk finds an optional determiner (DT) followed 
# by any number of adjectives (JJ) and then a noun (NN), then the Noun Phrase(NP) chunk should be formed.
regex = ('''
    NP: {<DT>?<JJ>*<NN>} # NP
    ''')
# Create the parser
chunk_parser = nltk.RegexpParser(regex)

# Tokenize and tag the sentence
tagged_words = nltk.pos_tag(nltk.word_tokenize(sentence))

# Form the chunks and create a tree out of it
chunked_tree = chunk_parser.parse(tagged_words)

# Let's draw the tree
# chunked_tree.draw() # Uncomment this to draw the tree - it opens a separate window

<img src="chunked_sentence_tree.jpg">

In [23]:
# A more raw version of the tree
print(chunked_tree)

(S
  By/IN
  Election/NNP
  Day/NNP
  ,/,
  more/JJR
  than/IN
  60/CD
  (NP %/NN)
  of/IN
  U.S./NNP
  K-12/NNP
  (NP public/JJ school/NN)
  students/NNS
  will/MD
  be/VB
  attending/VBG
  schools/NNS
  that/WDT
  offer/VBP
  (NP in-person/JJ learning/NN)
  at/IN
  least/JJS
  a/DT
  few/JJ
  days/NNS
  (NP a/DT week/NN)
  ,/,
  (NP an/DT updated/JJ tracker/NN)
  finds/NNS
  ./.)


## <font color='blue'> Time for named-entity chunking </font>
The previous chunker just looked for a pattern in the sentence. Nothing was learned. Let's try out a learned chunker now.

In [24]:
# Download the chunker
nltk.download('maxent_ne_chunker')
# Another download
nltk.download('words')

# Chunk
ne_tree = nltk.ne_chunk(tagged_words)

[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     C:\Users\prgzz\AppData\Roaming\nltk_data...
[nltk_data]   Package maxent_ne_chunker is already up-to-date!
[nltk_data] Downloading package words to
[nltk_data]     C:\Users\prgzz\AppData\Roaming\nltk_data...
[nltk_data]   Package words is already up-to-date!


In [34]:
# ne_tree.draw() # Uncomment this to draw the tree - it opens a separate window

<img src="ner_chunked_sentence_tree.jpg">

### <font color='brown'> A Larger Example </font>

In [31]:
# Download corpus
nltk.download('treebank')

# Get a list of already tagged sentences
treebank_sentences = nltk.corpus.treebank.tagged_sents()
print(treebank_sentences)

[nltk_data] Downloading package treebank to
[nltk_data]     C:\Users\prgzz\AppData\Roaming\nltk_data...
[nltk_data]   Package treebank is already up-to-date!
[[('Pierre', 'NNP'), ('Vinken', 'NNP'), (',', ','), ('61', 'CD'), ('years', 'NNS'), ('old', 'JJ'), (',', ','), ('will', 'MD'), ('join', 'VB'), ('the', 'DT'), ('board', 'NN'), ('as', 'IN'), ('a', 'DT'), ('nonexecutive', 'JJ'), ('director', 'NN'), ('Nov.', 'NNP'), ('29', 'CD'), ('.', '.')], [('Mr.', 'NNP'), ('Vinken', 'NNP'), ('is', 'VBZ'), ('chairman', 'NN'), ('of', 'IN'), ('Elsevier', 'NNP'), ('N.V.', 'NNP'), (',', ','), ('the', 'DT'), ('Dutch', 'NNP'), ('publishing', 'VBG'), ('group', 'NN'), ('.', '.')], ...]


In [32]:
# Let's find named-entities in the first 10
chunked_sents = []
for sent in treebank_sentences[:20]:
    chunked_sent = nltk.ne_chunk(sent, binary=False)
    chunked_sents.append(chunked_sent)

In [33]:
# This one found 3 named-entities
print(chunked_sents[12])
# chunked_sents[12].draw() # Uncomment this to draw the tree - it opens a separate window

(S
  Dr./NNP
  (PERSON Talcott/NNP)
  led/VBD
  a/DT
  team/NN
  of/IN
  researchers/NNS
  from/IN
  the/DT
  (ORGANIZATION National/NNP Cancer/NNP Institute/NNP)
  and/CC
  the/DT
  medical/JJ
  schools/NNS
  of/IN
  (ORGANIZATION Harvard/NNP University/NNP)
  and/CC
  (ORGANIZATION Boston/NNP University/NNP)
  ./.)


<img src="ner_chunked_larger_sentence_tree.jpg">

# <font color='green'> How does NER work? </font>

* NLTK provides a previously trained NER chunker.
* The chunker has been trained on a corpus that isn't available otherwise.
* What we know: The corpus was annotated manually and a Max Entropy classifier was trained on it.
* Basically, the objective was to classify phrases into a bunch of categories (locations, organizations, names, etc.) using a classifier that pretty much uses logistic regression.
* **More info at**: https://mattshomepage.com/articles/2016/May/23/nltk_nec/
* **Making your own named-entity chunker**: https://nlpforhackers.io/named-entity-extraction/

# <font color='green'> References </font>

* If you plan to read just one: https://www.nltk.org/book/ch07.html (Section 5)
* NER with NLTK and SpaCy (another cool NLP library, considered the fastest): https://towardsdatascience.com/named-entity-recognition-with-nltk-and-spacy-8c4a7d88e7da
* Another useful guide: https://medium.com/greyatom/learning-pos-tagging-chunking-in-nlp-85f7f811a8cb
* NER workings: https://mattshomepage.com/articles/2016/May/23/nltk_nec/
* Training your own NER: https://nlpforhackers.io/named-entity-extraction/
* **A plug for all the NLTK corpora out their**: http://www.nltk.org/howto/corpus.html