# Lesson: Named Entity Recognition (NER)

by Devon Mordell and Jay Brodeur for [DMDS 2023-2024](https://scds.github.io/dmds23-24/textanalyses.html).

For more guidance on using this notebook, please refer to the workshop recording on the event [homepage](https://scds.github.io/dmds23-24/textanalyses.html). You may also wish to refer to the online workshop, [Identifying Proper Nouns with Named Entity Recognition](https://scds.github.io/text-analysis-2/), for a step-by-step explanation of the code.

In [None]:
# Use pip to install transformer language model; step not required to use Spacy's small corpus
!pip install https://github.com/explosion/spacy-models/releases/download/en_core_web_trf-3.7.2/en_core_web_trf-3.7.2-py3-none-any.whl

In [None]:
# Import internal libraries: glob for grabbing docs from directory
import glob

# Import internal libraries: Counter to count named entities
from collections import Counter

# Import external libraries: SpaCy library
import spacy
from spacy import displacy

# Import external libraries: matplotlib.pyplot to create bar graph
import matplotlib.pyplot as plt

In [None]:
# Read files from directory and create list from contents
file_list = glob.glob('./dir' + '/*.txt') # directory containing text (.txt) files

texts = []

for filename in file_list:
    with open(filename, mode = 'r', encoding = 'utf-8') as f: # specify encoding as appropriate
        texts.append(f.read())

print(texts[0]) # print the first .txt file in the list to confirm

In [None]:
# Instantiate NLP pipeline - load transformer language model
nlp = spacy.load('en_core_web_trf')

# When using the transformer language model, creating the doc object can take 5-10 minutes with the Wollstonecraft document
# For faster but less accurate results, you can use nlp = spacy.load('en_core_web_sm')

all_docs = ""

for text in texts:
  all_docs = text + all_docs

# Create the Doc object by passing the text of the file (ner_text) through the text pipeline (nlp)
doc = nlp(all_docs)

In [None]:
# Run named entity recognition
for ent in doc.ents:
    print(ent.text, ent.start_char, ent.end_char, ent.label_, spacy.explain(ent.label_))

In [None]:
# Serve in browser: displacy.serve(doc, style='ent')

# Render displaCy visualization as HTML output
html = displacy.render(doc, style='ent', page=True)

# Create a new file and write contents of html variable to it
f = open('wollstonecraft.html', 'w')
f.write(html)
f.close()

In [None]:
# Return a list of persons and print the 15 most commonly occurring values
persons = [(ent.text, ent.label_) for ent in doc.ents if ent.label_ == 'PERSON']
print(Counter(persons).most_common(15))

In [None]:
# Assign 10 most common named entities to variables for plotting
entities = [ent.text for ent in doc.ents]
labels, values = zip(*(Counter(entities).most_common(10)))

# Plot the most common entities
plt.bar(labels, values)
plt.xticks(fontsize=7.5, rotation=45)
plt.show()
