# 3.3 Named Entity Recognition

In [1]:
import spacy
from spacy import displacy
from spacy import tokenizer
import re
nlp = spacy.load('en_core_web_sm')

In [2]:
google_text = "Google was founded on September 4, 1998, by computer scientists Larry Page and Sergey Brin while they were PhD students at Stanford University in California. Together they own about 14% of its publicly listed shares and control 56% of its stockholder voting power through super-voting stock. The company went public via an initial public offering (IPO) in 2004. In 2015, Google was reorganized as a wholly owned subsidiary of Alphabet Inc. Google is Alphabet's largest subsidiary and is a holding company for Alphabet's internet properties and interests. Sundar Pichai was appointed CEO of Google on October 24, 2015, replacing Larry Page, who became the CEO of Alphabet. On December 3, 2019, Pichai also became the CEO of Alphabet."
print(google_text)

Google was founded on September 4, 1998, by computer scientists Larry Page and Sergey Brin while they were PhD students at Stanford University in California. Together they own about 14% of its publicly listed shares and control 56% of its stockholder voting power through super-voting stock. The company went public via an initial public offering (IPO) in 2004. In 2015, Google was reorganized as a wholly owned subsidiary of Alphabet Inc. Google is Alphabet's largest subsidiary and is a holding company for Alphabet's internet properties and interests. Sundar Pichai was appointed CEO of Google on October 24, 2015, replacing Larry Page, who became the CEO of Alphabet. On December 3, 2019, Pichai also became the CEO of Alphabet.


In [3]:
spacy_doc = nlp(google_text)

In [None]:
# Print named entities and their labels
for word in spacy_doc.ents:
    print(word.text, word.label_)

Google ORG
September 4, 1998 DATE
Larry Page PERSON
Sergey Brin PERSON
PhD WORK_OF_ART
Stanford University ORG
California GPE
about 14% PERCENT
56% PERCENT
IPO ORG
2004 DATE
2015 DATE
Google ORG
Alphabet Inc. ORG
Alphabet ORG
Alphabet ORG
Sundar Pichai PERSON
Google ORG
October 24, 2015 DATE
Larry Page PERSON
Alphabet GPE
December 3, 2019 DATE
Pichai PERSON
Alphabet GPE


In [None]:
# Visualize named entities
displacy.render(spacy_doc,style="ent", jupyter=True)

**...let's see if cleaning our text up a little bit improves the tagging**

In [None]:
# Remove punctuation and convert to lowercase
google_text_clean = re.sub(r'[^\w\s]', '', google_text).lower() # remove punctuation and lowercase
print(google_text_clean)

google was founded on september 4 1998 by computer scientists larry page and sergey brin while they were phd students at stanford university in california together they own about 14 of its publicly listed shares and control 56 of its stockholder voting power through supervoting stock the company went public via an initial public offering ipo in 2004 in 2015 google was reorganized as a wholly owned subsidiary of alphabet inc google is alphabets largest subsidiary and is a holding company for alphabets internet properties and interests sundar pichai was appointed ceo of google on october 24 2015 replacing larry page who became the ceo of alphabet on december 3 2019 pichai also became the ceo of alphabet


In [7]:
spacy_doc_clean = nlp(google_text_clean)

In [8]:
for word in spacy_doc_clean.ents:
    print(word.text,word.label_)

google ORG
september 4 1998 DATE
stanford university ORG
california GPE
about 14 CARDINAL
56 CARDINAL
2004 DATE
2015 DATE
alphabet inc google ORG
google ORG
october 24 2015 DATE
larry PERSON
december 3 2019 DATE


In [None]:
# Visualize again after cleaning
displacy.render(spacy_doc_clean,style="ent",jupyter=True)

## Another Example

In [6]:
finance_text = "On March 20, 2021, the Federal Reserve announced a new interest rate policy. Jerome Powell emphasized that inflation might rise above 2% temporarily. Goldman Sachs responded with a revised forecast."

finance_doc = nlp(finance_text)
for ent in finance_doc.ents:
    print(ent.text, ent.label_)
    
displacy.render(finance_doc, style="ent", jupyter=True)

March 20, 2021 DATE
the Federal Reserve ORG
Jerome Powell PERSON
2% PERCENT
Goldman Sachs ORG


## What I Learned

- spaCy’s ents attribute extracts named entities like people, orgs, and dates.

- Cleaning text can make NER worse.

- displacy.render is a great way to visualize entities in a Jupyter.

- Always run NER before aggressive preprocessing.