## Name Entity Recognition(NER)

Named Entity Recognition (NER) is a natural language processing (NLP) technique that aims to identify and classify named entities within a text into predefined categories such as the names of persons, organizations, locations, expressions of times, quantities, monetary values, percentages, etc. The primary goal of NER is to extract and label these entities to provide structure and meaning to unstructured text data.

For instance, in the sentence "John works at Google in California", NER would identify "John" as a PERSON entity, "Google" as an ORGANIZATION entity, and "California" as a LOCATION entity.

In [1]:
#Example text
txt="""
John Smith works at XYZ Corporation located in New York City.
He is currently leading a project on artificial intelligence. 
Last week, he attended a conference on machine learning in San Francisco.
His colleague, Sarah Johnson, presented their team's research on natural language processing.
"""

SPACY Library


SpaCy is an open-source natural language processing (NLP) library designed to help developers build applications that process and understand large volumes of text data. It is written in Python and offers efficient implementations of various NLP tasks, including tokenization, part-of-speech tagging, named entity recognition (NER), dependency parsing, and more.

In [2]:
import spacy

In [3]:
nlp=spacy.load('en_core_web_sm')

[language][type][genre][size]  
en-english  
core- contains general vocabulary and entities)    
genre- place from where the text was extracted)  
size- small

In [4]:
doc=nlp(txt)

In [5]:
print(doc)


John Smith works at XYZ Corporation located in New York City.
He is currently leading a project on artificial intelligence. 
Last week, he attended a conference on machine learning in San Francisco.
His colleague, Sarah Johnson, presented their team's research on natural language processing.



As we can see we cannot directly print the entities in the  sentence so we use the "displacy.render" function from the spacy library

In [6]:
spacy.displacy.render(doc,style="ent")

In [7]:
#Now to understand this entities we can use explain function
spacy.explain("ORG")

'Companies, agencies, institutions, etc.'

In [8]:
spacy.explain("GPE")

'Countries, cities, states'

In [9]:
#To print all the entities in the text we use ents attribute
doc.ents

(John Smith,
 XYZ Corporation,
 New York City,
 Last week,
 San Francisco,
 Sarah Johnson)

In [10]:
#Looping through all the entitites
for entity in doc.ents:
    print(f"{entity.label_}:{entity.text}")
            #label_ returns the type of entity and text returns the actual entity

PERSON:John Smith
ORG:XYZ Corporation
GPE:New York City
DATE:Last week
GPE:San Francisco
PERSON:Sarah Johnson


In [11]:
#Storing all the organizations in a seperate list
org_list=[]

for entity in doc.ents:
    if entity.label_=='ORG':
        org_list.append(entity.text)

In [12]:
org_list

['XYZ Corporation']

## Assignment

Do the same steps as above for the given sentence

In [13]:
txt="Apple reached an all-time high stock price of 143 dollars this january"

In [14]:
import spacy

In [15]:
nlp=spacy.load("en_core_web_sm")

In [16]:
doc=nlp(txt)

In [17]:
print(doc)

Apple reached an all-time high stock price of 143 dollars this january


In [18]:
spacy.displacy.render(doc,"ent")

In [19]:
spacy.explain("MONEY")

'Monetary values, including unit'

In [20]:
doc.ents

(Apple, 143 dollars)

In [21]:
ents=[]
entities=doc.ents
for entity in entities:
    ents.append(entity.text)

In [22]:
ents

['Apple', '143 dollars']

In [23]:
#Practice for longer text and counting the entities
text="""
Tech giants like Google, Microsoft, and Amazon are constantly innovating in the field of artificial intelligence. Meanwhile, startups such as DeepMind, OpenAI, and Neuralink are pushing the boundaries of machine learning. The academic world also plays a significant role, with institutions like MIT, Stanford, and Oxford leading research in natural language processing.

In the business world, mergers and acquisitions are reshaping industries. Recently, IBM acquired Red Hat for a record-breaking sum. This move is expected to strengthen IBM's position in the cloud computing market. Meanwhile, Salesforce continues its expansion strategy, acquiring companies like Slack and Tableau to enhance its suite of products.

The financial sector is also witnessing significant activity. Banks like JPMorgan Chase, Goldman Sachs, and Citigroup are investing heavily in blockchain technology to improve security and efficiency. Fintech startups like Stripe and Robinhood are disrupting traditional banking with innovative payment solutions and commission-free trading platforms.

In healthcare, pharmaceutical companies like Pfizer, Johnson & Johnson, and Novartis are racing to develop vaccines and treatments for various diseases. Biotech startups such as Moderna and BioNTech have gained attention for their mRNA vaccine technology, which has shown promising results against COVID-19.

In the entertainment industry, studios like Disney, Warner Bros., and Netflix are investing billions in original content production to attract subscribers to their streaming platforms. Gaming companies like Electronic Arts, Activision Blizzard, and Epic Games are also capitalizing on the growing demand for interactive entertainment.

Overall, organizations across different sectors are leveraging technology and innovation to stay competitive in today's rapidly evolving landscape.

"""

In [24]:
model=spacy.load("en_core_web_sm")

In [25]:
doc=model(text)

In [26]:
#storing all the organizations in a list
organization_list=[]
entities=doc.ents
for entity in entities:
    if entity.label_=="ORG":
        organization_list.append(entity.text)

In [27]:
#Counting the number of occurence of each organization
#We use Counter method for this
from collections import Counter

In [28]:
organization_count=Counter(organization_list)

In [29]:
organization_count.most_common()

[('IBM', 2),
 ('Google', 1),
 ('Microsoft', 1),
 ('Amazon', 1),
 ('MIT', 1),
 ('Stanford', 1),
 ('Oxford', 1),
 ('Red Hat', 1),
 ('Goldman Sachs', 1),
 ('Citigroup', 1),
 ('Pfizer', 1),
 ('Johnson & Johnson', 1),
 ('Novartis', 1),
 ('COVID-19', 1),
 ('Disney', 1),
 ('Warner Bros.', 1),
 ('Electronic Arts', 1),
 ('Epic Games', 1)]

In [30]:
organization_list

['Google',
 'Microsoft',
 'Amazon',
 'MIT',
 'Stanford',
 'Oxford',
 'IBM',
 'Red Hat',
 'IBM',
 'Goldman Sachs',
 'Citigroup',
 'Pfizer',
 'Johnson & Johnson',
 'Novartis',
 'COVID-19',
 'Disney',
 'Warner Bros.',
 'Electronic Arts',
 'Epic Games']