# Named Entity Recognition


For named entity recognition we will use [spaCy](https://spacy.io/) library.


In [1]:
%pip install spacy
%pip install spacy-transformers

Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


- [List of spaCy english trained models](https://spacy.io/models/en)
- Due to state of art performance, we will use `en_core_web_trf` model.

In [2]:
!python3 -m spacy download en_core_web_trf

^C


- We need Entity Linker to get the entity wikidata infromation, especially for the `wikidata_id` of the entity to get data from [Wikidata](https://www.wikidata.org/wiki/Wikidata:Main_Page).
- We will use the [spaCy Entity Linker](https://github.com/egerber/spaCy-entity-linker) library to get the entity information and use the wikidata knowledge base.


In [None]:
%pip install spacy-entity-linker
!python3 -m spacy_entity_linker "download_knowledge_base"

In [2]:
import spacy
from spacy import displacy

nlp = spacy.load("en_core_web_trf")
nlp.add_pipe("entityLinker")

  from .autonotebook import tqdm as notebook_tqdm


<spacy_entity_linker.EntityLinker.EntityLinker at 0x7ff0a8ab4fd0>

We will take snippet from Guardian's `business` [article](https://www.theguardian.com/business/2024/mar/19/nvidia-tech-ai-superchip-artificial-intelligence-humanoid-robots)

In [3]:
#text = "John told me that Apple is the best company. On the other hand, Tom said that that Apple is not that good."
#text = "The technology, which remains at the cutting edge of research, has already been incorporated into offerings from Microsoft and Amazon, and now Nvidia’s getting into the game."
text = "The speakers are serviceable, being loud enough for general use, but they pale in comparison with the best you get from Apple, Dell or Razer"

doc = nlp(text)

displacy.render(doc, style="ent", jupyter=True)

# Get collection of entities
entities = doc.ents

# Get collection of linked entities
linked_entities = doc._.linkedEntities

display(linked_entities)

<EntityCollection (6 entities):
-https://www.wikidata.org/wiki/Q570 loudspeaker               transducer that converts electrical energy into sound energy; electroacoustic transducer that conver
-https://www.wikidata.org/wiki/Q7901733 Use                                                                         
-https://www.wikidata.org/wiki/Q577714 comparison                feature of grammar                                
-https://www.wikidata.org/wiki/Q312 Apple Inc.                American producer of hardware, software, and services, based in Cupertino, California
-https://www.wikidata.org/wiki/Q30873 Dell Inc.                 American multinational computer technology corporation
-https://www.wikidata.org/wiki/Q367412 Razer Inc.                US based company which specializes in products marketed to gamers>

In [4]:
print(f"Entities count: {len(entities)}")
print(f"Linked entities count: {len(linked_entities)}")

# Print all entities
print("\nAll entities:")
for entity in entities:
    print(f" {entity.text} ({entity.label_})")

# Print all linked entities
print("\nAll linked entities:")
for linked_entity in linked_entities:
    print(f" {linked_entity.span.text} -> {linked_entity.identifier}")

# Unique entities. We can not use set(entities) because entities are not hashable.

# Unique entities based on text
entities = list({entity.text: entity for entity in entities}.values())

print(f"\nUnique entities count based on text: {len(entities)}")

# Unique linked entities based on span text
linked_entities = list(
    {
        linked_entities.span.text: linked_entities for linked_entities in linked_entities
    }.values()
)

print(f"Unique linked entities based on span text: {len(linked_entities)}")

Entities count: 3
Linked entities count: 6

All entities:
 Apple (ORG)
 Dell (ORG)
 Razer (ORG)

All linked entities:
 speakers -> 570
 use -> 7901733
 comparison -> 577714
 Apple -> 312
 Dell -> 30873
 Razer -> 367412

Unique entities count based on text: 3
Unique linked entities based on span text: 6


- In Linked Entity object we do not have any information about that if entity is organization, person, location, etc.
- See the difference between entity and linked entity `The New York Times` and `New York Times`.
- We must use the `span` attribute of the linked entity and entity to get their match.


In [5]:
# Filter entities to only include ORG entities
org_entities = [entity for entity in entities if entity.label_ == "ORG"] 

# Map linked entities to original entities
org_entity_mapping = {}
for linked_entity in linked_entities:
    for org_entity in org_entities:
        if (
            linked_entity.span.start_char <= org_entity.end_char
            and linked_entity.span.end_char >= org_entity.start_char
        ):
            org_entity_mapping[org_entity] = "Q" + str(linked_entity.identifier)

# Display the mapping
for entity, wiki_id in org_entity_mapping.items():
    print(f"{entity.text} ({entity.label_}) -> {wiki_id}")

display(org_entity_mapping)

Apple (ORG) -> Q312
Dell (ORG) -> Q30873
Razer (ORG) -> Q367412


{Apple: 'Q312', Dell: 'Q30873', Razer: 'Q367412'}

- Now we have the entity id and we can get the entity information from the Wikidata.


In [None]:
%pip install pywikibot

- Let's create our custom entity class.

In [6]:
from spacy.tokens import Span

class CustomEntity:
    def __init__(self, entity: Span, ticker: str, sentiment: float=0.0):
        self.entity = entity
        self.ticker = ticker

- Now we can get the entity information from the Wikidata.

In [14]:
import pywikibot

site = pywikibot.Site("wikidata", "wikidata")
repo = site.data_repository()

# The Wikidata IDs for exchanges
NASDAQ_WIKI_ID = "Q82059"
NASDAQ_STOCKHOLM_AB_WIKI_ID = "Q1019992"
NYSE_WIKI_ID = "Q13677"

stock_exchanges = [NASDAQ_WIKI_ID, NASDAQ_STOCKHOLM_AB_WIKI_ID, NYSE_WIKI_ID, HONG_KONG_EXCHANGES_AND_CLEARING_WIKI_ID]

# The Wikidata properties for stock exchange and ticker symbol
STOCK_EXCHANGE_WIKI_PROPERTY = "P414"
TICKER_SYMBOL_WIKI_PROPERTY = "P249"

# Create list of custom entities
custom_entities = []

#for wiki_id in org_entity_mapping.values():
for entity, wiki_id in org_entity_mapping.items():
    try:
        # Create an ItemPage object for the entity with the Wikidata ID
        page = pywikibot.ItemPage(repo, wiki_id)

        # Retrieve the data of the entity
        item_dict = page.get()

        # Retrieve the claims of the entity
        claims = item_dict["claims"]

        # Check if the entity has a stock exchange property
        if STOCK_EXCHANGE_WIKI_PROPERTY in claims:
            for claim in claims[STOCK_EXCHANGE_WIKI_PROPERTY]:
                # Check if the stock exchange is NASDAQ or NYSE
                stock_exchange = claim.getTarget()
                print(stock_exchange)
                if stock_exchange.id in stock_exchanges:
                    qualifiers = claim.qualifiers
                    # Retrieve the ticker symbol
                    if TICKER_SYMBOL_WIKI_PROPERTY in qualifiers:
                        for qualifier in qualifiers[TICKER_SYMBOL_WIKI_PROPERTY]:
                            ticker_symbol = qualifier.getTarget()
                            print(ticker_symbol)
                            custom_entities.append(CustomEntity(entity=entity, ticker=ticker_symbol))
                            break
                        break
    except Exception as e:
        print(e)

[[wikidata:Q82059]]
AAPL
[[wikidata:Q13677]]
DELL
[[wikidata:Q496672]]
1337


In [33]:
for custom_entity in custom_entities:
    print(f"{custom_entity.entity.text} ({custom_entity.ticker}), {custom_entity.sentiment}")

AttributeError: 'CustomEntity' object has no attribute 'sentiment'

- We can get the sentiment of the entity by getting the sentiment of the entity in the context of the sentence.

In [None]:
from textblob import TextBlob

# For each entity, find sentences mentioning the entity and compute sentiment
for custom_entity in custom_entities:
    # Find sentences mentioning the entity
    entity_sentences = [sentence.text for sentence in doc.sents if custom_entity.entity.text in sentence.text]
    
    # Compute sentiment for each sentence mentioning the entity
    entity_sentiment = [TextBlob(sentence).sentiment.polarity for sentence in entity_sentences]
    
    # Compute average sentiment for the entity
    average_sentiment = sum(entity_sentiment) / len(entity_sentiment) if entity_sentiment else 0
    
    # Update the sentiment of the custom entity
    custom_entity.sentiment = average_sentiment
    
    # Print the entity, ticker, and average sentiment
    print(f"{custom_entity.entity.text} ({custom_entity.ticker}), Average Sentiment: {custom_entity.sentiment}")
    
    # Print each sentence mentioning the entity and its sentiment
    for sentence in entity_sentences:
        sentiment = TextBlob(sentence).sentiment.polarity
        print(f" \t{sentence} -> {sentiment}")


Apple (AAPL), Average Sentiment: 0.64375
 	John told me that Apple is the best company. -> 1.0
 	On the other hand, Tom said that that Apple is not that good. -> 0.2875


In [None]:
for custom_entity in custom_entities:
    print(f"{custom_entity.entity.text} ({custom_entity.ticker}), {custom_entity.sentiment}")

Apple (AAPL), 1.0
Microsoft (MSFT), 0.2875


- This was one of the naive approach to get sentiment analysis of the text.
- See, because if we in one sentence have multiple entities then we can not add the sentiment of the sentence to the each entity. That is not relevant to the entity in most cases.

In [None]:
# Display tree structure
displacy.render(doc, style="dep", jupyter=True)

In [None]:
# For each entity, find adjectives directly connected to the entity and compute sentiment
for custom_entity in custom_entities:
    entity_adjectives = [token.text for token in custom_entity.entity.subtree if token.dep_ in {"amod", "acomp"}]
    print
    entity_sentiment = [TextBlob(adj).sentiment.polarity for adj in entity_adjectives]
    average_sentiment = sum(entity_sentiment) / len(entity_sentiment) if entity_sentiment else 0
    print(f"Entity: {custom_entity.entity.text}, Average Sentiment: {average_sentiment}")

Entity: Apple, Average Sentiment: 0
Entity: Microsoft, Average Sentiment: 0


In [None]:
from transformers import AutoTokenizer, AutoModelForTokenClassification
from transformers import pipeline

tokenizer = AutoTokenizer.from_pretrained("dslim/bert-base-NER")
model = AutoModelForTokenClassification.from_pretrained("dslim/bert-base-NER")

nlp = pipeline("ner", model=model, tokenizer=tokenizer)
example = "My name is Wolfgang and I live in Berlin"

ner_results = nlp(example)
print(ner_results)