# Chapter 4. Company to Symbol Linking
This notebook is complementary material to Chapter 4. of the thesis.

## Entity Linking

- For named entity recognition we will use [spaCy](https://spacy.io/) library.


In [None]:
%pip install spacy
%pip install spacy-transformers

- [List of spaCy english trained models](https://spacy.io/models/en)
- Due to state of art performance, we will use `en_core_web_trf` model.

In [None]:
!python3 -m spacy download en_core_web_trf

- We need Entity Linker to get the entity Wikidata infromation, especially for the `wikidata_id` of the entity to get data from [Wikidata](https://www.wikidata.org/wiki/Wikidata:Main_Page).
- We will use the [spaCy Entity Linker](https://github.com/egerber/spaCy-entity-linker) library to get the entity information and use the wikidata knowledge base.

In [None]:
%pip install spacy-entity-linker
!python3 -m spacy_entity_linker "download_knowledge_base"

In [2]:
import spacy
from spacy import displacy

# Load the transformer-based model
nlp = spacy.load("en_core_web_trf")

# Add the entity linker to the pipeline
nlp.add_pipe("entityLinker")

  from .autonotebook import tqdm as notebook_tqdm


<spacy_entity_linker.EntityLinker.EntityLinker at 0x7fd278e2e9d0>

- We will use the excerpt from Guardian's technology [article](https://www.theguardian.com/uk-news/2024/feb/25/uks-enemies-could-use-ai-deepfakes-to-try-to-rig-election-says-james-cleverly).

In [3]:
text = "Executives from Adobe, Amazon, Google, IBM, Meta, Microsoft, OpenAI and TikTok gathered at the Munich Security Conference to announce a new framework for how they will respond to AI-generated deepfakes that deliberately trick voters."

# Text for test entity duplicity 
text = "Executives from Adobe, Amazon, Google, IBM, Meta, Microsoft, OpenAI and TikTok gathered at the Munich Security Conference to announce a new framework for how they will respond to AI-generated deepfakes that deliberately trick voters. We want to mention Adobe and Microsoft again."

doc = nlp(text)

# Render the entities in the text
displacy.render(doc, style="ent", jupyter=True)

# Get collection of entities
entities = doc.ents

# Get collection of linked entities
linked_entities = doc._.linkedEntities

linked_entities.pretty_print()

<EntityElement: https://www.wikidata.org/wiki/Q20313043 Executive                 English language monthly business magazine published in Beirut, Lebanon>
<EntityElement: https://www.wikidata.org/wiki/Q11463 Adobe                     American multinational computer software company  >
<EntityElement: https://www.wikidata.org/wiki/Q3884 Amazon                    American electronic commerce and cloud computing company>
<EntityElement: https://www.wikidata.org/wiki/Q95 Google                    American multinational Internet and technology corporation>
<EntityElement: https://www.wikidata.org/wiki/Q37156 IBM                       American multinational technology and consulting corporation>
<EntityElement: https://www.wikidata.org/wiki/Q18811574 Meta                      Silicon Valley company known for making augmented reality products>
<EntityElement: https://www.wikidata.org/wiki/Q2283 Microsoft                 American multinational technology corporation     >
<EntityElement: https

- We want to reduce doc.ents to only `ORG`.

In [4]:
# Delte non-ORG entities from doc.ents
doc.ents = [ent for ent in doc.ents if ent.label_ == "ORG"]
print(doc.ents)

(Adobe, Amazon, Google, IBM, Meta, Microsoft, OpenAI, TikTok, Adobe, Microsoft)


In [4]:
# Inforamtive fragment about linked entities
print("\nAll linked entities super classes:")
linked_entities.print_super_entities()


All linked entities super classes:
business (8) : Adobe,Amazon,Google,IBM,Microsoft,OpenAI,Adobe,Microsoft
enterprise (7) : Adobe,Amazon,Google,IBM,Microsoft,Adobe,Microsoft
software company (3) : IBM,Microsoft,Microsoft
type foundry (2) : Adobe,Adobe
data controller (2) : Adobe,Adobe
magazine (1) : Executive
website (1) : Amazon
IT consulting company (1) : IBM
privately held company (1) : Meta
research institute (1) : OpenAI


- We can get super classes of entities, but we want to still talk about organisations.
- It is more reliable than treating all possible types of entities that may be in the text and relevant to us at the same time.

In [8]:
# Informative fragment about entities and linked entities

# Print counts
print(f"Entities count: {len(entities)}")
print(f"Linked entities count: {len(linked_entities)}")

# Print all entities
print("\nAll entities:")
for entity in entities:
    print(f" {entity.text} ({entity.label_})")

# Print all linked entities
print("\nAll linked entities:")
for linked_entity in linked_entities:
    print(f" {linked_entity.span.text} -> Q{linked_entity.identifier}")
    print(f"  {linked_entity.get_instance_of_hierarchy()}")
    print(f"   {linked_entity.get_super_entities()}")

Entities count: 11
Linked entities count: 15

All entities:
 Adobe (ORG)
 Amazon (ORG)
 Google (ORG)
 IBM (ORG)
 Meta (ORG)
 Microsoft (ORG)
 OpenAI (ORG)
 TikTok (ORG)
 the Munich Security Conference (EVENT)
 Adobe (ORG)
 Microsoft (ORG)

All linked entities:
 Executives -> Q20313043
  ['magazine']
   <EntityCollection (1 entities):
-https://www.wikidata.org/wiki/Q41298 magazine                  publication type                                  >
 Adobe -> Q11463
  ['type foundry', 'business', 'economic unit', 'economic model', 'economics term', 'enterprise', 'data controller']
   <EntityCollection (4 entities):
-https://www.wikidata.org/wiki/Q377688 type foundry              company that designs or distributes typefaces     
-https://www.wikidata.org/wiki/Q4830453 business                  organization involved in commercial, industrial, or professional activity
-https://www.wikidata.org/wiki/Q6881511 enterprise                for-profit organizational unit producing goods or service

- In Linked Entity object we do not have any information about that if entity is organization, person, location, etc. 
    - We can get this information from the Wikidata, but in wide range as in previous output.
- We must use the `span` attribute of the linked entity and entity to get their match.
    - We have information about `end_char`/`start_char` of an entity and a linked entity in the text. 
    - So we can use this information to map the entity to the linked entity to observe `qid`.

In [5]:
from spacy.tokens import Span

# Set the new qid extension to the Span
Span.set_extension("qid", default=None, force=True)

# Create a dictionary to store the unique entities
qids_ents_dict = {}

# Sort the entities by their start character
linked_entities = sorted(linked_entities, key=lambda e: e.span.start_char)
org_entities = sorted(list(doc.ents), key=lambda e: e.start_char)

# Initialize two pointers
i, j = 0, 0

# Loop while both pointers are within range
while i < len(linked_entities) and j < len(org_entities):
    linked_entity = linked_entities[i]
    org_entity = org_entities[j]

    # For visualization of the pointers
    #print(i, org_entity)
    #print(j, linked_entity)

    # If the entities overlap
    if (
        linked_entity.span.start_char <= org_entity.end_char
        and linked_entity.span.end_char >= org_entity.start_char
    ):
        # Get linked entity qid
        qid = "Q" + str(linked_entity.identifier)

        qids_ents_dict[qid] = {"label": org_entity.text, "ticker":""}
        org_entity._.qid = qid
        
        i += 1
        j += 1
    # If the linked entity starts later, move the pointer for org_entities
    elif linked_entity.span.start_char > org_entity.start_char:
        j += 1
    # If the org entity starts later, move the pointer for linked_entities
    else:
        i += 1

# Display the dictionary
display(qids_ents_dict)

# Display the entities with their QIDs
for entity in doc.ents:
    print(f"{entity.text} -> {entity._.qid}")

{'Q11463': {'label': 'Adobe', 'ticker': ''},
 'Q3884': {'label': 'Amazon', 'ticker': ''},
 'Q95': {'label': 'Google', 'ticker': ''},
 'Q37156': {'label': 'IBM', 'ticker': ''},
 'Q18811574': {'label': 'Meta', 'ticker': ''},
 'Q2283': {'label': 'Microsoft', 'ticker': ''},
 'Q21708200': {'label': 'OpenAI', 'ticker': ''},
 'Q48938223': {'label': 'TikTok', 'ticker': ''}}

Adobe -> Q11463
Amazon -> Q3884
Google -> Q95
IBM -> Q37156
Meta -> Q18811574
Microsoft -> Q2283
OpenAI -> Q21708200
TikTok -> Q48938223
Adobe -> Q11463
Microsoft -> Q2283


### Now we have the entity qid and we can get the entity information from the Wikidata.


- For speed up the process we will use the `qids_ents_dict` to get the entity information from the Wikidata.
    - Instead of getting the entity information from the Wikidata for each entity, due to possible duplicity of entities in the text. 
    - We want for each article set of entities in the end.
- We also set another new extension `ticker` to `Span`.

In [10]:
# Set the new ticker extension to the Span
Span.set_extension("ticker", default=None, force=True)

In [68]:
%pip install pywikibot

Note: you may need to restart the kernel to use updated packages.


- Now we can get more information about entity from the Wikidata.

In [11]:
import pywikibot

# The Wikidata properties for stock exchange and ticker symbol
STOCK_EXCHANGES = ["Q82059", "Q1019992", "Q13677", "Q496672"] # NASDAQ, NASDAQ Stockholm AB, NYSE, Hong Kong Exchanges and Clearing
STOCK_EXCHANGE_PROPERTY = "P414"
TICKER_SYMBOL_PROPERTY = "P249"

# Initialize wikibot site and repository
site = pywikibot.Site("wikidata", "wikidata")
repo = site.data_repository()

def process_entity(qid, entity):
    """
    Process the entity with the given QID and update the entity dictionary with the ticker symbol.
    """
    try:
        print(f"Processing {entity['label']} with QID {qid}")
        page = pywikibot.ItemPage(repo, qid)
        item_dict = page.get()
        claims = item_dict["claims"]

        if STOCK_EXCHANGE_PROPERTY in claims:
            for claim in claims[STOCK_EXCHANGE_PROPERTY]:
                stock_exchange = claim.getTarget()
                if stock_exchange.id in STOCK_EXCHANGES:
                    qualifiers = claim.qualifiers
                    if TICKER_SYMBOL_PROPERTY in qualifiers:
                        for qualifier in qualifiers[TICKER_SYMBOL_PROPERTY]:
                            ticker_symbol = qualifier.getTarget()
                            print(f"    {entity['label']} has ticker {ticker_symbol}")
                            entity['ticker'] = ticker_symbol
                            return
    except pywikibot.exceptions.Error as e:
        print(e)

for qid, entity in qids_ents_dict.items():
    process_entity(qid, entity)

Processing Adobe with QID Q11463
    Adobe has ticker ADBE
Processing Amazon with QID Q3884
    Amazon has ticker AMZN
Processing Google with QID Q95
    Google has ticker GOOG
Processing IBM with QID Q37156
    IBM has ticker IBM
Processing Meta with QID Q18811574
Processing Microsoft with QID Q2283
    Microsoft has ticker MSFT
Processing OpenAI with QID Q21708200
Processing TikTok with QID Q48938223


In [12]:
for qid, entity in qids_ents_dict.items():
    if entity['ticker'] == "":
        print(f"{entity['label']} -> Not found")
    else:
        print(f"{entity['label']} ({entity['ticker']})")

Adobe (ADBE)
Amazon (AMZN)
Google (GOOG)
IBM (IBM)
Meta -> Not found
Microsoft (MSFT)
OpenAI -> Not found
TikTok -> Not found


- It is too slow and hardcoding to get the entity information from the Wikidata.

Npoužijeme filter subclass of organization na wiki, protože například meta nemá instanci třídy organizace, ale technologický něco

In [None]:
%pip install sparqlwrapper

NASDAQ: Q82059
NYSE: Q13677
AMEX: Q846626
Nasdaq Stockholm AB: Q1019992

- We add Facebook organisation into the text to see if the model can recognize it as an organization.
    - Facebook has Q355
    - Instagram has Q209330

In [13]:
# Add facebook into the dict
qids_ents_dict["Q209330"] = {"label": "Instagram", "ticker": ""}

In [15]:
import sys
from spacy.tokens import Span
from SPARQLWrapper import SPARQLWrapper, JSON

from typing import Dict, List, Set, Any

# The Wikidata IDs for stock exchanges
# - NASDAQ, Q82059
# - NASDAQ Stockholm AB, Q1019992
# - NYSE, Q13677
# - NYSE American (AMEX), Q846626 
STOCK_EXCHANGES = ["Q82059", "Q13677", "Q846626", "Q1019992"]

class SPARQLWikidataConnector:
    def __init__(self):
        self.endpoint_url = "https://query.wikidata.org/sparql"
        self.user_agent = "WDQS-example Python/%s.%s" % (sys.version_info[0], sys.version_info[1])

    def run_query(self, query):
        sparql = SPARQLWrapper(self.endpoint_url, agent=self.user_agent)
        sparql.setQuery(query)
        sparql.setReturnFormat(JSON)
        return sparql.query().convert()

    def retrieve_entities_info(self, entities_identifiers: Set[str]) -> Dict[str, Any]:
        stock_exchanges = " ".join(f"wd:{exchange}" for exchange in STOCK_EXCHANGES)
        entities_id = " ".join(f"wd:{entity_id}" for entity_id in entities_identifiers)

        print(f"Entities IDs: {entities_identifiers}\n")

        # First query
        query1 = f"""
        SELECT DISTINCT ?id ?idLabel ?exchangesLabel ?ticker WHERE {{
            SERVICE wikibase:label {{ bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }}
            VALUES ?id {{ {entities_id} }}
            VALUES ?exchanges {{ {stock_exchanges} }}
            ?id p:P414 ?exchange.
            ?exchange ps:P414 ?exchanges;
                      pq:P249 ?ticker.  
            FILTER NOT EXISTS {{
                ?exchange pq:P582 ?endTime.
            }}                                      
        }}
        """
        results1 = self.run_query(query1)
        matched_ids = {result['id']['value'].split('/')[-1] for result in results1['results']['bindings']}

        print(f"Matched IDs [after query1]: {matched_ids}")

        # Find the QIDs that did not match in the first query
        remaining_entities_id = entities_identifiers - matched_ids
        """
        if not remaining_entities_id:
            return results1
        """

        for result in results1['results']['bindings']:
            print(f" {result}")
            
        print(f"Remaining IDs [after query1]: {remaining_entities_id}\n")

        # Second query
        unmatched_entities_id = " ".join(f"wd:{entity_id}" for entity_id in remaining_entities_id)
        query2 = f"""
        SELECT DISTINCT ?id ?idLabel ?exchangesLabel ?ticker WHERE {{
            SERVICE wikibase:label {{ bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }}
            VALUES ?id {{ {unmatched_entities_id} }}
            VALUES ?exchanges {{ {stock_exchanges} }}
            ?id wdt:P127 ?owner.
            ?owner p:P414 ?exchange.
            ?exchange ps:P414 ?exchanges;
                      pq:P249 ?ticker. 
            FILTER NOT EXISTS {{
                ?exchange pq:P582 ?endTime.
            }}                                       
        }}
        """
        results2 = self.run_query(query2)
        matched_ids = {result['id']['value'].split('/')[-1] for result in results2['results']['bindings']}

        print(f"Matched IDs [after query2]: {matched_ids}")

        for result in results2['results']['bindings']:
            print(f" {result}")

        # Find the QIDs that did not match in the first query
        remaining_entities_id = remaining_entities_id - matched_ids
        """
        if not remaining_entities_id:
            return results
        """
            
        print(f"Remaining IDs [after query2]: {remaining_entities_id}\n")      

        # Third query
        unmatched_entities_id = " ".join(f"wd:{entity_id}" for entity_id in remaining_entities_id)
        query3 = f"""
        SELECT DISTINCT ?id ?idLabel ?exchangesLabel ?ticker WHERE {{
            SERVICE wikibase:label {{ bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }}
            VALUES ?id {{ {unmatched_entities_id} }}
            VALUES ?exchanges {{ {stock_exchanges} }}
            ?id wdt:P1889 ?differs.
            ?differs p:P414 ?exchange.
            ?exchange ps:P414 ?exchanges;
                    pq:P249 ?ticker.  
            FILTER NOT EXISTS {{
                ?exchange pq:P582 ?endTime.
            }}                                      
        }}
        """
        results3 = self.run_query(query3)
        matched_ids = {result['id']['value'].split('/')[-1] for result in results3['results']['bindings']}

        print(f"Matched IDs [after query3]: {matched_ids}")

        for result in results3['results']['bindings']:
            print(f" {result}")

        # Find the QIDs that did not match in the first query
        remaining_entities_id = remaining_entities_id - matched_ids
        """
        if not remaining_entities_id:
            return results
        """
            
        print(f"Remaining IDs [after query3]: {remaining_entities_id}\n")

        # Combine all results
        results1['results']['bindings'].extend(results2['results']['bindings'])
        results1['results']['bindings'].extend(results3['results']['bindings'])

        # Map for QID to entity info dict
        entities_identifiers_info = {}
        for result in results1['results']['bindings']:
            entity_id = result['id']['value'].split('/')[-1]
            ticker_info = {"ticker": result['ticker']['value'], "exchange": result['exchangesLabel']['value']}
            if entity_id not in entities_identifiers_info:
                entities_identifiers_info[entity_id] = {
                    "idLabel": result['idLabel']['value'],
                    "tickers": [ticker_info],
                }
            else:
                entities_identifiers_info[entity_id]["tickers"].append(ticker_info)

        for entity_id in entities_identifiers_info:
            print(f"{entity_id}:\n {entities_identifiers_info[entity_id]}")

        return entities_identifiers_info
    

sparql_connector = SPARQLWikidataConnector()
entities_info = sparql_connector.retrieve_entities_info(set(qids_ents_dict.keys()))

Entities IDs: {'Q48938223', 'Q2283', 'Q11463', 'Q3884', 'Q209330', 'Q95', 'Q37156', 'Q21708200', 'Q18811574'}

Matched IDs [after query1]: {'Q11463', 'Q3884', 'Q2283', 'Q37156', 'Q95'}
 {'id': {'type': 'uri', 'value': 'http://www.wikidata.org/entity/Q95'}, 'ticker': {'type': 'literal', 'value': 'GOOG'}, 'idLabel': {'xml:lang': 'en', 'type': 'literal', 'value': 'Google'}, 'exchangesLabel': {'xml:lang': 'en', 'type': 'literal', 'value': 'Nasdaq'}}
 {'id': {'type': 'uri', 'value': 'http://www.wikidata.org/entity/Q95'}, 'ticker': {'type': 'literal', 'value': 'GOOGL'}, 'idLabel': {'xml:lang': 'en', 'type': 'literal', 'value': 'Google'}, 'exchangesLabel': {'xml:lang': 'en', 'type': 'literal', 'value': 'Nasdaq'}}
 {'id': {'type': 'uri', 'value': 'http://www.wikidata.org/entity/Q37156'}, 'ticker': {'type': 'literal', 'value': 'IBM'}, 'idLabel': {'xml:lang': 'en', 'type': 'literal', 'value': 'IBM'}, 'exchangesLabel': {'xml:lang': 'en', 'type': 'literal', 'value': 'New York Stock Exchange'}}
 {'

In [27]:
# Map the tickers to the entities in the document
for entity in doc.ents:
    if entity._.qid in entities_info:
        entity._.ticker = entities_info[entity._.qid]["tickers"][0]["ticker"]

# Print the entities with their tickers
for entity in doc.ents:
    print(f"{entity.text} ({entity._.qid}) -> {entity._.ticker}")

# Keep only the entities with tickers
doc.ents = [ent for ent in doc.ents if ent._.ticker]

Adobe (Q11463) -> ADBE
Amazon (Q3884) -> AMZN
Google (Q95) -> GOOG
IBM (Q37156) -> IBM
Meta (Q18811574) -> META
Microsoft (Q2283) -> MSFT
Adobe (Q11463) -> ADBE
Microsoft (Q2283) -> MSFT


In [31]:
# Store the entities with tickers in a dictionary
# Create a dictionary to store entity.text:entity.ticker
entities_tickers_dict = {ent.text: ent._.ticker for ent in doc.ents}
print(entities_tickers_dict)


{'Adobe': 'ADBE', 'Amazon': 'AMZN', 'Google': 'GOOG', 'IBM': 'IBM', 'Meta': 'META', 'Microsoft': 'MSFT'}
