Hello world!

In today's tutorial, we are going to add a new entity to spaCy’s default NER (i.e., named entity recognition). We will explore why looking for specific strings is not a way to go and how to handle false positives and false negatives when doing NER with a custom-added entity. I’ll show you an example in Jupyter Notebook.

Let’s start. For the NER, we will use spaCy and its pretrained model, more precisely the large one. So, make sure you install these two.

In [None]:
!python -m spacy download en_core_web_lg

In [14]:
import spacy

# Load the spaCy pre-trained model
nlp = spacy.load("en_core_web_lg")

We will do NER on a `sample.txt` file. It’s located in the same directory as this Jupyter Notebook file. In the text file, there’s just a bunch of articles. The file itself is quite extensive. For the beginning, let’s try to read this file and print the first five articles to make sure we are able to read the file.

In [16]:
# Open the file named "sample.txt" in read mode with UTF-8 encoding
with open("sample.txt", "r", encoding="utf-8") as file:
    # Read all lines from the file and store them in the 'articles' list
    articles = file.readlines()

# Print the first 5 articles or all articles if there are fewer than 5
for i in range(min(5, len(articles))):
    # Print the article number and the content of the article
    print(f"Article {i + 1}:\n{articles[i]}")

    # Print a separator line between articles for better readability
    print("-----\n")


Article 1:
@fansoniclove Gold the Tenrec

-----

Article 2:
Tokyo-bound Sampson sets Aust rifle record. Shooter Dane Sampson has struck career-best form as he builds towards a third Olympics, setting a national record while winning the 50m rifle event at the South Australia championships. Sampson registered a score of 462 points to claim gold in the three positions event. The performance bettered Sampson's own national record of 460.7 points, which he achieved at last month's Wingfield grand prix. The score was also notably higher than what Italy's Niccolo Campriani (458.8) and Poland's Tomasz Bartnik (460.4) produced to win gold at the 2016 Olympics and 2018 world championships respectively. "It's good to be shooting PBs at this stage. It was a world-class finals score," Sampson said, having previously competed at the 2012 and 2016 Olympics. "You are unlikely to lose many competitions with that score. "I definitely feel that I am getting better and better and I am tracking well for To

If you run this block of code, you should get the following output, meaning that the file reading was successful. Note that we saved articles in the variable `articles`, which we will use several times later.

Now we come to the interesting part. We will do NER on the words gold and silver. **The goal is to find all articles mentioning gold or silver in a financial context. This is very important! Only in a financial context!**

### Attempt #1: Looking for strings *gold* and *silver*

Our first attempt will be simply looking for the strings *gold* and *silver*. The loop in the code block below is the main thing here. We loop article by article using the variable `articles` and look for the strings *gold* and *silver*.

In [34]:
def entity_matcher(article):
    # Process the article using spaCy
    doc = nlp(article)

    # Check if "gold" or "silver" is present in the article
    gold_present = any(token.text.lower() == "gold" for token in doc)
    silver_present = any(token.text.lower() == "silver" for token in doc)

    return gold_present, silver_present

gold_count = 0
silver_count = 0

# Lists to store articles containing "gold" and "silver"
gold_articles = []
silver_articles = []

# Iterate through each article in the 'articles' list
for article in articles:
    gold_present, silver_present = entity_matcher(article)

    # Check if "gold" is present in the article
    if gold_present:
        gold_count += 1
        gold_articles.append(article)

    # Check if "silver" is present in the article
    if silver_present:
        silver_count += 1
        silver_articles.append(article)

# Print the number of articles containing strings 'gold' and 'silver'
print("Number of articles containing the string 'gold':", gold_count)
print("Number of articles containing the string 'silver':", silver_count)

# Print the first 4 articles containing the string "gold"
print("\nFirst 4 articles containing the string 'gold':")
for i, article in enumerate(gold_articles[:4], start=1):
    print(f"{i}. {article}")

# Print the first 4 articles containing the string "silver"
print("\nFirst 4 articles containing the string 'silver':")
for i, article in enumerate(silver_articles[:4], start=1):
    print(f"{i}. {article}")

Number of articles containing the string 'gold': 602
Number of articles containing the string 'silver': 99

First 4 articles containing the string 'gold':
1. @fansoniclove Gold the Tenrec

2. Tokyo-bound Sampson sets Aust rifle record. Shooter Dane Sampson has struck career-best form as he builds towards a third Olympics, setting a national record while winning the 50m rifle event at the South Australia championships. Sampson registered a score of 462 points to claim gold in the three positions event. The performance bettered Sampson's own national record of 460.7 points, which he achieved at last month's Wingfield grand prix. The score was also notably higher than what Italy's Niccolo Campriani (458.8) and Poland's Tomasz Bartnik (460.4) produced to win gold at the 2016 Olympics and 2018 world championships respectively. "It's good to be shooting PBs at this stage. It was a world-class finals score," Sampson said, having previously competed at the 2012 and 2016 Olympics. "You are unli

If you run the code block above, you should get the output as seen above. As you can see, there were `602` articles mentioning gold and `99` articles mentioning silver. We only printed the first 5 articles, but there are a lot of them. If we printed all of them, we would just get a bunch of output, but the first 5 articles are enough to show you the point.

Remember what we said our goal was? To find all articles mentioning gold and silver in a financial context. Look at the second article containing the string *gold* above (i.e., `Tokyo-bound Sampson sets Aust rifle record. Shooter Dane...`). It’s mentioning gold in a sports context. This article is talking about gold medals. Is this in a financial context? No.

As we can see, looking for strings is not the way to go.

### Attempt #2: NER

Our second attempt will be doing NER using spaCy. Let's try to look for *COMMODITY* entities.

In [18]:
# Define a function for matching entities in an article
def entity_matcher(article):
    # Process the article using spaCy
    doc = nlp(article)

    # Check if "gold" or "silver" is present as a recognized entity with the label "COMMODITY"
    gold_present = any(token.ent_type_ == "COMMODITY" and token.text.lower() == "gold" for token in doc)
    silver_present = any(token.ent_type_ == "COMMODITY" and token.text.lower() == "silver" for token in doc)

    return gold_present, silver_present

# Initialize counters for the number of articles containing 'gold' and 'silver' as commodities
gold_count = 0
silver_count = 0

# Loop through each article in the list of articles
for article in articles:
    # Call the entity_matcher function to check for 'gold' and 'silver' in the current article
    gold_present, silver_present = entity_matcher(article)

    # Increment the counters if 'gold' or 'silver' is present in the current article
    if gold_present:
        gold_count += 1
    if silver_present:
        silver_count += 1

# Display the results
print("Number of articles containing 'gold' as a commodity:", gold_count)
print("Number of articles containing 'silver' as a commodity:", silver_count)

Number of articles containing 'gold' as a commodity: 0
Number of articles containing 'silver' as a commodity: 0


If you run the code block above, you should get the output as seen above. What a disappointment! spaCy has found no such articles.

Okay, let’s take a step back and print all the entities that spaCy finds in articles. Again, we only printed the first three articles to avoid having too much text. It’s enough to get the point.

In [20]:
def print_entities(article):
    # Process the article using spaCy
    doc = nlp(article)

    # Print all recognized entities
    entities = [f"{ent.text} ({ent.label_})" for ent in doc.ents]
    print(f"Entities in the article: {', '.join(entities)}")

# Iterate through the first 3 articles and print entities
for i, article in enumerate(articles[:3], start=1):
    # Print header for each article
    print(f"\nEntities in Article {i}:\n")
    
    # Call the function to print entities for the current article
    print_entities(article)
    
    # Print separator between articles
    print("-----\n")


Entities in Article 1:

Entities in the article: Tenrec (GPE)
-----


Entities in Article 2:

Entities in the article: Tokyo (GPE), Sampson (PERSON), Aust (GPE), Shooter Dane Sampson (ORG), third (ORDINAL), Olympics (EVENT), 50m (QUANTITY), South Australia (GPE), Sampson (PERSON), 462 (CARDINAL), three (CARDINAL), Sampson (ORG), 460.7 (CARDINAL), last month's (DATE), Wingfield (PERSON), Italy (GPE), Niccolo Campriani (PERSON), 458.8 (CARDINAL), Poland (GPE), Tomasz Bartnik (PERSON), 460.4 (CARDINAL), 2016 (DATE), Olympics (EVENT), 2018 (DATE), Sampson (PERSON), 2012 (DATE), 2016 (DATE), Olympics (EVENT), Tokyo (GPE), Sampson (PERSON), Australia (GPE), 2021 (DATE), Olympics (EVENT), pre-Games (EVENT), July 4-19 (DATE), Tokyo (GPE)
-----


Entities in Article 3:

Entities in the article: 33 (CARDINAL), 688 (CARDINAL), MCB (ORG), a National Park (FAC), One (CARDINAL)
-----



If you run the code block above, you should get the output as seen above. As you can see, spaCy found a lot of entities. People, locations, dates, quantities, events, etc. But there are no commodities. It looks like spacy doesn’t look for entities labeled by commodity or anything similar.

Let's run the following line to get all the labels spaCy has defined by default.

In [21]:
# Retrieve the list of named entity labels used by the NER pipeline in the spaCy processing pipeline
nlp.get_pipe('ner').labels

('CARDINAL',
 'DATE',
 'EVENT',
 'FAC',
 'GPE',
 'LANGUAGE',
 'LAW',
 'LOC',
 'MONEY',
 'NORP',
 'ORDINAL',
 'ORG',
 'PERCENT',
 'PERSON',
 'PRODUCT',
 'QUANTITY',
 'TIME',
 'WORK_OF_ART')

Looking at the output above, we can confirm that spaCy doesn’t look for entities labeled by commodity or anything similar by default.

Let’s add a *COMMODITY* label and two patterns, one for the gold and one for the silver. First, we will test a simple sentence `Good time to buy gold?` and see if spaCy now recognizes gold as an entity.

In [23]:
# Import spaCy and the EntityRuler class
import spacy
from spacy.pipeline import EntityRuler

# Load spaCy model
nlp = spacy.load("en_core_web_lg")

# Create an instance of the EntityRuler
ruler = EntityRuler(nlp)

# Define patterns for the EntityRuler
patterns = [
    {"label": "COMMODITY", "pattern": "gold"}
]

# Add the EntityRuler to the spaCy pipeline
ruler = nlp.add_pipe("entity_ruler")
ruler.add_patterns(patterns)

# Test the sentence
text = "Good time to buy gold?"
doc = nlp(text)

# Print recognized entities
for entity in doc.ents:
    print(entity.text, entity.label_)

gold COMMODITY


In [None]:
# Success! "Gold" was recognized as an entity under the label "COMMODITY" by spaCy. Now we want to extract full articles if, in any sentence inside the article, "gold" or "silver" are found as commodities.

In [42]:
import spacy
from spacy.pipeline import EntityRuler

# Load spaCy model
nlp = spacy.load("en_core_web_lg")

# Create an instance of the EntityRuler
ruler = EntityRuler(nlp)

# Define patterns for the new entities
patterns = [
    {"label": "COMMODITY", "pattern": "gold"},
    {"label": "COMMODITY", "pattern": "silver"}
]

# Add the EntityRuler to the spaCy pipeline
ruler = nlp.add_pipe("entity_ruler")
ruler.add_patterns(patterns)

# Read articles from file
with open("sample.txt", "r", encoding="utf-8") as file:
    articles = file.readlines()

# Counter for the number of articles containing "COMMODITY"
commodity_article_count = 0

# List to store the content of articles containing "COMMODITY"
commodity_articles = []

# Count all articles containing "COMMODITY" entities
for article in articles:
    doc = nlp(article)
    commodity_entities = [entity.text for entity in doc.ents if entity.label_ == "COMMODITY"]
    
    if commodity_entities:
        commodity_article_count += 1
        commodity_articles.append(article.strip())

# Print the number of articles containing 'COMMODITY' entities
print(f"Number of articles containing 'COMMODITY' entities: {commodity_article_count}\n")

# Print the first 4 articles containing the entity "COMMODITY"
print("First 4 articles containing 'COMMODITY' entities:")
for i in range(min(4, len(commodity_articles))):
    print(f"{i+1}. {commodity_articles[i]}\n")

Number of articles containing 'COMMODITY' entities: 179

First 4 articles containing 'COMMODITY' entities:
1. Tokyo-bound Sampson sets Aust rifle record. Shooter Dane Sampson has struck career-best form as he builds towards a third Olympics, setting a national record while winning the 50m rifle event at the South Australia championships. Sampson registered a score of 462 points to claim gold in the three positions event. The performance bettered Sampson's own national record of 460.7 points, which he achieved at last month's Wingfield grand prix. The score was also notably higher than what Italy's Niccolo Campriani (458.8) and Poland's Tomasz Bartnik (460.4) produced to win gold at the 2016 Olympics and 2018 world championships respectively. "It's good to be shooting PBs at this stage. It was a world-class finals score," Sampson said, having previously competed at the 2012 and 2016 Olympics. "You are unlikely to lose many competitions with that score. "I definitely feel that I am getti

In [None]:
# As we can see, there are a lot of false positives (e.g., "Sampson registered a score of 462 points to claim gold in the three positions event."). This is a sentence where "gold" is not mentioned in a financial context, but spaCy treats this sentence as if it were.

In [None]:
# Let's try to change patterns for a new entity called "COMMODITY" in more detail. We will now use "IN". It stands for "inclusion". It is used as part of the pattern to specify that the token's lowercase text should be in a list of possible values. This pattern is looking for a lowercase token where the text is either "price", "market", or "commodity". It allows flexibility to match any of the specified words. The "IN" operator is used to define this set of possible values for the token.

In [43]:
import spacy
from spacy.pipeline import EntityRuler

# Load spaCy model
nlp = spacy.load("en_core_web_lg")

# Create an instance of the EntityRuler
ruler = EntityRuler(nlp)

# Define patterns for the new entities
patterns = [
    {"label": "COMMODITY", "pattern": [{"LOWER": "gold"}, {"LOWER": {"IN": ["price", "market", "commodity"]}}]}, # Changed
    {"label": "COMMODITY", "pattern": [{"LOWER": "silver"}, {"LOWER": {"IN": ["price", "market", "commodity"]}}]}, # Changed
]

# Add the EntityRuler to the spaCy pipeline
ruler = nlp.add_pipe("entity_ruler")
ruler.add_patterns(patterns)

# Read articles from file
with open("sample.txt", "r", encoding="utf-8") as file:
    articles = file.readlines()

# Counter for the number of articles containing "COMMODITY"
commodity_article_count = 0

# List to store the content of articles containing "COMMODITY"
commodity_articles = []

# Count all articles containing "COMMODITY" entities
for article in articles:
    doc = nlp(article)
    commodity_entities = [entity.text for entity in doc.ents if entity.label_ == "COMMODITY"]
    
    if commodity_entities:
        commodity_article_count += 1
        commodity_articles.append(article.strip())

# Print the number of articles containing 'COMMODITY' entities
print(f"Number of articles containing 'COMMODITY' entities: {commodity_article_count}\n")

# Print the first 4 articles containing the entity "COMMODITY"
print("First 4 articles containing 'COMMODITY' entities:")
for i in range(min(4, len(commodity_articles))):
    print(f"{i+1}. {commodity_articles[i]}\n")

Number of articles containing 'COMMODITY' entities: 15

First 4 articles containing 'COMMODITY' entities:
1. Gold price trading in neutral territory following rising inflation, U.S. personal income grows 21.1% in March… https://t.co/bvJEdowf6g

2. Gold Price Today April 30: Gold Rate Continues To Fall, Check City-Wise Price List https://t.co/HGRF8zgSS9

3. Advertorial (?) in the Australian from Anzac Day ..cheers https://www.theaustralian.com.au/business/dgo-gold-aiming-to-strike-a-balance-in-gold-exploration/news-story/231d9ce048bfd7d7f87545741008db15DGO Gold aiming to strike a balance in gold explorationDGO Gold is hoping it has created a strategic investment model that will reinvigorate gold exploration in Australia. Picture: BloombergDAMON KITNEYVICTORIAN BUSINESS EDITOR6:42PM APRIL 25, 2021Melbourne investment banker Bruce Parncutt calls his latest plaything a “gold discovery company”.Neither an explorer nor a miner, the ASX-listed DGO Gold — of which Mr Parncutt has been an execu

In [44]:
import spacy
from spacy.pipeline import EntityRuler

# Load spaCy model
nlp = spacy.load("en_core_web_lg")

# Create an instance of the EntityRuler
ruler = EntityRuler(nlp)

# Define patterns for the new entities
patterns = [
    {"label": "COMMODITY", "pattern": [{"LOWER": {"IN": ["gold"]}}, {"LOWER": {"IN": ["price", "market", "commodity", "trading", "investment", "bullion", "precious metal", "futures", "yield", "contract", "reserve currencies", "modern monetary theory"]}}]}, # Changed
    {"label": "COMMODITY", "pattern": [{"LOWER": {"IN": ["silver"]}}, {"LOWER": {"IN": ["price", "market", "commodity", "trading", "investment", "bullion", "precious metal", "futures", "yield", "contract", "reserve currencies", "modern monetary theory"]}}]}, # Changed
]

# Add the EntityRuler to the spaCy pipeline
ruler = nlp.add_pipe("entity_ruler")
ruler.add_patterns(patterns)

# Read articles from file
with open("sample.txt", "r", encoding="utf-8") as file:
    articles = file.readlines()

# Counter for the number of articles containing "COMMODITY"
commodity_article_count = 0

# List to store the content of articles containing "COMMODITY"
commodity_articles = []

# Count all articles containing "COMMODITY" entities
for article in articles:
    doc = nlp(article)
    commodity_entities = [entity.text for entity in doc.ents if entity.label_ == "COMMODITY"]
    
    if commodity_entities:
        commodity_article_count += 1
        commodity_articles.append(article.strip())

# Print the number of articles containing 'COMMODITY' entities
print(f"Number of articles containing 'COMMODITY' entities: {commodity_article_count}\n")

# Print the first 4 articles containing the entity "COMMODITY"
print("First 4 articles containing 'COMMODITY' entities:")
for i in range(min(4, len(commodity_articles))):
    print(f"{i+1}. {commodity_articles[i]}\n")

Number of articles containing 'COMMODITY' entities: 43

First 4 articles containing 'COMMODITY' entities:
1. Gold price trading in neutral territory following rising inflation, U.S. personal income grows 21.1% in March… https://t.co/bvJEdowf6g

2. Gold Price Today April 30: Gold Rate Continues To Fall, Check City-Wise Price List https://t.co/HGRF8zgSS9

3. Gold futures eke out gain on volatile day, silver near Rs 67,550/kg: What analysts say - Domestic gold prices seesa… https://t.co/w6qkrpNcOA

4. Silver Futures Discussions. why moment not saw from 26.000,any problem let me know anybody



In [None]:
# Until this point, we haven't handled false negatives. This happens when spaCy treats a sentence as if it's not finance-related, when in reality it is. In other words, there are several other aliases that can be used to refer to gold or silver. For example, the XAU/USD alias will almost certainly refer to the gold price per troy ounce in US dollars, and it is commonly used when talking about price events. So, let's add four more patterns to handle false negatives.

In [45]:
import spacy
from spacy.pipeline import EntityRuler

# Load spaCy model
nlp = spacy.load("en_core_web_lg")

# Create an instance of the EntityRuler
ruler = EntityRuler(nlp)

# Define patterns for the new entities
patterns = [
    {"label": "COMMODITY", "pattern": [{"LOWER": {"IN": ["gold"]}}, {"LOWER": {"IN": ["price", "market", "commodity", "trading", "investment", "bullion", "precious metal", "futures", "yield", "contract", "reserve currencies", "modern monetary theory"]}}]},
    {"label": "COMMODITY", "pattern": [{"LOWER": {"IN": ["silver"]}}, {"LOWER": {"IN": ["price", "market", "commodity", "trading", "investment", "bullion", "precious metal", "futures", "yield", "contract", "reserve currencies", "modern monetary theory"]}}]},
    {"label": "COMMODITY", "pattern": "XAU/USD"}, # Added
    {"label": "COMMODITY", "pattern": "XAUUSD"}, # Added
    {"label": "COMMODITY", "pattern": "XAG/USD"}, # Added
    {"label": "COMMODITY", "pattern": "XAGUSD"}, # Added
]

# Add the EntityRuler to the spaCy pipeline
ruler = nlp.add_pipe("entity_ruler")
ruler.add_patterns(patterns)

# Read articles from file
with open("sample.txt", "r", encoding="utf-8") as file:
    articles = file.readlines()

# Counter for the number of articles containing "COMMODITY"
commodity_article_count = 0

# List to store the content of articles containing "COMMODITY"
commodity_articles = []

# Count all articles containing "COMMODITY" entities
for article in articles:
    doc = nlp(article)
    commodity_entities = [entity.text for entity in doc.ents if entity.label_ == "COMMODITY"]
    
    if commodity_entities:
        commodity_article_count += 1
        commodity_articles.append(article.strip())

# Print the number of articles containing 'COMMODITY' entities
print(f"Number of articles containing 'COMMODITY' entities: {commodity_article_count}\n")

# Print the first 4 articles containing the entity "COMMODITY"
print("First 4 articles containing 'COMMODITY' entities:")
for i in range(min(4, len(commodity_articles))):
    print(f"{i+1}. {commodity_articles[i]}\n")

Number of articles containing 'COMMODITY' entities: 47

First 4 articles containing 'COMMODITY' entities:
1. Gold Price News and Forecast: XAU/USD is trapped in daily support and resistance. Posted by: EUR Editor  in EUR  1 min ago  XAU/USD struggles below $1,800 amid risk-off mood. Gold steps back from intraday high while flashing $1,772 as a quote amid Friday’s Asian session. In doing … Read Full Story at source (may require registration)     Latest posts by EUR Editor ( see all )

2. Gold price trading in neutral territory following rising inflation, U.S. personal income grows 21.1% in March… https://t.co/bvJEdowf6g

3. Gold Price Today April 30: Gold Rate Continues To Fall, Check City-Wise Price List https://t.co/HGRF8zgSS9

4. Gold futures eke out gain on volatile day, silver near Rs 67,550/kg: What analysts say - Domestic gold prices seesa… https://t.co/w6qkrpNcOA



In [None]:
# Tweaking this approach further (i.e., adding more words to the patterns) would probably yield even better results, meaning even fewer false negatives ceteris paribus (i.e., for the same amount of false positives), but I decided to stop here.