# Homework 8: Build an EntityRuler in spaCy for Companies and Stocks
### Due Date: 04/07/25
- **Objective:** Students will use spaCy's EntityRuler to create and apply rules for extracting entities representing companies and stock symbols from a pandas DataFrame.

# Task 1: Setup Environment
- 1. Ensure you have spaCy and pandas installed in your Python environment.
- 2. Import the required libraries for working with spaCy and pandas.


In [2]:
#Import and install spaCy
!pip install spacy
!python -m spacy download en_core_web_sm
import spacy
nlp = spacy.load('en_core_web_sm')

#Import Pandas
import pandas as pd


Collecting en-core-web-sm==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m11.7 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25h[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


# Task 2: Load the Dataset
- 1. Download the provided dataset Download Download the provided datasetand load it into a pandas DataFrame. 
    -  HINT: The file is a .tsv, so use pd.read_csv() with the sep='\t' to read the file.
- 2. Examine the DataFrame to identify the columns containing company names and stock symbol.





In [16]:
#Load Stocks File
stocks = pd.read_csv("stocks-1 (1).tsv", sep="\t")

#Preview Dataset
stocks.head()





Unnamed: 0,Symbol,CompanyName,Industry,MarketCap
0,A,Agilent Technologies,Life Sciences Tools & Services,53.65B
1,AA,Alcoa,Metals & Mining,9.25B
2,AAC,Ares Acquisition,Shell Companies,1.22B
3,AACG,ATA Creativity Global,Diversified Consumer Services,90.35M
4,AADI,Aadi Bioscience,Pharmaceuticals,104.85M


In [None]:
#Investigate Dataset
stocks.describe()

#Check for any missing Values
stocks.isna().sum()

Symbol         0
CompanyName    0
Industry       9
MarketCap      0
dtype: int64

# Task 3: Extract Data for Patterns
- 1. Extract unique company names and stock symbol from the appropriate columns of the DataFrame.
- 2. Create patterns for each company and stock symbol, ensuring they are properly formatted to be recognized by spaCy's EntityRuler.
    - DO NOT manually input individual company stocks to create your patterns. Find an automated solution in case your DataFrame ever gets updated. HINT: Think for-loops.




In [30]:
# Extract unique company names and stock symbols
unique_companies = stocks['CompanyName'].dropna().unique()
unique_symbols = stocks['Symbol'].dropna().unique()

# Create patterns for company names
company_patterns = []
for name in unique_companies:
    company_patterns.append({"label": "COMPANY", "pattern": name})

# Create patterns for stock symbols
symbol_patterns = []
for symbol in unique_symbols:
    symbol_patterns.append({"label": "STOCK_SYMBOL", "pattern": symbol})

#Combine both patterns
combined_pattern = company_patterns + symbol_patterns
print(combined_pattern)


[{'label': 'COMPANY', 'pattern': 'Agilent Technologies'}, {'label': 'COMPANY', 'pattern': 'Alcoa'}, {'label': 'COMPANY', 'pattern': 'Ares Acquisition'}, {'label': 'COMPANY', 'pattern': 'ATA Creativity Global'}, {'label': 'COMPANY', 'pattern': 'Aadi Bioscience'}, {'label': 'COMPANY', 'pattern': 'Arlington Asset Investment'}, {'label': 'COMPANY', 'pattern': 'American Airlines'}, {'label': 'COMPANY', 'pattern': 'Altisource Asset Management'}, {'label': 'COMPANY', 'pattern': 'Atlantic American'}, {'label': 'COMPANY', 'pattern': "The Aaron's Company"}, {'label': 'COMPANY', 'pattern': 'Applied Optoelectronics'}, {'label': 'COMPANY', 'pattern': 'AAON, Inc.'}, {'label': 'COMPANY', 'pattern': 'Advance Auto Parts'}, {'label': 'COMPANY', 'pattern': 'Apple'}, {'label': 'COMPANY', 'pattern': 'Accelerate Acquisition'}, {'label': 'COMPANY', 'pattern': 'American Assets Trust'}, {'label': 'COMPANY', 'pattern': 'Autoscope Technologies'}, {'label': 'COMPANY', 'pattern': 'Almaden Minerals'}, {'label': 'CO

# Task 4: Create an EntityRuler
- 1. Use a spaCy language model to create an EntityRuler.
- 2. Add the patterns for both companies and stock symbols to the EntityRuler pipeline.

In [32]:
# Check if the "ner" pipe exists. If it does, add the EntityRuler before it.
if "ner" in nlp.pipe_names:
    # If entity_ruler already exists, simply add patterns to it.
    try:
        ruler = nlp.get_pipe("entity_ruler")
    except Exception:
        ruler = nlp.add_pipe("entity_ruler", before="ner")
    ruler.add_patterns(combined_pattern)
else:
    # If the NER component does not exist, add both the EntityRuler and the NER component.
    ruler = nlp.add_pipe("entity_ruler")
    ruler.add_patterns(combined_pattern)
    ner = nlp.add_pipe("ner")

# Check updated pipeline labels
print("\nUpdated Pipeline Labels:")
nlp.pipeline


Updated Pipeline Labels:


[('tok2vec', <spacy.pipeline.tok2vec.Tok2Vec at 0x1281d3c50>),
 ('tagger', <spacy.pipeline.tagger.Tagger at 0x1281d3710>),
 ('parser', <spacy.pipeline.dep_parser.DependencyParser at 0x1283651c0>),
 ('attribute_ruler',
  <spacy.pipeline.attributeruler.AttributeRuler at 0x1285cc0d0>),
 ('lemmatizer', <spacy.lang.en.lemmatizer.EnglishLemmatizer at 0x1285bacd0>),
 ('entity_ruler', <spacy.pipeline.entityruler.EntityRuler at 0x16b1571d0>),
 ('ner', <spacy.pipeline.ner.EntityRecognizer at 0x1283652a0>)]

# Task 4: Create an EntityRuler
- 1. Use the sample texts below, which include references to companies and stock symbols.
- 2. Apply your EntityRuler to the text and check if it correctly identifies the entities.

    - **Paragraph 1:** Helmerich & Payne (HP) saw its stock rise by 1.5%, fueled by optimistic forecasts in the Energy Equipment & Services sector. In contrast, Check-Cap (CHEK) faced a decline of 2.3% following its announcement of increased costs related to supply chain disruptions.
    Meanwhile, Vallon Pharmaceuticals (VLON) gained 0.8% after strong quarterly earnings, outperforming its peers in the Biotechnology space. Sequans Communications (SQNS) also recorded a modest increase of 0.5%, reflecting investors' confidence in its ability to navigate challenges in the Semiconductors & Semiconductor Equipment industry.

    - **Paragraph 2:** Aemetis (AMTX) saw its stock rise by 1.5%, fueled by optimistic forecasts in the Oil, Gas & Consumable Fuels sector. In contrast, Ferro Corporation (FOE) faced a decline of 2.3% following its announcement of increased costs related to supply chain disruptions.
    Meanwhile, RingCentral (RNG) gained 0.8% after strong quarterly earnings, outperforming its peers in the Software space. ACI Worldwide (ACIW) also recorded a modest increase of 0.5%, reflecting investors' confidence in its ability to navigate challenges in the Software industry.
    
    - **Paragraph 3:** On a mixed trading day, Par Pacific Holdings (PARR) saw its stock rise by 1.5%, fueled by optimistic forecasts in the Oil, Gas & Consumable Fuels sector. In contrast, Nano Dimension (NNDM) faced a decline of 2.3% following its announcement of increased costs related to supply chain disruptions.
    Meanwhile, Beyond Meat (BYND) gained 0.8% after strong quarterly earnings, outperforming its peers in the Food Products space. Apollo Investment (AINV) also recorded a modest increase of 0.5%, reflecting investors' confidence in its ability to navigate challenges in the Capital Markets industry.

In [None]:
#SHOULD I ADD A PATTERN FOR INDUSTRY
#Relevant Library
from spacy import displacy

#Test Paragraph 1
paragraph_1= nlp("Helmerich & Payne (HP) saw its stock rise by 1.5%, fueled by optimistic forecasts in the Energy Equipment & Services sector. In contrast, Check-Cap (CHEK) faced a decline of 2.3% following its announcement of increased costs related to supply chain disruptions. Meanwhile, Vallon Pharmaceuticals (VLON) gained 0.8% after strong quarterly earnings, outperforming its peers in the Biotechnology space. Sequans Communications (SQNS) also recorded a modest increase of 0.5%, reflecting investors' confidence in its ability to navigate challenges in the Semiconductors & Semiconductor Equipment industry.")
displacy.render(paragraph_1, style="ent", jupyter=True)

In [None]:
#Test Paragraph 2
paragraph_2= nlp("Aemetis (AMTX) saw its stock rise by 1.5%, fueled by optimistic forecasts in the Oil, Gas & Consumable Fuels sector. In contrast, Ferro Corporation (FOE) faced a decline of 2.3% following its announcement of increased costs related to supply chain disruptions. Meanwhile, RingCentral (RNG) gained 0.8% after strong quarterly earnings, outperforming its peers in the Software space. ACI Worldwide (ACIW) also recorded a modest increase of 0.5%, reflecting investors' confidence in its ability to navigate challenges in the Software industry.")
displacy.render(paragraph_2, style="ent", jupyter=True)

In [38]:
#Test Paragraph 3
paragraph_3= nlp("On a mixed trading day, Par Pacific Holdings (PARR) saw its stock rise by 1.5%, fueled by optimistic forecasts in the Oil, Gas & Consumable Fuels sector. In contrast, Nano Dimension (NNDM) faced a decline of 2.3% following its announcement of increased costs related to supply chain disruptions.Meanwhile, Beyond Meat (BYND) gained 0.8% after strong quarterly earnings, outperforming its peers in the Food Products space. Apollo Investment (AINV) also recorded a modest increase of 0.5%, reflecting investors' confidence in its ability to navigate challenges in the Capital Markets industry.")
displacy.render(paragraph_3, style="ent", jupyter=True)