# Spacy Pipeline

- **Goal:** Prediction Recognition

- **Purpose:** To extract named entities (NER), part-of-speech (POS), etc.
    1. Use to train model as feature extraction (ie: TF x IDF) alone isn't enough

- **Misc:**
    - `%store`: Cell magic will store the variable of interest so we can load in another notebook

In [1]:
import os
import sys
import spacy

import pandas as pd
# Get the current working directory of the notebook
notebook_dir = os.getcwd()
# Add the parent directory to the system path
sys.path.append(os.path.join(notebook_dir, '../'))

from data_processing import DataProcessing

In [2]:
pd.set_option('max_colwidth', 800)
nlp = spacy.load("en_core_web_sm")
%store -r predictions_df
%store -r non_predictions_df

In [3]:
predictions_df

Unnamed: 0,Base Sentence,Prediction Label,Model Name,Domain
0,"T1: On [2024-10-20], [Samantha Thompson, a financial analyst] predicts that the [operating cash flow] at [Johnson & Johnson] [will likely] [increase] by [10 percent to $25 billion] in [2026 Q2].",1,llama-3.3-70b-versatile,financial
1,"T2: In [Q4 of 2024], [Ethan Kim] from [Goldman Sachs], forecasts that the [stock price] [will] [fall] from [$120 to $90 per share] in [2028 Q1].",1,llama-3.3-70b-versatile,financial
2,"T3: [Alexander Brown, a financial expert] predicts on [15 November 2024] that the [research and development expenses] at [Intel] [may] [stay stable] at [$15 million] in [2027 Q3].",1,llama-3.3-70b-versatile,financial
3,"T4: According to a [senior executive] from [Coca-Cola], on [2024/08/25], the [net profit] [is expected to] [increase] beyond [$5 billion] in the timeframe of [2029 Q4].",1,llama-3.3-70b-versatile,financial
4,"T5: In [2025-02-15], the [revenue] at [Visa] [will] [rise] by [15 percent to $30 billion] [increase], as predicted by [David Lee, a financial reporter] on [2024-10-10].",1,llama-3.3-70b-versatile,financial
5,"T1: On [Wednesday, August 28, 2024], [Olivia Patel, an investor] predicts that the [gross profit] at [UnitedHealth Group] [should] [decrease] by [5 percent to $10 billion] in [2026 Q4].",1,llama-3.3-70b-versatile,financial
6,"T2: In [2027 Q1], [Jackson Davis] from [Morgan Stanley], envisions that the [operating income] [will] [rise] from [$10 billion to $15 billion] in [2028 Q2].",1,llama-3.3-70b-versatile,financial
7,"T3: [Ava Morales, a financial analyst] predicts on [2024-09-20] that the [net profit] at [Procter & Gamble] [may] [fall] under [10 percent to $5 billion] in [2027 Q1].",1,llama-3.3-70b-versatile,financial
8,"T4: According to [Emily Chen] from [Cisco Systems], on [2024/11/15], the [research and development expenses] [may] [increase] as much as [$500 million, reflecting a 20 percent increase] by [2026 Q3].",1,llama-3.3-70b-versatile,financial
9,"T5: In [2029 Q3], the [stock price] at [3M] [is expected to] [decrease] by [10 percent to $100 per share] [fall], as predicted by [Michael Kim, a financial expert] on [2024-08-10].",1,llama-3.3-70b-versatile,financial


In [4]:
data_processing = DataProcessing
updated_predictions_df = data_processing.reformat_df_with_template_number(predictions_df, col_name="Base Sentence")
updated_predictions_df

Unnamed: 0,Base Sentence,Prediction Label,Model Name,Domain,Template Number
0,"On [2024-10-20], [Samantha Thompson, a financial analyst] predicts that the [operating cash flow] at [Johnson & Johnson] [will likely] [increase] by [10 percent to $25 billion] in [2026 Q2].",1,llama-3.3-70b-versatile,financial,1
1,"In [Q4 of 2024], [Ethan Kim] from [Goldman Sachs], forecasts that the [stock price] [will] [fall] from [$120 to $90 per share] in [2028 Q1].",1,llama-3.3-70b-versatile,financial,2
2,"[Alexander Brown, a financial expert] predicts on [15 November 2024] that the [research and development expenses] at [Intel] [may] [stay stable] at [$15 million] in [2027 Q3].",1,llama-3.3-70b-versatile,financial,3
3,"According to a [senior executive] from [Coca-Cola], on [2024/08/25], the [net profit] [is expected to] [increase] beyond [$5 billion] in the timeframe of [2029 Q4].",1,llama-3.3-70b-versatile,financial,4
4,"In [2025-02-15], the [revenue] at [Visa] [will] [rise] by [15 percent to $30 billion] [increase], as predicted by [David Lee, a financial reporter] on [2024-10-10].",1,llama-3.3-70b-versatile,financial,5
5,"On [Wednesday, August 28, 2024], [Olivia Patel, an investor] predicts that the [gross profit] at [UnitedHealth Group] [should] [decrease] by [5 percent to $10 billion] in [2026 Q4].",1,llama-3.3-70b-versatile,financial,1
6,"In [2027 Q1], [Jackson Davis] from [Morgan Stanley], envisions that the [operating income] [will] [rise] from [$10 billion to $15 billion] in [2028 Q2].",1,llama-3.3-70b-versatile,financial,2
7,"[Ava Morales, a financial analyst] predicts on [2024-09-20] that the [net profit] at [Procter & Gamble] [may] [fall] under [10 percent to $5 billion] in [2027 Q1].",1,llama-3.3-70b-versatile,financial,3
8,"According to [Emily Chen] from [Cisco Systems], on [2024/11/15], the [research and development expenses] [may] [increase] as much as [$500 million, reflecting a 20 percent increase] by [2026 Q3].",1,llama-3.3-70b-versatile,financial,4
9,"In [2029 Q3], the [stock price] at [3M] [is expected to] [decrease] by [10 percent to $100 per share] [fall], as predicted by [Michael Kim, a financial expert] on [2024-08-10].",1,llama-3.3-70b-versatile,financial,5


In [5]:
only_predictions = DataProcessing.df_to_list(updated_predictions_df, 'Base Sentence')
only_predictions

['On [2024-10-20], [Samantha Thompson, a financial analyst] predicts that the [operating cash flow] at [Johnson & Johnson] [will likely] [increase] by [10 percent to $25 billion] in [2026 Q2].',
 'In [Q4 of 2024], [Ethan Kim] from [Goldman Sachs], forecasts that the [stock price] [will] [fall] from [$120 to $90 per share] in [2028 Q1].',
 '[Alexander Brown, a financial expert] predicts on [15 November 2024] that the [research and development expenses] at [Intel] [may] [stay stable] at [$15 million] in [2027 Q3].',
 'According to a [senior executive] from [Coca-Cola], on [2024/08/25], the [net profit] [is expected to] [increase] beyond [$5 billion] in the timeframe of [2029 Q4].',
 'In [2025-02-15], the [revenue] at [Visa] [will] [rise] by [15 percent to $30 billion] [increase], as predicted by [David Lee, a financial reporter] on [2024-10-10].',
 'On [Wednesday, August 28, 2024], [Olivia Patel, an investor] predicts that the [gross profit] at [UnitedHealth Group] [should] [decrease] by

In [6]:
ner_dfs = []
for doc in nlp.pipe(only_predictions, disable=["tok2vec", "parser", "attribute_ruler", "lemmatizer"]):
    # print(f"sentence: {doc}")
    columns = []
    values = []
    for ent in doc.ents:
        if ent.label_ in columns:
            columns.append(f"{ent.label_}_2")
        else:
            columns.append(ent.label_)
        values.append(ent.text)
        # print(f"    cols_name: {ent.label_}")
        # print(f"        text: {ent.text}")
    ner_df = pd.DataFrame([values], columns=columns)
    # print(ner_df)
    ner_dfs.append(ner_df)

In [8]:
ner_dfs[4]

Unnamed: 0,DATE,PERCENT,MONEY,PERSON,DATE_2
0,2025-02-15,15 percent,$30 billion,David Lee,2024-10-10


In [13]:
import spacy
import pandas as pd

# Load English tokenizer, tagger, parser, and NER
nlp = spacy.load("en_core_web_sm")

# Define the data
texts = [
    "Apple is looking at buying U.K. startup for $1 billion",
    "San Francisco considers banning sidewalk delivery robots",
    "London is a big city in the United Kingdom."
]

# Create an empty list to store the NER results
ner_results = []

# Process the texts and extract named entities
for doc in nlp.pipe(only_predictions, disable=["tok2vec", "tagger", "parser", "lemmatizer"]):
    doc_ents = [(ent.text, ent.label_) for ent in doc.ents]
    ner_results.append(doc_ents)

# Find all unique NER tags and convert the set to a list
all_tags = list(set([ent[1] for doc_ents in ner_results for ent in doc_ents]))

# Create a DataFrame with one column for each tag and rows corresponding to each document
df_ner = pd.DataFrame(columns=all_tags)

for i, doc_ents in enumerate(ner_results):
    for ent in doc_ents:
        df_ner.at[i, ent[1]] = ent[0]

# Display the DataFrame with ner results
print(df_ner)

      CARDINAL        DATE           WORK_OF_ART                GPE  \
0          NaN        2026                   NaN                NaN   
1          NaN        2028            Q4 of 2024                NaN   
2           15        2027                   NaN                NaN   
3   2024/08/25        2029                   NaN                NaN   
4          NaN  2024-10-10                   NaN                NaN   
5          NaN        2026                   NaN                NaN   
6          NaN        2028                   NaN                NaN   
7          NaN        2027                   NaN                NaN   
8   2024/11/15        2026                   NaN                NaN   
9          NaN  2024-08-10                   NaN                NaN   
10         NaN  2025-03-15                   NaN        Los Angeles   
11          10  2025-02-10            Q4 of 2024      New York City   
12  2024/08/25        2025             Quarter 2            Chicago   
13    

In [14]:
df_ner

Unnamed: 0,CARDINAL,DATE,WORK_OF_ART,GPE,PERSON,TIME,QUANTITY,MONEY,PERCENT,LOC,FAC,ORG
0,,2026,,,Samantha Thompson,,,$25 billion,10 percent,,,Johnson & Johnson
1,,2028,Q4 of 2024,,Ethan Kim,,,90,,,,Goldman Sachs
2,15,2027,,,Alexander Brown,,,$15 million,,,,Intel
3,2024/08/25,2029,,,,,,$5 billion,,,,Coca-Cola
4,,2024-10-10,,,David Lee,,,$30 billion,15 percent,,,
5,,2026,,,Olivia Patel,,,$10 billion,5 percent,,,UnitedHealth Group
6,,2028,,,Jackson Davis,,,$10 billion to $15 billion,,,,Morgan Stanley
7,,2027,,,Ava Morales,,,$5 billion,10 percent,,,Procter & Gamble
8,2024/11/15,2026,,,Emily Chen,,,$500 million,20 percent,,,Cisco Systems
9,,2024-08-10,,,Michael Kim,,,100,10 percent,,,


In [None]:
%store ner_dfs