# Spacy Pipeline

- **Goal:** Prediction Recognition

- **Purpose:** To extract named entities (NER), part-of-speech (POS), etc.
    1. Use to train model as feature extraction (ie: TF x IDF) alone isn't enough

- **Misc:**
    - `%store`: Cell magic will store the variable of interest so we can load in another notebook

In [1]:
import os
import sys
import spacy

import pandas as pd
# Get the current working directory of the notebook
notebook_dir = os.getcwd()
# Add the parent directory to the system path
sys.path.append(os.path.join(notebook_dir, '../'))

from data_processing import DataProcessing

In [2]:
%store -r predictions_df
%store -r non_predictions_df

In [3]:
predictions_df

Unnamed: 0,Base Sentences,Prediction Label,Model Name,Domain
0,"On 2024-10-15, Olivia Brown, a financial analy...",1,llama-3.3-70b-versatile,financial
1,"In 2024/08/20, Ethan Kim from Goldman Sachs, f...",1,llama-3.3-70b-versatile,financial
2,"According to a senior executive from Boeing, o...",1,llama-3.3-70b-versatile,financial
3,"In 2025-02-18, the revenue at Walmart has a hi...",1,llama-3.3-70b-versatile,financial
4,"On Wednesday, November 20, 2024, Ava Morales, ...",1,llama-3.3-70b-versatile,financial
5,"In Q3 of 2029, the net profit at Intel is expe...",1,llama-3.3-70b-versatile,financial
6,"On 21 Aug 2024, Jackson Hall, a financial seni...",1,llama-3.3-70b-versatile,financial
7,"According to a financial expert from IBM, on 2...",1,llama-3.3-70b-versatile,financial
8,"In 2026-11-15, the revenue at Toyota has a low...",1,llama-3.3-70b-versatile,financial
9,"On 2024/11/12, Michael Davis, a financial exec...",1,llama-3.3-70b-versatile,financial


In [5]:
only_predictions = DataProcessing.df_to_list(predictions_df, 'Base Sentences')
only_predictions

['On 2024-10-15, Olivia Brown, a financial analyst, predicts that the operating income at General Motors will likely rise by 10 percent to $5 billion in Q2 of 2026.',
 'In 2024/08/20, Ethan Kim from Goldman Sachs, forecasts that the stock price will increase from $500 to $700 per share in 2028.',
 'According to a senior executive from Boeing, on August 22, 2024, the research and development expenses are expected to stay stable at $15 million in the timeframe of Q4 of 2027.',
 'In 2025-02-18, the revenue at Walmart has a high chance of reaching $600 billion, which is a 20 percent increase, as predicted by David Lee, a financial expert, on 2024/10/12.',
 'On Wednesday, November 20, 2024, Ava Morales, a financial reporter, predicts that the gross profit at Cisco Systems will decrease by 5 percent to $10 billion in Q1 of 2026.',
 'In Q3 of 2029, the net profit at Intel is expected to rise by 15 percent to $20 billion, as predicted by a financial top executive on 2024-08-25.',
 'On 21 Aug 2

In [7]:
nlp = spacy.load("en_core_web_sm")
for doc in nlp.pipe(only_predictions, disable=["tok2vec", "parser", "attribute_ruler", "lemmatizer"]):
    # Do something with the doc here
    print([(ent.text, ent.label_) for ent in doc.ents])

[('2024-10-15', 'DATE'), ('Olivia Brown', 'PERSON'), ('General Motors', 'ORG'), ('10 percent', 'PERCENT'), ('$5 billion', 'MONEY'), ('Q2 of 2026', 'DATE')]
[('2024/08/20', 'CARDINAL'), ('Ethan Kim', 'PERSON'), ('Goldman Sachs', 'ORG'), ('500', 'MONEY'), ('700', 'MONEY'), ('2028', 'DATE')]
[('Boeing', 'ORG'), ('August 22, 2024', 'DATE'), ('$15 million', 'MONEY'), ('Q4', 'CARDINAL'), ('2027', 'DATE')]
[('2025-02-18', 'DATE'), ('Walmart', 'ORG'), ('$600 billion', 'MONEY'), ('20 percent', 'PERCENT'), ('David Lee', 'PERSON'), ('2024/10/12', 'CARDINAL')]
[('Wednesday, November 20, 2024', 'DATE'), ('Ava Morales', 'PERSON'), ('Cisco Systems', 'ORG'), ('5 percent', 'PERCENT'), ('$10 billion', 'MONEY'), ('Q1 of 2026', 'DATE')]
[('Q3 of 2029', 'DATE'), ('Intel', 'ORG'), ('15 percent', 'PERCENT'), ('$20 billion', 'MONEY'), ('2024-08-25', 'DATE')]
[('21 Aug 2024', 'DATE'), ('Jackson Hall', 'FAC'), ('Coca-Cola', 'ORG'), ('10 percent', 'PERCENT'), ('$15 billion', 'MONEY'), ('2027', 'DATE')]
[('IBM', 