# Spacy Pipeline

- **Goal:** Prediction Recognition

- **Purpose:** To extract named entities (NER), part-of-speech (POS), etc.
    1. Use to train model as feature extraction (ie: TF x IDF) alone isn't enough

- **Misc:**
    - `%store`: Cell magic will store the variable of interest so we can load in another notebook

In [1]:
import os
import sys
import spacy

import pandas as pd
# Get the current working directory of the notebook
notebook_dir = os.getcwd()
# Add the parent directory to the system path
sys.path.append(os.path.join(notebook_dir, '../'))

from data_processing import DataProcessing

In [2]:
pd.set_option('max_colwidth', 800)
nlp = spacy.load("en_core_web_sm")
%store -r predictions_df
%store -r non_predictions_df

In [3]:
predictions_df

Unnamed: 0,Base Sentence,Prediction Label,Model Name,Domain
0,"T1: On [2024-10-15], [Samantha Thompson, a financial analyst] predicts that the [operating cash flow] at [Johnson & Johnson] [will likely] [increase] by [10 percent to $20 billion] in [2026 Q2].",1,llama-3.3-70b-versatile,financial
1,"T2: In [Q4 of 2024], [Ethan Kim] from [Bloomberg], forecasts that the [stock price] [will] [fall] from [$500 to $300 per share] in [2028 Q1].",1,llama-3.3-70b-versatile,financial
2,"T3: [Ava Morales, a financial expert] predicts on [2024/08/20] that the [research and development expenses] at [Intel] [may] [stay stable] at [$15 million] in [2027 Q3].",1,llama-3.3-70b-versatile,financial
3,"T4: According to a [senior executive] from [Coca-Cola], on [21 Aug 2024], the [net profit] [is expected to] [rise] beyond [$5 billion] in the timeframe of [2029 Q4].",1,llama-3.3-70b-versatile,financial
4,"T5: In [2025-02-15], the [revenue] at [Visa] [is expected] to [increase] by [15 percent to $25 billion] [rise], as predicted by [Liam Chen, a financial reporter] on [2024-08-22].",1,llama-3.3-70b-versatile,financial
5,"T1: On [Wednesday, November 20, 2024], [Julian Lee, an investor] predicts that the [gross profit] at [UnitedHealth Group] [will likely] [decrease] by [5 percent to $10 billion] in [2026 Q4].",1,llama-3.3-70b-versatile,financial
6,"T2: In [2027 Q2], [Sophia Patel] from [Morgan Stanley], envisions that the [operating income] [will] [increase] from [$10 billion to $15 billion] in [2028 Q3].",1,llama-3.3-70b-versatile,financial
7,"T3: [Noah Brooks, a financial analyst] predicts on [2024/10/18] that the [net profit] at [Procter & Gamble] [may] [fall] by [10 percent to $5 billion] in [2027 Q2].",1,llama-3.3-70b-versatile,financial
8,"T4: According to a [top executive] from [AT&T], on [2024-08-25], the [revenue] [is expected to] [stay stable] at [$40 billion] in the timeframe of [2029 Q1].",1,llama-3.3-70b-versatile,financial
9,"T5: In [2026-08-20], the [stock price] at [McDonald's] [has a probability] of [20 percent to reach $250 per share] [rise], as predicted by [Jackson Brown, a financial expert] on [2024-10-12].",1,llama-3.3-70b-versatile,financial


In [4]:
only_predictions = DataProcessing.df_to_list(predictions_df, 'Base Sentence')
only_predictions

['T1: On [2024-10-15], [Samantha Thompson, a financial analyst] predicts that the [operating cash flow] at [Johnson & Johnson] [will likely] [increase] by [10 percent to $20 billion] in [2026 Q2].',
 'T2: In [Q4 of 2024], [Ethan Kim] from [Bloomberg], forecasts that the [stock price] [will] [fall] from [$500 to $300 per share] in [2028 Q1].',
 'T3: [Ava Morales, a financial expert] predicts on [2024/08/20] that the [research and development expenses] at [Intel] [may] [stay stable] at [$15 million] in [2027 Q3].',
 'T4: According to a [senior executive] from [Coca-Cola], on [21 Aug 2024], the [net profit] [is expected to] [rise] beyond [$5 billion] in the timeframe of [2029 Q4].',
 'T5: In [2025-02-15], the [revenue] at [Visa] [is expected] to [increase] by [15 percent to $25 billion] [rise], as predicted by [Liam Chen, a financial reporter] on [2024-08-22].',
 'T1: On [Wednesday, November 20, 2024], [Julian Lee, an investor] predicts that the [gross profit] at [UnitedHealth Group] [wil

In [11]:
ner_dfs = []
for doc in nlp.pipe(only_predictions, disable=["tok2vec", "parser", "attribute_ruler", "lemmatizer"]):
    # print(f"sentence: {doc}")
    columns = []
    values = []
    for ent in doc.ents:
        columns.append(ent.label_)
        values.append(ent.text)
        # print(f"    cols_name: {ent.label_}")
        # print(f"        text: {ent.text}")
    ner_df = pd.DataFrame([values], columns=columns)
    # print(ner_df)
    ner_dfs.append(ner_df)
ner_dfs

[  CARDINAL        DATE             PERSON                ORG     PERCENT  \
 0       T1  2024-10-15  Samantha Thompson  Johnson & Johnson  10 percent   
 
          MONEY  DATE  
 0  $20 billion  2026  ,
   CARDINAL WORK_OF_ART     PERSON     PERSON MONEY MONEY  DATE
 0       T2  Q4 of 2024  Ethan Kim  Bloomberg   500   300  2028,
         PERSON    CARDINAL    ORG        MONEY  DATE
 0  Ava Morales  2024/08/20  Intel  $15 million  2027,
   ORG        ORG CARDINAL  DATE       MONEY  DATE
 0  T4  Coca-Cola       21  2024  $5 billion  2029,
   CARDINAL        DATE     PERCENT        MONEY     PERSON        DATE
 0       T5  2025-02-15  15 percent  $25 billion  Liam Chen  2024-08-22,
   CARDINAL                          DATE      PERSON                 ORG  \
 0       T1  Wednesday, November 20, 2024  Julian Lee  UnitedHealth Group   
 
      PERCENT        MONEY  DATE  
 0  5 percent  $10 billion  2026  ,
   CARDINAL  DATE        PERSON             ORG                       MONEY  \
 0 

In [14]:
ner_dfs[3]

Unnamed: 0,ORG,ORG.1,CARDINAL,DATE,MONEY,DATE.1
0,T4,Coca-Cola,21,2024,$5 billion,2029


In [None]:
%store ner_dfs