# Spacy Pipeline

- **Goal:** Prediction Recognition

- **Purpose:** To extract named entities (NER), part-of-speech (POS), etc.
    1. Use to train model as feature extraction (ie: TF x IDF) alone isn't enough

- **Misc:**
    - `%store`: Cell magic will store the variable of interest so we can load in another notebook

In [1]:
import os
import sys
import spacy

import pandas as pd
# Get the current working directory of the notebook
notebook_dir = os.getcwd()
# Add the parent directory to the system path
sys.path.append(os.path.join(notebook_dir, '../'))

from pipelines import BasePipeline
from data_processing import DataProcessing

In [2]:
pd.set_option('max_colwidth', 800)
nlp = spacy.load("en_core_web_sm")
%store -r predictions_df
%store -r non_predictions_df

python eval() and df.apply()

In [3]:
predictions_df.head(3)

Unnamed: 0,Base Sentence,Prediction Label,Model Name,Domain,Template Number
0,"On 2024-10-15, Rachel Patel, a financial analyst, predicts that the operating cash flow at General Motors will likely decrease by $5 billion to $10 billion in Q2 of 2026.",1,llama-3.3-70b-versatile,financial,1
1,"In 2024, Julian Sanchez from Bank of America, forecasts that the stock price will rise from $50 to $75 per share in 2028.",1,llama-3.3-70b-versatile,financial,2
2,"Emily Wilson, a financial expert, predicts on 20/08/2024 that the research and development expenses at Pfizer may stay stable at $15 million in 2029.",1,llama-3.3-70b-versatile,financial,3


In [4]:
base_pipeline = BasePipeline()

cleaned_predictions_df = base_pipeline.clean_predictions(predictions_df)
cleaned_predictions_df

Unnamed: 0,Base Sentence,Prediction Label,Model Name,Domain,Template Number
0,"on 2024-10-15, rachel patel, a financial analyst, predicts that the operating cash flow at general motors will likely decrease by $5 billion to $10 billion in q2 of 2026.",1,llama-3.3-70b-versatile,financial,1
1,"in 2024, julian sanchez from bank of america, forecasts that the stock price will rise from $50 to $75 per share in 2028.",1,llama-3.3-70b-versatile,financial,2
2,"emily wilson, a financial expert, predicts on 20/08/2024 that the research and development expenses at pfizer may stay stable at $15 million in 2029.",1,llama-3.3-70b-versatile,financial,3
3,"according to a senior executive from cisco, on 2024/08/20, the net profit is expected to increase beyond $8 billion in the timeframe of q4 of 2027.",1,llama-3.3-70b-versatile,financial,4
4,"in 2025-02-18, the revenue at visa has a probability of 20 percent to reach $25 billion, which is a 10% increase, as predicted by david lee, a financial reporter, on 15 oct 2024.",1,llama-3.3-70b-versatile,financial,5
5,"on wednesday, november 20, 2024, michael davis, a financial analyst, predicts that the gross profit at 3m will likely decrease by 15% to $12 billion in q1 of 2026.",1,llama-3.3-70b-versatile,financial,1
6,"in q3 of 2024, olivia brown from johnson & johnson, envisions that the operating income will rise from $10 billion to $15 billion in 2028.",1,llama-3.3-70b-versatile,financial,2
7,"kevin white, a financial expert, predicts on 10/10/2024 that the revenue at at&t may increase by $5 billion to $20 billion in 2027.",1,llama-3.3-70b-versatile,financial,3
8,"according to a top executive from intel, on 2024-07-25, the net profit is expected to increase beyond $12 billion in the timeframe of q2 of 2029.",1,llama-3.3-70b-versatile,financial,4
9,"in 2026-08-25, the stock price at mcdonald's is expected to be $200 per share, which is a 25% increase, as predicted by sophia rodriguez, a financial analyst, on 25 july 2024.",1,llama-3.3-70b-versatile,financial,5


In [5]:
# predictions_df

In [6]:
only_predictions = DataProcessing.df_to_list(cleaned_predictions_df, 'Base Sentence')
only_predictions

['on 2024-10-15, rachel patel, a financial analyst, predicts that the operating cash flow at general motors will likely decrease by $5 billion to $10 billion in q2 of 2026.',
 'in 2024, julian sanchez from bank of america, forecasts that the stock price will rise from $50 to $75 per share in 2028.',
 'emily wilson, a financial expert, predicts on 20/08/2024 that the research and development expenses at pfizer may stay stable at $15 million in 2029.',
 'according to a senior executive from cisco, on 2024/08/20, the net profit is expected to increase beyond $8 billion in the timeframe of q4 of 2027.',
 'in 2025-02-18, the revenue at visa has a probability of 20 percent to reach $25 billion, which is a 10% increase, as predicted by david lee, a financial reporter, on 15 oct 2024.',
 'on wednesday, november 20, 2024, michael davis, a financial analyst, predicts that the gross profit at 3m will likely decrease by 15% to $12 billion in q1 of 2026.',
 'in q3 of 2024, olivia brown from johnson

In [7]:
initialize_spacy = DataProcessing.setup_spacy()

disable_components = ["parser", "lemmatizer"]
tags, all_pos_tags, entities, all_ner_entities = DataProcessing.extract_entities(only_predictions, initialize_spacy, disable_components)


In [8]:
tags

[[('on', 'ADP'),
  ('2024', 'NUM'),
  ('-', 'SYM'),
  ('10', 'NUM'),
  ('-', 'SYM'),
  ('15', 'NUM'),
  (',', 'PUNCT'),
  ('rachel', 'PROPN'),
  ('patel', 'NOUN'),
  (',', 'PUNCT'),
  ('a', 'DET'),
  ('financial', 'ADJ'),
  ('analyst', 'NOUN'),
  (',', 'PUNCT'),
  ('predicts', 'VERB'),
  ('that', 'SCONJ'),
  ('the', 'DET'),
  ('operating', 'VERB'),
  ('cash', 'NOUN'),
  ('flow', 'NOUN'),
  ('at', 'ADP'),
  ('general', 'ADJ'),
  ('motors', 'NOUN'),
  ('will', 'AUX'),
  ('likely', 'ADV'),
  ('decrease', 'VERB'),
  ('by', 'ADP'),
  ('$', 'SYM'),
  ('5', 'NUM'),
  ('billion', 'NUM'),
  ('to', 'PART'),
  ('$', 'SYM'),
  ('10', 'NUM'),
  ('billion', 'NUM'),
  ('in', 'ADP'),
  ('q2', 'NOUN'),
  ('of', 'ADP'),
  ('2026', 'NUM'),
  ('.', 'PUNCT')],
 [('in', 'ADP'),
  ('2024', 'NUM'),
  (',', 'PUNCT'),
  ('julian', 'ADJ'),
  ('sanchez', 'PROPN'),
  ('from', 'ADP'),
  ('bank', 'PROPN'),
  ('of', 'ADP'),
  ('america', 'PROPN'),
  (',', 'PUNCT'),
  ('forecasts', 'VERB'),
  ('that', 'SCONJ'),
  ('th

In [9]:
all_pos_tags

{'ADJ',
 'ADP',
 'ADV',
 'AUX',
 'CCONJ',
 'DET',
 'NOUN',
 'NUM',
 'PART',
 'PRON',
 'PROPN',
 'PUNCT',
 'SCONJ',
 'SYM',
 'VERB'}

In [10]:
entities

[[('2024-10-15', 'DATE_1'),
  ('rachel patel', 'PERSON_1'),
  ('$5 billion to $10 billion', 'MONEY_1'),
  ('2026', 'DATE_2')],
 [('2024', 'DATE_1'),
  ('julian sanchez', 'PERSON_1'),
  ('bank of america', 'ORG_1'),
  ('$50 to $75', 'MONEY_1'),
  ('2028', 'DATE_2')],
 [('emily wilson', 'PERSON_1'),
  ('20/08/2024', 'DATE_1'),
  ('$15 million', 'MONEY_1'),
  ('2029', 'DATE_2')],
 [('cisco', 'GPE_1'),
  ('2024/08/20', 'CARDINAL_1'),
  ('$8 billion', 'MONEY_1'),
  ('2027', 'DATE_1')],
 [('2025-02-18', 'DATE_1'),
  ('20 percent', 'PERCENT_1'),
  ('$25 billion', 'MONEY_1'),
  ('10%', 'PERCENT_2'),
  ('david lee', 'PERSON_1'),
  ('15 oct 2024', 'DATE_2')],
 [('wednesday, november 20, 2024', 'DATE_1'),
  ('michael davis', 'PERSON_1'),
  ('3', 'CARDINAL_1'),
  ('15%', 'PERCENT_1'),
  ('$12 billion', 'MONEY_1'),
  ('2026', 'DATE_2')],
 [('q3 of 2024', 'DATE_1'),
  ('johnson & johnson', 'ORG_1'),
  ('$10 billion to $15 billion', 'MONEY_1'),
  ('2028', 'DATE_2')],
 [('kevin white', 'PERSON_1'),
  

In [11]:
all_ner_entities

{'CARDINAL_1',
 'DATE_1',
 'DATE_2',
 'DATE_3',
 'DATE_4',
 'GPE_1',
 'MONEY_1',
 'NORP_1',
 'ORG_1',
 'PERCENT_1',
 'PERCENT_2',
 'PERSON_1',
 'PERSON_2',
 'QUANTITY_1'}

In [12]:
# ['on 2024-10-15, rachel patel, a financial analyst, predicts that the operating cash flow at general motors will likely decrease by $5 billion to $10 billion in q2 of 2026.',

# ['on 2024-10-15,  patel, a financial analyst, predicts that  operating cash flow at  motors    by 5 billion to $10 billion in   .',


"to" isn't in "on 2024-10-15, rachel patel, a financial analyst, predicts that the operating cash flow at general motors will likely decrease by $5 billion to $10 billion in q2 of 2026."

In [13]:
pos_df = DataProcessing.tags_to_dataframe(tags, all_pos_tags)
pos_df

Unnamed: 0,DET,NUM,ADV,SYM,ADJ,CCONJ,ADP,SCONJ,NOUN,PUNCT,AUX,PRON,VERB,PROPN,PART
0,the,2026,likely,$,general,,of,that,q2,.,will,,decrease,rachel,to
1,the,2028,,$,julian,,in,that,share,.,will,,rise,america,to
2,the,2029,,$,stable,and,in,that,pfizer,.,may,,stay,wilson,
3,the,2027,,$,net,,of,,timeframe,.,is,,increase,q4,to
4,a,2024,,$,financial,,on,,reporter,.,is,,predicted,oct,to
5,the,2026,likely,$,gross,,of,that,%,.,will,,decrease,q1,to
6,the,2028,,$,olivia,&,in,that,income,.,will,,rise,johnson,to
7,the,2027,,$,financial,,in,that,revenue,.,may,,increase,at&t,to
8,the,2029,,$,net,,of,,q2,.,is,,increase,intel,to
9,a,2024,,$,financial,,on,,analyst,.,is,,predicted,july,to


In [14]:
ner_df = DataProcessing.entities_to_dataframe(entities, all_ner_entities)
ner_df

Unnamed: 0,CARDINAL_1,NORP_1,DATE_1,DATE_3,DATE_2,PERSON_1,ORG_1,QUANTITY_1,DATE_4,PERCENT_1,GPE_1,PERCENT_2,PERSON_2,MONEY_1
0,,,2024-10-15,,2026,rachel patel,,,,,,,,$5 billion to $10 billion
1,,,2024,,2028,julian sanchez,bank of america,,,,,,,$50 to $75
2,,,20/08/2024,,2029,emily wilson,,,,,,,,$15 million
3,2024/08/20,,2027,,,,,,,,cisco,,,$8 billion
4,,,2025-02-18,,15 oct 2024,david lee,,,,20 percent,,10%,,$25 billion
5,3,,"wednesday, november 20, 2024",,2026,michael davis,,,,15%,,,,$12 billion
6,,,q3 of 2024,,2028,,johnson & johnson,,,,,,,$10 billion to $15 billion
7,,,10/10/2024,,2027,kevin white,at&t,,,,,,,$5 billion to $20 billion
8,,,2024-07-25,,2029,,intel,,,,,,,$12 billion
9,,,2026-08-25,,25 july 2024,,mcdonald's,,,25%,,,,200


- Need to clean. I don't think we're capturing all of the words for both POS and NER.

In [15]:
sentence_level_tags_entities = [pos_df, ner_df]
sentence_level_tags_entities_df = DataProcessing.concat_dfs(sentence_level_tags_entities, axis=1, ignore_index=False)
sentence_level_tags_entities_df

Unnamed: 0,DET,NUM,ADV,SYM,ADJ,CCONJ,ADP,SCONJ,NOUN,PUNCT,...,DATE_2,PERSON_1,ORG_1,QUANTITY_1,DATE_4,PERCENT_1,GPE_1,PERCENT_2,PERSON_2,MONEY_1
0,the,2026,likely,$,general,,of,that,q2,.,...,2026,rachel patel,,,,,,,,$5 billion to $10 billion
1,the,2028,,$,julian,,in,that,share,.,...,2028,julian sanchez,bank of america,,,,,,,$50 to $75
2,the,2029,,$,stable,and,in,that,pfizer,.,...,2029,emily wilson,,,,,,,,$15 million
3,the,2027,,$,net,,of,,timeframe,.,...,,,,,,,cisco,,,$8 billion
4,a,2024,,$,financial,,on,,reporter,.,...,15 oct 2024,david lee,,,,20 percent,,10%,,$25 billion
5,the,2026,likely,$,gross,,of,that,%,.,...,2026,michael davis,,,,15%,,,,$12 billion
6,the,2028,,$,olivia,&,in,that,income,.,...,2028,,johnson & johnson,,,,,,,$10 billion to $15 billion
7,the,2027,,$,financial,,in,that,revenue,.,...,2027,kevin white,at&t,,,,,,,$5 billion to $20 billion
8,the,2029,,$,net,,of,,q2,.,...,2029,,intel,,,,,,,$12 billion
9,a,2024,,$,financial,,on,,analyst,.,...,25 july 2024,,mcdonald's,,,25%,,,,200


## Play

- Remove once finalized

In [None]:
pos_col_names = list(pos_df.columns)
for pos_col_name in pos_col_names:
    print(f"pos_col_name: {spacy.explain(pos_col_name)}")

In [None]:
list(ner_df.columns)

In [None]:
ner_col_names = list(ner_df.columns)
for ner_col_name in ner_col_names:
    print(ner_col_name)
    print(f"ner_col_name: {spacy.explain(ner_col_name)}")

- Patterns:
    - P1 goes to ?
        - $: SYM
        - 10: NUM
    - P2 goes to ?
        - 2024: NUM
        - -: SYM
        - 10: NUM
        - -: SYM
        - 20: NUM
- Write regex for \$\d+ and \d+-\d+-\d+? -> Manually label?

- Create new function in clean_predictions.py called `remove_symbols`

In [None]:
texts = ["On [2024-10-20], [Samantha Thompson, a financial analyst] predicts that the [operating cash flow] at [Johnson & Johnson] [will likely] [increase] by [10 percent to $25 billion] in [2026 Q2]"]

texts_2 = "On [2024-10-20], [Samantha Thompson, a financial analyst] predicts that the [operating cash flow] at [Johnson & Johnson] [will likely] [increase] by [$10 percent to $25 billion] in [2026 Q2]"

texts_2 = "On [2024-10-20] $10 percent to $25 billion]"

texts_2_no_brackets = [texts_2.replace('[', '').replace(']', '')]
# ",".join(texts_2_no_brackets)
# print(texts_2_no_brackets)

# text_join = texts_2_no_brackets.split()
# print(text_join)

nlp = spacy.load("en_core_web_sm")
def extract_entities(data: pd.Series, nlp: spacy.Language):
    """
    Extract entities using the provided SpaCy NLP model.

    Parameters:
    -----------
    data : `pd.Series`
        A Series containing textual data for entity extraction.
    nlp : `spacy.Language`
        A SpaCy NLP model.

    Returns:
    --------
    tuple
        A tuple containing a list of entities and a set of unique NER tags.
    """
    entities = []
    all_ner_tags = set()
    label_counts = {}

    for doc in nlp.pipe(data, disable=["ner"]):
        # doc_entities = []
        # for ent in doc.ents:
        #     label = ent.label_
        #     text = ent.text
        #     print(label, text)
        print(doc)
        for token in doc:
            print(f"{token.text}: {token.pos_}")

        # entities.append(doc_entities)

    return entities, all_ner_tags

# extract_entities(texts, nlp)
# print()
print(texts_2_no_brackets)
extract_entities(texts_2_no_brackets, nlp)

2024: NUM
-: SYM
10: NUM
-: SYM
20: NUM


Create NER DATE from this

10: NUM
percent: NOUN

Johnson: PROPN
&: CCONJ
Johnson: PROPN