# Spacy Pipeline

- **Goal:** Prediction Recognition

- **Purpose:** To extract named entities (NER), part-of-speech (POS), etc.
    1. Use to train model as feature extraction (ie: TF x IDF) alone isn't enough

- **Misc:**
    - `%store`: Cell magic will store the variable of interest so we can load in another notebook

In [2]:
import os
import sys
import spacy

import pandas as pd
# Get the current working directory of the notebook
notebook_dir = os.getcwd()
# Add the parent directory to the system path
sys.path.append(os.path.join(notebook_dir, '../'))

from pipelines import BasePipeline
from data_processing import DataProcessing

In [3]:
pd.set_option('max_colwidth', 800)
nlp = spacy.load("en_core_web_sm")
%store -r predictions_df
%store -r non_predictions_df

python eval() and df.apply()

In [4]:
predictions_df.head(3)

Unnamed: 0,Base Sentence,Prediction Label,Model Name,Domain,Template Number
0,"On 2024-10-15, Rachel Patel, a financial analyst, predicts that the operating cash flow at General Motors will likely decrease by $5 billion to $10 billion in Q2 of 2026.",1,llama-3.3-70b-versatile,financial,1
1,"In 2024, Julian Sanchez from Bank of America, forecasts that the stock price will rise from $50 to $75 per share in 2028.",1,llama-3.3-70b-versatile,financial,2
2,"Emily Wilson, a financial expert, predicts on 20/08/2024 that the research and development expenses at Pfizer may stay stable at $15 million in 2029.",1,llama-3.3-70b-versatile,financial,3


In [5]:
base_pipeline = BasePipeline()

cleaned_predictions_df = base_pipeline.clean_predictions(predictions_df)
cleaned_predictions_df

Unnamed: 0,Base Sentence,Prediction Label,Model Name,Domain,Template Number
0,"on 2024-10-15, rachel patel, a financial analyst, predicts that the operating cash flow at general motors will likely decrease by $5 billion to $10 billion in q2 of 2026.",1,llama-3.3-70b-versatile,financial,1
1,"in 2024, julian sanchez from bank of america, forecasts that the stock price will rise from $50 to $75 per share in 2028.",1,llama-3.3-70b-versatile,financial,2
2,"emily wilson, a financial expert, predicts on 20/08/2024 that the research and development expenses at pfizer may stay stable at $15 million in 2029.",1,llama-3.3-70b-versatile,financial,3
3,"according to a senior executive from cisco, on 2024/08/20, the net profit is expected to increase beyond $8 billion in the timeframe of q4 of 2027.",1,llama-3.3-70b-versatile,financial,4
4,"in 2025-02-18, the revenue at visa has a probability of 20 percent to reach $25 billion, which is a 10% increase, as predicted by david lee, a financial reporter, on 15 oct 2024.",1,llama-3.3-70b-versatile,financial,5
5,"on wednesday, november 20, 2024, michael davis, a financial analyst, predicts that the gross profit at 3m will likely decrease by 15% to $12 billion in q1 of 2026.",1,llama-3.3-70b-versatile,financial,1
6,"in q3 of 2024, olivia brown from johnson & johnson, envisions that the operating income will rise from $10 billion to $15 billion in 2028.",1,llama-3.3-70b-versatile,financial,2
7,"kevin white, a financial expert, predicts on 10/10/2024 that the revenue at at&t may increase by $5 billion to $20 billion in 2027.",1,llama-3.3-70b-versatile,financial,3
8,"according to a top executive from intel, on 2024-07-25, the net profit is expected to increase beyond $12 billion in the timeframe of q2 of 2029.",1,llama-3.3-70b-versatile,financial,4
9,"in 2026-08-25, the stock price at mcdonald's is expected to be $200 per share, which is a 25% increase, as predicted by sophia rodriguez, a financial analyst, on 25 july 2024.",1,llama-3.3-70b-versatile,financial,5


In [6]:
# predictions_df

In [7]:
only_predictions = DataProcessing.df_to_list(cleaned_predictions_df, 'Base Sentence')
only_predictions

['on 2024-10-15, rachel patel, a financial analyst, predicts that the operating cash flow at general motors will likely decrease by $5 billion to $10 billion in q2 of 2026.',
 'in 2024, julian sanchez from bank of america, forecasts that the stock price will rise from $50 to $75 per share in 2028.',
 'emily wilson, a financial expert, predicts on 20/08/2024 that the research and development expenses at pfizer may stay stable at $15 million in 2029.',
 'according to a senior executive from cisco, on 2024/08/20, the net profit is expected to increase beyond $8 billion in the timeframe of q4 of 2027.',
 'in 2025-02-18, the revenue at visa has a probability of 20 percent to reach $25 billion, which is a 10% increase, as predicted by david lee, a financial reporter, on 15 oct 2024.',
 'on wednesday, november 20, 2024, michael davis, a financial analyst, predicts that the gross profit at 3m will likely decrease by 15% to $12 billion in q1 of 2026.',
 'in q3 of 2024, olivia brown from johnson

In [8]:
initialize_spacy = DataProcessing.setup_spacy()

### Word

In [9]:
word_leveL_disable_components = ["parser", "lemmatizer"]
word_level_pos_tags, word_level_pos_mappings, word_level_ner_entities, word_level_ner_mappings = DataProcessing.extract_entities(only_predictions, initialize_spacy, word_leveL_disable_components)

In [10]:
word_level_pos_tags

{'ADJ',
 'ADP',
 'ADV',
 'AUX',
 'CCONJ',
 'DET',
 'NOUN',
 'NUM',
 'PART',
 'PRON',
 'PROPN',
 'PUNCT',
 'SCONJ',
 'SYM',
 'VERB'}

In [11]:
word_level_pos_mappings

[[('on', 'ADP'),
  ('2024', 'NUM'),
  ('-', 'SYM'),
  ('10', 'NUM'),
  ('-', 'SYM'),
  ('15', 'NUM'),
  (',', 'PUNCT'),
  ('rachel', 'PROPN'),
  ('patel', 'NOUN'),
  (',', 'PUNCT'),
  ('a', 'DET'),
  ('financial', 'ADJ'),
  ('analyst', 'NOUN'),
  (',', 'PUNCT'),
  ('predicts', 'VERB'),
  ('that', 'SCONJ'),
  ('the', 'DET'),
  ('operating', 'VERB'),
  ('cash', 'NOUN'),
  ('flow', 'NOUN'),
  ('at', 'ADP'),
  ('general', 'ADJ'),
  ('motors', 'NOUN'),
  ('will', 'AUX'),
  ('likely', 'ADV'),
  ('decrease', 'VERB'),
  ('by', 'ADP'),
  ('$', 'SYM'),
  ('5', 'NUM'),
  ('billion', 'NUM'),
  ('to', 'PART'),
  ('$', 'SYM'),
  ('10', 'NUM'),
  ('billion', 'NUM'),
  ('in', 'ADP'),
  ('q2', 'NOUN'),
  ('of', 'ADP'),
  ('2026', 'NUM'),
  ('.', 'PUNCT')],
 [('in', 'ADP'),
  ('2024', 'NUM'),
  (',', 'PUNCT'),
  ('julian', 'ADJ'),
  ('sanchez', 'PROPN'),
  ('from', 'ADP'),
  ('bank', 'PROPN'),
  ('of', 'ADP'),
  ('america', 'PROPN'),
  (',', 'PUNCT'),
  ('forecasts', 'VERB'),
  ('that', 'SCONJ'),
  ('th

In [12]:
# ['on 2024-10-15, rachel patel, a financial analyst, predicts that the operating cash flow at general motors will likely decrease by $5 billion to $10 billion in q2 of 2026.',
# ['on 2024-10-15,  patel, a financial analyst, predicts that  operating cash flow at  motors    by 5 billion to $10 billion in   .',

"to" isn't in "on 2024-10-15, rachel patel, a financial analyst, predicts that the operating cash flow at general motors will likely decrease by $5 billion to $10 billion in q2 of 2026."

In [13]:
word_level_pos_df = DataProcessing.convert_tags_entities_to_dataframe(word_level_pos_tags, word_level_pos_mappings)
word_level_pos_df

Unnamed: 0,NUM,PRON,DET,ADV,PROPN,NOUN,VERB,ADP,SCONJ,AUX,CCONJ,ADJ,PUNCT,SYM,PART
0,2026,,the,likely,rachel,q2,decrease,of,that,will,,general,.,$,to
1,2028,,the,,america,share,rise,in,that,will,,julian,.,$,to
2,2029,,the,,wilson,pfizer,stay,in,that,may,and,stable,.,$,
3,2027,,the,,q4,timeframe,increase,of,,is,,net,.,$,to
4,2024,,a,,oct,reporter,predicted,on,,is,,financial,.,$,to
5,2026,,the,likely,q1,%,decrease,of,that,will,,gross,.,$,to
6,2028,,the,,johnson,income,rise,in,that,will,&,olivia,.,$,to
7,2027,,the,,at&t,revenue,increase,in,that,may,,financial,.,$,to
8,2029,,the,,intel,q2,increase,of,,is,,net,.,$,to
9,2024,,a,,july,analyst,predicted,on,,is,,financial,.,$,to


In [14]:
word_level_ner_df = DataProcessing.convert_tags_entities_to_dataframe(word_level_ner_entities, word_level_ner_mappings)
word_level_ner_df

Unnamed: 0,PERCENT_1,MONEY_1,ORG_1,QUANTITY_1,NORP_1,DATE_4,DATE_2,GPE_1,DATE_3,PERSON_2,CARDINAL_1,DATE_1,PERSON_1,PERCENT_2
0,,$5 billion to $10 billion,,,,,2026,,,,,2024-10-15,rachel patel,
1,,$50 to $75,bank of america,,,,2028,,,,,2024,julian sanchez,
2,,$15 million,,,,,2029,,,,,20/08/2024,emily wilson,
3,,$8 billion,,,,,,cisco,,,2024/08/20,2027,,
4,20 percent,$25 billion,,,,,15 oct 2024,,,,,2025-02-18,david lee,10%
5,15%,$12 billion,,,,,2026,,,,3,"wednesday, november 20, 2024",michael davis,
6,,$10 billion to $15 billion,johnson & johnson,,,,2028,,,,,q3 of 2024,,
7,,$5 billion to $20 billion,at&t,,,,2027,,,,,10/10/2024,kevin white,
8,,$12 billion,intel,,,,2029,,,,,2024-07-25,,
9,25%,200,mcdonald's,,,,25 july 2024,,,,,2026-08-25,,


In [15]:
word_level_tags_entities = [word_level_pos_df, word_level_ner_df]
word_level_tags_entities_df = DataProcessing.concat_dfs(word_level_tags_entities, axis=1, ignore_index=False)
word_level_tags_entities_df

Unnamed: 0,NUM,PRON,DET,ADV,PROPN,NOUN,VERB,ADP,SCONJ,AUX,...,NORP_1,DATE_4,DATE_2,GPE_1,DATE_3,PERSON_2,CARDINAL_1,DATE_1,PERSON_1,PERCENT_2
0,2026,,the,likely,rachel,q2,decrease,of,that,will,...,,,2026,,,,,2024-10-15,rachel patel,
1,2028,,the,,america,share,rise,in,that,will,...,,,2028,,,,,2024,julian sanchez,
2,2029,,the,,wilson,pfizer,stay,in,that,may,...,,,2029,,,,,20/08/2024,emily wilson,
3,2027,,the,,q4,timeframe,increase,of,,is,...,,,,cisco,,,2024/08/20,2027,,
4,2024,,a,,oct,reporter,predicted,on,,is,...,,,15 oct 2024,,,,,2025-02-18,david lee,10%
5,2026,,the,likely,q1,%,decrease,of,that,will,...,,,2026,,,,3,"wednesday, november 20, 2024",michael davis,
6,2028,,the,,johnson,income,rise,in,that,will,...,,,2028,,,,,q3 of 2024,,
7,2027,,the,,at&t,revenue,increase,in,that,may,...,,,2027,,,,,10/10/2024,kevin white,
8,2029,,the,,intel,q2,increase,of,,is,...,,,2029,,,,,2024-07-25,,
9,2024,,a,,july,analyst,predicted,on,,is,...,,,25 july 2024,,,,,2026-08-25,,


- Need to clean. I don't think we're capturing all of the words for both POS and NER.

### Sentence

In [16]:
sentence_leveL_disable_components = ["tok2vec", "parser", "lemmatizer"]
sentence_level_pos_tags, sentence_level_pos_mappings, sentence_level_ner_entities, sentence_level_ner_mappings = DataProcessing.extract_entities(only_predictions, initialize_spacy, sentence_leveL_disable_components)

In [17]:
sentence_level_pos_tags

{'NOUN'}

In [18]:
sentence_level_pos_df = DataProcessing.convert_tags_entities_to_dataframe(sentence_level_pos_tags, sentence_level_pos_mappings)
sentence_level_pos_df

Unnamed: 0,NOUN
0,.
1,.
2,.
3,.
4,.
5,.
6,.
7,.
8,.
9,.


In [19]:
sentence_level_ner_df = DataProcessing.convert_tags_entities_to_dataframe(sentence_level_ner_entities, sentence_level_ner_mappings)
sentence_level_ner_df

Unnamed: 0,PERCENT_1,MONEY_1,ORG_1,QUANTITY_1,NORP_1,DATE_4,DATE_2,GPE_1,DATE_3,PERSON_2,CARDINAL_1,DATE_1,PERSON_1,PERCENT_2
0,,$5 billion to $10 billion,,,,,2026,,,,,2024-10-15,rachel patel,
1,,$50 to $75,bank of america,,,,2028,,,,,2024,julian sanchez,
2,,$15 million,,,,,2029,,,,,20/08/2024,emily wilson,
3,,$8 billion,,,,,,cisco,,,2024/08/20,2027,,
4,20 percent,$25 billion,,,,,15 oct 2024,,,,,2025-02-18,david lee,10%
5,15%,$12 billion,,,,,2026,,,,3,"wednesday, november 20, 2024",michael davis,
6,,$10 billion to $15 billion,johnson & johnson,,,,2028,,,,,q3 of 2024,,
7,,$5 billion to $20 billion,at&t,,,,2027,,,,,10/10/2024,kevin white,
8,,$12 billion,intel,,,,2029,,,,,2024-07-25,,
9,25%,200,mcdonald's,,,,25 july 2024,,,,,2026-08-25,,


In [20]:
sentence_level_tags_entities = [sentence_level_pos_df, sentence_level_ner_df]
sentence_level_tags_entities_df = DataProcessing.concat_dfs(sentence_level_tags_entities, axis=1, ignore_index=False)
sentence_level_tags_entities_df

Unnamed: 0,NOUN,PERCENT_1,MONEY_1,ORG_1,QUANTITY_1,NORP_1,DATE_4,DATE_2,GPE_1,DATE_3,PERSON_2,CARDINAL_1,DATE_1,PERSON_1,PERCENT_2
0,.,,$5 billion to $10 billion,,,,,2026,,,,,2024-10-15,rachel patel,
1,.,,$50 to $75,bank of america,,,,2028,,,,,2024,julian sanchez,
2,.,,$15 million,,,,,2029,,,,,20/08/2024,emily wilson,
3,.,,$8 billion,,,,,,cisco,,,2024/08/20,2027,,
4,.,20 percent,$25 billion,,,,,15 oct 2024,,,,,2025-02-18,david lee,10%
5,.,15%,$12 billion,,,,,2026,,,,3,"wednesday, november 20, 2024",michael davis,
6,.,,$10 billion to $15 billion,johnson & johnson,,,,2028,,,,,q3 of 2024,,
7,.,,$5 billion to $20 billion,at&t,,,,2027,,,,,10/10/2024,kevin white,
8,.,,$12 billion,intel,,,,2029,,,,,2024-07-25,,
9,.,25%,200,mcdonald's,,,,25 july 2024,,,,,2026-08-25,,


In [30]:
import spacy


nlp = spacy.load("en_core_web_sm")
doc = nlp(only_predictions[0])
word_embeddings = [token.vector for token in doc]
len(word_embeddings)

39

In [None]:
word_embeddings.

In [23]:
nlp = spacy.load("en_core_web_sm")
doc = nlp(text_data_sentence)
word_embeddings = [token.vector for token in doc]
word_embeddings

ValueError: [E1041] Expected a string, Doc, or bytes as input, but got: <class 'list'>

In [None]:
only_predictions

In [None]:
from spacy.vectors import Vectors
import numpy as np

# empty_vectors = Vectors(shape=(10000, 300))

data = np.zeros((len(only_predictions), 300), dtype='f')
vectors = Vectors(data=data, keys=only_predictions)
vectors

In [None]:
# Convert vectors to a numpy array
vector_data = vectors.data
vector_data

In [None]:

# Create a corresponding array of keys
vector_keys = np.array(only_predictions)
vector_keys

In [None]:
prediction_labels = predictions_df['Prediction Label']
prediction_labels

In [None]:
X_train, X_test, y_train, y_test = DataProcessing.split_data(vector_data, vector_keys)
X_train

## Play

- Remove once finalized

In [None]:
pos_col_names = list(pos_df.columns)
for pos_col_name in pos_col_names:
    print(f"pos_col_name: {spacy.explain(pos_col_name)}")

In [None]:
list(ner_df.columns)

In [None]:
ner_col_names = list(ner_df.columns)
for ner_col_name in ner_col_names:
    print(ner_col_name)
    print(f"ner_col_name: {spacy.explain(ner_col_name)}")

- Patterns:
    - P1 goes to ?
        - $: SYM
        - 10: NUM
    - P2 goes to ?
        - 2024: NUM
        - -: SYM
        - 10: NUM
        - -: SYM
        - 20: NUM
- Write regex for \$\d+ and \d+-\d+-\d+? -> Manually label?

- Create new function in clean_predictions.py called `remove_symbols`

In [None]:
texts = ["On [2024-10-20], [Samantha Thompson, a financial analyst] predicts that the [operating cash flow] at [Johnson & Johnson] [will likely] [increase] by [10 percent to $25 billion] in [2026 Q2]"]

texts_2 = "On [2024-10-20], [Samantha Thompson, a financial analyst] predicts that the [operating cash flow] at [Johnson & Johnson] [will likely] [increase] by [$10 percent to $25 billion] in [2026 Q2]"

texts_2 = "On [2024-10-20] $10 percent to $25 billion]"

texts_2_no_brackets = [texts_2.replace('[', '').replace(']', '')]
# ",".join(texts_2_no_brackets)
# print(texts_2_no_brackets)

# text_join = texts_2_no_brackets.split()
# print(text_join)

nlp = spacy.load("en_core_web_sm")
def extract_entities(data: pd.Series, nlp: spacy.Language):
    """
    Extract entities using the provided SpaCy NLP model.

    Parameters:
    -----------
    data : `pd.Series`
        A Series containing textual data for entity extraction.
    nlp : `spacy.Language`
        A SpaCy NLP model.

    Returns:
    --------
    tuple
        A tuple containing a list of entities and a set of unique NER tags.
    """
    entities = []
    all_ner_tags = set()
    label_counts = {}

    for doc in nlp.pipe(data, disable=["ner"]):
        # doc_entities = []
        # for ent in doc.ents:
        #     label = ent.label_
        #     text = ent.text
        #     print(label, text)
        print(doc)
        for token in doc:
            print(f"{token.text}: {token.pos_}")

        # entities.append(doc_entities)

    return entities, all_ner_tags

# extract_entities(texts, nlp)
# print()
print(texts_2_no_brackets)
extract_entities(texts_2_no_brackets, nlp)

2024: NUM
-: SYM
10: NUM
-: SYM
20: NUM


Create NER DATE from this

10: NUM
percent: NOUN

Johnson: PROPN
&: CCONJ
Johnson: PROPN

In [None]:
Zeke's advisor: Metrics, systems, predictive variables (dep var) and indep var, research questions, sentamint analysis, 

Zeke's research interests: information fusion