# Spacy Pipeline

- **Goal:** Prediction Recognition

- **Purpose:** To extract named entities (NER), part-of-speech (POS), etc.
    1. Use to train model as feature extraction (ie: TF x IDF) alone isn't enough

- **Misc:**
    - `%store`: Cell magic will store the variable of interest so we can load in another notebook

In [1]:
import os
import sys
import spacy

import pandas as pd
# Get the current working directory of the notebook
notebook_dir = os.getcwd()
# Add the parent directory to the system path
sys.path.append(os.path.join(notebook_dir, '../'))

from pipelines import BasePipeline
from data_processing import DataProcessing
from feature_extraction import SpacyFeatureExtraction

In [2]:
%store -r shuffled_base_df
pd.set_option('max_colwidth', 800)
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

python eval() and df.apply()

In [3]:
shuffled_base_df

Unnamed: 0,Base Sentence,Prediction Label,Model Name,Domain,Template Number
0,The music echoed through the empty hall.,0,llama-3.3-70b-versatile,any,0
1,"According to a policy analyst, Emily Chen, from the Congressional Budget Office, on 2024-08-22, the federal budget deficit is expected to decrease beyond $1 trillion in the timeframe of Q4 of 2027.",1,llama-3.3-70b-versatile,policy,4
2,"On 2024-10-15, Dr. David Lee, a health expert, predicts that the obesity rate at the World Health Organization will likely decrease by 3% in Q2 of 2026.",1,llama-3.3-70b-versatile,health,1
3,"According to a senior level person from 3M, on 2024/08/22, the operating income is expected to increase as much as $500 million, reflecting a 20% increase, in the timeframe of Q2 of 2029.",1,llama-3.3-70b-versatile,financial,4
4,"On 2024-10-15, Rachel Patel, a financial analyst, predicts that the operating income at General Motors will likely increase by $5 billion in Q2 of 2026.",1,llama-3.3-70b-versatile,financial,1
5,She practiced yoga on the quiet morning.,0,llama-3.3-70b-versatile,any,0
6,He played with his dog in the backyard.,0,llama-3.3-70b-versatile,any,0
7,The kids played tag in the park playground.,0,llama-3.3-70b-versatile,any,0
8,They went to the movies on a Friday night.,0,llama-3.3-70b-versatile,any,0
9,He ate a healthy breakfast every morning.,0,llama-3.3-70b-versatile,any,0


In [4]:
# initialize the spacy model
spacy_feature_extractor = SpacyFeatureExtraction(shuffled_base_df, 'Base Sentence')
spacy_feature_extractor

<feature_extraction.SpacyFeatureExtraction at 0x138289a10>

## Extract Part-of-Speech (POS) Tags and Named Entity Recognition (NER) Entities at Word Level

In [5]:
only_predictions = DataProcessing.df_to_list(shuffled_base_df, 'Base Sentence')
only_predictions

['The music echoed through the empty hall.',
 'According to a policy analyst, Emily Chen, from the Congressional Budget Office, on 2024-08-22, the federal budget deficit is expected to decrease beyond $1 trillion in the timeframe of Q4 of 2027.',
 'On 2024-10-15, Dr. David Lee, a health expert, predicts that the obesity rate at the World Health Organization will likely decrease by 3% in Q2 of 2026.',
 'According to a senior level person from 3M, on 2024/08/22, the operating income is expected to increase as much as $500 million, reflecting a 20% increase, in the timeframe of Q2 of 2029.',
 'On 2024-10-15, Rachel Patel, a financial analyst, predicts that the operating income at General Motors will likely increase by $5 billion in Q2 of 2026.',
 'She practiced yoga on the quiet morning.',
 'He played with his dog in the backyard.',
 'The kids played tag in the park playground.',
 'They went to the movies on a Friday night.',
 'He ate a healthy breakfast every morning.',
 'According to a 

In [6]:
word_leveL_disable_components = ["lemmatizer"]
word_level_pos_tags, word_level_pos_mappings, word_level_ner_entities, word_level_ner_mappings = spacy_feature_extractor.extract_entities(only_predictions, word_leveL_disable_components)

### Visualize as DF

In [7]:
all_word_level_pos_df = DataProcessing.convert_to_df(word_level_pos_tags, mapping=word_level_pos_mappings)
all_word_level_pos_df

Unnamed: 0,AUX,PUNCT,DET,NUM,VERB,ADP,SYM,PROPN,ADJ,NOUN,PRON,SCONJ,ADV,CCONJ,PART
0,,.,the,,echoed,through,,,empty,hall,,,,,
1,is,.,the,2027.0,decrease,of,$,Q4,federal,timeframe,,,,,to
2,will,.,the,2026.0,decrease,of,-,Q2,,%,,that,likely,,
3,is,.,the,2029.0,reflecting,of,$,Q2,much,timeframe,,,as,,to
4,will,.,the,2026.0,increase,of,$,Q2,financial,income,,that,likely,,
5,,.,the,,practiced,on,,,quiet,morning,She,,,,
6,,.,the,,played,in,,,,backyard,his,,,,
7,,.,the,,played,in,,,,playground,,,,,
8,,.,a,,went,on,,Friday,,night,They,,,,
9,,.,every,,ate,,,,healthy,morning,He,,,,


In [8]:
all_word_level_ner_df = DataProcessing.convert_to_df(word_level_ner_entities, word_level_ner_mappings)
all_word_level_ner_df

Unnamed: 0,PERSON_1,CARDINAL_2,PRODUCT_1,CARDINAL_3,DATE_2,CARDINAL_1,MONEY_2,TIME_1,DATE_3,PERCENT_2,QUANTITY_1,DATE_1,ORG_2,GPE_1,ORG_1,MONEY_1,PERCENT_1
1,Emily Chen,2027,,,,Q4,,,,,,2024-08-22,,,the Congressional Budget Office,$1 trillion,
2,David Lee,,Q2,,,2026,,,,,,2024-10-15,,,the World Health Organization,,3%
3,,,Q2 of 2029,,,2024/08/22,,,,,,,,,3M,as much as $500 million,20%
4,Rachel Patel,,Q2 of 2026,,,,,,,,,2024-10-15,,,General Motors,$5 billion,
5,,,,,,,,the quiet morning,,,,,,,,,
8,,,,,,,,night,,,,Friday,,,,,
10,,,,,2026-02-01,,,,,,20 inches,2024-11-25,,Toronto,the Meteorological Service of Canada,,
14,Daniel Hall,2027,,10 million,,Q4,,,,,,2024-08-24,,the United States,,,25%
16,Lisa Nguyen,,,,,2029,,,,,,2024-10-11,Q3,,the Department of Commerce,$20 billion,
18,Samantha Brown,,,,,2028,,,,,,2024-07-22,,Chicago,the National Weather Service,,10%


In [9]:
word_level_tags_entities = [all_word_level_pos_df, all_word_level_ner_df]
word_level_tags_entities_df = DataProcessing.concat_dfs(word_level_tags_entities, axis=1, ignore_index=False)
word_level_tags_entities_df

Unnamed: 0,AUX,PUNCT,DET,NUM,VERB,ADP,SYM,PROPN,ADJ,NOUN,PRON,SCONJ,ADV,CCONJ,PART,PERSON_1,CARDINAL_2,PRODUCT_1,CARDINAL_3,DATE_2,CARDINAL_1,MONEY_2,TIME_1,DATE_3,PERCENT_2,QUANTITY_1,DATE_1,ORG_2,GPE_1,ORG_1,MONEY_1,PERCENT_1
0,,.,the,,echoed,through,,,empty,hall,,,,,,,,,,,,,,,,,,,,,,
1,is,.,the,2027.0,decrease,of,$,Q4,federal,timeframe,,,,,to,Emily Chen,2027,,,,Q4,,,,,,2024-08-22,,,the Congressional Budget Office,$1 trillion,
2,will,.,the,2026.0,decrease,of,-,Q2,,%,,that,likely,,,David Lee,,Q2,,,2026,,,,,,2024-10-15,,,the World Health Organization,,3%
3,is,.,the,2029.0,reflecting,of,$,Q2,much,timeframe,,,as,,to,,,Q2 of 2029,,,2024/08/22,,,,,,,,,3M,as much as $500 million,20%
4,will,.,the,2026.0,increase,of,$,Q2,financial,income,,that,likely,,,Rachel Patel,,Q2 of 2026,,,,,,,,,2024-10-15,,,General Motors,$5 billion,
5,,.,the,,practiced,on,,,quiet,morning,She,,,,,,,,,,,,the quiet morning,,,,,,,,,
6,,.,the,,played,in,,,,backyard,his,,,,,,,,,,,,,,,,,,,,,
7,,.,the,,played,in,,,,playground,,,,,,,,,,,,,,,,,,,,,,
8,,.,a,,went,on,,Friday,,night,They,,,,,,,,,,,,night,,,,Friday,,,,,
9,,.,every,,ate,,,,healthy,morning,He,,,,,,,,,,,,,,,,,,,,,


### Encode

In [10]:
encoded_word_level_tags_entities_df = DataProcessing.encode_tags_entities_df(word_level_tags_entities_df, sentence_and_label_df=shuffled_base_df)
encoded_word_level_tags_entities_df

Unnamed: 0,Base Sentence,Prediction Label,AUX,PUNCT,DET,NUM,VERB,ADP,SYM,PROPN,ADJ,NOUN,PRON,SCONJ,ADV,CCONJ,PART,PERSON_1,CARDINAL_2,PRODUCT_1,CARDINAL_3,DATE_2,CARDINAL_1,MONEY_2,TIME_1,DATE_3,PERCENT_2,QUANTITY_1,DATE_1,ORG_2,GPE_1,ORG_1,MONEY_1,PERCENT_1
0,The music echoed through the empty hall.,0,0,1,1,0,1,1,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,"According to a policy analyst, Emily Chen, from the Congressional Budget Office, on 2024-08-22, the federal budget deficit is expected to decrease beyond $1 trillion in the timeframe of Q4 of 2027.",1,1,1,1,1,1,1,1,1,1,1,0,0,0,0,1,1,1,0,0,0,1,0,0,0,0,0,1,0,0,1,1,0
2,"On 2024-10-15, Dr. David Lee, a health expert, predicts that the obesity rate at the World Health Organization will likely decrease by 3% in Q2 of 2026.",1,1,1,1,1,1,1,1,1,0,1,0,1,1,0,0,1,0,1,0,0,1,0,0,0,0,0,1,0,0,1,0,1
3,"According to a senior level person from 3M, on 2024/08/22, the operating income is expected to increase as much as $500 million, reflecting a 20% increase, in the timeframe of Q2 of 2029.",1,1,1,1,1,1,1,1,1,1,1,0,0,1,0,1,0,0,1,0,0,1,0,0,0,0,0,0,0,0,1,1,1
4,"On 2024-10-15, Rachel Patel, a financial analyst, predicts that the operating income at General Motors will likely increase by $5 billion in Q2 of 2026.",1,1,1,1,1,1,1,1,1,1,1,0,1,1,0,0,1,0,1,0,0,0,0,0,0,0,0,1,0,0,1,1,0
5,She practiced yoga on the quiet morning.,0,0,1,1,0,1,1,0,0,1,1,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0
6,He played with his dog in the backyard.,0,0,1,1,0,1,1,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
7,The kids played tag in the park playground.,0,0,1,1,0,1,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
8,They went to the movies on a Friday night.,0,0,1,1,0,1,1,0,1,0,1,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0
9,He ate a healthy breakfast every morning.,0,0,1,1,0,1,0,0,0,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [11]:
%store encoded_word_level_tags_entities_df

Stored 'encoded_word_level_tags_entities_df' (DataFrame)
