# Extract Features

1. Read csv files and load as dfs
2. Combine dfs
3. Get semantic cosine similarity

In [1]:
import os, sys

import pandas as pd

# Get the current working directory of the notebook
notebook_dir = os.getcwd()
# Add the parent directory to the system path
sys.path.append(os.path.join(notebook_dir, '../'))

import log_files
from log_files import LogData
from data_processing import DataProcessing
from feature_extraction import SpacyFeatureExtraction

In [2]:
pd.set_option('max_colwidth', 800)
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

## Predictions

- Use the structure from `1-generate_predictions-all_domains.ipynb`

In [3]:
log_file_path = "data/prediction_logs"
predictions = True
predictions_df = log_files.read_data(notebook_dir, log_file_path, predictions)
predictions_df.head(7)

Start logging batch
log_directory: /Users/detraviousjamaribrinkley/Documents/Development/research_labs/uf_ds/predictions/misc_experiments/../data/prediction_logs
save_batch_directory: /Users/detraviousjamaribrinkley/Documents/Development/research_labs/uf_ds/predictions/misc_experiments/../data/prediction_logs/batch_1-prediction
CSV to DF
Load saved csv: /Users/detraviousjamaribrinkley/Documents/Development/research_labs/uf_ds/predictions/misc_experiments/../data/prediction_logs/batch_1-prediction/batch_1-from_df.csv
save_batch_directory: /Users/detraviousjamaribrinkley/Documents/Development/research_labs/uf_ds/predictions/misc_experiments/../data/prediction_logs/batch_2-prediction
CSV to DF
Load saved csv: /Users/detraviousjamaribrinkley/Documents/Development/research_labs/uf_ds/predictions/misc_experiments/../data/prediction_logs/batch_2-prediction/batch_2-from_df.csv


Unnamed: 0,Base Sentence,Sentence Label,Domain,Model Name,API Name,Batch ID,Template Number
0,JPMorgan Chase forecasts that the net profit at Amazon potentially decrease in Q3 of 2027.,1,finance,llama-3.1-70b-instruct,NAVI_GATOR,0,1
1,"On August 21, 2024, Bank of America speculates the revenue at Microsoft will likely increase.",1,finance,llama-3.1-70b-instruct,NAVI_GATOR,0,2
2,"Citigroup predicts on 2024-08-21, the operating income at Alphabet may rise.",1,finance,llama-3.1-70b-instruct,NAVI_GATOR,0,3
3,"According to Goldman Sachs, the research and development expenses at Facebook would fall in 2025.",1,finance,llama-3.1-70b-instruct,NAVI_GATOR,0,4
4,"In 21 August 2024, Morgan Stanley envisions that the gross profit at Johnson & Johnson has some probability to remain stable.",1,finance,llama-3.1-70b-instruct,NAVI_GATOR,0,5
5,"The stock price at Visa should stay same in Q2 of 2026, according to Wells Fargo.",1,finance,llama-3.1-70b-instruct,NAVI_GATOR,0,6
6,JPMorgan forecasts that the revenue at Microsoft potentially decrease in Q3 of 2027.,1,finance,llama-3.3-70b-instruct,NAVI_GATOR,0,1


## Observations

In [4]:
log_file_path = "data/observation_logs"
predictions = False
observations_df = log_files.read_data(notebook_dir, log_file_path, predictions)
observations_df.head(7)

Start logging batch
log_directory: /Users/detraviousjamaribrinkley/Documents/Development/research_labs/uf_ds/predictions/misc_experiments/../data/observation_logs
save_batch_directory: /Users/detraviousjamaribrinkley/Documents/Development/research_labs/uf_ds/predictions/misc_experiments/../data/observation_logs/batch_1-observation
CSV to DF
Load saved csv: /Users/detraviousjamaribrinkley/Documents/Development/research_labs/uf_ds/predictions/misc_experiments/../data/observation_logs/batch_1-observation/batch_1-from_df.csv
save_batch_directory: /Users/detraviousjamaribrinkley/Documents/Development/research_labs/uf_ds/predictions/misc_experiments/../data/observation_logs/batch_2-observation
CSV to DF
Load saved csv: /Users/detraviousjamaribrinkley/Documents/Development/research_labs/uf_ds/predictions/misc_experiments/../data/observation_logs/batch_2-observation/batch_2-from_df.csv
save_batch_directory: /Users/detraviousjamaribrinkley/Documents/Development/research_labs/uf_ds/predictions/m

Unnamed: 0,Base Sentence,Sentence Label,Domain,Model Name,API Name,Batch ID,Template Number
0,The financial analyst at Goldman Sachs observed that the operating income at Tesla had increased in the first quarter of 2024.,0,finance,llama-3.1-70b-instruct,NAVI_GATOR,0,1
1,"On 2024-08-20 to 2025-08-20, Morgan Stanley speculates the stock price at Amazon will likely rise.",0,finance,llama-3.1-70b-instruct,NAVI_GATOR,0,2
2,"A young investor predicts on 2025-03-15, the S&P 500 index may rise.",0,finance,llama-3.1-70b-instruct,NAVI_GATOR,0,3
3,"According to Bank of America, the net profit at Microsoft would fall in the second quarter of 2026.",0,finance,llama-3.1-70b-instruct,NAVI_GATOR,0,4
4,"In 2027-01-01 to 2027-12-31, Wells Fargo envisions that the interest rates at the Federal Reserve have some probability to remain stable.",0,finance,llama-3.1-70b-instruct,NAVI_GATOR,0,5
5,"The trading volume at Apple should stay same in the fourth quarter of 2025, according to a financial expert at JPMorgan Chase.",0,finance,llama-3.1-70b-instruct,NAVI_GATOR,0,6
6,JPMorgan observed that the net profit at Microsoft had risen in September 2023.,0,finance,llama-3.3-70b-instruct,NAVI_GATOR,0,1


## Both

- Create a knowledge graph
    - Nodes: words
    - Edges: connection to other words (same/diff sentence)
- Look at code from Graphbreeding project on 2019 Mac

In [14]:
df = DataProcessing.concat_dfs([predictions_df, observations_df])
df.head(7)

Unnamed: 0,Base Sentence,Sentence Label,Domain,Model Name,API Name,Batch ID,Template Number
0,JPMorgan Chase forecasts that the net profit at Amazon potentially decrease in Q3 of 2027.,1,finance,llama-3.1-70b-instruct,NAVI_GATOR,0,1
1,"On August 21, 2024, Bank of America speculates the revenue at Microsoft will likely increase.",1,finance,llama-3.1-70b-instruct,NAVI_GATOR,0,2
2,"Citigroup predicts on 2024-08-21, the operating income at Alphabet may rise.",1,finance,llama-3.1-70b-instruct,NAVI_GATOR,0,3
3,"According to Goldman Sachs, the research and development expenses at Facebook would fall in 2025.",1,finance,llama-3.1-70b-instruct,NAVI_GATOR,0,4
4,"In 21 August 2024, Morgan Stanley envisions that the gross profit at Johnson & Johnson has some probability to remain stable.",1,finance,llama-3.1-70b-instruct,NAVI_GATOR,0,5
5,"The stock price at Visa should stay same in Q2 of 2026, according to Wells Fargo.",1,finance,llama-3.1-70b-instruct,NAVI_GATOR,0,6
6,JPMorgan forecasts that the revenue at Microsoft potentially decrease in Q3 of 2027.,1,finance,llama-3.3-70b-instruct,NAVI_GATOR,0,1


In [6]:
predictions = DataProcessing.df_to_list(predictions_df, "Base Sentence")
observations = DataProcessing.df_to_list(observations_df, "Base Sentence")

In [7]:
predictions

['JPMorgan Chase forecasts that the net profit at Amazon potentially decrease in Q3 of 2027.',
 'On August 21, 2024, Bank of America speculates the revenue at Microsoft will likely increase.',
 'Citigroup predicts on 2024-08-21, the operating income at Alphabet may rise.',
 'According to Goldman Sachs, the research and development expenses at Facebook would fall in 2025.',
 'In 21 August 2024, Morgan Stanley envisions that the gross profit at Johnson & Johnson has some probability to remain stable.',
 'The stock price at Visa should stay same in Q2 of 2026, according to Wells Fargo.',
 'JPMorgan forecasts that the revenue at Microsoft potentially decrease in Q3 of 2027.',
 'On August 25, 2024, to September 25, 2025, Citigroup speculates the net profit at Johnson & Johnson will likely increase.',
 'Bank of America predicts on 2024-08-21, the operating income at Visa may rise.',
 'According to Goldman Sachs, the research and development expenses at Alphabet would fall in 2029 Q2.',
 'In 

In [8]:
disable_components = [""]
spacy_fe = SpacyFeatureExtraction(predictions_df, "Base Sentence")
all_pos_tags, tags, all_ner_tags, entities = spacy_fe.extract_features(disable_components)

Pipeline: ['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']
Spacy Doc (0):  JPMorgan Chase forecasts that the net profit at Amazon potentially decrease in Q3 of 2027.
 POS: JPMorgan---PROPN---JPMorgan---compound---False
 POS: Chase---PROPN---Chase---nsubj---False
 POS: forecasts---VERB---forecast---ROOT---False
 POS: that---SCONJ---that---mark---True
 POS: the---DET---the---det---True
 POS: net---ADJ---net---amod---False
 POS: profit---NOUN---profit---nsubj---False
 POS: at---ADP---at---prep---True
 POS: Amazon---PROPN---Amazon---pobj---False
 POS: potentially---ADV---potentially---advmod---False
 POS: decrease---NOUN---decrease---ccomp---False
 POS: in---ADP---in---prep---True
 POS: Q3---PROPN---Q3---pobj---False
 POS: of---ADP---of---prep---True
 POS: 2027---NUM---2027---pobj---False
 POS: .---PUNCT---.---punct---False

 NER: JPMorgan Chase---ORG---0---14
 NER: Amazon---ORG---48---54
 NER: Q3---GPE---79---81
 NER: 2027---DATE---85---89

Spacy Doc (1):  On August

# Mapping variables : words in sentence(s)

- Sentenc/Spacy Doc (0): JPMorgan Chase forecasts that the net profit at Amazon potentially decrease in Q3 of 2027.
    1. $ p_s $: JPMorgan Chase
    2. $ p_t $: Amazon
    3. $ p_d $: Q3 of 2027
    4. $ p_{outcome} $: net profit decrease or decrease of net profit

- Sentence/Spacy Doc (1):  On August 21, 2024, Bank of America speculates the revenue at Microsoft will likely increase.
    1. $ p_s $: Bank of America
    2. $ p_t $: Microsoft
    3. $ p_d $: August 21, 2024
    4. $ p_o $: revenue increase or increase in revenue


- Would I want to provide mappings when generating the data?


- Fine tune an event extraction model
- Enties model
- Event: SPO (Something happended to somebody)


 I (source) predict the Pacers (target) to win (attribute) the 2024-2025 (date) NBA Finals.

 Map my work to event extraction

In [None]:
pos_df = DataProcessing.convert_tags_entities_to_dataframe(all_pos_tags, tags)
pos_df.head(1)

Unnamed: 0,NUM_2,ADP_4,DET_4,ADJ_3,PROPN_1,ADP_5,NUM_3,PRON_2,DET_3,CCONJ_1,DET_2,VERB_1,PROPN_6,PRON_1,NOUN_9,PUNCT_1,NOUN_8,CCONJ_2,ADP_2,DET_5,AUX_1,NOUN_4,PUNCT_4,SCONJ_1,ADV_1,PROPN_4,VERB_3,ADP_6,PUNCT_5,PROPN_2,NOUN_6,DET_1,VERB_4,ADJ_1,ADP_3,PROPN_7,PROPN_3,NOUN_3,NOUN_5,NUM_1,PART_1,ADJ_4,PUNCT_2,ADP_1,NUM_4,NOUN_11,PUNCT_3,VERB_2,NOUN_2,NOUN_10,SYM_2,SYM_1,NOUN_1,NOUN_7,PROPN_5,ADJ_2
0,,,,,JPMorgan,,,,,,,forecasts,,,,.,,,in,,,,,that,potentially,Q3,,,,Chase,,the,,net,of,,Amazon,,,2027,,,,at,,,,,decrease,,,,profit,,,


In [11]:
ner_df = DataProcessing.convert_tags_entities_to_dataframe(all_ner_tags, entities)
ner_df.head(1)

Unnamed: 0,DATE_2,PRODUCT_1,FAC_1,DATE_1,PERSON_2,EVENT_1,LOC_1,NORP_1,GPE_2,ORG_2,GPE_1,MONEY_1,ORG_1,CARDINAL_1,TIME_1,PERSON_1,ORG_3
0,,,,2027,,,,,,Amazon,Q3,,JPMorgan Chase,,,,
