# Linguistic Features of Prediction vs Non-Prediction

1. POS
2. NER
3. [Spacy shape](https://spacy.io/usage/linguistic-features/): would need to map each word to shape to same entity? All records below map to entity Detravious Jamari Brinkley
    - D.J. Brinkley | X.X. Xxxxxxxx
    - D.J Brinkley | X.X Xxxxxxxx
    - DJ Brinkley | XX Xxxxxxxx
4. [Spacy lemma](https://spacy.io/usage/linguistic-features/): would abbreviated map to proper lemma?
    - Apple | apple is correct
    - AAPL | apple?
5. [Spacy dep](https://spacy.io/usage/linguistic-features/): wouldn't/would be a consistent form bc of dep is based on the construction of the sentence?
6. [Spacy morphology](https://spacy.io/usage/linguistic-features/): would these be consistent enough to define a prediction?
7. [Spacy children, ancestors](https://spacy.io/usage/linguistic-features/): wouldn't be consistent bc depends on sentence construction.
8. Phonetics

In [1]:
import os
import sys

notebook_dir = os.getcwd()
sys.path.append(os.path.join(notebook_dir, '../'))

from data_processing import DataProcessing
from feature_extraction import SpacyFeatureExtraction

In [2]:
predictions_df = DataProcessing.load_multiple_batches(notebook_dir, sep=',')
predictions_df

Loading: /Users/detraviousjamaribrinkley/Documents/Development/research_labs/uf_ds/predictions/notebook_experiments/../data/prediction_logs/batch_1-prediction/batch_1-from_df.csv
✓ Loaded batch 1
Loading: /Users/detraviousjamaribrinkley/Documents/Development/research_labs/uf_ds/predictions/notebook_experiments/../data/prediction_logs/batch_2-prediction/batch_2-from_df.csv
✓ Loaded batch 2
Loading: /Users/detraviousjamaribrinkley/Documents/Development/research_labs/uf_ds/predictions/notebook_experiments/../data/prediction_logs/batch_3-prediction/batch_3-from_df.csv
✓ Loaded batch 3
Loading: /Users/detraviousjamaribrinkley/Documents/Development/research_labs/uf_ds/predictions/notebook_experiments/../data/prediction_logs/batch_4-prediction/batch_4-from_df.csv
✓ Loaded batch 4
Loading: /Users/detraviousjamaribrinkley/Documents/Development/research_labs/uf_ds/predictions/notebook_experiments/../data/prediction_logs/batch_5-prediction/batch_5-from_df.csv
✓ Loaded batch 5
Loading: /Users/detr

Unnamed: 0,Base Sentence,Sentence Label,Domain,Model Name,API Name,Batch ID,Template Number
0,JPMorgan Chase forecasts that the net profit a...,1,finance,llama-3.1-70b-instruct,NAVI_GATOR,0,1
1,"On August 21, 2024, Bank of America speculates...",1,finance,llama-3.1-70b-instruct,NAVI_GATOR,0,2
2,"Citigroup predicts on 2024-08-21, the operatin...",1,finance,llama-3.1-70b-instruct,NAVI_GATOR,0,3
3,"According to Goldman Sachs, the research and d...",1,finance,llama-3.1-70b-instruct,NAVI_GATOR,0,4
4,"In 21 August 2024, Morgan Stanley envisions th...",1,finance,llama-3.1-70b-instruct,NAVI_GATOR,0,5
...,...,...,...,...,...,...,...
711,"On 2029-09-15, Coach Michael Brown speculates ...",1,sport,llama-3.3-70b-versatile,GROQ_CLOUD,0,2
712,"Coach Sofia Rodriguez predicts on 08/10/2028, ...",1,sport,llama-3.3-70b-versatile,GROQ_CLOUD,0,3
713,"According to Analyst David Lee, the scoring av...",1,sport,llama-3.3-70b-versatile,GROQ_CLOUD,0,4
714,"In 2025-10-20, Analyst Rachel Kim envisions th...",1,sport,llama-3.3-70b-versatile,GROQ_CLOUD,0,5


In [3]:
observations_df = DataProcessing.load_multiple_batches(notebook_dir, sep=',', data_type='observation')
observations_df

Loading: /Users/detraviousjamaribrinkley/Documents/Development/research_labs/uf_ds/predictions/notebook_experiments/../data/observation_logs/batch_1-observation/batch_1-from_df.csv
Loading: /Users/detraviousjamaribrinkley/Documents/Development/research_labs/uf_ds/predictions/notebook_experiments/../data/observation_logs/batch_2-observation/batch_2-from_df.csv
Loading: /Users/detraviousjamaribrinkley/Documents/Development/research_labs/uf_ds/predictions/notebook_experiments/../data/observation_logs/batch_3-observation/batch_3-from_df.csv
Loading: /Users/detraviousjamaribrinkley/Documents/Development/research_labs/uf_ds/predictions/notebook_experiments/../data/observation_logs/batch_4-observation/batch_4-from_df.csv
Loading: /Users/detraviousjamaribrinkley/Documents/Development/research_labs/uf_ds/predictions/notebook_experiments/../data/observation_logs/batch_5-observation/batch_5-from_df.csv
✓ Loaded batch 5
Loading: /Users/detraviousjamaribrinkley/Documents/Development/research_labs/u

Unnamed: 0,Base Sentence,Sentence Label,Domain,Model Name,API Name,Batch ID,Template Number
0,JPMorgan Chase observed that the net profit at...,0,finance,llama-3.1-70b-instruct,NAVI_GATOR,0,1
1,"On 08/20/2024 to 08/20/2025, Bank of America s...",0,finance,llama-3.1-70b-instruct,NAVI_GATOR,0,2
2,"Citigroup noted on 2024-08-20, the research an...",0,finance,llama-3.1-70b-instruct,NAVI_GATOR,0,3
3,"According to a financial analyst, the gross pr...",0,finance,llama-3.1-70b-instruct,NAVI_GATOR,0,4
4,"In 2025-08-20, a college student envisioned th...",0,finance,llama-3.1-70b-instruct,NAVI_GATOR,0,5
...,...,...,...,...,...,...,...
1107,"On Q2 of 2025, Coach Ryan Thompson observed th...",0,sport,llama-3.1-8b-instant,GROQ_CLOUD,0,2
1108,"George noted that on August 28, 2024, the save...",0,sport,llama-3.1-8b-instant,GROQ_CLOUD,0,3
1109,According to the staff of the Los Angeles Lake...,0,sport,llama-3.1-8b-instant,GROQ_CLOUD,0,4
1110,"In Q4 of 2027, Analyst Daniel Kim recorded tha...",0,sport,llama-3.1-8b-instant,GROQ_CLOUD,0,5


In [4]:
predictions = DataProcessing.df_to_list(predictions_df, 'Base Sentence')
predictions = predictions[:3]

observations = DataProcessing.df_to_list(observations_df, 'Base Sentence')
observations = observations[:3]

sentences = predictions + observations
sentences

['JPMorgan Chase forecasts that the net profit at Amazon potentially decrease in Q3 of 2027.',
 'On August 21, 2024, Bank of America speculates the revenue at Microsoft will likely increase.',
 'Citigroup predicts on 2024-08-21, the operating income at Alphabet may rise.',
 'JPMorgan Chase observed that the net profit at Amazon had remained stable in Q2 2026.',
 'On 08/20/2024 to 08/20/2025, Bank of America speculated the operating income at Microsoft changed.',
 'Citigroup noted on 2024-08-20, the research and development expenses at Alphabet fell.']

In [5]:
sentences_df = DataProcessing.concat_dfs([predictions_df.loc[:3, :], observations_df.loc[:3, :]])
sentences_df

Unnamed: 0,Base Sentence,Sentence Label,Domain,Model Name,API Name,Batch ID,Template Number
0,JPMorgan Chase forecasts that the net profit a...,1,finance,llama-3.1-70b-instruct,NAVI_GATOR,0,1
1,"On August 21, 2024, Bank of America speculates...",1,finance,llama-3.1-70b-instruct,NAVI_GATOR,0,2
2,"Citigroup predicts on 2024-08-21, the operatin...",1,finance,llama-3.1-70b-instruct,NAVI_GATOR,0,3
3,"According to Goldman Sachs, the research and d...",1,finance,llama-3.1-70b-instruct,NAVI_GATOR,0,4
4,JPMorgan Chase observed that the net profit at...,0,finance,llama-3.1-70b-instruct,NAVI_GATOR,0,1
5,"On 08/20/2024 to 08/20/2025, Bank of America s...",0,finance,llama-3.1-70b-instruct,NAVI_GATOR,0,2
6,"Citigroup noted on 2024-08-20, the research an...",0,finance,llama-3.1-70b-instruct,NAVI_GATOR,0,3
7,"According to a financial analyst, the gross pr...",0,finance,llama-3.1-70b-instruct,NAVI_GATOR,0,4


In [6]:
spacy_fe = SpacyFeatureExtraction(sentences_df, 'Base Sentence')
spacy_fe

<feature_extraction.SpacyFeatureExtraction at 0x33649b510>

In [7]:
disable_components = []
spacy_fe.extract_ner_features(disable_components)

8it [00:00, 373.99it/s]

Spacy Doc (0):  JPMorgan Chase forecasts that the net profit at Amazon potentially decrease in Q3 of 2027.
Spacy Doc (1):  On August 21, 2024, Bank of America speculates the revenue at Microsoft will likely increase.
Spacy Doc (2):  Citigroup predicts on 2024-08-21, the operating income at Alphabet may rise.
Spacy Doc (3):  According to Goldman Sachs, the research and development expenses at Facebook would fall in 2025.





Unnamed: 0,Sentence,Term,NER Label,Unique NER Label,Start Char,End Char
0,"(JPMorgan, Chase, forecasts, that, the, net, p...",JPMorgan Chase,ORG,ORG_1,0.0,14.0
1,"(JPMorgan, Chase, forecasts, that, the, net, p...",Amazon,ORG,ORG_2,48.0,54.0
2,"(JPMorgan, Chase, forecasts, that, the, net, p...",Q3,GPE,GPE_1,79.0,81.0
3,"(JPMorgan, Chase, forecasts, that, the, net, p...",2027,DATE,DATE_1,85.0,89.0
4,,,,,,
5,"(On, August, 21, ,, 2024, ,, Bank, of, America...","August 21, 2024",DATE,DATE_2,3.0,18.0
6,"(On, August, 21, ,, 2024, ,, Bank, of, America...",Bank of America,ORG,ORG_3,20.0,35.0
7,"(On, August, 21, ,, 2024, ,, Bank, of, America...",Microsoft,ORG,ORG_4,62.0,71.0
8,,,,,,
9,"(Citigroup, predicts, on, 2024, -, 08, -, 21, ...",Citigroup,ORG,ORG_5,0.0,9.0


In [8]:
spacy_fe.extract_pos_features(disable_components)

8it [00:00, 457.46it/s]

Spacy Doc (0):  JPMorgan Chase forecasts that the net profit at Amazon potentially decrease in Q3 of 2027.
Spacy Doc (1):  On August 21, 2024, Bank of America speculates the revenue at Microsoft will likely increase.
Spacy Doc (2):  Citigroup predicts on 2024-08-21, the operating income at Alphabet may rise.
Spacy Doc (3):  According to Goldman Sachs, the research and development expenses at Facebook would fall in 2025.





Unnamed: 0,Sentence,Term,POS Label,Unique POS Label,Lemmas,Dependencies,Stop Word
0,"(JPMorgan, Chase, forecasts, that, the, net, p...",JPMorgan,PROPN,PROPN_1,JPMorgan,compound,False
1,"(JPMorgan, Chase, forecasts, that, the, net, p...",Chase,PROPN,PROPN_2,Chase,nsubj,False
2,"(JPMorgan, Chase, forecasts, that, the, net, p...",forecasts,VERB,VERB_1,forecast,ROOT,False
3,"(JPMorgan, Chase, forecasts, that, the, net, p...",that,SCONJ,SCONJ_1,that,mark,True
4,"(JPMorgan, Chase, forecasts, that, the, net, p...",the,DET,DET_1,the,det,True
...,...,...,...,...,...,...,...
137,"(According, to, a, financial, analyst, ,, the,...",in,ADP,ADP_22,in,prep,True
138,"(According, to, a, financial, analyst, ,, the,...",Q3,PROPN,PROPN_24,Q3,pobj,False
139,"(According, to, a, financial, analyst, ,, the,...",2025,NUM,NUM_14,2025,nummod,False
140,"(According, to, a, financial, analyst, ,, the,...",.,PUNCT,PUNCT_15,.,punct,False
