# Entity Resolution: Source and Target

- Once we have the source identified, can we connect same source to each other, then group by source to see all/each predictions per source?

> `notebook_experiments/sequence_labelling-prediction_sentence.ipynb` because we extract the prediction properties: source, target, date, and outcome.

In [1]:
import os
import sys

import pandas as pd

from tqdm import tqdm

from pyjedai.datamodel import Data
from pyjedai.joins import EJoin, TopKJoin

notebook_dir = os.getcwd()

sys.path.append(os.path.join(notebook_dir, '../'))

from data_processing import DataProcessing
from feature_extraction import SpacyFeatureExtraction

  from tqdm.autonotebook import tqdm


In [2]:
base_data_path = DataProcessing.load_base_data_path(notebook_dir)

In [3]:
extract_prediction_properties_path = "extract_prediction_properties/"
extract_prediction_properties_full_path = os.path.join(base_data_path, extract_prediction_properties_path, 'extracted_pps-v1.csv')
df = DataProcessing.load_from_file(extract_prediction_properties_full_path, 'csv', sep=',')
df

Unnamed: 0,Prediction Sentence,Raw Response,Model Name,No Property,Source,Target,Date,Outcome
0,Professor Thompson forecasts that the graduati...,"{0: [""forecasts"", ""that"", ""the"", ""graduation"",...",openai/gpt-oss-120b,"forecasts, that, the, graduation, rate, at, wi...",Professor Thompson,Harvard University,2027,drop
1,Economist Dr. Sarah Lee predicts on 12/31/2027...,"{0: [""predicts"", ""on"", ""the""], 1: [""Economist ...",openai/gpt-oss-120b,"predicts, on, the",Economist Dr. Sarah Lee,consumer confidence index,12/31/2027,may rise
2,"According to a fitness expert, the nutritional...","{0: [""According"", ""to"", ""a"", ""the"", ""in""], 1: ...",openai/gpt-oss-120b,"According, to, a, the, in",fitness expert,nutritional intake at community centers,21 August 2024,would fall
3,The nutritional awareness in Europe should sta...,"{""0"": [""The"", ""in"", ""should"", ""the"", ""in"", ""ac...",openai/gpt-oss-120b,"The, in, should, the, in, according, to",a research report,nutritional awareness in Europe,2028,stay the same
4,"Coach Sofia Rodriguez predicts on 08/10/2028, ...","{0: [""predicts"", ""on"", ""the"", ""at""], 1: [""Coac...",openai/gpt-oss-120b,"predicts, on, the, at",Coach Sofia Rodriguez,Boston Celtics,08/10/2028,win ratio may rise
5,Analyst Kevin Jackson predicts on 21 August 2...,"{0: [""predicts"", ""on"", ""the"", ""at"", ""the""], 1:...",openai/gpt-oss-120b,"predicts, on, the, at, the",Analyst Kevin Jackson,New England Patriots,21 August 2024,score average may rise
6,The National Oceanic and Atmospheric Administr...,"{0: [""The"", ""forecasts"", ""that"", ""the"", ""may"",...",openai/gpt-oss-120b,"The, forecasts, that, the, may, in",National Oceanic and Atmospheric Administration,precipitation levels at New Orleans,2024-08-21,decrease
7,The sports analyst from ESPN anticipates that ...,"{0: [""The"", ""anticipates"", ""that"", ""the"", ""wil...",openai/gpt-oss-120b,"The, anticipates, that, the, will, in, of",sports analyst from ESPN,scoring average at the Los Angeles Lakers,2024 Q3,potentially decrease
8,The stock price at Amazon should stay same in ...,"{0: [""The"", ""at"", ""should"", ""in"", ""according"",...",openai/gpt-oss-120b,"The, at, should, in, according, to",Morgan Stanley,Amazon,2024/08/21,"stock price, stay same"
9,The transaction will have a positive impact of...,"{\n ""0"": [""The"", ""transaction"", ""will"", ""have...",openai/gpt-oss-120b,"The, transaction, will, have, a, on, ,, which,...",Ruukki,earnings,fourth quarter of this year,positive impact of around EUR2m


In [4]:
df['Source'].value_counts()

Source
Morgan Stanley                                     2
Professor Thompson                                 1
policy analyst David Chen                          1
Financial analyst Olivia Brown                     1
sports analyst Sarah Johnson                       1
The Centers for Disease Control and Prevention     1
Financial Analyst James Lee                        1
Coach Brad Stevens                                 1
college student Samantha Brown                     1
Ms. Sophia Chen                                    1
Policy analyst David Lee                           1
International Energy Agency                        1
Emily Chen                                         1
The Brookings Institution                          1
American Heart Association                         1
JP Morgan Chase                                    1
Marketing Director Sarah Lee                       1
Economist Dr. Sarah Lee                            1
sports expert Lisa Kim                 

In [5]:
df['Target'].value_counts()

Target
Amazon                                                              3
Boston Celtics                                                      2
Harvard University                                                  1
approval ratings for the new president                              1
Microsoft Corporation                                               1
Los Angeles Lakers                                                  1
health screening participation                                      1
the stock market volatility                                         1
the number of affordable housing units in urban areas               1
FC Barcelona                                                        1
renewable energy investments at Tesla                               1
investment in renewable energy projects at emerging markets         1
net profit at Amazon                                                1
Microsoft                                                           1
the consumer 

In [6]:
source_entity_sfe = SpacyFeatureExtraction(df, 'Source')
target_entity_sfe = SpacyFeatureExtraction(df, 'Target')

In [7]:
entity_spacy_embeddings_df = source_entity_sfe.sentence_embeddings_extraction(attach_to_df=True)
entity_spacy_embeddings_df = target_entity_sfe.sentence_embeddings_extraction(attach_to_df=True)
entity_spacy_embeddings_df.head(7)

100%|██████████| 33/33 [00:00<00:00, 180.57it/s]
100%|██████████| 33/33 [00:00<00:00, 515.19it/s]


Unnamed: 0,Prediction Sentence,Raw Response,Model Name,No Property,Source,Target,Date,Outcome,Source Embedding,Target Embedding
0,Professor Thompson forecasts that the graduati...,"{0: [""forecasts"", ""that"", ""the"", ""graduation"",...",openai/gpt-oss-120b,"forecasts, that, the, graduation, rate, at, wi...",Professor Thompson,Harvard University,2027,drop,"[-0.188665, 0.1446445, 0.0038895002, 0.18792, ...","[0.099470004, -0.04451, 0.200515, -0.251205, 0..."
1,Economist Dr. Sarah Lee predicts on 12/31/2027...,"{0: [""predicts"", ""on"", ""the""], 1: [""Economist ...",openai/gpt-oss-120b,"predicts, on, the",Economist Dr. Sarah Lee,consumer confidence index,12/31/2027,may rise,"[-0.21245393, 0.39729, 0.13133174, 0.19134, -0...","[-0.49209335, 0.44875398, 0.04111433, 0.108046..."
2,"According to a fitness expert, the nutritional...","{0: [""According"", ""to"", ""a"", ""the"", ""in""], 1: ...",openai/gpt-oss-120b,"According, to, a, the, in",fitness expert,nutritional intake at community centers,21 August 2024,would fall,"[-0.124935, 0.424476, 0.0314905, 0.062526494, ...","[-0.007739997, 0.41389403, 0.084343396, 0.1394..."
3,The nutritional awareness in Europe should sta...,"{""0"": [""The"", ""in"", ""should"", ""the"", ""in"", ""ac...",openai/gpt-oss-120b,"The, in, should, the, in, according, to",a research report,nutritional awareness in Europe,2028,stay the same,"[-0.38877067, 0.07016266, -0.10960641, 0.20952...","[-0.21708825, 0.3115975, 0.14308275, 0.2055162..."
4,"Coach Sofia Rodriguez predicts on 08/10/2028, ...","{0: [""predicts"", ""on"", ""the"", ""at""], 1: [""Coac...",openai/gpt-oss-120b,"predicts, on, the, at",Coach Sofia Rodriguez,Boston Celtics,08/10/2028,win ratio may rise,"[0.30977035, 0.24381, 0.15909334, 0.39647332, ...","[-0.325455, -0.37922502, 0.137535, -0.18293, 0..."
5,Analyst Kevin Jackson predicts on 21 August 2...,"{0: [""predicts"", ""on"", ""the"", ""at"", ""the""], 1:...",openai/gpt-oss-120b,"predicts, on, the, at, the",Analyst Kevin Jackson,New England Patriots,21 August 2024,score average may rise,"[-0.47217667, 0.11852766, 0.057633, 0.216971, ...","[-0.03541, -0.16582334, -0.091784, -0.35260665..."
6,The National Oceanic and Atmospheric Administr...,"{0: [""The"", ""forecasts"", ""that"", ""the"", ""may"",...",openai/gpt-oss-120b,"The, forecasts, that, the, may, in",National Oceanic and Atmospheric Administration,precipitation levels at New Orleans,2024-08-21,decrease,"[0.06728621, -0.040862404, 0.22539398, 0.13525...","[-0.2182428, 0.28489, 0.079394594, -0.20747499..."


In [16]:
source_entity_sfe.extract_ner_features(disable_components=[])

33it [00:00, 1091.80it/s]

Spacy Doc (0):  Professor Thompson
Spacy Doc (1):  Economist Dr. Sarah Lee
Spacy Doc (2):  fitness expert
Spacy Doc (3):  a research report





Unnamed: 0,Sentence,Term,NER Label,Unique NER Label,Start Char,End Char
0,Professor Thompson,Thompson,PERSON,PERSON_1,10,18
1,,,,,,
2,Economist Dr. Sarah Lee,Sarah Lee,PERSON,PERSON_2,14,23
3,,,,,,
4,,,,,,
...,...,...,...,...,...,...
58,,,,,,
59,Financial analyst Olivia Brown,Olivia Brown,PERSON,PERSON_19,18,30
60,,,,,,
61,European Central Bank,European Central Bank,ORG,ORG_11,0,21


In [17]:
target_entity_sfe.extract_ner_features(disable_components=[])

33it [00:00, 1037.80it/s]

Spacy Doc (0):  Harvard University
Spacy Doc (1):  consumer confidence index
Spacy Doc (2):  nutritional intake at community centers
Spacy Doc (3):  nutritional awareness in Europe





Unnamed: 0,Sentence,Term,NER Label,Unique NER Label,Start Char,End Char
0,Harvard University,Harvard University,ORG,ORG_1,0.0,18.0
1,,,,,,
2,,,,,,
3,,,,,,
4,nutritional awareness in Europe,Europe,LOC,LOC_1,25.0,31.0
5,,,,,,
6,Boston Celtics,Boston Celtics,ORG,ORG_2,0.0,14.0
7,,,,,,
8,New England Patriots,New England Patriots,ORG,ORG_3,0.0,20.0
9,,,,,,


In [18]:
source_entity_sfe.extract_pos_features(disable_components=[])

33it [00:00, 1530.89it/s]

Spacy Doc (0):  Professor Thompson
Spacy Doc (1):  Economist Dr. Sarah Lee
Spacy Doc (2):  fitness expert
Spacy Doc (3):  a research report





[['Professor', 'Thompson'],
 ['PROPN', 'PROPN'],
 ['PROPN_1', 'PROPN_2'],
 ['Professor', 'Thompson'],
 ['compound', 'ROOT'],
 [False, False],
 ['Economist', 'Dr.', 'Sarah', 'Lee'],
 ['NOUN', 'PROPN', 'PROPN', 'PROPN'],
 ['NOUN_1', 'PROPN_1', 'PROPN_2', 'PROPN_3'],
 ['economist', 'Dr.', 'Sarah', 'Lee'],
 ['compound', 'compound', 'compound', 'ROOT'],
 [False, False, False, False],
 ['fitness', 'expert'],
 ['NOUN', 'NOUN'],
 ['NOUN_1', 'NOUN_2'],
 ['fitness', 'expert'],
 ['compound', 'ROOT'],
 [False, False],
 ['a', 'research', 'report'],
 ['DET', 'NOUN', 'NOUN'],
 ['DET_1', 'NOUN_1', 'NOUN_2'],
 ['a', 'research', 'report'],
 ['det', 'compound', 'ROOT'],
 [True, False, False],
 ['Coach', 'Sofia', 'Rodriguez'],
 ['PROPN', 'PROPN', 'PROPN'],
 ['PROPN_1', 'PROPN_2', 'PROPN_3'],
 ['Coach', 'Sofia', 'Rodriguez'],
 ['compound', 'compound', 'ROOT'],
 [False, False, False],
 ['Analyst', 'Kevin', 'Jackson'],
 ['PROPN', 'PROPN', 'PROPN'],
 ['PROPN_1', 'PROPN_2', 'PROPN_3'],
 ['Analyst', 'Kevin', 'J

In [19]:
target_entity_sfe.extract_pos_features(disable_components=[])

33it [00:00, 1105.64it/s]

Spacy Doc (0):  Harvard University
Spacy Doc (1):  consumer confidence index
Spacy Doc (2):  nutritional intake at community centers
Spacy Doc (3):  nutritional awareness in Europe





[['Harvard', 'University'],
 ['PROPN', 'PROPN'],
 ['PROPN_1', 'PROPN_2'],
 ['Harvard', 'University'],
 ['compound', 'ROOT'],
 [False, False],
 ['consumer', 'confidence', 'index'],
 ['NOUN', 'NOUN', 'NOUN'],
 ['NOUN_1', 'NOUN_2', 'NOUN_3'],
 ['consumer', 'confidence', 'index'],
 ['compound', 'compound', 'ROOT'],
 [False, False, False],
 ['nutritional', 'intake', 'at', 'community', 'centers'],
 ['ADJ', 'NOUN', 'ADP', 'NOUN', 'NOUN'],
 ['ADJ_1', 'NOUN_1', 'ADP_1', 'NOUN_2', 'NOUN_3'],
 ['nutritional', 'intake', 'at', 'community', 'center'],
 ['amod', 'ROOT', 'prep', 'compound', 'pobj'],
 [False, False, True, False, False],
 ['nutritional', 'awareness', 'in', 'Europe'],
 ['ADJ', 'NOUN', 'ADP', 'PROPN'],
 ['ADJ_1', 'NOUN_1', 'ADP_1', 'PROPN_1'],
 ['nutritional', 'awareness', 'in', 'Europe'],
 ['amod', 'ROOT', 'prep', 'pobj'],
 [False, False, True, False],
 ['Boston', 'Celtics'],
 ['PROPN', 'PROPN'],
 ['PROPN_1', 'PROPN_2'],
 ['Boston', 'Celtics'],
 ['compound', 'ROOT'],
 [False, False],
 ['

In [None]:
from pyphonetics import Soundex # https://github.com/Lilykos/pyphonetics
soundex = Soundex()

soundex.phonetics('Rupert')

'R163'

In [25]:
entity_spacy_embeddings_df.head(3)

Unnamed: 0,Prediction Sentence,Raw Response,Model Name,No Property,Source,Target,Date,Outcome,Source Embedding,Target Embedding
0,Professor Thompson forecasts that the graduati...,"{0: [""forecasts"", ""that"", ""the"", ""graduation"",...",openai/gpt-oss-120b,"forecasts, that, the, graduation, rate, at, wi...",Professor Thompson,Harvard University,2027,drop,"[-0.188665, 0.1446445, 0.0038895002, 0.18792, ...","[0.099470004, -0.04451, 0.200515, -0.251205, 0..."
1,Economist Dr. Sarah Lee predicts on 12/31/2027...,"{0: [""predicts"", ""on"", ""the""], 1: [""Economist ...",openai/gpt-oss-120b,"predicts, on, the",Economist Dr. Sarah Lee,consumer confidence index,12/31/2027,may rise,"[-0.21245393, 0.39729, 0.13133174, 0.19134, -0...","[-0.49209335, 0.44875398, 0.04111433, 0.108046..."
2,"According to a fitness expert, the nutritional...","{0: [""According"", ""to"", ""a"", ""the"", ""in""], 1: ...",openai/gpt-oss-120b,"According, to, a, the, in",fitness expert,nutritional intake at community centers,21 August 2024,would fall,"[-0.124935, 0.424476, 0.0314905, 0.062526494, ...","[-0.007739997, 0.41389403, 0.084343396, 0.1394..."


In [32]:
source_phonetics = []
target_phonetics = []

for idx, row in tqdm(entity_spacy_embeddings_df.iterrows()):
    source_entity_name = row['Source']
    target_entity_name = row['Target']
    specific_source_phonetics = soundex.phonetics(source_entity_name)
    specific_target_phonetics = soundex.phonetics(target_entity_name)
    source_phonetics.append(specific_source_phonetics)
    target_phonetics.append(specific_target_phonetics)

33it [00:00, 20548.10it/s]


In [33]:
entity_spacy_embeddings_df.loc[:, 'Source Phonetics'] = source_phonetics
entity_spacy_embeddings_df.loc[:, 'Target Phonetics'] = target_phonetics
entity_spacy_embeddings_df

Unnamed: 0,Prediction Sentence,Raw Response,Model Name,No Property,Source,Target,Date,Outcome,Source Embedding,Target Embedding,Source Phonetics,Target Phonetics
0,Professor Thompson forecasts that the graduati...,"{0: [""forecasts"", ""that"", ""the"", ""graduation"",...",openai/gpt-oss-120b,"forecasts, that, the, graduation, rate, at, wi...",Professor Thompson,Harvard University,2027,drop,"[-0.188665, 0.1446445, 0.0038895002, 0.18792, ...","[0.099470004, -0.04451, 0.200515, -0.251205, 0...",P612,H616
1,Economist Dr. Sarah Lee predicts on 12/31/2027...,"{0: [""predicts"", ""on"", ""the""], 1: [""Economist ...",openai/gpt-oss-120b,"predicts, on, the",Economist Dr. Sarah Lee,consumer confidence index,12/31/2027,may rise,"[-0.21245393, 0.39729, 0.13133174, 0.19134, -0...","[-0.49209335, 0.44875398, 0.04111433, 0.108046...",E255,C525
2,"According to a fitness expert, the nutritional...","{0: [""According"", ""to"", ""a"", ""the"", ""in""], 1: ...",openai/gpt-oss-120b,"According, to, a, the, in",fitness expert,nutritional intake at community centers,21 August 2024,would fall,"[-0.124935, 0.424476, 0.0314905, 0.062526494, ...","[-0.007739997, 0.41389403, 0.084343396, 0.1394...",F352,N363
3,The nutritional awareness in Europe should sta...,"{""0"": [""The"", ""in"", ""should"", ""the"", ""in"", ""ac...",openai/gpt-oss-120b,"The, in, should, the, in, according, to",a research report,nutritional awareness in Europe,2028,stay the same,"[-0.38877067, 0.07016266, -0.10960641, 0.20952...","[-0.21708825, 0.3115975, 0.14308275, 0.2055162...",A626,N363
4,"Coach Sofia Rodriguez predicts on 08/10/2028, ...","{0: [""predicts"", ""on"", ""the"", ""at""], 1: [""Coac...",openai/gpt-oss-120b,"predicts, on, the, at",Coach Sofia Rodriguez,Boston Celtics,08/10/2028,win ratio may rise,"[0.30977035, 0.24381, 0.15909334, 0.39647332, ...","[-0.325455, -0.37922502, 0.137535, -0.18293, 0...",C216,B235
5,Analyst Kevin Jackson predicts on 21 August 2...,"{0: [""predicts"", ""on"", ""the"", ""at"", ""the""], 1:...",openai/gpt-oss-120b,"predicts, on, the, at, the",Analyst Kevin Jackson,New England Patriots,21 August 2024,score average may rise,"[-0.47217667, 0.11852766, 0.057633, 0.216971, ...","[-0.03541, -0.16582334, -0.091784, -0.35260665...",A542,N524
6,The National Oceanic and Atmospheric Administr...,"{0: [""The"", ""forecasts"", ""that"", ""the"", ""may"",...",openai/gpt-oss-120b,"The, forecasts, that, the, may, in",National Oceanic and Atmospheric Administration,precipitation levels at New Orleans,2024-08-21,decrease,"[0.06728621, -0.040862404, 0.22539398, 0.13525...","[-0.2182428, 0.28489, 0.079394594, -0.20747499...",N354,P621
7,The sports analyst from ESPN anticipates that ...,"{0: [""The"", ""anticipates"", ""that"", ""the"", ""wil...",openai/gpt-oss-120b,"The, anticipates, that, the, will, in, of",sports analyst from ESPN,scoring average at the Los Angeles Lakers,2024 Q3,potentially decrease,"[-0.09651501, 0.205311, 0.084187, 0.18342498, ...","[-0.2211187, 0.13450356, 0.17586729, -0.070046...",S163,S265
8,The stock price at Amazon should stay same in ...,"{0: [""The"", ""at"", ""should"", ""in"", ""according"",...",openai/gpt-oss-120b,"The, at, should, in, according, to",Morgan Stanley,Amazon,2024/08/21,"stock price, stay same","[-0.53253496, 0.361565, -0.013345003, -0.19011...","[-0.73095, 0.45252, 0.1357, 0.25915, -0.14606,...",M625,A525
9,The transaction will have a positive impact of...,"{\n ""0"": [""The"", ""transaction"", ""will"", ""have...",openai/gpt-oss-120b,"The, transaction, will, have, a, on, ,, which,...",Ruukki,earnings,fourth quarter of this year,positive impact of around EUR2m,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[-0.8994, 0.58613, -0.19851, 0.35195, 0.019843...",R200,E655
