# Sequence Labelling: Pediction Sentence

- Sequence: Pediction Sentence
- Labels: Prediction Properties

> `prediction_classification_experiments-v2/ml_classifiers.ipynb` and `prediction_classification_experiments-v2/llm_classifiers.ipynb` because classify sentences as prediction/non-prediction

In [1]:
import os
import sys
import pprint

import pandas as pd

from tqdm import tqdm

notebook_dir = os.getcwd()

sys.path.append(os.path.join(notebook_dir, '../'))

from data_processing import DataProcessing
from prediction_properties import PredictionProperties
from text_generation_models import TextGenerationModelFactory
from vector_stores import ChromaVectorStore, VectorStoreDirector

In [2]:
pd.set_option('max_colwidth', 800)
pd.set_option('display.max_columns', None)
# pd.set_option('display.max_columns', 40)
# pd.set_option('display.max_rows', None)

## Load Data

In [3]:
base_data_path = os.path.join(notebook_dir, '../data')
combine_data_path = os.path.join(base_data_path, 'combined_datasets')
model_results_path = os.path.join(combine_data_path, 'ml_classifiers-v1.csv')
df = DataProcessing.load_from_file(model_results_path, 'csv', sep=',')
df.head(3)

Unnamed: 0,Base Sentence,Sentence Label,Author Type,Embedding,Normalized Embeddings,perceptron,sgd_classifier,logistic_regression,ridge_classifier,decision_tree_classifier,random_forest_classifier,gradient_boosting_classifier
0,"PARIS — President Emmanuel Macron declared his candidacy for a second five-year term in the presidential election next month, formalizing his decision with a low-key letter in several newspapers that exhorted the French to let him guide “this beautiful and collective adventure that is called France.”",1,1,[ 5.66046238e-02 1.04988851e-01 -3.34139578e-02 1.98167879e-02\n 1.52485088e-01 -6.53142408e-02 -1.45529425e-02 -6.41713664e-02\n -3.75037752e-02 2.14759326e+00 -2.14621708e-01 2.73683928e-02\n 2.88958307e-02 -1.24519587e-01 -6.89217523e-02 -4.13673334e-02\n -5.35880737e-02 7.76791871e-01 -4.43464518e-02 1.01134705e-03\n 1.09494086e-02 -4.75250259e-02 -2.06354335e-02 -6.48403764e-02\n 1.07812867e-01 9.46720596e-03 -1.09430917e-01 -4.65166196e-03\n -8.32740441e-02 5.66849113e-03 -2.98308004e-02 6.16043173e-02\n 2.14268006e-02 5.82768060e-02 6.74477890e-02 -1.06455544e-02\n -7.22951535e-03 -8.13032016e-02 -8.00553635e-02 -4.17012870e-02\n 1.55045995e-02 5.84505014e-02 2.86313556e-02 -1.45766467e-01\n 2.51089260e-02 -3.74014415e-02 -2.69140229e-02 3.26086022e-02\n 4....,[ 9.95389938e-01 -8.30533266e-01 -1.43044465e-03 8.73336315e-01\n 1.09862113e+00 -4.87613559e-01 -1.22331895e-01 -3.02105606e-01\n -6.34055793e-01 3.96464318e-01 1.20431840e-01 1.43495783e-01\n -6.45723879e-01 -1.23308265e+00 1.21950001e-01 9.57266837e-02\n -3.60498607e-01 -1.11962736e+00 1.05052781e+00 5.26727080e-01\n 1.28631011e-01 -6.31512940e-01 -1.30328745e-01 -2.20445901e-01\n 1.18003213e+00 -3.56904298e-01 -9.83796865e-02 -1.43321365e-01\n -1.27634919e+00 -1.40238732e-01 -2.59166718e-01 5.60881197e-01\n 5.56570828e-01 -5.19165695e-02 -8.78698155e-02 4.34541374e-01\n -1.03092425e-01 -1.60028136e+00 -8.79279971e-01 -1.84806854e-01\n 1.09652579e-01 -9.91694033e-02 -2.68077195e-01 -1.27014589e+00\n 7.04476759e-02 -8.28510642e-01 9.05415952e-01 7.40090787e-01\n 3....,0,0,1,0,0,1,1
1,"This time, the plot — about a ray gun that turns humans into monsters, and vice versa — seems to acknowledge the need to goose characters out of their inertia.",0,1,[-2.23119743e-02 6.97671250e-02 -9.84913930e-02 3.75865086e-04\n -3.14215869e-02 9.29637328e-02 -2.15256251e-02 1.64190568e-02\n 3.75561090e-03 2.09514165e+00 -9.81769562e-02 -2.71825790e-02\n 6.37708604e-02 2.52576079e-02 -1.64901182e-01 -1.31515667e-01\n -8.57630968e-02 1.00542879e+00 -1.91387057e-01 -1.72457062e-02\n -1.96998157e-02 1.58039983e-02 -8.47747996e-02 -5.93274459e-02\n -3.87448259e-02 2.83081476e-02 -6.39593303e-02 -3.22961658e-02\n -1.75291598e-02 -5.21216244e-02 -5.90514541e-02 8.25655013e-02\n -1.87435567e-01 1.76751390e-01 1.82304636e-01 -5.84071316e-02\n 6.34762719e-02 7.96191171e-02 -4.05625440e-02 -6.58209398e-02\n 5.19772992e-02 1.05586648e-02 -6.87460601e-02 -1.15110271e-01\n 8.34729597e-02 5.28117083e-03 -7.90077299e-02 3.89454179e-02\n -6....,[ 1.02042906e-01 -1.25779295e+00 -8.07256043e-01 5.91519654e-01\n -1.25950062e+00 1.64591324e+00 -2.32750431e-01 5.62637687e-01\n -5.82166910e-02 2.33458817e-01 1.34504688e+00 -5.37074924e-01\n -1.18046746e-01 9.42714691e-01 -8.83851290e-01 -1.40953076e+00\n -8.80104840e-01 -4.82799970e-02 -8.90786171e-01 2.49257118e-01\n -3.30653459e-01 1.78774640e-01 -1.01988363e+00 -1.42349273e-01\n -1.00427949e+00 -8.71060789e-02 4.76414889e-01 -5.45713067e-01\n -3.79090428e-01 -7.61182785e-01 -7.15413988e-01 8.47321272e-01\n -2.47394013e+00 1.62751734e+00 1.49586022e+00 -2.60521144e-01\n 8.91075373e-01 8.00682306e-01 -3.06060821e-01 -5.04338622e-01\n 6.78358138e-01 -8.19013655e-01 -1.57371426e+00 -8.22251201e-01\n 9.33124840e-01 -1.94198340e-01 1.62035823e-01 8.18059087e-01\n -1....,0,0,0,0,0,1,1
2,"In his first weeks as mayor, that challenge has risen to meet him.",0,1,[ 5.81765212e-02 2.07019195e-01 -7.69932643e-02 -6.81760013e-02\n 1.22693665e-01 -1.35694534e-01 -8.19419175e-02 -3.39005925e-02\n 5.66506805e-03 2.72540665e+00 -1.78715482e-01 2.04635300e-02\n 4.16096002e-02 -2.71960557e-03 -1.61515608e-01 1.82891320e-02\n 1.80162247e-02 7.59893358e-01 -1.08826131e-01 1.53591344e-02\n 6.34927768e-03 -7.43348673e-02 -4.02560048e-02 -7.23461360e-02\n 7.84059986e-02 9.49361399e-02 -9.24548283e-02 1.06267445e-02\n -1.70880035e-02 -8.75466838e-02 -6.63756654e-02 1.13582596e-01\n -7.43332654e-02 -5.63022820e-03 1.33150846e-01 -1.48050874e-01\n 4.73999567e-02 -3.72759961e-02 5.08240648e-02 -1.01549730e-01\n 2.34233841e-04 -1.08329961e-02 4.20196690e-02 -1.49904892e-01\n -5.08043952e-02 7.94086978e-02 -1.34930000e-01 -6.85446570e-03\n -6....,[ 1.0131841 0.4071531 -0.54105407 -0.4022119 0.7166242 -1.436313\n -1.1894954 0.02270282 -0.03156724 2.1921527 0.498047 0.05735163\n -0.45335802 0.5362927 -0.8483727 1.0918443 0.7958655 -1.1988105\n 0.19923028 0.7447842 0.05969716 -0.9745417 -0.4024483 -0.32677308\n 0.741749 0.8669924 0.11621065 0.07907018 -0.37306973 -1.1418184\n -0.8297732 1.271178 -0.83287007 -0.95782936 0.81809187 -1.5650845\n 0.66503227 -0.9433948 1.0203716 -0.9776657 -0.1284527 -1.1405437\n -0.08856661 -1.3306093 -1.051625 0.90742135 -0.63597816 0.2545358\n -0.35334235 -1.1255441 -0.09057999 0.48243684 0.25384033 0.21164928\n 0.3576934 1.7142646 -0.36263907 1.1403483 1.0692748 -0.89533913\n -0.03675979 0.46403384 0.37106022 0.396437 0.4069825 0...,0,0,0,0,1,0,0


## Majority Vote (Prediction)

### Majority Vote

- Be care with doing this on real data. Some examples in `data/combined_datasets/ml_classifiers-v1.csv` have true labels of 0 and MV is 1.

In [4]:
model_results_df = df.drop(columns=['Author Type', 'Embedding', 'Normalized Embeddings'])
model_results_df.head(3)

Unnamed: 0,Base Sentence,Sentence Label,perceptron,sgd_classifier,logistic_regression,ridge_classifier,decision_tree_classifier,random_forest_classifier,gradient_boosting_classifier
0,"PARIS — President Emmanuel Macron declared his candidacy for a second five-year term in the presidential election next month, formalizing his decision with a low-key letter in several newspapers that exhorted the French to let him guide “this beautiful and collective adventure that is called France.”",1,0,0,1,0,0,1,1
1,"This time, the plot — about a ray gun that turns humans into monsters, and vice versa — seems to acknowledge the need to goose characters out of their inertia.",0,0,0,0,0,0,1,1
2,"In his first weeks as mayor, that challenge has risen to meet him.",0,0,0,0,0,1,0,0


In [5]:
model_results_df['Majority Vote'] = model_results_df.iloc[:, 2:].mode(axis=1)
model_results_df.head(3)

Unnamed: 0,Base Sentence,Sentence Label,perceptron,sgd_classifier,logistic_regression,ridge_classifier,decision_tree_classifier,random_forest_classifier,gradient_boosting_classifier,Majority Vote
0,"PARIS — President Emmanuel Macron declared his candidacy for a second five-year term in the presidential election next month, formalizing his decision with a low-key letter in several newspapers that exhorted the French to let him guide “this beautiful and collective adventure that is called France.”",1,0,0,1,0,0,1,1,0
1,"This time, the plot — about a ray gun that turns humans into monsters, and vice versa — seems to acknowledge the need to goose characters out of their inertia.",0,0,0,0,0,0,1,1,0
2,"In his first weeks as mayor, that challenge has risen to meet him.",0,0,0,0,0,1,0,0,0


### Filter for Prediction (=1)

In [6]:
filt_prediction = (model_results_df['Sentence Label'] == 1) & (model_results_df['Majority Vote'] == 1)
predictions_df = model_results_df[filt_prediction]
predictions_df.shape

(667, 10)

In [7]:
predictions_df.head(7)

Unnamed: 0,Base Sentence,Sentence Label,perceptron,sgd_classifier,logistic_regression,ridge_classifier,decision_tree_classifier,random_forest_classifier,gradient_boosting_classifier,Majority Vote
16,"On 2025-06-01, Meteorologist Emily Chen speculates that the temperature at Dallas will likely increase.",1,1,1,1,1,1,1,1,1
17,"The dinners are planned for March 1, 2, 8, 9, 14 and 15 at 7:30 p.m., $95 plus tax and gratuities.",1,1,1,1,1,1,0,0,1
18,Apple will release an electric car within the next decade.,1,1,1,1,1,1,1,1,1
20,"On August 22, 2024, Research Advisor Michael Brown speculates that the graduation rates at Harvard University will likely increase.",1,1,1,1,1,1,1,1,1
22,"WASHINGTON — The Biden administration is quietly pressing the Taiwanese government to order American-made weapons that would help its small military repel a seaborne invasion by China rather than weapons designed for conventional set-piece warfare, current and former U.S. and Taiwanese officials say.",1,1,1,1,1,1,1,1,1
23,"In 2029, college student Alex Lee envisions that the average GPA at Stanford University has some probability to remain stable.",1,1,1,1,1,1,0,1,1
25,"Around 10 p.m., Lindsay Colford settles into bed with the dulcet drawl of Matthew McConaughey, who is about to take her on an audio journey through the cosmos until she falls asleep.",1,1,1,1,1,0,0,0,1


## LLM for Relation Extraction

### Prompts

In [8]:
prediction_base_prompt = DataProcessing.load_prediction_properties()
prediction_base_prompt

' A prediction <p> = (<p_s>, <p_t>, <p_d>, <p_o>), where it consists of the following four properties:\n\n            1. <p_s>\n                - Defined as: \n                    - Source entity that states the <p>\n                - Characteristics:\n                    - A person with either: a name only, profile name only, geneder only, domain specific title only or any combination of these.\n                    - An associated organization\n                    - Named entity: Person, organization\n                    - Part of speech: Noun\n\n            2. <p_t>\n                - Defined as: \n                    - Target entity that the <p> is about\n                - Characteristics:\n                    - Same and <p_s>\n                    \n            3. <p_d>\n                - Defined as: \n                    - Date when the <p> is made\n                    - Date when the <p> is expected to come to fruition\n                - Characteristics:\n                    - For

In [9]:
# role_prompt = "Role: You are a lingust that specializes in identifying properties within a prediction statement."
# task_prompt = "Your task is label the prediction properties within"

In [10]:
system_identity_prompt = "You are a lingustic expert that specializes in identifying properties within a prediction statement."
# prediction_requirements = PredictionProperties.get_requirements()
task = """For each word within the sentence "label" as either a "no_label": 0, "source": 1, "target": 2, "date": 3, "outcome": 4. IMPORTANT: Keep multi-word entities together as single items in the list."""
sentence_label_format_output = """

Respond ONLY with valid JSON in this exact format: {0: [word_from_sentence]}, {1: [word_from_sentence]}, {2: [word_from_sentence]}, {3: [word_from_sentence]}, {4: [word_from_sentence]}, where key is int ranging from 0 to 4 and the value is the words_from_sentence, split by a comma/all placed into a list, so {int: [word_from_sentence_1, word_from_sentence_2, ..., word_from_sentence_W]}. For 2 and 3, some words may be a prefix or a position or tile before/after 2 or 3. Be sure to take that into account.

Do NOT reason or provide anything other than the aforementioned. Also, stop responding in reverse format {word_from_sentence: 0}, {word_from_sentence: 1}, {word_from_sentence: 2}, {word_from_sentence: 3}, {word_from_sentence: 4} or in any other format.

Respond ONLY with valid JSON in this exact format: {0: [word_from_sentence]}, {1: [word_from_sentence]}, {2: [word_from_sentence]}, {3: [word_from_sentence]}, {4: [word_from_sentence]}, where key is int ranging from 0 to 4 and the value is the words_from_sentence, split by a comma/all placed into a list, so {int: [word_from_sentence_1, word_from_sentence_2, ..., word_from_sentence_W]}.
"""

In [11]:
prediction_properties = PredictionProperties.get_prediction_properties()
prediction_properties_base_prompt = f"""{system_identity_prompt} For each prediction, the format is based on: 
    
    {prediction_properties}

"""
prediction_properties_base_prompt

'You are a lingustic expert that specializes in identifying properties within a prediction statement. For each prediction, the format is based on: \n\n     A prediction <p> = (<p_s>, <p_t>, <p_d>, <p_o>), where it consists of the following four properties:\n\n            1. <p_s>\n                - Defined as: \n                    - Source entity that states the <p>\n                - Characteristics:\n                    - A person with either: a name only, profile name only, geneder only, domain specific title only or any combination of these.\n                    - An associated organization\n                    - Named entity: Person, organization\n                    - Part of speech: Noun\n\n            2. <p_t>\n                - Defined as: \n                    - Target entity that the <p> is about\n                - Characteristics:\n                    - Same and <p_s>\n                    \n            3. <p_d>\n                - Defined as: \n                    - Date wh

### Models

In [12]:
tgmf = TextGenerationModelFactory()

# Option 1: Specific models
# models = tgmf.create_instances(['llama-3.1-8b-instant', 'llama-3.3-70b-versatile', 'llama-3.3-70b-instruct', 'openai/gpt-oss-20b'])
models = tgmf.create_instances(['openai/gpt-oss-120b'])

# Option 2: All Groq models
# models = tgmf.create_instances(tgmf.get_groq_model_names())

# Option 3: All NaviGator models
# models = tgmf.create_instances(tgmf.get_navigator_model_names())

# Option 4: All available models
# models = tgmf.create_instances()

# Option 5: Mix and match
# custom_models = ['llama-3.1-70b-instruct', 'mistral-small-3.1', 'llama-3.1-8b-instant']
# models = tgmf.create_instances(custom_models)
models

[<text_generation_models.GptOss120bTextGenerationModel at 0x3218f9590>]

### Prompt Models

In [13]:
def llm_certifier(idx, sentence_to_classify: str, base_prompt: str, model, task, format_output: str):
    
      prompt = f"""{base_prompt}
      
      Sentence to extract the prediction properties: '{sentence_to_classify}'

      {task}
      
      {format_output}
      """
      if idx < 2:
            print(f"\tPrompt: {prompt}")
      input_prompt = model.user(prompt)
      raw_text_llm_generation = model.chat_completion([input_prompt])
      # print(f"Raw response: {raw_text_llm_generation}")      
      return raw_text_llm_generation

In [14]:
# subset_predictions_df = predictions_df.iloc[:33, :]
# subset_predictions_df.head(3)

In [15]:
results = []
# len(predictions_df)
for idx, row in tqdm(predictions_df.iterrows(), total=len(predictions_df), desc="Processing"):
    text = row['Base Sentence']
    # print(f"{idx} --- Sentence: {text}")
    for model in models:
        # print(model.__name__())
        raw_response = llm_certifier(idx, text, prediction_properties_base_prompt, model, task, sentence_label_format_output)
        result = (text, raw_response, model.__name__())
        results.append(result)

        if idx < 3:
            # print(f"{idx} --- Sentence: {text}")
            print(f"\n--- Result {idx} ---")
            pprint.pprint(result, width=120)
    
    # print()

Processing:  13%|█▎        | 88/667 [16:18<1:47:16, 11.12s/it]


RateLimitError: Error code: 429 - {'error': {'message': 'Rate limit reached for model `openai/gpt-oss-120b` in organization `org_01jf12p7h2f9d8jj9h5fxm2h5d` service tier `on_demand` on tokens per day (TPD): Limit 200000, Used 199569, Requested 858. Please try again in 3m4.464s. Need more tokens? Upgrade to Dev Tier today at https://console.groq.com/settings/billing', 'type': 'tokens', 'code': 'rate_limit_exceeded'}}

In [16]:
results

[('On 2025-06-01, Meteorologist Emily Chen speculates that the temperature at Dallas will likely increase.',
  '{0: ["On", "speculates", "that", "the", "will", "likely"], 1: ["Meteorologist Emily Chen"], 2: ["temperature at Dallas"], 3: ["2025-06-01"], 4: ["increase"]}',
  'openai/gpt-oss-120b'),
 ('The dinners are planned for March 1, 2, 8, 9, 14 and 15 at 7:30 p.m., $95 plus tax and gratuities.',
  '{0: ["are", "planned", "for", "at"], 1: [], 2: ["The dinners"], 3: ["March 1, 2, 8, 9, 14 and 15", "7:30 p.m."], 4: ["$95 plus tax and gratuities"]}',
  'openai/gpt-oss-120b'),
 ('Apple will release an electric car within the next decade.',
  '{0: ["will", "an", "within", "the"], 1: ["Apple"], 2: ["electric car"], 3: ["next decade"], 4: ["release"]}',
  'openai/gpt-oss-120b'),
 ('On August 22, 2024, Research Advisor Michael Brown speculates that the graduation rates at Harvard University will likely increase.',
  '{0: ["On", "speculates", "that", "the", "will"], 1: ["Research Advisor Mich

In [17]:
column_names = ["Prediction Sentence", "Raw Response", "Model Name"]
results_df = pd.DataFrame(results, columns=column_names)
results_df

Unnamed: 0,Prediction Sentence,Raw Response,Model Name
0,"On 2025-06-01, Meteorologist Emily Chen speculates that the temperature at Dallas will likely increase.","{0: [""On"", ""speculates"", ""that"", ""the"", ""will"", ""likely""], 1: [""Meteorologist Emily Chen""], 2: [""temperature at Dallas""], 3: [""2025-06-01""], 4: [""increase""]}",openai/gpt-oss-120b
1,"The dinners are planned for March 1, 2, 8, 9, 14 and 15 at 7:30 p.m., $95 plus tax and gratuities.","{0: [""are"", ""planned"", ""for"", ""at""], 1: [], 2: [""The dinners""], 3: [""March 1, 2, 8, 9, 14 and 15"", ""7:30 p.m.""], 4: [""$95 plus tax and gratuities""]}",openai/gpt-oss-120b
2,Apple will release an electric car within the next decade.,"{0: [""will"", ""an"", ""within"", ""the""], 1: [""Apple""], 2: [""electric car""], 3: [""next decade""], 4: [""release""]}",openai/gpt-oss-120b
3,"On August 22, 2024, Research Advisor Michael Brown speculates that the graduation rates at Harvard University will likely increase.","{0: [""On"", ""speculates"", ""that"", ""the"", ""will""], 1: [""Research Advisor Michael Brown""], 2: [""graduation rates at Harvard University""], 3: [""August 22, 2024""], 4: [""likely increase""]}",openai/gpt-oss-120b
4,"WASHINGTON — The Biden administration is quietly pressing the Taiwanese government to order American-made weapons that would help its small military repel a seaborne invasion by China rather than weapons designed for conventional set-piece warfare, current and former U.S. and Taiwanese officials say.","{0: [""WASHINGTON"", ""—"", ""is"", ""quietly"", ""pressing"", ""to"", ""that"", ""would"", ""rather"", ""than"", ""weapons"", ""designed"", ""for"", ""conventional"", ""set-piece"", ""warfare,"", ""say.""], 1: [""The Biden administration"", ""current and former U.S. and Taiwanese officials""], 2: [""the Taiwanese government""], 3: [], 4: [""order American-made weapons"", ""help its small military repel a seaborne invasion by China""]}",openai/gpt-oss-120b
...,...,...,...
83,"Minutes before the Super Bowl gets underway on Sunday, four high school students from Riverside will be standing on the field in the glare of the national spotlight.","{0: [""Minutes"", ""before"", ""the"", ""gets"", ""underway"", ""on"", ""will"", ""be""], 1: [], 2: [""four high school students from Riverside""], 3: [""Sunday""], 4: [""standing on the field in the glare of the national spotlight""]}",openai/gpt-oss-120b
84,With the new production plant the company would increase its capacity to meet the expected increase in demand and would improve the use of raw materials and therefore increase the production profitability .,"{0: [""With"", ""would"", ""to"", ""meet"", ""the"", ""expected"", ""increase"", ""in"", ""demand"", ""and"", ""would"", ""and"", ""therefore""], 1: [""the company""], 2: [""the new production plant""], 3: [], 4: [""increase its capacity"", ""improve the use of raw materials"", ""increase the production profitability""]}",openai/gpt-oss-120b
85,"Meteorologist Ethan Kim predicts on 08/15/2024, the wind speed at Los Angeles may rise.","{0: [""predicts"", ""on"", ""the""], 1: [""Meteorologist Ethan Kim""], 2: [""wind speed at Los Angeles""], 3: [""08/15/2024""], 4: [""may rise""]}",openai/gpt-oss-120b
86,"As the world reels from spikes in oil and gas prices, the fallout from Russia’s invasion of Ukraine has laid bare a dilemma: Nations remain extraordinarily dependent on fossil fuels and are struggling to shore up supplies precisely at a moment when scientists say the world must slash its use of oil, gas and coal to avert irrevocable damage to the planet.","{0: [""As"",""the"",""world"",""reels"",""from"",""spikes"",""in"",""oil and gas prices"",""the"",""fallout"",""from"",""Russia’s invasion of Ukraine"",""has"",""laid"",""bare"",""a dilemma:"",""Nations"",""remain"",""extraordinarily"",""dependent"",""on"",""fossil fuels"",""and"",""are"",""struggling"",""to"",""shore"",""up"",""supplies"",""precisely"",""at"",""a moment"",""when"",""scientists"",""say"",""the"",""world"",""must"",""slash"",""its"",""use"",""of"",""oil, gas and coal"",""to"",""avert"",""irrevocable"",""damage"",""to"",""the"",""planet.""],1: [],2: [],3: [],4: []}",openai/gpt-oss-120b


In [None]:
import ast, re
from typing import Any, Dict, List, Union

def parse_json_response_2(raw_response: Union[str, bytes, bytearray]) -> Dict[int, List[Any]]:
    """
    Parse a string that looks like a Python literal into a dictionary.
    Normalizes string keys to integer keys.
    Handles both quoted and unquoted list items.
    
    Args:
        raw_response: The raw string (or bytes/bytearray) containing Python-like literals.
    
    Returns:
        A dictionary with integer keys mapped to lists.
    
    Raises:
        ValueError: If the input cannot be parsed into the expected structure.
    """
    # Normalize input
    print(f"RAW: {raw_response}")
    if isinstance(raw_response, (bytes, bytearray)):
        raw_response = raw_response.decode("utf-8", errors="replace")
    if not isinstance(raw_response, str):
        raise ValueError(f"Expected str/bytes, got {type(raw_response).__name__}")
    
    text = raw_response.strip()
    
    # Fix unquoted list items: [word1, word2] -> ["word1", "word2"]
    def quote_unquoted_items(match):
        """Add quotes to unquoted words in lists."""
        content = match.group(1)
        # Split by comma, strip whitespace, and add quotes if not already quoted
        items = []
        for item in content.split(','):
            item = item.strip()
            if item and not (item.startswith('"') or item.startswith("'")):
                items.append(f'"{item}"')
            else:
                items.append(item)
        return '[' + ', '.join(items) + ']'
    
    # Apply the fix to all lists in the text
    text = re.sub(r'\[([^\[\]]+)\]', quote_unquoted_items, text)
    
    # Try to safely evaluate using ast.literal_eval
    try:
        obj = ast.literal_eval(text)
    except (SyntaxError, ValueError) as e:
        raise ValueError(f"Unable to parse input as Python literal: {e}\nProcessed text: {text}")
    
    # Ensure we have a dict
    if not isinstance(obj, dict):
        raise ValueError(f"Expected dict, got {type(obj).__name__}")
    
    # Normalize keys to integers
    normalized = {}
    for key, value in obj.items():
        try:
            int_key = int(key)
            normalized[int_key] = value
        except (ValueError, TypeError):
            # Keep non-convertible keys as-is (shouldn't happen with your format)
            normalized[key] = value
    
    return normalized

In [29]:
for idx, row in results_df.iterrows():
    raw_response = row['Raw Response']
    
    try:
        # Parse the raw JSON response
        cleaned_response = parse_json_response_2(raw_response)
        print(f"   CLEANED: {cleaned_response}")
        
        # Store as comma-separated strings
        results_df.at[idx, 'No Property'] = ', '.join(cleaned_response.get(0, []))
        results_df.at[idx, 'Source'] = ', '.join(cleaned_response.get(1, []))
        results_df.at[idx, 'Target'] = ', '.join(cleaned_response.get(2, []))
        results_df.at[idx, 'Date'] = ', '.join(cleaned_response.get(3, []))
        results_df.at[idx, 'Outcome'] = ', '.join(cleaned_response.get(4, []))
    except ValueError as e:
        print(f"   ERROR at index {idx}: {e}")
        # Set empty values for failed parses
        for col in ['No Property', 'Source', 'Target', 'Date', 'Outcome']:
            results_df.at[idx, col] = ''

results_df.head(3)

RAW: {0: ["On", "speculates", "that", "the", "will", "likely"], 1: ["Meteorologist Emily Chen"], 2: ["temperature at Dallas"], 3: ["2025-06-01"], 4: ["increase"]}
   CLEANED: {0: ['On', 'speculates', 'that', 'the', 'will', 'likely'], 1: ['Meteorologist Emily Chen'], 2: ['temperature at Dallas'], 3: ['2025-06-01'], 4: ['increase']}
RAW: {0: ["are", "planned", "for", "at"], 1: [], 2: ["The dinners"], 3: ["March 1, 2, 8, 9, 14 and 15", "7:30 p.m."], 4: ["$95 plus tax and gratuities"]}
   CLEANED: {0: ['are', 'planned', 'for', 'at'], 1: [], 2: ['The dinners'], 3: ['March 1, 2, 8, 9, 14 and 15', '7:30 p.m.'], 4: ['$95 plus tax and gratuities']}
RAW: {0: ["will", "an", "within", "the"], 1: ["Apple"], 2: ["electric car"], 3: ["next decade"], 4: ["release"]}
   CLEANED: {0: ['will', 'an', 'within', 'the'], 1: ['Apple'], 2: ['electric car'], 3: ['next decade'], 4: ['release']}
RAW: {0: ["On", "speculates", "that", "the", "will"], 1: ["Research Advisor Michael Brown"], 2: ["graduation rates at H

Unnamed: 0,Prediction Sentence,Raw Response,Model Name,No Property,Source,Target,Date,Outcome
0,"On 2025-06-01, Meteorologist Emily Chen speculates that the temperature at Dallas will likely increase.","{0: [""On"", ""speculates"", ""that"", ""the"", ""will"", ""likely""], 1: [""Meteorologist Emily Chen""], 2: [""temperature at Dallas""], 3: [""2025-06-01""], 4: [""increase""]}",openai/gpt-oss-120b,"On, speculates, that, the, will, likely",Meteorologist Emily Chen,temperature at Dallas,2025-06-01,increase
1,"The dinners are planned for March 1, 2, 8, 9, 14 and 15 at 7:30 p.m., $95 plus tax and gratuities.","{0: [""are"", ""planned"", ""for"", ""at""], 1: [], 2: [""The dinners""], 3: [""March 1, 2, 8, 9, 14 and 15"", ""7:30 p.m.""], 4: [""$95 plus tax and gratuities""]}",openai/gpt-oss-120b,"are, planned, for, at",,The dinners,"March 1, 2, 8, 9, 14 and 15, 7:30 p.m.",$95 plus tax and gratuities
2,Apple will release an electric car within the next decade.,"{0: [""will"", ""an"", ""within"", ""the""], 1: [""Apple""], 2: [""electric car""], 3: [""next decade""], 4: [""release""]}",openai/gpt-oss-120b,"will, an, within, the",Apple,electric car,next decade,release


In [None]:
extract_prediction_properties_path = "extract_prediction_properties/"
extract_prediction_properties_full_path = os.path.join(base_data_path, extract_prediction_properties_path)
DataProcessing.save_to_file(results_df, extract_prediction_properties_full_path, 'extracted_pps', 'csv')

Using file number: 2
Saving CSV file to: /Users/detraviousjamaribrinkley/Documents/Development/research_labs/uf_ds/predictions/notebook_experiments/../data/extract_prediction_properties/extracted_pps-v2.csv


> `notebook_experiments/load_vector_store.ipynb` because we want to use the properties extracted to search the vector store.

or

> `notebook_experiments/entity_resolution-source_target.ipynb` because we want to see predictions by source.