# 1-Generate Observations using LangChain Templates

- **Goal:** Prediction Similarity

- **Purpose:** To implement step 1 with sub steps of prediction similarity pipeline. See steps
    1. Generate predictions
        1. Create several prediction prompts templates
        2. Utilize open-source LLMs to generate predictions
    2. Generate observations    

- **Misc:**
    - `%store`: Cell magic will store the variable of interest so we can load in another notebook

In [1]:
import os, sys

import pandas as pd

from tqdm import tqdm
from langchain_core.prompts import PipelinePromptTemplate, PromptTemplate

# Get the current working directory of the notebook
notebook_dir = os.getcwd()
# Add the parent directory to the system path
sys.path.append(os.path.join(notebook_dir, '../'))

from log_files import LogData
from data_processing import DataProcessing
from text_generation_models import TextGenerationModelFactory

In [2]:
tgmf = TextGenerationModelFactory()
print(tgmf)
llama_versatile_generation_model = tgmf.create_instance(model_name='llama-3.3-70b-versatile')
llama_instant_generation_model = tgmf.create_instance('llama-3.1-8b-instant')
llama_70b_8192_generation_model = tgmf.create_instance('llama3-70b-8192')
llama_8b_8192_generation_model = tgmf.create_instance('llama3-8b-8192')

gpt_35_turbo_generation_model = tgmf.create_instance('gpt-3.5-turbo')
gpt_4_o_generation_model = tgmf.create_instance('gpt-4o')
mixtral_87b_instruct_generation_model = tgmf.create_instance('mixtral-8x7b-instruct') 

<text_generation_models.TextGenerationModelFactory object at 0x11dd5c210>


## LangChain Templates for Any Domain Non-Predictions

In [3]:
observation_template = """{observation_properties}

{observation_requirements}
"""
observation_prompt = PromptTemplate.from_template(observation_template)

In [4]:
observation_properties_template = """An observation <o> = (<o_s>, <p_t>, <o_d>, <o_a>), where it consists of the following four properties:

    1. <o_s>, any source entity in the {observation_domain} domain.
        - Can be a person (with a name) or a {observation_domain} person such as a {observation_domain} reporter, {observation_domain} analyst, {observation_domain} expert, {observation_domain} top executive, {observation_domain} senior level person, etc), civilian.
        - Can only be an organization that is associated with the {observation_domain} obervation.
    2. <o_t>, any target entity in the {observation_domain} domain.
	    - Can be a person (with a name) or a {observation_domain} person such as a {observation_domain} reporter, {observation_domain} analyst, {observation_domain} expert, {observation_domain} top executive, {observation_domain} senior level person, etc).
        - Can only be an organization that is associated with the {observation_domain} obervation.
    3. <o_d>, date or time range when <p> is expected to come to fruition or when one should observe the <p>.
        - Forecast can range from a second to anytime in the future.
        - Answers the questions: "How far to go out from today?" or "Where to stop?".
    4. <o_a>, {observation_domain} obervation attribute.
        - Characteristics of a domain-specific attributes such as various quantifiable metrics relevant to the {observation_domain} domain.
        - Some examples are {observation_attributes}.
"""
observation_properties_prompt = PromptTemplate.from_template(observation_properties_template)

In [5]:
observation_requirements = """ requirements to use for each observation:

    - Should be based on real-world {observation_domain} data and not hallucinate.
    - Must be a simple sentence (observation) (and NOT compounding using "and" or "or").
    - Should diversify all four properties of the observation (<o>) as in change and not use same for <p_s>, <p_t>, <p_d>, <p_a>.
    - The observation should be unique and not repeated.
    - Do not number the observations.
    - Do not say, "Here are {observation_N} unique observation based on the provided templates and examples:" or anything similar in the prompt.
    - Change how the current date (<p_d>) written in the observation with examples of (1) Wednesday, August 21, 2024; (2) Wed, August 21, 2024; (3) 08/21/2024; (4) 08/21/2024; (5) 21/08/2024; (6) 21 August 2024; (7) 2024/08/21; (8) 2024-08-21; (9) August 21, 2024; (10) Aug 21, 2024; (11) 21 August 2024, (12) 21 Aug 2024, Q3 of 2027, 2029 of Q3, etc (with removing day of week).
    - Do not use any of the examples in the prompt.
    - In front of every observation, put the template number in the format of "T0:" and only use "T0:" as the template number.
    - Do not put template number on line by itself. Always pair with an observation.
    - Disregard brackets: "<>"
    - Do not use person name of entity name more than once as in don't use name Joe as both the <p_s> and <p_t>, unless like Mr. Sach and Goldman Sach or Mr. Sam Walton and Sam's Club, etc.
    - The source entity (<p_s>) is rarely the same as the target entity (<p_t>) and if same, the <p_s> is making a observation on itself in the <p_t>.
    - Should variate the slope of rise/increase/as much as, fall/decrease/as little as, change, stay stable, high/low chance/probability/degree of, etc.
    - Should variate the observation verbs such as will, would, be going to, should, etc.
    - Must be past tense as in already occurred and not future tense."""
observation_requirements_prompt = PromptTemplate.from_template(observation_requirements)

In [6]:
observation_input_prompts = [
    ("observation_properties", observation_properties_prompt),
    ("observation_requirements", observation_requirements_prompt),
]

observation_pipeline_prompt = PipelinePromptTemplate(
    final_prompt=observation_prompt, pipeline_prompts=observation_input_prompts
)

  observation_pipeline_prompt = PipelinePromptTemplate(


In [7]:
observation_N = 1

financial_attributes = """stock price, net profit, revenue, operating cash flow, research and development expenses, operating income, gross profit."""
health_attributes = """obesity rates, prevalence of chronic illnesses, average physical activity levels, nutritional intake, etc."""
policy_attributes = """election outcomes, economic reforms, legislative impacts."""
weather_attributes = """temperature, precipitation, wind speed, humidity, etc."""

observation_attributes = f"{financial_attributes} + {health_attributes} + {policy_attributes} + {weather_attributes}"

observation_input_dict = {
    "observation_domain": "finance, health, policy, weather, sports",
    "observation_attributes": observation_attributes,
    "observation_N": observation_N
}

observation_prompt_output = observation_pipeline_prompt.format(**observation_input_dict)
print(observation_prompt_output)

An observation <o> = (<o_s>, <p_t>, <o_d>, <o_a>), where it consists of the following four properties:

    1. <o_s>, any source entity in the finance, health, policy, weather, sports domain.
        - Can be a person (with a name) or a finance, health, policy, weather, sports person such as a finance, health, policy, weather, sports reporter, finance, health, policy, weather, sports analyst, finance, health, policy, weather, sports expert, finance, health, policy, weather, sports top executive, finance, health, policy, weather, sports senior level person, etc), civilian.
        - Can only be an organization that is associated with the finance, health, policy, weather, sports obervation.
    2. <o_t>, any target entity in the finance, health, policy, weather, sports domain.
	    - Can be a person (with a name) or a finance, health, policy, weather, sports person such as a finance, health, policy, weather, sports reporter, finance, health, policy, weather, sports analyst, finance, heal

## Batch Generation Data

In [None]:
tgmf = TextGenerationModelFactory()

N_batches = 2

# text_generation_models = [llama_versatile_generation_model, llama_instant_generation_model, llama_70b_8192_generation_model, 
#                           llama_8b_8192_generation_model, gpt_35_turbo_generation_model, gpt_4_o_generation_model, 
#                           mixtral_87b_instruct_generation_model]

text_generation_models = [gpt_4_o_generation_model, 
                          mixtral_87b_instruct_generation_model]

In [9]:
observation_domains = ["mixed"]
observation_prompt_outputs = {
    "mixed": observation_prompt_output,
}
non_prediction_label = 0

batched_non_predictions_df = tgmf.batch_generate_data(N_batches=N_batches,
                                text_generation_models=text_generation_models,
                                domains=observation_prompt_outputs,
                                prompt_outputs=observation_prompt_outputs,
                                sentence_label=non_prediction_label)


  0%|          | 0/1 [00:00<?, ?it/s]

mixed --- gpt-4-turbo --- NAVI_GATOR
mixed --- mixtral-8x7b-instruct --- NAVI_GATOR


100%|██████████| 1/1 [00:15<00:00, 15.85s/it]


Start logging batch
log_directory: /Users/detraviousjamaribrinkley/Documents/Development/research_labs/uf_ds/predictions/data/observations_logs
Save CSV: /Users/detraviousjamaribrinkley/Documents/Development/research_labs/uf_ds/predictions/data/observations_logs/batch_1-observationss/batch_1-from_df.csv

CSV to Log





In [10]:
pd.set_option('max_colwidth', 800)
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
batched_non_predictions_df

[                                                                                                                                                    Base Sentence  \
 0                                     T0: According to health expert Dr. Smith, the obesity rates in the United States had increased significantly by 2023-10-15.   
 1                                                       T0: Finance analyst Jane Doe reported that Apple's stock price had risen to $150 per share by 2024/07/01.   
 2                                               T0: Weather reporter John Brown noted that the precipitation levels in Seattle were unusually high on 2024-03-15.   
 3                                     T0: Policy analyst Sarah Green observed that the economic reforms in Germany had led to a 2% increase in GDP by Q4 of 2025.   
 4                   T0: Sports expert Mike Johnson stated that the average physical activity levels among teenagers in the UK had decreased by 10% by 2025-06-30.   
 5  

In [11]:
non_predictions_df = DataProcessing.concat_dfs(batched_non_predictions_df)
non_predictions_df

Unnamed: 0,Base Sentence,Sentence Label,Domain,Model Name,API Name
0,"T0: According to health expert Dr. Smith, the obesity rates in the United States had increased significantly by 2023-10-15.",0,mixed,gpt-4o,NAVI_GATOR
1,T0: Finance analyst Jane Doe reported that Apple's stock price had risen to $150 per share by 2024/07/01.,0,mixed,gpt-4o,NAVI_GATOR
2,T0: Weather reporter John Brown noted that the precipitation levels in Seattle were unusually high on 2024-03-15.,0,mixed,gpt-4o,NAVI_GATOR
3,T0: Policy analyst Sarah Green observed that the economic reforms in Germany had led to a 2% increase in GDP by Q4 of 2025.,0,mixed,gpt-4o,NAVI_GATOR
4,T0: Sports expert Mike Johnson stated that the average physical activity levels among teenagers in the UK had decreased by 10% by 2025-06-30.,0,mixed,gpt-4o,NAVI_GATOR
5,T0: Health organization WHO reported that the prevalence of chronic illnesses in Asia had risen by 5% by 2023/12/31.,0,mixed,gpt-4o,NAVI_GATOR
6,T0: Finance top executive Mr. Lee from Goldman Sachs noted that the operating income of the company had increased by 15% by Q2 of 2024.,0,mixed,gpt-4o,NAVI_GATOR
7,T0: Weather analyst Emma White observed that the wind speed in Chicago had reached 50 mph on 2024-11-05.,0,mixed,gpt-4o,NAVI_GATOR
8,T0: Policy expert Dr. Thompson reported that the legislative impacts of the new tax law in Canada had resulted in a 3% decrease in unemployment by 2025-09-01.,0,mixed,gpt-4o,NAVI_GATOR
9,T0: Sports analyst David Black stated that the net profit of Manchester United had doubled by Q1 of 2026.,0,mixed,gpt-4o,NAVI_GATOR
