# Generate Observations from Predictions using LangChain Templates

- **Goal:** Prediction Similarity

- **Purpose:** To implement step 1 with sub steps of prediction similarity pipeline. See steps
    1. Generate predictions
        1. Create several prediction prompts templates
        2. Utilize open-source LLMs to generate predictions
    2. Generate observations    

- **Misc:**
    - `%store`: Cell magic will store the variable of interest so we can load in another notebook
    - See `read_log_file.ipynb`

In [1]:
import os, sys

import pandas as pd

from tqdm import tqdm
from langchain_core.prompts import PipelinePromptTemplate, PromptTemplate

# Get the current working directory of the notebook
notebook_dir = os.getcwd()
# Add the parent directory to the system path
sys.path.append(os.path.join(notebook_dir, '../'))

from log_files import LogData
from data_processing import DataProcessing
from text_generation_models import TextGenerationModelFactory

In [2]:
tgmf = TextGenerationModelFactory()
print(tgmf)
llama_versatile_generation_model = tgmf.create_instance(model_name='llama-3.3-70b-versatile')
llama_instant_generation_model = tgmf.create_instance('llama-3.1-8b-instant')
llama_70b_8192_generation_model = tgmf.create_instance('llama3-70b-8192')
llama_8b_8192_generation_model = tgmf.create_instance('llama3-8b-8192')

gpt_35_turbo_generation_model = tgmf.create_instance('gpt-3.5-turbo')
gpt_4_o_generation_model = tgmf.create_instance('gpt-4o')
mixtral_87b_instruct_generation_model = tgmf.create_instance('mixtral-8x7b-instruct') 

<text_generation_models.TextGenerationModelFactory object at 0x36215db20>


## LangChain Templates for Any Domain Non-Predictions

In [3]:
observation_template = """{observation_properties}

{observation_requirements}

Given the above and a prediction: {prediction}, generate {observation_N} observations
"""
observation_prompt = PromptTemplate.from_template(observation_template)

In [4]:
observation_properties_template = """An observation <o> = (<o_s>, <p_t>, <o_d>, <o_a>), where it consists of the following four properties:

    1. <o_s>, any source entity in the {observation_domain} domain.
        - Can be a person (with a name) or a {observation_domain} person such as a {observation_domain} reporter, {observation_domain} analyst, {observation_domain} expert, {observation_domain} top executive, {observation_domain} senior level person, etc), civilian.
        - Can only be an organization that is associated with the {observation_domain} obervation.
    2. <o_t>, any target entity in the {observation_domain} domain.
	    - Can be a person (with a name) or a {observation_domain} person such as a {observation_domain} reporter, {observation_domain} analyst, {observation_domain} expert, {observation_domain} top executive, {observation_domain} senior level person, etc).
        - Can only be an organization that is associated with the {observation_domain} obervation.
    3. <o_d>, date or time range when <p> is expected to come to fruition or when one should observe the <p>.
        - Forecast can range from a second to anytime in the future.
        - Answers the questions: "How far to go out from today?" or "Where to stop?".
    4. <o_a>, {observation_domain} obervation attribute.
        - Characteristics of a domain-specific attributes such as various quantifiable metrics relevant to the {observation_domain} domain.
        - Some examples are {observation_attributes}.
"""
observation_properties_prompt = PromptTemplate.from_template(observation_properties_template)

In [5]:
observation_requirements = """ requirements to use for each observation:

    - Should be based on real-world {observation_domain} data and not hallucinate.
    - Must be a simple sentence (observation) (and NOT compounding using "and" or "or").
    - Should diversify all four properties of the observation (<o>) as in change and not use same for <p_s>, <p_t>, <p_d>, <p_a>.
    - The observation should be unique and not repeated.
    - Do not number the observations.
    - Do not say, "Here are {observation_N} unique observation based on the provided templates and examples:" or anything similar in the prompt.
    - Change how the current date (<p_d>) written in the observation with examples of (1) Wednesday, August 21, 2024; (2) Wed, August 21, 2024; (3) 08/21/2024; (4) 08/21/2024; (5) 21/08/2024; (6) 21 August 2024; (7) 2024/08/21; (8) 2024-08-21; (9) August 21, 2024; (10) Aug 21, 2024; (11) 21 August 2024, (12) 21 Aug 2024, Q3 of 2027, 2029 of Q3, etc (with removing day of week).
    - Do not use any of the examples in the prompt.
    - In front of every observation, put the template number in the format of "T0:" and only use "T0:" as the template number.
    - Do not put template number on line by itself. Always pair with an observation.
    - Disregard brackets: "<>"
    - Do not use person name of entity name more than once as in don't use name Joe as both the <p_s> and <p_t>, unless like Mr. Sach and Goldman Sach or Mr. Sam Walton and Sam's Club, etc.
    - The source entity (<p_s>) is rarely the same as the target entity (<p_t>) and if same, the <p_s> is making a observation on itself in the <p_t>.
    - Should variate the slope of rise/increase/as much as, fall/decrease/as little as, change, stay stable, high/low chance/probability/degree of, etc.
    - Should variate the observation verbs such as will, would, be going to, should, etc.
    - Must be past tense as in already occurred and not future tense."""
observation_requirements_prompt = PromptTemplate.from_template(observation_requirements)

In [6]:
prediction_sentence ="Detravious, an investor forecasts that the stock price at Apple will likely decrease in 2025 Q1."
prediction_sentence_prompt = PromptTemplate.from_template(prediction_sentence)

In [7]:
observation_input_prompts = [
    ("observation_properties", observation_properties_prompt),
    ("observation_requirements", observation_requirements_prompt),
    ("prediction", prediction_sentence_prompt)
]

observation_pipeline_prompt = PipelinePromptTemplate(
    final_prompt=observation_prompt, pipeline_prompts=observation_input_prompts
)

  observation_pipeline_prompt = PipelinePromptTemplate(


In [8]:
observation_N = 1

financial_attributes = """stock price, net profit, revenue, operating cash flow, research and development expenses, operating income, gross profit."""
# health_attributes = """obesity rates, prevalence of chronic illnesses, average physical activity levels, nutritional intake, etc."""
# policy_attributes = """election outcomes, economic reforms, legislative impacts."""
# weather_attributes = """temperature, precipitation, wind speed, humidity, etc."""

# observation_attributes = f"{financial_attributes} + {health_attributes} + {policy_attributes} + {weather_attributes}"

observation_attributes = f"{financial_attributes}"

observation_input_dict = {
    "observation_domain": "finance",
    "observation_attributes": observation_attributes,
    "observation_N": observation_N
}

observation_prompt_output = observation_pipeline_prompt.format(**observation_input_dict)
print(observation_prompt_output)

An observation <o> = (<o_s>, <p_t>, <o_d>, <o_a>), where it consists of the following four properties:

    1. <o_s>, any source entity in the finance domain.
        - Can be a person (with a name) or a finance person such as a finance reporter, finance analyst, finance expert, finance top executive, finance senior level person, etc), civilian.
        - Can only be an organization that is associated with the finance obervation.
    2. <o_t>, any target entity in the finance domain.
	    - Can be a person (with a name) or a finance person such as a finance reporter, finance analyst, finance expert, finance top executive, finance senior level person, etc).
        - Can only be an organization that is associated with the finance obervation.
    3. <o_d>, date or time range when <p> is expected to come to fruition or when one should observe the <p>.
        - Forecast can range from a second to anytime in the future.
        - Answers the questions: "How far to go out from today?" or "W

## Batch Generation Data

In [None]:
tgmf = TextGenerationModelFactory()

N_batches = 1

# text_generation_models = [llama_versatile_generation_model, llama_instant_generation_model, llama_70b_8192_generation_model, 
#                           llama_8b_8192_generation_model, gpt_35_turbo_generation_model, gpt_4_o_generation_model, 
#                           mixtral_87b_instruct_generation_model]

# llama_instant_generation_model is generating too many

text_generation_models = [llama_versatile_generation_model, llama_70b_8192_generation_model, 
                          llama_8b_8192_generation_model, gpt_35_turbo_generation_model, gpt_4_o_generation_model, 
                          mixtral_87b_instruct_generation_model]

# text_generation_models = [gpt_4_o_generation_model, 
#                           mixtral_87b_instruct_generation_model]

In [10]:
observation_domains = ["mixed"]
observation_prompt_outputs = {
    "mixed": observation_prompt_output,
}
observation_label = 0

batched_observations_df = tgmf.batch_generate_predictions(N_batches=N_batches,
                                text_generation_models=text_generation_models,
                                domains=observation_prompt_outputs,
                                prompt_outputs=observation_prompt_outputs,
                                sentence_label=observation_label)


  0%|          | 0/1 [00:00<?, ?it/s]

mixed --- llama-3.3-70b-versatile --- GROQ_CLOUD
mixed --- llama3-70b-8192 --- GROQ_CLOUD
mixed --- llama3-8b-8192 --- GROQ_CLOUD
mixed --- gpt-3.5-turbo --- NAVI_GATOR
mixed --- gpt-4-turbo --- NAVI_GATOR
mixed --- mixtral-8x7b-instruct --- NAVI_GATOR


100%|██████████| 1/1 [00:03<00:00,  3.91s/it]


Start logging batch
log_directory: /Users/detraviousjamaribrinkley/Documents/Development/research_labs/uf_ds/predictions/data/prediction_logs
Save CSV: /Users/detraviousjamaribrinkley/Documents/Development/research_labs/uf_ds/predictions/data/prediction_logs/batch_22-predictions/batch_22-from_df.csv

CSV to Log





In [11]:
pd.set_option('max_colwidth', 800)
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
batched_observations_df

[                                                                                             Base Sentence  \
 0  T0: Detravious, an investor, forecasted that the stock price at Apple would likely decrease in 2025 Q1.   
 
    Sentence Label Domain               Model Name    API Name  
 0               0  mixed  llama-3.3-70b-versatile  GROQ_CLOUD  ,
                                                                            Base Sentence  \
 0                    Here is an observation based on the provided template and examples:   
 1  T0: Detravious, an investor, observed that the revenue at Tesla decreased in 2022 Q3.   
 
    Sentence Label Domain       Model Name    API Name  
 0               0  mixed  llama3-70b-8192  GROQ_CLOUD  
 1               0  mixed  llama3-70b-8192  GROQ_CLOUD  ,
                                                                                                                                                                                                

In [12]:
# prediction_sentence ="Detravious, an investor forecasts that the stock price at Apple will likely decrease in 2025 Q1."


In [13]:
observations_df = DataProcessing.concat_dfs(batched_observations_df)
observations_df

Unnamed: 0,Base Sentence,Sentence Label,Domain,Model Name,API Name
0,"T0: Detravious, an investor, forecasted that the stock price at Apple would likely decrease in 2025 Q1.",0,mixed,llama-3.3-70b-versatile,GROQ_CLOUD
1,Here is an observation based on the provided template and examples:,0,mixed,llama3-70b-8192,GROQ_CLOUD
2,"T0: Detravious, an investor, observed that the revenue at Tesla decreased in 2022 Q3.",0,mixed,llama3-70b-8192,GROQ_CLOUD
3,Here is one observation based on the provided templates and examples:,0,mixed,llama3-8b-8192,GROQ_CLOUD
4,"T0: The finance expert, Warren Buffett, observes that the net profit of Coca-Cola will rise as much as 15% by August 2027.",0,mixed,llama3-8b-8192,GROQ_CLOUD
5,"Note: I've followed the requirements to generate this observation. The source entity is Warren Buffett, a finance expert, and the target entity is Coca-Cola, an organization. The date is August 2027, and the observation attribute is net profit, which is expected to rise by 15%. The observation is unique and not repeated, and it's in the past tense.",0,mixed,llama3-8b-8192,GROQ_CLOUD
6,"T0: A finance expert, Detravious, predicted that the stock price at Apple decreased in the first quarter of 2025.",0,mixed,gpt-3.5-turbo,NAVI_GATOR
7,"T0: Detravious, an investor, observed that the stock price at Apple decreased in Q1 of 2025.",0,mixed,gpt-4o,NAVI_GATOR
8,"T0: Observation = (Apple, Detravious, investor, Q1 2025, stock price will likely decrease)",0,mixed,mixtral-8x7b-instruct,NAVI_GATOR
