# 1-Generate Observations using LangChain

- **Goal:** observation Recognition

- **Purpose:** To implement step 1 with sub steps of observation recognition pipeline. See steps
    1. Generate observations
        1. Create several observation prompts templates
        2. Utilize open-source LLMs to generate observations

- **Misc:**
    - `%store`: Cell magic will store the variable of interest so we can load in another notebook

In [None]:
import os, sys

import pandas as pd

from tqdm import tqdm
from langchain_core.prompts import PipelinePromptTemplate, PromptTemplate

# Get the current working directory of the notebook
notebook_dir = os.getcwd()
# Add the parent directory to the system path
sys.path.append(os.path.join(notebook_dir, '../'))

from log_files import LogData
from data_processing import DataProcessing
from text_generation_models import TextGenerationModelFactory

## Text Generation Models

In [None]:
tgmf = TextGenerationModelFactory()

llama_versatile_generation_model = tgmf.create_instance(model_name='llama-3.3-70b-versatile')
llama_instant_generation_model = tgmf.create_instance('llama-3.1-8b-instant')
llama_70b_8192_generation_model = tgmf.create_instance('llama3-70b-8192')
llama_8b_8192_generation_model = tgmf.create_instance('llama3-8b-8192')

gpt_35_turbo_generation_model = tgmf.create_instance('gpt-3.5-turbo')
gpt_4_o_generation_model = tgmf.create_instance('gpt-4o')
mixtral_87b_instruct_generation_model = tgmf.create_instance('mixtral-8x7b-instruct') 

## Base Templates for Domain Observations

In [None]:
full_observation_template = """{observation_properties}

{observation_requirements}

{observation_templates}

{observation_examples}
"""

full_observation_prompt = PromptTemplate.from_template(full_observation_template)

In [None]:
observation_properties_template = """A observation <o> = (<o_s>, <o_t>, <o_d>, <o_a>), where it consists of the following four properties:

    1. <o_s>, any source entity in the {observation_domain} domain.
        - Can be a person (with a name) or a {observation_domain} person such as a {observation_domain} reporter, {observation_domain} analyst, {observation_domain} expert, {observation_domain} top executive, {observation_domain} senior level person, etc), civilian.
        - Can only be an organization that is associated with the {observation_domain} observation.
    2. <o_t>, any target entity in the {observation_domain} domain.
	    - Can be a person (with a name) or a {observation_domain} person such as a {observation_domain} reporter, {observation_domain} analyst, {observation_domain} expert, {observation_domain} top executive, {observation_domain} senior level person, etc).
        - Can only be an organization that is associated with the {observation_domain} observation.
    3. <o_d>, date or time range when <o> is expected to come to fruition or when one should observe the <o>.
        - Forecast can range from a second to anytime in the future.
        - Answers the questions: "How far to go out from today?" or "Where to stop?".
    4. <o_a>, {observation_domain} observation attribute.
        - Characteristics of a domain-specific attributes such as various quantifiable metrics relevant to the {observation_domain} domain.
        - Some examples are {observation_domain_attribute}.  
"""
observation_properties_prompt = PromptTemplate.from_template(observation_properties_template)

In [None]:
observation_requirements_template = """requirements to use for each observation:

    - Should be based on real-world {observation_domain} data and not hallucinate.
    - Must be a simple sentence (observation) (and NOT compounding using "and" or "or").
    - Should diversify all four properties of the observation (<o>) as in change and not use same for <o_s>, <o_t>, <o_d>, <o_a>.
    - The observation should be unique and not repeated.
    - Do not number the observations.
    - Do not say, "Here are {observations_N} unique observation based on the provided templates and examples:" or anything similar in the prompt.
    - Change how the current date (<o_d>) written in the observation with examples of (1) Wednesday, August 21, 2024; (2) Wed, August 21, 2024; (3) 08/21/2024; (4) 08/21/2024; (5) 21/08/2024; (6) 21 August 2024; (7) 2024/08/21; (8) 2024-08-21; (9) August 21, 2024; (10) Aug 21, 2024; (11) 21 August 2024, (12) 21 Aug 2024, Q3 of 2027, 2029 of Q3, etc (with removing day of week).
    - Do not use any of the examples in the prompt.
    - In front of every observation, put the template number in the format of "T0:" and only use "T0:" as the template number.
    - Do not put template number on line by itself. Always pair with an observation.
    - Disregard brackets: "<>"
    - Do not use person name of entity name more than once as in don't use name Joe as both the <o_s> and <o_t>, unless like Mr. Sach and Goldman Sach or Mr. Sam Walton and Sam's Club, etc.
    - The source entity (<o_s>) is rarely the same as the target entity (<o_t>) and if same, the <o_s> is making a observation on itself in the <o_t>.
    - Should variate the slope of rise/increase/as much as, fall/decrease/as little as, change, stay stable, high/low chance/probability/degree of, etc.
    - Should variate the observation verbs such as will, would, be going to, should, etc.
    - Must be past tense as in already occurred and not future tense."""
observation_requirements_prompt = PromptTemplate.from_template(observation_requirements_template)

In [None]:
observation_templates_template = """Here are some {observation_domain} templates:

    - {observation_domain} template 1: <o_s> observed that the <o_a> at <o_t> had decreased in <o_d>.

"""
observation_templates_prompt = PromptTemplate.from_template(observation_templates_template)

In [None]:
observation_examples_template = """Here are some examples of {observation_domain} observations:
{domain_examples}

With the above, generate a unique set of {observations_N} observations per template following the examples. Think from the perspective of an {observation_domain} analyst, expert, top executive, or senior level person and even a college student, professional, research advisor, etc."""
observation_examples_prompt = PromptTemplate.from_template(observation_examples_template)

In [None]:
observation_input_prompts = [
    ("observation_properties", observation_properties_prompt),
    ("observation_requirements", observation_requirements_prompt),
    ("observation_templates", observation_templates_prompt),
    ("observation_examples", observation_examples_prompt),
]

pipeline_prompt = PipelinePromptTemplate(
    final_prompt=full_observation_prompt, pipeline_prompts=observation_input_prompts
)

## Specific Templates for Domain observations

- For now, generating 1 observation per template. From here, I'll try 3 and increase by increments/multiples of 3.

- With 1 observation per template,
    - 1 observation per template x 6 examples per domain so 6 observations per domain
    - 6 observations per domain x 4 domains = 24 observations per model
    - 24 observations per model x 2 models = 48 observations across all models
    - 48 observations across all models x 2 batches = 96 across all batches

In [None]:
examples_per_template = 1
generate_N_observations_per_template = 1 * examples_per_template

### Template for Financial observations

In [None]:
financial_attributes = """stock price, net profit, revenue, operating cash flow, research and development expenses, operating income, gross profit."""
financial_requirements = """- Should be based on real-world financial earnings reports.
    - Suppose the time when <o> was made is during any earning season.
    - Include stocks from all sectors such as consumer staples, energy, finance, health care, industrials, materials, media, real estate, retail, technology, utilities, defense, etc.
    - Include the US Dollar sign ($) before or USD after the amount of the financial attribute."""

financial_examples = """
   - financial examples for template 1:
      1. Joseph, the young entrepreneur, observed that the revenue at FUBU (his parents clothing line) had increased for Q3 2028.
      2. BJ saw that the operating cash flow at UF's school of Engineering decrease in 05/2025.
      3. An fresh investor noticed the ETFs in his portfolio exponentially grew from Apr 7, 1997 to Apr 7, 2025.
 """

financial_input_dict = {
    "observation_domain": "financial",
    "observation_domain_attribute": financial_attributes,
    "domain_requirements": financial_requirements,
    "domain_examples": financial_examples,
    "observations_N": generate_N_observations_per_template
}
financial_prompt_output = pipeline_prompt.format(**financial_input_dict)
print(financial_prompt_output)


###  Template for Health observations

In [None]:
health_attributes = """obesity rates, prevalence of chronic illnesses, average physical activity levels, nutritional intake, etc."""
health_requirements = """- Should be based on real-world health reports.
    - Suppose the time when <o> was made is during any season such as flu season, allergy season, pandemic, epidemic, etc.
    - Include reports from all Health organization, researcher, doctor, physical therapist, physician assistant, nurse practictioners, fitness expert, etc."""

health_examples = """
- health examples for template 1:
    1. WellFlorida caught that the patients' blood glucose at all hospitals in Florida improved from Q1 2021 to Q3 2021.
    2. Nurse John observed that the heart rate in patients at Alaska's General Hospital had stabilized from 2023 January to 2023 Dec.
    3. I noted that the number of visits my patients in Piscataway, NJ decreased from start of week to end of week.
"""

health_input_dict = {
    "observation_domain": "health",
    "observation_domain_attribute": health_attributes,
    "domain_requirements": health_requirements,
    "domain_examples": health_examples,
    "observations_N": generate_N_observations_per_template
}

health_prompt_output = pipeline_prompt.format(**health_input_dict)


###  Template for Policy observations

In [None]:
policy_attributes = """election outcomes, economic reforms, legislative impacts."""
policy_requirements = """- Should be based on real-world policy reports.
    - Suppose the time when <o> was made is during an election cycle or non-election cycles.
    - Include policies & laws, from all sectors such as consumer staples, energy, finance, health care, industrials, materials, media, real estate, retail, technology, utilities, defense, etc."""

policy_examples = """
   - policy examples for template 1:
      1. Local journalist, Aaron, identified economic reforms in Thomson, GA rose jan 2033.
      2. Policy analyst, Michael (Ph.D), remarked that the home tax in Austin, TX had increased on 7/9/28.
      3. Policy maker Sarah noted that company employment rates in her city San Francisco had risen from Q1 2025 to Q3 2025.
"""

policy_input_dict = {
    "observation_domain": "policy",
    "observation_domain_attribute": policy_attributes,
    "domain_requirements": policy_requirements,
    "domain_examples": policy_examples,
    "observations_N": generate_N_observations_per_template
}

###  Template for Weather observations

In [None]:
weather_attributes = """temperature, precipitation, wind speed, humidity, etc."""
weather_requirements = """- Should be based on real-world weather reports.
    - Suppose the time when <o> was made is during any season and any location (ie: Florida known for hurricanes, California known for wildfires, etc).
    - Include reports from all meteorologists, weather organizations, or any type of weather entity.."""

weather_examples = """
    - weather examples for template 1:
        1. The street cleaner watched the snow in Minnesota increase from 12/8/9 to 2/8/10.
        2. Jade, a farmer, caught that the rainfall in Kansas had decreased at midnight.
        3. I felt the wind speed in North Dakota (city of Fargo) picked up drastically today.

"""

weather_input_dict = {
    "observation_domain": "weather",
    "observation_domain_attribute": weather_attributes,
    "domain_requirements": weather_requirements,
    "domain_examples": weather_examples,
    "observations_N": generate_N_observations_per_template
}

## Generate observations

In [None]:
N_batches = 1
# text_generation_models = [gpt_4_o_generation_model]
# text_generation_models = [llama_versatile_generation_model]
text_generation_models = [llama_versatile_generation_model, llama_instant_generation_model, llama_70b_8192_generation_model, 
                          llama_8b_8192_generation_model, gpt_35_turbo_generation_model, gpt_4_o_generation_model, 
                          mixtral_87b_instruct_generation_model]

# text_generation_models = [gpt_4_o_generation_model]

# text_generation_models = [llama_versatile_generation_model, llama_instant_generation_model, llama_70b_8192_generation_model, 
#                           llama_8b_8192_generation_model, gpt_35_turbo_generation_model]

In [None]:
observation_domains = ["finance", "health", "policy", "weather"]
observation_prompt_outputs = {
    "finance": financial_prompt_output,
    "health": health_prompt_output,
    "policy": financial_prompt_output,
    "weather": financial_prompt_output,
}
observation_label = 1

batched_observations_df = tgmf.batch_generate_data(N_batches=N_batches, 
                                text_generation_models=text_generation_models, 
                                domains=observation_domains,
                                prompt_outputs=observation_prompt_outputs,
                                sentence_label=observation_label)

In [None]:
pd.set_option('max_colwidth', 800)
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
batched_observations_df