# 1-Generate Predictions using LangChain

- **Goal:** Prediction Recognition

- **Purpose:** To implement step 1 with sub steps of prediction recognition pipeline. See steps
    1. Generate predictions
        1. Create several prediction prompts templates
        2. Utilize open-source LLMs to generate predictions

- **Misc:**
    - `%store`: Cell magic will store the variable of interest so we can load in another notebook

Entities: You already have Representative Angela, Economist Dr. Henry, Meteorologist Nina, etc. Can we have at least one example per template? Other ex are analyst, expert, top executive,  senior level person, college student, research advisor, professional, military member, etc
- Alex, a health expert
- Angela Cortez, a member of Congress
- Dr. Anna Lee --> Anna Lee, Ph.D.

Future verb tense: A more balanced distribution/usage of the below
- Update "will likely" to will (without likely) may, would, be going to, should, etc (can google more)
- Can use rise/increase/as much as, fall/decrease/as little as, change, stay stable, high/low chance/probability/degree of, etc. (can google more)

Date: We write the date differently in many cases below. Want this to be represented in the dataset.  Try to  disperse the same usage to other templates and other domains. For example: T2 for Health all have [Month, date, year] and similarly for T5 in Health [Month Year] are all grouped within one template.  See example dates below
- (1) Wednesday, August 21, 2024; (2) Wed, August 21, 2024; (3) 08/21/2024; (4) 08/21/2024; (5) 21/08/2024; (6) 21 August 2024; (7) 2024/08/21; (8) 2024-08-21; (9) August 21, 2024; (10) Aug 21, 2024; (11) 21 August 2024, (12) 21 Aug 2024, Q3 of 2027, 2029 of Q3, etc (with removing day of week), first week of Fall, Q2 of 2033, 2045 Q3, etc.
- Differentiate the month, date, and year. 

In [1]:
import os, sys

import pandas as pd

from tqdm import tqdm
from langchain_core.prompts import PipelinePromptTemplate, PromptTemplate

# Get the current working directory of the notebook
notebook_dir = os.getcwd()
# Add the parent directory to the system path
sys.path.append(os.path.join(notebook_dir, '../'))

from log_files import LogData
from data_processing import DataProcessing
from text_generation_models import TextGenerationModelFactory

## Text Generation Models

In [2]:
tgmf = TextGenerationModelFactory()

llama_versatile_generation_model = tgmf.create_instance(model_name='llama-3.3-70b-versatile')
llama_instant_generation_model = tgmf.create_instance('llama-3.1-8b-instant')
llama_70b_8192_generation_model = tgmf.create_instance('llama3-70b-8192')
llama_8b_8192_generation_model = tgmf.create_instance('llama3-8b-8192')

gpt_35_turbo_generation_model = tgmf.create_instance('gpt-3.5-turbo')
gpt_4_o_generation_model = tgmf.create_instance('gpt-4o')
mixtral_87b_instruct_generation_model = tgmf.create_instance('mixtral-8x7b-instruct') 

## Base Templates for Domain Predictions

In [3]:
full_prediction_template = """{prediction_properties}

{prediction_requirements}

{prediction_templates}

{prediction_examples}
"""

full_prediction_prompt = PromptTemplate.from_template(full_prediction_template)

In [4]:
prediction_properties_template = """A prediction <p> = (<p_s>, <p_t>, <p_d>, <p_a>), where it consists of the following four properties:

    1. <p_s>, any source entity in the {prediction_domain} domain.
        - Can be a person (with a name) or a {prediction_domain} person such as a {prediction_domain} reporter, {prediction_domain} analyst, {prediction_domain} expert, {prediction_domain} top executive, {prediction_domain} senior level person, etc), civilian.
        - Can only be an organization that is associated with the {prediction_domain} prediction.
    2. <p_t>, any target entity in the {prediction_domain} domain.
	    - Can be a person (with a name) or a {prediction_domain} person such as a {prediction_domain} reporter, {prediction_domain} analyst, {prediction_domain} expert, {prediction_domain} top executive, {prediction_domain} senior level person, etc).
        - Can only be an organization that is associated with the {prediction_domain} prediction.
    3. <p_d>, date or time range when <p> is expected to come to fruition or when one should observe the <p>.
        - Forecast can range from a second to anytime in the future.
        - Answers the questions: "How far to go out from today?" or "Where to stop?".
    4. <p_a>, {prediction_domain} prediction attribute.
        - Characteristics of a domain-specific attributes such as various quantifiable metrics relevant to the {prediction_domain} domain.
        - Some examples are {prediction_domain_attribute}.  
"""
prediction_properties_prompt = PromptTemplate.from_template(prediction_properties_template)

In [5]:
prediction_requirements_template = """{prediction_domain} requirements to use for each prediction:

    - Should be based on real-world {prediction_domain} data and not hallucinate.
    - Only a simple sentence (prediction) (and NOT compounding using "and" or "or").
    - Should diversify all four properties of the prediction (<p>) as in change and not use same for <p_s>, <p_t>, <p_d>, <p_a>.
    - Should use synonyms to predict such as forecasts, speculates, foresee, envision, etc., and not use any of them more than ten times.
    - The prediction should be unique and not repeated.
    - Do not number the predictions.
    - Do not say, "As the {prediction_domain}, I will generate company-based {prediction_domain} predictions using the provided templates." or anything similar.
    - Use the five different templates and examples provided.
    - Change how the current date (<p_d>) written in the prediction with examples of (1) Wednesday, August 21, 2024; (2) Wed, August 21, 2024; (3) 08/21/2024; (4) 08/21/2024; (5) 21/08/2024; (6) 21 August 2024; (7) 2024/08/21; (8) 2024-08-21; (9) August 21, 2024; (10) Aug 21, 2024; (11) 21 August 2024, (12) 21 Aug 2024, Q3 of 2027, 2029 of Q3, etc (with removing day of week).
    {domain_requirements}
    - Do not say, "Here are {predictions_N} unique {prediction_domain} predictions based on the provided templates and examples:" in the prompt.
    - Do not use any of the examples in the prompt.
    - In front of every prodiction, put the template number in the format of "T1:", "T2:", etc. and do not number them like "1.", "2.", etc. Should have template number and generated prediction matching.
    - Do not put template number on line by itself. Always pair with a prediction.
    - Disregard brackets: "<>"
    - Should never say "Here are {predictions_N} unique {prediction_domain} predictions based on the provided templates and examples:" or "Note: I've made sure to follow the guidelines and templates provided, and generated unique predictions that meet the requirements."
    - Do not use person name of entity name more than once as in don't use name Joe as both the <p_s> and <p_t>, unless like Mr. Sach and Goldman Sach or Mr. Sam Walton and Sam's Club, etc.
    - The source entity (<p_s>) is rarely the same as the target entity (<p_t>) and if same, the <p_s> is making a prediction on itself in the <p_t>. Thus, can be first person and use first person pronouns. 
    - Should variate the slope of rise/increase/as much as, fall/decrease/as little as, change, stay stable, high/low chance/probability/degree of, etc and can be dramatic like drastic, exponential, etc
    - Should variate the prediction verbs such as will, would, be going to, should, has potential to, etc.
"""
prediction_requirements_prompt = PromptTemplate.from_template(prediction_requirements_template)

In [6]:
prediction_templates_template = """Here are some {prediction_domain} templates:

    - {prediction_domain} template 1: <p_s> forecasts that the <p_a> at <p_t> potentially decrease in <p_d>.
    - {prediction_domain} template 2: On <p_d>, <p_s> speculates the <p_a> at <p_t> will likely increase.
    - {prediction_domain} template 3: <p_s> predicts on <p_d>, the <p_t> <p_a> may rise.
    - {prediction_domain} template 4: According to <p_s>, the <p_a> at <p_t> would fall in <p_d>.
    - {prediction_domain} template 5: In <p_d>, <p_s> envisions that <p_t> <p_a> has some probability to remain stable.
    - {prediction_domain} template 6: <p_t> <p_a> should stay same <p_d>, according to <p_s>. 

"""
prediction_templates_prompt = PromptTemplate.from_template(prediction_templates_template)

In [7]:
prediction_examples_template = """Here are some examples of {prediction_domain} predictions:

{domain_examples}

With the above, generate a unique set of {predictions_N} predictions per template following the examples. Think from the perspective of an {prediction_domain} analyst, expert, top executive, or senior level person and even a college student, professional, research advisor, etc."""
prediction_examples_prompt = PromptTemplate.from_template(prediction_examples_template)

In [8]:
prediction_input_prompts = [
    ("prediction_properties", prediction_properties_prompt),
    ("prediction_requirements", prediction_requirements_prompt),
    ("prediction_templates", prediction_templates_prompt),
    ("prediction_examples", prediction_examples_prompt),
]

pipeline_prompt = PipelinePromptTemplate(
    final_prompt=full_prediction_prompt, pipeline_prompts=prediction_input_prompts
)

  pipeline_prompt = PipelinePromptTemplate(


## Specific Templates for Domain Predictions

- For now, generating 1 prediction per template. From here, I'll try 3 and increase by increments/multiples of 3.

- With 1 prediction per template,
    - 1 prediction per template x 6 examples per domain so 6 predictions per domain
    - 6 predictions per domain x 4 domains = 24 predictions per model
    - 24 predictions per model x 2 models = 48 predictions across all models
    - 48 predictions across all models x 2 batches = 96 across all batches

In [9]:
examples_per_template = 1
generate_N_predictions_per_template = 1 * examples_per_template

### Template for Financial Predictions

In [10]:
financial_attributes = """stock price, net profit, revenue, operating cash flow, research and development expenses, operating income, gross profit."""
financial_requirements = """- Should be based on real-world financial earnings reports.
    - Suppose the time when $p$ was made is during any earning season.
    - Include stocks from all sectors such as consumer staples, energy, finance, health care, industrials, materials, media, real estate, retail, technology, utilities, defense, etc.
    - Include the US Dollar sign ($) before or USD after the amount of the financial attribute."""

financial_examples = """
   - financial examples for template 1:
      1. Detravious, an investor forecasts that the stock price at Apple will likely decrease in 2025 Q1.
      2. Ava Lee predicts that the operating cash flow at ExxonMobil should decrease in 03/21/2025 to 08/21/2025.
      3. Joe predicts that the stocks he has will likely increase in 2024/08/21.
   - financial examples for template 2:
      1. On March 15, 2025, Goldman Sachs speculates that the interest rates at the Federal Reserve will likely increase.
      2. On April 2, 2025, Morgan Stanley forecasts that the stock value at Tesla will likely increase.
      3. On January 28, 2025, Chase analysts foresee that there stock prices will likely increase.
   - financial examples for template 3:
      1. Morgan Stanley anticipates that on May 3, 2025, the NASDAQ composite index could climb moderately.
      2. BlackRock foresees that on April 22, 2025, the value of Bitcoin has a high probability of rising sharply.
      3. Morgan Stanley predicts that on May 3, 2025, there stock price will likely rise.
   - financial examples for template 4:
      1. According to Chase Bank, the expected returns at emerging market equities will likely fall in May 2025.
      2. According to Ryan, the projected revenue at Meta Platforms will likely fall in Q2 2025.
      3. According to Apple, the trading volume it has will likely increase in Q1 2025
   - financial examples for template 5:
      1. In April 2025, Wells Fargo expects that U.S. Treasury yields will likely stay stable.
      2. In May 2025, Bob envisions that the inflation rate in Wells Fargo will likely stay stable.
      3. In August 2025, Bob predicts that the stocks he has will likely stay stable.
   - financial examples for template 6:
      1. Apple stock price will decrease in February 2025, accordeing to Roger.
      2. The NASDAQ index is expected to rise in June 2025, according to Bank of America.
      3. Roger foresees the stock price increasing in July 2025, according to his projections.
 """

financial_input_dict = {
    "prediction_domain": "financial",
    "prediction_domain_attribute": financial_attributes,
    "domain_requirements": financial_requirements,
    "domain_examples": financial_examples,
    "predictions_N": generate_N_predictions_per_template
}
financial_prompt_output = pipeline_prompt.format(**financial_input_dict)
print(financial_prompt_output)


A prediction <p> = (<p_s>, <p_t>, <p_d>, <p_a>), where it consists of the following four properties:

    1. <p_s>, any source entity in the financial domain.
        - Can be a person (with a name) or a financial person such as a financial reporter, financial analyst, financial expert, financial top executive, financial senior level person, etc), civilian.
        - Can only be an organization that is associated with the financial prediction.
    2. <p_t>, any target entity in the financial domain.
	    - Can be a person (with a name) or a financial person such as a financial reporter, financial analyst, financial expert, financial top executive, financial senior level person, etc).
        - Can only be an organization that is associated with the financial prediction.
    3. <p_d>, date or time range when <p> is expected to come to fruition or when one should observe the <p>.
        - Forecast can range from a second to anytime in the future.
        - Answers the questions: "How fa

###  Template for Health Predictions


Want these tied to specific campaigns, regions, or health studies? Yeah, that's a good idea.  In regions, could include national, state, and local.  Be sure with the health study ex that it's still a prediction instead of it being data already collected and analyzed. 

- Template: According to <p_s>, the <p_a> at <p_t> would fall in <p_d>.
- Mappings: According to a <study by UF>, the <obesity rate> in <Gainesville, FL> should rapidly decrease between <9.8.2026 to 12.19.2026> (don't include in code; only showing relation to template. 
- Example: According to  study by UF, the obesity rate in Gainesville would  fall between 9.8.2026 to 12.19.2026

I'm cool with some examples not  exactly matching the template. The only exact matchings should be the variables. Every other word can be a synonym.

In [11]:
health_attributes = """obesity rates, prevalence of chronic illnesses, average physical activity levels, nutritional intake, etc."""
health_requirements = """- Should be based on real-world health reports.
    - Suppose the time when $p$ was made is during any season such as flu season, allergy season, pandemic, epidemic, etc.
    - Include reports from all Health organization, researcher, doctor, physical therapist, physician assistant, nurse practictioners, fitness expert, etc."""

health_examples = """
- health examples for template 1:
    1. CDC predicts that the obesity rates at the national level will likely decrease in late 2025.
    2. CDC forecasts that the prevalence of chronic illnesses at urban health centers will likely decrease in Q3 2025.
    3. Chase Thinks that the average physical activity levels he does will likely increase in 2025.
- health examples for template 2:
    1. On May 15, 2025, the CDC speculates that the average physical activity levels at U.S. high schools will likely increase.
    2. On June 1, 2025, Sam speculates that the nutritional intake at rural clinics in America will likely increase.
    3. On July 1, 2025, Sam predicts that the obesity rates at the national level will likely increase.
- health examples for template 3:
    1. The NIH predicts that on July 5, 2025, public engagement in preventative health screenings will likely rise.
    2. Alex suspects that on June 15, 2025, the obesity rates at the national level will likely decrease.
    3. Talon envisions that on Januaruy 3, 2025, the average physical activity levels for him will likely rise.
- health examples for template 4:
    1. According to the CDC, the obesity rates at U.S. elementary schools will likely fall in Fall 2025.
    2. According to the NIH, the average sugar consumption at public school cafeterias will likely fall in September 2025.
    3. According to James, the average physical activity levels for him will likely fall in 2025.
- health examples for template 5:
    1. In June 2025, Dr. Maria Thompson envisions that national obesity rates will likely decrease.
    2. In August 2025, Professor James Liu envisions that average physical activity levels among teenagers will likely increase.
    3. In July 2025, Dr. Aisha Reynolds envisions that her nutritional intake will likely stay stable.
- health examples for template 6:
    1. Physical activity levels among seniors will likely rise in July 2025, according to Dr. Elena Morales.
    2. Nutritional awareness in public schools will likely rise in September 2025, according to Professor Daniel Kim.
    3. Sarah's health screening participation will likely rise in late 2025, according to Sarah.
"""

health_input_dict = {
    "prediction_domain": "health",
    "prediction_domain_attribute": health_attributes,
    "domain_requirements": health_requirements,
    "domain_examples": health_examples,
    "predictions_N": generate_N_predictions_per_template
}
health_prompt_output = pipeline_prompt.format(**health_input_dict)


###  Template for Policy Predictions

Change <e_s> to be first person in some cases

- T1. 3. Angela Cortez forcasts that data privacy laws in the technology sector will likely stay stable in the public, in late 2025, according to expert Angela Cortez. 
    - I forcast that data privacy laws in the technology sector will likely stay stable in the public, in late 2025, according to expert Angela Cortez.

- T6. 3. Rachel Alvarez's voter participation in local elections will likely stay the same in November 2025, according to Rachel Alvarez.
    - My voter participation in local elections will likely stay the same in November 2025, according to Rachel Alvarez.

In [12]:
policy_attributes = """election outcomes, economic reforms, legislative impacts."""
policy_requirements = """- Should be based on real-world policy reports.
    - Suppose the time when $p$ was made is during an election cycle or non-election cycles.
    - Include policies & laws, from all sectors such as consumer staples, energy, finance, health care, industrials, materials, media, real estate, retail, technology, utilities, defense, etc."""

policy_examples = """
   - policy examples for template 1:
      1. Election outcomes in key swing states will likely rise in national importance in November 2025, according to analyst Rachel Lin.
      2. Economic reforms in the energy sector will likely increase in visibility in Q3 2025, according to Dr. Marcus Grant from the Brookings Institution.
      3. I forcast that data privacy laws in the technology sector will likely stay stable in the public, in late 2025, according to expert Angela Cortez.
   - policy examples for template 2:
      1. On April 5, 2025, the Brookings Institution speculates that voter turnout at battleground states will likely stay stable.
      2. On May 12, 2025, the International Monetary Fund speculates that investment activity at emerging markets will likely increase.
      3. On June 1, 2025, policy analyst Rachel Kim speculates that regulatory scrutiny in Rachels large tech firms will likely decrease.
   - policy examples for template 3:
      1. Representative Angela Brooks predicts that on October 15, 2025, the defense budget allocation will likely stay stable.
      2. Economist Dr. Henry Zhao predicts that on July 1, 2025, the corporate tax rate in the finance sector will likely rise.
      3. Senator Michael Greene predicts that on November 3, 2025, his voter engagement in suburban districts will likely decrease.
   - policy examples for template 4:
      1. According to Senator Alicia Ramirez, the public trust at federal institutions will likely increase in late 2025.
      2. According to Thomas Nguyen, the investment confidence at the real estate sector will likely fall in Q3 2025.
      3. According to policy advisor Natalie Chen, the employment rate at at her company firms will likely stay stable in September 2025.
   - policy examples for template 5:
      1. In August 2025, Senator Jordan Ellis envisions that healthcare spending will likely stay stable.
      2. In June 2025, economist Dr. Priya Nandakumar envisions that inflation rates in the consumer staples sector will likely increase.
      3. In October 2025, policy strategist Kevin Adler envisions that his defense contract approvals will likely decrease.
   - policy examples for template 6:
      1. Renewable energy investments are expected to rise in Q3 2025, according to Dr. Elena Foster.
      2. Healthcare subsidies will likely decrease in September 2025, according to Senator Marcus Lee.
      3. My voter participation in local elections will likely stay the same in November 2025, according to Rachel Alvarez.
"""

policy_input_dict = {
    "prediction_domain": "policy",
    "prediction_domain_attribute": policy_attributes,
    "domain_requirements": policy_requirements,
    "domain_examples": policy_examples,
    "predictions_N": generate_N_predictions_per_template
}
policy_prompt_output = pipeline_prompt.format(**policy_input_dict)

###  Template for Weather Predictions

I updated <p_d> for: 
- T2 1.
- T2 3.

In [13]:
weather_attributes = """temperature, precipitation, wind speed, humidity, etc."""
weather_requirements = """- Should be based on real-world weather reports.
    - Suppose the time when $p$ was made is during any season and any location (ie: Florida known for hurricanes, California known for wildfires, etc).
    - Include reports from all meteorologists, weather organizations, or any type of weather entity.."""

weather_examples = """
    - weather examples for template 1:
        1. The National Weather Service forecasts that the precipitation levels at Miami will likely increase in September 2025.
        2. AccuWeather forecasts that the humidity at Phoenix will likely stay stable in early fall 2025.
        3. Sam forecasts that the wind speed at his house will likely decrease in November 2025.
    - weather examples for template 2:
        1. On 08/21/2024, Meteorologist Lisa Park speculates that the temperature at Los Angeles will likely increase.
        2. On June 15, 2025, Dr. Mark Williams speculates that the humidity at Houston will likely decrease.
        3. Third week in January, San Francisco's meteorological team speculates that the wind speed in San Francisco will likely stay stable.
    - weather examples for template 3:
        1. Dr. Anna Lee predicts that on May 20, 2025, the temperature at Denver will likely decrease.
        2. Meteorologist John Roberts predicts that on July 1, 2025, the wind speed at New York will likely rise.
        3. The Miami Weather Bureau predicts that on August 10, 2025, the humidity at Miami will likely stay stable.
    - weather examples for template 4:
        1. According to Dr. Linda Garcia, the temperature at Boston will likely increase in November 2025.
        2. According to Meteorologist Jake Wilson, the precipitation levels at Seattle will likely stay stable in January 2025.
        3. According to Dylan, the wind speed at Dylans home will likely fall in October 2025.
    - weather examples for template 5:
        1. In December 2025, Meteorologist Claire Thompson envisions that the temperature at Chicago will likely stay increase.
        2. In May 2025, Dr. Robert Harris envisions that the humidity at Phoenix will likely stay stable.
        3. In July 2025, the San Francisco Weather Bureau envisions that the precipitation levels in San Francisco will likely decrease.
    - weather examples for template 6:
        1. Temperature in Las Vegas will likely rise in July 2025, according to Meteorologist Nina Patel.
        2. Humidity in Houston will likely decrease in August 2025, according to Dr. Kevin Morales.
        3. Wind speed in Miami will likely stay stable in October 2025, according to the Miami Weather Bureau.
"""

weather_input_dict = {
    "prediction_domain": "weather",
    "prediction_domain_attribute": weather_attributes,
    "domain_requirements": weather_requirements,
    "domain_examples": weather_examples,
    "predictions_N": generate_N_predictions_per_template
}
weather_prompt_output = pipeline_prompt.format(**weather_input_dict)


## Generate Predictions

I updated `batch_generate_predictions` to `batch_generate_data`. Same function, different name. Use the one that goes with the code you have. 

In [14]:
N_batches = 1
# text_generation_models = [llama_versatile_generation_model, llama_instant_generation_model, llama_70b_8192_generation_model, 
#                           llama_8b_8192_generation_model, gpt_35_turbo_generation_model, gpt_4_o_generation_model, 
#                           mixtral_87b_instruct_generation_model]
text_generation_models_navigator = [gpt_35_turbo_generation_model, gpt_4_o_generation_model, 
                          mixtral_87b_instruct_generation_model]
prediction_domains = ["finance", "health", "policy", "weather"]
prediction_prompt_outputs = {
    "finance": financial_prompt_output,
    "health": health_prompt_output,
    "policy": policy_prompt_output,
    "weather": weather_prompt_output,
}
prediction_label = 1

batched_predictions_df = tgmf.batch_generate_data(N_batches=N_batches, 
                                text_generation_models=text_generation_models_navigator, 
                                domains=prediction_domains,
                                prompt_outputs=prediction_prompt_outputs,
                                sentence_label=prediction_label)

  0%|          | 0/1 [00:00<?, ?it/s]


finance --- gpt-3.5-turbo --- NAVI_GATOR


BadRequestError: Error code: 400 - {'error': {'message': 'ExceededBudget: User=dj.brinkley@ufl.edu over budget. Spend=5.007422400000013, Budget=5.0', 'type': 'budget_exceeded', 'param': None, 'code': '400'}}

In [None]:
pd.set_option('max_colwidth', 800)
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
batched_predictions_df