# 1-Generate Predictions using LangChain

- **Goal:** Prediction Recognition

- **Purpose:** To implement step 1 with sub steps of prediction recognition pipeline. See steps
    1. Generate predictions
        1. Create several prediction prompts templates
        2. Utilize open-source LLMs to generate predictions

- **Misc:**
    - `%store`: Cell magic will store the variable of interest so we can load in another notebook

In [1]:
# !pip3 install pandas langchain spacy numpy Groq

In [2]:
!pip3 install -U scikit-learn pandas tqdm langchain-core spacy groq python-dotenv

Defaulting to user installation because normal site-packages is not writeable
You should consider upgrading via the '/Library/Developer/CommandLineTools/usr/bin/python3 -m pip install --upgrade pip' command.[0m


In [3]:
# !pip3 install python-dotenv

In [4]:
import os, sys

import pandas as pd

from tqdm import tqdm
from langchain_core.prompts import PipelinePromptTemplate, PromptTemplate

# Get the current working directory of the notebook
notebook_dir = os.getcwd()
# Add the parent directory to the system path
sys.path.append(os.path.join(notebook_dir, '../'))

from log_files import DataFrameLogger
from data_processing import DataProcessing
from text_generation_models import TextGenerationModelFactory, LlamaVersatileTextGenerationModel, LlamaInstantTextGenerationModel, Llama70B8192TextGenerationModel, Llama8B8192TextGenerationModel, MixtralTextGenerationModel



In [5]:
# pd.set_option('max_colwidth', 800)

llama_versatile_generation_model = LlamaVersatileTextGenerationModel()
llama_instant_generation_model = LlamaInstantTextGenerationModel()
llama_70b_8192_generation_model = Llama70B8192TextGenerationModel()
llama_8b_8192_generation_model = Llama8B8192TextGenerationModel()
mixtral_generation_model = MixtralTextGenerationModel()

## LangChain Templates for Domain Predictions

In [6]:
full_prediction_template = """{prediction_properties}

{prediction_requirements}

{prediction_templates}

{prediction_examples}
"""

full_prediction_prompt = PromptTemplate.from_template(full_prediction_template)

Google predictive spelling/autocomplete 

In [7]:
prediction_properties_template = """A prediction ($p$) = ($p_s$, $p_t$, $p_d$, $p_a$), where it consists of the following four properties:

    1. $p_s$, any source entity in the {prediction_domain} domain.
        - Can be a person (with a name) or a {prediction_domain} person such as a {prediction_domain} reporter, {prediction_domain} analyst, {prediction_domain} expert, {prediction_domain} top executive, {prediction_domain} senior level person, etc).
        - Can only be an organization that is associated with the {prediction_domain} prediction.
    2. $p_t$, any target entity in the {prediction_domain} domain.
	      - Can be a person (with a name) or a {prediction_domain} person such as a {prediction_domain} reporter, {prediction_domain} analyst, {prediction_domain} expert, {prediction_domain} top executive, {prediction_domain} senior level person, etc).
        - Can only be an organization that is associated with the {prediction_domain} prediction.
    3. $p_d$, date range when $p$ is expected to come to fruition.
        - Forecast can range from a second to anytime in the future.
        - Answers the questions: "How far to go out from today?" or "Where to stop?".
    4. $p_a$, {prediction_domain} prediction attribute.
        - Characteristics of a domain-specific attributes such as various quantifiable metrics relevant to the {prediction_domain} domain.
        - Some examples are {prediction_domain_attribute}.  
"""
prediction_properties_prompt = PromptTemplate.from_template(prediction_properties_template)

    - Keep the brackets around the prediction properties when generating predictions and be sure to include brackets around dates such as "2024-10-15", "2024/08/20", "Q4 of 2024", "2025", "2027 Q1", "Q3 2027", "On 21 Aug 2024".

In [8]:
prediction_requirements_template = """{prediction_domain} requirements to use for each prediction:

    - Should be based on real-world {prediction_domain} data and not hallucinate.
    - Only a simple sentence (prediction) (and NOT compounding using "and" or "or").
    - Should diversify all nine properties of the prediction ($p$) meaning to change and not use same (p_p, p_o, p_t, p_f, p_a, p_s, p_m, p_v, p_l) .
    - Should use synonyms of $p_w$ such as forecasts, speculates, foresee, envision, etc., and not use any of them more than ten times.
    - The prediction should be unique and not repeated.
    - The forecast time ($p_f$) should always be after current time ($p_t$) of when forecast ($p$) was made.
    - Do not number the predictions.
    - Do not say, "As the {prediction_domain} at organization ($p_o$), I will generate company-based {prediction_domain} predictions using the provided templates." or anything similar.
    - Should have a forecast time ($p_f$) when $p$ is expected to come to fruition between 2025 to 2050.
    - Use the five different templates and examples provided.
    - Change how the current time ($p_t$) and forecast time ($p_f$) are written in the prediction with examples of (1) Wednesday, August 21, 2024; (2) Wed, August 21, 2024; (3) 08/21/2024; (4) 08/21/2024; (5) 21/08/2024; (6) 21 August 2024; (7) 2024/08/21; (8) 2024-08-21; (9) August 21, 2024; (10) Aug 21, 2024; (11) 21 August 2024, (12) 21 Aug 2024, Q3 of 2027, 2029 of Q3, etc (with removing day of week).
    {domain_requirements}
    - Stop saying, "Here are {predictions_N} unique {prediction_domain} predictions based on the provided templates and examples:" in the prompt.
    - Do not use any of the examples in the prompt.
    - In front of every prodiction, put the template number in the format of "T1:", "T2:", etc. and do not number them like "1.", "2.", etc.
    - Disregard brackets: "[]"
    - Should never say "Here are {predictions_N} unique {prediction_domain} predictions based on the provided templates and examples:" 
    - Be sure to space words when generating the prediction metric ($p_m$) like "from _ to _ 
"""
prediction_requirements_prompt = PromptTemplate.from_template(prediction_requirements_template)

In [9]:
prediction_templates_template = """Here are some {prediction_domain} templates:

- {prediction_domain} template 1: On [ $p_t$ ], [ $p_p$ ] [ $p_w$ ] that the [ $p_a$ ] at [ $p_o$ ] [ $p_v$ ] [ $p_s$ ] by [ $p_m$ ] in [ $p_f$ ].

"""
prediction_templates_prompt = PromptTemplate.from_template(prediction_templates_template)

In [10]:
prediction_examples_template = """Here are some examples of {prediction_domain} predictions:

{domain_examples}

With the above, generate a unique set of {predictions_N} predictions. Think from the perspective of an {prediction_domain} analyst, expert, top executive, or senior level person."""
prediction_examples_prompt = PromptTemplate.from_template(prediction_examples_template)

In [11]:
prediction_input_prompts = [
    ("prediction_properties", prediction_properties_prompt),
    ("prediction_requirements", prediction_requirements_prompt),
    ("prediction_templates", prediction_templates_prompt),
    ("prediction_examples", prediction_examples_prompt),
]

pipeline_prompt = PipelinePromptTemplate(
    final_prompt=full_prediction_prompt, pipeline_prompts=prediction_input_prompts
)

  pipeline_prompt = PipelinePromptTemplate(


## Generate Domain Predictions

In [12]:
predictions_N = 10

### Generate Financial Predictions

In [13]:
financial_attributes = """stock price, net profit, revenue, operating cash flow, research and development expenses, operating income, gross profit."""
financial_requirements = """- Should be based on real-world financial earnings reports.
    - Suppose the time when $p$ was made is during any earning season.
    - Include stocks from all sectors such as consumer staples, energy, finance, health care, industrials, materials, media, real estate, retail, technology, utilities, defense, etc.
    - Include the US Dollar sign ($) before or USD after the amount of the financial attribute."""

financial_examples = """
- financial examples for template 1:
		- {prediction_domain} template 1: [ $p_s$ ] forecasts that the [ $p_a$ ] at [ $p_t$ ] will likely decrease in [ $p_d$ ].

    1. [Detravious, an investor] forecasts that the [stock price] at [Apple] will likely decrease in [2025 Q1 to 2025 Q3].
    2. [Ava Lee] predicts that the [operating cash flow] at [ExxonMobil] should decrease in [03/21/2025 to 08/21/2025].
    
 """

In [14]:
financial_input_dict = {
    "prediction_domain": "financial",
    "prediction_domain_attribute": financial_attributes,
    "domain_requirements": financial_requirements,
    "domain_examples": financial_examples,
    "predictions_N": predictions_N
}
financial_prompt_output = pipeline_prompt.format(**financial_input_dict)
print(financial_prompt_output)
# prompt_template = "Your prompt here"
# label = 1  # or "0" for non-prediction
# domain = "finance" 


# pd.set_option('max_colwidth', 800)

# llama_versatile_generation_model = LlamaVersatileTextGenerationModel()
# llama_instant_generation_model = LlamaInstantTextGenerationModel()
# llama_70b_8192_generation_model = Llama70B8192TextGenerationModel()
# llama_8b_8192_generation_model = Llama8B8192TextGenerationModel()
# mixtral_generation_model = MixtralTextGenerationModel()


# versatile_financial_df = llama_versatile_generation_model.generate_predictions(financial_prompt_output, label, domain)
# instant_financial_df = llama_instant_generation_model.generate_predictions(financial_prompt_output, label, domain)
# seventy_financial_df = llama_70b_8192_generation_model.generate_predictions(financial_prompt_output, label, domain)
# eight_financial_df = llama_8b_8192_generation_model.generate_predictions(financial_prompt_output, label, domain)
# mixtral_financial_df = mixtral_generation_model.generate_predictions(financial_prompt_output, label, domain)

# financial_df = [versatile_financial_df, instant_financial_df, seventy_financial_df, eight_financial_df, mixtral_financial_df]
# DataProcessing.concat_dfs(financial_df)

A prediction ($p$) = ($p_s$, $p_t$, $p_d$, $p_a$), where it consists of the following four properties:

    1. $p_s$, any source entity in the financial domain.
        - Can be a person (with a name) or a financial person such as a financial reporter, financial analyst, financial expert, financial top executive, financial senior level person, etc).
        - Can only be an organization that is associated with the financial prediction.
    2. $p_t$, any target entity in the financial domain.
	      - Can be a person (with a name) or a financial person such as a financial reporter, financial analyst, financial expert, financial top executive, financial senior level person, etc).
        - Can only be an organization that is associated with the financial prediction.
    3. $p_d$, date range when $p$ is expected to come to fruition.
        - Forecast can range from a second to anytime in the future.
        - Answers the questions: "How far to go out from today?" or "Where to stop?".
   

In [15]:
tgmf = TextGenerationModelFactory()

N_batches = 1
text_generation_models = [llama_instant_generation_model]
# text_generation_models = [llama_versatile_generation_model, llama_instant_generation_model, llama_70b_8192_generation_model, llama_8b_8192_generation_model, mixtral_generation_model]
# text_generation_models = [llama_instant_generation_model, llama_8b_8192_generation_model]

# text_generation_models = [llama_versatile_generation_model, llama_70b_8192_generation_model, mixtral_generation_model]

In [17]:
prediction_domains = ["finance"]
prediction_prompt_outputs = {
    "finance": financial_prompt_output,
}
prediction_label = 1

batched_predictions_df = tgmf.batch_generate_predictions(N_batches=N_batches, 
                                text_generation_models=text_generation_models, 
                                domains=prediction_domains,
                                prompt_outputs=prediction_prompt_outputs,
                                sentence_label=prediction_label)

  0%|          | 0/1 [00:00<?, ?it/s]

finance --- <text_generation_models.LlamaInstantTextGenerationModel object at 0x1051c1610>


100%|██████████| 1/1 [00:01<00:00,  1.03s/it]







In [18]:
batched_predictions_df

[                                       Base Sentence  Sentence Label  \
 0  T1: On Wednesday, August 21, 2024, Detravious,...               1   
 1  T1: On Wed, August 21, 2024, Ava Lee predicts ...               1   
 2  T1: On 08/21/2024, Ethan Kim forecasts that th...               1   
 3  T1: On 08/21/2024, Rachel Chen speculates that...               1   
 4  T1: On 21/08/2024, David Lee envisions that th...               1   
 5  T1: On 21 August 2024, Emily Wong foresees tha...               1   
 6  T1: On 2024/08/21, James Chen predicts that th...               1   
 7  T1: On 2024-08-21, Sarah Kim forecasts that th...               1   
 8  T1: On August 21, 2024, Kevin Brown speculates...               1   
 9  T1: On Aug 21, 2024, Olivia Lee envisions that...               1   
 
              Model Name   Domain  Batch Index  
 0  llama-3.1-8b-instant  finance            0  
 1  llama-3.1-8b-instant  finance            0  
 2  llama-3.1-8b-instant  finance            0  

### Generate Health Predictions

In [19]:
pd.set_option('max_colwidth', 800)
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
batched_predictions_df

[                                                                                                                                                      Base Sentence  \
 0         T1: On Wednesday, August 21, 2024, Detravious, an investor, forecasts that the stock price at Apple will likely decrease from $150 to $120 in Q3 of 2027.   
 1           T1: On Wed, August 21, 2024, Ava Lee predicts that the operating cash flow at ExxonMobil should decrease from $20 billion to $15 billion in 2029 of Q3.   
 2                                 T1: On 08/21/2024, Ethan Kim forecasts that the revenue at Amazon will likely increase from $500 billion to $600 billion in 2028.   
 3                    T1: On 08/21/2024, Rachel Chen speculates that the net profit at Microsoft will likely decrease from $20 billion to $15 billion in Q2 of 2026.   
 4                          T1: On 21/08/2024, David Lee envisions that the gross profit at Alphabet will likely increase from $100 billion to $120 billion in 2

In [20]:
predictions_df = DataProcessing.concat_dfs(batched_predictions_df)
predictions_df

Unnamed: 0,Base Sentence,Sentence Label,Model Name,Domain,Batch Index
0,"T1: On Wednesday, August 21, 2024, Detravious, an investor, forecasts that the stock price at Apple will likely decrease from $150 to $120 in Q3 of 2027.",1,llama-3.1-8b-instant,finance,0
1,"T1: On Wed, August 21, 2024, Ava Lee predicts that the operating cash flow at ExxonMobil should decrease from $20 billion to $15 billion in 2029 of Q3.",1,llama-3.1-8b-instant,finance,0
2,"T1: On 08/21/2024, Ethan Kim forecasts that the revenue at Amazon will likely increase from $500 billion to $600 billion in 2028.",1,llama-3.1-8b-instant,finance,0
3,"T1: On 08/21/2024, Rachel Chen speculates that the net profit at Microsoft will likely decrease from $20 billion to $15 billion in Q2 of 2026.",1,llama-3.1-8b-instant,finance,0
4,"T1: On 21/08/2024, David Lee envisions that the gross profit at Alphabet will likely increase from $100 billion to $120 billion in 2027.",1,llama-3.1-8b-instant,finance,0
5,"T1: On 21 August 2024, Emily Wong foresees that the research and development expenses at Tesla will likely decrease from $5 billion to $3 billion in Q4 of 2025.",1,llama-3.1-8b-instant,finance,0
6,"T1: On 2024/08/21, James Chen predicts that the operating income at Johnson & Johnson will likely increase from $20 billion to $25 billion in 2028.",1,llama-3.1-8b-instant,finance,0
7,"T1: On 2024-08-21, Sarah Kim forecasts that the cash flow from operations at Procter & Gamble will likely decrease from $10 billion to $8 billion in Q1 of 2026.",1,llama-3.1-8b-instant,finance,0
8,"T1: On August 21, 2024, Kevin Brown speculates that the earnings per share at Coca-Cola will likely increase from $2.50 to $3.00 in 2029.",1,llama-3.1-8b-instant,finance,0
9,"T1: On Aug 21, 2024, Olivia Lee envisions that the return on equity at 3M will likely decrease from 20% to 15% in Q3 of 2027.",1,llama-3.1-8b-instant,finance,0


In [None]:
# logger = DataFrameLogger()
# logger.log_df(predictions_df)

In [None]:
# logged_data = logger.load_log()
# logged_data

In [None]:
%store updated_predictions_df
%store updated_non_predictions_df