# 1-Generate Predictions using LangChain

- **Goal:** Prediction Recognition

- **Purpose:** To implement step 1 with sub steps of prediction recognition pipeline. See steps
    1. Generate predictions
        1. Create several prediction prompts templates
        2. Utilize open-source LLMs to generate predictions

- **Misc:**
    - `%store`: Cell magic will store the variable of interest so we can load in another notebook

In [1]:
# !pip3 install pandas langchain spacy numpy Groq

In [2]:
# !pip3 install -U scikit-learn pandas tqdm langchain-core spacy groq python-dotenv

In [3]:
# !pip3 install python-dotenv

In [4]:
import os, sys

import pandas as pd

from tqdm import tqdm
from langchain_core.prompts import PipelinePromptTemplate, PromptTemplate

# Get the current working directory of the notebook
notebook_dir = os.getcwd()
# Add the parent directory to the system path
sys.path.append(os.path.join(notebook_dir, '../'))

from log_files import LogData
from data_processing import DataProcessing
from text_generation_models import TextGenerationModelFactory, LlamaVersatileTextGenerationModel, LlamaInstantTextGenerationModel, Llama70B8192TextGenerationModel, Llama8B8192TextGenerationModel, MixtralTextGenerationModel



In [5]:
# pd.set_option('max_colwidth', 800)

llama_versatile_generation_model = LlamaVersatileTextGenerationModel()
llama_instant_generation_model = LlamaInstantTextGenerationModel()
llama_70b_8192_generation_model = Llama70B8192TextGenerationModel()
llama_8b_8192_generation_model = Llama8B8192TextGenerationModel()
mixtral_generation_model = MixtralTextGenerationModel()

## LangChain Templates for Domain Predictions

In [6]:
full_prediction_template = """{prediction_properties}

{prediction_requirements}

{prediction_templates}

{prediction_examples}
"""

full_prediction_prompt = PromptTemplate.from_template(full_prediction_template)

Google predictive spelling/autocomplete 

In [7]:
prediction_properties_template = """A prediction ($p$) = ($p_s$, $p_t$, $p_d$, $p_a$), where it consists of the following four properties:

    1. $p_s$, any source entity in the {prediction_domain} domain.
        - Can be a person (with a name) or a {prediction_domain} person such as a {prediction_domain} reporter, {prediction_domain} analyst, {prediction_domain} expert, {prediction_domain} top executive, {prediction_domain} senior level person, etc).
        - Can only be an organization that is associated with the {prediction_domain} prediction.
    2. $p_t$, any target entity in the {prediction_domain} domain.
	      - Can be a person (with a name) or a {prediction_domain} person such as a {prediction_domain} reporter, {prediction_domain} analyst, {prediction_domain} expert, {prediction_domain} top executive, {prediction_domain} senior level person, etc).
        - Can only be an organization that is associated with the {prediction_domain} prediction.
    3. $p_d$, date range when $p$ is expected to come to fruition.
        - Forecast can range from a second to anytime in the future.
        - Answers the questions: "How far to go out from today?" or "Where to stop?".
    4. $p_a$, {prediction_domain} prediction attribute.
        - Characteristics of a domain-specific attributes such as various quantifiable metrics relevant to the {prediction_domain} domain.
        - Some examples are {prediction_domain_attribute}.  
"""
prediction_properties_prompt = PromptTemplate.from_template(prediction_properties_template)

    - Keep the brackets around the prediction properties when generating predictions and be sure to include brackets around dates such as "2024-10-15", "2024/08/20", "Q4 of 2024", "2025", "2027 Q1", "Q3 2027", "On 21 Aug 2024".

In [8]:
prediction_requirements_template = """{prediction_domain} requirements to use for each prediction:

    - Should be based on real-world {prediction_domain} data and not hallucinate.
    - Only a simple sentence (prediction) (and NOT compounding using "and" or "or").
    - Should diversify all four properties of the prediction ($p$) as in change and not use same for $p_s$, $p_t$, $p_d$, $p_a$.
    - Should use synonyms to predict such as forecasts, speculates, foresee, envision, etc., and not use any of them more than ten times.
    - The prediction should be unique and not repeated.
    - Do not number the predictions.
    - Do not say, "As the {prediction_domain}, I will generate company-based {prediction_domain} predictions using the provided templates." or anything similar.
    - Use the five different templates and examples provided.
    - Change how the current date ($p_d$) written in the prediction with examples of (1) Wednesday, August 21, 2024; (2) Wed, August 21, 2024; (3) 08/21/2024; (4) 08/21/2024; (5) 21/08/2024; (6) 21 August 2024; (7) 2024/08/21; (8) 2024-08-21; (9) August 21, 2024; (10) Aug 21, 2024; (11) 21 August 2024, (12) 21 Aug 2024, Q3 of 2027, 2029 of Q3, etc (with removing day of week).
    {domain_requirements}
    - Stop saying, "Here are {predictions_N} unique {prediction_domain} predictions based on the provided templates and examples:" in the prompt.
    - Do not use any of the examples in the prompt.
    - In front of every prodiction, put the template number in the format of "T1:", "T2:", etc. and do not number them like "1.", "2.", etc.
    - Disregard brackets: "[]"
    - Should never say "Here are {predictions_N} unique {prediction_domain} predictions based on the provided templates and examples:"
    - Do not use person name of entity name more than once as in don't use name Joe as both the $p_s$ and $p_t$, unless like Mr. Sach and Goldman Sach or Mr. Sam Walton and Sam's Club, etc.
    - Should variate the slope of rise/increase/as much as, fall/decrease/as little as, change, stay stable, high/low chance/probability/degree of, etc.
    - Should variate the prediction verbs such as will, would, be going to, should, etc.
"""
prediction_requirements_prompt = PromptTemplate.from_template(prediction_requirements_template)

In [9]:
prediction_templates_template = """Here are some {prediction_domain} templates:

- {prediction_domain} template 1: On [ $p_t$ ], [ $p_p$ ] [ $p_w$ ] that the [ $p_a$ ] at [ $p_o$ ] [ $p_v$ ] [ $p_s$ ] by [ $p_m$ ] in [ $p_f$ ].

"""
prediction_templates_prompt = PromptTemplate.from_template(prediction_templates_template)

In [10]:
prediction_examples_template = """Here are some examples of {prediction_domain} predictions:

{domain_examples}

With the above, generate a unique set of {predictions_N} predictions. Think from the perspective of an {prediction_domain} analyst, expert, top executive, or senior level person."""
prediction_examples_prompt = PromptTemplate.from_template(prediction_examples_template)

In [11]:
prediction_input_prompts = [
    ("prediction_properties", prediction_properties_prompt),
    ("prediction_requirements", prediction_requirements_prompt),
    ("prediction_templates", prediction_templates_prompt),
    ("prediction_examples", prediction_examples_prompt),
]

pipeline_prompt = PipelinePromptTemplate(
    final_prompt=full_prediction_prompt, pipeline_prompts=prediction_input_prompts
)

  pipeline_prompt = PipelinePromptTemplate(


## Generate Domain Predictions

In [12]:
predictions_N = 10

### Generate Financial Predictions

In [13]:
financial_attributes = """stock price, net profit, revenue, operating cash flow, research and development expenses, operating income, gross profit."""
financial_requirements = """- Should be based on real-world financial earnings reports.
    - Suppose the time when $p$ was made is during any earning season.
    - Include stocks from all sectors such as consumer staples, energy, finance, health care, industrials, materials, media, real estate, retail, technology, utilities, defense, etc.
    - Include the US Dollar sign ($) before or USD after the amount of the financial attribute."""

financial_examples = """
- financial examples for template 1:
		- {prediction_domain} template 1: [$p_s$] forecasts that the [$p_a$] at [$p_t$] to increase in [$p_d$].

    1. [Detravious, an investor] forecasts that the [stock price] at [Apple] will likely decrease in [2025 Q1 to 2025 Q3].
    2. [Ava Lee] predicts that the [operating cash flow] at [ExxonMobil] should decrease in [03/21/2025 to 08/21/2025].
    
 """

In [14]:
financial_input_dict = {
    "prediction_domain": "financial",
    "prediction_domain_attribute": financial_attributes,
    "domain_requirements": financial_requirements,
    "domain_examples": financial_examples,
    "predictions_N": predictions_N
}
financial_prompt_output = pipeline_prompt.format(**financial_input_dict)
print(financial_prompt_output)


A prediction ($p$) = ($p_s$, $p_t$, $p_d$, $p_a$), where it consists of the following four properties:

    1. $p_s$, any source entity in the financial domain.
        - Can be a person (with a name) or a financial person such as a financial reporter, financial analyst, financial expert, financial top executive, financial senior level person, etc).
        - Can only be an organization that is associated with the financial prediction.
    2. $p_t$, any target entity in the financial domain.
	      - Can be a person (with a name) or a financial person such as a financial reporter, financial analyst, financial expert, financial top executive, financial senior level person, etc).
        - Can only be an organization that is associated with the financial prediction.
    3. $p_d$, date range when $p$ is expected to come to fruition.
        - Forecast can range from a second to anytime in the future.
        - Answers the questions: "How far to go out from today?" or "Where to stop?".
   

In [15]:
tgmf = TextGenerationModelFactory()

N_batches = 1
# text_generation_models = [llama_instant_generation_model]
# text_generation_models = [llama_versatile_generation_model, llama_instant_generation_model, llama_70b_8192_generation_model, llama_8b_8192_generation_model]
text_generation_models = [llama_instant_generation_model, llama_8b_8192_generation_model]

# text_generation_models = [llama_versatile_generation_model, llama_70b_8192_generation_model, mixtral_generation_model]

In [16]:
prediction_domains = ["finance"]
prediction_prompt_outputs = {
    "finance": financial_prompt_output,
}
prediction_label = 1

batched_predictions_df = tgmf.batch_generate_predictions(N_batches=N_batches, 
                                text_generation_models=text_generation_models, 
                                domains=prediction_domains,
                                prompt_outputs=prediction_prompt_outputs,
                                sentence_label=prediction_label)

  0%|          | 0/1 [00:00<?, ?it/s]

finance --- <text_generation_models.LlamaInstantTextGenerationModel object at 0x1039c76a0>
finance --- <text_generation_models.Llama8B8192TextGenerationModel object at 0x30a95b6d0>


100%|██████████| 1/1 [00:13<00:00, 13.56s/it]







In [17]:
batched_predictions_df

[                                         Base Sentence  Sentence Label  \
 0    T1: Mr. Johnson, a financial analyst, forecast...               1   
 1    T1: Goldman Sachs speculates that the net prof...               1   
 2    T1: Detravious, an investor, predicts that the...               1   
 3    T1: Ava Lee, a financial expert, forecasts tha...               1   
 4    T1: Mr. Smith, a senior level person, speculat...               1   
 ..                                                 ...             ...   
 224  T1: Mr. Smith, a senior level person, speculat...               1   
 225  T1: Ava Lee, a financial expert, forecasts tha...               1   
 226  T1: Detravious, an investor, predicts that the...               1   
 227  T1: Mr. Johnson, a financial analyst, speculat...               1   
 228  T1: Goldman Sachs predicts that the net profit...               1   
 
                Model Name   Domain  Batch Index  
 0    llama-3.1-8b-instant  finance            

### Generate Health Predictions

In [18]:
pd.set_option('max_colwidth', 800)
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
batched_predictions_df

[                                                                                                                                                  Base Sentence  \
 0                                     T1: Mr. Johnson, a financial analyst, forecasts that the stock price at Tesla will likely decrease in 2025 Q1 to 2025 Q3.   
 1                                                           T1: Goldman Sachs speculates that the net profit at Amazon will be going to increase in Q3 of 2027.   
 2                   T1: Detravious, an investor, predicts that the research and development expenses at Microsoft will likely fall in 2029 of Q3 to 2030 of Q3.   
 3                        T1: Ava Lee, a financial expert, forecasts that the operating income at Johnson & Johnson will be going to rise in 2025 Q2 to 2026 Q2.   
 4                          T1: Mr. Smith, a senior level person, speculates that the revenue at Procter & Gamble will likely stay stable in 2025 Q1 to 2025 Q2.   
 5              

In [19]:
predictions_df = DataProcessing.concat_dfs(batched_predictions_df)
predictions_df

Unnamed: 0,Base Sentence,Sentence Label,Model Name,Domain,Batch Index
0,"T1: Mr. Johnson, a financial analyst, forecasts that the stock price at Tesla will likely decrease in 2025 Q1 to 2025 Q3.",1,llama-3.1-8b-instant,finance,0
1,T1: Goldman Sachs speculates that the net profit at Amazon will be going to increase in Q3 of 2027.,1,llama-3.1-8b-instant,finance,0
2,"T1: Detravious, an investor, predicts that the research and development expenses at Microsoft will likely fall in 2029 of Q3 to 2030 of Q3.",1,llama-3.1-8b-instant,finance,0
3,"T1: Ava Lee, a financial expert, forecasts that the operating income at Johnson & Johnson will be going to rise in 2025 Q2 to 2026 Q2.",1,llama-3.1-8b-instant,finance,0
4,"T1: Mr. Smith, a senior level person, speculates that the revenue at Procter & Gamble will likely stay stable in 2025 Q1 to 2025 Q2.",1,llama-3.1-8b-instant,finance,0
5,"T1: Detravious, an investor, predicts that the gross profit at Coca-Cola will likely increase in 2025 Q3 to 2026 Q3.",1,llama-3.1-8b-instant,finance,0
6,"T1: Ava Lee, a financial expert, forecasts that the net profit at ExxonMobil will be going to decrease in 2025 Q1 to 2025 Q3.",1,llama-3.1-8b-instant,finance,0
7,"T1: Mr. Johnson, a financial analyst, speculates that the operating cash flow at Intel will likely fall in 2025 Q2 to 2025 Q3.",1,llama-3.1-8b-instant,finance,0
8,T1: Goldman Sachs predicts that the stock price at Apple will be going to rise in 2025 Q2 to 2026 Q2.,1,llama-3.1-8b-instant,finance,0
9,"T1: Mr. Smith, a senior level person, forecasts that the revenue at McDonald's will likely stay stable in 2025 Q1 to 2025 Q2.",1,llama-3.1-8b-instant,finance,0


In [20]:
log_directory = '../data/prediction_logs/'
file_name = 'test.log'

In [21]:
logger = LogData(log_directory, file_name)
logger

<log_files.LogData at 0x17538be80>

In [22]:
csv_output_path = os.path.join(log_directory, 'from_dataframe.csv')
logger.dataframe_to_csv(predictions_df, csv_output_path)

True

In [23]:
csv_input_path = os.path.join(log_directory, 'from_dataframe.csv')
file_name = 'from_csv.log'
logger.csv_to_log(csv_input_path, file_name)

False

In [24]:
csv_output_from_log_path = os.path.join(log_directory, 'from_log.csv')
ignore_patterns = ['INFO', 'DEBUG', 'ERROR', 'WARNING', 'CSV Row: '] # Example patterns to ignore

logger.log_to_csv(file_name, csv_output_from_log_path, ignore_patterns)

False

In [25]:
csv_input_from_log_path = os.path.join(log_directory, 'from_dataframe.csv')
df_from_log_csv = logger.csv_to_dataframe(csv_input_from_log_path)

if df_from_log_csv is not None:
    print("\nDataFrame created from log CSV:")
    print(df_from_log_csv)
else:
    print("\nFailed to create DataFrame from log CSV. Check the log file for errors.")


DataFrame created from log CSV:
                                                                                                                                                                                                                      Base Sentence  \
0                                                                                                         T1: Mr. Johnson, a financial analyst, forecasts that the stock price at Tesla will likely decrease in 2025 Q1 to 2025 Q3.   
1                                                                                                                               T1: Goldman Sachs speculates that the net profit at Amazon will be going to increase in Q3 of 2027.   
2                                                                                       T1: Detravious, an investor, predicts that the research and development expenses at Microsoft will likely fall in 2029 of Q3 to 2030 of Q3.   
3                                          

In [26]:
df_from_log_csv

Unnamed: 0,Base Sentence,Sentence Label,Model Name,Domain,Batch Index
0,"T1: Mr. Johnson, a financial analyst, forecasts that the stock price at Tesla will likely decrease in 2025 Q1 to 2025 Q3.",1,llama-3.1-8b-instant,finance,0
1,T1: Goldman Sachs speculates that the net profit at Amazon will be going to increase in Q3 of 2027.,1,llama-3.1-8b-instant,finance,0
2,"T1: Detravious, an investor, predicts that the research and development expenses at Microsoft will likely fall in 2029 of Q3 to 2030 of Q3.",1,llama-3.1-8b-instant,finance,0
3,"T1: Ava Lee, a financial expert, forecasts that the operating income at Johnson & Johnson will be going to rise in 2025 Q2 to 2026 Q2.",1,llama-3.1-8b-instant,finance,0
4,"T1: Mr. Smith, a senior level person, speculates that the revenue at Procter & Gamble will likely stay stable in 2025 Q1 to 2025 Q2.",1,llama-3.1-8b-instant,finance,0
5,"T1: Detravious, an investor, predicts that the gross profit at Coca-Cola will likely increase in 2025 Q3 to 2026 Q3.",1,llama-3.1-8b-instant,finance,0
6,"T1: Ava Lee, a financial expert, forecasts that the net profit at ExxonMobil will be going to decrease in 2025 Q1 to 2025 Q3.",1,llama-3.1-8b-instant,finance,0
7,"T1: Mr. Johnson, a financial analyst, speculates that the operating cash flow at Intel will likely fall in 2025 Q2 to 2025 Q3.",1,llama-3.1-8b-instant,finance,0
8,T1: Goldman Sachs predicts that the stock price at Apple will be going to rise in 2025 Q2 to 2026 Q2.,1,llama-3.1-8b-instant,finance,0
9,"T1: Mr. Smith, a senior level person, forecasts that the revenue at McDonald's will likely stay stable in 2025 Q1 to 2025 Q2.",1,llama-3.1-8b-instant,finance,0
