# 1-Generate Predictions using LangChain

- **Goal:** Prediction Recognition

- **Purpose:** To implement step 1 with sub steps of prediction recognition pipeline. See steps
    1. Generate predictions
        1. Create several prediction prompts templates
        2. Utilize open-source LLMs to generate predictions

- **Misc:**
    - `%store`: Cell magic will store the variable of interest so we can load in another notebook

In [None]:
# !pip3 install pandas langchain spacy numpy Groq

In [None]:
!pip3 install -U scikit-learn pandas tqdm langchain-core spacy groq python-dotenv

In [None]:
# !pip3 install python-dotenv

In [None]:
import os, sys

import pandas as pd

from tqdm import tqdm
from langchain_core.prompts import PipelinePromptTemplate, PromptTemplate

# Get the current working directory of the notebook
notebook_dir = os.getcwd()
# Add the parent directory to the system path
sys.path.append(os.path.join(notebook_dir, '../'))

from log_files import DataFrameLogger
from data_processing import DataProcessing
from text_generation_models import TextGenerationModelFactory, LlamaVersatileTextGenerationModel, LlamaInstantTextGenerationModel, Llama70B8192TextGenerationModel, Llama8B8192TextGenerationModel, MixtralTextGenerationModel

In [None]:
# pd.set_option('max_colwidth', 800)

llama_versatile_generation_model = LlamaVersatileTextGenerationModel()
llama_instant_generation_model = LlamaInstantTextGenerationModel()
llama_70b_8192_generation_model = Llama70B8192TextGenerationModel()
llama_8b_8192_generation_model = Llama8B8192TextGenerationModel()
mixtral_generation_model = MixtralTextGenerationModel()

## LangChain Templates for Domain Predictions

In [None]:
full_prediction_template = """{prediction_properties}

{prediction_requirements}

{prediction_templates}

{prediction_examples}
"""

full_prediction_prompt = PromptTemplate.from_template(full_prediction_template)

Google predictive spelling/autocomplete 

In [None]:
prediction_properties_template = """A prediction ($p$) = ($p_s$, $p_t$, $p_d$, $p_a$), where it consists of the following four properties:

    1. $p_s$, any source entity in the {prediction_domain} domain.
        - Can be a person (with a name) or a {prediction_domain} person such as a {prediction_domain} reporter, {prediction_domain} analyst, {prediction_domain} expert, {prediction_domain} top executive, {prediction_domain} senior level person, etc).
        - Can only be an organization that is associated with the {prediction_domain} prediction.
    2. $p_t$, any target entity in the {prediction_domain} domain.
	      - Can be a person (with a name) or a {prediction_domain} person such as a {prediction_domain} reporter, {prediction_domain} analyst, {prediction_domain} expert, {prediction_domain} top executive, {prediction_domain} senior level person, etc).
        - Can only be an organization that is associated with the {prediction_domain} prediction.
    3. $p_d$, date range when $p$ is expected to come to fruition.
        - Forecast can range from a second to anytime in the future.
        - Answers the questions: "How far to go out from today?" or "Where to stop?".
    4. $p_a$, {prediction_domain} prediction attribute.
        - Characteristics of a domain-specific attributes such as various quantifiable metrics relevant to the {prediction_domain} domain.
        - Some examples are {prediction_domain_attribute}.  
"""
prediction_properties_prompt = PromptTemplate.from_template(prediction_properties_template)

    - Keep the brackets around the prediction properties when generating predictions and be sure to include brackets around dates such as "2024-10-15", "2024/08/20", "Q4 of 2024", "2025", "2027 Q1", "Q3 2027", "On 21 Aug 2024".

In [None]:
prediction_requirements_template = """{prediction_domain} requirements to use for each prediction:

    - Should be based on real-world {prediction_domain} data and not hallucinate.
    - Only a simple sentence (prediction) (and NOT compounding using "and" or "or").
    - Should diversify all four properties of the prediction ($p$) as in change and not use same for $p_s$, $p_t$, $p_d$, $p_a$.
    - Should use synonyms to predict such as forecasts, speculates, foresee, envision, etc., and not use any of them more than ten times.
    - The prediction should be unique and not repeated.
    - Do not number the predictions.
    - Do not say, "As the {prediction_domain}, I will generate company-based {prediction_domain} predictions using the provided templates." or anything similar.
    - Use the five different templates and examples provided.
    - Change how the current date ($p_d$) written in the prediction with examples of (1) Wednesday, August 21, 2024; (2) Wed, August 21, 2024; (3) 08/21/2024; (4) 08/21/2024; (5) 21/08/2024; (6) 21 August 2024; (7) 2024/08/21; (8) 2024-08-21; (9) August 21, 2024; (10) Aug 21, 2024; (11) 21 August 2024, (12) 21 Aug 2024, Q3 of 2027, 2029 of Q3, etc (with removing day of week).
    {domain_requirements}
    - Stop saying, "Here are {predictions_N} unique {prediction_domain} predictions based on the provided templates and examples:" in the prompt.
    - Do not use any of the examples in the prompt.
    - In front of every prodiction, put the template number in the format of "T1:", "T2:", etc. and do not number them like "1.", "2.", etc.
    - Disregard brackets: "[]"
    - Should never say "Here are {predictions_N} unique {prediction_domain} predictions based on the provided templates and examples:"
    - Do not use person name of entity name more than once as in don't use name Joe as both the $p_s$ and $p_t$, unless like Mr. Sach and Goldman Sach or Mr. Sam Walton and Sam's Club, etc.
    - Should variate the slope of rise/increase/as much as, fall/decrease/as little as, change, stay stable, high/low chance/probability/degree of, etc.
    - Should variate the prediction verbs such as will, would, be going to, should, etc.
"""
prediction_requirements_prompt = PromptTemplate.from_template(prediction_requirements_template)

In [None]:
prediction_templates_template = """Here are some {prediction_domain} templates:

- {prediction_domain} template 1: On [ $p_t$ ], [ $p_p$ ] [ $p_w$ ] that the [ $p_a$ ] at [ $p_o$ ] [ $p_v$ ] [ $p_s$ ] by [ $p_m$ ] in [ $p_f$ ].

"""
prediction_templates_prompt = PromptTemplate.from_template(prediction_templates_template)

In [None]:
prediction_examples_template = """Here are some examples of {prediction_domain} predictions:

{domain_examples}

With the above, generate a unique set of {predictions_N} predictions. Think from the perspective of an {prediction_domain} analyst, expert, top executive, or senior level person."""
prediction_examples_prompt = PromptTemplate.from_template(prediction_examples_template)

In [None]:
prediction_input_prompts = [
    ("prediction_properties", prediction_properties_prompt),
    ("prediction_requirements", prediction_requirements_prompt),
    ("prediction_templates", prediction_templates_prompt),
    ("prediction_examples", prediction_examples_prompt),
]

pipeline_prompt = PipelinePromptTemplate(
    final_prompt=full_prediction_prompt, pipeline_prompts=prediction_input_prompts
)

## Generate Domain Predictions

In [None]:
predictions_N = 10

### Generate Financial Predictions

In [None]:
financial_attributes = """stock price, net profit, revenue, operating cash flow, research and development expenses, operating income, gross profit."""
financial_requirements = """- Should be based on real-world financial earnings reports.
    - Suppose the time when $p$ was made is during any earning season.
    - Include stocks from all sectors such as consumer staples, energy, finance, health care, industrials, materials, media, real estate, retail, technology, utilities, defense, etc.
    - Include the US Dollar sign ($) before or USD after the amount of the financial attribute."""

financial_examples = """
- financial examples for template 1:
		- {prediction_domain} template 1: [$p_s$] forecasts that the [$p_a$] at [$p_t$] to increase in [$p_d$].

    1. [Detravious, an investor] forecasts that the [stock price] at [Apple] will likely decrease in [2025 Q1 to 2025 Q3].
    2. [Ava Lee] predicts that the [operating cash flow] at [ExxonMobil] should decrease in [03/21/2025 to 08/21/2025].
    
 """

In [None]:
financial_input_dict = {
    "prediction_domain": "financial",
    "prediction_domain_attribute": financial_attributes,
    "domain_requirements": financial_requirements,
    "domain_examples": financial_examples,
    "predictions_N": predictions_N
}
financial_prompt_output = pipeline_prompt.format(**financial_input_dict)
print(financial_prompt_output)


In [None]:
tgmf = TextGenerationModelFactory()

N_batches = 1
# text_generation_models = [llama_instant_generation_model]
# text_generation_models = [llama_versatile_generation_model, llama_instant_generation_model, llama_70b_8192_generation_model, llama_8b_8192_generation_model]
text_generation_models = [llama_instant_generation_model, llama_8b_8192_generation_model]

# text_generation_models = [llama_versatile_generation_model, llama_70b_8192_generation_model, mixtral_generation_model]

In [None]:
prediction_domains = ["finance"]
prediction_prompt_outputs = {
    "finance": financial_prompt_output,
}
prediction_label = 1

batched_predictions_df = tgmf.batch_generate_predictions(N_batches=N_batches, 
                                text_generation_models=text_generation_models, 
                                domains=prediction_domains,
                                prompt_outputs=prediction_prompt_outputs,
                                sentence_label=prediction_label)

In [None]:
batched_predictions_df

### Generate Health Predictions

In [None]:
pd.set_option('max_colwidth', 800)
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
batched_predictions_df

In [None]:
predictions_df = DataProcessing.concat_dfs(batched_predictions_df)
predictions_df

In [None]:
# csv_log_data = predictions_df.to_csv()
# csv_log_data

In [None]:
# # 2025-04-10 14:21:09,464 - INFO - Logging setup complete - logging to: ../data/prediction_logs/test.log

# import logging
# import os

# def setup_logging():
#     # Define the directory where you want to save the logs
#     log_directory = '../data/prediction_logs/'  # Adjust according to your notebook's path

#     # Create the directory if it doesn't exist
#     os.makedirs(log_directory, exist_ok=True)

#     # Define the full path to the log file
#     log_file_path = os.path.join(log_directory, 'test.log')

#     # Clear previous logging configuration
#     logging.getLogger().handlers = []

#     # Configure the logging module to write to the specified file
#     logging.basicConfig(level=logging.INFO,
#                         format='%(asctime)s - %(levelname)s - %(message)s',
#                         handlers=[
#                             logging.FileHandler(log_file_path),
#                             logging.StreamHandler()
#                         ])
#     logging.info("Logging setup complete - logging to: {}".format(log_file_path))
#     return log_file_path

# def log_data():
#     # my_list = [1, 2, 3, "apple", "banana"]
#     logging.info(f"Contents of my_list: {csv_log_data}")
#     print("Logging data... check the log file for the output.")

In [None]:
# log_file_path = setup_logging()
# print(f"Logging has been set up. Logs will be saved to: {log_file_path}")
# log_data()

In [None]:
# import logging

# def read_log_file(log_file_path):
#     """Opens and reads the content of a log file.

#     Args:
#         log_file_path (str): The path to the log file.
#     """
#     try:
#         with open(log_file_path, 'r') as log_file:
#             for line in log_file:
#                 print(line.strip())  # Print each line, removing leading/trailing whitespace
#     except FileNotFoundError:
#         print(f"Error: Log file not found at {log_file_path}")
#     except Exception as e:
#         print(f"An error occurred while reading the log file: {e}")

# log_file = "../data/prediction_logs/test.log"
# # Now, you can use the read_log_file function to open and read the content
# print("\n--- Reading the log file ---")
# read_log_file(log_file)

In [None]:
def convert_list_to_csv(data_list, csv_file_path):
    """
    Converts a list to a CSV file. Each element of the list becomes a row
    with a single column.

    Args:
        data_list (list): The list to convert.
        csv_file_path (str): The path to save the CSV file.
    """
    try:
        with open(csv_file_path, 'w', newline='') as csvfile:
            writer = csv.writer(csvfile)
            for item in data_list:
                writer.writerow([item])
        logging.info(f"List successfully converted to CSV: {csv_file_path}")
        return csv_file_path
    except Exception as e:
        logging.error(f"Error converting list to CSV: {e}")
        return None

def convert_csv_to_dataframe(csv_file_path):
    """
    Reads a CSV file (with a single column) and converts it to a Pandas DataFrame.

    Args:
        csv_file_path (str): The path to the CSV file.

    Returns:
        pandas.DataFrame or None: The DataFrame if successful, None otherwise.
    """
    try:
        df = pd.read_csv(csv_file_path, header=None, names=['data'])
        logging.info(f"CSV successfully converted to DataFrame.")
        return df
    except FileNotFoundError:
        logging.error(f"CSV file not found: {csv_file_path}")
        return None
    except Exception as e:
        logging.error(f"Error converting CSV to DataFrame: {e}")
        return None

# Define the output file paths
csv_output_path = os.path.join(log_file, 'my_list.csv')

# Convert the list to CSV
csv_file = convert_list_to_csv(my_list, csv_output_path)

if csv_file:
    # Convert the CSV to a Pandas DataFrame
    df_from_csv = convert_csv_to_dataframe(csv_file)

    if df_from_csv is not None:
        logging.info(f"Generated DataFrame:\n{df_from_csv}")
        print("DataFrame created successfully. Check the log file for details.")
    else:
        print("Failed to create DataFrame. Check the log file for errors.")
else:
    print("Failed to convert list to CSV. Check the log file for errors.")

In [None]:
# logger = DataFrameLogger()
# logger.log_df(predictions_df)

In [None]:
# logged_data = logger.load_log()
# logged_data

In [None]:
# %store updated_predictions_df
# %store updated_non_predictions_df

In [None]:
import pandas as pd
import os, csv
import logging

# Configure logging (if not already configured)
log_directory = '../data/prediction_logs/'
os.makedirs(log_directory, exist_ok=True)
log_file_path = os.path.join(log_directory, 'data_processing.log')

if not logging.root.handlers:
    logging.basicConfig(level=logging.INFO,
                        format='%(asctime)s - %(levelname)s - %(message)s',
                        filename=log_file_path)
    logging.info(f"Logging configured to save to: {log_file_path}")
else:
    logging.info("Logging already configured.")

def dataframe_to_csv(df, csv_file_path):
    """Writes a Pandas DataFrame to a CSV file."""
    try:
        df.to_csv(csv_file_path, index=False)  # index=False to avoid writing DataFrame index
        logging.info(f"DataFrame successfully written to CSV: {csv_file_path}")
        return True
    except Exception as e:
        logging.error(f"Error writing DataFrame to CSV: {e}")
        return False

# Example DataFrame
data = {'col1': [1, 2, 3], 'col2': ['a', 'b', 'c']}
df = pd.DataFrame(data)

csv_output_path = os.path.join(log_directory, 'from_dataframe.csv')
dataframe_to_csv(df, csv_output_path)

In [None]:
def csv_to_log(csv_file_path, log_file_path):
    """Reads a CSV file and writes its content to a log file."""
    try:
        with open(csv_file_path, 'r') as csvfile, open(log_file_path, 'a') as logfile:
            reader = csv.reader(csvfile)
            for row in reader:
                log_message = f"CSV Row: {', '.join(map(str, row))}"
                logfile.write(log_message + '\n')
        logging.info(f"CSV content successfully written to log file: {log_file_path}")
        return True
    except FileNotFoundError:
        logging.error(f"CSV file not found: {csv_file_path}")
        return False
    except Exception as e:
        logging.error(f"Error writing CSV to log file: {e}")
        return False

csv_input_path = os.path.join(log_directory, 'from_dataframe.csv')
log_output_path = os.path.join(log_directory, 'from_csv.log')
csv_to_log(csv_input_path, log_output_path)

In [None]:
def log_to_csv(log_file_path, csv_file_path, lines_to_ignore=None, delimiter=','):
    """
    Reads a log file, extracts relevant lines, and writes them to a CSV file.

    Args:
        log_file_path (str): Path to the log file.
        csv_file_path (str): Path to save the CSV file.
        lines_to_ignore (list, optional): List of strings or patterns to identify lines to skip. Defaults to None.
        delimiter (str, optional): Delimiter for the CSV file. Defaults to ','.
    """
    if lines_to_ignore is None:
        lines_to_ignore = []

    try:
        extracted_data = []
        with open(log_file_path, 'r') as logfile:
            for line in logfile:
                skip_line = False
                for ignore_pattern in lines_to_ignore:
                    if ignore_pattern in line:
                        skip_line = True
                        break
                if not skip_line:
                    # Assuming the relevant data in the log file is comma-separated
                    # You might need more sophisticated parsing based on your log format
                    parts = line.strip().split(delimiter)
                    extracted_data.append(parts)

        with open(csv_file_path, 'w', newline='') as csvfile:
            writer = csv.writer(csvfile, delimiter=delimiter)
            writer.writerows(extracted_data)

        logging.info(f"Successfully extracted data from log file to CSV: {csv_file_path}, ignoring lines containing: {lines_to_ignore}")
        return True
    except FileNotFoundError:
        logging.error(f"Log file not found: {log_file_path}")
        return False
    except Exception as e:
        logging.error(f"Error processing log file to CSV: {e}")
        return False

log_input_path = os.path.join(log_directory, 'from_csv.log')
csv_output_from_log_path = os.path.join(log_directory, 'from_log.csv')
ignore_patterns = ['INFO', 'DEBUG', 'ERROR', 'WARNING', 'CSV Row: '] # Example patterns to ignore

log_to_csv(log_input_path, csv_output_from_log_path, ignore_patterns)

In [None]:
def csv_to_dataframe(csv_file_path):
    """Reads a CSV file into a Pandas DataFrame."""
    try:
        df = pd.read_csv(csv_file_path)
        logging.info(f"CSV file successfully read into DataFrame: {csv_file_path}")
        return df
    except FileNotFoundError:
        logging.error(f"CSV file not found: {csv_file_path}")
        return None
    except pd.errors.EmptyDataError:
        logging.warning(f"CSV file is empty: {csv_file_path}")
        return pd.DataFrame() # Return an empty DataFrame
    except Exception as e:
        logging.error(f"Error reading CSV file into DataFrame: {e}")
        return None

csv_input_from_log_path = os.path.join(log_directory, 'from_dataframe.csv')
df_from_log_csv = csv_to_dataframe(csv_input_from_log_path)

if df_from_log_csv is not None:
    print("\nDataFrame created from log CSV:")
    print(df_from_log_csv)
else:
    print("\nFailed to create DataFrame from log CSV. Check the log file for errors.")

In [None]:
df_from_log_csv

In [None]:
log_directory