This notebooks shows how to use pure prompt to get the answer for each question.
- steps:
  - inject the data and the question into the prompt.
  - call the LLM
  - get the answer from LLM
- the input data are: [`air_passengers.csv`](../../data/air_passengers.csv), [`melbourne_temp.csv`](../../data/melbourne_temp.csv)
  - Note:  [`nyc_taxi.csv`](../../data/nyc_taxi.csv) cannot be fitted into the prompt due to the token limit
- the question is: [`easy_precise_questions.csv`](../../data/easy_precise_questions.csv)

In [1]:
from openai import AzureOpenAI
from dotenv import load_dotenv
import pandas as pd
import os
from jinja2 import Environment, FileSystemLoader
from pathlib import Path
from tqdm.notebook import tqdm
import time
import sys
import json

module_path = os.path.abspath(os.path.join(".."))
if module_path not in sys.path:
    sys.path.append(module_path)

from utils.utils import convert_types, eval
from utils.vars import DATA_DIR, EXCEPT_FILES, QUESTION_PATH

load_dotenv()

True

In [2]:
# cannot fit the file for nyc as it is too large
# TODO: change the way to pass the data
# result is the worst, but pretty fast.

In [3]:
# read questions
df_questions = pd.read_csv(QUESTION_PATH)

# get the client object
client = AzureOpenAI(
    api_key=os.getenv("AZURE_OPENAI_API_KEY"),
    api_version="2024-10-01-preview",  # different from assistant
    azure_endpoint=os.getenv("AZURE_OPENAI_ENDPOINT"),
)

In [4]:
prompt_path = "prompts/prompt.jinja2"

In [5]:
# read the prompt
instruction = (
    Environment(loader=FileSystemLoader(".")).get_template(prompt_path).render()
)

df_result = []

for file_path in Path(DATA_DIR).glob("*.csv"):
    if file_path.name in EXCEPT_FILES + ["nyc_taxi.csv"]:
        continue
    print(f"file: {file_path.name}")
    # read the data
    df = pd.read_csv(file_path)

    # call openai
    for _, row in tqdm(df_questions.iterrows(), total=len(df_questions)):
        question = row["question"]
        answer_true = row[Path(file_path).name]

        start_time = time.time()

        response = client.chat.completions.create(
            model="gpt-4o",
            messages=[
                {"role": "system", "content": instruction},
                {
                    "role": "user",
                    "content": f"Here is the dataset in the markdown format. {df.to_markdown()}",
                },
                {"role": "user", "content": question},
            ],
            temperature=0,
            top_p=1,
            seed=42,
        )

        # format the output
        try:
            answer_pred = json.loads(response.choices[0].message.content)["output"]
        except json.decoder.JSONDecodeError:
            print(
                f"Original output: {response.choices[0].message.content}; JSONDecodeError: {answer_pred}"
            )

        df_result.append(
            {
                **response.usage.to_dict(),
                "question": question,
                "execution_time_s": round(time.time() - start_time, 2),
                "file": file_path.name,
                "answer_pred": convert_types(answer_pred),
                "answer_true": convert_types(answer_true),
            }
        )

file: air_passengers.csv


  0%|          | 0/16 [00:00<?, ?it/s]

Original output: ```json
{"output": "0"}
```; JSONDecodeError: 0
file: melbourne_temp.csv


  0%|          | 0/16 [00:00<?, ?it/s]

Original output: ```json
{"output": "8"}
```; JSONDecodeError: 17.3
Original output: ```json
{"output": "0"}
```; JSONDecodeError: 17.3
Original output: ```json
{"output": "2"}
```; JSONDecodeError: 17.3
Original output: ```{"output": "3652"}```; JSONDecodeError: 17.3


In [6]:
# eval
df_result = pd.DataFrame(df_result)

# loop through each file
eval(df=df_result, details=False)

File: air_passengers.csv; Accuracy: 0.875
question: What is the Q3 of the target variable?
answer_pred: 318.0
answer_true: 360.5
**************************************************
question: What is the frequency of the given time series data?
answer_pred: Monthly
answer_true: MS
**************************************************
File: melbourne_temp.csv; Accuracy: 0.4375
question: What is the mean of the target variable?
answer_pred: 12.46
answer_true: 11.18
**************************************************
question: What is the medium of the target variable?
answer_pred: 15.1
answer_true: 11.0
**************************************************
question: What is the standard deviation of the target variable?
answer_pred: 4.34
answer_true: 4.07
**************************************************
question: What is the Q1 of the target variable?
answer_pred: 10.1
answer_true: 8.3
**************************************************
question: What is the Q3 of the target variable?
answer_pre

In [7]:
with pd.option_context("display.max_rows", None, "display.max_columns", None):
    display(df_result.groupby(["file"]).describe())

Unnamed: 0_level_0,completion_tokens,completion_tokens,completion_tokens,completion_tokens,completion_tokens,completion_tokens,completion_tokens,completion_tokens,prompt_tokens,prompt_tokens,prompt_tokens,prompt_tokens,prompt_tokens,prompt_tokens,prompt_tokens,prompt_tokens,total_tokens,total_tokens,total_tokens,total_tokens,total_tokens,total_tokens,total_tokens,total_tokens,execution_time_s,execution_time_s,execution_time_s,execution_time_s,execution_time_s,execution_time_s,execution_time_s,execution_time_s
Unnamed: 0_level_1,count,mean,std,min,25%,50%,75%,max,count,mean,std,min,25%,50%,75%,max,count,mean,std,min,25%,50%,75%,max,count,mean,std,min,25%,50%,75%,max
file,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2,Unnamed: 22_level_2,Unnamed: 23_level_2,Unnamed: 24_level_2,Unnamed: 25_level_2,Unnamed: 26_level_2,Unnamed: 27_level_2,Unnamed: 28_level_2,Unnamed: 29_level_2,Unnamed: 30_level_2,Unnamed: 31_level_2,Unnamed: 32_level_2
air_passengers.csv,16.0,7.375,1.857418,6.0,6.0,6.0,8.0,11.0,16.0,2463.25,1.732051,2460.0,2462.75,2463.0,2464.0,2466.0,16.0,2470.625,3.201562,2466.0,2468.75,2470.5,2472.0,2477.0,16.0,0.585,0.2611,0.3,0.425,0.535,0.6175,1.33
melbourne_temp.csv,16.0,8.625,1.543805,6.0,8.0,8.0,10.0,11.0,16.0,69100.25,1.732051,69097.0,69099.75,69100.0,69101.0,69103.0,16.0,69108.875,2.729469,69103.0,69108.0,69109.0,69109.25,69114.0,16.0,6.94375,5.799137,0.94,5.49,6.485,6.8825,27.25
