# Interactive Notebook for Understanding StreamBench
This is a step-by-step walkthrough about how to run StreamBench


## Import Required Packages

In [6]:
import os
from pathlib import Path
from stream_bench.agents import load_agent
from stream_bench.benchmarks import load_benchmark

# Set your API keys here
os.environ.update({
    "OAI_KEY": Path("../apikeys/openai_bench.txt").read_text().strip()
})

## Setup Agent and Benchmark Configurations
They are normally specified in `./configs/agent/<agent_name>.yml` and `./configs/bench/<dataset_name>.yml`. We use python dictionaries in this notebook instead.

In [3]:
# Zero-shot agent
agent_cfg = {
    "agent_name": "zeroshot",
    "llm": {
        "series": "openai",  # Remember to set the environment variable "OAI_KEY"
        "model_name": "gpt-4o-mini-2024-07-18",
        "temperature": 0.0,
        "max_tokens": 512
    }
}
# BIRD benchmark
bench_cfg = {
    "bench_name": "bird",
    "split": "test",
    "feedback": "correctness",
    "seed": 42,
    "db_path": "./data/bird/dev_databases"
}
# Let agent holds benchmark information
agent_cfg["bench_name"] = bench_cfg["bench_name"]
agent_cfg["split"] = bench_cfg["split"]

## Load the Agent and Benchmark Dataset

In [8]:
agent = load_agent(agent_cfg["agent_name"])(agent_cfg)
bench = load_benchmark(bench_cfg["bench_name"])(**bench_cfg)

Initializing DB schema prompts...


100%|██████████| 1534/1534 [00:00<00:00, 25510.94it/s]

11 found!
Building few-shot examples...





## What Happens at Each Time Step in the Input-Feedback Sequence?
The steps below show the intermediate outputs at each iteration in the main script `./stream_bench/pipelines/run_bench.py` line 83-114.

In [9]:
# Each row is the data instance for a time step
for time_step, row in enumerate(bench.get_dataset()):
    break

print(f"Time step: {time_step}")
print(f"Row: {row}")

Time step: 0
Row: {'db_id': 'card_games', 'question': 'What is the language of the card with the multiverse number 149934?', 'evidence': 'multiverse number 149934 refers to multiverseid = 149934;', 'SQL': 'SELECT language FROM foreign_data WHERE multiverseid = 149934', 'question_id': 422, 'difficulty': 'simple'}


In [14]:
# Inference
x = bench.get_input(row)
model_raw_output = agent(**x)
prediction = bench.postprocess_generation(model_raw_output, time_step)
print(f"Model input (only shows the keys): {x.keys()}")
print(f"Model raw output: {model_raw_output}")
print(f"Post-processed model output: {prediction}")

Model input (only shows the keys): dict_keys(['question', 'fewshot_template', 'prompt_zeroshot', 'prompt_fewshot', 'prompt_cot', 'feedback_template', 'refine_template'])
Model raw output: ```sql
SELECT foreign_data.language
FROM foreign_data
JOIN cards ON foreign_data.uuid = cards.uuid
WHERE cards.multiverseId = 149934;
```
Post-processed model output: SELECT foreign_data.language
FROM foreign_data
JOIN cards ON foreign_data.uuid = cards.uuid
WHERE cards.multiverseId = 149934;


In [17]:
# Get feedback
label = bench.get_output(row)
pred_res = bench.process_results(  # prediction result: correctness of the model output
    prediction,
    label,  # In Text-to-SQL tasks, the SQL execution result of "prediction" and "label" are compared
    return_details=True,
    time_step=time_step
)
has_feedback, feedback = bench.give_feedback(model_raw_output, row, pred_res)
print(f"Label: {label}")
print(f"Prediction result: {pred_res}")
print(f"Feedback: {feedback}")

Label: {'SQL': 'SELECT language FROM foreign_data WHERE multiverseid = 149934', 'db_id': 'card_games', 'label': 'SELECT language FROM foreign_data WHERE multiverseid = 149934'}
Prediction result: {'result': 'Answer is NOT Correct', 'correct': 0, 'n_correct': 0, 'rolling_acc': 0.0}
Feedback: {'question': 'What is the language of the card with the multiverse number 149934?', 'self_output': 'SELECT foreign_data.language\nFROM foreign_data\nJOIN cards ON foreign_data.uuid = cards.uuid\nWHERE cards.multiverseId = 149934;', 'is_correct': 0, 'ground_truth': 'SELECT language FROM foreign_data WHERE multiverseid = 149934', 'shot_template': 'Question: {question}\n{answer}', 'memprompt_template': 'Question: {question}\nYour SQL code: {answer}\nUser Feedback: {correctness}'}


In [19]:
# The agent update itself based on the feedback
has_update = agent.update(has_feedback, **feedback)
print(f"Has the agent updated itself? {has_update}")  # Zero-shot agent: no updates

Has the agent updated itself? False
