# Explingo Experiment Runner

This notebook:
1. Loads the gold-standard dataset, prepares the metrics functions, and verifies that the metric functions give the maximum score on the gold-standard dataset, and lower scores on less aligned datasets
2. Runs the prompt-design, few-shot, and bootstrap-few-shot experiments on a testing dataset

## Setup
Import necessary libraries and prepare the LLM

**Note: To run these cells, you need a `keys.yaml` file in the top-level Explingo directory with the following line:**
```yaml
openai_api_key: <your_openai_api_key>
```

In [22]:
import pandas as pd

from experiment_runner import ExplingoExperimentRunner
import os
import yaml
import dspy
import metrics
import random

In [23]:
with open(os.path.join("..", "keys.yaml"), "r") as file:
    config = yaml.safe_load(file)
    openai_api_key = config["openai_api_key"]

llm = dspy.OpenAI(model='gpt-4o', api_key=openai_api_key, max_tokens=2000)

Now, we create the main experiment runner object. This object takes in a dataset, and then
1. Splits the dataset into a training dataset and a testing dataset (see notes below)
2. Sets up the evaluation metrics (see notes below). The fluency metric is set up to use sample from the dataset as reference
3. Runs the experiments on the testing dataset

Some examples in the testing datasets include gold-standard narratives; others include only a sample explanation.
- The former makes up the gold-standard dataset used for tuning the evaluation metrics and providing few-shot examples.
- The latter makes up the testing dataset used for evaluation and for bootstrapping few-shot examples

We use the following metrics, all scored on a scale from 0-4:
- Accuracy: the narrative accurately describes the information in the explanation
- Fluency: the narrative is coherent and natural, as compared to the gold-standard explanations. We pass in a small list of sample narratives from the gold-standard dataset to compare against
- Conciseness: the narrative is not too long, as compared to the gold-standard explanations. For now, any narrative that is no longer than the longest gold-standard narrative will score 4
- Completeness: the narrative includes all relevant information from the original explanation 

**Note: You can set `verbose=1` to see the narratives generated, or `verbose=2` to see the explanations, narratives, and rationalizations**

In [24]:
# iterate all datasets in the eval_data folder
runners = {}
for dataset in os.listdir(os.path.join("eval_data")):
    runners[dataset] = ExplingoExperimentRunner(llm=llm, openai_api_key=openai_api_key, dataset_filepath = os.path.join("eval_data", dataset), verbose=0)
    
results = []

## Basic prompt design experiment

We begin with basic prompts. With 4 metrics (without completeness), each with a score of 0-2, the maximum score is 8. 

We generate narratives/rationalizations on `max_iters=5` sample explanations, and return the average total score.

In [25]:
# Utilities for cleaner results

def pretty_print(result):
    s = f"Total score: {result[0]}"
    s2 = ", ".join([f"{k}: {v}" for k, v in result[1].items()])
    print(f"{s} ({s2})")
    
def update_results(method, dataset, scores, kwargs):
    result = {"dataset": dataset, "prompt": prompt, "total score": scores[0]}
    result.update(scores[1])
    result.update(kwargs)
    results.append(result)

In [26]:
prompts = ["You are helping users understand an ML model's prediction. Given an explanation and information about the model, "
           "convert the explanation into a human-readable narrative.",
           "You are helping users who do not have experience working with ML understand an ML model's prediction. Given an explanation and information about the model, convert the explanation into a human-readable narrative. Make your answers sound as natural as possible.",
           "You are helping users understand an ML model's prediction. Given an explanation and information about the model, convert the explanation into a human-readable narrative. Be sure to explicitly mention all values from the explanation in your response.",
]

for dataset in runners:
    runner = runners[dataset]
    print(f"Dataset: {dataset}")
    for prompt in prompts:
        print(f"Prompt: {prompt}")
        scores = runner.run_basic_prompting_experiment(prompt=prompt, max_iters=5)
        update_results("basic_prompting", dataset, scores, {"prompt": prompt})
        pretty_print(scores)
        print("--")
    print("=====")

Dataset: gold_standards.json
Prompt: You are helping users understand an ML model's prediction. Given an explanation and information about the model, convert the explanation into a human-readable narrative.
Total score: 15.6 (accuracy: 4.0, completeness: 4.0, fluency: 3.6, conciseness: 4.0)
--
Prompt: You are helping users who do not have experience working with ML understand an ML model's prediction. Given an explanation and information about the model, convert the explanation into a human-readable narrative. Make your answers sound as natural as possible.
Total score: 14.8 (accuracy: 4.0, completeness: 3.6, fluency: 3.2, conciseness: 4.0)
--
Prompt: You are helping users understand an ML model's prediction. Given an explanation and information about the model, convert the explanation into a human-readable narrative. Be sure to explicitly mention all values from the explanation in your response.
Total score: 15.6 (accuracy: 4.0, completeness: 4.0, fluency: 3.6, conciseness: 4.0)
--
==

## Few-shot experiment

Next, we repeat the experiment with the addition of N few-shot examples from the gold-standard dataset.

In [27]:
for dataset in runners:
    runner = runners[dataset]
    print(f"Dataset: {dataset}")
    for i in [1, 3, 5]:
        print(f"Few-shot n: {i}")
        scores = runner.run_few_shot_experiment(max_iters=5, n_few_shot=i, prompt=prompts[0])
        update_results("few_shot", dataset, scores, {"n_few_shot": i, "prompt": prompts[0]})
        pretty_print(scores)
        print("--")
    print("=====")

Dataset: gold_standards.json
Few-shot n: 1
Total score: 16.0 (accuracy: 4.0, completeness: 4.0, fluency: 4.0, conciseness: 4.0)
--
Few-shot n: 3
Total score: 15.8 (accuracy: 4.0, completeness: 4.0, fluency: 3.8, conciseness: 4.0)
--
Few-shot n: 5
Total score: 16.0 (accuracy: 4.0, completeness: 4.0, fluency: 4.0, conciseness: 4.0)
--
=====
Dataset: unaligned_examples_1.json
Few-shot n: 1
Total score: 15.8 (accuracy: 4.0, completeness: 4.0, fluency: 3.8, conciseness: 4.0)
--
Few-shot n: 3
Total score: 15.6 (accuracy: 4.0, completeness: 4.0, fluency: 3.6, conciseness: 4.0)
--
Few-shot n: 5
Total score: 15.8 (accuracy: 4.0, completeness: 4.0, fluency: 3.8, conciseness: 4.0)
--
=====
Dataset: unaligned_examples_2.json
Few-shot n: 1
Total score: 15.6 (accuracy: 4.0, completeness: 3.6, fluency: 4.0, conciseness: 4.0)
--
Few-shot n: 3
Total score: 15.8 (accuracy: 4.0, completeness: 4.0, fluency: 3.8, conciseness: 4.0)
--
Few-shot n: 5
Total score: 15.4 (accuracy: 4.0, completeness: 3.6, fluenc

## Bootstrapped few-shot
Next, we repeat the experiment with the addition of 3 examples bootstrapped by DSPy to optimize the evaluation metrics.

In [28]:
for dataset in runners:
    runner = runners[dataset]
    print(f"Dataset: {dataset}")
    for i, j in [[0, 3], [0, 5], [3, 3], [3, 5]]:
        print(f"Few-shot n: {i}, Bootstrapped n: {j}")
        scores = runner.run_bootstrap_few_shot_experiment(max_iters=5, n_labeled_few_shot=i, n_bootstrapped_few_shot=j)
        update_results("bootstrap_few_shot", dataset, scores, {"n_few_shot": i, "n_bootstrapped_few_shot": j, "prompt": prompts[0]})
        pretty_print(scores)
        print("--")
    print("=====")

Dataset: gold_standards.json
Few-shot n: 0, Bootstrapped n: 3


 14%|█▍        | 3/21 [00:01<00:11,  1.61it/s]


Bootstrapped 3 full traces after 4 examples in round 0.


 14%|█▍        | 3/21 [00:00<00:00, 144.02it/s]


Bootstrapped 3 full traces after 4 examples in round 0.


 14%|█▍        | 3/21 [00:00<00:00, 147.99it/s]


Bootstrapped 3 full traces after 4 examples in round 0.


 14%|█▍        | 3/21 [00:00<00:00, 149.21it/s]


Bootstrapped 3 full traces after 4 examples in round 0.


 14%|█▍        | 3/21 [00:00<00:00, 157.69it/s]


Bootstrapped 3 full traces after 4 examples in round 0.
Total score: 15.8 (accuracy: 4.0, completeness: 4.0, fluency: 3.8, conciseness: 4.0)
--
Few-shot n: 0, Bootstrapped n: 5


 33%|███▎      | 7/21 [00:03<00:06,  2.29it/s]


Bootstrapped 5 full traces after 8 examples in round 0.


 33%|███▎      | 7/21 [00:00<00:00, 155.95it/s]


Bootstrapped 5 full traces after 8 examples in round 0.


 33%|███▎      | 7/21 [00:00<00:00, 143.66it/s]


Bootstrapped 5 full traces after 8 examples in round 0.


 33%|███▎      | 7/21 [00:00<00:00, 148.90it/s]


Bootstrapped 5 full traces after 8 examples in round 0.


 33%|███▎      | 7/21 [00:00<00:00, 150.85it/s]


Bootstrapped 5 full traces after 8 examples in round 0.
Total score: 16.0 (accuracy: 4.0, completeness: 4.0, fluency: 4.0, conciseness: 4.0)
--
Few-shot n: 3, Bootstrapped n: 3


 14%|█▍        | 3/21 [00:00<00:00, 157.08it/s]


Bootstrapped 3 full traces after 4 examples in round 0.


 14%|█▍        | 3/21 [00:00<00:00, 157.15it/s]


Bootstrapped 3 full traces after 4 examples in round 0.


 14%|█▍        | 3/21 [00:00<00:00, 148.88it/s]


Bootstrapped 3 full traces after 4 examples in round 0.


 14%|█▍        | 3/21 [00:00<00:00, 148.94it/s]


Bootstrapped 3 full traces after 4 examples in round 0.


 14%|█▍        | 3/21 [00:00<00:00, 149.43it/s]


Bootstrapped 3 full traces after 4 examples in round 0.
Total score: 15.8 (accuracy: 4.0, completeness: 4.0, fluency: 3.8, conciseness: 4.0)
--
Few-shot n: 3, Bootstrapped n: 5


 33%|███▎      | 7/21 [00:00<00:00, 146.50it/s]


Bootstrapped 5 full traces after 8 examples in round 0.


 33%|███▎      | 7/21 [00:00<00:00, 160.01it/s]


Bootstrapped 5 full traces after 8 examples in round 0.


 33%|███▎      | 7/21 [00:00<00:00, 97.19it/s]


Bootstrapped 5 full traces after 8 examples in round 0.


 33%|███▎      | 7/21 [00:00<00:00, 181.16it/s]


Bootstrapped 5 full traces after 8 examples in round 0.


 33%|███▎      | 7/21 [00:00<00:00, 143.97it/s]


Bootstrapped 5 full traces after 8 examples in round 0.
Total score: 16.0 (accuracy: 4.0, completeness: 4.0, fluency: 4.0, conciseness: 4.0)
--
=====
Dataset: unaligned_examples_1.json
Few-shot n: 0, Bootstrapped n: 3


 57%|█████▋    | 4/7 [00:02<00:01,  1.79it/s]


Bootstrapped 3 full traces after 5 examples in round 0.


 57%|█████▋    | 4/7 [00:00<00:00, 166.47it/s]


Bootstrapped 3 full traces after 5 examples in round 0.


 57%|█████▋    | 4/7 [00:00<00:00, 153.88it/s]


Bootstrapped 3 full traces after 5 examples in round 0.


 57%|█████▋    | 4/7 [00:00<00:00, 153.83it/s]


Bootstrapped 3 full traces after 5 examples in round 0.


 57%|█████▋    | 4/7 [00:00<00:00, 144.43it/s]


Bootstrapped 3 full traces after 5 examples in round 0.
Total score: 15.2 (accuracy: 4.0, completeness: 4.0, fluency: 3.2, conciseness: 4.0)
--
Few-shot n: 0, Bootstrapped n: 5


100%|██████████| 7/7 [00:02<00:00,  3.31it/s]


Bootstrapped 5 full traces after 7 examples in round 0.


100%|██████████| 7/7 [00:00<00:00, 177.42it/s]


Bootstrapped 5 full traces after 7 examples in round 0.


100%|██████████| 7/7 [00:00<00:00, 54.35it/s]


Bootstrapped 5 full traces after 7 examples in round 0.


100%|██████████| 7/7 [00:00<00:00, 56.68it/s]


Bootstrapped 5 full traces after 7 examples in round 0.


100%|██████████| 7/7 [00:00<00:00, 59.52it/s]


Bootstrapped 5 full traces after 7 examples in round 0.
Total score: 15.2 (accuracy: 4.0, completeness: 4.0, fluency: 3.2, conciseness: 4.0)
--
Few-shot n: 3, Bootstrapped n: 3


 57%|█████▋    | 4/7 [00:00<00:00, 156.64it/s]


Bootstrapped 3 full traces after 5 examples in round 0.


 57%|█████▋    | 4/7 [00:00<00:00, 163.05it/s]


Bootstrapped 3 full traces after 5 examples in round 0.


 57%|█████▋    | 4/7 [00:00<00:00, 149.74it/s]


Bootstrapped 3 full traces after 5 examples in round 0.


 57%|█████▋    | 4/7 [00:00<00:00, 166.34it/s]


Bootstrapped 3 full traces after 5 examples in round 0.


 57%|█████▋    | 4/7 [00:00<00:00, 150.10it/s]


Bootstrapped 3 full traces after 5 examples in round 0.
Total score: 15.2 (accuracy: 4.0, completeness: 4.0, fluency: 3.2, conciseness: 4.0)
--
Few-shot n: 3, Bootstrapped n: 5


100%|██████████| 7/7 [00:00<00:00, 158.63it/s]


Bootstrapped 5 full traces after 7 examples in round 0.


100%|██████████| 7/7 [00:00<00:00, 163.97it/s]


Bootstrapped 5 full traces after 7 examples in round 0.


100%|██████████| 7/7 [00:00<00:00, 160.34it/s]


Bootstrapped 5 full traces after 7 examples in round 0.


100%|██████████| 7/7 [00:00<00:00, 165.21it/s]


Bootstrapped 5 full traces after 7 examples in round 0.


100%|██████████| 7/7 [00:00<00:00, 155.38it/s]


Bootstrapped 5 full traces after 7 examples in round 0.
Total score: 15.2 (accuracy: 4.0, completeness: 4.0, fluency: 3.2, conciseness: 4.0)
--
=====
Dataset: unaligned_examples_2.json
Few-shot n: 0, Bootstrapped n: 3


 86%|████████▌ | 6/7 [00:03<00:00,  1.63it/s]


Bootstrapped 3 full traces after 7 examples in round 0.


 86%|████████▌ | 6/7 [00:00<00:00, 117.26it/s]


Bootstrapped 3 full traces after 7 examples in round 0.


 86%|████████▌ | 6/7 [00:00<00:00, 136.92it/s]


Bootstrapped 3 full traces after 7 examples in round 0.


 86%|████████▌ | 6/7 [00:00<00:00, 161.02it/s]


Bootstrapped 3 full traces after 7 examples in round 0.


 86%|████████▌ | 6/7 [00:00<00:00, 143.96it/s]


Bootstrapped 3 full traces after 7 examples in round 0.
Total score: 14.6 (accuracy: 4.0, completeness: 4.0, fluency: 2.6, conciseness: 4.0)
--
Few-shot n: 0, Bootstrapped n: 5


100%|██████████| 7/7 [00:00<00:00,  9.64it/s]


Bootstrapped 3 full traces after 7 examples in round 0.


100%|██████████| 7/7 [00:00<00:00, 60.91it/s]


Bootstrapped 3 full traces after 7 examples in round 0.


100%|██████████| 7/7 [00:00<00:00, 41.89it/s]


Bootstrapped 3 full traces after 7 examples in round 0.


100%|██████████| 7/7 [00:00<00:00, 45.07it/s]


Bootstrapped 3 full traces after 7 examples in round 0.


100%|██████████| 7/7 [00:00<00:00, 55.17it/s]


Bootstrapped 3 full traces after 7 examples in round 0.
Total score: 14.6 (accuracy: 4.0, completeness: 4.0, fluency: 2.6, conciseness: 4.0)
--
Few-shot n: 3, Bootstrapped n: 3


 86%|████████▌ | 6/7 [00:00<00:00, 49.50it/s]


Bootstrapped 3 full traces after 7 examples in round 0.


 86%|████████▌ | 6/7 [00:00<00:00, 36.41it/s]


Bootstrapped 3 full traces after 7 examples in round 0.


 86%|████████▌ | 6/7 [00:00<00:00, 36.38it/s]


Bootstrapped 3 full traces after 7 examples in round 0.


 86%|████████▌ | 6/7 [00:00<00:00, 53.11it/s]


Bootstrapped 3 full traces after 7 examples in round 0.


 86%|████████▌ | 6/7 [00:00<00:00, 61.16it/s]


Bootstrapped 3 full traces after 7 examples in round 0.
Total score: 14.6 (accuracy: 4.0, completeness: 4.0, fluency: 2.6, conciseness: 4.0)
--
Few-shot n: 3, Bootstrapped n: 5


100%|██████████| 7/7 [00:00<00:00, 47.45it/s]


Bootstrapped 3 full traces after 7 examples in round 0.


100%|██████████| 7/7 [00:00<00:00, 43.35it/s]


Bootstrapped 3 full traces after 7 examples in round 0.


100%|██████████| 7/7 [00:00<00:00, 49.21it/s]


Bootstrapped 3 full traces after 7 examples in round 0.


100%|██████████| 7/7 [00:00<00:00, 51.10it/s]


Bootstrapped 3 full traces after 7 examples in round 0.


100%|██████████| 7/7 [00:00<00:00, 47.81it/s]


Bootstrapped 3 full traces after 7 examples in round 0.
Total score: 14.6 (accuracy: 4.0, completeness: 4.0, fluency: 2.6, conciseness: 4.0)
--
=====


In [29]:
result_df = pd.DataFrame(results)
result_df.to_csv("results.csv")
result_df

Unnamed: 0,dataset,prompt,total score,accuracy,completeness,fluency,conciseness,n_few_shot,n_bootstrapped_few_shot
0,gold_standards.json,You are helping users understand an ML model's...,15.6,4.0,4.0,3.6,4.0,,
1,gold_standards.json,You are helping users who do not have experien...,14.8,4.0,3.6,3.2,4.0,,
2,gold_standards.json,You are helping users understand an ML model's...,15.6,4.0,4.0,3.6,4.0,,
3,unaligned_examples_1.json,You are helping users understand an ML model's...,15.0,4.0,4.0,3.0,4.0,,
4,unaligned_examples_1.json,You are helping users who do not have experien...,15.2,4.0,4.0,3.2,4.0,,
5,unaligned_examples_1.json,You are helping users understand an ML model's...,15.6,4.0,4.0,3.6,4.0,,
6,unaligned_examples_2.json,You are helping users understand an ML model's...,15.0,4.0,4.0,3.0,4.0,,
7,unaligned_examples_2.json,You are helping users who do not have experien...,15.0,4.0,4.0,3.0,4.0,,
8,unaligned_examples_2.json,You are helping users understand an ML model's...,14.6,4.0,4.0,2.6,4.0,,
9,gold_standards.json,You are helping users understand an ML model's...,16.0,4.0,4.0,4.0,4.0,1.0,
