# Explingo Experiment Runner

This notebook:
1. Loads the gold-standard dataset, prepares the metrics functions, and verifies that the metric functions give the maximum score on the gold-standard dataset
TODO: ensure that narratives very unlike gold-standard score lower
2. Runs the prompt-design, few-shot, and bootstrap-few-shot experiments on a testing dataset

## Setup
Import necessary libraries, prepare the LLM, and load the gold-standard dataset

**Note: To run these cells, you need a `keys.yaml` file in the top-level Explingo directory with the following line:**
```yaml
openai_api_key: <your_openai_api_key>
```

In [1]:
import examples, core 
import os
import yaml
import dspy
import metrics
import random

In [2]:
with open(os.path.join("..", "keys.yaml"), "r") as file:
    config = yaml.safe_load(file)
    openai_api_key = config["openai_api_key"]

llm = dspy.OpenAI(model='gpt-4o', api_key=openai_api_key, max_tokens=2000)

data = examples.load_examples("examples.json")

Some examples include gold-standard narratives; others include only a sample explanation.
- The former makes up the gold-standard dataset used for tuning the evaluation metrics and providing few-shot examples.
- The latter makes up the testing dataset used for evaluation and for bootstrapping few-shot examples

TODO: Right now, we few-shot prompt and evaluation using the same dataset which may lead to biased results, we should separate into a test and train dataset

In [3]:
gold_standards = [d for d in data if hasattr(d, "narrative")]
max_optimal_length = max([len(d.narrative) for d in gold_standards])
max_optimal_length

474

Next, we set up the evaluation metrics. We use the following metrics, all scored on a scale from 0-2:
- Accuracy: the narrative accurately describes the information in the explanation
- Fluency: the narrative is coherent and natural, as compared to the gold-standard explanations. We pass in a small list of sample narratives from the gold-standard dataset to compare against
- Conciseness: the narrative is not too long, as compared to the gold-standard explanations. For now, any narrative that is no longer than the longest gold-standard narrative will score 2
- Context Awareness: the rationalization given alongside the explanation is relevant
- Completeness: the narrative includes all relevant information from the original explanation 

TODO: Currently, we have removed the completeness metric as the accuracy metric ends up encompassing it (and the definition of complete depends on the gold-standard). We need to decide if we want to add it back in

**Note: You can set `verbose=1` to see the narratives generated, or `verbose=2` to see the explanations, narratives, and rationalizations**

In [4]:
example_good_narratives = random.sample([d.narrative for d in gold_standards], 5)
example_bad_narratives = random.sample([d.bad_narrative for d in gold_standards], 5)

exp_metrics = metrics.Metrics(
            [
                metrics.accuracy,
                metrics.fluency,
                metrics.conciseness,
                metrics.context_awareness,
            ], verbose=0, 
        metric_kwargs={"conciseness": {"max_optimal_length": max_optimal_length},
                       "fluency": {"good_narratives": example_good_narratives, "bad_narratives": example_bad_narratives}}
        )

Finally, we set up the main experiment runner object

In [5]:
explingo = core.Explingo(llm=llm, context="The model predicts house prices", 
                         examples=data, metric=exp_metrics)

## Verifying the Evaluation Metrics

In the following code block, we verify that all the evaluation metrics give the maximum score on the gold-standard dataset. We are defining a successful narrative as one that matches the standards of the gold-standard dataset; this allows us to ensure that the metrics are aligned to this gold-standard.

**Note: occasionally, we do see a few failures, but for now most runs I am seeing 100% success.**

TODO: To complete this, we also need to ensure that narratives that are unlike the gold-standard score lower, using a separate dataset

TODO: We do not currently verify the context awareness metric, as the gold-standard dataset does not include a rationalization

In [6]:
VERIFY = False

if VERIFY:
    # Gold standard dataset does not include a rationalization, so we skip context awareness 
    ver_metrics = metrics.Metrics(
        [
            metrics.accuracy,
            metrics.fluency,
            metrics.conciseness,
        ], verbose=0, 
        metric_kwargs={"conciseness": {"max_optimal_length": max_optimal_length},
                       "fluency": {"good_narratives": example_good_narratives, "bad_narratives": example_bad_narratives}}
    )
    total_passed, total_failed = 0, 0
    for gold_standard in gold_standards:
        result = ver_metrics(gold_standard, gold_standard)[0]
        if result != 2*len(ver_metrics.metric_funcs):
            print(f"Failed {gold_standard}")
            total_failed += 1
        else:
            total_passed += 1
    print(f"Total passed: {total_passed}, total failed: {total_failed}")

# BASIC PROMPTING

We begin with basic prompts. With 4 metrics (without completeness), each with a score of 0-2, the maximum score is 8. 

We generate narratives/rationalizations on `max_iters=5` sample explanations, and return the average total score.

In [7]:
# Utility for cleaner results

def pretty_print(result):
    s = f"Total score: {result[0]}"
    s2 = ", ".join([f"{k}: {v}" for k, v in result[1].items()])
    print(f"{s} ({s2})")

In [8]:
prompts = [
    "You are helping users understand an ML model's prediction. Given an explanation and information about the model, convert the explanation into a human-readable narrative.",
    "You are helping users who do not have experience working with ML understand an ML model's prediction. Given an explanation and information about the model, convert the explanation into a human-readable narrative. Make your answers sound as natural as possible.",
    "You are helping users understand an ML model's prediction. Given an explanation and information about the model, convert the explanation into a human-readable narrative. Be sure to explicitly mention all values from the explanation in your response.",
]

for prompt in prompts:
    print(f"Prompt: {prompt}")
    pretty_print(explingo.run_basic_prompting_experiment(data, prompt=prompt, max_iters=5))
    

Prompt: You are helping users understand an ML model's prediction. Given an explanation and information about the model, convert the explanation into a human-readable narrative.
Total score: 7.6 (accuracy: 2.0, fluency: 1.6, conciseness: 2.0, context_awareness: 2.0)
Prompt: You are helping users who do not have experience working with ML understand an ML model's prediction. Given an explanation and information about the model, convert the explanation into a human-readable narrative. Make your answers sound as natural as possible.
Total score: 7.6 (accuracy: 2.0, fluency: 1.6, conciseness: 2.0, context_awareness: 2.0)
Prompt: You are helping users understand an ML model's prediction. Given an explanation and information about the model, convert the explanation into a human-readable narrative. Be sure to explicitly mention all values from the explanation in your response.
Total score: 7.2 (accuracy: 2.0, fluency: 1.2, conciseness: 2.0, context_awareness: 2.0)


Next, we repeat the experiment with the addition of N few-shot examples from the gold-standard dataset.

In [9]:
for i in [1, 3, 5]:
    print(f"Few-shot n: {i}")
    pretty_print(explingo.run_few_shot_experiment(data, max_iters=5, n_few_shot=i))

Few-shot n: 1
Total score: 8.0 (accuracy: 2.0, fluency: 2.0, conciseness: 2.0, context_awareness: 2.0)
Few-shot n: 3
Total score: 8.0 (accuracy: 2.0, fluency: 2.0, conciseness: 2.0, context_awareness: 2.0)
Few-shot n: 5
Total score: 8.0 (accuracy: 2.0, fluency: 2.0, conciseness: 2.0, context_awareness: 2.0)


Next, we repeat the experiment with the addition of 3 examples bootstrapped by DSPy to optimize the evaluation metrics.

TODO: We should experiment with different numbers of labeled few-shot and bootstrapped few-shot examples

In [10]:
for i, j in [[0, 3], [0, 5], [3, 3], [3, 5]]:
    print(f"Few-shot n: {i}, Bootstrapped n: {j}")
    pretty_print(explingo.run_bootstrap_few_shot_experiment(data, max_iters=5, n_labeled_few_shot=3, n_bootstrapped_few_shot=3))

Few-shot n: 0, Bootstrapped n: 3


 24%|██▍       | 6/25 [00:06<00:20,  1.08s/it]


Bootstrapped 3 full traces after 7 examples in round 0.


 24%|██▍       | 6/25 [00:05<00:17,  1.09it/s]


Bootstrapped 3 full traces after 7 examples in round 0.


 20%|██        | 5/25 [00:01<00:06,  2.97it/s]


Bootstrapped 3 full traces after 6 examples in round 0.


 28%|██▊       | 7/25 [00:05<00:13,  1.29it/s]


Bootstrapped 3 full traces after 8 examples in round 0.


 20%|██        | 5/25 [00:02<00:11,  1.68it/s]


Bootstrapped 3 full traces after 6 examples in round 0.
Total score: 7.4 (accuracy: 2.0, fluency: 1.4, conciseness: 2.0, context_awareness: 2.0)
Few-shot n: 0, Bootstrapped n: 5


 32%|███▏      | 8/25 [00:05<00:12,  1.40it/s]


Bootstrapped 3 full traces after 9 examples in round 0.


 32%|███▏      | 8/25 [00:06<00:14,  1.20it/s]


Bootstrapped 3 full traces after 9 examples in round 0.


 24%|██▍       | 6/25 [00:04<00:15,  1.24it/s]


Bootstrapped 3 full traces after 7 examples in round 0.


 20%|██        | 5/25 [00:01<00:07,  2.57it/s]


Bootstrapped 3 full traces after 6 examples in round 0.


 24%|██▍       | 6/25 [00:02<00:08,  2.16it/s]


Bootstrapped 3 full traces after 7 examples in round 0.
Total score: 7.4 (accuracy: 2.0, fluency: 1.4, conciseness: 2.0, context_awareness: 2.0)
Few-shot n: 3, Bootstrapped n: 3


 28%|██▊       | 7/25 [00:04<00:10,  1.70it/s]


Bootstrapped 3 full traces after 8 examples in round 0.


 24%|██▍       | 6/25 [00:02<00:09,  2.07it/s]


Bootstrapped 3 full traces after 7 examples in round 0.


 28%|██▊       | 7/25 [00:03<00:09,  1.83it/s]


Bootstrapped 3 full traces after 8 examples in round 0.


 24%|██▍       | 6/25 [00:02<00:06,  2.89it/s]


Bootstrapped 3 full traces after 7 examples in round 0.


 20%|██        | 5/25 [00:02<00:09,  2.22it/s]


Bootstrapped 3 full traces after 6 examples in round 0.
Total score: 7.2 (accuracy: 1.8, fluency: 1.4, conciseness: 2.0, context_awareness: 2.0)
Few-shot n: 3, Bootstrapped n: 5


 24%|██▍       | 6/25 [00:02<00:07,  2.58it/s]


Bootstrapped 3 full traces after 7 examples in round 0.


 24%|██▍       | 6/25 [00:01<00:03,  5.95it/s]


Bootstrapped 3 full traces after 7 examples in round 0.


 28%|██▊       | 7/25 [00:01<00:04,  3.88it/s]


Bootstrapped 3 full traces after 8 examples in round 0.


 24%|██▍       | 6/25 [00:05<00:16,  1.15it/s]


Bootstrapped 3 full traces after 7 examples in round 0.


 24%|██▍       | 6/25 [00:00<00:01, 10.27it/s]


Bootstrapped 3 full traces after 7 examples in round 0.
Total score: 7.4 (accuracy: 2.0, fluency: 1.4, conciseness: 2.0, context_awareness: 2.0)
