# Explingo Experiment Runner

This notebook:
1. Loads the gold-standard dataset, prepares the metrics functions, and verifies that the metric functions give the maximum score on the gold-standard dataset, and lower scores on less aligned datasets
2. Runs the prompt-design, few-shot, and bootstrap-few-shot experiments on a testing dataset

## Setup
Import necessary libraries, prepare the LLM, and load the gold-standard dataset

**Note: To run these cells, you need a `keys.yaml` file in the top-level Explingo directory with the following line:**
```yaml
openai_api_key: <your_openai_api_key>
```

In [2]:
import examples, core 
import os
import yaml
import dspy
import metrics
import random

In [3]:
with open(os.path.join("..", "keys.yaml"), "r") as file:
    config = yaml.safe_load(file)
    openai_api_key = config["openai_api_key"]

llm = dspy.OpenAI(model='gpt-4o', api_key=openai_api_key, max_tokens=2000)

labeled_train, labeled_eval, unlabeled_train, unlabeled_eval = examples.get_data("gold_standards.json")
train_data = labeled_train + unlabeled_train
eval_data = labeled_eval + unlabeled_eval

Some examples include gold-standard narratives; others include only a sample explanation.
- The former makes up the gold-standard dataset used for tuning the evaluation metrics and providing few-shot examples.
- The latter makes up the testing dataset used for evaluation and for bootstrapping few-shot examples

In [4]:
max_optimal_length = max([len(d.narrative) for d in labeled_train])
max_optimal_length

474

Next, we set up the evaluation metrics. We use the following metrics, all scored on a scale from 0-2:
- Accuracy: the narrative accurately describes the information in the explanation
- Fluency: the narrative is coherent and natural, as compared to the gold-standard explanations. We pass in a small list of sample narratives from the gold-standard dataset to compare against
- Conciseness: the narrative is not too long, as compared to the gold-standard explanations. For now, any narrative that is no longer than the longest gold-standard narrative will score 2
- Context Awareness: the rationalization given alongside the explanation is relevant
- Completeness: the narrative includes all relevant information from the original explanation 

TODO: Currently, we have removed the completeness metric as the accuracy metric ends up encompassing it (and the definition of complete depends on the gold-standard). We need to decide if we want to add it back in

**Note: You can set `verbose=1` to see the narratives generated, or `verbose=2` to see the explanations, narratives, and rationalizations**

In [5]:
example_good_narratives = random.sample([d.narrative for d in labeled_train], 5)
example_bad_narratives = random.sample([d.bad_narrative for d in labeled_train], 5)

exp_metrics = metrics.Metrics(
            [
                metrics.accuracy,
                metrics.fluency,
                metrics.conciseness,
                metrics.context_awareness,
            ], verbose=0, 
        metric_kwargs={"conciseness": {"max_optimal_length": max_optimal_length},
                       "fluency": {"good_narratives": example_good_narratives, "bad_narratives": example_bad_narratives}}
        )

Finally, we set up the main experiment runner object

In [6]:
explingo = core.Explingo(llm=llm, context="The model predicts house prices", metric=exp_metrics,
                         labeled_train_data=labeled_train, unlabeled_train_data=unlabeled_train)

## Verifying the Evaluation Metrics

In the following code block, we verify our metric functionality by comparing average score on our gold standard dataset (used to tune the metrics) to other datasets that use different styles of explanations. We expect the gold standard average score to be very close to 2*len(metrics) (since each metric is scored on a scale of 0-2), and the other datasets to be lower.
 
TODO: We do not currently verify the context awareness metric, as the gold-standard dataset does not include a rationalization

In [7]:
metric_verification_datasets = ["gold_standards.json", "unaligned_examples_1.json", "unaligned_examples_2.json"]

# Example datasets do not include a rationalization, so we skip context awareness 
ver_metrics = metrics.Metrics(
    [
        metrics.accuracy,
        metrics.fluency,
        metrics.conciseness,
    ], verbose=0, 
    metric_kwargs={"conciseness": {"max_optimal_length": max_optimal_length},
                   "fluency": {"good_narratives": example_good_narratives, "bad_narratives": example_bad_narratives}}
)

for dataset in metric_verification_datasets: 
    score = 0
    labeled_train, _, _, _ = examples.get_data(dataset, split=1)
    for example in labeled_train:
        score += ver_metrics(example, example)[0]
    print(f"Dataset: {dataset}, Average score: {score/len(labeled_train)}/{len(ver_metrics.metric_funcs)*2.0}")

Dataset: gold_standards.json, Average score: 5.9/6.0
Dataset: unaligned_examples_1.json, Average score: 5.9/6.0
Dataset: unaligned_examples_2.json, Average score: 5.444444444444445/6.0


## Basic prompt design experiment

We begin with basic prompts. With 4 metrics (without completeness), each with a score of 0-2, the maximum score is 8. 

We generate narratives/rationalizations on `max_iters=5` sample explanations, and return the average total score.

In [8]:
# Utility for cleaner results

def pretty_print(result):
    s = f"Total score: {result[0]}"
    s2 = ", ".join([f"{k}: {v}" for k, v in result[1].items()])
    print(f"{s} ({s2})")

In [9]:
prompts = [
    "You are helping users understand an ML model's prediction. Given an explanation and information about the model, convert the explanation into a human-readable narrative.",
    "You are helping users who do not have experience working with ML understand an ML model's prediction. Given an explanation and information about the model, convert the explanation into a human-readable narrative. Make your answers sound as natural as possible.",
    "You are helping users understand an ML model's prediction. Given an explanation and information about the model, convert the explanation into a human-readable narrative. Be sure to explicitly mention all values from the explanation in your response.",
]

for prompt in prompts:
    print(f"Prompt: {prompt}")
    pretty_print(explingo.run_basic_prompting_experiment(eval_data, prompt=prompt, max_iters=5))
    

Prompt: You are helping users understand an ML model's prediction. Given an explanation and information about the model, convert the explanation into a human-readable narrative.
Total score: 6.8 (accuracy: 1.6, fluency: 1.4, conciseness: 2.0, context_awareness: 1.8)
Prompt: You are helping users who do not have experience working with ML understand an ML model's prediction. Given an explanation and information about the model, convert the explanation into a human-readable narrative. Make your answers sound as natural as possible.
Total score: 7.0 (accuracy: 1.6, fluency: 1.6, conciseness: 2.0, context_awareness: 1.8)
Prompt: You are helping users understand an ML model's prediction. Given an explanation and information about the model, convert the explanation into a human-readable narrative. Be sure to explicitly mention all values from the explanation in your response.
Total score: 7.0 (accuracy: 2.0, fluency: 1.2, conciseness: 2.0, context_awareness: 1.8)


## Few-shot experiment

Next, we repeat the experiment with the addition of N few-shot examples from the gold-standard dataset.

In [10]:
for i in [1, 3, 5]:
    print(f"Few-shot n: {i}")
    pretty_print(explingo.run_few_shot_experiment(eval_data, max_iters=5, n_few_shot=i))

Few-shot n: 1
Total score: 7.4 (accuracy: 2.0, fluency: 1.6, conciseness: 2.0, context_awareness: 1.8)
Few-shot n: 3
Total score: 7.2 (accuracy: 1.8, fluency: 1.6, conciseness: 2.0, context_awareness: 1.8)
Few-shot n: 5
Total score: 7.4 (accuracy: 1.8, fluency: 1.8, conciseness: 2.0, context_awareness: 1.8)


## Bootstrapped few-shot
Next, we repeat the experiment with the addition of 3 examples bootstrapped by DSPy to optimize the evaluation metrics.

In [11]:
for i, j in [[0, 3], [0, 5], [3, 3], [3, 5]]:
    print(f"Few-shot n: {i}, Bootstrapped n: {j}")
    pretty_print(explingo.run_bootstrap_few_shot_experiment(eval_data, max_iters=5, n_labeled_few_shot=i, n_bootstrapped_few_shot=j))

Few-shot n: 0, Bootstrapped n: 3


 40%|████      | 6/15 [00:03<00:05,  1.54it/s]


Bootstrapped 3 full traces after 7 examples in round 0.


 33%|███▎      | 5/15 [00:02<00:04,  2.23it/s]


Bootstrapped 3 full traces after 6 examples in round 0.


 60%|██████    | 9/15 [00:18<00:12,  2.10s/it]


Bootstrapped 3 full traces after 10 examples in round 0.


 40%|████      | 6/15 [00:03<00:05,  1.72it/s]


Bootstrapped 3 full traces after 7 examples in round 0.


 33%|███▎      | 5/15 [00:03<00:06,  1.59it/s]


Bootstrapped 3 full traces after 6 examples in round 0.
Total score: 6.8 (accuracy: 1.8, fluency: 1.2, conciseness: 2.0, context_awareness: 1.8)
Few-shot n: 0, Bootstrapped n: 5


 33%|███▎      | 5/15 [00:00<00:01,  7.53it/s]


Bootstrapped 3 full traces after 6 examples in round 0.


 40%|████      | 6/15 [00:03<00:05,  1.80it/s]


Bootstrapped 3 full traces after 7 examples in round 0.


 40%|████      | 6/15 [00:03<00:05,  1.66it/s]


Bootstrapped 3 full traces after 7 examples in round 0.


 47%|████▋     | 7/15 [00:03<00:04,  1.76it/s]


Bootstrapped 3 full traces after 8 examples in round 0.


 33%|███▎      | 5/15 [00:01<00:02,  3.52it/s]


Bootstrapped 3 full traces after 6 examples in round 0.
Total score: 7.0 (accuracy: 1.8, fluency: 1.4, conciseness: 2.0, context_awareness: 1.8)
Few-shot n: 3, Bootstrapped n: 3


 40%|████      | 6/15 [00:02<00:03,  2.88it/s]


Bootstrapped 3 full traces after 7 examples in round 0.


 40%|████      | 6/15 [00:02<00:03,  2.69it/s]


Bootstrapped 3 full traces after 7 examples in round 0.


 60%|██████    | 9/15 [00:05<00:03,  1.71it/s]


Bootstrapped 3 full traces after 10 examples in round 0.


 40%|████      | 6/15 [00:03<00:05,  1.70it/s]


Bootstrapped 3 full traces after 7 examples in round 0.


 60%|██████    | 9/15 [00:04<00:03,  1.82it/s]


Bootstrapped 3 full traces after 10 examples in round 0.
Total score: 7.2 (accuracy: 2.0, fluency: 1.4, conciseness: 2.0, context_awareness: 1.8)
Few-shot n: 3, Bootstrapped n: 5


 40%|████      | 6/15 [00:02<00:03,  2.50it/s]


Bootstrapped 3 full traces after 7 examples in round 0.


 40%|████      | 6/15 [00:03<00:05,  1.64it/s]


Bootstrapped 3 full traces after 7 examples in round 0.


 40%|████      | 6/15 [00:00<00:00, 285.69it/s]


Bootstrapped 3 full traces after 7 examples in round 0.


 60%|██████    | 9/15 [00:01<00:01,  5.35it/s]


Bootstrapped 3 full traces after 10 examples in round 0.


 60%|██████    | 9/15 [00:02<00:01,  3.08it/s]


Bootstrapped 3 full traces after 10 examples in round 0.
Total score: 7.0 (accuracy: 2.0, fluency: 1.2, conciseness: 2.0, context_awareness: 1.8)
