# Explingo Experiment Runner

This notebook:
1. Loads the gold-standard dataset, prepares the metrics functions, and verifies that the metric functions give the maximum score on the gold-standard dataset
TODO: ensure that narratives very unlike gold-standard score lower
2. Runs the prompt-design, few-shot, and bootstrap-few-shot experiments on a testing dataset

## Setup
Import necessary libraries, prepare the LLM, and load the gold-standard dataset

**Note: To run these cells, you need a `keys.yaml` file in the top-level Explingo directory with the following line:**
```yaml
openai_api_key: <your_openai_api_key>
```

In [None]:
import examples, core 
import os
import yaml
import dspy
import metrics
import random

In [None]:
with open(os.path.join("..", "keys.yaml"), "r") as file:
    config = yaml.safe_load(file)
    openai_api_key = config["openai_api_key"]

llm = dspy.OpenAI(model='gpt-4o', api_key=openai_api_key, max_tokens=2000)

data = examples.load_examples("examples.json")

Some examples include gold-standard narratives; others include only a sample explanation.
- The former makes up the gold-standard dataset used for tuning the evaluation metrics and providing few-shot examples.
- The latter makes up the testing dataset used for evaluation and for bootstrapping few-shot examples

TODO: Right now, we few-shot prompt and evaluation using the same dataset which may lead to biased results, we should separate into a test and train dataset

In [None]:
gold_standards = [d for d in data if hasattr(d, "narrative")]
max_optimal_length = max([len(d.narrative) for d in gold_standards])
max_optimal_length

Next, we set up the evaluation metrics. We use the following metrics, all scored on a scale from 0-2:
- Accuracy: the narrative accurately describes the information in the explanation
- Fluency: the narrative is coherent and natural, as compared to the gold-standard explanations. We pass in a small list of sample narratives from the gold-standard dataset to compare against
- Conciseness: the narrative is not too long, as compared to the gold-standard explanations. For now, any narrative that is no longer than the longest gold-standard narrative will score 2
- Context Awareness: the rationalization given alongside the explanation is relevant
- Completeness: the narrative includes all relevant information from the original explanation 

TODO: Currently, we have removed the completeness metric as the accuracy metric ends up encompassing it (and the definition of complete depends on the gold-standard). We need to decide if we want to add it back in

In [None]:
example_good_narratives = random.sample([d.narrative for d in gold_standards], 5)
example_bad_narratives = random.sample([d.bad_narrative for d in gold_standards], 5)

exp_metrics = metrics.Metrics(
            [
                metrics.accuracy,
                metrics.fluency,
                metrics.conciseness,
                metrics.context_awareness,
            ], verbose=0, 
        metric_kwargs={"conciseness": {"max_optimal_length": max_optimal_length},
                       "fluency": {"good_narratives": example_good_narratives, "bad_narratives": example_bad_narratives}}
        )

Finally, we set up the main experiment runner object

In [None]:
explingo = core.Explingo(llm=llm, context="The model predicts house prices", 
                         examples=data, metric=exp_metrics)

## Verifying the Evaluation Metrics

In the following code block, we verify that all the evaluation metrics give the maximum score on the gold-standard dataset. We are defining a successful narrative as one that matches the standards of the gold-standard dataset; this allows us to ensure that the metrics are aligned to this gold-standard.

**Note: occasionally, we do see a few failures, but for now most runs I am seeing 100% success.**

TODO: To complete this, we also need to ensure that narratives that are unlike the gold-standard score lower, using a separate dataset

TODO: We do not currently verify the context awareness metric, as the gold-standard dataset does not include a rationalization

In [None]:
VERIFY = True

if VERIFY:
    # Gold standard dataset does not include a rationalization, so we skip context awareness 
    ver_metrics = metrics.Metrics(
        [
            metrics.accuracy,
            metrics.fluency,
            metrics.conciseness,
        ], verbose=0, 
        metric_kwargs={"conciseness": {"max_optimal_length": max_optimal_length},
                       "fluency": {"good_narratives": example_good_narratives, "bad_narratives": example_bad_narratives}}
    )
    total_passed, total_failed = 0, 0
    for gold_standard in gold_standards:
        result = ver_metrics(gold_standard, gold_standard)[0]
        if result != 2*len(ver_metrics.metric_funcs):
            print(f"Failed {gold_standard}")
            total_failed += 1
        else:
            total_passed += 1
    print(f"Total passed: {total_passed}, total failed: {total_failed}")

# BASIC PROMPTING

We begin with basic prompts. With 4 metrics (without completeness), each with a score of 0-2, the maximum score is 8. 

We generate narratives/rationalizations on `max_iters=5` sample explanations, and return the average total score.

TODO: We need to repeat this for each of our 5 test prompts

In [None]:
explingo.run_experiment(data, prompt_type="basic", max_iters=5)

Next, we repeat the experiment with the addition of 3 few-shot examples from the gold-standard dataset.

In [None]:
for i in [1, 3, 5]:
    print(f"Few-shot n: {i}")
    print(explingo.run_experiment(data, prompt_type="few-shot", max_iters=5, few_shot_n=i))

Next, we repeat the experiment with the addition of 3 examples bootstrapped by DSPy to optimize the evaluation metrics.

TODO: We should experiment with different numbers of labeled few-shot and bootstrapped few-shot examples

In [None]:
explingo.run_experiment(data, prompt_type="bootstrap-few-shot", max_iters=5, few_shot_n=3)