# Explingo Experiment Runner

This notebook:
1. Loads the gold-standard dataset, prepares the metrics functions, and verifies that the metric functions give the maximum score on the gold-standard dataset, and lower scores on less aligned datasets
2. Runs the prompt-design, few-shot, and bootstrap-few-shot experiments on a testing dataset

## Setup
Import necessary libraries and prepare the LLM

**Note: To run these cells, you need a `keys.yaml` file in the top-level Explingo directory with the following line:**
```yaml
openai_api_key: <your_openai_api_key>
```

In [20]:
import pandas as pd

from experiment_runner import ExplingoExperimentRunner
import os
import yaml
import dspy
import metrics
import random

In [21]:
with open(os.path.join("..", "keys.yaml"), "r") as file:
    config = yaml.safe_load(file)
    openai_api_key = config["openai_api_key"]

llm = dspy.OpenAI(model='gpt-4o', api_key=openai_api_key, max_tokens=1000)

Now, we create the main experiment runner object. This object takes in a dataset, and then
1. Splits the dataset into a training dataset and a testing dataset (see notes below)
2. Sets up the evaluation metrics (see notes below). The fluency metric is set up to use sample from the dataset as reference
3. Runs the experiments on the testing dataset

Some examples in the testing datasets include gold-standard narratives; others include only a sample explanation.
- The former makes up the gold-standard dataset used for tuning the evaluation metrics and providing few-shot examples.
- The latter makes up the testing dataset used for evaluation and for bootstrapping few-shot examples

We use the following metrics, all scored on a scale from 0-4:
- Accuracy: the narrative accurately describes the information in the explanation
- Fluency: the narrative is coherent and natural, as compared to the gold-standard explanations. We pass in a small list of sample narratives from the gold-standard dataset to compare against
- Conciseness: the narrative is not too long, as compared to the gold-standard explanations. For now, any narrative that is no longer than the longest gold-standard narrative will score 4
- Completeness: the narrative includes all relevant information from the original explanation 

**Note: You can set `verbose=1` to see the narratives generated, or `verbose=2` to see the explanations, narratives, and rationalizations**

In [22]:
# iterate all datasets in the eval_data folder
runners = {}
total_eval = 0
for dataset in os.listdir(os.path.join("eval_data")):
    runners[dataset] = ExplingoExperimentRunner(llm=llm, openai_api_key=openai_api_key, dataset_filepath = os.path.join("eval_data", dataset), verbose=1)
    total_eval += len(runners[dataset].eval_data)
    
print("Total eval examples:", total_eval)
results = []

eval_data\housing_1.json
Total number of examples: 35
Labeled training examples: 5
Labeled evaluation examples: 15
Unlabeled training examples: 5
Unlabeled evaluation examples: 10
---
eval_data\housing_2.json
Total number of examples: 22
Labeled training examples: 5
Labeled evaluation examples: 7
Unlabeled training examples: 5
Unlabeled evaluation examples: 5
---
eval_data\housing_3.json
Total number of examples: 22
Labeled training examples: 5
Labeled evaluation examples: 8
Unlabeled training examples: 5
Unlabeled evaluation examples: 4
---
eval_data\mushroom_1.json
Total number of examples: 30
Labeled training examples: 5
Labeled evaluation examples: 6
Unlabeled training examples: 5
Unlabeled evaluation examples: 14
---
eval_data\mushroom_2.json
Total number of examples: 30
Labeled training examples: 5
Labeled evaluation examples: 6
Unlabeled training examples: 5
Unlabeled evaluation examples: 14
---
eval_data\pdf_1.json
Total number of examples: 30
Labeled training examples: 5
Label

## Basic prompt design experiment

We begin with basic prompts. With 4 metrics (without completeness), each with a score of 0-2, the maximum score is 8. 

We generate narratives/rationalizations on `max_iters=5` sample explanations, and return the average total score.

In [23]:
# Utilities for cleaner results

def pretty_print(result):
    s = f"Total score: {result[0]}"
    s2 = ", ".join([f"{k}: {v}" for k, v in result[1].items()])
    print(f"{s} ({s2})")
    
def update_results(method, dataset, scores, kwargs):
    result = {"dataset": dataset, "prompt": prompt, "total score": scores[0]}
    result.update(scores[1])
    result.update(kwargs)
    results.append(result)

In [24]:
prompts = ["You are helping users understand an ML model's prediction. Given an explanation and information about the model, "
           "convert the explanation into a human-readable narrative.",
           "You are helping users who do not have experience working with ML understand an ML model's prediction. Given an explanation and information about the model, convert the explanation into a human-readable narrative. Make your answers sound as natural as possible.",
           "You are helping users understand an ML model's prediction. Given an explanation and information about the model, convert the explanation into a human-readable narrative. Be sure to explicitly mention all values from the explanation in your response.",
]

for dataset in runners:
    runner = runners[dataset]
    print(f"Dataset: {dataset}")
    for prompt in prompts:
        print(f"Prompt: {prompt}")
        scores = runner.run_basic_prompting_experiment(prompt=prompt, max_iters=2)
        update_results("basic_prompting", dataset, scores, {"prompt": prompt})
        pretty_print(scores)
        print("--")
    print("=====")

Dataset: housing_1.json
Prompt: You are helping users understand an ML model's prediction. Given an explanation and information about the model, convert the explanation into a human-readable narrative.
Narrative: The model predicts the house price based on several key factors. The house is located within the Ames city limits, specifically in the NoRidge neighborhood, which contributes significantly to the price with a value of 23069.89. The above ground living area is 2198 square feet, adding 20125.75 to the price. The second floor has 1053 square feet, contributing 18094.05. The overall material and finish of the house are rated at 8, which adds 9655.79 to the price. Lastly, the house was originally constructed in the year 2000, contributing 8192.46 to the final predicted price.
Total Score: 15.0
accuracy: 4.0, completeness: 4.0, fluency: 3.0, conciseness: 4, 
--
Narrative: The model predicts the house price based on several factors. The type of foundation being wood decreases the pri

In [25]:
llm.inspect_history(n=1)




You are helping users understand an ML model's prediction. Given an explanation and information about the model, convert the explanation into a human-readable narrative. Be sure to explicitly mention all values from the explanation in your response.
---
Follow the following format
Context: what the model predicts
Explanation: explanation of the model's prediction
Explanation Format: format the explanation is given in
Narrative: human-readable narrative version of the explanation
---
Context: The model predicts whether a student will pass their class
Explanation: (In a romantic relationship, yes, -1.86), (Student's guardian, father, 1.12), (Family eductional support, no, -1.02)
Explanation Format: SHAP feature contribution in (feature_name, feature_value, contribution) format
Please provide the output field Narrative. Do so immediately, without additional content before or after, and precisely as the format above shows.

Narrative: The model predicts that the student will pass their 

"\n\n\nYou are helping users understand an ML model's prediction. Given an explanation and information about the model, convert the explanation into a human-readable narrative. Be sure to explicitly mention all values from the explanation in your response.\n---\nFollow the following format\nContext: what the model predicts\nExplanation: explanation of the model's prediction\nExplanation Format: format the explanation is given in\nNarrative: human-readable narrative version of the explanation\n---\nContext: The model predicts whether a student will pass their class\nExplanation: (In a romantic relationship, yes, -1.86), (Student's guardian, father, 1.12), (Family eductional support, no, -1.02)\nExplanation Format: SHAP feature contribution in (feature_name, feature_value, contribution) format\nPlease provide the output field Narrative. Do so immediately, without additional content before or after, and precisely as the format above shows.\n\n\x1b[32mNarrative: The model predicts that the

## Few-shot experiment

Next, we repeat the experiment with the addition of N few-shot examples from the gold-standard dataset.

In [26]:
for dataset in runners:
    runner = runners[dataset]
    print(f"Dataset: {dataset}")
    for i in [1, 3, 5]:
        print(f"Few-shot n: {i}")
        scores = runner.run_few_shot_experiment(n_few_shot=i, prompt=prompts[0], max_iters=2)
        update_results("few_shot", dataset, scores, {"n_few_shot": i, "prompt": prompts[0]})
        pretty_print(scores)
        print("--")
    print("=====")

Dataset: housing_1.json
Few-shot n: 1
Narrative: The house's location in NoRidge increased the predicted price by about $23,000. The relatively larger above ground living space increased the price by about $20,000. The house's larger than average second floor increased the price by about $18,000. The good overall material and finish of the house (rated 8/10) increased the price by about $9,600. The house is newer than average, with a construction year of 2000, which increased the price by about $8,200.
Total Score: 16.0
accuracy: 4.0, completeness: 4.0, fluency: 4.0, conciseness: 4, 
--
Narrative: The house's wood foundation reduced the predicted price by about $18,000. The house's location in Mitchel reduced the price by about $13,500. The lower-than-average material rating also reduced the house price by about $10,000. The relatively larger three season porch area increased the price by about $10,000. Having only one bedroom above ground increased the price by about $9,000.
Total Sco

In [27]:
llm.inspect_history(n=1)




You are helping users understand an ML model's prediction. Given an explanation and information about the model, convert the explanation into a human-readable narrative.
---
Follow the following format
Context: what the model predicts
Explanation: explanation of the model's prediction
Explanation Format: format the explanation is given in
Narrative: human-readable narrative version of the explanation
---
Example 1
Context: The model predicts whether a student will pass their class
Explanation: (Family eductional support, no, -1.37), (School, MS, -0.59)
Explanation Format: SHAP feature contribution in (feature_name, feature_value, contribution) format
Narrative: The lack of family support and attending the MS school indicate a lower probability of passing.
Example 2
Context: The model predicts whether a student will pass their class
Explanation: (Family eductional support, no, -2.26), (In a romantic relationship, no, 1.11), (Sex, M, -0.60)
Explanation Format: SHAP feature contributio

"\n\n\nYou are helping users understand an ML model's prediction. Given an explanation and information about the model, convert the explanation into a human-readable narrative.\n---\nFollow the following format\nContext: what the model predicts\nExplanation: explanation of the model's prediction\nExplanation Format: format the explanation is given in\nNarrative: human-readable narrative version of the explanation\n---\nExample 1\nContext: The model predicts whether a student will pass their class\nExplanation: (Family eductional support, no, -1.37), (School, MS, -0.59)\nExplanation Format: SHAP feature contribution in (feature_name, feature_value, contribution) format\nNarrative: The lack of family support and attending the MS school indicate a lower probability of passing.\nExample 2\nContext: The model predicts whether a student will pass their class\nExplanation: (Family eductional support, no, -2.26), (In a romantic relationship, no, 1.11), (Sex, M, -0.60)\nExplanation Format: SHAP

## Bootstrapped few-shot
Next, we repeat the experiment with the addition of 3 examples bootstrapped by DSPy to optimize the evaluation metrics.

In [28]:
for dataset in runners:
    runner = runners[dataset]
    print(f"Dataset: {dataset}")
    #for i, j in [[0, 1], [0, 3], [3, 3]]:
    for i, j in [[3, 3]]:
        print(f"Few-shot n: {i}, Bootstrapped n: {j}")
        scores = runner.run_bootstrap_few_shot_experiment(n_labeled_few_shot=i, n_bootstrapped_few_shot=j, max_iters=2)
        update_results("bootstrap_few_shot", dataset, scores, {"n_few_shot": i, "n_bootstrapped_few_shot": j, "prompt": prompts[0]})
        pretty_print(scores)
        print("--")
    print("=====")

Dataset: housing_1.json
Few-shot n: 3, Bootstrapped n: 3


  0%|          | 0/10 [00:00<?, ?it/s]

Narrative: The model predicts the house price based on several key features. The above ground living area, which is 1256 square feet, decreases the predicted price by $12,527.46. The overall material and finish rating of the house, which is rated as 5, reduces the price by $10,743.76. The absence of a second floor (0 square feet) further decreases the price by $10,142.29. The house's location within the Edwards neighborhood in Ames city limits lowers the price by $9,913.81. However, the presence of a wood deck area measuring 736 square feet increases the predicted price by $9,846.38.
Total Score: 16.0
accuracy: 4.0, completeness: 4.0, fluency: 4.0, conciseness: 4, 
--


 10%|█         | 1/10 [00:00<00:03,  2.77it/s]

Narrative: The presence of a second floor with 854 square feet increases the predicted house price by about $12,000. The house being relatively new, built in 2003, adds approximately $9,000 to its price. However, the total basement area of 856 square feet reduces the price by about $6,000. Similarly, the first floor area of 856 square feet decreases the price by about $5,000. Lastly, the house's location in the CollgCr neighborhood reduces the price by about $5,000.
Total Score: 16.0
accuracy: 4.0, completeness: 4.0, fluency: 4.0, conciseness: 4, 
--


 20%|██        | 2/10 [00:00<00:02,  3.22it/s]

Narrative: The presence of walkout or garden level walls increases the predicted price by about $17,000. The house's overall good condition rating also adds about $13,000 to the price. However, the relatively smaller above ground living space reduces the price by about $12,000. The absence of a second floor further reduces the price by about $10,000. Additionally, the house's proximity to certain conditions reduces the price by about $8,000.
Total Score: 16.0
accuracy: 4.0, completeness: 4.0, fluency: 4.0, conciseness: 4, 
--


 30%|███       | 3/10 [00:00<00:02,  3.20it/s]


Bootstrapped 3 full traces after 4 examples in round 0.
Narrative: The house's location in the NoRidge neighborhood increases the predicted price by about $23,000. The above ground living area of 2,198 square feet adds approximately $20,000 to the price. The presence of a second floor with 1,053 square feet further increases the price by about $18,000. The overall material and finish rating of the house, which is rated as 8, adds about $9,000 to the price. Lastly, the house being relatively new, built in the year 2000, contributes an additional $8,000 to the predicted price.
Total Score: 16.0
accuracy: 4.0, completeness: 4.0, fluency: 4.0, conciseness: 4, 
--


  0%|          | 0/10 [00:00<?, ?it/s]

Narrative: The model predicts the house price based on several key features. The above ground living area, which is 1256 square feet, decreases the predicted price by $12,527.46. The overall material and finish rating of the house, which is rated as 5, reduces the price by $10,743.76. The absence of a second floor (0 square feet) further decreases the price by $10,142.29. The house's location within the Edwards neighborhood in Ames city limits lowers the price by $9,913.81. However, the presence of a wood deck area measuring 736 square feet increases the predicted price by $9,846.38.
Total Score: 16.0
accuracy: 4.0, completeness: 4.0, fluency: 4.0, conciseness: 4, 
--
Narrative: The presence of a second floor with 854 square feet increases the predicted house price by about $12,000. The house being relatively new, built in 2003, adds approximately $9,000 to its price. However, the total basement area of 856 square feet reduces the price by about $6,000. Similarly, the first floor area 

 30%|███       | 3/10 [00:00<00:00, 112.92it/s]


Bootstrapped 3 full traces after 4 examples in round 0.
Narrative: The type of foundation being wood decreases the predicted house price by about $18,000. The house's location within the Mitchel neighborhood reduces the price by about $13,000. The overall material and finish rating of the house, which is rated as 5, further decreases the price by about $10,000. However, the presence of a three-season porch area measuring 320 square feet increases the predicted price by about $10,000. Additionally, having one bedroom above ground adds about $9,000 to the price.
Total Score: 16.0
accuracy: 4.0, completeness: 4.0, fluency: 4.0, conciseness: 4, 
--
Total score: 16.0 (accuracy: 4.0, completeness: 4.0, fluency: 4.0, conciseness: 4.0)
--
=====
Dataset: housing_2.json
Few-shot n: 3, Bootstrapped n: 3


  0%|          | 0/10 [00:00<?, ?it/s]

Narrative: The model predicts the house price based on several key features. The above ground living area, which is 1256 square feet, decreases the predicted price by $12,527.46. The overall material and finish rating of the house, which is rated as 5, reduces the price by $10,743.76. The absence of a second floor (0 square feet) further decreases the price by $10,142.29. The house's location within the Edwards neighborhood in Ames city limits lowers the price by $9,913.81. However, the presence of a wood deck area measuring 736 square feet increases the predicted price by $9,846.38.
Total Score: 15.0
accuracy: 4.0, completeness: 4.0, fluency: 3.0, conciseness: 4, 
--


 10%|█         | 1/10 [00:00<00:05,  1.56it/s]

Narrative: The SHAP value indicates that the house price increases because the second floor is 854 sq ft and the house was originally constructed in 2003. However, the price decreases due to the basement area being 856 sq ft, the first floor being 856 sq ft, and the location being CollgCr.
Total Score: 16.0
accuracy: 4.0, completeness: 4.0, fluency: 4.0, conciseness: 4, 
--


 20%|██        | 2/10 [00:04<00:18,  2.30s/it]

Narrative: The SHAP value indicates that the house price increases because it has walkout or garden level walls rated as 'Gd' and the overall condition of the house is rated 8. However, the price decreases due to the above ground living area being 1262 sq ft, the absence of a second floor, and its proximity to feeder roads.
Total Score: 16.0
accuracy: 4.0, completeness: 4.0, fluency: 4.0, conciseness: 4, 
--


 30%|███       | 3/10 [00:08<00:23,  3.33s/it]

Narrative: The SHAP value indicates that the house price increases because the second floor is 866 sq ft, the house was originally constructed in 2001, and the above ground living area is 1786 sq ft. However, the price decreases due to the location being CollgCr and the basement area being 920 sq ft.
Total Score: 16.0
accuracy: 4.0, completeness: 4.0, fluency: 4.0, conciseness: 4, 
--


 40%|████      | 4/10 [00:14<00:21,  3.59s/it]


Bootstrapped 3 full traces after 5 examples in round 0.
Narrative: The SHAP value indicates that the house price increases because the location is NoRidge, the above ground living area is 2198 sq ft, the second floor is 1053 sq ft, the overall material and finish of the house is rated 8, and the house was originally constructed in 2000.
Total Score: 16.0
accuracy: 4.0, completeness: 4.0, fluency: 4.0, conciseness: 4, 
--


  0%|          | 0/10 [00:00<?, ?it/s]

Narrative: The model predicts the house price based on several key features. The above ground living area, which is 1256 square feet, decreases the predicted price by $12,527.46. The overall material and finish rating of the house, which is rated as 5, reduces the price by $10,743.76. The absence of a second floor (0 square feet) further decreases the price by $10,142.29. The house's location within the Edwards neighborhood in Ames city limits lowers the price by $9,913.81. However, the presence of a wood deck area measuring 736 square feet increases the predicted price by $9,846.38.
Total Score: 15.0
accuracy: 4.0, completeness: 4.0, fluency: 3.0, conciseness: 4, 
--
Narrative: The SHAP value indicates that the house price increases because the second floor is 854 sq ft and the house was originally constructed in 2003. However, the price decreases due to the basement area being 856 sq ft, the first floor being 856 sq ft, and the location being CollgCr.
Total Score: 16.0
accuracy: 4.0,

 40%|████      | 4/10 [00:00<00:00, 110.90it/s]


Bootstrapped 3 full traces after 5 examples in round 0.
Narrative: The SHAP value indicates that the house price decreases because the foundation is made of wood, the location is Mitchel, and the overall material and finish of the house is rated 5. However, the price increases due to having a three season porch area of 320 sq ft and having 1 bedroom above ground.
Total Score: 16.0
accuracy: 4.0, completeness: 4.0, fluency: 4.0, conciseness: 4, 
--
Total score: 16.0 (accuracy: 4.0, completeness: 4.0, fluency: 4.0, conciseness: 4.0)
--
=====
Dataset: housing_3.json
Few-shot n: 3, Bootstrapped n: 3


  0%|          | 0/10 [00:00<?, ?it/s]

Narrative: The model predicts the house price based on several key features. The above ground living area, which is 1256 square feet, decreases the predicted price by $12,527.46. The overall material and finish rating of the house, which is rated as 5, reduces the price by $10,743.76. The absence of a second floor (0 square feet) further decreases the price by $10,142.29. The house's location within the Edwards neighborhood in Ames city limits lowers the price by $9,913.81. However, the presence of a wood deck area measuring 736 square feet increases the predicted price by $9,846.38.
Total Score: 16.0
accuracy: 4.0, completeness: 4.0, fluency: 4.0, conciseness: 4, 
--


 10%|█         | 1/10 [00:00<00:02,  3.07it/s]

Narrative: This house is more expensive because it has a second floor (size=854) and was built more recently (year=2003). However, it is cheaper due to a smaller basement (size=856), smaller first floor (size=856), and its location in CollgCr.
Total Score: 16.0
accuracy: 4.0, completeness: 4.0, fluency: 4.0, conciseness: 4, 
--


 20%|██        | 2/10 [00:04<00:18,  2.36s/it]

Narrative: This house is more expensive because it has garden level walls (type=Gd) and is in good overall condition (rating=8). However, it is cheaper due to having less above ground living space (size=1262), no second floor (size=0), and being close to feeder roads.
Total Score: 16.0
accuracy: 4.0, completeness: 4.0, fluency: 4.0, conciseness: 4, 
--


 30%|███       | 3/10 [00:09<00:22,  3.23s/it]


Bootstrapped 3 full traces after 4 examples in round 0.
Narrative: This house is more expensive because it is located in the NoRidge neighborhood, has a large above ground living area (size=2198), and includes a second floor (size=1053). Additionally, the house has a high overall material and finish rating (rating=8) and was built relatively recently (year=2000).
Total Score: 14.0
accuracy: 4.0, completeness: 2.0, fluency: 4.0, conciseness: 4, 
--


  0%|          | 0/10 [00:00<?, ?it/s]

Narrative: The model predicts the house price based on several key features. The above ground living area, which is 1256 square feet, decreases the predicted price by $12,527.46. The overall material and finish rating of the house, which is rated as 5, reduces the price by $10,743.76. The absence of a second floor (0 square feet) further decreases the price by $10,142.29. The house's location within the Edwards neighborhood in Ames city limits lowers the price by $9,913.81. However, the presence of a wood deck area measuring 736 square feet increases the predicted price by $9,846.38.
Total Score: 16.0
accuracy: 4.0, completeness: 4.0, fluency: 4.0, conciseness: 4, 
--
Narrative: This house is more expensive because it has a second floor (size=854) and was built more recently (year=2003). However, it is cheaper due to a smaller basement (size=856), smaller first floor (size=856), and its location in CollgCr.
Total Score: 16.0
accuracy: 4.0, completeness: 4.0, fluency: 4.0, conciseness: 

 30%|███       | 3/10 [00:00<00:00, 114.37it/s]


Bootstrapped 3 full traces after 4 examples in round 0.
Narrative: This house is cheaper because it has a wood foundation, is located in the Mitchel neighborhood, and has an overall material and finish rating of 5. However, it is more expensive due to having a three-season porch (size=320) and one bedroom above ground.
Total Score: 16.0
accuracy: 4.0, completeness: 4.0, fluency: 4.0, conciseness: 4, 
--
Total score: 15.0 (accuracy: 4.0, completeness: 3.0, fluency: 4.0, conciseness: 4.0)
--
=====
Dataset: mushroom_1.json
Few-shot n: 3, Bootstrapped n: 3


  0%|          | 0/10 [00:00<?, ?it/s]

Narrative: The model predicts that the mushroom is poisonous primarily because it has a foul odor, which contributes significantly to the prediction. Additionally, the silky texture of the stalk surface above the ring and the chocolate color of the spore print also play smaller roles in the model's decision.
Total Score: 15.0
accuracy: 4.0, completeness: 4.0, fluency: 3.0, conciseness: 4, 
--


 10%|█         | 1/10 [00:04<00:36,  4.01s/it]

Narrative: This mushroom is more likely to be poisonous because its foul odor, silky stalk surface, and chocolate spore print color. Be careful!
Total Score: 16.0
accuracy: 4.0, completeness: 4.0, fluency: 4.0, conciseness: 4, 
--


 20%|██        | 2/10 [00:07<00:28,  3.54s/it]

Narrative: This mushroom is less likely to be poisonous because it has no odor and broad gills, although its brown spore print color slightly increases the risk. Stay cautious!
Total Score: 16.0
accuracy: 4.0, completeness: 4.0, fluency: 4.0, conciseness: 4, 
--


 30%|███       | 3/10 [00:10<00:24,  3.54s/it]

Narrative: This mushroom is less likely to be poisonous because it has no odor, broad gills, and a black spore print color. Enjoy!
Total Score: 9.0
accuracy: 0.0, completeness: 2.0, fluency: 3.0, conciseness: 4, 
--


 40%|████      | 4/10 [00:13<00:20,  3.39s/it]

Narrative: This mushroom is less likely to be poisonous because it has no odor, broad gill size, and black spore print color. Enjoy!
Total Score: 13.0
accuracy: 4.0, completeness: 2.0, fluency: 3.0, conciseness: 4, 
--


 50%|█████     | 5/10 [00:16<00:15,  3.03s/it]

Narrative: This mushroom is more likely to be poisonous because its foul odor, silky stalk surface, and chocolate spore print color. Be careful!
Total Score: 14.0
accuracy: 4.0, completeness: 2.0, fluency: 4.0, conciseness: 4, 
--


 60%|██████    | 6/10 [00:18<00:11,  2.83s/it]

Narrative: This mushroom is more likely to be poisonous because its foul odor, buff gill color, and narrow gill size. Be careful!
Total Score: 16.0
accuracy: 4.0, completeness: 4.0, fluency: 4.0, conciseness: 4, 
--


 70%|███████   | 7/10 [00:21<00:09,  3.10s/it]


Bootstrapped 3 full traces after 8 examples in round 0.
Narrative: This mushroom is more likely to be poisonous because its foul odor, silky stalk surface, and chocolate spore print color. Be careful!
Total Score: 14.0
accuracy: 4.0, completeness: 2.0, fluency: 4.0, conciseness: 4, 
--


  0%|          | 0/10 [00:00<?, ?it/s]

Narrative: The model predicts that the mushroom is poisonous primarily because it has a foul odor, which contributes significantly to the prediction. Additionally, the silky texture of the stalk surface above the ring and the chocolate color of the spore print also play smaller roles in the model's decision.
Total Score: 15.0
accuracy: 4.0, completeness: 4.0, fluency: 3.0, conciseness: 4, 
--
Narrative: This mushroom is more likely to be poisonous because its foul odor, silky stalk surface, and chocolate spore print color. Be careful!
Total Score: 16.0
accuracy: 4.0, completeness: 4.0, fluency: 4.0, conciseness: 4, 
--
Narrative: This mushroom is less likely to be poisonous because it has no odor and broad gills, although its brown spore print color slightly increases the risk. Stay cautious!
Total Score: 16.0
accuracy: 4.0, completeness: 4.0, fluency: 4.0, conciseness: 4, 
--
Narrative: This mushroom is less likely to be poisonous because it has no odor, broad gills, and a black spore

 70%|███████   | 7/10 [00:00<00:00, 109.17it/s]


Bootstrapped 3 full traces after 8 examples in round 0.
Narrative: This mushroom is less likely to be poisonous because it has no odor and broad gills, although its crowded gill spacing slightly increases the risk. Stay cautious!
Total Score: 15.0
accuracy: 4.0, completeness: 4.0, fluency: 3.0, conciseness: 4, 
--
Total score: 14.5 (accuracy: 4.0, completeness: 3.0, fluency: 3.5, conciseness: 4.0)
--
=====
Dataset: mushroom_2.json
Few-shot n: 3, Bootstrapped n: 3


  0%|          | 0/10 [00:00<?, ?it/s]

Narrative: The model predicts that the mushroom is poisonous primarily because it has a foul odor, which contributes significantly to the prediction. Additionally, the silky texture of the stalk surface above the ring and the chocolate color of the spore print also play smaller roles in the model's decision.
Total Score: 15.0
accuracy: 4.0, completeness: 4.0, fluency: 3.0, conciseness: 4, 
--


 10%|█         | 1/10 [00:00<00:02,  3.10it/s]

Narrative: The foul odor, silky stalk surface, and chocolate spore print color suggest the mushroom more likely to be poisonous.
Total Score: 14.0
accuracy: 4.0, completeness: 2.0, fluency: 4.0, conciseness: 4, 
--


 20%|██        | 2/10 [00:03<00:16,  2.04s/it]

Narrative: The lack of odor, broad gill size, and brown spore print color suggest the mushroom is less likely to be poisonous.
Total Score: 10.0
accuracy: 0.0, completeness: 2.0, fluency: 4.0, conciseness: 4, 
--


 30%|███       | 3/10 [00:07<00:21,  3.00s/it]

Narrative: The absence of odor, broad gill size, and black spore print color suggest the mushroom is less likely to be poisonous.
Total Score: 14.0
accuracy: 4.0, completeness: 2.0, fluency: 4.0, conciseness: 4, 
--


 40%|████      | 4/10 [00:11<00:19,  3.32s/it]

Narrative: The lack of odor, broad gill size, and black spore print color suggest the mushroom is less likely to be poisonous.
Total Score: 14.0
accuracy: 4.0, completeness: 2.0, fluency: 4.0, conciseness: 4, 
--


 50%|█████     | 5/10 [00:15<00:17,  3.59s/it]

Narrative: The foul odor, silky stalk surface, and chocolate spore print color suggest the mushroom more likely to be poisonous.
Total Score: 14.0
accuracy: 4.0, completeness: 2.0, fluency: 4.0, conciseness: 4, 
--


 60%|██████    | 6/10 [00:18<00:13,  3.40s/it]

Narrative: The foul odor, buff gill color, and narrow gill size suggest the mushroom more likely to be poisonous.
Total Score: 14.0
accuracy: 4.0, completeness: 2.0, fluency: 4.0, conciseness: 4, 
--


 70%|███████   | 7/10 [00:21<00:09,  3.19s/it]

Narrative: The foul odor, narrow gill size, and buff gill color suggest the mushroom more likely to be poisonous.
Total Score: 14.0
accuracy: 4.0, completeness: 2.0, fluency: 4.0, conciseness: 4, 
--


 80%|████████  | 8/10 [00:23<00:05,  2.95s/it]

Narrative: The narrow gill size, fishy odor, and buff gill color suggest the mushroom more likely to be poisonous.
Total Score: 14.0
accuracy: 4.0, completeness: 2.0, fluency: 4.0, conciseness: 4, 
--


 90%|█████████ | 9/10 [00:26<00:02,  2.83s/it]

Narrative: The lack of odor, crowded gill spacing, and broad gill size suggest the mushroom is more likely to be poisonous.
Total Score: 11.0
accuracy: 0.0, completeness: 4.0, fluency: 3.0, conciseness: 4, 
--


100%|██████████| 10/10 [00:29<00:00,  2.96s/it]


Bootstrapped 0 full traces after 10 examples in round 0.
Narrative: The foul odor and silky stalk surface above the ring suggest the mushroom is more likely to be poisonous, and the chocolate spore print further increases the risk of toxicity.
Total Score: 12.0
accuracy: 0.0, completeness: 4.0, fluency: 4.0, conciseness: 4, 
--


  0%|          | 0/10 [00:00<?, ?it/s]

Narrative: The model predicts that the mushroom is poisonous primarily because it has a foul odor, which contributes significantly to the prediction. Additionally, the silky texture of the stalk surface above the ring and the chocolate color of the spore print also play smaller roles in the model's decision.
Total Score: 15.0
accuracy: 4.0, completeness: 4.0, fluency: 3.0, conciseness: 4, 
--
Narrative: The foul odor, silky stalk surface, and chocolate spore print color suggest the mushroom more likely to be poisonous.
Total Score: 14.0
accuracy: 4.0, completeness: 2.0, fluency: 4.0, conciseness: 4, 
--
Narrative: The lack of odor, broad gill size, and brown spore print color suggest the mushroom is less likely to be poisonous.
Total Score: 10.0
accuracy: 0.0, completeness: 2.0, fluency: 4.0, conciseness: 4, 
--
Narrative: The absence of odor, broad gill size, and black spore print color suggest the mushroom is less likely to be poisonous.
Total Score: 14.0
accuracy: 4.0, completeness:

100%|██████████| 10/10 [00:00<00:00, 115.95it/s]


Bootstrapped 0 full traces after 10 examples in round 0.
Narrative: The absence of odor suggests the mushroom is less likely to be poisonous, but the crowded gill spacing and broad gill size indicate a higher risk of toxicity.
Total Score: 9.0
accuracy: 0.0, completeness: 2.0, fluency: 3.0, conciseness: 4, 
--
Total score: 10.5 (accuracy: 0.0, completeness: 3.0, fluency: 3.5, conciseness: 4.0)
--
=====
Dataset: pdf_1.json
Few-shot n: 3, Bootstrapped n: 3


  0%|          | 0/10 [00:00<?, ?it/s]

Narrative: The model predicts that the PDF file contains malware based on several key factors. Firstly, the size of the metadata is 262.0 KB, which significantly contributes to the prediction. Secondly, the total size of the file is 74.0 KB, which also plays an important role. Lastly, the absence of Javascript keywords, with a count of 0.0, further supports the model's prediction.
Total Score: 16.0
accuracy: 4.0, completeness: 4.0, fluency: 4.0, conciseness: 4, 
--


 10%|█         | 1/10 [00:03<00:28,  3.21s/it]

Narrative: The PDF file is less likely to contain malware because it has fewer objects (-1), fewer keywords that denote the end of streams (-1), and fewer streams (sequences of binary data) (-1).
Total Score: 11.0
accuracy: 0.0, completeness: 4.0, fluency: 3.0, conciseness: 4, 
--


 20%|██        | 2/10 [00:07<00:32,  4.05s/it]

Narrative: The PDF file is more likely to contain malware because it has a larger metadata size (272 KB), no Javascript keywords, and a larger total size (90 KB).
Total Score: 16.0
accuracy: 4.0, completeness: 4.0, fluency: 4.0, conciseness: 4, 
--


 30%|███       | 3/10 [00:11<00:28,  4.01s/it]

Narrative: The PDF file is more likely to contain malware because it has a larger metadata size (180 KB), a larger total size (7 KB), and a negative number of objects (-1).
Total Score: 10.0
accuracy: 0.0, completeness: 2.0, fluency: 4.0, conciseness: 4, 
--


 40%|████      | 4/10 [00:15<00:23,  3.96s/it]

Narrative: The PDF file is more likely to contain malware because it has a larger metadata size (262 KB), no Javascript keywords, and a larger total size (91 KB).
Total Score: 14.0
accuracy: 4.0, completeness: 2.0, fluency: 4.0, conciseness: 4, 
--


 50%|█████     | 5/10 [00:19<00:19,  3.81s/it]

Narrative: The PDF file is more likely to contain malware because it has one Javascript keyword, a negative number of images, and one JS keyword.
Total Score: 9.0
accuracy: 0.0, completeness: 2.0, fluency: 3.0, conciseness: 4, 
--


 60%|██████    | 6/10 [00:22<00:14,  3.57s/it]

Narrative: The PDF file is more likely to contain malware because it has a larger metadata size (336 KB), a larger total size (58 KB), and a higher number of objects (121).
Total Score: 14.0
accuracy: 4.0, completeness: 2.0, fluency: 4.0, conciseness: 4, 
--


 70%|███████   | 7/10 [00:26<00:11,  3.73s/it]

Narrative: The PDF file is more likely to contain malware because it has a higher number of Javascript keywords (3), a higher number of JS keywords (2), and a higher number of keywords that denote the end of streams (2).
Total Score: 7.0
accuracy: 0.0, completeness: 2.0, fluency: 1.0, conciseness: 4, 
--


 80%|████████  | 8/10 [00:29<00:07,  3.62s/it]

Narrative: The PDF file is more likely to contain malware because it has a larger metadata size (289 KB), no Javascript keywords, and a total size of 27 KB.
Total Score: 16.0
accuracy: 4.0, completeness: 4.0, fluency: 4.0, conciseness: 4, 
--


 90%|█████████ | 9/10 [00:32<00:03,  3.65s/it]


Bootstrapped 3 full traces after 10 examples in round 0.
Narrative: The PDF file is more likely to contain malware because it has a larger metadata size (180 KB), a small total size (3 KB), and contains one stream (sequence of binary data).
Total Score: 16.0
accuracy: 4.0, completeness: 4.0, fluency: 4.0, conciseness: 4, 
--


  0%|          | 0/10 [00:00<?, ?it/s]

Narrative: The model predicts that the PDF file contains malware based on several key factors. Firstly, the size of the metadata is 262.0 KB, which significantly contributes to the prediction. Secondly, the total size of the file is 74.0 KB, which also plays an important role. Lastly, the absence of Javascript keywords, with a count of 0.0, further supports the model's prediction.
Total Score: 16.0
accuracy: 4.0, completeness: 4.0, fluency: 4.0, conciseness: 4, 
--
Narrative: The PDF file is less likely to contain malware because it has fewer objects (-1), fewer keywords that denote the end of streams (-1), and fewer streams (sequences of binary data) (-1).
Total Score: 11.0
accuracy: 0.0, completeness: 4.0, fluency: 3.0, conciseness: 4, 
--
Narrative: The PDF file is more likely to contain malware because it has a larger metadata size (272 KB), no Javascript keywords, and a larger total size (90 KB).
Total Score: 16.0
accuracy: 4.0, completeness: 4.0, fluency: 4.0, conciseness: 4, 
--

 90%|█████████ | 9/10 [00:00<00:00, 116.55it/s]


Bootstrapped 3 full traces after 10 examples in round 0.
Narrative: The PDF file is more likely to contain malware because it has a larger metadata size (358 KB), no Javascript keywords, and a total size of 63 KB.
Total Score: 14.0
accuracy: 4.0, completeness: 2.0, fluency: 4.0, conciseness: 4, 
--
Total score: 15.0 (accuracy: 4.0, completeness: 3.0, fluency: 4.0, conciseness: 4.0)
--
=====
Dataset: pdf_2.json
Few-shot n: 3, Bootstrapped n: 3


  0%|          | 0/10 [00:00<?, ?it/s]

Narrative: The model predicts that the PDF file contains malware based on several key factors. Firstly, the size of the metadata is 262.0 KB, which significantly contributes to the prediction. Secondly, the total size of the file is 74.0 KB, which also plays an important role. Lastly, the absence of Javascript keywords, with a count of 0.0, further supports the model's prediction.
Total Score: 15.0
accuracy: 4.0, completeness: 4.0, fluency: 3.0, conciseness: 4, 
--


 10%|█         | 1/10 [00:00<00:02,  3.22it/s]

Narrative: A lower number of objects (-1), fewer keywords that denote the end of streams (-1), and fewer streams (-1) suggest that the PDF contains malware.
Total Score: 12.0
accuracy: 0.0, completeness: 4.0, fluency: 4.0, conciseness: 4, 
--


 20%|██        | 2/10 [00:04<00:20,  2.59s/it]

Narrative: The larger metadata size (272 KB), no Javascript keywords, and a larger total size (90 KB) suggest that the PDF contains malware.
Total Score: 10.0
accuracy: 0.0, completeness: 2.0, fluency: 4.0, conciseness: 4, 
--


 30%|███       | 3/10 [00:10<00:28,  4.06s/it]

Narrative: The larger metadata size (180 KB), a smaller total size (7 KB), and a negative number of objects (-1) suggest that the PDF contains malware.
Total Score: 10.0
accuracy: 0.0, completeness: 2.0, fluency: 4.0, conciseness: 4, 
--


 40%|████      | 4/10 [00:14<00:25,  4.30s/it]

Narrative: The larger metadata size (262 KB), no Javascript keywords, and a larger total size (91 KB) suggest that the PDF contains malware.
Total Score: 10.0
accuracy: 0.0, completeness: 2.0, fluency: 4.0, conciseness: 4, 
--


 50%|█████     | 5/10 [00:18<00:19,  3.86s/it]

Narrative: The presence of Javascript keywords (1.0) and the absence of images (-1.0) suggest that the PDF contains malware.
Total Score: 7.0
accuracy: 0.0, completeness: 0.0, fluency: 3.0, conciseness: 4, 
--


 60%|██████    | 6/10 [00:21<00:15,  3.78s/it]

Narrative: The larger metadata size (336 KB), a moderate total size (58 KB), and a higher number of objects (121) suggest that the PDF contains malware.
Total Score: 10.0
accuracy: 0.0, completeness: 2.0, fluency: 4.0, conciseness: 4, 
--


 70%|███████   | 7/10 [00:24<00:10,  3.40s/it]

Narrative: The presence of 3 Javascript keywords, 2 JS keywords, and 2 keywords that denote the end of streams suggests that the PDF contains malware.
Total Score: 9.0
accuracy: 0.0, completeness: 2.0, fluency: 3.0, conciseness: 4, 
--


 80%|████████  | 8/10 [00:27<00:06,  3.18s/it]

Narrative: The larger metadata size (289 KB), no Javascript keywords, and a smaller total size (27 KB) suggest that the PDF contains malware.
Total Score: 10.0
accuracy: 0.0, completeness: 2.0, fluency: 4.0, conciseness: 4, 
--


 90%|█████████ | 9/10 [00:30<00:03,  3.16s/it]

Narrative: The presence of 2 Javascript keywords, a smaller total size (4 KB), and 10 entries in Xref tables suggest that the PDF contains malware.
Total Score: 11.0
accuracy: 0.0, completeness: 4.0, fluency: 3.0, conciseness: 4, 
--


100%|██████████| 10/10 [00:33<00:00,  3.34s/it]


Bootstrapped 0 full traces after 10 examples in round 0.
Narrative: A moderate metadata size (180 KB), a small total size (3 KB), and a single stream of binary data suggest that the PDF contains malware.
Total Score: 9.0
accuracy: 0.0, completeness: 2.0, fluency: 3.0, conciseness: 4, 
--


  0%|          | 0/10 [00:00<?, ?it/s]

Narrative: The model predicts that the PDF file contains malware based on several key factors. Firstly, the size of the metadata is 262.0 KB, which significantly contributes to the prediction. Secondly, the total size of the file is 74.0 KB, which also plays an important role. Lastly, the absence of Javascript keywords, with a count of 0.0, further supports the model's prediction.
Total Score: 15.0
accuracy: 4.0, completeness: 4.0, fluency: 3.0, conciseness: 4, 
--
Narrative: A lower number of objects (-1), fewer keywords that denote the end of streams (-1), and fewer streams (-1) suggest that the PDF contains malware.
Total Score: 12.0
accuracy: 0.0, completeness: 4.0, fluency: 4.0, conciseness: 4, 
--
Narrative: The larger metadata size (272 KB), no Javascript keywords, and a larger total size (90 KB) suggest that the PDF contains malware.
Total Score: 10.0
accuracy: 0.0, completeness: 2.0, fluency: 4.0, conciseness: 4, 
--
Narrative: The larger metadata size (180 KB), a smaller tota

100%|██████████| 10/10 [00:00<00:00, 115.15it/s]


Bootstrapped 0 full traces after 10 examples in round 0.
Narrative: A larger metadata size (358 KB), no Javascript keywords, and a larger total size (63 KB) suggest that the PDF contains malware.
Total Score: 10.0
accuracy: 0.0, completeness: 2.0, fluency: 4.0, conciseness: 4, 
--
Total score: 9.5 (accuracy: 0.0, completeness: 2.0, fluency: 3.5, conciseness: 4.0)
--
=====
Dataset: student_1.json
Few-shot n: 3, Bootstrapped n: 3


  0%|          | 0/10 [00:00<?, ?it/s]

Narrative: The lack of family support significantly reduces this student's chances of passing. While not being in a romantic relationship provides some positive impact, being male also slightly decreases the likelihood of passing.
Total Score: 16.0
accuracy: 4.0, completeness: 4.0, fluency: 4.0, conciseness: 4, 
--


 10%|█         | 1/10 [00:03<00:28,  3.18s/it]

Narrative: The presence of family support and not being in a romantic relationship are positive factors that increase this student's likelihood of passing. However, being 17 years old slightly decreases their chances.
Total Score: 16.0
accuracy: 4.0, completeness: 4.0, fluency: 4.0, conciseness: 4, 
--


 20%|██        | 2/10 [00:06<00:27,  3.41s/it]

Narrative: The student's chances of passing are negatively impacted by being in a romantic relationship, the lack of family educational support, and being male.
Total Score: 13.0
accuracy: 4.0, completeness: 2.0, fluency: 3.0, conciseness: 4, 
--


 30%|███       | 3/10 [00:10<00:25,  3.61s/it]

Narrative: The student's likelihood of passing is reduced due to the absence of family support and attending the MS school.
Total Score: 16.0
accuracy: 4.0, completeness: 4.0, fluency: 4.0, conciseness: 4, 
--


 40%|████      | 4/10 [00:13<00:20,  3.43s/it]


Bootstrapped 3 full traces after 5 examples in round 0.
Narrative: The lack of family support significantly reduces this student's chances of passing. While not being in a romantic relationship provides some positive impact, having their mother as their guardian slightly decreases the likelihood of passing.
Total Score: 16.0
accuracy: 4.0, completeness: 4.0, fluency: 4.0, conciseness: 4, 
--


  0%|          | 0/10 [00:00<?, ?it/s]

Narrative: The lack of family support significantly reduces this student's chances of passing. While not being in a romantic relationship provides some positive impact, being male also slightly decreases the likelihood of passing.
Total Score: 16.0
accuracy: 4.0, completeness: 4.0, fluency: 4.0, conciseness: 4, 
--
Narrative: The presence of family support and not being in a romantic relationship are positive factors that increase this student's likelihood of passing. However, being 17 years old slightly decreases their chances.
Total Score: 16.0
accuracy: 4.0, completeness: 4.0, fluency: 4.0, conciseness: 4, 
--
Narrative: The student's chances of passing are negatively impacted by being in a romantic relationship, the lack of family educational support, and being male.
Total Score: 13.0
accuracy: 4.0, completeness: 2.0, fluency: 3.0, conciseness: 4, 
--
Narrative: The student's likelihood of passing is reduced due to the absence of family support and attending the MS school.
Total Sc

 40%|████      | 4/10 [00:00<00:00, 118.33it/s]


Bootstrapped 3 full traces after 5 examples in round 0.
Narrative: The presence of family support and not being in a romantic relationship are positive factors that increase this student's likelihood of passing. However, frequently going out with friends slightly decreases their chances.
Total Score: 16.0
accuracy: 4.0, completeness: 4.0, fluency: 4.0, conciseness: 4, 
--
Total score: 16.0 (accuracy: 4.0, completeness: 4.0, fluency: 4.0, conciseness: 4.0)
--
=====
Dataset: student_2.json
Few-shot n: 3, Bootstrapped n: 3


  0%|          | 0/10 [00:00<?, ?it/s]

Narrative: The model predicts that the student is less likely to pass their class because they do not have family educational support, which negatively impacts the prediction by 2.26 units. Additionally, the fact that the student is not in a romantic relationship positively influences the prediction by 1.11 units. Lastly, the student's gender being male slightly decreases the likelihood of passing by 0.60 units.
Total Score: 16.0
accuracy: 4.0, completeness: 4.0, fluency: 4.0, conciseness: 4, 
--


 10%|█         | 1/10 [00:06<00:59,  6.64s/it]

Narrative: The presence of family educational support and the lack of a romantic relationship suggest the student is more likely to pass the class. However, being 17 years old slightly decreases the probability of passing.
Total Score: 16.0
accuracy: 4.0, completeness: 4.0, fluency: 4.0, conciseness: 4, 
--


 20%|██        | 2/10 [00:10<00:39,  4.91s/it]

Narrative: Being in a romantic relationship, the lack of family support, and the sex (male) suggest the student is less likely to pass the class.
Total Score: 14.0
accuracy: 4.0, completeness: 2.0, fluency: 4.0, conciseness: 4, 
--


 30%|███       | 3/10 [00:12<00:26,  3.77s/it]

Narrative: The lack of family support and attending the school MS suggest the student is less likely to pass the class.
Total Score: 14.0
accuracy: 4.0, completeness: 2.0, fluency: 4.0, conciseness: 4, 
--


 40%|████      | 4/10 [00:16<00:22,  3.72s/it]

Narrative: Having the father as the student's guardian, not being in a romantic relationship, and receiving family educational support all suggest a higher likelihood of the student passing the class.
Total Score: 14.0
accuracy: 4.0, completeness: 2.0, fluency: 4.0, conciseness: 4, 
--


 50%|█████     | 5/10 [00:19<00:17,  3.46s/it]

Narrative: The lack of a romantic relationship and the presence of family educational support suggest the student is more likely to pass the class. However, the lower quality of family relationships indicates a lower probability of passing.
Total Score: 14.0
accuracy: 4.0, completeness: 2.0, fluency: 4.0, conciseness: 4, 
--


 60%|██████    | 6/10 [00:22<00:13,  3.35s/it]

Narrative: The lack of a romantic relationship, the presence of family educational support, and a high quality of family relationships suggest the student is more likely to pass the class.
Total Score: 16.0
accuracy: 4.0, completeness: 4.0, fluency: 4.0, conciseness: 4, 
--


 70%|███████   | 7/10 [00:26<00:11,  3.75s/it]


Bootstrapped 3 full traces after 8 examples in round 0.
Narrative: The fact that the student is in a romantic relationship and does not have family educational support suggests they are less likely to pass the class. Additionally, having their mother as their guardian slightly decreases the likelihood of passing.
Total Score: 16.0
accuracy: 4.0, completeness: 4.0, fluency: 4.0, conciseness: 4, 
--


  0%|          | 0/10 [00:00<?, ?it/s]

Narrative: The model predicts that the student is less likely to pass their class because they do not have family educational support, which negatively impacts the prediction by 2.26 units. Additionally, the fact that the student is not in a romantic relationship positively influences the prediction by 1.11 units. Lastly, the student's gender being male slightly decreases the likelihood of passing by 0.60 units.
Total Score: 16.0
accuracy: 4.0, completeness: 4.0, fluency: 4.0, conciseness: 4, 
--
Narrative: The presence of family educational support and the lack of a romantic relationship suggest the student is more likely to pass the class. However, being 17 years old slightly decreases the probability of passing.
Total Score: 16.0
accuracy: 4.0, completeness: 4.0, fluency: 4.0, conciseness: 4, 
--
Narrative: Being in a romantic relationship, the lack of family support, and the sex (male) suggest the student is less likely to pass the class.
Total Score: 14.0
accuracy: 4.0, completene

 70%|███████   | 7/10 [00:00<00:00, 114.25it/s]


Bootstrapped 3 full traces after 8 examples in round 0.
Narrative: Being in a romantic relationship and lacking family educational support suggest the student is less likely to pass the class. However, having their father as their guardian positively influences the prediction.
Total Score: 16.0
accuracy: 4.0, completeness: 4.0, fluency: 4.0, conciseness: 4, 
--
Total score: 16.0 (accuracy: 4.0, completeness: 4.0, fluency: 4.0, conciseness: 4.0)
--
=====


In [29]:
llm.inspect_history(n=1)




You are helping users understand an ML model's prediction. Given an explanation and information about the model,
convert the explanation into a human-readable narrative.

---

Follow the following format.

Context: what the ML model predicts

Explanation: explanation of an ML model's prediction

Explanation Format: format the explanation is given in

Narrative: human-readable narrative version of the explanation

---

Context: The model predicts whether a student will pass their class

Explanation: (Family eductional support, no, -2.26), (In a romantic relationship, no, 1.11), (Sex, M, -0.60)

Explanation Format: SHAP feature contribution in (feature_name, feature_value, contribution) format

Narrative: The model predicts that the student is less likely to pass their class because they do not have family educational support, which negatively impacts the prediction by 2.26 units. Additionally, the fact that the student is not in a romantic relationship positively influences the predi

"\n\n\nYou are helping users understand an ML model's prediction. Given an explanation and information about the model,\nconvert the explanation into a human-readable narrative.\n\n---\n\nFollow the following format.\n\nContext: what the ML model predicts\n\nExplanation: explanation of an ML model's prediction\n\nExplanation Format: format the explanation is given in\n\nNarrative: human-readable narrative version of the explanation\n\n---\n\nContext: The model predicts whether a student will pass their class\n\nExplanation: (Family eductional support, no, -2.26), (In a romantic relationship, no, 1.11), (Sex, M, -0.60)\n\nExplanation Format: SHAP feature contribution in (feature_name, feature_value, contribution) format\n\nNarrative: The model predicts that the student is less likely to pass their class because they do not have family educational support, which negatively impacts the prediction by 2.26 units. Additionally, the fact that the student is not in a romantic relationship posi

In [30]:
result_df = pd.DataFrame(results)
result_df.to_csv("results.csv")
result_df

Unnamed: 0,dataset,prompt,total score,accuracy,completeness,fluency,conciseness,n_few_shot,n_bootstrapped_few_shot
0,housing_1.json,You are helping users understand an ML model's...,15.5,4.0,4.0,3.5,4.0,,
1,housing_1.json,You are helping users who do not have experien...,14.5,4.0,4.0,2.5,4.0,,
2,housing_1.json,You are helping users understand an ML model's...,15.5,4.0,4.0,3.5,4.0,,
3,housing_2.json,You are helping users understand an ML model's...,14.5,4.0,4.0,2.5,4.0,,
4,housing_2.json,You are helping users who do not have experien...,14.0,4.0,4.0,2.0,4.0,,
...,...,...,...,...,...,...,...,...,...
58,mushroom_2.json,You are helping users understand an ML model's...,10.5,0.0,3.0,3.5,4.0,3.0,3.0
59,pdf_1.json,You are helping users understand an ML model's...,15.0,4.0,3.0,4.0,4.0,3.0,3.0
60,pdf_2.json,You are helping users understand an ML model's...,9.5,0.0,2.0,3.5,4.0,3.0,3.0
61,student_1.json,You are helping users understand an ML model's...,16.0,4.0,4.0,4.0,4.0,3.0,3.0
