# 📓 Groundedness Evaluations for Abstention Handling

In many ways, feedbacks can be thought of as LLM apps themselves. Given text, they return some result. Thinking in this way, we can use TruLens to evaluate and track our feedback quality. We can even do this for different models or prompting schemes (such as chain-of-thought reasoning).

This notebook follows an evaluation of a set of test cases generated from human annotated datasets. In particular, we generate test cases from [SummEval](https://arxiv.org/abs/2007.12626).

SummEval is one of the datasets dedicated to automated evaluations on summarization tasks, which are closely related to the groundedness evaluation in RAG with the retrieved context (i.e. the source) and response (i.e. the summary). It contains human annotation of numerical score (**1** to **5**) comprised of scoring from 3 human expert annotators and 5 croweded-sourced annotators. There are 16 models being used for generation in total for 100 paragraphs in the test set, so there are a total of 16,000 machine-generated summaries. Each paragraph also has several human-written summaries for comparative analysis. 

For evaluating groundedness feedback functions, we compute the annotated "consistency" scores, a measure of whether the summarized response is factually consisntent with the source texts and hence can be used as a proxy to evaluate groundedness in our RAG triad, and normalized to **0** to **1** score as our **expected_score** and to match the output of feedback functions.

## Abstention Background

In this particular set of evaluations, we are focused on the handling of abstentions. Uncertainty-based abstention in LLMs has been shown to improve safety and reduce hallucination ([Tomani](https://arxiv.org/abs/2404.10960)). For groundedness evaluations, we want to ensure these are handled in a manner that is consistent with human preferences; in other words, calibrated.

Abstentions can be broken down into two distinct groups, distinguished by whether the question is answerable given the context. In other words, whether the abstention is **justified**. We take an opinionated stance that abstentions for unanswerable questions are justified and therefore, **grounded**. Alternatively, abstentions for questions answerable by the context are therefore not grounded.

## Experimental Setup

For this set of experiments, we take the same test cases used for groundedness evaluations with a few key changes:
1. We randomly sample approximately 50% of the test cases and replace the response with an abstention. We'll refer to this as the abstention set. The rest will be head as control.
2. In the abstention set, we will take a random sample of approximately 50% of cases and truncate to remove all but the first sentence in the query, removing a majority of the context. This splits the absention set into **answerable** and **unanswerable** abstentions.

From here, we have two tests sets to test against.
1. The expected score for all abstentions is 1.
2. The expected score for answerable abstentions will be set to 0, and then the expected score for unanswerable abstentions will be set to 1.

We will then compute the MAE of our groundedness evaluator against the expected score for each test set. Results will be displayed for the following subgroups:
- Abstentions v. Control
- Answerable Abstentions v. Unanswerable Abstentions

We will also show results for the test cases with high and low human-annotated consistency to ensure that this treatment is consistent across expected scores.


## Improving Groundedness

To improve the groundedness feedback function against these tests, we consider the following changes:
- Abstentions are excluded from statements to evaluate; empty strings are considered grounded.
- The answerability in abstention cases is assessed; unanswerable abstentions score high, answerable abstentions score low.

In [1]:
# Import groundedness feedback function
from trulens_eval.feedback import GroundTruthAgreement
from trulens_eval import TruBasicApp, Feedback, Tru, Select
from test_cases import generate_summeval_groundedness_golden_set
import copy

tru = Tru()
tru.reset_database()

# generator for groundedness golden set
test_cases_gen = generate_summeval_groundedness_golden_set("./datasets/summeval/summeval_test_100.json")



🦑 Tru initialized with db url sqlite:///default.sqlite .
🛑 Secret keys may be written to the database. See the `database_redact_keys` option of `Tru` to prevent this.


In [2]:
# Specify the number of test cases we want to run the smoke test on
groundedness_golden_set_abstentions = []
for i in range(500):
    groundedness_golden_set_abstentions.append(next(test_cases_gen))

In [3]:
import random

# Specify the number of test cases we want to run the smoke test on
groundedness_golden_set_abstentions = []
for i in range(500):
    groundedness_golden_set_abstentions.append(next(test_cases_gen))

# Randomly sample 50% of the test cases and replace the response with a random abstention
# Split into answerable and unanswerable set
abstention_set = []
control_set = []
for item in groundedness_golden_set_abstentions:
    if random.random() < 0.5:
        abstention_set.append(item)
    else:
        control_set.append(item)

for item in abstention_set:
    item['response'] = random.choice([
        "I don't know",
        "I don't have the information available to answer this question",
        "I'm unsure of the answer given the provided context",
        "I'm not certain",
        "I can't say for sure",
        "I'm not familiar with that",
        "I'm not knowledgeable about this topic",
        "I'm not the right person to ask",
        "I'm unable to provide a definitive answer",
        "I'm sorry, I can't help with that"
    ])
    item['group'] = 'Abstention'

# Set the group for the remaining cases as 'Control'
for item in groundedness_golden_set_abstentions:
    if 'group' not in item:
        item['group'] = 'Control'

# Split into answerable and unanswerable set
unanswerable_abstentions = []
answerable_abstentions = []

abstention_set_copy = copy.deepcopy(abstention_set)
for item in abstention_set_copy:
    item['expected_score'] = 1.0
    if random.random() < 0.5:
        item['group'] = 'Unanswerable Abstention'
        split_result = item['query'].split('.', 1)
        if len(split_result) > 1:
            item['query'] = split_result[0]
            unanswerable_abstentions.append(item)
        else:
            pass
    else:
        item['group'] = 'Answerable Abstention'
        answerable_abstentions.append(item)
    
# create a set that includes all abstentions and the control set
groundedness_golden_set_abstensions_score_high = control_set + answerable_abstentions + unanswerable_abstentions


In [4]:
answerable_abstentions_ca = copy.deepcopy(answerable_abstentions)
for item in answerable_abstentions_ca :
    item['expected_score'] = 0.0

groundedness_golden_set_abstensions_consider_answerability = control_set + answerable_abstentions_ca + unanswerable_abstentions

In [5]:
groundedness_golden_set_abstensions_consider_answerability_copy = copy.deepcopy(groundedness_golden_set_abstensions_consider_answerability)

test_set = [{'query': item['query'], 'response': item['response'], 'group': item['group']} for item in groundedness_golden_set_abstensions_consider_answerability_copy]

In [6]:
import pandas as pd

# Consider answerability of abstentions
print("\n Consider answerability of abstentions")

## Count the data by group (Control, Answerable Abstention, Unanswerable Abstention)
ca_df = pd.DataFrame(groundedness_golden_set_abstensions_consider_answerability)
ca_group_counts = ca_df['group'].value_counts()
print(ca_group_counts)

## Calculate average values for expected_score and human_score by group
ca_group_avg = ca_df.groupby('group').agg({'expected_score': 'mean', 'human_score': 'mean'})
ca_group_avg = ca_group_avg.reindex(['Control', 'Answerable Abstention', 'Unanswerable Abstention'])
print(ca_group_avg)

# Reward all abstentions equally
print("\n Reward all abstentions equally")

sh_df = pd.DataFrame(groundedness_golden_set_abstensions_score_high)
sh_group_counts = sh_df['group'].value_counts()
print(sh_group_counts)

## Calculate average values for expected_score and human_score by group
sh_group_avg = sh_df.groupby('group').agg({'expected_score': 'mean', 'human_score': 'mean'})
sh_group_avg = sh_group_avg.reindex(['Control', 'Answerable Abstention', 'Unanswerable Abstention'])
print(sh_group_avg)


 Consider answerability of abstentions
group
Control                    256
Unanswerable Abstention    129
Answerable Abstention      115
Name: count, dtype: int64
                         expected_score  human_score
group                                               
Control                        0.924297     4.696615
Answerable Abstention          0.000000     4.669565
Unanswerable Abstention        1.000000     4.715762

 Reward all abstentions equally
group
Control                    256
Unanswerable Abstention    129
Answerable Abstention      115
Name: count, dtype: int64
                         expected_score  human_score
group                                               
Control                        0.924297     4.696615
Answerable Abstention          1.000000     4.669565
Unanswerable Abstention        1.000000     4.715762


### Benchmarking GPT4o

In [7]:
from dotenv import load_dotenv

load_dotenv()

True

In [8]:
from trulens_eval.feedback.provider import OpenAI

openai_provider = OpenAI(model_engine="gpt-4o")
f_groundedness_openai_gpt4o = Feedback(openai_provider.groundedness_measure_with_cot_reasons, name = "Groundedness OpenAI GPT-4o").on_input_output()
def wrapped_groundedness_openai_gpt4o(input: str, output: float) -> float:
    return f_groundedness_openai_gpt4o(input, output)[0]

# Create a Feedback object using the numeric_difference method of the ground_truth object
ground_truth_consider_answerability = GroundTruthAgreement(groundedness_golden_set_abstensions_consider_answerability)
# Call the numeric_difference method with app and record and aggregate to get the mean absolute error
f_mae_consider_answerability = Feedback(ground_truth_consider_answerability.mae, name = "Mean Absolute Error (consider answerability)", higher_is_better=False).on(Select.Record.app._call.args.args[0]).on(Select.Record.app._call.args.args[1]).on(Select.RecordOutput)

# Create a Feedback object using the numeric_difference method of the ground_truth object
ground_truth_abstensions_score_high = GroundTruthAgreement(groundedness_golden_set_abstensions_score_high)
# Call the numeric_difference method with app and record and aggregate to get the mean absolute error
f_mae_abstensions_score_high = Feedback(ground_truth_abstensions_score_high.mae, name = "Mean Absolute Error (all abstensions score high)", higher_is_better=False).on(Select.Record.app._call.args.args[0]).on(Select.Record.app._call.args.args[1]).on(Select.RecordOutput)

✅ In Groundedness OpenAI GPT-4o, input source will be set to __record__.main_input or `Select.RecordInput` .
✅ In Groundedness OpenAI GPT-4o, input statement will be set to __record__.main_output or `Select.RecordOutput` .
✅ In Mean Absolute Error (consider answerability), input prompt will be set to __record__.app._call.args.args[0] .
✅ In Mean Absolute Error (consider answerability), input response will be set to __record__.app._call.args.args[1] .
✅ In Mean Absolute Error (consider answerability), input score will be set to __record__.main_output or `Select.RecordOutput` .
✅ In Mean Absolute Error (all abstensions score high), input prompt will be set to __record__.app._call.args.args[0] .
✅ In Mean Absolute Error (all abstensions score high), input response will be set to __record__.app._call.args.args[1] .
✅ In Mean Absolute Error (all abstensions score high), input score will be set to __record__.main_output or `Select.RecordOutput` .


In [9]:
tru_wrapped_groundedness_gpt4o = TruBasicApp(wrapped_groundedness_openai_gpt4o, app_id="groundedness GPT-4o-instruct",
                                             feedbacks=[f_mae_consider_answerability, f_mae_abstensions_score_high])
for i in range(len(groundedness_golden_set_abstensions_score_high)):
    source = groundedness_golden_set_abstensions_consider_answerability[i]["query"]
    response = groundedness_golden_set_abstensions_consider_answerability[i]["response"]
    group = groundedness_golden_set_abstensions_consider_answerability[i]["group"]
   
    with tru_wrapped_groundedness_gpt4o as recording:
        try:
            recording.record_metadata = dict(group = group)
            tru_wrapped_groundedness_gpt4o.app(source, response)
            
        except Exception as e:
            print(e)


In [10]:
tru.get_leaderboard(group_by_metadata_key = "group")

Unnamed: 0_level_0,Unnamed: 1_level_0,Mean Absolute Error (all abstensions score high),Mean Absolute Error (consider answerability),latency,total_cost
app_id,group,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
groundedness GPT-4o-instruct,Unanswerable Abstention,1.0,1.0,0.294574,0.001298
groundedness GPT-4o-instruct,Answerable Abstention,1.0,0.0,0.26087,0.003262
groundedness GPT-4o-instruct,Control,0.07612,0.07612,2.535156,0.027785


In [11]:
tru.run_dashboard()

Starting dashboard ...
Config file already exists. Skipping writing process.
Credentials file already exists. Skipping writing process.


Accordion(children=(VBox(children=(VBox(children=(Label(value='STDOUT'), Output())), VBox(children=(Label(valu…

Dashboard started at http://192.168.4.206:56106 .


<Popen: returncode: None args: ['streamlit', 'run', '--server.headless=True'...>