# 📓 Groundedness Evaluations for Abstention Handling

In many ways, feedbacks can be thought of as LLM apps themselves. Given text, they return some result. Thinking in this way, we can use TruLens to evaluate and track our feedback quality. We can even do this for different models or prompting schemes (such as chain-of-thought reasoning).

This notebook follows an evaluation of a set of test cases generated from human annotated datasets. In particular, we generate test cases from [SummEval](https://arxiv.org/abs/2007.12626).

SummEval is one of the datasets dedicated to automated evaluations on summarization tasks, which are closely related to the groundedness evaluation in RAG with the retrieved context (i.e. the source) and response (i.e. the summary). It contains human annotation of numerical score (**1** to **5**) comprised of scoring from 3 human expert annotators and 5 croweded-sourced annotators. There are 16 models being used for generation in total for 100 paragraphs in the test set, so there are a total of 16,000 machine-generated summaries. Each paragraph also has several human-written summaries for comparative analysis. 

For evaluating groundedness feedback functions, we compute the annotated "consistency" scores, a measure of whether the summarized response is factually consisntent with the source texts and hence can be used as a proxy to evaluate groundedness in our RAG triad, and normalized to **0** to **1** score as our **expected_score** and to match the output of feedback functions.

## Abstention Background

In this particular set of evaluations, we are focused on the handling of abstentions. Uncertainty-based abstention in LLMs has been shown to improve safety and reduce hallucination ([Tomani](https://arxiv.org/abs/2404.10960)). For groundedness evaluations, we want to ensure these are handled in a manner that is consistent with human preferences; in other words, calibrated.

Abstentions can be broken down into two distinct groups, distinguished by whether the question is answerable given the context. In other words, whether the abstention is **justified**. We take an opinionated stance that abstentions for unanswerable questions are justified and therefore, **grounded**. Alternatively, abstentions for questions answerable by the context are therefore not grounded.

## Experimental Setup

For this set of experiments, we take the same test cases used for groundedness evaluations with a few key changes:
1. We randomly sample approximately 50% of the test cases and replace the response with an abstention. We'll refer to this as the abstention set. The rest will be head as control.
2. In the abstention set, we will take a random sample of approximately 50% of cases and truncate to remove all but the first sentence in the query, removing a majority of the context. This splits the absention set into **answerable** and **unanswerable** abstentions.

From here, we have two tests sets to test against.
1. The expected score for all abstentions is 1.
2. The expected score for answerable abstentions will be set to 0, and then the expected score for unanswerable abstentions will be set to 1.

We will then compute the MAE of our groundedness evaluator against the expected score for each test set. Results will be displayed for the following subgroups:
- Abstentions v. Control
- Answerable Abstentions v. Unanswerable Abstentions

We will also show results for the test cases with high and low human-annotated consistency to ensure that this treatment is consistent across expected scores.


## Improving Groundedness

To improve the groundedness feedback function against these tests, we consider the following changes:
- Abstentions are excluded from statements to evaluate; empty strings are considered grounded.
- The answerability in abstention cases is assessed; unanswerable abstentions score high, answerable abstentions score low.

In [1]:
# Import groundedness feedback function
from trulens_eval.feedback import GroundTruthAgreement
from trulens_eval import TruBasicApp, Feedback, Tru, Select
from test_cases import generate_summeval_groundedness_golden_set
tru = Tru()
tru.reset_database()

# generator for groundedness golden set
test_cases_gen = generate_summeval_groundedness_golden_set("./datasets/summeval/summeval_test_100.json")



🦑 Tru initialized with db url sqlite:///default.sqlite .
🛑 Secret keys may be written to the database. See the `database_redact_keys` option of `Tru` to prevent this.


In [2]:
import random

# Specify the number of test cases we want to run the smoke test on
groundedness_golden_set_abstentions = []
for i in range(500):
    groundedness_golden_set_abstentions.append(next(test_cases_gen))

# Randomly sample 50% of the test cases and replace the response with a random abstention
# Split into answerable and unanswerable set
abstention_set = []
control_set = []
for item in groundedness_golden_set_abstentions:
    if random.random() < 0.5:
        abstention_set.append(item)
    else:
        control_set.append(item)

for item in abstention_set:
    item['response'] = random.choice([
        "I don't know",
        "I don't have the information available to answer this question",
        "I'm unsure of the answer given the provided context",
        "I'm not certain",
        "I can't say for sure",
        "I'm not familiar with that",
        "I'm not knowledgeable about this topic",
        "I'm not the right person to ask",
        "I'm unable to provide a definitive answer",
        "I'm sorry, I can't help with that"
    ])
    item['group'] = 'Abstention'
    item['expected_score_exclusions'] = 1.0

# Set the group for the remaining cases as 'Control'
for item in groundedness_golden_set_abstentions:
    if 'group' not in item:
        item['group'] = 'Control'
        item['expected_score_abstentions_score_high'] = item['expected_score']

# Split into answerable and unanswerable set
unanswerable_abstentions = []
answerable_abstentions = []
for item in abstention_set:
    if random.random() < 0.5:
        unanswerable_abstentions.append(item)
    else:
        answerable_abstentions.append(item)

# Set group, truncate query, and set expected score for unanswerable abstentions
for item in unanswerable_abstentions:
    item['group'] = 'Unanswerable Abstention'
    item['expected_score'] = 1.0
    split_result = item['query'].split('.', 1)
    if len(split_result) > 1:
        item['query'] = split_result[0]
    else:
        unanswerable_abstentions.remove(item)
    

# Set the answerable set group and expected score
for item in answerable_abstentions:
    item['group'] = 'Answerable Abstention'
    item['expected_score'] = 0.0
    
# create a set that includes all abstentions and the control set
groundedness_golden_set_abstensions_consider_answerability = control_set + answerable_abstentions + unanswerable_abstentions

# create an alternative set where the expected_score is set high for all abstentions
groundedness_golden_set_abstentions_expected_high = []
for item in groundedness_golden_set_abstensions_consider_answerability:
    new_item = item.copy()
    new_item['expected_score'] = new_item['expected_score_abstentions_score_high']
    new_item.pop('expected_score_abstentions_score_high')
    groundedness_golden_set_abstentions_expected_high.append(new_item)



In [3]:
import pandas as pd

# Reward all abstentions equally
print("\n Reward all abstentions equally")

## Count the data by group (Control, Answerable Abstention, Unanswerable Abstention)
df = pd.DataFrame(groundedness_golden_set_abstensions_consider_answerability)
group_counts = df['group'].value_counts()
print(group_counts)

## Calculate average values for expected_score and human_score by group
group_avg = df.groupby('group').agg({'expected_score': 'mean', 'human_score': 'mean'})
group_avg = group_avg.reindex(['Control', 'Answerable Abstention', 'Unanswerable Abstention'])
print(group_avg)

# Consider answerability of abstentions
print("\n Consider answerability of abstentions")

df = pd.DataFrame(groundedness_golden_set_abstentions_expected_high)
group_counts = df['group'].value_counts()
print(group_counts)

## Calculate average values for expected_score and human_score by group
group_avg = df.groupby('group').agg({'expected_score': 'mean', 'human_score': 'mean'})
group_avg = group_avg.reindex(['Control', 'Answerable Abstention', 'Unanswerable Abstention'])
print(group_avg)


 Reward all abstentions equally
group
Control                    256
Unanswerable Abstention    126
Answerable Abstention      118
Name: count, dtype: int64
                         expected_score  human_score
group                                               
Control                        0.925625     4.701823
Answerable Abstention          0.000000     4.485876
Unanswerable Abstention        1.000000     4.505291

 Consider answerability of abstentions
group
Control                    256
Unanswerable Abstention    126
Answerable Abstention      118
Name: count, dtype: int64
                         expected_score  human_score
group                                               
Control                        0.925625     4.701823
Answerable Abstention          1.000000     4.485876
Unanswerable Abstention        1.000000     4.505291


In [4]:
control_items = [item for item in groundedness_golden_set_abstentions if item['group'] == 'Control'][:3]
answerable_abstention_items = [item for item in groundedness_golden_set_abstentions if item['group'] == 'Answerable Abstention'][:3]
unanswerable_abstention_items = [item for item in groundedness_golden_set_abstentions if item['group'] == 'Unanswerable Abstention'][:3]

print("Control items:")
for item in control_items:
    print(item)

print("\nAnswerable Abstention items:")
for item in answerable_abstention_items:
    print(item)

print("\nUnanswerable Abstention items:")
for item in unanswerable_abstention_items:
    print(item)

Control items:
{'query': '(CNN)Donald Sterling\'s racist remarks cost him an NBA team last year. But now it\'s his former female companion who has lost big. A Los Angeles judge has ordered V. Stiviano to pay back more than $2.6 million in gifts after Sterling\'s wife sued her. In the lawsuit, Rochelle "Shelly" Sterling accused Stiviano of targeting extremely wealthy older men. She claimed Donald Sterling used the couple\'s money to buy Stiviano a Ferrari, two Bentleys and a Range Rover, and that he helped her get a $1.8 million duplex. Who is V. Stiviano? Stiviano countered that there was nothing wrong with Donald Sterling giving her gifts and that she never took advantage of the former Los Angeles Clippers owner, who made much of his fortune in real estate. Shelly Sterling was thrilled with the court decision Tuesday, her lawyer told CNN affiliate KABC. "This is a victory for the Sterling family in recovering the $2,630,000 that Donald lavished on a conniving mistress," attorney Pierc

### Benchmarking GPT4o

In [6]:
from dotenv import load_dotenv

load_dotenv()

True

In [8]:

from trulens_eval.feedback.provider import OpenAI

openai_provider = OpenAI(model_engine="gpt-4o")
f_groundedness_openai_gpt4o = Feedback(openai_provider.groundedness_measure_with_cot_reasons, name = "Groundedness OpenAI GPT-4o").on_input_output()
def wrapped_groundedness_openai_gpt4o(input, output) -> float:
    return f_groundedness_openai_gpt4o(input, output)[0]


# Create a Feedback object using the numeric_difference method of the ground_truth object
ground_truth_consider_answerability = GroundTruthAgreement(groundedness_golden_set_abstensions_consider_answerability)
# Call the numeric_difference method with app and record and aggregate to get the mean absolute error
f_mae_consider_answerability = Feedback(ground_truth_consider_answerability.mae, name = "Mean Absolute Error (consider answerability)", higher_is_better=False).on(Select.Record.calls[0].args.args[0]).on(Select.Record.calls[0].args.args[1]).on_output()

# Create a Feedback object using the numeric_difference method of the ground_truth object
ground_truth_abstensions_score_high = GroundTruthAgreement(groundedness_golden_set_abstentions_expected_high)
# Call the numeric_difference method with app and record and aggregate to get the mean absolute error
f_mae_abstensions_score_high = Feedback(ground_truth_abstensions_score_high.mae, name = "Mean Absolute Error (all abstensions score high)", higher_is_better=False).on(Select.Record.calls[0].args.args[0]).on(Select.Record.calls[0].args.args[1]).on_output()




✅ In Groundedness OpenAI GPT-4o, input source will be set to __record__.main_input or `Select.RecordInput` .
✅ In Groundedness OpenAI GPT-4o, input statement will be set to __record__.main_output or `Select.RecordOutput` .
✅ In Mean Absolute Error (consider answerability), input prompt will be set to __record__.calls[0].args.args[0] .
✅ In Mean Absolute Error (consider answerability), input response will be set to __record__.calls[0].args.args[1] .
✅ In Mean Absolute Error (consider answerability), input score will be set to __record__.main_output or `Select.RecordOutput` .
✅ In Mean Absolute Error (all abstensions score high), input prompt will be set to __record__.calls[0].args.args[0] .
✅ In Mean Absolute Error (all abstensions score high), input response will be set to __record__.calls[0].args.args[1] .
✅ In Mean Absolute Error (all abstensions score high), input score will be set to __record__.main_output or `Select.RecordOutput` .


In [9]:
tru.reset_database()

In [10]:
tru_wrapped_groundedness_gpt4o = TruBasicApp(wrapped_groundedness_openai_gpt4o, app_id="groundedness GPT-4o-instruct",
                                             feedbacks=[f_mae_consider_answerability, f_mae_abstensions_score_high])
for i in range(len(groundedness_golden_set_abstentions)):
    source = groundedness_golden_set_abstentions[i]["query"]
    response = groundedness_golden_set_abstentions[i]["response"]
    group = groundedness_golden_set_abstentions[i]["group"]
   
    with tru_wrapped_groundedness_gpt4o as recording:
        try:
            recording.record_metadata = dict(group = group)
            tru_wrapped_groundedness_gpt4o.app(source, response)
            
        except Exception as e:
            print(e)


In [10]:
tru.run_dashboard()

Starting dashboard ...
Config file already exists. Skipping writing process.
Credentials file already exists. Skipping writing process.


Accordion(children=(VBox(children=(VBox(children=(Label(value='STDOUT'), Output())), VBox(children=(Label(valu…

Dashboard started at http://192.168.1.181:57285 .


<Popen: returncode: None args: ['streamlit', 'run', '--server.headless=True'...>

In [11]:
from typing import Optional, List
from trulens_eval.schema import types as mod_types_schema

import json

def get_leaderboard_grouped_by_metadata(
        record_metadata_key: str = None,
        app_ids: Optional[List[mod_types_schema.AppID]] = None,
        
    ) -> pd.DataFrame:
    """Get a leaderboard for the given apps grouped by record metadata

    Args:
        app_ids: A list of app ids to filter records by. If empty or not given, all
            apps will be included in leaderboard.
        record_metadata_key: A key included in record metadata that you want to group results by.

    Returns:
        Dataframe of apps with their feedback results aggregated and grouped by the specified record metadata key.
    """

    if app_ids is None:
        app_ids = []

    df, feedback_cols = tru.get_records_and_feedback(app_ids)

    df['meta'] = [json.loads(df["record_json"][i])["meta"] for i in range(len(df))]

    df[str(record_metadata_key)] = [item.get(record_metadata_key, "No metadata for specified key") for item in df['meta']]

    col_agg_list = feedback_cols + ['latency', 'total_cost']

    leaderboard = df.groupby(['app_id',str(record_metadata_key)])[col_agg_list].mean().sort_values(
            by=feedback_cols, ascending=False
    )

    return leaderboard

In [12]:
get_leaderboard_grouped_by_metadata(record_metadata_key = "group")

Unnamed: 0_level_0,Unnamed: 1_level_0,Mean Absolute Error (consider answerability),Mean Absolute Error (all abstensions score high),latency,total_cost
app_id,group,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
groundedness GPT-4o-instruct,Unanswerable Abstention,1.0,1.0,0.285714,0.001287
groundedness GPT-4o-instruct,Control,0.052189,0.052189,2.492188,0.025812
groundedness GPT-4o-instruct,Answerable Abstention,,0.974576,0.288136,0.003354
