# 📓 Groundedness Evaluations

In many ways, feedbacks can be thought of as LLM apps themselves. Given text, they return some result. Thinking in this way, we can use TruLens to evaluate and track our feedback quality. We can even do this for different models or prompting schemes (such as chain-of-thought reasoning). In particular, here we will evaluate the latest open source LLM published by Snowflake, [`arctic-instruct`](https://huggingface.co/Snowflake/snowflake-arctic-instruct), a 480B dense-MOE hybrid transformer-based model.

This notebook follows an evaluation of a set of test cases generated from human annotated datasets. In particular, we generate test cases from [SummEval](https://arxiv.org/abs/2007.12626).

SummEval is one of the datasets dedicated to automated evaluations on summarization tasks, which are closely related to the groundedness evaluation in RAG with the retrieved context (i.e. the source) and response (i.e. the summary). It contains human annotation of numerical score (**1** to **5**) comprised of scoring from 3 human expert annotators and 5 croweded-sourced annotators. There are 16 models being used for generation in total for 100 paragraphs in the test set, so there are a total of 16,000 machine-generated summaries. Each paragraph also has several human-written summaries for comparative analysis. 


For evaluating groundedness feedback functions, we compute the annotated "consistency" scores, a measure of whether the summarized response is factually consisntent with the source texts and hence can be used as a proxy to evaluate groundedness in our RAG triad, and normalized to **0** to **1** score as our **expected_score** and to match the output of feedback functions.

In [2]:
# Import groundedness feedback function
from trulens_eval.feedback import GroundTruthAgreement
from trulens_eval import TruBasicApp, Feedback, Tru, Select
from test_cases import generate_summeval_groundedness_golden_set
tru = Tru()
tru.reset_database()

# generator for groundedness golden set
test_cases_gen = generate_summeval_groundedness_golden_set("./datasets/summeval/summeval_test_100.json")

🦑 Tru initialized with db url sqlite:///default.sqlite .
🛑 Secret keys may be written to the database. See the `database_redact_keys` option of Tru` to prevent this.


In [3]:
# specify the number of test cases we want to run the smoke test on
groundedness_golden_set = []
for i in range(500):
    groundedness_golden_set.append(next(test_cases_gen))

In [4]:
len(groundedness_golden_set)

500

In [5]:
groundedness_golden_set[:5]


[{'query': '(CNN)Donald Sterling\'s racist remarks cost him an NBA team last year. But now it\'s his former female companion who has lost big. A Los Angeles judge has ordered V. Stiviano to pay back more than $2.6 million in gifts after Sterling\'s wife sued her. In the lawsuit, Rochelle "Shelly" Sterling accused Stiviano of targeting extremely wealthy older men. She claimed Donald Sterling used the couple\'s money to buy Stiviano a Ferrari, two Bentleys and a Range Rover, and that he helped her get a $1.8 million duplex. Who is V. Stiviano? Stiviano countered that there was nothing wrong with Donald Sterling giving her gifts and that she never took advantage of the former Los Angeles Clippers owner, who made much of his fortune in real estate. Shelly Sterling was thrilled with the court decision Tuesday, her lawyer told CNN affiliate KABC. "This is a victory for the Sterling family in recovering the $2,630,000 that Donald lavished on a conniving mistress," attorney Pierce O\'Donnell s

In [6]:
import os
os.environ["OPENAI_API_KEY"] = "sk-..."
os.environ["REPLICATE_API_TOKEN"] = "r8_..."

### Benchmarking various Groundedness feedback function providers (OpenAI GPT-4o vs Snowflake's Arctic Instruct)

In [7]:
from trulens_eval.feedback import LiteLLM
from trulens_eval.feedback.provider import OpenAI

replicate_provider_arctic = LiteLLM(model_engine="replicate/snowflake/snowflake-arctic-instruct")

f_groundedness_arctic = Feedback(replicate_provider_arctic.groundedness_measure_with_cot_reasons, name = "Groundedness arctic-instruct").on_input_output()
def wrapped_groundedness_arctic(input, output) -> float:
    score = f_groundedness_arctic(input, output)[0]
    print(score)
    return score

replicate_provider_llama3 = LiteLLM(model_engine="replicate/meta/meta-llama-3-70b-instruct")

f_groundedness_llama3 = Feedback(replicate_provider_llama3.groundedness_measure_with_cot_reasons, name = "Groundedness Llama-3-70b-instruct").on_input_output()
def wrapped_groundedness_llama3(input, output) -> float:
    score = f_groundedness_llama3(input, output)[0]   
    print(score)
    return score

replicate_provider_mixtral = LiteLLM(model_engine="replicate/mistralai/mixtral-8x7b-instruct-v0.1")
f_groundedness_mixtral = Feedback(replicate_provider_mixtral.groundedness_measure_with_cot_reasons, name = "Groundedness Mixtral-8x7b-instruct-v0.1").on_input_output()
def wrapped_groundedness_mixtral(input, output) -> float:
    score = f_groundedness_mixtral(input, output)[0]   
    print(score)
    return score


openai_provider_turbo = OpenAI(model_engine="gpt-4-turbo")
f_groundedness_openai_gpt4_turbo = Feedback(openai_provider_turbo.groundedness_measure_with_cot_reasons, name = "Groundedness OpenAI GPT-4-turbo").on_input_output()
def wrapped_groundedness_openai_gpt4_turbo(input, output) -> float:
    return f_groundedness_openai_gpt4_turbo(input, output)[0]

openai_provider = OpenAI(model_engine="gpt-4o")
f_groundedness_openai_gpt4 = Feedback(openai_provider.groundedness_measure_with_cot_reasons, name = "Groundedness OpenAI GPT-4o").on_input_output()
def wrapped_groundedness_openai_gpt4(input, output) -> float:
    return f_groundedness_openai_gpt4(input, output)[0]


# Create a Feedback object using the numeric_difference method of the ground_truth object
ground_truth = GroundTruthAgreement(groundedness_golden_set)
# Call the numeric_difference method with app and record and aggregate to get the mean absolute error
f_mae = Feedback(ground_truth.mae, name = "Mean Absolute Error").on(Select.Record.calls[0].args.args[0]).on(Select.Record.calls[0].args.args[1]).on_output()


✅ In Groundedness arctic-instruct, input source will be set to __record__.main_input or `Select.RecordInput` .
✅ In Groundedness arctic-instruct, input statement will be set to __record__.main_output or `Select.RecordOutput` .
✅ In Groundedness Llama-3-70b-instruct, input source will be set to __record__.main_input or `Select.RecordInput` .
✅ In Groundedness Llama-3-70b-instruct, input statement will be set to __record__.main_output or `Select.RecordOutput` .
✅ In Groundedness Mixtral-8x7b-instruct-v0.1, input source will be set to __record__.main_input or `Select.RecordInput` .
✅ In Groundedness Mixtral-8x7b-instruct-v0.1, input statement will be set to __record__.main_output or `Select.RecordOutput` .
✅ In Groundedness OpenAI GPT-4-turbo, input source will be set to __record__.main_input or `Select.RecordInput` .
✅ In Groundedness OpenAI GPT-4-turbo, input statement will be set to __record__.main_output or `Select.RecordOutput` .
✅ In Groundedness OpenAI GPT-4o, input source will be 

In [None]:
tru_wrapped_groundedness_llama3 = TruBasicApp(wrapped_groundedness_llama3, app_id = "groundedness Llama3-70b-instruct", feedbacks=[f_mae])
for i in range(len(groundedness_golden_set)):
    source = groundedness_golden_set[i]["query"]
    response = groundedness_golden_set[i]["response"]
   
    with tru_wrapped_groundedness_llama3 as recording:
        try:
            tru_wrapped_groundedness_llama3.app(source, response)
        except Exception as e:
            print(e)



In [None]:
tru_wrapped_groundedness_mixtral = TruBasicApp(wrapped_groundedness_mixtral, app_id = "groundedness mixtral-8x7b-instruct-v0.1", feedbacks=[f_mae])
for i in range(len(groundedness_golden_set)):
    source = groundedness_golden_set[i]["query"]
    response = groundedness_golden_set[i]["response"]
   
    with tru_wrapped_groundedness_mixtral as recording:
        try:
            tru_wrapped_groundedness_mixtral.app(source, response)
        except Exception as e:
            print(e)

In [11]:
tru.get_leaderboard(app_ids=[])

Unnamed: 0_level_0,Mean Absolute Error,latency,total_cost
app_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
groundedness mixtral-8x7b-instruct-v0.1,0.340668,4.89267,0.000264


In [79]:
tru.get_leaderboard(app_ids=[])

Unnamed: 0_level_0,Mean Absolute Error,latency,total_cost
app_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
groundedness Llama3-70b-instruct,0.054653,12.184049,5e-06


In [19]:
tru.get_leaderboard(app_ids=[])

Unnamed: 0_level_0,Mean Absolute Error,latency,total_cost
app_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
groundedness arctic instruct,0.076393,6.446394,3e-06
groundedness openai gpt-4o,0.057695,6.440239,0.012691
