## Context Relevance Evaluations

In many ways, feedbacks can be thought of as LLM apps themselves. Given text, they return some result. Thinking in this way, we can use TruLens to evaluate and track our feedback quality. We can even do this for different models (e.g. gpt-3.5 and gpt-4) or prompting schemes (such as chain-of-thought reasoning).

This notebook follows an evaluation of a set of test cases. You are encouraged to run this on your own and even expand the test cases to evaluate performance on test cases applicable to your scenario or domain.

In [1]:
# Import relevance feedback function
from trulens_eval.feedback import GroundTruthAgreement, OpenAI, LiteLLM
from trulens_eval import TruBasicApp, Feedback, Tru, Select
from test_cases import context_relevance_golden_set

import openai

Tru().reset_database()

ðŸ¦‘ Tru initialized with db url sqlite:///default.sqlite .
ðŸ›‘ Secret keys may be written to the database. See the `database_redact_keys` option of `Tru` to prevent this.
Deleted 17 rows.


In [2]:
import os
os.environ["OPENAI_API_KEY"] = "..."
os.environ["COHERE_API_KEY"] = "..."
os.environ["HUGGINGFACE_API_KEY"] = "..."
os.environ["ANTHROPIC_API_KEY"] = "..."
os.environ["TOGETHERAI_API_KEY"] = "..."

In [3]:
# GPT 3.5
turbo = OpenAI(model_engine="gpt-3.5-turbo")

def wrapped_relevance_turbo(input, output):
    return turbo.qs_relevance(input, output)

# GPT 4
gpt4 = OpenAI(model_engine="gpt-4")

def wrapped_relevance_gpt4(input, output):
    return gpt4.qs_relevance(input, output)

# Cohere
command_nightly = LiteLLM(model_engine="command-nightly")
def wrapped_relevance_command_nightly(input, output):
    return command_nightly.qs_relevance(input, output)

# Anthropic
claude_1 = LiteLLM(model_engine="claude-instant-1")
def wrapped_relevance_claude1(input, output):
    return claude_1.qs_relevance(input, output)

claude_2 = LiteLLM(model_engine="claude-2")
def wrapped_relevance_claude2(input, output):
    return claude_2.qs_relevance(input, output)

# Meta
llama_2_13b = LiteLLM(model_engine="together_ai/togethercomputer/Llama-2-7B-32K-Instruct")
def wrapped_relevance_llama2(input, output):
    return llama_2_13b.qs_relevance(input, output)

Here we'll set up our golden set as a set of prompts, responses and expected scores stored in `test_cases.py`. Then, our numeric_difference method will look up the expected score for each prompt/response pair by **exact match**. After looking up the expected score, we will then take the L1 difference between the actual score and expected score.

In [4]:
# Create a Feedback object using the numeric_difference method of the ground_truth object
ground_truth = GroundTruthAgreement(context_relevance_golden_set)
# Call the numeric_difference method with app and record and aggregate to get the mean absolute error
f_mae = Feedback(ground_truth.mae, name = "Mean Absolute Error").on(Select.Record.calls[0].args.args[0]).on(Select.Record.calls[0].args.args[1]).on_output()

âœ… In Mean Absolute Error, input prompt will be set to __record__.calls[0].args.args[0] .
âœ… In Mean Absolute Error, input response will be set to __record__.calls[0].args.args[1] .
âœ… In Mean Absolute Error, input score will be set to __record__.main_output or `Select.RecordOutput` .


In [5]:
tru_wrapped_relevance_turbo = TruBasicApp(wrapped_relevance_turbo, app_id = "context relevance gpt-3.5-turbo", feedbacks=[f_mae])

tru_wrapped_relevance_gpt4 = TruBasicApp(wrapped_relevance_gpt4, app_id = "context relevance gpt-4", feedbacks=[f_mae])

tru_wrapped_relevance_commandnightly = TruBasicApp(wrapped_relevance_command_nightly, app_id = "context relevance Command-Nightly", feedbacks=[f_mae])

tru_wrapped_relevance_claude1 = TruBasicApp(wrapped_relevance_claude1, app_id = "context relevance Claude 1", feedbacks=[f_mae])

tru_wrapped_relevance_claude2 = TruBasicApp(wrapped_relevance_claude2, app_id = "context relevance Claude 2", feedbacks=[f_mae])

tru_wrapped_relevance_llama2 = TruBasicApp(wrapped_relevance_llama2, app_id = "context relevance Llama-2-13b", feedbacks=[f_mae])

âœ… added app context relevance gpt-3.5-turbo
âœ… added feedback definition feedback_definition_hash_ac1d5b3a2009be5efdb59a1f22e23053
âœ… added app context relevance gpt-4
âœ… added feedback definition feedback_definition_hash_ac1d5b3a2009be5efdb59a1f22e23053
âœ… added app context relevance Command-Nightly
âœ… added feedback definition feedback_definition_hash_ac1d5b3a2009be5efdb59a1f22e23053
âœ… added app context relevance Claude 1
âœ… added feedback definition feedback_definition_hash_ac1d5b3a2009be5efdb59a1f22e23053
âœ… added app context relevance Claude 2
âœ… added feedback definition feedback_definition_hash_ac1d5b3a2009be5efdb59a1f22e23053
âœ… added app context relevance Llama-2-13b
âœ… added feedback definition feedback_definition_hash_ac1d5b3a2009be5efdb59a1f22e23053


In [None]:
for i in range(len(context_relevance_golden_set)):
    prompt = context_relevance_golden_set[i]["query"]
    response = context_relevance_golden_set[i]["response"]
    with tru_wrapped_relevance_turbo as recording:
        tru_wrapped_relevance_turbo.app(prompt, response)
    
    with tru_wrapped_relevance_gpt4 as recording:
        tru_wrapped_relevance_gpt4.app(prompt, response)
    
    with tru_wrapped_relevance_commandnightly as recording:
        tru_wrapped_relevance_commandnightly.app(prompt, response)
    
    with tru_wrapped_relevance_claude1 as recording:
        tru_wrapped_relevance_claude1.app(prompt, response)

    with tru_wrapped_relevance_claude2 as recording:
        tru_wrapped_relevance_claude2.app(prompt, response)

    with tru_wrapped_relevance_llama2 as recording:
        tru_wrapped_relevance_llama2.app(prompt, response)

In [7]:
Tru().get_leaderboard(app_ids=[]).sort_values(by="Mean Absolute Error")

âœ… feedback result Mean Absolute Error DONE feedback_result_hash_086ffca9b39fe36e86797171e56e3f50


Unnamed: 0_level_0,Mean Absolute Error,latency,total_cost
app_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
context relevance Claude 1,0.186667,0.066667,0.0
context relevance gpt-3.5-turbo,0.206667,0.066667,0.000762
context relevance gpt-4,0.253333,0.066667,0.015268
context relevance Command-Nightly,0.313333,0.066667,0.0
context relevance Claude 2,0.366667,0.066667,0.0
context relevance Llama-2-13b,0.586667,0.066667,0.0
