## Meta Evaluation - evaluating your LLM-as-judge with TruLens

Meta evaluation is the process of evaluating evaluation methods themselves. Here we are measuring and benchmarking the performance of LLM-based evaluators (aka LLM-as-judge), where the main focus of performance is human alignment. In other words, how closely aligned the generated scores are with human evaluation processes.


###
In TruLens, we implement this as a special case of GroundTruth evaluation, since we canonically regard human preferences as the groundtruth in most LLM tasks. 

For experiment tracking, we provide a suite of automatic metric computation via Aggregator, 

In [None]:
import pandas as pd
from trulens.benchmark.benchmark_frameworks.tru_benchmark_experiment import (
    BenchmarkParams,
)
from trulens.benchmark.benchmark_frameworks.tru_benchmark_experiment import (
    TruBenchmarkExperiment,
)
from trulens.benchmark.benchmark_frameworks.tru_benchmark_experiment import (
    create_benchmark_experiment_app,
)
from trulens.core import TruSession
from trulens.feedback import GroundTruthAggregator

session = TruSession()
session.reset_database()

golden_set = [
    {
        "query": "who are the Apple's competitors?",
        "expected_response": "Apple competitors include Samsung, Google, and Microsoft.",
        "expected_score": 1.0,  # groundtruth score annotated by human
    },
    {
        "query": "what is the capital of France?",
        "expected_response": "Paris is the capital of France.",
        "expected_score": 1.0,
    },
    {
        "query": "what is the capital of Spain?",
        "expected_response": "I love going to Spain.",
        "expected_score": 0,
    },
]


true_labels = [entry["expected_score"] for entry in golden_set]


gt_df = pd.DataFrame(golden_set)
gt_df

In [None]:
import os

import snowflake.connector
from trulens.providers.cortex import Cortex

snowflake_connection_parameters = {
    "account": os.environ["SNOWFLAKE_ACCOUNT"],
    "user": os.environ["SNOWFLAKE_USER"],
    "password": os.environ["SNOWFLAKE_USER_PASSWORD"],
}
snowflake_connection = snowflake.connector.connect(
    **snowflake_connection_parameters
)
provider = Cortex(
    snowflake_connection,
    model_engine="mistral-large",
)

In [None]:
from typing import Tuple


# output is feedback_score
def context_relevance_ff(input, output, benchmark_params) -> float:
    return provider.context_relevance(
        question=input,
        context=output,
        temperature=benchmark_params["temperature"],
    )


# output is (feedback_score, confidence_score)
def context_relevance_ff_with_confidence(
    input, output, benchmark_params
) -> Tuple[float, float]:
    return provider.context_relevance_verb_confidence(
        question=input,
        context=output,
        temperature=benchmark_params["temperature"],
    )

### Collect all prompt and expected responses from the golden set and pass to GroundTruthAggregator as ground truth labels

In [None]:
mae_agg_func = GroundTruthAggregator(true_labels=true_labels).mae


benchmark_experiment = TruBenchmarkExperiment(
    feedback_fn=context_relevance_ff,
    agg_funcs=[mae_agg_func],
    benchmark_params=BenchmarkParams(temperature=0.5),
)

tru_benchmark_arctic = create_benchmark_experiment_app(
    app_name="MAE", app_version="1", benchmark_experiment=benchmark_experiment
)

In [None]:
with tru_benchmark_arctic as recording:
    feedback_res = tru_benchmark_arctic.app(gt_df)

### Sanity check: compare the generated feedback scores with the passed in ground truth labels [1, 1, 0] 

In [None]:
feedback_res  # generate feedback scores from our context relevance feedback function

In [None]:
session.get_leaderboard()

In [None]:
ece_agg_func = GroundTruthAggregator(true_labels=true_labels).ece

benchmark_experiment = TruBenchmarkExperiment(
    feedback_fn=context_relevance_ff_with_confidence,
    agg_funcs=[ece_agg_func],
    benchmark_params=BenchmarkParams(temperature=0.5),
)

tru_benchmark_arctic_calibration = create_benchmark_experiment_app(
    app_name="Expected Calibration Error (ECE)",
    app_version="1",
    benchmark_experiment=benchmark_experiment,
)

In [None]:
with tru_benchmark_arctic_calibration as recording:
    feedback_results = tru_benchmark_arctic_calibration.app(gt_df)

In [None]:
feedback_results

In [None]:
session.get_leaderboard()

### Users can also define custom aggregator functions and register them easily

In [None]:
# Example usage of custom aggregation function
from typing import List


def custom_aggr_function(
    scores: List[float], aggregator: GroundTruthAggregator
) -> float:
    # Example: Calculate the average of top k scores
    if aggregator.k is None:
        raise ValueError("k must be set for custom aggregation.")
    top_k_scores = sorted(scores, reverse=True)[: aggregator.k]
    return sum(top_k_scores) / len(top_k_scores) if top_k_scores else 0


gt_aggregator = GroundTruthAggregator(true_labels=true_labels, k=3)

# Register a custom aggregation function
gt_aggregator.register_custom_agg_func("mean_top_k", custom_aggr_function)

### the below `my_custom_aggr_fnc` can be passed into agg_funcs parameters of `session.BenchmarkExperiment` 

In [None]:
my_custom_aggr_fnc = gt_aggregator.mean_top_k

In [None]:
my_custom_aggr_fnc([
    5,
    5,
    1,
    2,
])  # top 3 scores are [5, 5, 2], so the average is 4