### Ground truth dataset persistence and evaluation in TruLens

In this notebook, we give a quick walkthrough of how you can prepare your own ground truth dataset, as well as utilize our utility function to load preprocessed BEIR (Benchmarking IR) datasets to take advantage of its unified format.


In [None]:
# !pip install trulens trulens-provider-openai openai

In [None]:
from trulens.core import TruSession

session = TruSession()
session.reset_database()

#### Add custom ground truth dataset to TruLens 

In [None]:
import pandas as pd

data = {
    "query": ["hello world", "who is the president?", "what is AI?"],
    "query_id": ["1", "2", "3"],
    "expected_response": ["greeting", "Joe Biden", "Artificial Intelligence"],
    "expected_chunks": [
        [
            {
                "text": "All CS major students must know the term 'Hello World'",
                "title": "CS 101",
            }
        ],
        [
            {
                "text": "Barack Obama was the president of the US (POTUS) from 2008 to 2016.'",
                "title": "US Presidents",
            }
        ],
        [
            {
                "text": "AI is the simulation of human intelligence processes by machines, especially computer systems.",
                "title": "AI is not a bubble :(",
            }
        ],
    ],
}

df = pd.DataFrame(data)

### Idempotency in TruLens dataset:
 IDs for both datasets and ground truth data entries are based on their content and metadata, so `add_ground_truth_to_dataset` is idempotent and should not create duplicate rows in the DB. 

In [None]:
session.add_ground_truth_to_dataset(
    dataset_name="test_dataset_new",
    ground_truth_df=df,
    dataset_metadata={"domain": "Random QA"},
)

### Retrieving groundtruth dataset from the DB for Ground truth evaluation (semantic similarity)

Below we will introduce how to retrieve the ground truth dataset (or a subset of it) that we just persisted, and use it as the golden set in `GroundTruthAgreement` feedback function to perform ground truth lookup and evaluation

In [None]:
ground_truth_df = session.get_ground_truth("test_dataset_new")
ground_truth_df

In [None]:
ground_truth_df

In [None]:
import os

from trulens.core import Feedback
from trulens.feedback import GroundTruthAgreement
from trulens.providers.openai import OpenAI as fOpenAI

os.environ["OPENAI_API_KEY"] = "sk-..."

f_groundtruth = Feedback(
    GroundTruthAgreement(ground_truth_df, provider=fOpenAI()).agreement_measure,
    name="Ground Truth (semantic similarity measurement)",
).on_input_output()

### Create Simple LLM Application

In [None]:
from openai import OpenAI
from trulens.apps.custom import instrument

oai_client = OpenAI()


class APP:
    @instrument
    def completion(self, prompt):
        completion = (
            oai_client.chat.completions.create(
                model="gpt-4o-mini",
                temperature=0,
                messages=[
                    {
                        "role": "user",
                        "content": f"Please answer the question: {prompt}",
                    }
                ],
            )
            .choices[0]
            .message.content
        )
        return completion


llm_app = APP()

## Instrument chain for logging with TruLens

In [None]:
# add trulens as a context manager for llm_app
from trulens.apps.custom import TruCustomApp

tru_app = TruCustomApp(
    llm_app, app_name="LLM App v1", feedbacks=[f_groundtruth]
)

In [None]:
# Instrumented query engine can operate as a context manager:
with tru_app as recording:
    llm_app.completion("what is AI?")

In [None]:
session.get_leaderboard(app_ids=[tru_app.app_id])

In [None]:
session.reset_database()

### Loading dataset to a dataframe:
This is helpful when we'd want to inspect the groundtruth dataset after transformation. The below example 
loads a preprocessed dataset from BEIR (Benchmarking Information Retrieval) collection

In [None]:
from trulens.benchmark.benchmark_frameworks.dataset.beir_loader import (
    TruBEIRDataLoader,
)

beir_data_loader = TruBEIRDataLoader(data_folder="./", dataset_name="scifact")

gt_df = beir_data_loader.load_dataset_to_df(download=True)

In [None]:
gt_df.expected_chunks[0]

In [None]:
# then we can save the ground truth to the dataset
session.add_ground_truth_to_dataset(
    dataset_name="my_beir_scifact",
    ground_truth_df=gt_df,
    dataset_metadata={"domain": "Information Retrieval"},
)

### Single method to save to the databse 
We also make directly persisting to DB easy. This is particular useful for larger datasets such as MSMARCO, where there are over 8 million documents in the corpus.

In [None]:
beir_data_loader.persist_dataset(
    session=session,
    dataset_name="my_beir_scifact",
    dataset_metadata={"domain": "Information Retrieval"},
)

### Benchmarking feedback functions / evaluators as a special case of groundtruth evaluation

In [None]:
from typing import Tuple

from trulens.providers.openai import OpenAI

provider = OpenAI(model_engine="gpt-4o")


def context_relevance_ff(
    input, output, benchmark_params
) -> Tuple[float, float]:
    return provider.context_relevance(
        question=input,
        context=output,
        temperature=benchmark_params["temperature"],
    )

In [None]:
gt_df = gt_df.head(10)
gt_df

In [None]:
from trulens.feedback import GroundTruthAggregator

true_labels = []

for chunks in gt_df.expected_chunks:
    for chunk in chunks:
        true_labels.append(chunk["expected_score"])
ndcg_agg_func = GroundTruthAggregator(true_labels=true_labels, k=10).ndcg_at_k

In [None]:
len(true_labels)

In [None]:
from trulens.benchmark.benchmark_frameworks.tru_benchmark_experiment import (
    BenchmarkParams,
)
from trulens.benchmark.benchmark_frameworks.tru_benchmark_experiment import (
    TruBenchmarkExperiment,
)
from trulens.benchmark.benchmark_frameworks.tru_benchmark_experiment import (
    create_benchmark_experiment_app,
)

benchmark_experiment = TruBenchmarkExperiment(
    feedback_fn=context_relevance_ff,
    agg_funcs=[ndcg_agg_func],
    benchmark_params=BenchmarkParams(temperature=0.5),
)

In [None]:
tru_benchmark = create_benchmark_experiment_app(
    app_id="Cortex benchmark demo", benchmark_experiment=benchmark_experiment
)

with tru_benchmark as recording:
    feedback_res = tru_benchmark.app(gt_df)

In [None]:
feedback_res

In [None]:
session.get_leaderboard(app_ids=[])