### Ground truth dataset persistence and evaluation in TruLens

In this notebook, we give a quick walkthrough of how you can prepare your own ground truth dataset, as well as utilize our utility function to load preprocessed BEIR (Benchmarking IR) datasets to take advantage of its unified format.


In [None]:
# !pip install trulens trulens-provider-openai openai

In [None]:
import pandas as pd
from trulens.core import TruSession

tru = TruSession()
tru.reset_database()

#### Add custom ground truth dataset to TruLens 

In [None]:
data = {
    "query": ["hello world", "who is the president?", "what is AI?"],
    "query_id": ["1", "2", "3"],
    "expected_response": ["greeting", "Joe Biden", "Artificial Intelligence"],
    "expected_chunks": [
        None,
        None,
        None,
    ],  # This can also be lists if applicable
    "meta": [None, None, None],  # Metadata can also be a list of dictionaries
}

df = pd.DataFrame(data)

### Idempotency in TruLens dataset:
 IDs for both datasets and ground truth data entries are based on their content and metadata, so `add_ground_truth_to_dataset` is idempotent and should not create duplicate rows in the DB. 

In [None]:
tru.add_ground_truth_to_dataset(
    dataset_name="test_dataset_new",
    ground_truth_df=df,
    dataset_metadata={"domain": "Random QA"},
)

### Loading dataset to a dataframe:
So that we can inspect the groundtruth dataset after transformation:
Load a preprocessed dataset from BEIR (Benchmarking Information Retrieval) collection

In [None]:
from trulens.core.dataset.beir_loader import TruBEIRDataLoader

beir_data_loader = TruBEIRDataLoader(
    data_folder="/Users/dhuang/Documents", dataset_name="quora"
)

df = beir_data_loader.load_dataset_to_df(download=True)

In [None]:
df

In [None]:
# then we can save the ground truth to the dataset
tru.add_ground_truth_to_dataset(
    dataset_name="my_beir_quora",
    ground_truth_df=df,
    dataset_metadata={"domain": "Information Retrieval"},
)

### Single method to save to the databse 
We also make directly persisting to DB easy. This is particular useful for larger datasets such as MSMARCO, where there are over 8 million documents in the corpus.

In [None]:
beir_data_loader.persist_dataset(
    tru=tru,
    dataset_name="my_beir_quora",
    dataset_metadata={"domain": "Information Retrieval"},
)

### Retrieving groundtruth dataset from the DB for evaluation

Below we will introduce how to retrieve the ground truth dataset (or a subset of it) that we just persisted, and use it as the golden set in `GroundTruthAgreement` feedback function to perform ground truth lookup and evaluation

In [None]:
ground_truth_df = tru.get_ground_truth("test_dataset_new")
ground_truth_df

In [None]:
import os

from trulens.core import Feedback
from trulens.feedback import GroundTruthAgreement
from trulens.providers.openai import OpenAI as fOpenAI

os.environ["OPENAI_API_KEY"] = "sk-..."

f_groundtruth = Feedback(
    GroundTruthAgreement(ground_truth_df, provider=fOpenAI()).agreement_measure,
    name="Ground Truth (semantic similarity measurement)",
).on_input_output()

### Create Simple LLM Application

In [None]:
from openai import OpenAI
from trulens.core.app.custom import instrument

oai_client = OpenAI()


class APP:
    @instrument
    def completion(self, prompt):
        completion = (
            oai_client.chat.completions.create(
                model="gpt-4o-mini",
                temperature=0,
                messages=[
                    {
                        "role": "user",
                        "content": f"Please answer the question: {prompt}",
                    }
                ],
            )
            .choices[0]
            .message.content
        )
        return completion


llm_app = APP()

## Instrument chain for logging with TruLens

In [None]:
# add trulens as a context manager for llm_app
from trulens.core import TruCustomApp

tru_app = TruCustomApp(llm_app, app_id="LLM App v1", feedbacks=[f_groundtruth])

In [None]:
# Instrumented query engine can operate as a context manager:
with tru_app as recording:
    llm_app.completion("what is AI?")

In [None]:
tru.get_leaderboard(app_ids=[tru_app.app_id])