# Random Evaluation of Records

This notebook walks through the random evaluation of records with TruLens.

This is useful in cases where we want to log all application runs, but it is expensive to run evaluations each time. To gauge the performance of the app, we need *some* evaluations, so it is useful to evaluate a representative sample of records. We can do this after each record selectively running and logging feedback based on some randomization scheme.

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/truera/trulens/blob/main/examples/experimental/random_evaluation.ipynb)

In [None]:
# !pip install --pre trulens chromadb==0.4.18 openai==1.3.7

In [None]:
import os

os.environ["OPENAI_API_KEY"] = "sk-..."

## Get Data

In this case, we'll just initialize some simple text in the notebook.

In [None]:
university_info = """
The University of Washington, founded in 1861 in Seattle, is a public research university
with over 45,000 students across three campuses in Seattle, Tacoma, and Bothell.
As the flagship institution of the six public universities in Washington state,
UW encompasses over 500 buildings and 20 million square feet of space,
including one of the largest library systems in the world.
"""

## Create Vector Store

Create a chromadb vector store in memory.

In [None]:
from openai import OpenAI

oai_client = OpenAI()

oai_client.embeddings.create(
    model="text-embedding-ada-002", input=university_info
)

In [None]:
import chromadb
from chromadb.utils.embedding_functions import OpenAIEmbeddingFunction

embedding_function = OpenAIEmbeddingFunction(
    api_key=os.environ.get("OPENAI_API_KEY"),
    model_name="text-embedding-ada-002",
)

chroma_client = chromadb.Client()
vector_store = chroma_client.get_or_create_collection(
    name="Universities", embedding_function=embedding_function
)

Add the university_info to the embedding database.

In [None]:
vector_store.add("uni_info", documents=university_info)

## Build RAG from scratch

Build a custom RAG from scratch, and add TruLens custom instrumentation.

In [None]:
from trulens.core import TruSession
from trulens.core.app.custom import instrument

tru = TruSession()
tru.reset_database()

In [None]:
class RAG_from_scratch:
    @instrument
    def retrieve(self, query: str) -> list:
        """
        Retrieve relevant text from vector store.
        """
        results = vector_store.query(query_texts=query, n_results=2)
        return results["documents"][0]

    @instrument
    def generate_completion(self, query: str, context_str: list) -> str:
        """
        Generate answer from context.
        """
        completion = (
            oai_client.chat.completions.create(
                model="gpt-3.5-turbo",
                temperature=0,
                messages=[
                    {
                        "role": "user",
                        "content": f"We have provided context information below. \n"
                        f"---------------------\n"
                        f"{context_str}"
                        f"\n---------------------\n"
                        f"Given this information, please answer the question: {query}",
                    }
                ],
            )
            .choices[0]
            .message.content
        )
        return completion

    @instrument
    def query(self, query: str) -> str:
        context_str = self.retrieve(query)
        completion = self.generate_completion(query, context_str)
        return completion


rag = RAG_from_scratch()

## Set up feedback functions.

Here we'll use groundedness, answer relevance and context relevance to detect hallucination.

In [None]:
import numpy as np
from trulens.core import Feedback
from trulens.core import Select
from trulens.providers.openai import OpenAI as fOpenAI

# Initialize provider class
fopenai = fOpenAI()

# Define a groundedness feedback function
f_groundedness = (
    Feedback(fopenai.groundedness_measure_with_cot_reasons, name="Groundedness")
    .on(Select.RecordCalls.retrieve.rets.collect())
    .on_output()
)

# Question/answer relevance between overall question and answer.
f_qa_relevance = (
    Feedback(fopenai.relevance_with_cot_reasons, name="Answer Relevance")
    .on(Select.RecordCalls.retrieve.args.query)
    .on_output()
)

# Question/statement relevance between question and each context chunk.
f_context_relevance = (
    Feedback(
        fopenai.context_relevance_with_cot_reasons, name="Context Relevance"
    )
    .on(Select.RecordCalls.retrieve.args.query)
    .on(Select.RecordCalls.retrieve.rets.collect())
    .aggregate(np.mean)
)

## Construct the app
Wrap the custom RAG with TruCustomApp, add list of feedbacks for eval

In [None]:
from trulens.core import TruCustomApp

tru_rag = TruCustomApp(rag, app_name="RAG", app_version="v1")

## Eval Randomization

Create a function to run feedback functions randomly, depending on the record_id hash

In [None]:
import random
from typing import Iterable, Sequence

from trulens.core import Feedback
from trulens.core.schema import FeedbackResult
from trulens.core.schema import Record


def random_run_feedback_functions(
    record: Record, feedback_functions: Sequence[Feedback]
) -> Iterable[FeedbackResult]:
    """
    Given the record, randomly decide to run feedback functions.

    args:
    record (Record): The record on which to evaluate the feedback functions

    feedback_functions (Sequence[Feedback]): A collection of feedback functions to evaluate.

    returns:
    `FeedbackResult`, one for each element of `feedback_functions`, or prints "Feedback skipped for this record".

    """
    # randomly decide to run feedback (50% chance)
    decision = random.choice([True, False])
    # run feedback if decided
    if decision:
        print("Feedback run for this record")
        tru.add_feedbacks(
            tru.run_feedback_functions(
                record,
                feedback_functions=[
                    f_context_relevance,
                    f_groundedness,
                    f_qa_relevance,
                ],
            )
        )
    else:
        print("Feedback skipped for this record")

## Generate a test set

In [None]:
from trulens.test.generate_test_set import GenerateTestSet

test = GenerateTestSet(app_callable=rag.query)
test_set = test.generate_test_set(test_breadth=4, test_depth=1)
test_set

## Run the app
Run and log the rag application for each prompt in the test set. For a random subset of cases, also run evaluations.

In [None]:
# run feedback across test set
for category in test_set:
    # run prompts in each category
    test_prompts = test_set[category]
    for test_prompt in test_prompts:
        result, record = tru_rag.with_record(
            rag.query, "How many professors are at UW in Seattle?"
        )
        # random run feedback based on record_id
        random_run_feedback_functions(
            record,
            feedback_functions=[
                f_context_relevance,
                f_groundedness,
                f_qa_relevance,
            ],
        )

In [None]:
tru.get_leaderboard(app_ids=["RAG v1"])

In [None]:
from trulens.dashboard import run_dashboard

run_dashboard(tru)