# Generating a Test Set with TruLens

In the early stages of developing an LLM app, it is often challenging to generate a comprehensive test set on which to evaluate your app.

This notebook demonstrates the usage of test set generation using TruLens, particularly targeted at applications that leverage private data or context such as RAGs.

By providing your LLM app callable, we can leverage your app to generate its own test set dependent on your specifications for `test_breadth` and `test_depth`. The resulting test set will both question categories tailored to your data, and a list of test prompts for each category. You can specify both the number of categories (`test_breadth`) and number of prompts for each category (`test_depth`).

In [None]:
from trulens.benchmark.generate.generate_test_set import GenerateTestSet

## Set key

In [None]:
import os

os.environ["OPENAI_API_KEY"] = "sk-..."

## Build application

In [None]:
# Imports from LangChain to build app
import bs4
from langchain import hub
from langchain.chat_models import ChatOpenAI
from langchain.document_loaders import WebBaseLoader
from langchain.embeddings import OpenAIEmbeddings
from langchain.schema import StrOutputParser
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores import Chroma
from langchain_core.runnables import RunnablePassthrough

In [None]:
loader = WebBaseLoader(
    web_paths=("https://lilianweng.github.io/posts/2023-06-23-agent/",),
    bs_kwargs=dict(
        parse_only=bs4.SoupStrainer(
            class_=("post-content", "post-title", "post-header")
        )
    ),
)
docs = loader.load()

In [None]:
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000, chunk_overlap=200
)
splits = text_splitter.split_documents(docs)

vectorstore = Chroma.from_documents(
    documents=splits, embedding=OpenAIEmbeddings()
)

In [None]:
retriever = vectorstore.as_retriever()

prompt = hub.pull("rlm/rag-prompt")
llm = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0)


def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)


rag_chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

## Generate a test set using the RAG

Now that we've set up the application, we can instantiate the `GenerateTestSet` class with the application. This way the test set generation will be tailored to your app and data.

After instantiating the `GenerateTestSet` class, generate your test set by specifying `test_breadth` and `test_depth`.

In [None]:
test = GenerateTestSet(app_callable=rag_chain.invoke)
test_set = test.generate_test_set(test_breadth=3, test_depth=2)
test_set

We can also provide a list of examples to help guide our app to the types of questions we want to test.

In [None]:
examples = [
    "What is sensory memory?",
    "How much information can be stored in short term memory?",
]

fewshot_test_set = test.generate_test_set(
    test_breadth=3, test_depth=2, examples=examples
)
fewshot_test_set

## Evaluate your application

Now that we have our test set, we can leverage it to test our app. Importantly, we'll set each category as metadata for the test prompts. This will evaluate the performance of our RAG across each question category.

### Set up feedback functions

In [None]:
import numpy as np
from trulens.apps.langchain import TruChain
from trulens.core import Feedback
from trulens.feedback.v2.feedback import Groundedness
from trulens.providers.openai import OpenAI

# Initialize provider class
openai = OpenAI()

# select context to be used in feedback. the location of context is app specific.
context = TruChain.select_context(rag_chain)

# Define a groundedness feedback function
grounded = Groundedness(groundedness_provider=OpenAI())
f_groundedness = (
    Feedback(grounded.groundedness_measure_with_cot_reasons)
    .on(context.collect())  # collect context chunks into a list
    .on_output()
    .aggregate(grounded.grounded_statements_aggregator)
)

# Question/answer relevance between overall question and answer.
f_qa_relevance = Feedback(openai.relevance).on_input_output()
# Question/statement relevance between question and each context chunk.
f_context_relevance = (
    Feedback(openai.context_relevance).on_input().on(context).aggregate(np.mean)
)

### Instrument app for logging with TruLens

In [None]:
from trulens.apps.langchain import TruChain

tru_recorder = TruChain(
    rag_chain,
    app_name="ChatApplication",
    app_version="chain_1",
    feedbacks=[f_qa_relevance, f_context_relevance, f_groundedness],
)

In [None]:
from trulens.core import TruSession
from trulens.dashboard import run_dashboard

tru = TruSession()
run_dashboard(tru, force=True)

### Evaluate the application with our generated test set

In [None]:
with tru_recorder as recording:
    for category in test_set:
        recording.record_metadata = dict(prompt_category=category)
        test_prompts = test_set[category]
        for test_prompt in test_prompts:
            llm_response = rag_chain.invoke(test_prompt)