## You Need A Subject Matter Expert!

We need to evaluate our AI product. I saw you googling (or chatGPTing) for the best metric to assess your RAG. *Stop it immediately!* A typical error when building an evaluation system for an RAG is adopting many confusing metrics. **It is hard to find a quality response metric in a domain with open-ended responses, especially at the beginning.**

Before digging into fancy metrics, **you must involve Subject Matters Experts (SMEs)** in the project to build a successful AI product. Someone who knows everything about the domain and will use your product or is interested in creating a helpful product: If you are building an AI system that needs to reply to new employee questions about the onboarding process, you can involve an HR manager. Again, involve a lawyer if your product must respond to legal questions.

It's even better if you can identify the *principal* subject matter expert. As Hamel says, the principal SME is someone "whose judgment is crucial for the success of your AI product. These are the people with deep domain expertise or represent your target users." There are a couple of reasons why you should find them: they set the standard and, since they are at most two people, their judgment will be consistent. Also, their involvement might make them feel owners of the projects.

By now, you should get the idea: since we are building a movie expert, we need a cinema geek! Yes, it might be hard to find one in the room. But perhaps you can turn to the person sitting next to you, and treat them as your expert.

## Let's Build a First Questions Datasets

Now that we have identified our SME, we need to ask them to evaluate the interactions with the chatbot. But how can we, if we have no data?

Here is another situation where LLMs can come in help: we can simply instruct them to generate the prompts for us. Two caveats:

1. Use the smartest model available. Evals are expensive, but it's far more expensive to make changes to your system without running them!
2. Perhaps synthetic data might not cover every nuance with real-world interactions. It's not a real problem: we are trying to get from zero to one, and we can always use (part of) the real-world data we collect from usage to build more eval datasets.

In [None]:
from pydantic import BaseModel
import polars as pl

In [None]:
import os

os.environ["OPENAI_API_KEY"] = ...

To instruct the LLM to generate valuable questions to use as an evaluation dataset we could think about different characteristics that our product must have, in particular: 

- Which **features** should our system have? What it must do specifically?
- Which **scenario** should it be able to address without problems?
- Which kind of **users**  will use the product? Expert users? Technical or non-technical people?

Let's do this exercise! In the following list, we suggested some possible features, scenarios and personas that our movie expert chatbot should have. Think about the product we are building and try to add yours instead of the dots.

In [None]:
movie_expert_features = [
    "Movies Recommendation",
    "Movies Synopsys",
    "Movie Metadata (cast, director, release dates)",
]
movie_expert_scenarios = [
    "Generic questions without details",
    "Question non related to movies",
    "Toxic Questions",
]
movie_expert_personas = ["Movie entushiast", "New Users"]

In [None]:
SYSTEM_MESSAGE = """Act as you are a AI system tester. 
The user is a domain expert that must evaluate the answer generated by an AI system of your questions. 
Your role is to generate a dataset of questions to test a movie expert AI system. Note that the questions could be vary and follow 

RULES:
- The questions should test ONLY the following product features: {features}.
- The questions should test ONLY the following usage scenarios: {scenarios}
- You must generate the questions impersonating ONLY the following personas: {personas}
"""

MAX_ROWS = 20
EVAL_CONSTRUCTION_PROMPT = (
    """Generate an evaluation dataset with no more than {n_rows} rows"""
)

In [None]:
def build_eval_dataset_builder_system_message(
    system_message_format: str,
    features: list[str],
    scenarios: list[str],
    personas: list[str],
) -> str:
    features = ", ".join(features)
    scenarios = ", ".join(scenarios)
    personas = ", ".join(personas)

    return SYSTEM_MESSAGE.format(
        features=features, scenarios=scenarios, personas=personas
    )

In [None]:
system_message = build_eval_dataset_builder_system_message(
    SYSTEM_MESSAGE, movie_expert_features, movie_expert_scenarios, movie_expert_personas
)
print(system_message)

In [None]:
class EvalQuestionFormat(BaseModel):
    question_id: int
    question: str
    feature: str
    scenario: str
    persona: str


class EvalDataset(BaseModel):
    questions: list[EvalQuestionFormat]

In [None]:
import openai

client = openai.OpenAI()

chat_completion = client.beta.chat.completions.parse(
    model="gpt-4o",
    messages=[
        {"role": "system", "content": system_message},
        {"role": "user", "content": EVAL_CONSTRUCTION_PROMPT.format(n_rows=MAX_ROWS)},
    ],
    response_format=EvalDataset,
)

answer = chat_completion

answer = answer.choices[0].message.parsed

In [None]:
answer.model_dump()["questions"]

Now, we have a list of questions to pose to our AI and ask our evaluation expert to evaluate it. Note that, the SME could be involved also both for giving you features, scenario and personas or to add particular questions to the generated dataset. 

In [None]:
answers = pl.from_dicts(answer.model_dump()["questions"])
answers

In [None]:
answers.write_csv("../data/eval_questions.csv")