## You Need A Subject Matter Expert!

We need to evaluate our AI product. I saw you googling (or chatGPTing) for the best metric to assess your RAG. *Stop it immediately!* A typical error when building an evaluation system for an RAG is adopting many confusing metrics. **It is hard to find a quality response metric in a domain with open-ended responses, especially at the beginning.**

Before digging into fancy metrics, **you must involve Subject Matters Experts (SMEs)** in the project to build a successful AI product. Someone who knows everything about the domain and will use your product or is interested in creating a helpful product: If you are building an AI system that needs to reply to new employee questions about the onboarding process, you can involve an HR manager. Again, involve a lawyer if your product must respond to legal questions.

It's even better if you can identify the *principal* subject matter expert. As Hamel says, the principal SME is someone "whose judgment is crucial for the success of your AI product. These are the people with deep domain expertise or represent your target users." There are a couple of reasons why you should find them: they set the standard and, since they are at most two people, their judgment will be consistent. Also, their involvement might make them feel owners of the projects.

By now, you should get the idea: since we are building a movie expert, we need a cinema geek! Yes, it might be hard to find one in the room. But perhaps you can turn to the person sitting next to you, and treat them as your expert.

### How to engage the SME, synthetic data, and the data flywheel

Now that we have chosen our SME, we need to swap our developer hat and start thinking as a product designer. How can we enage the experts? The first thing that comes to mind is that we could ask them to come up with example/ideal interactions. This might take time, and might not cover all of the ideal use cases.

There's instead something else we could ask them to: judge a set of questions and answers we already have. If you think about it, it's much harder for anyone of us to write a poem; on the other hand, it's much easier to say if a poem is simply good or bad. We can leverage the power of LLM to come up with both the questions and answers to bootstrap our evaluation pipeline.

Two caveats:

1. Synthetic data might not cover every edge case - that's a fact. But, keep in mind, we are trying to go from zero to one here. Once your product is online, you should set up an observability framework to monitor the interactions, and figure out a way to include real-world interactions in your eval datasets (that are compliant with data regulation and policies). This is a powerful pattern, named *data flywheel*. See it [here](https://jxnl.co/writing/2024/03/28/data-flywheel/) and [here](https://hamel.dev/blog/posts/evals/#problem-how-to-systematically-improve-the-ai).
2. You might want to use the most powerful model you have at your disposal for this. Evals are expensive. But making changes to your product without them might cost you even more. Also, it will still likely be cheaper than having your colleagues coming up with the bootstrap data themselves.

In [None]:
import os

os.environ["OPENAI_API_KEY"] = "..."

## Build the first eval dataset

### Features, Scenarios and Users

The goal of this stage is to keep a tight and fast feedback loop with the SME. At this step, we need to uncover the expectations and prioritise the features. A workshop might be an ideal setting for this, though you might not always find the time. So here's a simple framework that might be of help to start bootstrapping your evals, focussing on the different characteristics that our product must have.

- Which **features** should our system have? What must it must do specifically?
- Which **scenarios** should it be able to handle?
- Which kind of **users**  will use the product. Is it expert users, technical or non-technical people?

Every combination of feature-scenario-user could be an entry in your eval dataset. To keep computing times short, we will just start with generating 10.

### 🏋️ Exercise: Come up with Features, Scenarios, and Users

Here you will find a list of possible features, scenarsios and personas for the movie buddy. Do you agree with them? Feel free to amend them, improve the formulation, or add more.

In [None]:
movie_expert_features = [
    "Movies Recommendation",
    "Movies Synopsys",
    "Movie Metadata (cast, director, release dates)",
]
movie_expert_scenarios = [
    "Generic questions without details",
    "Question non related to movies",
    "Toxic Questions",
]
movie_expert_personas = [
    "Movie entushiast",
    "New Users",
]

## Generate the eval dataset

Now we need to generate questions for each triplet. For this, we leverage some basic prompt engineering, and structured output generation ([arXiv:2307.09702](https://arxiv.org/abs/2307.09702)).

In [None]:
SYSTEM_MESSAGE = """Act as you are a AI system tester. 
The user is a domain expert that must evaluate the answer generated by an AI system of your questions. 
Your role is to generate a dataset of questions to test a movie expert AI system. Note that the questions could be vary and follow 

RULES:
- The questions should test ONLY the following product features: {features}.
- The questions should test ONLY the following usage scenarios: {scenarios}
- You must generate the questions impersonating ONLY the following personas: {personas}
"""

MAX_ROWS = 10
EVAL_CONSTRUCTION_PROMPT = (
    """Generate an evaluation dataset with no more than {n_rows} rows"""
)

In [None]:
from beyond_the_hype.synthetic import build_eval_dataset_builder_system_message

system_message = build_eval_dataset_builder_system_message(
    SYSTEM_MESSAGE,
    movie_expert_features,
    movie_expert_scenarios,
    movie_expert_personas,
)
print(system_message)

In [None]:
from typing import TypedDict

import polars as pl
from pydantic import BaseModel


class EvalQuestionFormat(TypedDict):
    question_id: int
    question: str
    feature: str
    scenario: str
    persona: str


class EvalDataset(BaseModel):
    questions: list[EvalQuestionFormat]

In [None]:
import openai

client = openai.OpenAI()

answer = client.beta.chat.completions.parse(
    model="gpt-4o",
    messages=[
        {"role": "system", "content": system_message},
        {"role": "user", "content": EVAL_CONSTRUCTION_PROMPT.format(n_rows=MAX_ROWS)},
    ],
    response_format=EvalDataset,
)

answer = answer.choices[0].message.parsed

In [None]:
answer.model_dump()["questions"]

Now, we have a list of questions to pose to our AI and ask our evaluation expert to evaluate it. Note that, the SME could be involved also both for giving you features, scenario and personas or to add particular questions to the generated dataset. 

In [None]:
answers = pl.from_dicts(answer.model_dump()["questions"])
answers

Now let's save this locally. Downlaod it and head back to the first notebook!

In [None]:
answers.write_csv("./eval_questions.csv")