# Let's Build an LLM-as-a-Judge

In the previous sections, we argued that having a domain expert is fundamental to the success of your AI product. However, involving a domain expert every time you need an evaluation could be costly economically and in terms of time.

AI help us again! 

We could ask an LLM to judge how our movie expert answers a set of questions. It will mimic the domain expert in giving us an outcome and proposing a textual critique. Note that a domain expert remains fundamental to understand what it is expected from the product and his considerations should guide us in building a strong LLM-as-a-judge tool.

In [None]:
import polars as pl
from pydantic import BaseModel

Let's build everything we need to ask an LLM to judge our questions. Starting from a system message, we provide two elements to the system message:
- **guidelines** to instruct how we expect the evaluations and,
- **examples** to improve the quality of the response. 

In [None]:
class EvaluationExample(BaseModel):
    question: str
    response: str
    outcome: str
    critique: str

In [None]:
JUDGE_SYSTEM_MESSAGE = """You are a movie expert chatbot evaluator with advanced capabilities to understand if the question response is good or not.
Follow some guidelines to evaluate the system:
{guidelines}

We provide you with some examples to give you an idea of how a good response should be.
Evaluation examples:
{examples}
"""

EXAMPLE_STRUCTURE = """<user_question>{question}</user_question>
<response>{response}</response>
<outcome>{outcome}</outcome>
<critique>{critique}</critique>"""

In [None]:
evaluation_guidelines = [
    'Provide a "pass" or "fail" outcome. No other category is admitted.'
    'Provide a comprehensive and clear explanation of why you gave that outcome. In particular for the "fail" one'
]

examples = [
    EvaluationExample(
        question="Could you suggest an action movie?",
        response='I recommend "The Thousand Faces of Dunjia" (2017), directed by Yuen Woo-ping. The film follows a group of swordsmen\'s adventures as they secretly protect humanity by hunting down mysterious creatures from outer space. It combines action and fantasy elements, making it an engaging watch for fans of the genre. Enjoy!',
        outcome="pass",
        critique="The system replied with an action movie correctly. The response is brief but well explained.",
    ),
    EvaluationExample(
        question="Give me a summary of DUNE II",
        response="I'm sorry, but I don't have information about \"Dune II.\" My expertise covers movie plots, metadata, and summaries of films, but it seems that title isn't available in my current context. If you have any questions about other movies, feel free to ask!",
        outcome="fail",
        critique="The system should reply to questions regarding movie summaries. It seems that can't find the movie in the context.",
    ),
]

We put all together in a single string to obtain the final formatted system message (`formatted_system_message`)

In [None]:
def build_evaluation_examples(example: EvaluationExample) -> str:
    return EXAMPLE_STRUCTURE.format(
        question=example.question,
        response=example.response,
        outcome=example.outcome,
        critique=example.critique,
    )


def build_evaluation_system_message(
    system_message: str,
    guidelines: str,
    examples: list[EvaluationExample],
) -> str:
    formatted_guidelines = "\n".join(guidelines)
    formatted_examples = "\n\n".join([build_evaluation_examples(ex) for ex in examples])
    return system_message.format(
        guidelines=formatted_guidelines, examples=formatted_examples
    )

In [None]:
formatted_system_message = build_evaluation_system_message(
    JUDGE_SYSTEM_MESSAGE,
    evaluation_guidelines,
    examples,
)
print(formatted_system_message)

Now, we need to ask to judge a list of questions and response couples. We could start looking at the questions already evaluated by our domain expert. 

In [None]:
domain_expert_critiques = pl.read_csv("eval_comments.csv")

In [None]:
domain_expert_critiques

In [None]:
formatted_questions = (
    domain_expert_critiques.select(
        pl.format(
            "Question: {}\nAnswer: {}\n\n", pl.col("question"), pl.col("rag_answer")
        )
    )
    .to_series(0)
    .str.join("\n")
    .item()
)

In [None]:
prompt = f"""Evaluate how our AI system answered the given questions.
Here is the list of question and answer couples: 
{formatted_questions}"""

print(prompt)

In [None]:
class LLMJudgeDataset(BaseModel):
    critiques: list[EvaluationExample]

In [None]:
import os

import openai

os.environ["OPENAI_API_KEY"] = "..."

client = openai.OpenAI()

chat_completion = client.beta.chat.completions.parse(
    model="gpt-4o-mini",
    messages=[
        {"role": "system", "content": formatted_system_message},
        {"role": "user", "content": prompt},
    ],
    response_format=LLMJudgeDataset,
)

llm_judge_outcome = chat_completion.choices[0].message.parsed

In [None]:
# configure polars to display more text in each row
PL_STR_LEN = 1000
_ = pl.Config.set_fmt_str_lengths(PL_STR_LEN)
_ = pl.Config.set_tbl_width_chars(PL_STR_LEN)

In [None]:
llm_judge_df = pl.from_dicts(llm_judge_outcome.model_dump()["critiques"])
llm_judge_df = llm_judge_df.rename(
    {"outcome": "Judgement_llm_judge", "critique": "Critique_llm_judge"}
)
llm_judge_df

## How does our evaluator perform compared to domain expert?

Domain Expert outcomes and critiques are still essential! To determine the performance of the LLM-as-a-judge tool we build, we should compare their response with the one from our domain expert. **Domain expert responses are our ground truth.** The ideal result that we could expect is that the automatic evaluator aligns with human feedback.

In [None]:
full_judges = pl.concat(
    [
        domain_expert_critiques,
        llm_judge_df.select(["Judgement_llm_judge", "Critique_llm_judge"]),
    ],
    how="horizontal",
)

In [None]:
full_judges

## 🏋🏻 Exercise: Iterate until LLM-as-a-Judge aligns with the expert

Go back to the prompt and system message, try to edit it, and rerun the code until it aligns with the results of the domain expert! 

You could also involve the domain expert again (who should be beside you) to check if the LLM-as-a-judge answer stimulated his change of mind. The results might surprise you!