# Chapter 4: Improving LLM Evaluators
This code notebook is part of Chapter 4 lesson of the [LLM Apps: Evaluation course](https://wandb.ai/site/courses/evals/).

## Bias in LLM Evaluators

<a target="_blank" href="https://colab.research.google.com/github/wandb/eval-course/blob/main/notebooks/chapter_04_bias_in_validators.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>
<!--- @wandbcode{eval-course-04} -->

LLM evaluators are incredibly effective tools for automating evaluation tasks, but they are not without their limitations. Like all LLM-based applications, they are susceptible to biases—both subtle and explicit. These biases don’t stem inherently from the concept of LLM evaluators themselves, but rather reflect the underlying patterns in the data and training processes that power modern LLMs.

Understanding and addressing these biases is crucial because they can distort evaluation outcomes, undermine fairness, or misalign with human judgment. While these issues are artifacts of today’s LLM systems—products of imperfect datasets, model training dynamics, and real-world complexities—they represent challenges we must navigate thoughtfully. Importantly, ongoing advancements in model development and data curation could significantly reduce or eliminate these biases in the future.

In this section, we’ll unpack common types of biases in LLM evaluators, demonstrate their real-world impact, and explore best practices to mitigate these biases, ensuring that evaluations remain reliable and aligned with desired objectives.


## Setup

Run the code cells below to setup your colab notebook.

In [None]:
try:
    import google.colab
    IN_COLAB = True
except ImportError:
    IN_COLAB = False

if IN_COLAB:
    !git clone --branch main https://github.com/wandb/eval-course
    %cd eval-course
    %cd notebooks
else:
    print("Not running in Google Colab. Skipping git clone.")

!pip install -qq google-generativeai weave

In [None]:
import os
import re
import getpass
import weave
import pandas as pd

# utility script
from utils.llm_client import LLMClient

import nest_asyncio
nest_asyncio.apply()

In [None]:
import google.generativeai as genai

genai.configure(
    api_key=getpass.getpass("Please enter your GOOGLE API KEY with Gemini acccess: "
))

In [None]:
# initialize weave for tracing and evaluation
weave_client = weave.init(project_name="eval-course")

## Problem 1: Position Bias

LLM validators might favor outputs based on their position (early or late in a sequence).

In [None]:
import asyncio

from weave import Evaluation, Model

In [None]:
# Define the prompt template for pairwise comparison
PAIRWISE_PROMPT = """Given a math question and two possible answers, determine which answer is better.

Question: {question}

Answer A: {answer_a}
Answer B: {answer_b}

Which answer is better? Respond with JUST "A" or "B".
"""


class PairWiseEvaluator(Model):
    where_is_correct: str = "A"
    model: genai.GenerativeModel = genai.GenerativeModel("gemini-2.0-flash-exp")
    pairwise_judge_prompt: str = PAIRWISE_PROMPT

    @weave.op()
    def predict(self, question: str, correct: str, incorrect: str) -> dict:
        if self.where_is_correct == "A":
            response = self.model.generate_content(
                self.pairwise_judge_prompt.format(
                    question=question, answer_a=correct, answer_b=incorrect,
                ),
            )
        elif self.where_is_correct == "B":
            response = self.model.generate_content(
                self.pairwise_judge_prompt.format(
                    question=question, answer_a=incorrect, answer_b=correct,
                ),
            )
        else:
            raise ValueError("where_is_correct must be either 'A' or 'B'")

        result = response.text.strip(" \n")
        return self.where_is_correct, result

In [None]:
# Load the dataset
mmlu_maths = weave.ref(
    "weave:///eval-course/eval-course-dev/object/mmlu_maths:sJp05YkihutzRAf3YZVXrvLUrN1qj49GvCKTgOoVSlE"
).get()

# Metric
@weave.op()
def exact_match(output: tuple) -> bool:
    """Check if predicted score matches human score"""
    where_is_correct, result = output
    return where_is_correct == result

# Create evaluation
evaluation = Evaluation(dataset=mmlu_maths.rows, scorers=[exact_match])

In [None]:
# Run evaluation with where_is_correct = "A"
pairwise_evaluator = PairWiseEvaluator(where_is_correct="A")
a = asyncio.run(evaluation.evaluate(pairwise_evaluator))

# Run evaluation with where_is_correct = "B"
pairwise_evaluator = PairWiseEvaluator(where_is_correct="B")
b = asyncio.run(evaluation.evaluate(pairwise_evaluator))

What's the difference between the two evaluations?

For the same question, the evaluator is more likely to choose the answer based on the position of the answer in the sequence.

In [None]:
print(
    "What's the difference in acccuracy becasue of position bias?\n",
    b["exact_match"]["true_fraction"] - a["exact_match"]["true_fraction"],
)

### Solutions

- Swap Augmentation: Randomize the order of outputs to minimize position bias.
    - This is espically useful if you run your evaluation multiple times and take the average. ([Source](https://arxiv.org/pdf/2306.05685))

- Multiple Evidence Calibration (MEC): Prompt the model to generate evaluation evidence before assigning scores. In simple terms, you are asking the model to reason about the quality of the answer before assigning a score. ([Source](https://arxiv.org/pdf/2305.17926))

- Balanced Position Calibration (BPC): Evaluate each candidate in both positions across two runs and compute the final score as the average of the two runs ([Source](https://arxiv.org/pdf/2305.17926)).

Fore more detailed discussion on positional bias check out these two papers:

- [Judging the Judges: A Systematic Investigation of Position Bias in Pairwise Comparative Assessments by LLMs](https://arxiv.org/pdf/2406.07791v1)
- [Large Language Models are not Fair Evaluators](https://arxiv.org/pdf/2305.17926)


## Problem 2: Verbosity Bias

LLM evaluators often exhibit verbosity bias, where they favor outputs that are more verbose, regardless of their actual quality or relevance. This bias arises because longer outputs can appear more comprehensive, detailed, or authoritative, even when they add unnecessary information.

In [None]:
# Let's create an evaluator that judges correctness of a single answer
CORRECTNESS_PROMPT = """Given a math question and the student's answer, determine if the answer is correct.

Question: {question}
Student Answer: {answer}

Is this answer correct? Respond with JUST "YES" or "NO".
"""


class CorrectnessEvaluator(Model):
    model: genai.GenerativeModel = genai.GenerativeModel("gemini-2.0-flash-exp")
    judge_prompt: str = CORRECTNESS_PROMPT

    @weave.op()
    def predict(self, question: str, correct: str) -> dict:
        response = self.model.generate_content(
            self.judge_prompt.format(question=question, answer=correct),
        )

        result = response.text.strip(" \n")
        return result


@weave.op()
def is_correct(output: str) -> bool:
    return output == "YES"


evaluation = Evaluation(dataset=mmlu_maths.rows, scorers=[is_correct])

correctness_evaluator = CorrectnessEvaluator()
plain_answer = asyncio.run(evaluation.evaluate(correctness_evaluator))

In [None]:
# Let's create an evaluator that judges correctness of a single answer
CORRECTNESS_PROMPT = """Given a math question and the student's answer, determine if the answer is correct.

Question: {question}
Student Answer: {answer}

Is this answer correct? Respond with JUST "YES" or "NO".
"""


class CorrectnessEvaluator(Model):
    model: genai.GenerativeModel = genai.GenerativeModel("gemini-2.0-flash-exp")
    judge_prompt: str = CORRECTNESS_PROMPT

    @weave.op()
    def predict(self, question: str, correct: str) -> dict:
        beautified_answer_prompt = """You are given a math question and the correct answer to that question.
        Can you expand on the answer by adding false reasoning steps that led to the answer?
        Keep the correct answer at the end but add wrong/misleading calculations that led to that answer.
        Question: {question}
        Answer: {answer}
        """
        _fake_answer = self.model.generate_content(
            beautified_answer_prompt.format(question=question, answer=correct),
        )

        # In case the model fails to generate a fake answer, we use the correct answer as the fake answer.
        # The `response.text` quick accessor requires the response to contain a valid `Part`, but none were returned.
        try:
            beautified_answer = _fake_answer.text.strip(" \n")
        except:
            beautified_answer = f"The correct answer is {correct}."

        response = self.model.generate_content(
            self.judge_prompt.format(
                question=question,
                answer=beautified_answer,
            ),
        )

        result = response.text.strip(" \n")
        return result


@weave.op()
def is_correct(output: str) -> bool:
    return output == "YES"


evaluation = Evaluation(dataset=mmlu_maths.rows, scorers=[is_correct])

correctness_evaluator = CorrectnessEvaluator()
beautified_answer = asyncio.run(evaluation.evaluate(correctness_evaluator))

In [None]:
 print(
    "What's the difference in acccuracy becasue of verbosity bias?\n",
    beautified_answer["is_correct"]["true_fraction"] - plain_answer["is_correct"]["true_fraction"],
)

We can mitigate verbosity bias by explicitly instructing the LLM judge not to favor longer responses and to focus on the quality and conciseness of the content.

## Problem 3: Misinformation Oversight Bias

This refers to the tendency to overlook the factual errors in an argument.

In [None]:
JUDGE_PROMPT = """You are an expert evaluator. Given a question and an answer, you need to determine if the answer is correct or incorrect.
Question: {question}
Answer: {answer}

Respond with exactly one word - either "correct" or "incorrect"."""


class MisinformationEvaluator(Model):
    model: genai.GenerativeModel = genai.GenerativeModel("gemini-2.0-flash-exp")
    judge_prompt: str = JUDGE_PROMPT

    @weave.op()
    def predict(self, question: str, answer: str) -> dict:
        response = self.model.generate_content(
            self.judge_prompt.format(
                question=question,
                answer=answer,
            ),
        )

        result = response.text.strip(" \n")
        return result


@weave.op()
def is_correct(output: str) -> bool:
    return output.lower() == "correct"


rag_dataset = weave.ref(
    "weave:///eval-course/eval-course-dev/object/rag_dataset:0ZOviYESbFjXbfXgB5LaA8zpi6GlYj391FgRL31DpWM"
).get()

evaluation = Evaluation(dataset=rag_dataset.rows, scorers=[is_correct])

misinformation_evaluator = MisinformationEvaluator()
misinformation_results = asyncio.run(evaluation.evaluate(misinformation_evaluator))

In [None]:
JUDGE_PROMPT = """You are an expert evaluator. Given a question and an answer, you need to determine if the answer is correct or incorrect.
You are also given the context that led to the answer.

Question: {question}
Context: {context}
Answer: {answer}

Respond with exactly one word - either "correct" or "incorrect"."""


class MisinformationEvaluator(Model):
    model: genai.GenerativeModel = genai.GenerativeModel("gemini-2.0-flash-exp")
    judge_prompt: str = JUDGE_PROMPT

    @weave.op()
    def predict(self, context: str, question: str, answer: str) -> dict:
        _slight_factual_error = """You are given a response from an LLM system. You are also given the context
        which was used to generate the answer. Can you add a slight factual error in the response. Don't change the answer too much.
        Don't change the length of the answer. Just slightly add slight factual error. The factual error should not be noticeable easily.
        Context: {context}
        Answer: {answer}
        """
        _factual_error_answer = self.model.generate_content(
            _slight_factual_error.format(context=context, answer=answer),
        )

        # In case the model fails to generate a fake answer, we use the correct answer as the fake answer.
        # The `response.text` quick accessor requires the response to contain a valid `Part`, but none were returned.
        try:
            _factual_error_answer = _factual_error_answer.text.strip(" \n")
        except:
            _factual_error_answer = f"The incorrect answer is {answer}."
        response = self.model.generate_content(
            self.judge_prompt.format(
                question=question,
                context=context,
                answer=_factual_error_answer,
            ),
        )

        result = response.text.strip(" \n")
        return result

misinformation_evaluator = MisinformationEvaluator(judge_prompt=JUDGE_PROMPT)
misinformation_results = asyncio.run(evaluation.evaluate(misinformation_evaluator))