<img src="https://raw.githubusercontent.com/comet-ml/opik/main/apps/opik-documentation/documentation/static/img/opik-logo.svg" width="250"/>

# 🛠 An LLM Jury as a Custom Metric in Opik

An LLM jury consists of multiple independent LLM evaluators that assess an input and aggregate their outputs using ensembling techniques like voting, averaging, or max selection. Compared to a single large model, a jury of smaller, diverse models reduces intra-model bias, achieves better performance, and operates at a much lower cost.

For more about LLM juries, read [the original ArXiv paper here](https://arxiv.org/abs/2404.18796).

### ⚙ Set up the environment

In [None]:
%pip install opik datasets openai --quiet

You'll need a [free Opik account](https://www.comet.com/signup?utm_campaign=opik&utm_medium=colab&utm_source=llm_jury_blog) to start running this code (if you already have a Comet account, that works too!). Next, [grab your API key](https://www.comet.com/account-settings/apiKeys?utm_campaign=opik&utm_medium=colab&utm_source=llm_jury_blog) from your `Account Settings` and run the following code:

In [None]:
import os

# Set the project name for Opik
os.environ["OPIK_PROJECT_NAME"] = "llm-juries-project"

import opik
opik.configure()

In [None]:
import getpass

# Set OpenAI API key: https://openai.com/
if "OPENAI_API_KEY" not in os.environ:
  os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter your OpenAI API key: ")

In [None]:
# Set OpenRouter API key: https://openrouter.ai/
if "OPENROUTER_API_KEY" not in os.environ:
  os.environ["OPENROUTER_API_KEY"] = getpass.getpass("Enter your OpenRouter API key: ")

### ⚙ Define the model

For more information on `Qwen2.5-3B-Instruct`, see the [Hugging Face model card here](https://huggingface.co/Qwen/Qwen2.5-3B-Instruct).

We'll be using `Qwen2.5-3B-Instruct` to generate answers to questions in the Natural Questions (NQ) dataset from Google Research ([see below](https://colab.research.google.com/drive/1Lt-4rvNIYPhgCMpaTd2N6GxJu9LkfcE5#scrollTo=XCsp2QnMvNyb)).

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer

# Load the model and tokenizer
MODEL_NAME = "Qwen/Qwen2.5-3B-Instruct"

model = AutoModelForCausalLM.from_pretrained(
    MODEL_NAME,
    torch_dtype="auto",
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)

### ⚙ Define custom functions

By adding tracking to our LLM application, we'll have full visibility into each evaluation run. In the example below we use the `@track` decorator, but there are other ways of adding tracking to your code outlined in the [Opik documentation](https://www.comet.com/docs/opik/tracing/log_traces?utm_campaign=opik&utm_medium=colab&utm_source=LLM_Jury_blog).

Here we define a function to generate responses to the input questions from the dataset we'll define in the next few steps.

In [None]:
from opik import track

@track
def generate_answer(input_question: str) -> str:
  """Generates an answer based on the input question using the loaded LLM."""
  messages = [
    {"role": "system", "content": "You are Qwen, created by Alibaba Cloud. You are a helpful assistant."},
    {"role": "user", "content": input_question}
  ]
  text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
  )
  model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

  generated_ids = model.generate(
    **model_inputs,
    max_new_tokens=512
  )
  generated_ids = [
    output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
  ]

  response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
  return response

In [None]:
@track
def evaluation_task(data):
    """Evaluates the LLM output given a dataset sample."""
    llm_output = generate_answer(data['question'])
    return {"output": llm_output}

### ⚙ Define our metric

Opik has several [built-in evaluation metrics](https://www.comet.com/docs/opik/evaluation/metrics/overview), but also supports [custom metric definitions](https://www.comet.com/docs/opik/evaluation/metrics/custom_metric) using [Opik's BaseMetric class](https://github.com/comet-ml/opik/blob/main/sdks/python/src/opik/evaluation/metrics/base_metric.py). Here, we build a custom metric that calls each of the three models and aggregates their scores.

For our particular use case, we want the models to return structured outputs in the form of valid JSON objects. For this, we'll define the exact structure we're looking for in variable called `response_format`.

For more information on what a JSON schema is, [see here](https://json-schema.org/overview/what-is-jsonschema). For more information on how to use JSON schemas as structured outputs with OpenAI, [see here](https://platform.openai.com/docs/guides/structured-outputs).

In [None]:
# JSON schema for hallucination scoring response_format
RESPONSE_FORMAT = {
      "type": "json_schema",
      "json_schema": {
        "name": "hallucination_score",
        "strict": True,
        "schema": {
          "type": "object",
          "properties": {
            "score": {
              "type": "number",
              "description": "A hallucination score between 0 and 1"
            },
            "reason": {
              "type": "string",
              "description": "The reasoning for the assessed hallucination score"
            }
          },
          "required": ["score", "reason"],
          "additionalProperties": False
        }
      }
    }

Next, we define our LLM Jury metric

In [None]:
from opik.evaluation.metrics import base_metric, score_result
from opik.evaluation import models
import json
from typing import Any
from openai import OpenAI
from opik.integrations.openai import track_openai
import numpy as np


class LLMJuryMetric(base_metric.BaseMetric):
  """Metric to evaluate LLM outputs for factual accuracy using multiple models and an avergae voting function."""
    def __init__(self, name: str = "LLM Jury"):
        self.name = name
        self.llm_client = track_openai(OpenAI(base_url="https://openrouter.ai/api/v1",
                                              api_key=os.getenv("OPENROUTER_API_KEY"),)
        )
        self.prompt_template = """
        You are an impartial judge evaluating the following claim for factual accuracy. Analyze it carefully
        and respond with a number between 0 and 1: 1 if completely accurate, 0.5 if mixed accuracy, or 0 if inaccurate.
        Then provide one brief sentence explaining your ruling.

        The format of the your response should be a JSON object with no additional text or backticks that follows the format:
        {{
            "score": <score between 0 and 1>,
            "reason": "<reason for the score>"
        }}

        Claim to evaluate: {output}

        Response:
        """
        self.model_names = ["openai/gpt-4o-mini", "mistralai/mistral-small-24b-instruct-2501", "cohere/command-r-08-2024"]
    def score(self, output: str, **ignored_kwargs: Any):
        """
        Score the output of an LLM.

        Args:
            output: The output of an LLM to score.
            **ignored_kwargs: Any additional keyword arguments. This is important so that the metric can be used in the `evaluate` function.
        """

        # Construct the prompt based on the output of the LLM
        prompt = self.prompt_template.format(output=output)

        completions = []

        for model in self.model_names:
          try:
              completion = self.llm_client.chat.completions.create(
                  model=model,
                  messages=[
                      {
                          "role": "user",
                          "content": prompt
                          }
                      ],
                  response_format=RESPONSE_FORMAT
                  )

              response_data = json.loads(completion.choices[0].message.content)
              completions.append(response_data)
          except (json.JSONDecodeError, AttributeError, IndexError):
              print(f"Error parsing response from model {model}: {completion}")
              continue  # Skip this model if an error occurs

        if completions:
              avg_score = np.mean([float(response["score"]) for response in completions])
              reasons = {self.model_names[i]: response["reason"] for i, response in enumerate(completions)}

        else:
              avg_score = 0.0
              reasons = "No valid responses received."

        return score_result.ScoreResult(
            name=self.name,
            value=avg_score,
            reason=str(reasons)
        )

### ⚙ Create the Opik Dataset

For this experiment, we'll be using the articles contained in the [Natural Questions (NQ) dataset, created by Google Reseach and hosted by Hugging Face](https://huggingface.co/datasets/google-research-datasets/nq_open?library=datasets).

In [None]:
from datasets import load_dataset

# Load dataset
ds = load_dataset("google-research-datasets/nq_open")['train']

In [None]:
import pandas as pd
import numpy as np

# Preprocess dataset
# Take only first 100 rows
df = ds.to_pandas().iloc[:100,:]
# Convert any list items to arrays
df = df.map(lambda x: x.tolist() if isinstance(x, np.ndarray) else x)
# Rename column to align with variables in our custom functions above
df.rename(columns={"answer":"reference"}, inplace=True)

In [None]:
from opik import Opik

# Log dataset to Opik
client = Opik()
dataset = client.get_or_create_dataset(name="NQ-subset")
dataset.insert_from_pandas(df)

### ⚙ Evaluate

In [None]:
# Instantiate our custom LLM Jury metric
LLMJuryMetric = LLMJuryMetric()

In [None]:
from opik.evaluation import evaluate

# Perform the evaluation
evaluation = evaluate(
    experiment_name="My LLM Jury Experiment",
    dataset=dataset,
    task=evaluation_task,
    scoring_metrics=[LLMJuryMetric],
    task_threads=1
)

### Have any additional questions?
- Check out [Opik's full documentation here](https://www.comet.com/docs/opik/)
- [Connect with us on Slack!](chat.comet.com)