# Tutorial 3: Using QSTN for LLM-as-a-Judge

QSTN can be directly used to be used as an LLM-as-a-judge. In this tutorial, we utilize the [**HelpSteer2 dataset**](https://huggingface.co/datasets/nvidia/HelpSteer2) by NVIDIA. Here different model's responses were evaluated by humans on a scale of **five distinct 0–4 Likert scales**:

*   **Helpfulness:** Overall utility and goal-fulfillment of the response.
*   **Correctness:** Factual accuracy and adherence to logical constraints.
*   **Coherence:** Structure, flow, and the natural clarity of the language used.
*   **Complexity:** The level of complexity in the response.
*   **Verbosity:** How verbose is the response.

In [7]:
#!pip install datasets
#!pip install pandas

from datasets import load_dataset
import pandas as pd

SEED = 1

dataset = load_dataset("nvidia/HelpSteer2", split="validation")


# Accessing a sample
df = dataset.to_pandas()
print("Size:", len(df))
df.sample(5, random_state=SEED)

Size: 1038


Unnamed: 0,prompt,response,helpfulness,correctness,coherence,complexity,verbosity
589,Can you generate an extended strategic vision ...,"Sure, here's an extended strategic vision for ...",4,4,4,1,2
718,write outline of report about air-prunning pots.,Sure! Here's an outline for a report about air...,2,3,3,2,3
677,how to plot a mean variance frontier for three...,To plot a mean-variance frontier for three ass...,2,2,3,2,2
957,do you know the english word drama\n<extra_id_...,"I'm sorry, I'm not sure what you mean. Could y...",0,0,4,1,0
314,We're working with the Welsh govt on the issue...,"Sure, here are three suggestions for answers t...",1,1,4,1,2


## Setting up the prompts

We first need to create our prompts. Let's start with a simple system and user prompt.

In [None]:
system_prompt = (
    "You are an expert annotator that judges responses by Large Language Models "
    "on helpfulness, correctness, coherence, complexity and verbosity."
)
prompt = """You will be given a prompt by a user and a response by a Large Language Model. {question}\n
---- USER PROMPT: ----\n 
{prompt}\n
---- LLM RESPONSE: ----\n 
{response}\n
---- LLM RESPONSE END ----
{automatic_output}"""

It contains four placeholders for now. ``question`` will be where we ask the question. ``prompt`` is the placeholder for the user prompt in the dataset, ``response`` is the placeholder for the LLM response in the dataset and ``automatic_output`` will be instructions for the model on how to format the response.

We can define our question depending on which category we want to ask.

In [None]:
question_all = "Your task is to rate these responses on a Likert Scale from 0 - 4 in regards to helpfulness, correctness, coherence, complexity and verbosity."

question_helpfulness = "Your task is to rate these responses on a Likert Scale from 0 - 4 in regards to helpfulness."

question_correctness = "Your task is to rate these responses on a Likert Scale from 0 - 4 in regards to correctness."

question_coherence = "Your task is to rate these responses on a Likert Scale from 0 - 4 in regards to coherence."

question_complexity = "Your task is to rate these responses on a Likert Scale from 0 - 4 in regards to complexity."

question_verbosity = "Your task is to rate these responses on a Likert Scale from 0 - 4 in regards to verbosity."

# Using Response Generation Method

We want our LLM-Judge to only respond with a likert-scale from 0 to 4 for all our categories. For this we can restrict the model output

In [None]:
from qstn.inference import JSONResponseGenerationMethod
from qstn.utilities import constants

categories = ["helpfulness", "correctness", "coherence", "complexity", "verbosity"]

json_prompt = """Respond only in the following JSON format and only respond with a single number from 0 to 4 for each category!\n\
{
    "helpfulness": "Overall utility and goal-fulfillment of the response on a scale from 0 to 4.", 
    "correctness":  "Factual accuracy and adherence to logical constraints on a scale from 0 to 4.",
    "coherence": "Structure, flow, and the natural clarity of the language used on a scale from 0 to 4.",
    "complexity": "The level of complexity in the response on a scale from 0 to 4.",
    "verbosity": "How verbose is the response on a scale from 0 to 4."
}
"""

json_constraints = {cat: constants.OPTIONS_ADJUST for cat in categories}

json_rgm = JSONResponseGenerationMethod(
    json_fields=categories,
    output_template=json_prompt,
    constraints=json_constraints,
    output_index_only=True,
)

In [None]:
from qstn.prompt_builder import generate_likert_options

answer_options = generate_likert_options(
    5, ["", "", "", "", ""], start_idx=0, response_generation_method=json_rgm
)

For QSTN we always need to specify a questionnaire structure. Since we only want to ask one question, we can do this with a simple dict.
Another option would be to ask each question (helpfullness, correctness, coherence, complexity and verbosity) individually, for that we would create a questionnaire with 5 different questions.

In [None]:
questionnaire_structure = [
    {"questionnaire_item_id": 1, "question_content": question_all}
]
df_questionnaire = pd.DataFrame(questionnaire_structure)

To make use of efficient batching we create multiple LLMPrompts for each entry in our dataset.

In [None]:
from qstn.prompt_builder import LLMPrompt
from qstn.utilities import placeholder


def create_prompts(row):
    llm_prompt = LLMPrompt(
        questionnaire_source=df_questionnaire,
        questionnaire_name=str(row.name),
        system_prompt=system_prompt,
        prompt=prompt.format(
            prompt=row["prompt"],
            response=row["response"],
            question=placeholder.PROMPT_QUESTIONS,
            automatic_output=placeholder.PROMPT_AUTOMATIC_OUTPUT_INSTRUCTIONS,
        ),
        seed=SEED,
    )

    llm_prompt.prepare_prompt(
        question_stem=placeholder.QUESTION_CONTENT, answer_options=answer_options
    )

    return llm_prompt


# This creates the list directly from the dataframe results
llm_prompts = df.apply(create_prompts, axis=1).tolist()

In [14]:
example_system_prompt, example_prompt = llm_prompts[
    0
].get_prompt_for_questionnaire_type()

print("EXAMPLE SYSTEM PROMPT:", example_system_prompt)
print("EXAMPLE USER PROMPT:", example_prompt)

EXAMPLE SYSTEM PROMPT: You are an expert annotator that judges responses by Large Language Models on helpfullness, correctness, coherence, complexity and verbosity.
EXAMPLE USER PROMPT: You will be given a prompt by a user and a response by a Large Language Model. Your task is to rate these responses on a Likert Scale from 0 - 4 in regards to helpfullness, correctness, coherence, complexity and verbosity.

---- USER PROMPT: ----
 
explain master slave replication nsql

---- LLM RESPONSE: ----
 
In the context of NoSQL databases, master-slave replication refers to a configuration where a single master node writes data, and one or more slave nodes read data from the master and replicate it to provide read scalability. The master node is responsible for accepting write requests and updating its own data, while the slave nodes are responsible for replicating the data from the master and serving read requests.

In this configuration, the master node is the only node that can make changes to

## Basic Inference

This is already enough for simple LLM as a Judge. We will use the api method now to generate our output.

In [None]:
model_id = "Qwen/Qwen3-VL-4B-Instruct"

# Local Inference
# from vllm import LLM
# generator = LLM(model_id, max_model_len=10000, seed=SEED)

# Remote Inference
from openai import AsyncOpenAI

openai_api_key = "EMPTY"
openai_api_base = "http://localhost:8000/v1"

generator = AsyncOpenAI(
    api_key=openai_api_key,
    base_url=openai_api_base,
)

In [16]:
from qstn.survey_manager import conduct_survey_single_item

results = conduct_survey_single_item(
    generator, llm_prompts=llm_prompts, client_model_name=model_id, seed=SEED
)

Processing questionnaires:   0%|          | 0/1 [00:00<?, ?it/s]



Processing Prompts: 100%|██████████| 1038/1038 [03:34<00:00,  4.85it/s]


In [17]:
from qstn.parser import parse_json
from qstn.utilities import create_one_dataframe

parsed_results = parse_json(results)

full_results = create_one_dataframe(parsed_results)

In [18]:
full_results.head(5)

Unnamed: 0,questionnaire_name,questionnaire_item_id,question,helpfulness,correctness,coherence,complexity,verbosity
0,0,1,Your task is to rate these responses on a Like...,4,4,4,3,3
1,1,1,Your task is to rate these responses on a Like...,4,4,4,3,3
2,2,1,Your task is to rate these responses on a Like...,4,4,4,1,1
3,3,1,Your task is to rate these responses on a Like...,4,3,4,3,3
4,4,1,Your task is to rate these responses on a Like...,2,2,3,3,3


## Compare Results with Human Rating

In [None]:
#!pip install scipy
from scipy.stats import pearsonr

# Define the 5 HelpSteer2 categories
categories = ["helpfulness", "correctness", "coherence", "complexity", "verbosity"]

performance_metrics = []

for cat in categories:
    human_scores = df[cat]
    judge_scores = full_results[cat]

    correlation, _ = pearsonr(human_scores, judge_scores)
    accuracy = (human_scores == judge_scores).mean()

    performance_metrics.append(
        {
            "Category": cat.capitalize(),
            "Pearson Corr": round(correlation, 3),
            "Exact Match": f"{accuracy:.1%}",
        }
    )

df_summary = pd.DataFrame(performance_metrics)
print(df_summary)

      Category  Pearson Corr Exact Match
0  Helpfulness         0.397       41.2%
1  Correctness         0.403       44.9%
2    Coherence         0.343       64.6%
3   Complexity         0.383       21.2%
4    Verbosity         0.568       23.7%
