This repository contains necessary codes for evaluating model responses on the TruthfulVQA dataset. It supports two main evaluation modes: multiple-choice question evaluation and open-ended question evaluation with pairwise comparison.
- Multiple evaluation modes:
- Multiple-choice question evaluation with comprehensive metrics
- Open-ended question evaluation using judge model for pairwise comparison
- Rich terminal output with formatted tables and panels
- Flexible file saving options
- Batch processing support for judge model
- Clone the repository
- Install required packages:
pip install rich numpy transformers vllm pillowThis mode evaluates model responses on multiple-choice questions, computing various accuracy metrics. There are two sub-modes:
This mode first extracts answers from raw model responses and then computes evaluation metrics.
python main.py extract_and_eval input.json [options]Options:
--save-extracted PATH: Save extracted results to specified path--save-metrics PATH: Save evaluation metrics to specified path--output-dir DIR: Base directory for output files
Example:
# Extract and evaluate without saving intermediate results
python main.py extract_and_eval example.json
# Extract and evaluate, saving both extracted results and metrics
python main.py extract_and_eval example.json --save-extracted results/extracted.json --save-metrics results/metrics.jsonThis mode computes evaluation metrics directly from pre-processed results (where answers have already been extracted).
python main.py eval input.json [options]Options:
--save-metrics PATH: Save evaluation metrics to specified path--output-dir DIR: Base directory for output files
Example:
# Evaluate without saving metrics
python main.py eval example.json
# Evaluate and save metrics
python main.py eval example.json --save-metrics results/metrics.jsonFor extract_and_eval mode:
[
{
"category": "factual",
"subcategory": "science",
"level": 1,
"ground_truth": "A",
"response": "Let me think about this.\n◁think▷\nThis is a basic science question. The answer is A.\n◁/think▷\n(A)"
}
]For eval mode (pre-processed):
[
{
"category": "factual",
"subcategory": "science",
"level": 1,
"ground_truth": "A",
"extracted_answer": "A",
"is_correct": true,
"response": "(A)"
}
]Required fields:
category: Question categorysubcategory: Question subcategorylevel: Difficulty level (1, 2, or 3)ground_truth: Correct answer (A, B, C, or D)response: Model's response text (for extract_and_eval)extracted_answer: Extracted answer (for eval)is_correct: Whether the answer is correct (for eval)
The tool computes:
- Overall Accuracy
- Category-wise Accuracies
- Subcategory-wise Accuracies
- Level-wise Accuracies
- Level Variance
- CAI (Correctness Attenuation Index)
The CAI is calculated as:
CAI = (level1 - level2) / level1 + (level3 - level2) / level2
This mode uses a judge model to compare pairs of model responses for open-ended questions, determining which response is better and providing detailed critiques.
python evaluation/judge.py --pairwise_test_input_path input.json --output_dir results --output_suffix _judged --batch_size 128Options:
--pairwise_test_input_path: Path to the input JSON file containing pairwise comparisons--output_dir: Directory to save the output results--output_suffix: Suffix to add to the output filename--batch_size: Batch size for processing (default: 128)
[
{
"question": "How many people are in the picture?",
"image_path": "path/to/image",
"response_A": {
"text": "2",
"source": "model-a"
},
"response_B": {
"text": "There are two people in the picture: a young girl and a man who appears to be much smaller due to the perspective and distance.",
"source": "model-b"
}
}
]Required fields:
question: The question being askedimage_path: Path to the image fileresponse_A: First model's responseresponse_B: Second model's response
The output includes the original data plus judge model evaluations:
[
{
"question": "How many people are in the picture?",
"image_path": "path/to/image",
"response_A": {
"text": "2",
"source": "model-a"
},
"response_B": {
"text": "There are two people in the picture: a young girl and a man who appears to be much smaller due to the perspective and distance.",
"source": "model-b"
},
"judge": "<critique>Response B is better because it provides more detailed information about the people in the image, including their relative sizes and positions.</critique>\n<label>B</label>\n<confidence>0.95</confidence>",
"label": "B",
"confidence": 0.95
}
]Additional fields:
judge: The full judge model output including critique, label, and confidencelabel: The selected better response (A or B)confidence: A score between 0 and 1 indicating the judge's confidence in the decision
The repository includes recommended prompts for both multiple-choice and open-ended questions in prompt.py. These prompts are designed to help models provide consistent and well-formatted responses.
Two prompt templates are provided for multiple-choice questions:
- Standard Multiple-Choice Prompt:
MULTIPLE_CHOICE_PROMPT = """
<image>
{question}
{options}
Answer with the option's letter enclosed in () at the end of your response.
"""- Multiple-Choice with Confidence Score:
MULTIPLE_CHOICE_ECE_PROMPT = """
<image>
{question}
{options}
Answer with the option's letter enclosed in () at the end of your response. Give your confidence score of your answer (a fractional number in the range of 0-1) enclosed in [] at the end of your response.
Example Output: (A)[0.9]
"""The options are automatically formatted as:
(A) First option
(B) Second option
(C) Third option
(D) Fourth option
For open-ended questions, a simple prompt template is provided:
OPEN_ENDED_PROMPT = """
<image>
{question}
"""