🎯 TruthfulVQA Evaluation

This repository contains necessary codes for evaluating model responses on the TruthfulVQA dataset. It supports two main evaluation modes: multiple-choice question evaluation and open-ended question evaluation with pairwise comparison.

✨ Features

Multiple evaluation modes:
- Multiple-choice question evaluation with comprehensive metrics
- Open-ended question evaluation using judge model for pairwise comparison
Rich terminal output with formatted tables and panels
Flexible file saving options
Batch processing support for judge model

🚀 Installation

Clone the repository
Install required packages:

pip install rich numpy transformers vllm pillow

📋 Usage

1. Multiple-Choice Question Evaluation

This mode evaluates model responses on multiple-choice questions, computing various accuracy metrics. There are two sub-modes:

a. Extract and Evaluate (`extract_and_eval`)

This mode first extracts answers from raw model responses and then computes evaluation metrics.

python main.py extract_and_eval input.json [options]

Options:

--save-extracted PATH: Save extracted results to specified path
--save-metrics PATH: Save evaluation metrics to specified path
--output-dir DIR: Base directory for output files

Example:

# Extract and evaluate without saving intermediate results
python main.py extract_and_eval example.json

# Extract and evaluate, saving both extracted results and metrics
python main.py extract_and_eval example.json --save-extracted results/extracted.json --save-metrics results/metrics.json

b. Evaluate Only (`eval`)

This mode computes evaluation metrics directly from pre-processed results (where answers have already been extracted).

python main.py eval input.json [options]

Options:

--save-metrics PATH: Save evaluation metrics to specified path
--output-dir DIR: Base directory for output files

Example:

# Evaluate without saving metrics
python main.py eval example.json

# Evaluate and save metrics
python main.py eval example.json --save-metrics results/metrics.json

📝 Input Format (Multiple-Choice)

For extract_and_eval mode:

[
    {
        "category": "factual",
        "subcategory": "science",
        "level": 1,
        "ground_truth": "A",
        "response": "Let me think about this.\n◁think▷\nThis is a basic science question. The answer is A.\n◁/think▷\n(A)"
    }
]

For eval mode (pre-processed):

[
    {
        "category": "factual",
        "subcategory": "science",
        "level": 1,
        "ground_truth": "A",
        "extracted_answer": "A",
        "is_correct": true,
        "response": "(A)"
    }
]

Required fields:

category: Question category
subcategory: Question subcategory
level: Difficulty level (1, 2, or 3)
ground_truth: Correct answer (A, B, C, or D)
response: Model's response text (for extract_and_eval)
extracted_answer: Extracted answer (for eval)
is_correct: Whether the answer is correct (for eval)

📊 Output Metrics (Multiple-Choice)

The tool computes:

Overall Accuracy
Category-wise Accuracies
Subcategory-wise Accuracies
Level-wise Accuracies
Level Variance
CAI (Correctness Attenuation Index)

The CAI is calculated as:

CAI = (level1 - level2) / level1 + (level3 - level2) / level2

2. Open-Ended Question Evaluation

This mode uses a judge model to compare pairs of model responses for open-ended questions, determining which response is better and providing detailed critiques.

python evaluation/judge.py --pairwise_test_input_path input.json --output_dir results --output_suffix _judged --batch_size 128

Options:

--pairwise_test_input_path: Path to the input JSON file containing pairwise comparisons
--output_dir: Directory to save the output results
--output_suffix: Suffix to add to the output filename
--batch_size: Batch size for processing (default: 128)

📝 Input Format (Open-Ended)

[
    {
        "question": "How many people are in the picture?",
        "image_path": "path/to/image",
        "response_A": {
            "text": "2",
            "source": "model-a"
        },
        "response_B": {
            "text": "There are two people in the picture: a young girl and a man who appears to be much smaller due to the perspective and distance.",
            "source": "model-b"
        }
    }
]

Required fields:

question: The question being asked
image_path: Path to the image file
response_A: First model's response
response_B: Second model's response

📊 Output Format (Open-Ended)

The output includes the original data plus judge model evaluations:

[
    {
        "question": "How many people are in the picture?",
        "image_path": "path/to/image",
        "response_A": {
            "text": "2",
            "source": "model-a"
        },
        "response_B": {
            "text": "There are two people in the picture: a young girl and a man who appears to be much smaller due to the perspective and distance.",
            "source": "model-b"
        },
        "judge": "<critique>Response B is better because it provides more detailed information about the people in the image, including their relative sizes and positions.</critique>\n<label>B</label>\n<confidence>0.95</confidence>",
        "label": "B",
        "confidence": 0.95
    }
]

Additional fields:

judge: The full judge model output including critique, label, and confidence
label: The selected better response (A or B)
confidence: A score between 0 and 1 indicating the judge's confidence in the decision

💬 Prompts

The repository includes recommended prompts for both multiple-choice and open-ended questions in prompt.py. These prompts are designed to help models provide consistent and well-formatted responses.

Multiple-Choice Questions

Two prompt templates are provided for multiple-choice questions:

Standard Multiple-Choice Prompt:

MULTIPLE_CHOICE_PROMPT = """
<image>
{question}

{options}

Answer with the option's letter enclosed in () at the end of your response.
"""

Multiple-Choice with Confidence Score:

MULTIPLE_CHOICE_ECE_PROMPT = """
<image>
{question}

{options}

Answer with the option's letter enclosed in () at the end of your response. Give your confidence score of your answer (a fractional number in the range of 0-1) enclosed in [] at the end of your response.

Example Output: (A)[0.9]
"""

The options are automatically formatted as:

(A) First option
(B) Second option
(C) Third option
(D) Fourth option

Open-Ended Questions

For open-ended questions, a simple prompt template is provided:

OPEN_ENDED_PROMPT = """
<image>
{question}
"""

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
evaluation		evaluation
examples		examples
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
prompt.py		prompt.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🎯 TruthfulVQA Evaluation

✨ Features

🚀 Installation

📋 Usage

1. Multiple-Choice Question Evaluation

a. Extract and Evaluate (`extract_and_eval`)

b. Evaluate Only (`eval`)

📝 Input Format (Multiple-Choice)

📊 Output Metrics (Multiple-Choice)

2. Open-Ended Question Evaluation

📝 Input Format (Open-Ended)

📊 Output Format (Open-Ended)

💬 Prompts

Multiple-Choice Questions

Open-Ended Questions

About

Uh oh!

Releases

Packages

Languages

License

truthfulvqa/TruthfulVQA_code

Folders and files

Latest commit

History

Repository files navigation

🎯 TruthfulVQA Evaluation

✨ Features

🚀 Installation

📋 Usage

1. Multiple-Choice Question Evaluation

a. Extract and Evaluate (extract_and_eval)

b. Evaluate Only (eval)

📝 Input Format (Multiple-Choice)

📊 Output Metrics (Multiple-Choice)

2. Open-Ended Question Evaluation

📝 Input Format (Open-Ended)

📊 Output Format (Open-Ended)

💬 Prompts

Multiple-Choice Questions

Open-Ended Questions

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

a. Extract and Evaluate (`extract_and_eval`)

b. Evaluate Only (`eval`)

Packages