A fast and lightweight Python package for evaluating question-answering models and prompting of black-box and open-source large language models.
pip install qa-metrics
is all you need!
-
Version 0.2.42 Released! (06/20/2025)
- RewardBert (ModerBert base) supports batch scores prediction to speed up prediction for RL training.
-
Version 0.2.35 Released! (06/18/2025)
- RewardBert (ModerBert base) trained to evaluate both short-form and long-form generations.
- RewardBert outputs a likert scale between 1-5 or normalized score between 0-1.
- Turn off nltk download verbose logs.
-
Version 0.2.30 Released!
- Enhanced PEDANTS with multi-pipeline support and improved edge case handling
- Introduced trained tiny-bert for QA evaluation (18MB model size)
- Added direct Huggingface model download support for TransformerMatcher
- Python >= 3.6
- openai >= 1.0
pip install qa-metrics
Our package offers six QA evaluation methods with varying strengths:
Method | Best For | Cost | Correlation with Human Judgment |
---|---|---|---|
RewardBert | General Text Generations | Free | Very High |
Normalized Exact Match | Short-form QA (NQ-OPEN, HotpotQA, etc.) | Free | Good |
PEDANTS | Both short & medium-form QA | Free | Very High |
Neural Evaluation | Both short & long-form QA | Free | High |
Open Source LLM Evaluation | All QA types | Free | High |
Black-box LLM Evaluation | All QA types | Paid | Highest |
Parameters
reference_answer
(str): gold (correct) answer to the questioncandidate_answer
(str): The answer provided by a candidate that needs to be evaluated
Returns
tuple
: A tuple of normalized and raw scores.
from qa_metrics.RewardBert import RewardBert
rb = RewardBert(device='cuda')
reference_answer = "The Frog Prince"
candidate_answer = "The movie \"The Princess and the Frog\" is loosely based off the Brother Grimm's \"Iron Henry\""
rb.compute_score(reference_answer, candidate_answer)
# (0.29113227128982544, 2.1645290851593018)
Parameters
reference_answers
(list of str): A list of gold (correct) answers to the questioncandidate_answer
(list of str): A list of answers provided by a candidate that needs to be evaluatedbatch_size
(int): batch size to predict (default 1)
Returns
tuple
: A tuple of a list of normalized and raw scores.
from qa_metrics.RewardBert import RewardBert
rb = RewardBert(device='cuda')
reference_answer = ["The Frog Prince"]
candidate_answer = ["The movie \"The Princess and the Frog\" is loosely based off the Brother Grimm's \"Iron Henry\""]
rb.compute_batch_scores(reference_answer, candidate_answer, batch_size=1)
# ([0.29113227128982544], [2.1645290851593018])
Parameters
reference_answer
(list of str): A list of gold (correct) answers to the questioncandidate_answer
(str): The answer provided by a candidate that needs to be evaluated
Returns
boolean
: True if there are any exact normalized matches between gold and candidate answers
from qa_metrics.em import em_match
reference_answer = ["The Frog Prince", "The Princess and the Frog"]
candidate_answer = "The movie \"The Princess and the Frog\" is loosely based off the Brother Grimm's \"Iron Henry\""
match_result = em_match(reference_answer, candidate_answer)
Parameters
reference_answer
(str): A gold (correct) answer to the questioncandidate_answer
(str): The answer provided by a candidate that needs to be evaluated
Returns
dictionary
: Contains the F1 score, precision, and recall between a gold and candidate answer
Parameters
reference_answer
(list of str): List of gold answerscandidate_answer
(str): Candidate answer to evaluatethreshold
(float): F1 score threshold for considering a match (default: 0.5)
Returns
boolean
: True if F1 score exceeds threshold for any gold answer
from qa_metrics.f1 import f1_match, f1_score_with_precision_recall
f1_stats = f1_score_with_precision_recall(reference_answer[0], candidate_answer)
match_result = f1_match(reference_answer, candidate_answer, threshold=0.5)
Parameters
reference_answer
(str): A Gold answercandidate_answer
(str): Candidate answer to evaluatequestion
(str): The question being evaluated
Returns
float
: The similarity score between two strings (0 to 1)
Parameters
reference_answer
(list of str): List of gold answerscandidate_answer
(str): Candidate answer to evaluatequestion
(str): The question being evaluated
Returns
dictionary
: Contains the gold answer and candidate answer pair with highest matching score
Parameters
reference_answer
(list of str): List of gold answerscandidate_answer
(str): Candidate answer to evaluatequestion
(str): The question being evaluated
Returns
dictionary
: Contains matching scores for all gold answer and candidate answer pairs
Parameters
reference_answer
(list of str): List of gold answerscandidate_answer
(str): Candidate answer to evaluatequestion
(str): The question being evaluated
Returns
boolean
: True if candidate answer matches any gold answer
Parameters
reference_answer
(list of str): List of gold answersquestion
(str): The question being evaluated
Returns
list
: The type of the question (what, who, when, how, why, which, where)
Parameters
reference_answer
(list of str): List of gold answerscandidate_answer
(str): Candidate answer to evaluatequestion
(str): The question being evaluated
Returns
list
: A list revised rules applicable to judge answer correctness
from qa_metrics.pedant import PEDANT
pedant = PEDANT()
scores = pedant.get_scores(reference_answer, candidate_answer, question)
match_result = pedant.evaluate(reference_answer, candidate_answer, question)
Parameters
reference_answer
(str): A Gold answercandidate_answer
(str): Candidate answer to evaluatequestion
(str): The question being evaluated
Returns
float
: The similarity score between two strings (0 to 1)
Parameters
reference_answer
(list of str): List of gold answerscandidate_answer
(str): Candidate answer to evaluatequestion
(str): The question being evaluated
Returns
dictionary
: Contains the gold answer and candidate answer pair with highest matching score
Parameters
reference_answer
(list of str): List of gold answerscandidate_answer
(str): Candidate answer to evaluatequestion
(str): The question being evaluated
Returns
dictionary
: Contains matching scores for all gold answer and candidate answer pairs
Parameters
reference_answer
(list of str): List of gold answerscandidate_answer
(str): Candidate answer to evaluatequestion
(str): The question being evaluated
Returns
boolean
: True if transformer model considers candidate answer equivalent to any gold answer
from qa_metrics.transformerMatcher import TransformerMatcher
### supports zli12321/roberta-large-qa-evaluator, `zli12321/answer_equivalence_bert`, `zli12321/answer_equivalence_distilbert`, `zli12321/answer_equivalence_roberta`, `zli12321/answer_equivalence_distilroberta`
tm = TransformerMatcher("zli12321/answer_equivalence_tiny_bert")
match_result = tm.transformer_match(reference_answer, candidate_answer, question)
Parameters
prompt
(str): The input prompt textmodel_engine
(str): OpenAI model to use (e.g., 'gpt-3.5-turbo')temperature
(float): Controls randomness (0-1)max_tokens
(int): Maximum tokens in response
from qa_metrics.prompt_llm import CloseLLM
model = CloseLLM()
model.set_openai_api_key(YOUR_OPENAI_KEY)
result = model.prompt_gpt(prompt=prompt, model_engine='gpt-3.5-turbo')
Parameters
prompt
(str): The input prompt textmodel_engine
(str): Claude model to useanthropic_version
(str): API versionmax_tokens_to_sample
(int): Maximum tokens in responsetemperature
(float): Controls randomness (0-1)
model = CloseLLM()
model.set_anthropic_api_key(YOUR_ANTHROPIC_KEY)
result = model.prompt_claude(prompt=prompt, model_engine='claude-v1')
Parameters
message
(str): The input message textmodel_engine
(str): Model to usetemperature
(float): Controls randomness (0-1)max_tokens
(int): Maximum tokens in response
from qa_metrics.prompt_open_llm import OpenLLM
model = OpenLLM()
model.set_deepinfra_key(YOUR_DEEPINFRA_KEY)
result = model.prompt(message=prompt, model_engine='mistralai/Mixtral-8x7B-Instruct-v0.1')
Our fine-tuned models are available on Huggingface:
@inproceedings{li-etal-2024-pedants,
title = "{PEDANTS}: Cheap but Effective and Interpretable Answer Equivalence",
author = "Li, Zongxia and
Mondal, Ishani and
Nghiem, Huy and
Liang, Yijun and
Boyd-Graber, Jordan Lee",
editor = "Al-Onaizan, Yaser and
Bansal, Mohit and
Chen, Yun-Nung",
booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2024",
month = nov,
year = "2024",
address = "Miami, Florida, USA",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2024.findings-emnlp.548/",
doi = "10.18653/v1/2024.findings-emnlp.548",
pages = "9373--9398",
abstract = "Question answering (QA) can only make progress if we know if an answer is correct, but current answer correctness (AC) metrics struggle with verbose, free-form answers from large language models (LLMs). There are two challenges with current short-form QA evaluations: a lack of diverse styles of evaluation data and an over-reliance on expensive and slow LLMs. LLM-based scorers correlate better with humans, but this expensive task has only been tested on limited QA datasets. We rectify these issues by providing rubrics and datasets for evaluating machine QA adopted from the Trivia community. We also propose an efficient, and interpretable QA evaluation that is more stable than an exact match and neural methods (BERTScore)."
}
This project is licensed under the MIT License.
For questions or comments, please contact: zli12321@umd.edu