We provide a principled, practical, and unified assessment framework for uncertainty/confidence measures of language models. Our assessment is compatible with diverse uncertainty ranges and does not require binarization of correctness scores.
Language Models (LMs) have shown promising performance in natural language generation. However, as LMs often generate incorrect or hallucinated responses, it is crucial to correctly quantify their uncertainty in responding to given inputs. In addition to verbalized confidence elicited via prompting, many uncertainty measures (e.g., semantic entropy and affinity-graph-based measures) have been proposed. However, these measures can differ greatly, and it is unclear how to compare them, partly because they take values over different ranges (e.g.,
Indication Diagram via Rank-Calibration
Indication diagrams comparing two uncertainty measures, 
where
- Create virtual environment using
python -m venv rce
pip install -r requirements.txt
- Before using OpenAI APIs, make sure you have the API key
OPENAI_API_KEYupdated in ./run/.env.
figures: images used for github repoindicators: uncertainty measure implementationsmetrics: correctness and calibration metrics, e.g., rank-calibration, ECE, etcmodels: OpenAI and opensource model implementationsrun: functions exposed to user to generate responses, calibrate uncertainty/confidence, and compute evaluation statssubmission: scripts and files to reproduce results reported in submissiontasks: different datasets loading implementationutils: miscellaneous functions implementation
- To plot indication diagrams, uncertainty/correctness distributions on all experiment configurations:
cd submission
./bash/make_plots.sh
- To plot RCE boxplots, critical difference diagrams on all experiment configurations:
cd submission
python make_tables.py
In both cases, plots will be saved under calibration_results and evaluation_stats folder with the folder names indicating the corresponding experiment configuration.
Please feel free to email us at Xinmeng Huamg and Shuo Li. If you find this work useful in your own research, please consider citing our work:
@misc{huang2024uncertainty,
title={Uncertainty in Language Models: Assessment through Rank-Calibration},
author={Xinmeng Huang and Shuo Li and Mengxin Yu and Matteo Sesia and Hamed Hassani and Insup Lee and Osbert Bastani and Edgar Dobriban},
year={2024},
eprint={2404.03163},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
This codebase is released under MIT License.