LLM-Factuality-Survey

The repository for the survey paper "Survey on Factuality in Large Language Models: Knowledge, Retrieval and Domain-Specificity"

Cunxiang Wang^1,7*, Xiaoze Liu²*, Yuanhao Yue³*, Qipeng Guo⁴, Xiangkun Hu⁴, Xiangru Tang⁵, Tianhang Zhang⁶, Cheng Jiayang⁷, Yunzhi Yao⁸, Wenyang Gao^1,8, Xuming Hu⁹, Zehan Qi⁹, Yidong Wang¹, Linyi Yang¹, Jindong Wang¹⁰, Xing Xie¹⁰, Zheng Zhang^4,11 and Yue Zhang¹.

1. School of Engineering, Westlake University; 2. Purdue University; 3. Fudan University; 4. Amazon AWS AI Lab; 5. Yale University; 6. Shanghai Jiao Tong University; 7. HKUST; 8. Zhejiang University; 9. Tsinghua University; 10. Microsoft Research; 11. NYU Shanghai;
(*: Equal Contribution; Correspondence to: Yue Zhang)

NOTE: As real-time updates may not be feasible for the arXiv paper. For the most recent developments and modifications, please consult this repository. We greatly appreciate and welcome pull requests or issues to enhance the quality of this survey. All contributions will be list in the acknowledgements section.

Paper List

Analysis of Factuality

Knowledge Storage

Language Models as Knowledge Bases?. Petroni et al. 2019. [Paper]
Locating and Editing Factual Associations in GPT. Meng et al. 2022. [Paper]
Transformer Feed-Forward Layers Are Key-Value Memories. Geva et al. 2021. [Paper]
Transformer Feed-Forward Layers Build Predictions by Promoting Concepts in the Vocabulary Space. Geva et al. 2022. [Paper]
Dissecting Recall of Factual Associations in Auto-Regressive Language Models. Globerson et al. 2023. [Paper]
Journey to the Center of the Knowledge Neurons: Discoveries of Language-Independent Knowledge Neurons and Degenerate Knowledge Neurons. Chen et al. 2023. [Paper]
A rigorous study of integrated gradients method and extensions to internal neuron attributions. Lundstrom et al. 2022. [Paper]

Knowledge Awareness

CRITIC: Large Language Models Can Self-Correct with Tool-Interactive Critiquing. Gou et al. 2023. [Paper]
Investigating the Factual Knowledge Boundary of Large Language Models with Retrieval Augmentation. Ren et al. 2023. [Paper]
Do Large Language Models Know What They Don't Know?. Yin et al. 2023. [Paper]
A Survey on In-context Learning. Dong et al. 2023. [Paper]
Language Models (Mostly) Know What They Know. Kadavath et al. 2022. [Paper]
The internal state of an llm knows when its lying. Azaria et al. 2023. [Paper]

Parametric Knowledge vs Retrieved Knowledge

Generate rather than retrieve: Large language models are strong context generators. Yu et al. 2023. [Paper]
Beyond Factuality: A Comprehensive Evaluation of Large Language Models as Knowledge Generators. Chen et al. 2023. [Paper]
Leveraging Passage Retrieval with Generative Models for Open Domain Question Answering. Izacard et al. 2021. [Paper]
Large language models struggle to learn long-tail knowledge. Kandpal et al. 2023. [Paper]
Head-to-Tail: How Knowledgeable are Large Language Models (LLM)? AKA Will LLMs Replace Knowledge Graphs?. Sun et al. 2023. [Paper]

Contextual Influence

Large Language Models with Controllable Working Memory. Li et al. 2023. [Paper]
Context-faithful Prompting for Large Language Models. Zhou et al. 2023. [Paper]
Benchmarking Large Language Models in Retrieval-Augmented Generation. Chen et al. 2023. [Paper]
Automatic Evaluation of Attribution by Large Language Models. Yue et al. 2023. [Paper]

Knowledge Conflicts

Entity-Based Knowledge Conflicts in Question Answering. Longpre et al. 2021. [Paper]
Rich Knowledge Sources Bring Complex Knowledge Conflicts: Recalibrating Models to Reflect Conflicting Evidence. Chen et al. 2022. [Paper]
Adaptive Chameleon or Stubborn Sloth: Unraveling the Behavior of Large Language Models in Knowledge Clashes. Xie et al. 2023. [Paper]
Large Language Models with Controllable Working Memory. Li et al. 2023. [Paper]

Causes of Factual Errors

Model-level Causes

Forgetting

An Empirical Investigation of Catastrophic Forgetting in Gradient-Based Neural Networks. Goodfellow et al. 2015. [Paper]
Preserving In-Context Learning ability in Large Language Model Fine-tuning. Wang et al. 2022. [Paper]
Recall and Learn: Fine-tuning Deep Pretrained Language Models with Less Forgetting. Chen et al. 2020. [Paper]
Investigating the Catastrophic Forgetting in Multimodal Large Language Models. Zhai et al. 2023. [Paper]
An Empirical Study of Catastrophic Forgetting in Large Language Models During Continual Fine-tuning. Luo et al. 2023. [Paper]

Reasoning Failure

We're Afraid Language Models Aren't Modeling Ambiguity. Liu et al. 2023. [Paper]
The Reversal Curse: LLMs trained on "A is B" fail to learn "B is A". Berglund et al. 2023. [Paper]
Understanding Catastrophic Forgetting in Language Models via Implicit Inference. Kotha et al. 2023. [Paper]
Can ChatGPT Replace Traditional KBQA Models? An In-depth Analysis of the Question Answering Performance of the GPT LLM Family. Tan et al. 2023. [Paper]

Retrieval-level Causes

Misinformation Not Recognized by LLMs

Entity-Based Knowledge Conflicts in Question Answering. Longpre et al. 2021. [Paper]
On the Risk of Misinformation Pollution with Large Language Models. Pan et al. 2023. [Paper]
A Survey on Truth Discovery. Han et al. 2015. [Paper]

Distracting Information

SAIL: Search-Augmented Instruction Learning. Luo et al. 2023. [Paper]
Lost in the middle: How language models use long contexts. Liu et al. 2023. [Paper]

Misinterpretation of Related Information

ReAct: Synergizing Reasoning and Acting in Language Models. Yao et al. 2023. [Paper]

Inference-level Causes

Snowballing

How language model hallucinations can snowball. Zhang et al. 2023. [Paper]
A Stitch in Time Saves Nine: Detecting and Mitigating Hallucinations of LLMs by Validating Low-Confidence Generation. Varshney et al. 2023. [Paper]

Erroneous Decoding

DoLa: Decoding by Contrasting Layers Improves Factuality in Large Language Models. Chuang et al. 2023. [Paper]
How Decoding Strategies Affect the Verifiability of Generated Text. Massarelli et al. 2020. [Paper]

Exposure Bias

WinoQueer: A Community-in-the-Loop Benchmark for Anti-LGBTQ+ Bias in Large Language Models. Felkner et al. 2023. [Paper]
Bias and Fairness in Large Language Models: A Survey. Gallegos et al. 2023. [Paper]
MISGENDERED: Limits of Large Language Models in Understanding Pronouns. Hossain et al. 2023. [Paper]

Evaluation of Factuality

Benchmarks

Measuring Massive Multitask Language Understanding. Hendrycks et al. 2021. [Paper]
TruthfulQA: Measuring How Models Mimic Human Falsehoods. Lin et al. 2022. [Paper]
HaluEval: A Large-Scale Hallucination Evaluation Benchmark for Large Language Models. Li et al. 2023. [Paper]
C-Eval: A Multi-Level Multi-Discipline Chinese Evaluation Suite for Foundation Models. Huang et al. 2023. [Paper]
Do Large Language Models Know What They Don't Know?. Yin et al. 2023. [Paper]
Do Large Language Models Know about Facts?. Hu et al. 2023. [Paper]
RealTime QA: What's the Answer Right Now?. Kasai et al. 2022. [Paper]
FreshLLMs: Refreshing Large Language Models with Search Engine Augmentation. Vu et al. 2023. [Paper]
Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models. Srivastava et al. 2023. [Paper]
Natural Questions: a Benchmark for Question Answering Research. Kwiatkowski et al. 2019. [Paper]
TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension. Joshi et al. 2017. [Paper]
Semantic Parsing on Freebase from Question-Answer Pairs. Berant et al. 2013. [Paper]
Open Question Answering over Tables and Text. Chen et al. 2021. [Paper]
AmbigQA: Answering Ambiguous Open-domain Questions. Min et al. 2020. [Paper]
HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering. Yang et al. 2018. [Paper]
Constructing A Multi-hop QA Dataset for Comprehensive Evaluation of Reasoning Steps. Ho et al. 2020. [Paper]
IIRC: A Dataset of Incomplete Information Reading Comprehension Questions. Ferguson et al. 2020. [Paper]
MuSiQue: Multihop Questions via Single-hop Question Composition. Trivedi et al. 2022. [Paper]
ELI5: Long Form Question Answering. Fan et al. 2019. [Paper]
FEVER: a large-scale dataset for Fact Extraction and VERification. Thorne et al. 2018. [Paper]
Fool Me Twice: Entailment from Wikipedia Gamification. Eisenschlos et al. 2021. [Paper]
HoVer: A Dataset for Many-Hop Fact Extraction And Claim Verification. Jiang et al. 2020. [Paper]
The Fact Extraction and VERification Over Unstructured and Structured information (FEVEROUS) Shared Task. Aly et al. 2021. [Paper]
T-REx: A Large Scale Alignment of Natural Language with Knowledge Base Triples. Elsahar et al. 2018. [Paper]
Zero-Shot Relation Extraction via Reading Comprehension. Levy et al. 2017. [Paper]
Language Models as Knowledge Bases?. Petroni et al. 2019. [Paper]
Neural Text Generation from Structured Data with Application to the Biography Domain. Lebret et al. 2016. [Paper]
WikiAsp: A Dataset for Multi-domain Aspect-based Summarization. Hayashi et al. 2021. [Paper]
KILT: a Benchmark for Knowledge Intensive Language Tasks. Petroni et al. 2021. [Paper]
Scaling Language Models: Methods, Analysis & Insights from Training Gopher. Rae et al. 2022. [Paper]
Curation Corpus Base. Curation et al. 2020. [Paper]
Pointer sentinel mixture models. Merity et al. 2016. [Paper]
The LAMBADA dataset: Word prediction requiring a broad discourse context. Paperno et al. 2016. [Paper]
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. Raffel et al. 2020. [Paper]
The Pile: An 800GB Dataset of Diverse Text for Language Modeling. Gao et al. 2020. [Paper]
Wizard of Wikipedia: Knowledge-Powered Conversational agents. Dinan et al. 2019. [Paper]
Grounded response generation task at dstc7. Galley et al. 2019. [Paper]
"What do others think?": Task-Oriented Conversational Modeling with Subjective Knowledge. Zhao et al. 2023. [Paper]
RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models. Gehman et al. 2020. [Paper]
Hey AI, Can You Solve Complex Tasks by Talking to Agents?. Khot et al. 2022. [Paper]
Did Aristotle Use a Laptop? A Question Answering Benchmark with Implicit Reasoning Strategies. Geva et al. 2021. [Paper]
TempQuestions: A Benchmark for Temporal Question Answering. Jia et al. 2018. [Paper]
INFOTABS: Inference on Tables as Semi-structured Data. Gupta et al. 2020. [Paper]

Studies

SelfCheckGPT: Zero-Resource Black-Box Hallucination Detection for Generative Large Language Models. Manakul et al. 2023. [Paper]
Evaluating Open Question Answering Evaluation. Wang et al. 2023. [Paper]
Measuring and Modifying Factual Knowledge in Large Language Models. Pezeshkpour et al. 2023. [Paper]
A Stitch in Time Saves Nine: Detecting and Mitigating Hallucinations of LLMs by Validating Low-Confidence Generation. Varshney et al. 2023. [Paper]
FacTool: Factuality Detection in Generative AI -- A Tool Augmented Framework for Multi-Task and Multi-Domain Scenarios. Chern et al. 2023. [Paper]
Language Models (Mostly) Know What They Know. Kadavath et al. 2022. [Paper]
Generate rather than retrieve: Large language models are strong context generators. Yu et al. 2023. [Paper]
Beyond Factuality: A Comprehensive Evaluation of Large Language Models as Knowledge Generators. Chen et al. 2023. [Paper]
Teaching language models to support answers with verified quotes. Menick et al. 2022. [Paper]

Evaluating Domain-specific Factuality

PIXIU: A Large Language Model, Instruction Data and Evaluation Benchmark for Finance. Xie et al. 2023. [Paper]
When flue meets flang: Benchmarks and large pre-trained language model for financial domain. Shah et al. 2022. [Paper]
EcomGPT: Instruction-tuning Large Language Model with Chain-of-Task Tasks for E-commerce. Li et al. 2023. [Paper]
CMB: A Comprehensive Medical Benchmark in Chinese. Wang et al. 2023. [Paper]
Genegpt: Augmenting large language models with domain tools for improved access to biomedical information. Jin et al. 2023. [Paper]
Legalbench: A collaboratively built benchmark for measuring legal reasoning in large language models. Guha et al. 2023. [Paper]
LawBench: Benchmarking Legal Knowledge of Large Language Models. Fei et al. 2023. [Paper]

Factuality Enhancement

On Standalone LLM Generation

Pretraining-based

Initial Pretraining

A Multitask, Multilingual, Multimodal Evaluation of ChatGPT on Reasoning, Hallucination, and Interactivity. Yejin Bang et al. arXiv 2023. [Paper]
Deduplicating Training Data Makes Language Models Better. Lee, Katherine et al. ACL 2022. [Paper]
Unsupervised Improvement of Factual Knowledge in Language Models. Sadeq, Nafis et al. EACL 2023. [Paper]

Continual Pretraining

Factuality Enhanced Language Models for Open-Ended Text Generation. Lee, Nayeon et al. NeurIPS 2022. [Paper]

Supervised Finetuning

Continual SFT

SKILL: Structured Knowledge Infusion for Large Language Models. Moiseev, Fedor et al. NAACL 2022. [Paper]
Contrastive Learning Reduces Hallucination in Conversations. Sun, Weiwei et al. AAAI 2023. [Paper]
ChatGPT is not Enough: Enhancing Large Language Models with Knowledge Graphs for Fact-aware Language Modeling. Linyao Yang et al. arXiv 2023. [Paper]

Model Editing

Editing Large Language Models: Problems, Methods, and Opportunities. Yunzhi Yao et al. arXiv 2023. [Paper]
Knowledge Neurons in Pretrained Transformers. Dai, Damai et al. ACL 2022. [Paper]
Locating and Editing Factual Associations in GPT. Kevin Meng et al. NeurIPS 2022. [Paper]
Editing Factual Knowledge in Language Models. De Cao, Nicola et al. EMNLP 2021. [Paper]
Fast Model Editing at Scale. Eric Mitchell et al. ICLR 2022. [Paper]
Inference-Time Intervention: Eliciting Truthful Answers from a Language Model. Kenneth Li et al. arXiv 2023. [Paper]

Multi-Agent

Improving Factuality and Reasoning in Language Models through Multiagent Debate. Yilun Du et al. arXiv 2023. [Paper]
LM vs LM: Detecting Factual Errors via Cross Examination. Roi Cohen et al. arXiv 2023. [Paper]

Novel Prompt

Generate Rather than Retrieve: Large Language Models are Strong Context Generators. Yu, Wenhao et al. ICLR 2023. [Paper]
"According to ..." Prompting Language Models Improves Quoting from Pre-Training Data. Orion Weller et al. arXiv 2023. [Paper]
Decomposed Prompting: A Modular Approach for Solving Complex Tasks. Tushar Khot et al. arXiv 2023. [Paper]
Chain-of-Verification Reduces Hallucination in Large Language Models. Dhuliawala et al. arXiv 2023. [Paper]

Decoding

Factuality Enhanced Language Models for Open-Ended Text Generation. Lee, Nayeon et al. NeurIPS 2022. [Paper]
DoLa: Decoding by Contrasting Layers Improves Factuality in Large Language Models. Chuang, Yung-Sung et al. arXiv 2023. [Paper]

On Retrieval-Augmented Generation

Normal RAG Setting

Improving Language Models by Retrieving From Trillions of Tokens. Sebastian Borgeaud et al. arXiv 2021. [Paper]
Internet-Augmented Language Models through Few-Shot Prompting for Open-Domain Question Answering. Angeliki Lazaridou et al. arXiv 2022. [Paper]

Interactive Retrieval

CoT-based Retrieval

Rethinking with Retrieval: Faithful Large Language Model Inference. Hangfeng He et al. arXiv 2023. [Paper]
Interleaving Retrieval with Chain-of-Thought Reasoning for Knowledge-Intensive Multi-Step Questions. Trivedi, Harsh et al. ACL 2023. [Paper]
Active Retrieval Augmented Generation. Zhengbao Jiang et al. arXiv 2023. [Paper]

Agent-based Retrieval

ReAct: Synergizing Reasoning and Acting in Language Models. Shunyu Yao et al. arXiv 2023. [Paper]
Reflexion: Language Agents with Verbal Reinforcement Learning. Noah Shinn et al. arXiv 2023. [Paper]
A Stitch in Time Saves Nine: Detecting and Mitigating Hallucinations of LLMs by Validating Low-Confidence Generation. Neeraj Varshney et al. arXiv 2023. [Paper]

Retrieval Adaptation

Prompt-based

Check Your Facts and Try Again: Improving Large Language Models with External Knowledge and Automated Feedback. Baolin Peng et al. arXiv 2023. [Paper]
Knowledge-Augmented Language Model Verification. Jinheon Baek et al. EMNLP 2023. [Paper]
WikiChat: Stopping the Hallucination of Large Language Model Chatbots by Few-Shot Grounding on Wikipedia. Semnani et al. EMNLP 2023 findings. [Paper] [GitHub] [Demo]

SFT-based

Atlas: Few-shot Learning with Retrieval Augmented Language Models. Gautier Izacard et al. arXiv 2022. [Paper]
REPLUG: Retrieval-Augmented Black-Box Language Models. Weijia Shi et al. arXiv 2023. [Paper]
SAIL: Search-Augmented Instruction Learning. Luo, Hongyin et al. arXiv 2023. [Paper]

RLHF-based

Teaching Language Models to Support Answers with Verified Quotes. Jacob Menick et al. arXiv 2022. [Paper]

Retrieval on External Memory

Decoupled Context Processing for Context Augmented Language Modeling. Zonglin Li et al. NeurIPS 2022. [Paper]
G-MAP: General Memory-Augmented Pre-trained Language Model for Domain Tasks. Zhongwei Wan et al. ICML 2019. [Paper]
Parameter-Efficient Transfer Learning for NLP. Neil Houlsby et al. EMNLP 2022. [Paper]
KALA: Knowledge-Augmented Language Model Adaptation. Kang, Minki et al. NAACL 2022. [Paper]
Entities as Experts: Sparse Memory Access with Entity Supervision. Thibault Févry et al. EMNLP 2020. [Paper]
Mention Memory: Incorporating Textual Knowledge into Transformers through Entity Mention Attention. Michiel de Jong et al. ICLR 2022. [Paper]
Plug-and-Play Knowledge Injection for Pre-trained Language Models. Zhang, Zhengyan et al. ACL 2023. [Paper]
Evidence-based Factual Error Correction. Thorne, James et al. ACL 2021. [Paper]
Rarr: Researching and revising what language models say, using language models. Gao, Luyu et al. ACL 2023. [Paper]
PURR: Efficiently Editing Language Model Hallucinations by Denoising Language Model Corruptions. Chen, Anthony et al. arXiv 2023. [Paper]

Retrieval on Structured Knowledge Source

Mitigating Language Model Hallucination with Interactive Question-Knowledge Alignment. Shuo Zhang et al. arXiv 2023. [Paper]
StructGPT: A general framework for Large Language Model to Reason on Structured Data. Jinhao Jiang et al. arXiv 2023. [Paper]
Knowledge-Augmented Language Model Prompting for Zero-Shot Knowledge Graph Question Answering. Jinheon Baek et al. arXiv 2023. [Paper]

Domain Factuality Enhanced LLMs

Healthcare Domain-enhanced LLMs

CohortGPT: An Enhanced GPT for Participant Recruitment in Clinical Study. Guan, Zihan et al. arXiv 2023. [paper]
ChatDoctor: A Medical Chat Model Fine-Tuned on a Large Language Model Meta-AI (LLaMA) Using Medical Domain Knowledge. Li, Yunxiang et al. Cureus 2023. [paper]
Deid-GPT: Zero-Shot Medical Text De-Identification By Gpt-4. Liu, Zhengliang et al. arXiv 2023. [paper]
Biomedlm: A Domain-Specific Large Language Model for Biomedical Text. Venigalla, A et al. [blog] [model]
MedChatZH: A Better Medical Adviser Learns from Better Instructions. Tan, Yang et al. arXiv 2023. [paper]
BioGPT: Generative Pre-Trained Transformer for Biomedical Text Generation and Mining. Luo, Renqian et al. Briefings in Bioinformatics 2022. [paper]
Genegpt: Augmenting Large Language Models with Domain Tools for Improved Access to Biomedical Information. Jin, Qiao et al. arXiv 2023. [paper]
Almanac: Retrieval-Augmented Language Models for Clinical Medicine. Hiesinger, William et al. arXiv 2023. [paper]
MolXPT: Wrapping Molecules with Text for Generative Pre-training. Liu, Zequn et al. arXiv 2023. [paper]
HuatuoGPT, Towards Taming Language Model to Be a Doctor. Zhang, Hongbo et al. arXiv 2023. [paper]
Zhongjing: Enhancing the Chinese Medical Capabilities of Large Language Model through Expert Feedback and Real-world Multi-turn Dialogue. Yang, Songhua et al. arXiv 2023. [paper]
Augmenting Black-box LLMs with Medical Textbooks for Clinical Question Answering. Wang, Yubo et al. arXiv 2023. [paper]
DISC-MedLLM: Bridging General Large Language Models and Real-World Medical Consultation. Bao, Zhijie et al. arXiv 2023. [paper]

Legal Domain enhanced LLMs

Brief Report on LawGPT 1.0: A Virtual Legal Assistant Based on GPT-3. Nguyen, Ha-Thanh et al. arXiv 2023. [paper]
Chatlaw: Open-Source Legal Large Language Model with Integrated External Knowledge Bases. Cui, Jiaxi et al. arXiv 2023. [paper]
Explaining Legal Concepts with Augmented Large Language Models (GPT-4). Savelka, Jaromir et al. arXiv 2023. [paper]
Lawyer LLaMA Technical Report. Huang, Quzhe et al. arXiv 2023. [paper]

Finance Domain-enhanced LLMs

EcomGPT: Instruction-tuning Large Language Model with Chain-of-Task Tasks for E-commerce. Li, Yangning et al. arXiv 2023. [paper]
BloombergGPT: A Large Language Model for Finance. Shijie Wu et al. arXiv 2023. [paper]

Other Domain-Enhanced LLMs

Geoscience and Environment domain-enhanced LLMs

Learning A Foundation Language Model for Geoscience Knowledge Understanding and Utilization. Deng, Cheng et al. arXiv 2023. [paper]
HouYi: An Open-Source Large Language Model Specially Designed for Renewable Energy and Carbon Neutrality Field. Bai, Mingliang et al. arXiv 2023. [paper]

Education Domain-enhanced LLMs

GrammarGPT: Exploring Open-Source LLMs for Native Chinese Grammatical Error Correction with Supervised Fine-Tuning. Fan, Yaxin et al. arXiv 2023. [paper]

Food Domain-enhanced LLMs

FoodGPT: A Large Language Model in Food Testing Domain with Incremental Pre-training and Knowledge Graph Prompt. Qi, Zhixiao et al. arXiv 2023. [paper]

Home Renovation Domain-enhanced LLMs

ChatHome: Development and Evaluation of a Domain-Specific Language Model for Home Renovation. Wen, Cheng et al. arXiv 2023. [paper]

Tables

Table: Comparison between the factuality issue and the hallucination issue.


Factual and Non-Hallucinated	Factually correct outputs.
Non-Factual and Hallucinated	Entirely fabricated outputs.
Hallucinated but Factual	1. Outputs that are unfaithful to the prompt but remain factually correct (cao-etal-2022-hallucinated). 2. Outputs that deviate from the prompt's specifics but don't touch on factuality, e.g., a prompt asking for a story about a rabbit and wolf becoming friends, but the LLM produces a tale about a rabbit and a dog befriending each other. 3. Outputs that provide additional factual details not specified in the prompt, e.g., a prompt asking about the capital of France, and the LLM responds with "Paris, which is known for the Eiffel Tower."
Non-Factual but Non-Hallucinated	1. Outputs where the LLM states, "I don't know," or avoids a direct answer. 2. Outputs that are partially correct, e.g., for the question, "Who landed on the moon with Apollo 11?" If the LLM responds with just "Neil Armstrong," the answer is incomplete but not hallucinated. 3. Outputs that provide a generalized or vague response without specific details, e.g., for a question about the causes of World War II, the LLM might respond with "It was due to various political and economic factors."

Causes of Factual Errors

Category	Cause	Example Dialog	Notes and references
Model-level causes	Domain knowledge deficit	Q: CEO of Assicurazioni Generali? BloombergGPT: Philippe Donnet GPT-NeoX: Antonio De Lorenzo, Simone Gambarini, Enrico Zanetti FLAN-T5-XXL: John M Forsyth, Christopher K Peters, {empty string}	BloombergGPT is a finance domain-specific language model. Wu et al, 2023
	Outdated information	Q: When was Kyiv attacked by Russia? ChatGPT: As of my last knowledge update in September 2021, Russia had not launched an attack on Kyiv.	Kyiv was attacked by Russia on 25 February 2022.
	Reasoning error	Q: Who is Tom Cruise’s mother? A: Mary Lee Pfeiffer Q: Who is Mary Lee Pfeiffer’s son? A: There is no widely known information about...	From Berglund et al, 2023. It is clear that the model knows Tom Cruise’s mother is Lee Pfeiffer, but it fails to reason that Lee Pfeiffer has a son named Tom Cruise.
Retrieval-level causes	Distracted by the retrieval information	Retrieved context: "Alan Melikdjanian - Age, Bio, Faces and Birthday" Alan Melikdjanian is the Director, Writer (with his father), Editor, and Composer of the 2006 direct-to-DVD Russian-language film Citizen Mavzik... Q: What is the nationality of the director of Citizen Mavzik? A: (1) The director of the film Citizen Mavzik is Alan Melikdjanian. (2) Alan Melikdjanian is from Russia. So, the answer is Russia.	Alan Melikdjanian, known by the alias Captain Disillusion, is a Soviet-born American independent filmmaker and YouTuber. Melikdjanian et al, 2018
	Misunderstood the retrieval information	Retrieved context: "Gerard III, Count of Guelders - Wikipedia" Gerard married Margaretha of Brabant... Q: Who is Gerard I, Count Of Guelders's wife? A: ... the name of Gerard I, Count of Guelders's wife, which is Margaretha of Brabant.	The retrieved context accurately provided the name of Gerard III, Count of Guelders' wife, which is Margaretha of Brabant. However, the model erroneously referred to him as Gerard I, resulting in an incorrect answer.
	Fail to address misinformation in retrieved documents	Retrieved context: [0] "James Vincent - Death Notice and Service Information - Legacy.com" James Vincent died on August 9, 2023... [1] "UPDATE: Jury convicts James Vincent of murder after deliberating..." A jury convicted James Vincent of murder in the 2018 Shooting death of Luka Grabb... Q: When did James Vincent die? A: James Vincent died on August 9, 2023, in Long Beach, California.	In the presence of contradictory information from retrieved documents, the model should thoroughly outline the response, while also noting the need for clarity. For instance, the model could state: "According to the first document, James Vincent passed away on August 9, 2023, in Long Beach. However, the second document suggests that James Vincent was murdered in 2018."
Inference-level causes	Snowballing	Q: Was there ever a US senator who represented the state of New Hampshire and whose alma mater was the University of Pennsylvania? A: Yes... His name was John P. Hale	John P. Hale was graduated from Bowdoin College. Zhang et al, 2023
	Exposure bias	Q: Aamari was very stoic. [PRONOUN] rarely showed any emotion. A: He.	The correct answer was Xe according to Hossain et al, 2023.

Evaluations

Reference	Task	Dataset	Metrics	Human Eval	Evaluated LLMs	Granularity
FActScore Min et al, 2023	Biography Generation	183 people entities	F1	✓	GPT-3.5, ChatGPT...	T
SelfCheckGPT Manakul et al, 2023	Bio Generation	WikiBio	AUC-PR, Human Score	✓	GPT-3, LLaMA, OPT, GPT-J...	S
Wang et al, 2023	Open QA	NQ, TQ	ACC, EM	✓	GPT-3.5, ChatGPT, GPT-4, Bing Chat	S
Pezeshkpour et al, 2023	Knowledge Probing	T-REx, LAMA	ACC		GPT3.5	T
De Cao et al, 2021	QA, Fact Checking	KILT, FEVER, zsRE	ACC		GPT-3, FLAN-T5	S/T
Varshney et al, 2023	Article Generation	Unnamed Dataset	ACC, AUC		GPT3.5, Vicuna	S
FactTool Chern et al, 2023	KB-based QA	RoSE	ACC, F1...		GPT-4, ChatGPT, FLAN-T5	S
Kadavath et al, 2022	Self-evaluation	BIG Bench, MMLU, LogiQA, TruthfulQA, QuALITY, TriviaQA Lambada	ACC, Brier Score, RMS Calibration Error...		Claude	T

Reference	Task	Dataset	Metrics	Human Eval	Evaluated LLMs	Granularity
Retro Borgeaud et al, 2022	QA, Language Modeling	MassiveText, Curation Corpus, Wikitext103, Lambada, C4,Pile, NQ	PPL, ACC, Exact Match	✓	Retro	T
GenRead Yu et al, 2023	QA, Dialogue, Fact Checking	NQ, TQ, WebQ, FEVER, FM2, WoW	EM, ACC, F1, Rouge-L	-	GPT3.5, Codex GPT-3, Gopher FLAN, GLaM PaLM	S
GopherCite Menick et al, 2022	Self-supported QA	NQ, ELI5, TruthfulQA (Health, Law, Fiction, Conspiracies)	Human Score	✓	GopherCite	S
Trivedi et al. Trivedi et al, 2023	QA	HotpotQA, IIRC 2WikiMultihopQA, MuSiQue(music)	Retrieval recall, Answer F1	-	GPT-3 FLAN-T5	S/T
Peng et al. Peng et al, 2023	QA, Dialogue	DSTC7 track2 DSTC11 track5, OTT-QA	ROUGE, chrF, BERTScore, Usefulness, Humanness...	✓	ChatGPT	S/T
CRITIC Gou et al, 2023	QA Toxicity Reduction	AmbigNQ, TriviaQA, HotpotQA, RealToxicityPrompts	Exact Match, maximum toxicity, perplexity, n-gram diversity, AUROC...,	-	GPT-3.5 ChatGPT	T
Khot et al. Khot et al, 2023	QA, long-context QA	CommaQA-E, 2WikiMultihopQA, MuSiQue, HotpotQA	Exact Match, Answer F1	-	GPT-3 FLAN-T5	T
ReAct Yao et al, 2023	QA Fact Verification	HotpotQA, FEVER	Exact Match, ACC	-	PaLM GPT-3	S/T
Jiang et al. Jiang et al, 2023	QA, Commonsense Reasoning, long-form QA...	2WikiMultihopQA, StrategyQA, ASQA, WikiAsp	Exact Match, Disambig-F1, ROUGE, entity F1...	-	GPT-3.5	T
Lee et al. Lee et al, 2022	Open-ended Generation	FEVER	Entity score, EntailmentRatio, ppl...	-	Megatron-LM	T
SAIL Luo et al, 2023	QA Fact Checking	UniLC	ACC F1	-	LLaMA Vicuna SAIL	T
He et al. He et al, 2022	Commonsense Reasoning, Temporal Reasoning, Tabular Reasoning	StrategyQA, TempQuestions, IN-FOTABS	ACC	-	GPT-3	T
Pan et al. Pan et al, 2023	Fact Checking	HOVER FEVEROUS-S	Macro-F1	-	Codex FLAN-T5	S
Multiagent Debate Du et al, 2023	Biography MMLU	Unnamed Biography Dataset, MMLU	ChatGPT Evaluator, ACC	-	Bard ChatGPT	S

Benchmarks

Reference	Task Type	Dataset	Metrics	Performance of Representative LLMs
MMLU Hendrycks et al, 2021	Multi-Choice QA	Humanities, Social, Sciences, STEM...	ACC	(ACC, 5-shot) GPT-4: 86.4 GPT-3.5: 70 LLaMA2-70B: 68.9
TruthfulQA Lin et al, 2022	QA	Health, Law, Conspiracies, Fiction...	Human Score, GPT-judge, ROUGE, BLEU, MC1,MC2...	(zero-shot) GPT-4: ~29 (MC1) GPT-3.5: ~28 (MC1), 79.92(%true) LLaMA2-70B: 53.37 (%true)
C-Eval Huang et al, 2023	Multi-Choice QA	STEM, Social Science, Humanities...	ACC	(zero-shot, average ACC) GPT-4: 68.7 GPT-3.5: 54.4 LLaMA2-70B: 50.13
AGIEval Zhong et al, 2023	Multi-Choice QA	Gaokao, (geometry, Bio, history...),SAT, Law...	ACC	(zero-shot, average ACC) GPT-4: 56.4 GPT-3.5: 42.9 LLaMA2-70B: 40.02
HaluEval Li et al, 2023	Hallucination Evaluation	HaluEval	ACC	(general ACC) GPT-3.5: 86.22
BigBench Srivastava et al, 2023	Multi-tasks(QA, NLI, Fact Checking, Reasoning...)	BigBench	Metric to each type of task	(Big-Bench Hard) GPT-3.5: 49.6 LLaMA-65B: 42.6
ALCE Gao et al, 2023	Citation Generation	ASQA, ELI5, QAMPARI	MAUVE, Exact Match, ROUGE-L...	(ASQA, 3-psg, citation prec) GPT-3.5: 73.9 LLaMA-33B: 23.0
QUIP Weller et al, 2023	Generative QA	TriviaQA, NQ, ELI5, HotpotQA	QUIP-Score, Exact match	(ELI5, QUIP, null prompt) GPT-4: 21.0 GPT-3.5: 27.7
PopQA Mallen et al, 2023	Multi-Choice QA	PopQA, EntityQuestions	ACC	(overall ACC) GPT-3.5: ~37.0
UniLC Zhang et al, 2023	Fact Checking	Climate, Health, MGFN	ACC, F1	(zero-shot, fact tasks, average F1) GPT-3.5: 51.62
Pinocchio Hu et al, 2023	Fact Checking, QA, Reasoning	Pinocchio	ACC, F1	GPT-3.5: (Zero-shot ACC: 46.8, F1:44.4) GPT-3.5: (Few-shot ACC: 47.1, F1:45.7)
SelfAware Yin et al, 2023	Self-evaluation	SelfAware	ACC	(instruction input, F1) GPT-4: 75.47 GPT-3.5: 51.43 LLaMA-65B: 46.89
RealTimeQA Kasai et al, 2022	Multi-Choice QA, Generative QA	RealTimeQA	ACC, F1	(original setting, GCS retrieval) GPT-3: 69.3 (ACC for MC) GPT-3: 39.4 (F1 for generation)
FreshQA Vu et al, 2023	Generative QA	FRESHQA	ACC (Human)	(strict ACC, null prompt) GPT-4: 28.6 GPT-3.5: 26.0

Domain evaluation

Reference	Domain	Task	Datasets	Metrics	Evaluated LLMs
Xie et al, 2023	Finance	Sentiment analysis, News headline classification, Named entity recognition, Question answering, Stock movement prediction	FLARE	F1, Acc, Avg F1, Entity F1, EM, MCC	GPT-4 , BloombergGPT, FinMA-(7B, 30B, 7B-full), Vicuna-7B
Li et al, 2023	Finance	134 E-com tasks	EcomInstruct	Micro-F1, Macro-F1, ROUGE	BLOOM, BLOOMZ, ChatGPT, EcomGPT
Wang et al, 2023	Medicine	Multi-Choice QA	CMB	Acc	GPT-4, ChatGLM2-6B, ChatGPT, DoctorGLM, Baichuan-13B-chat, HuatuoGPT, MedicalGPT, ChatMed-Consult, ChatGLM-Med , Bentsao, BianQue-2
Li et al, 2023	Medicine	Generative-QA	Huatuo-26M	BLEU, ROUGE, GLEU	T5, GPT2
Jin et al, 2023	Medicine	Nomenclature, Genomic location, Functional analysis, Sequence alignment	GeneTuring	Acc	GPT-2, BioGPT, BioMedLM, GPT-3, ChatGPT, New Bing
Guha et al, 2023	Law	Issue-spotting, Rule-recall, Rule-application, Rule-conclusion, Interpretation, Rhetorical-understanding	LegalBench	Acc, EM	GPT-4, GPT-3.5, Claude-1, Incite, OPT Falcon, LLaMA-2, FLAN-T5...
Fei et al, 2023	Law	Legal QA, NER, Sentiment Analysis, Reading Comprehension	LawBench	F1, Acc, ROUGE-L, Normalized log-distance...	GPT-4, ChatGPT, InternLM-Chat, StableBeluga2...

Enhancement

Enhancement methods

Reference	Dataset	Metrics	Baselines ➝ Theirs	Dataset	Metrics	Baselines ➝ Theirs
Li et al, 2022	NQ	EM	34.5 ➝ 44.35 (T5 11B)	GSM8K	ACC	77.0 ➝ 85.0 (ChatGPT)
Yu et al, 2023	NQ	EM	20.9 ➝ 28.0 (InstructGPT)	TriviaQA	EM	57.5 ➝ 59.0 (InstructGPT)
-	-	-	-	WebQA	EM	18.6 ➝ 24.6 (InstructGPT)
Chuang et al, 2023	FACTOR News	ACC	58.3 ➝ 62.0 (LLaMa-7B)	FACTOR News	ACC	61.1 ➝ 62.5 (LLaMa-13B)
-	FACTOR News	ACC	63.8 ➝ 65.4 (LLaMa-33B)	FACTOR News	ACC	63.6 ➝ 66.2 (LLaMa-65B)
-	FACTOR Wiki	ACC	58.6 ➝ 62.2 (LLaMa-7B)	FACTOR Wiki	ACC	62.6 ➝ 66.2 (LLaMa-13B)
-	FACTOR Wiki	ACC	69.5 ➝ 70.3 (LLaMa-33B)	FACTOR Wiki	ACC	72.2 ➝ 72.4 (LLaMa-65B)
-	TruthfulQA	%Truth * Info	32.4 ➝ 44.6 (LLaMa-13B)	TruthfulQA	%Truth * Info	34.8 ➝ 49.2 (LLaMa-65B)
Li et al, 2022	TruthfulQA	%Truth * Info	32.4 ➝ 44.4 (LLaMa-13B)	TruthfulQA	%Truth * Info	31.7 ➝ 36.7 (LLaMa-33B)
-	TruthfulQA	%Truth * Info	34.8 ➝ 43.4 (LLaMa-65B)	-	-	-
Li et al, 2023	NQ	ACC	46.6 ➝ 51.3 (LLaMA-7B)	TriviaQA	ACC	89.6 ➝ 91.1 (LLaMA-7B)
-	MMLU	ACC	35.7 ➝ 40.1 (LLaMA-7B)	TruthfulQA	%Truth * Info	32.5 ➝ 65.1 (Alpaca)
-	TruthfulQA	%Truth * Info	26.9 ➝ 43.5 (LLaMa-7B)	TruthfulQA	%Truth * Info	51.5 ➝ 74.0 (Vicuna)
Cohen et al, 2023	LAMA	F1	50.7 ➝ 80.8 (ChatGPT)	TriviaQA	F1	56.2 ➝ 82.6 (ChatGPT)
-	NQ	F1	60.6 ➝ 79.1 (ChatGPT)	PopQA	F1	65.2 ➝ 85.4 (ChatGPT)
-	LAMA	F1	42.5 ➝ 79.3 (GPT-3)	TriviaQA	F1	46.7 ➝ 77.2 (GPT-3)
-	NQ	F1	52.0 ➝ 78.0 (GPT-3)	PopQA	F1	43.7 ➝ 77.4 (GPT-3)
...

Domain-enhanced LLMs

Reference	Domain	Model	Eval Task	Eval Dataset	Continual Pretrained?	Continual SFT?	Train From Scratch?	External Knowledge
Zhang et al, 2023	Healthcare	Baichuan-7B, Ziya-LLaMA-13B	QA	cMedQA2, WebMedQA, Huatuo-26M	✔️
Yang et al, 2023	Healthcare	Ziya-LLaMA-13B	QA	CMtMedQA, huatuo-26M	✔️	✔️
Wang et al, 2023	Healthcare	GPT-3.5-Turbo, LLaMA-2-13B	QA	MedQAUSMLE, MedQAMCMLE, MedMCQA				✔️
Ross et al, 2022	Healthcare	MOLFORMER	Molecule properties prediction				✔️
Bao et al, 2023	Healthcare	Baichuan-13B	QA	CMB-Clin, CMD, CMID		✔️
Guan et al, 2023	Healthcare	ChatGPT	IU-RR, MIMIC-CXR					✔️
Liu et al, 2023	Healthcare	GPT-4	Medical Text De-Identification					✔️
Li et al, 2023	Healthcare	LLaMA	QA			✔️
Venigalla et al, 2022	Healthcare	GPT (2.7b)	QA				✔️
Xiong et al, 2023	Healthcare	ChatGLM-6B	QA			✔️
Tan et al, 2023	Healthcare	Baichuan-7B	QA	C-Eval, MMLU		✔️
Luo et al, 2022	Healthcare	GPT-2	QA, DC, RE				✔️
Jin et al, 2023	Healthcare	Codex	QA	GeneTuring				✔️
Zakka et al, 2023	Healthcare	text-davinci-003	QA	ClinicalQA				✔️
Liu et al, 2023	Healthcare	GPT-2medium	Molecular Property Prediction, Molecule-text translation			✔️	✔️
Nguyen et al, 2023	Law	GPT3				✔️
Savelka et al, 2023	Law	GPT-4						✔️
Huang et al, 2023	Law	LLaMA	CN Legal Tasks		✔️	✔️
Cui et al, 2023	Law	Ziya-LLaMA-13B	QA	national judicial examination question	✔️			✔️
Li et al, 2023	Finance	BLOOMZ	4 major tasks 12 subtasks	EcomInstruct		✔️
Wu et al, 2023	Finance	BLOOM	Financial NLP (SA, BC, NER, NER+NED, QA)	Financial Datasets			✔️
Deng et al, 2023	Geoscience	LLaMA-7B		GeoBench	✔️
Bai et al, 2023	Geoscience	ChatGLM-6B			✔️
Fan et al, 2023	Education	phoenix-inst-chat-7b	Chinese Grammatical Error Correction	ChatGPT-generated, Human-annotated		✔️
Qi et al, 2023	Food	Chinese-LLaMA2-13B	QA		✔️			✔️
Wen et al, 2023	Home Renovation	Baichuan-13B		C-Eval, CMMLU, EvalHome		✔️

Reference

If you find this project useful in your research or work, please consider citing it:

@misc{wang2023survey,
      title={Survey on Factuality in Large Language Models: Knowledge, Retrieval and Domain-Specificity}, 
      author={Cunxiang Wang and Xiaoze Liu and Yuanhao Yue and Xiangru Tang and Tianhang Zhang and Cheng Jiayang and Yunzhi Yao and Wenyang Gao and Xuming Hu and Zehan Qi and Yidong Wang and Linyi Yang and Jindong Wang and Xing Xie and Zheng Zhang and Yue Zhang},
      year={2023},
      eprint={2310.07521},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Acknowledgements

CHEN Liang (ChanLiang) for PR#1.
JinheonBaek (JinheonBaek) for PR#2 and PR#3

Name		Name	Last commit message	Last commit date
Latest commit History 50 Commits
CITATION.cff		CITATION.cff
README.md		README.md
logo.jpg		logo.jpg
survey-tree.jpg		survey-tree.jpg

wangcunxiang/LLM-Factuality-Survey

Folders and files

Latest commit

History

Repository files navigation

LLM-Factuality-Survey

Paper List

Analysis of Factuality

Knowledge Storage

Knowledge Awareness

Parametric Knowledge vs Retrieved Knowledge

Contextual Influence

Knowledge Conflicts

Causes of Factual Errors

Model-level Causes

Forgetting

Reasoning Failure

Retrieval-level Causes

Misinformation Not Recognized by LLMs

Distracting Information

Misinterpretation of Related Information

Inference-level Causes

Snowballing

Erroneous Decoding

Exposure Bias

Evaluation of Factuality

Benchmarks

Studies

Evaluating Domain-specific Factuality

Factuality Enhancement

On Standalone LLM Generation

Pretraining-based

Initial Pretraining

Continual Pretraining

Supervised Finetuning

Continual SFT

Model Editing

Multi-Agent

Novel Prompt

Decoding

On Retrieval-Augmented Generation

Normal RAG Setting

Interactive Retrieval

CoT-based Retrieval

Agent-based Retrieval

Retrieval Adaptation

Prompt-based

SFT-based

RLHF-based

Retrieval on External Memory

Retrieval on Structured Knowledge Source

Domain Factuality Enhanced LLMs

Healthcare Domain-enhanced LLMs

Legal Domain enhanced LLMs

Finance Domain-enhanced LLMs

Other Domain-Enhanced LLMs

Geoscience and Environment domain-enhanced LLMs

Education Domain-enhanced LLMs

Food Domain-enhanced LLMs

Home Renovation Domain-enhanced LLMs

Tables

Causes of Factual Errors

Evaluations

Benchmarks

Domain evaluation

Enhancement

Enhancement methods

Domain-enhanced LLMs

Reference

Acknowledgements

Star History

About

Resources

Stars

Watchers

Forks