Skip to content

Medical Hallucination in Foundation Models and Their Impact on Healthcare (2025)

Notifications You must be signed in to change notification settings

mitmedialab/medical_hallucination

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

17 Commits
 
 
 
 
 
 

Repository files navigation

If you like our project, please give us a star ⭐ on GitHub for the latest update.

medrxiv Website

Medical Hallucination in Foundation Models and Their Impact on Healthcare (2025)

Foundation Models that are capable of processing and generating multi-modal data have transformed AI’s role in medicine. However, a key limitation of their reliability is hallucination, where inaccurate or fabricated information can impact clinical decisions and patient safety. We define medical hallucination as any instance in which a model generates misleading medical content. This paper examines the unique characteristics, causes, and implications of medical hallucinations, with a particular focus on how these errors manifest themselves in real-world clinical scenarios. Our contributions include (1) a taxonomy for understanding and addressing medical hallucinations, (2) benchmarking models using medical hallucination dataset and physician-annotated LLM responses to real medical cases, providing direct insight into the clinical impact of hallucinations, and (3) a multi-national clinician survey on their experiences with medical hallucinations. Our results reveal that inference techniques such as Chain-of-Thought (CoT) and Search Augmented Generation can effectively reduce hallucination rates. However, despite these improvements, non-trivial levels of hallucination persist. These findings underscore the ethical and practical imperative for robust detection and mitigation strategies, establishing a foundation for regulatory policies that prioritize patient safety and maintain clinical integrity as AI becomes more integrated into healthcare. The feedback from clinicians highlights the urgent need for not only technical advances but also for clearer ethical and regulatory guidelines to ensure patient safety.

📣 News

[2025-03-03] 🎉🎉🎉 Our preprint paper has been submitted to medRxiv.

⚡ Contact

If you want to add your work or model to this list, please do not hesitate to email ybkim95@mit.edu.

Contents

What makes Medical Hallucination Special?

Taxonomy

Figure 1. A visual taxonomy of medical hallucinations in LLMs, organized into five main clusters.



LLM hallucinations refer to outputs that are factually incorrect, logically inconsistent, or inadequately grounded in reliable sources. In general domains, these hallucinations may take the form of factual errors or non-sequiturs. In medicine, they can be more challenging to detect because the language used often appears clinically valid while containing critical inaccuracies. Medical hallucinations exhibit two distinct features compared to their general-purpose counterparts. First, they arise within specialized tasks such as diagnostic reasoning, therapeutic planning, or interpretation of laboratory findings, where inaccuracies have immediate implications for patient care. Second, these hallucinations frequently use domain-specific terms and appear to present coherent logic, which can make them difficult to recog- nize without expert scrutiny. In settings where clinicians or patients rely on AI recommendations, a tendency potentially heightened in domains like medicine, unrecognized errors risk delaying proper interventions or redirecting care pathways. Moreover, the impact of medical hallucinations is far more severe. Errors in clinical reasoning or misleading treatment recommendations can directly harm patients by delaying proper care or leading to inappropriate interventions. Furthermore, the detectability of such hallucinations depends on the level of domain expertise of the audience and the quality of the prompting provided to the model. Domain experts are more likely to identify subtle inaccuracies in clinical terminology and reasoning, whereas non-experts may struggle to discern these errors, thereby increasing the risk of misinterpretation. These distinctions are crucial: whereas general hallucinations might lead to relatively benign mistakes, medical hallucinations can undermine patient safety and erode trust in AI-assisted clinical systems.

Our primary contributions include:

  1. A taxonomy for medical hallucination in Large Language Models
  2. Benchmarking models using a medical hallucination dataset and physician-annotated LLM responses to real medical cases, providing direct insight into the clinical impact of hallucinations.
  3. A multi-national clinician survey on their experiences with medical hallucinations.

These contributions collectively advance our understanding of medical hallucinations and their mitigation strategies, with implications extending to regulatory frameworks and best practices for the deployment of AI in clinical settings.


Hallucinations in Medical LLMs

Please refer to table 1 of our paper for examples of medical hallucination in clinical tasks, and table 2 for an organized taxonomy of medical hallucination.

Title Institute Date Code
MedHalu: Hallucinations in Responses to Healthcare Queries by Large Language Models University of Surrey
Georgia Institute of Technology
2024-09 N/A
Creating trustworthy llms: Dealing with hallucinations in healthcare ai University of Washington Bothell
Kaiser Permanente
2023-09 N/A
Knowledge Overshadowing Causes Amalgamated Hallucination in Large Language Models University of Illinois Urbana-Champaign
The Hong Kong Polytechnic University
Stanford University
2024-07 N/A
Mechanistic Understanding and Mitigation of Language Model Non-Factual Hallucinations University of Toronto
McGill University
Mila– Québec AI Institute
University of California, Riverside
2024-06 N/A
Language models are susceptible to incorrect patient self diagnosis in medical applications University of Maryland, College Park
Johns Hopkins University
2023-09 N/A
The Dawn After the Dark: An Empirical Study on Factuality Hallucination in Large Language Models Renmin University of China
Université de Montréal
2024-01 https://github.com/RUCAIBox/HaluEval-2.0

Survey on Medical Hallucination among Healthcare Professionals

To investigate the perceptions and experiences of healthcare professionals and researchers regarding the use of AI / LLM tools, particularly regarding medical hallucinations, we conducted a survey aimed at individuals in the medical, research, and analytical fields (Figure 9). A total of 75 professionals participated, primarily holding MD and/or PhD degrees, representing a diverse range of disciplines. The survey was conducted over a 94-day period, from September 15, 2024, to December 18, 2024, confirming the significant adoption of AI/LLM tools across these fields. Respondents indicated varied levels of trust in these tools, and notably, a substantial proportion reported encountering medical hallucinations—factually incorrect yet plausible outputs with medical relevance—in tasks critical to their work, such as literature reviews and clinical decision-making. Participants described employing verification strategies like cross-referencing and colleague consultation to manage these inaccuracies.

Taxonomy

Taxonomy

Figure 9. Key insights from a multi-national clinician survey on medical hallucinations in clinical practice.



Medical Hallucination Benchmarks

Please refer to table 3 of our paper for details on medical hallucination benchmarks.

Title Institute Date Code
Med-HALT: Medical Domain Hallucination Test for Large Language Models Saama AI Research 2023-10 https://medhalt.github.io/
Hallucination benchmark in medical visual question answering University College London 2024-04 https://github.com/knowlab/halt-medvqa
Detection, Diagnosis, and Explanation: A Benchmark for Chinese Medial Hallucination Evaluation Peking University
Ministry of Education
Beijing Jiaotong University
2024-05 link to drive
MedVH: Towards Systematic Evaluation of Hallucination for Large Vision Language Models in the Medical Context The Ohio State University
University of Oxford
2024-07 https://github.com/dongzizhu/MedVH
Detecting and Evaluating Medical Hallucinations in Large Vision Language Models Fudan University
Tencent Youtu Lab
Cognition and Intelligent Technology Laboratory
Ministry of Education
AI and Unmanned Systems Engineering Research Center of Jilin Province
2024-06 https://github.com/ydk122024/Med-HallMark
K-QA: AReal-World Medical Q&A Benchmark K Health Inc
The Hebrew University of Jerusalem
2024-01 https://github.com/Itaymanes/K-QA
Grounding LLMs to In-prompt Instructions: Reducing Hallucinations Caused by Static Pre-training Knowledge Heriot-Watt University 2024-05 https://github.com/AddleseeHQ/in-prompt-grounding
Medical Expert Annotations of Unsupported Facts in Doctor-Written and LLM-Generated Patient Summaries University of Münster
MIT
Duke University
2024-04 https://github.com/stefanhgm/patient_summaries_with_llms
MedHallBench: A New Benchmark for Assessing Hallucination in Medical Large Language Models University of Warwick
Cranfield University
University of Oxford
2025-01 N/A

LLM Experiments on Medical Hallucination Benchmark

Taxonomy

Taxonomy

Figure 5. Hallucination Pointwise Score vs. Similarity Score of LLMs on Med-Halt hallucination benchmark.

This result reveals that the recent models (e.g. o3-mini, deepseek-r1, and gemini-2.0-flash) typically start with high baseline hallucination resistance and tend to see moderate but consistent gains from a simple CoT, while previous models including medical-purpose LLMs often begin at low hallucination resistance yet can benefit from different approaches (e.g. Search, CoT, and System Prompt). Moreover, retrieval-augmented generation can be less effective if the model struggles to reconcile retrieved information with its internal knowledge.


Dection Methods for Medical Hallucination

Type 1: Factual Verification

Title Institute Date Code
Complex Claim Verification with Evidence Retrieved in the Wild The University of Texas at Austin 2025-01 https://github.com/jifan-chen/Fact-checking-via-Raw-Evidence
FACTSCORE: Fine-grained Atomic Evaluation of Factual Precision in Long Form Text Generation University of Washington
University of Massachusetts Amherst
Allen Institute for AI
Meta AI
2023-12 https://github.com/shmsw25/FActScore

Type 2: Summary Consistency Verification

Title Institute Date Code
Answers Unite! Unsupervised Metrics for Reinforced Summarization Models CNRS
Sorbonne Université
LIP6
reciTAL
2019-11 https://github.com/ThomasScialom/summa-qa
Asking and Answering Questions to Evaluate the Factual Consistency of Summaries New York University
Facebook AI
CIFAR Associate Fellow
2020-07 https://github.com/W4ngatang/qags
QuestEval: Summarization Asks for Fact-based Evaluation CNRS
Sorbonne Université
LIP6
reciTAL
New York University
2021-11 https://github.com/ThomasScialom/QuestEval
Ranking Generated Summaries by Correctness: An Interesting but Challenging Application for Natural Language Inference Amazon
Technische Universitat
Bar-Ilan University, Ramat-Gan
2019-08 https://tudatalib.ulb.tu-darmstadt.de/handle/tudatalib/2002

Type 3: Uncertainty-Based Hallucination Detection

Title Institute Date Code
Looking for a Needle in a Haystack: A Comprehensive Study of Hallucinations in Neural Machine Translation Instituto de Telecomunicações
Instituto Superior Técnico & LUMLIS
Unbabel
University of Edinburgh
2023-05 https://github.com/deep-spin/hallucinations-in-nmt
Enhancing Uncertainty-Based Hallucination Detection with Stronger Focus Shanghai Jiaotong University
Amazon AWS AI
Westlake University
IGSNRR, Chinese Academy of Sciences
2023-11 https://github.com/zthang/focus
Detecting hallucinations in large language models using semantic entropy University of Oxford 2024-06 https://github.com/jlko/semantic_uncertainty
https://github.com/jlko/long_hallucinations
Decomposing Uncertainty for Large Language Models through Input Clarification Ensembling UC Santa Barbara
MIT-IBM Watson AI Lab, IBM Research
MIT CSAIL
2024-06 https://github.com/UCSB-NLP-Chang/llm_uncertainty

Mitigation Methods for Medical Hallucination

Title Institute Date Code
SYNFAC-EDIT: Synthetic Imitation Edit Feedback for Factual Alignment in Clinical Summarization University of Massachusetts, Amherst
Fudan University
University of Massachusetts, Lowell
2024-10 N/A
Benchmarking Retrieval-Augmented Generation for Medicine Univeristy of Virginia
National Library of Medicine
National Institutes of Health
2024-08 https://teddy-xionggz.github.io/benchmark-medical-rag/
Enhancement of the Performance of Large Language Models in Diabetes Education through Retrieval-Augmented Generation: Comparative Study State Key Laboratory of Ophthalmology
Zhongshan Ophthalmic Center
Sun Yat-sen University
Guangdong Provincial Key Laboratory of Ophthalmology and Visual Science
Guangdong Provincial Clinical Research Center for Ocular Diseases
Research Centre for SHARP Vision
The Hong Kong Polytechnic University
Peking University Third Hospital
2024-08 N/A
Improving Retrieval-Augmented Generation in Medicine with Iterative Follow-up Questions University of Virginia
National Institutes of Health
University of Illinois Urbana-Champaign
2024-10 N/A
Knowledge Graphs, Large Language Models, and Hallucinations: An NLP Perspective Aalborg University
TU Wien
Institute of Logic and Computation
2024-11 N/A
CoMT: Chain-of-Medical-Thought Reduces Hallucination in Medical Report Generation Fudan University
Tencent YouTu Lab
Xiamen University
Cognition and Intelligent Technology Laboratory
Institute of Meta-Medical
Ministry of Education
Jilin Provincial Key Laboratory of Intelligence Science and Engineering
2025-02 https://github.com/FRENKIE-CHIANG/CoMT
Towards Mitigating Hallucination in Large Language Models via Self-Reflection Center for Artificial Intelligence Research (CAiRE)
Hong Kong University of Science and Technology
2023-10 https://github.com/ziweiji/Self_Reflection_Medical
Mitigating Hallucinations in Large Language Models via Semantic Enrichment of Prompts: Insights from BioBERT and Ontological Integration Sofia University 2024-09 N/A
Chain-of-Knowledge: Grounding Large Language Models via Dynamic Knowledge Adapting over Heterogeneous Sources DAMOAcademy, Alibaba Group
Nanyang Technological University
Singapore University of Technology and Design
Salesforce Research
Hupan Lab
2024-02 https://github.com/DAMO-NLP-SG/chain-of-knowledge

Human Physicians' Medical Hallucination Annotation

To rigorously evaluate the presence and nature of hallucinations in LLMs within the clinical domain, we employed a structured annotation process. We built upon established frameworks for hallucination and risk assessment, drawing specifically from the hallucination typology proposed by Hegselmann et al. (2024b) and the risk level framework from Asgari et al. (2024) (Figure 6) and used the New England Journal of Medicine (NEJM) Case Reports for LLM inferences.

Taxonomy

Figure 6. An annotation process of medical hallucinations in LLMs.



To qualitatively assess the LLM’s clinical reasoning abilities, we designed three targeted tasks, each focusing on a crucial aspect of medical problem-solving: 1) chronological ordering of events, 2) lab data interpretation, and 3) differential diagnosis generation. These tasks were designed to mimic essential steps in clinical practice, from understanding the patient’s history to formulating a diagnosis.

Taxonomy

Figure 1. Overview of medical hallucinations generated by state-of-the-art LLMs.



📑 Citation

Please consider citing 📑 our paper if our repository is helpful to your work, thanks sincerely!

@article {Kim2025.02.28.25323115,
	author = {Kim, Yubin and Jeong, Hyewon and Chen, Shen and Li, Shuyue Stella and Lu, Mingyu and Alhamoud, Kumail and Mun, Jimin and Grau, Cristina and Jung, Minseok and Gameiro, Rodrigo R and Fan, Lizhou and Park, Eugene and Lin, Tristan and Yoon, Joonsik and Yoon, Wonjin and Sap, Maarten and Tsvetkov, Yulia and Liang, Paul Pu and Xu, Xuhai and Liu, Xin and McDuff, Daniel and Lee, Hyeonhoon and Park, Hae Won and Tulebaev, Samir R and Breazeal, Cynthia},
	title = {Medical Hallucination in Foundation Models and Their Impact on Healthcare},
	elocation-id = {2025.02.28.25323115},
	year = {2025},
	doi = {10.1101/2025.02.28.25323115},
	publisher = {Cold Spring Harbor Laboratory Press},
	abstract = {Foundation Models that are capable of processing and generating multi-modal data have transformed AI{\textquoteright}s role in medicine. However, a key limitation of their reliability is hallucination, where inaccurate or fabricated information can impact clinical decisions and patient safety. We define medical hallucination as any instance in which a model generates misleading medical content. This paper examines the unique characteristics, causes, and implications of medical hallucinations, with a particular focus on how these errors manifest themselves in real-world clinical scenarios. Our contributions include (1) a taxonomy for understanding and addressing medical hallucinations, (2) benchmarking models using medical hallucination dataset and physician-annotated LLM responses to real medical cases, providing direct insight into the clinical impact of hallucinations, and (3) a multi-national clinician survey on their experiences with medical hallucinations. Our results reveal that inference techniques such as Chain-of-Thought (CoT) and Search Augmented Generation can effectively reduce hallucination rates. However, despite these improvements, non-trivial levels of hallucination persist. These findings underscore the ethical and practical imperative for robust detection and mitigation strategies, establishing a foundation for regulatory policies that prioritize patient safety and maintain clinical integrity as AI becomes more integrated into healthcare. The feedback from clinicians highlights the urgent need for not only technical advances but also for clearer ethical and regulatory guidelines to ensure patient safety. A repository organizing the paper resources, summaries, and additional information is available at https://github.com/mitmedialab/medical_hallucination.Competing Interest StatementThe authors have declared no competing interest.Funding StatementThis study did not receive any funding.Author DeclarationsI confirm all relevant ethical guidelines have been followed, and any necessary IRB and/or ethics committee approvals have been obtained.YesThe details of the IRB/oversight body that provided approval or exemption for the research described are given below:This study received an Institutional Review Board (IRB) exemption from MIT COUHES (Committee On the Use of Humans as Experimental Subjects) under exemption category 2 (Educational Testing, Surveys, Interviews, or Observation). The IRB determined that this research, involving surveys with professionals on their perceptions and experiences with AI/LLMs, posed minimal risk to participants and met the criteria for exemption.I confirm that all necessary patient/participant consent has been obtained and the appropriate institutional forms have been archived, and that any patient/participant/sample identifiers included were not known to anyone (e.g., hospital staff, patients or participants themselves) outside the research group so cannot be used to identify individuals.YesI understand that all clinical trials and any other prospective interventional studies must be registered with an ICMJE-approved registry, such as ClinicalTrials.gov. I confirm that any such study reported in the manuscript has been registered and the trial registration ID is provided (note: if posting a prospective study registered retrospectively, please provide a statement in the trial ID field explaining why the study was not registered in advance).Yes I have followed all appropriate research reporting guidelines, such as any relevant EQUATOR Network research reporting checklist(s) and other pertinent material, if applicable.YesMed-HALT is a publicly available dataset and NEJM Medical Records can be access after the sign-up.https://www.nejm.org/browse/nejm-article-category/clinical-cases?date=past5Yearshttps://github.com/medhalt/medhalt},
	URL = {https://www.medrxiv.org/content/early/2025/03/03/2025.02.28.25323115},
	eprint = {https://www.medrxiv.org/content/early/2025/03/03/2025.02.28.25323115.full.pdf},
	journal = {medRxiv}
}

About

Medical Hallucination in Foundation Models and Their Impact on Healthcare (2025)

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published