Foundation Models that are capable of processing and generating multi-modal data have transformed AI’s role in medicine. However, a key limitation of their reliability is hallucination, where inaccurate or fabricated information can impact clinical decisions and patient safety. We define medical hallucination as any instance in which a model generates misleading medical content. This paper examines the unique characteristics, causes, and implications of medical hallucinations, with a particular focus on how these errors manifest themselves in real-world clinical scenarios. Our contributions include (1) a taxonomy for understanding and addressing medical hallucinations, (2) benchmarking models using medical hallucination dataset and physician-annotated LLM responses to real medical cases, providing direct insight into the clinical impact of hallucinations, and (3) a multi-national clinician survey on their experiences with medical hallucinations. Our results reveal that inference techniques such as Chain-of-Thought (CoT) and Search Augmented Generation can effectively reduce hallucination rates. However, despite these improvements, non-trivial levels of hallucination persist. These findings underscore the ethical and practical imperative for robust detection and mitigation strategies, establishing a foundation for regulatory policies that prioritize patient safety and maintain clinical integrity as AI becomes more integrated into healthcare. The feedback from clinicians highlights the urgent need for not only technical advances but also for clearer ethical and regulatory guidelines to ensure patient safety.
[2025-03-03] 🎉🎉🎉 Our preprint paper has been submitted to medRxiv.
If you want to add your work or model to this list, please do not hesitate to email ybkim95@mit.edu.
- What makes Medical Hallucination Special?
- Hallucinations in Medical LLMs
- Survey on Medical Hallucination among Healthcare Professionals
- Medical Hallucination Benchmarks
- LLM Experiments on Medical Hallucination Benchmark
- Detection Methods for Medical Hallucination
- Mitigation Methods for Medical Hallucination
- Human Physicians' Medical Hallucination Annotation
LLM hallucinations refer to outputs that are factually incorrect, logically inconsistent, or inadequately grounded in reliable sources. In general domains, these hallucinations may take the form of factual errors or non-sequiturs. In medicine, they can be more challenging to detect because the language used often appears clinically valid while containing critical inaccuracies. Medical hallucinations exhibit two distinct features compared to their general-purpose counterparts. First, they arise within specialized tasks such as diagnostic reasoning, therapeutic planning, or interpretation of laboratory findings, where inaccuracies have immediate implications for patient care. Second, these hallucinations frequently use domain-specific terms and appear to present coherent logic, which can make them difficult to recog- nize without expert scrutiny. In settings where clinicians or patients rely on AI recommendations, a tendency potentially heightened in domains like medicine, unrecognized errors risk delaying proper interventions or redirecting care pathways. Moreover, the impact of medical hallucinations is far more severe. Errors in clinical reasoning or misleading treatment recommendations can directly harm patients by delaying proper care or leading to inappropriate interventions. Furthermore, the detectability of such hallucinations depends on the level of domain expertise of the audience and the quality of the prompting provided to the model. Domain experts are more likely to identify subtle inaccuracies in clinical terminology and reasoning, whereas non-experts may struggle to discern these errors, thereby increasing the risk of misinterpretation. These distinctions are crucial: whereas general hallucinations might lead to relatively benign mistakes, medical hallucinations can undermine patient safety and erode trust in AI-assisted clinical systems.
Our primary contributions include:
- A taxonomy for medical hallucination in Large Language Models
- Benchmarking models using a medical hallucination dataset and physician-annotated LLM responses to real medical cases, providing direct insight into the clinical impact of hallucinations.
- A multi-national clinician survey on their experiences with medical hallucinations.
These contributions collectively advance our understanding of medical hallucinations and their mitigation strategies, with implications extending to regulatory frameworks and best practices for the deployment of AI in clinical settings.
Please refer to table 1 of our paper for examples of medical hallucination in clinical tasks, and table 2 for an organized taxonomy of medical hallucination.
To investigate the perceptions and experiences of healthcare professionals and researchers regarding the use of AI / LLM tools, particularly regarding medical hallucinations, we conducted a survey aimed at individuals in the medical, research, and analytical fields (Figure 9). A total of 75 professionals participated, primarily holding MD and/or PhD degrees, representing a diverse range of disciplines. The survey was conducted over a 94-day period, from September 15, 2024, to December 18, 2024, confirming the significant adoption of AI/LLM tools across these fields. Respondents indicated varied levels of trust in these tools, and notably, a substantial proportion reported encountering medical hallucinations—factually incorrect yet plausible outputs with medical relevance—in tasks critical to their work, such as literature reviews and clinical decision-making. Participants described employing verification strategies like cross-referencing and colleague consultation to manage these inaccuracies.
Figure 9. Key insights from a multi-national clinician survey on medical hallucinations in clinical practice.Please refer to table 3 of our paper for details on medical hallucination benchmarks.
Figure 5. Hallucination Pointwise Score vs. Similarity Score of LLMs on Med-Halt hallucination benchmark.
This result reveals that the recent models (e.g. o3-mini, deepseek-r1, and gemini-2.0-flash) typically start with high baseline hallucination resistance and tend to see moderate but consistent gains from a simple CoT, while previous models including medical-purpose LLMs often begin at low hallucination resistance yet can benefit from different approaches (e.g. Search, CoT, and System Prompt). Moreover, retrieval-augmented generation can be less effective if the model struggles to reconcile retrieved information with its internal knowledge.
Title | Institute | Date | Code |
---|---|---|---|
Complex Claim Verification with Evidence Retrieved in the Wild | The University of Texas at Austin | 2025-01 | https://github.com/jifan-chen/Fact-checking-via-Raw-Evidence |
FACTSCORE: Fine-grained Atomic Evaluation of Factual Precision in Long Form Text Generation | University of Washington University of Massachusetts Amherst Allen Institute for AI Meta AI |
2023-12 | https://github.com/shmsw25/FActScore |
Title | Institute | Date | Code |
---|---|---|---|
Answers Unite! Unsupervised Metrics for Reinforced Summarization Models | CNRS Sorbonne Université LIP6 reciTAL |
2019-11 | https://github.com/ThomasScialom/summa-qa |
Asking and Answering Questions to Evaluate the Factual Consistency of Summaries | New York University Facebook AI CIFAR Associate Fellow |
2020-07 | https://github.com/W4ngatang/qags |
QuestEval: Summarization Asks for Fact-based Evaluation | CNRS Sorbonne Université LIP6 reciTAL New York University |
2021-11 | https://github.com/ThomasScialom/QuestEval |
Ranking Generated Summaries by Correctness: An Interesting but Challenging Application for Natural Language Inference | Amazon Technische Universitat Bar-Ilan University, Ramat-Gan |
2019-08 | https://tudatalib.ulb.tu-darmstadt.de/handle/tudatalib/2002 |
Title | Institute | Date | Code |
---|---|---|---|
Looking for a Needle in a Haystack: A Comprehensive Study of Hallucinations in Neural Machine Translation | Instituto de Telecomunicações Instituto Superior Técnico & LUMLIS Unbabel University of Edinburgh |
2023-05 | https://github.com/deep-spin/hallucinations-in-nmt |
Enhancing Uncertainty-Based Hallucination Detection with Stronger Focus | Shanghai Jiaotong University Amazon AWS AI Westlake University IGSNRR, Chinese Academy of Sciences |
2023-11 | https://github.com/zthang/focus |
Detecting hallucinations in large language models using semantic entropy | University of Oxford | 2024-06 | https://github.com/jlko/semantic_uncertainty https://github.com/jlko/long_hallucinations |
Decomposing Uncertainty for Large Language Models through Input Clarification Ensembling | UC Santa Barbara MIT-IBM Watson AI Lab, IBM Research MIT CSAIL |
2024-06 | https://github.com/UCSB-NLP-Chang/llm_uncertainty |
Title | Institute | Date | Code |
---|---|---|---|
SYNFAC-EDIT: Synthetic Imitation Edit Feedback for Factual Alignment in Clinical Summarization | University of Massachusetts, Amherst Fudan University University of Massachusetts, Lowell |
2024-10 | N/A |
Benchmarking Retrieval-Augmented Generation for Medicine | Univeristy of Virginia National Library of Medicine National Institutes of Health |
2024-08 | https://teddy-xionggz.github.io/benchmark-medical-rag/ |
Enhancement of the Performance of Large Language Models in Diabetes Education through Retrieval-Augmented Generation: Comparative Study | State Key Laboratory of Ophthalmology Zhongshan Ophthalmic Center Sun Yat-sen University Guangdong Provincial Key Laboratory of Ophthalmology and Visual Science Guangdong Provincial Clinical Research Center for Ocular Diseases Research Centre for SHARP Vision The Hong Kong Polytechnic University Peking University Third Hospital |
2024-08 | N/A |
Improving Retrieval-Augmented Generation in Medicine with Iterative Follow-up Questions | University of Virginia National Institutes of Health University of Illinois Urbana-Champaign |
2024-10 | N/A |
Knowledge Graphs, Large Language Models, and Hallucinations: An NLP Perspective | Aalborg University TU Wien Institute of Logic and Computation |
2024-11 | N/A |
CoMT: Chain-of-Medical-Thought Reduces Hallucination in Medical Report Generation | Fudan University Tencent YouTu Lab Xiamen University Cognition and Intelligent Technology Laboratory Institute of Meta-Medical Ministry of Education Jilin Provincial Key Laboratory of Intelligence Science and Engineering |
2025-02 | https://github.com/FRENKIE-CHIANG/CoMT |
Towards Mitigating Hallucination in Large Language Models via Self-Reflection | Center for Artificial Intelligence Research (CAiRE) Hong Kong University of Science and Technology |
2023-10 | https://github.com/ziweiji/Self_Reflection_Medical |
Mitigating Hallucinations in Large Language Models via Semantic Enrichment of Prompts: Insights from BioBERT and Ontological Integration | Sofia University | 2024-09 | N/A |
Chain-of-Knowledge: Grounding Large Language Models via Dynamic Knowledge Adapting over Heterogeneous Sources | DAMOAcademy, Alibaba Group Nanyang Technological University Singapore University of Technology and Design Salesforce Research Hupan Lab |
2024-02 | https://github.com/DAMO-NLP-SG/chain-of-knowledge |
To rigorously evaluate the presence and nature of hallucinations in LLMs within the clinical domain, we employed a structured annotation process. We built upon established frameworks for hallucination and risk assessment, drawing specifically from the hallucination typology proposed by Hegselmann et al. (2024b) and the risk level framework from Asgari et al. (2024) (Figure 6) and used the New England Journal of Medicine (NEJM) Case Reports for LLM inferences.
Figure 6. An annotation process of medical hallucinations in LLMs.To qualitatively assess the LLM’s clinical reasoning abilities, we designed three targeted tasks, each focusing on a crucial aspect of medical problem-solving: 1) chronological ordering of events, 2) lab data interpretation, and 3) differential diagnosis generation. These tasks were designed to mimic essential steps in clinical practice, from understanding the patient’s history to formulating a diagnosis.
Figure 1. Overview of medical hallucinations generated by state-of-the-art LLMs.Please consider citing 📑 our paper if our repository is helpful to your work, thanks sincerely!
@article {Kim2025.02.28.25323115,
author = {Kim, Yubin and Jeong, Hyewon and Chen, Shen and Li, Shuyue Stella and Lu, Mingyu and Alhamoud, Kumail and Mun, Jimin and Grau, Cristina and Jung, Minseok and Gameiro, Rodrigo R and Fan, Lizhou and Park, Eugene and Lin, Tristan and Yoon, Joonsik and Yoon, Wonjin and Sap, Maarten and Tsvetkov, Yulia and Liang, Paul Pu and Xu, Xuhai and Liu, Xin and McDuff, Daniel and Lee, Hyeonhoon and Park, Hae Won and Tulebaev, Samir R and Breazeal, Cynthia},
title = {Medical Hallucination in Foundation Models and Their Impact on Healthcare},
elocation-id = {2025.02.28.25323115},
year = {2025},
doi = {10.1101/2025.02.28.25323115},
publisher = {Cold Spring Harbor Laboratory Press},
abstract = {Foundation Models that are capable of processing and generating multi-modal data have transformed AI{\textquoteright}s role in medicine. However, a key limitation of their reliability is hallucination, where inaccurate or fabricated information can impact clinical decisions and patient safety. We define medical hallucination as any instance in which a model generates misleading medical content. This paper examines the unique characteristics, causes, and implications of medical hallucinations, with a particular focus on how these errors manifest themselves in real-world clinical scenarios. Our contributions include (1) a taxonomy for understanding and addressing medical hallucinations, (2) benchmarking models using medical hallucination dataset and physician-annotated LLM responses to real medical cases, providing direct insight into the clinical impact of hallucinations, and (3) a multi-national clinician survey on their experiences with medical hallucinations. Our results reveal that inference techniques such as Chain-of-Thought (CoT) and Search Augmented Generation can effectively reduce hallucination rates. However, despite these improvements, non-trivial levels of hallucination persist. These findings underscore the ethical and practical imperative for robust detection and mitigation strategies, establishing a foundation for regulatory policies that prioritize patient safety and maintain clinical integrity as AI becomes more integrated into healthcare. The feedback from clinicians highlights the urgent need for not only technical advances but also for clearer ethical and regulatory guidelines to ensure patient safety. A repository organizing the paper resources, summaries, and additional information is available at https://github.com/mitmedialab/medical_hallucination.Competing Interest StatementThe authors have declared no competing interest.Funding StatementThis study did not receive any funding.Author DeclarationsI confirm all relevant ethical guidelines have been followed, and any necessary IRB and/or ethics committee approvals have been obtained.YesThe details of the IRB/oversight body that provided approval or exemption for the research described are given below:This study received an Institutional Review Board (IRB) exemption from MIT COUHES (Committee On the Use of Humans as Experimental Subjects) under exemption category 2 (Educational Testing, Surveys, Interviews, or Observation). The IRB determined that this research, involving surveys with professionals on their perceptions and experiences with AI/LLMs, posed minimal risk to participants and met the criteria for exemption.I confirm that all necessary patient/participant consent has been obtained and the appropriate institutional forms have been archived, and that any patient/participant/sample identifiers included were not known to anyone (e.g., hospital staff, patients or participants themselves) outside the research group so cannot be used to identify individuals.YesI understand that all clinical trials and any other prospective interventional studies must be registered with an ICMJE-approved registry, such as ClinicalTrials.gov. I confirm that any such study reported in the manuscript has been registered and the trial registration ID is provided (note: if posting a prospective study registered retrospectively, please provide a statement in the trial ID field explaining why the study was not registered in advance).Yes I have followed all appropriate research reporting guidelines, such as any relevant EQUATOR Network research reporting checklist(s) and other pertinent material, if applicable.YesMed-HALT is a publicly available dataset and NEJM Medical Records can be access after the sign-up.https://www.nejm.org/browse/nejm-article-category/clinical-cases?date=past5Yearshttps://github.com/medhalt/medhalt},
URL = {https://www.medrxiv.org/content/early/2025/03/03/2025.02.28.25323115},
eprint = {https://www.medrxiv.org/content/early/2025/03/03/2025.02.28.25323115.full.pdf},
journal = {medRxiv}
}