GitHub - mitmedialab/medical_hallucination: Medical Hallucination in Foundation Models and Their Impact on Healthcare (2025)

Medical Hallucination in Foundation Models and Their Impact on Healthcare

If you like our project, please give us a star ⭐ on GitHub for the latest update.

Medical Hallucination in Foundation Models and Their Impact on Healthcare (2025)

Foundation Models that are capable of processing and generating multi-modal data have transformed AI’s role in medicine. However, a key limitation of their reliability is hallucination, where inaccurate or fabricated information can impact clinical decisions and patient safety. We define medical hallucination as any instance in which a model generates misleading medical content. This paper examines the unique characteristics, causes, and implications of medical hallucinations, with a particular focus on how these errors manifest themselves in real-world clinical scenarios. Our contributions include (1) a taxonomy for understanding and addressing medical hallucinations, (2) benchmarking models using medical hallucination dataset and physician-annotated LLM responses to real medical cases, providing direct insight into the clinical impact of hallucinations, and (3) a multi-national clinician survey on their experiences with medical hallucinations. Our results reveal that inference techniques such as Chain-of-Thought (CoT) and Search Augmented Generation can effectively reduce hallucination rates. However, despite these improvements, non-trivial levels of hallucination persist. These findings underscore the ethical and practical imperative for robust detection and mitigation strategies, establishing a foundation for regulatory policies that prioritize patient safety and maintain clinical integrity as AI becomes more integrated into healthcare. The feedback from clinicians highlights the urgent need for not only technical advances but also for clearer ethical and regulatory guidelines to ensure patient safety.

📣 News

[2025-03-03] 🎉🎉🎉 Our preprint paper has been submitted to medRxiv.

⚡ Contact

If you want to add your work or model to this list, please do not hesitate to email ybkim95@mit.edu.

What makes Medical Hallucination Special?
Hallucinations in Medical LLMs
Survey on Medical Hallucination among Healthcare Professionals
Medical Hallucination Benchmarks
LLM Experiments on Medical Hallucination Benchmark
Detection Methods for Medical Hallucination
Mitigation Methods for Medical Hallucination
Human Physicians' Medical Hallucination Annotation

What makes Medical Hallucination Special?

Figure 1. A visual taxonomy of medical hallucinations in LLMs, organized into five main clusters.

LLM hallucinations refer to outputs that are factually incorrect, logically inconsistent, or inadequately grounded in reliable sources. In general domains, these hallucinations may take the form of factual errors or non-sequiturs. In medicine, they can be more challenging to detect because the language used often appears clinically valid while containing critical inaccuracies. Medical hallucinations exhibit two distinct features compared to their general-purpose counterparts. First, they arise within specialized tasks such as diagnostic reasoning, therapeutic planning, or interpretation of laboratory findings, where inaccuracies have immediate implications for patient care. Second, these hallucinations frequently use domain-specific terms and appear to present coherent logic, which can make them difficult to recog- nize without expert scrutiny. In settings where clinicians or patients rely on AI recommendations, a tendency potentially heightened in domains like medicine, unrecognized errors risk delaying proper interventions or redirecting care pathways. Moreover, the impact of medical hallucinations is far more severe. Errors in clinical reasoning or misleading treatment recommendations can directly harm patients by delaying proper care or leading to inappropriate interventions. Furthermore, the detectability of such hallucinations depends on the level of domain expertise of the audience and the quality of the prompting provided to the model. Domain experts are more likely to identify subtle inaccuracies in clinical terminology and reasoning, whereas non-experts may struggle to discern these errors, thereby increasing the risk of misinterpretation. These distinctions are crucial: whereas general hallucinations might lead to relatively benign mistakes, medical hallucinations can undermine patient safety and erode trust in AI-assisted clinical systems.

Our primary contributions include:

A taxonomy for medical hallucination in Large Language Models
Benchmarking models using a medical hallucination dataset and physician-annotated LLM responses to real medical cases, providing direct insight into the clinical impact of hallucinations.
A multi-national clinician survey on their experiences with medical hallucinations.

These contributions collectively advance our understanding of medical hallucinations and their mitigation strategies, with implications extending to regulatory frameworks and best practices for the deployment of AI in clinical settings.

Hallucinations in Medical LLMs

Please refer to table 1 of our paper for examples of medical hallucination in clinical tasks, and table 2 for an organized taxonomy of medical hallucination.

Title	Institute	Date	Code
MedHalu: Hallucinations in Responses to Healthcare Queries by Large Language Models	University of Surrey Georgia Institute of Technology	2024-09	N/A
Creating trustworthy llms: Dealing with hallucinations in healthcare ai	University of Washington Bothell Kaiser Permanente	2023-09	N/A
Knowledge Overshadowing Causes Amalgamated Hallucination in Large Language Models	University of Illinois Urbana-Champaign The Hong Kong Polytechnic University Stanford University	2024-07	N/A
Mechanistic Understanding and Mitigation of Language Model Non-Factual Hallucinations	University of Toronto McGill University Mila– Québec AI Institute University of California, Riverside	2024-06	N/A
Language models are susceptible to incorrect patient self diagnosis in medical applications	University of Maryland, College Park Johns Hopkins University	2023-09	N/A
The Dawn After the Dark: An Empirical Study on Factuality Hallucination in Large Language Models	Renmin University of China Université de Montréal	2024-01	https://github.com/RUCAIBox/HaluEval-2.0

Survey on Medical Hallucination among Healthcare Professionals

To investigate the perceptions and experiences of healthcare professionals and researchers regarding the use of AI / LLM tools, particularly regarding medical hallucinations, we conducted a survey aimed at individuals in the medical, research, and analytical fields (Figure 9). A total of 75 professionals participated, primarily holding MD and/or PhD degrees, representing a diverse range of disciplines. The survey was conducted over a 94-day period, from September 15, 2024, to December 18, 2024, confirming the significant adoption of AI/LLM tools across these fields. Respondents indicated varied levels of trust in these tools, and notably, a substantial proportion reported encountering medical hallucinations—factually incorrect yet plausible outputs with medical relevance—in tasks critical to their work, such as literature reviews and clinical decision-making. Participants described employing verification strategies like cross-referencing and colleague consultation to manage these inaccuracies.

Figure 9. Key insights from a multi-national clinician survey on medical hallucinations in clinical practice.

Medical Hallucination Benchmarks

Please refer to table 3 of our paper for details on medical hallucination benchmarks.

Title	Institute	Date	Code
Med-HALT: Medical Domain Hallucination Test for Large Language Models	Saama AI Research	2023-10	https://medhalt.github.io/
Hallucination benchmark in medical visual question answering	University College London	2024-04	https://github.com/knowlab/halt-medvqa
Detection, Diagnosis, and Explanation: A Benchmark for Chinese Medial Hallucination Evaluation	Peking University Ministry of Education Beijing Jiaotong University	2024-05	link to drive
MedVH: Towards Systematic Evaluation of Hallucination for Large Vision Language Models in the Medical Context	The Ohio State University University of Oxford	2024-07	https://github.com/dongzizhu/MedVH
Detecting and Evaluating Medical Hallucinations in Large Vision Language Models	Fudan University Tencent Youtu Lab Cognition and Intelligent Technology Laboratory Ministry of Education AI and Unmanned Systems Engineering Research Center of Jilin Province	2024-06	https://github.com/ydk122024/Med-HallMark
K-QA: AReal-World Medical Q&A Benchmark	K Health Inc The Hebrew University of Jerusalem	2024-01	https://github.com/Itaymanes/K-QA
Grounding LLMs to In-prompt Instructions: Reducing Hallucinations Caused by Static Pre-training Knowledge	Heriot-Watt University	2024-05	https://github.com/AddleseeHQ/in-prompt-grounding
Medical Expert Annotations of Unsupported Facts in Doctor-Written and LLM-Generated Patient Summaries	University of Münster MIT Duke University	2024-04	https://github.com/stefanhgm/patient_summaries_with_llms
MedHallBench: A New Benchmark for Assessing Hallucination in Medical Large Language Models	University of Warwick Cranfield University University of Oxford	2025-01	N/A

LLM Experiments on Medical Hallucination Benchmark

Figure 5. Hallucination Pointwise Score vs. Similarity Score of LLMs on Med-Halt hallucination benchmark.

This result reveals that the recent models (e.g. o3-mini, deepseek-r1, and gemini-2.0-flash) typically start with high baseline hallucination resistance and tend to see moderate but consistent gains from a simple CoT, while previous models including medical-purpose LLMs often begin at low hallucination resistance yet can benefit from different approaches (e.g. Search, CoT, and System Prompt). Moreover, retrieval-augmented generation can be less effective if the model struggles to reconcile retrieved information with its internal knowledge.

Dection Methods for Medical Hallucination

Type 1: Factual Verification

Title	Institute	Date	Code
Complex Claim Verification with Evidence Retrieved in the Wild	The University of Texas at Austin	2025-01	https://github.com/jifan-chen/Fact-checking-via-Raw-Evidence
FACTSCORE: Fine-grained Atomic Evaluation of Factual Precision in Long Form Text Generation	University of Washington University of Massachusetts Amherst Allen Institute for AI Meta AI	2023-12	https://github.com/shmsw25/FActScore

Type 2: Summary Consistency Verification

Title	Institute	Date	Code
Answers Unite! Unsupervised Metrics for Reinforced Summarization Models	CNRS Sorbonne Université LIP6 reciTAL	2019-11	https://github.com/ThomasScialom/summa-qa
Asking and Answering Questions to Evaluate the Factual Consistency of Summaries	New York University Facebook AI CIFAR Associate Fellow	2020-07	https://github.com/W4ngatang/qags
QuestEval: Summarization Asks for Fact-based Evaluation	CNRS Sorbonne Université LIP6 reciTAL New York University	2021-11	https://github.com/ThomasScialom/QuestEval
Ranking Generated Summaries by Correctness: An Interesting but Challenging Application for Natural Language Inference	Amazon Technische Universitat Bar-Ilan University, Ramat-Gan	2019-08	https://tudatalib.ulb.tu-darmstadt.de/handle/tudatalib/2002

Type 3: Uncertainty-Based Hallucination Detection

Title	Institute	Date	Code
Looking for a Needle in a Haystack: A Comprehensive Study of Hallucinations in Neural Machine Translation	Instituto de Telecomunicações Instituto Superior Técnico & LUMLIS Unbabel University of Edinburgh	2023-05	https://github.com/deep-spin/hallucinations-in-nmt
Enhancing Uncertainty-Based Hallucination Detection with Stronger Focus	Shanghai Jiaotong University Amazon AWS AI Westlake University IGSNRR, Chinese Academy of Sciences	2023-11	https://github.com/zthang/focus
Detecting hallucinations in large language models using semantic entropy	University of Oxford	2024-06	https://github.com/jlko/semantic_uncertainty https://github.com/jlko/long_hallucinations
Decomposing Uncertainty for Large Language Models through Input Clarification Ensembling	UC Santa Barbara MIT-IBM Watson AI Lab, IBM Research MIT CSAIL	2024-06	https://github.com/UCSB-NLP-Chang/llm_uncertainty

Mitigation Methods for Medical Hallucination

Title	Institute	Date	Code
SYNFAC-EDIT: Synthetic Imitation Edit Feedback for Factual Alignment in Clinical Summarization	University of Massachusetts, Amherst Fudan University University of Massachusetts, Lowell	2024-10	N/A
Benchmarking Retrieval-Augmented Generation for Medicine	Univeristy of Virginia National Library of Medicine National Institutes of Health	2024-08	https://teddy-xionggz.github.io/benchmark-medical-rag/
Enhancement of the Performance of Large Language Models in Diabetes Education through Retrieval-Augmented Generation: Comparative Study	State Key Laboratory of Ophthalmology Zhongshan Ophthalmic Center Sun Yat-sen University Guangdong Provincial Key Laboratory of Ophthalmology and Visual Science Guangdong Provincial Clinical Research Center for Ocular Diseases Research Centre for SHARP Vision The Hong Kong Polytechnic University Peking University Third Hospital	2024-08	N/A
Improving Retrieval-Augmented Generation in Medicine with Iterative Follow-up Questions	University of Virginia National Institutes of Health University of Illinois Urbana-Champaign	2024-10	N/A
Knowledge Graphs, Large Language Models, and Hallucinations: An NLP Perspective	Aalborg University TU Wien Institute of Logic and Computation	2024-11	N/A
CoMT: Chain-of-Medical-Thought Reduces Hallucination in Medical Report Generation	Fudan University Tencent YouTu Lab Xiamen University Cognition and Intelligent Technology Laboratory Institute of Meta-Medical Ministry of Education Jilin Provincial Key Laboratory of Intelligence Science and Engineering	2025-02	https://github.com/FRENKIE-CHIANG/CoMT
Towards Mitigating Hallucination in Large Language Models via Self-Reflection	Center for Artificial Intelligence Research (CAiRE) Hong Kong University of Science and Technology	2023-10	https://github.com/ziweiji/Self_Reflection_Medical
Mitigating Hallucinations in Large Language Models via Semantic Enrichment of Prompts: Insights from BioBERT and Ontological Integration	Sofia University	2024-09	N/A
Chain-of-Knowledge: Grounding Large Language Models via Dynamic Knowledge Adapting over Heterogeneous Sources	DAMOAcademy, Alibaba Group Nanyang Technological University Singapore University of Technology and Design Salesforce Research Hupan Lab	2024-02	https://github.com/DAMO-NLP-SG/chain-of-knowledge

Human Physicians' Medical Hallucination Annotation

To rigorously evaluate the presence and nature of hallucinations in LLMs within the clinical domain, we employed a structured annotation process. We built upon established frameworks for hallucination and risk assessment, drawing specifically from the hallucination typology proposed by Hegselmann et al. (2024b) and the risk level framework from Asgari et al. (2024) (Figure 6) and used the New England Journal of Medicine (NEJM) Case Reports for LLM inferences.

Figure 6. An annotation process of medical hallucinations in LLMs.

To qualitatively assess the LLM’s clinical reasoning abilities, we designed three targeted tasks, each focusing on a crucial aspect of medical problem-solving: 1) chronological ordering of events, 2) lab data interpretation, and 3) differential diagnosis generation. These tasks were designed to mimic essential steps in clinical practice, from understanding the patient’s history to formulating a diagnosis.

Figure 1. Overview of medical hallucinations generated by state-of-the-art LLMs.

📑 Citation

Please consider citing 📑 our paper if our repository is helpful to your work, thanks sincerely!

@article {Kim2025.02.28.25323115,
	author = {Kim, Yubin and Jeong, Hyewon and Chen, Shen and Li, Shuyue Stella and Lu, Mingyu and Alhamoud, Kumail and Mun, Jimin and Grau, Cristina and Jung, Minseok and Gameiro, Rodrigo R and Fan, Lizhou and Park, Eugene and Lin, Tristan and Yoon, Joonsik and Yoon, Wonjin and Sap, Maarten and Tsvetkov, Yulia and Liang, Paul Pu and Xu, Xuhai and Liu, Xin and McDuff, Daniel and Lee, Hyeonhoon and Park, Hae Won and Tulebaev, Samir R and Breazeal, Cynthia},
	title = {Medical Hallucination in Foundation Models and Their Impact on Healthcare},
	elocation-id = {2025.02.28.25323115},
	year = {2025},
	doi = {10.1101/2025.02.28.25323115},
	publisher = {Cold Spring Harbor Laboratory Press},
	abstract = {Foundation Models that are capable of processing and generating multi-modal data have transformed AI{\textquoteright}s role in medicine. However, a key limitation of their reliability is hallucination, where inaccurate or fabricated information can impact clinical decisions and patient safety. We define medical hallucination as any instance in which a model generates misleading medical content. This paper examines the unique characteristics, causes, and implications of medical hallucinations, with a particular focus on how these errors manifest themselves in real-world clinical scenarios. Our contributions include (1) a taxonomy for understanding and addressing medical hallucinations, (2) benchmarking models using medical hallucination dataset and physician-annotated LLM responses to real medical cases, providing direct insight into the clinical impact of hallucinations, and (3) a multi-national clinician survey on their experiences with medical hallucinations. Our results reveal that inference techniques such as Chain-of-Thought (CoT) and Search Augmented Generation can effectively reduce hallucination rates. However, despite these improvements, non-trivial levels of hallucination persist. These findings underscore the ethical and practical imperative for robust detection and mitigation strategies, establishing a foundation for regulatory policies that prioritize patient safety and maintain clinical integrity as AI becomes more integrated into healthcare. The feedback from clinicians highlights the urgent need for not only technical advances but also for clearer ethical and regulatory guidelines to ensure patient safety. A repository organizing the paper resources, summaries, and additional information is available at https://github.com/mitmedialab/medical_hallucination.Competing Interest StatementThe authors have declared no competing interest.Funding StatementThis study did not receive any funding.Author DeclarationsI confirm all relevant ethical guidelines have been followed, and any necessary IRB and/or ethics committee approvals have been obtained.YesThe details of the IRB/oversight body that provided approval or exemption for the research described are given below:This study received an Institutional Review Board (IRB) exemption from MIT COUHES (Committee On the Use of Humans as Experimental Subjects) under exemption category 2 (Educational Testing, Surveys, Interviews, or Observation). The IRB determined that this research, involving surveys with professionals on their perceptions and experiences with AI/LLMs, posed minimal risk to participants and met the criteria for exemption.I confirm that all necessary patient/participant consent has been obtained and the appropriate institutional forms have been archived, and that any patient/participant/sample identifiers included were not known to anyone (e.g., hospital staff, patients or participants themselves) outside the research group so cannot be used to identify individuals.YesI understand that all clinical trials and any other prospective interventional studies must be registered with an ICMJE-approved registry, such as ClinicalTrials.gov. I confirm that any such study reported in the manuscript has been registered and the trial registration ID is provided (note: if posting a prospective study registered retrospectively, please provide a statement in the trial ID field explaining why the study was not registered in advance).Yes I have followed all appropriate research reporting guidelines, such as any relevant EQUATOR Network research reporting checklist(s) and other pertinent material, if applicable.YesMed-HALT is a publicly available dataset and NEJM Medical Records can be access after the sign-up.https://www.nejm.org/browse/nejm-article-category/clinical-cases?date=past5Yearshttps://github.com/medhalt/medhalt},
	URL = {https://www.medrxiv.org/content/early/2025/03/03/2025.02.28.25323115},
	eprint = {https://www.medrxiv.org/content/early/2025/03/03/2025.02.28.25323115.full.pdf},
	journal = {medRxiv}
}

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
images		images
README.md		README.md
paper.pdf		paper.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Medical Hallucination in Foundation Models and Their Impact on Healthcare

If you like our project, please give us a star ⭐ on GitHub for the latest update.

Medical Hallucination in Foundation Models and Their Impact on Healthcare (2025)

📣 News

⚡ Contact

Contents

What makes Medical Hallucination Special?

Hallucinations in Medical LLMs

Survey on Medical Hallucination among Healthcare Professionals

Medical Hallucination Benchmarks

LLM Experiments on Medical Hallucination Benchmark

Dection Methods for Medical Hallucination

Type 1: Factual Verification

Type 2: Summary Consistency Verification

Type 3: Uncertainty-Based Hallucination Detection

Mitigation Methods for Medical Hallucination

Human Physicians' Medical Hallucination Annotation

📑 Citation

About

Releases

Packages

Contributors 2

mitmedialab/medical_hallucination

Folders and files

Latest commit

History

Repository files navigation

Medical Hallucination in Foundation Models and Their Impact on Healthcare

If you like our project, please give us a star ⭐ on GitHub for the latest update.

Medical Hallucination in Foundation Models and Their Impact on Healthcare (2025)

📣 News

⚡ Contact

Contents

What makes Medical Hallucination Special?

Hallucinations in Medical LLMs

Survey on Medical Hallucination among Healthcare Professionals

Medical Hallucination Benchmarks

LLM Experiments on Medical Hallucination Benchmark

Dection Methods for Medical Hallucination

Type 1: Factual Verification

Type 2: Summary Consistency Verification

Type 3: Uncertainty-Based Hallucination Detection

Mitigation Methods for Medical Hallucination

Human Physicians' Medical Hallucination Annotation

📑 Citation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Packages