CRAFT-MD: Conversational Reasoning Assessment Framework for Testing in Medicine

This repository contains code and data for Conversational Reasoning Assessment Framework for Testing in Medicine (CRAFT-MD). CRAFT-MD is a novel framework designed to assess the conversational reasoning abilities of LLMs through simulated doctor-patient conversations that mimic clinical encounters. It evaluates clinical LLMs using a multi-agent framework comprising four components: a clinical LLM agent, a patient AI agent, a grader AI agent, and medical experts. The clinical LLM agent can be instantiated with any new LLM, allowing seamless evaluation of its clinical capabilities, such as history taking and information synthesis. The other agents play supportive roles in evaluating the clinical LLM agent.

Paper Link:

Dataset

The dataset consists of 2000 questions, each structured as a case vignette followed by four answer choices (see Data Availability). Of these, 1800 were sourced from MedQA-USMLE, encompassing medical conditions commonly encountered in primary and specialist care settings. These questions span 12 medical specialties: Dermatology, Hematology and Oncology, Neurology, Gastroenterology, Pediatrics and Neonatology, Cardiology, Infectious Disease, Obstetrics and Gynecology, Urology and Nephrology, Endocrinology, Rheumatology, and Others. Additional 100 vignettes from an online question back and 100 newly generated vignettes are also included. The case vignettes were written by medical experts and contained details of the age and sex of the patient, current symptoms, medical history of illness and medications, and relevant family history. In some cases, physical exam and laboratory results were also included in the vignette.

Replicating results

Note: You will need to setup an OpenAI API setup, and replace the OpenAI key information in the below script. Jupyter notebooks can be run to replicate the CRAFT-MD framework and other results presented in the manuscript.

System Requirements

All code contained in the repository was tested on python v3.9.17. The conda environment used can be re-created using environment.yml.

All open-source models were tested on Quadro RTX 8000 48gb GPU. The GPT-4 and GPT-3.5 models can be tested on a normal computer.

Citation

Johri, S., Jeong, J., Tran, B. A., Schlessinger, D. I., Wongvibulsin, S., Cai, Z. R., ... & Rajpurkar, P. (2023). Guidelines For Rigorous Evaluation of Clinical LLMs For Conversational Reasoning. medRxiv, 2023-09.

Johri, Shreya, et al. "CRAFT-MD: A Conversational Evaluation Framework for Comprehensive Assessment of Clinical LLMs." AAAI 2024 Spring Symposium on Clinical Foundation Models. 2024.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
data		data
results		results
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
Reproduce CRAFT-MD results.ipynb		Reproduce CRAFT-MD results.ipynb
environment.yml		environment.yml
parallel_craftmd_gpt.py		parallel_craftmd_gpt.py
parallel_craftmd_multimodal.py		parallel_craftmd_multimodal.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data

data

results

results

src

src

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

Reproduce CRAFT-MD results.ipynb

Reproduce CRAFT-MD results.ipynb

environment.yml

environment.yml

parallel_craftmd_gpt.py

parallel_craftmd_gpt.py

parallel_craftmd_multimodal.py

parallel_craftmd_multimodal.py

Repository files navigation

CRAFT-MD: Conversational Reasoning Assessment Framework for Testing in Medicine

Dataset

Replicating results

System Requirements

Citation

About

Releases

Packages

Languages

License

rajpurkarlab/craft-md

Folders and files

Latest commit

History

Repository files navigation

CRAFT-MD: Conversational Reasoning Assessment Framework for Testing in Medicine

Dataset

Replicating results

System Requirements

Citation

About

Resources

License

Stars

Watchers

Forks

Languages