# MediQ:Question-Asking LLMs and a Benchmark for Reliable Interactive Clinical Reasoning

---

#### Members
1. Aadarsha Chapagain aadarsha.chapagain@torontomu.ca
2. Colin Lacey aadarsha.chapagain@torontomu.ca



# Introduction:

#### Problem Description:

MEDIQ set out to study how to improve LLMs through the ability to ask follow-up questions. Typically, LLMs will answer questions based on the information they are provided in one go (called single-turn Q&A), and to better align to real-world conversations, this projects explores if the LLMs (GPT-4 and LLaMA-3) could be trained to ask follow-up questions when they do not have enough information.


#### Context of the Problem:
![](https://github.com/aadarshachapagain/Meqiq_bench/blob/main/media/problem_context.png?raw=true)

Vanilla LLMs are trained to answer any question, even with incomplete context or insufficient knowledge. This approach raises concerns about the reliability of the system, especially in domains like medicine and healthcare, where reliability is critically important. Gathering more contextual information before answering questions can significantly improve the system's reliability.

#### Limitation About other Approaches:

![](https://github.com/aadarshachapagain/Meqiq_bench/blob/main/media/limitation_of_other_approaches.png?raw=true)

The primary limitation across prior approaches is their insufficient support for proactive information-seeking behavior. While some enhance domain-specific accuracy (e.g., via PubMed finetuning) or improve interactive performance using conversational data, most systems still fail to dynamically elicit missing or critical patient information. Additionally, several methods focus on synthetic or simplified datasets, limiting their applicability to real-world, nuanced medical scenarios. Few address abstention, uncertainty modeling, or dynamic questioning, which are crucial for safe and effective medical decision-making.

#### Solution:

![](https://github.com/aadarshachapagain/Meqiq_bench/blob/main/media/solution.png?raw=true)

Asking questions when information is insufficient for diagnosis helps gather vital details for decision-making. The use of two modules—an expert system and a patient system—enables the collection of sufficient information before making decisions. Decisions made with adequate information are inherently more reliable.



# Background

| Reference | Explanation | Dataset/Input | Weakness |
|-----------|-------------|----------------|-----------|
| [1] Singhal et al. (2023a) | Introduced MultiMedQA, a benchmark with multiple-choice and open-ended medical questions from diverse sources. | MultiMedQA | Limited focus on interactive or information-seeking capabilities. |
| [2,3,4,5] Bolton et al. (2022); Yasunaga et al. (2022); Wu et al. (2023a); Singhal et al. (2023a,b) | Finetuned general LLMs on medical knowledge like PubMed to improve domain-specific performance. | PubMed, medical knowledge corpora | Improves accuracy but lacks proactive information-seeking. |
| [6,7] Yunxiang et al. (2023); Han et al. (2023) | Trained models on conversational medical datasets to improve interactive performance. | Conversational medical datasets | Still not inherently proactive in gathering information. |
| [8] Kim et al. (2024) | Improved complex medical QA by enabling dynamic multi-agent collaboration. | Complex medical questions | Still limited in information-seeking design. |
| [9,10] Li et al. (2023); Andukuri et al. (2024) | Used LLMs to elicit information-rich preferences in everyday tasks. | Human preference tasks | Applications to medical domain are limited. |
| [11] Wu et al. (2023b) | Evaluated LLMs using DDXPlus, a synthetic patient dataset. | DDXPlus | Focused on rule-based data; limited real-world complexity. |
| [12] Hu et al. (2024) | Formulated information-seeking as a search problem with reward modeling based on uncertainty. | Medical diagnosis with binary questions | Limited to binary question formats. |
| [13] Johri et al. (2023) | Observed that LLMs fail to elicit complete patient information. | LLM interactions with synthetic patients | Did not address or improve information-seeking ability. |
| [14] Tu et al. (2024) | Proposed a multi-task medical system. | Multiple medical tasks | Does not address the issue of abstention or proactive questioning. |
| [15,16,17,18] Zhou et al. (2024); Wu et al. (2024, 2023c); Deng et al. (2024); Lin et al. (2024) | Explored multi-agent and human-AI collaboration for interactive performance. | Collaborative frameworks | Could benefit from abstention-based info-seeking enhancements. |
| [19] Stella et.al. (2024)| Novel Question-Asking LLMs with Expert and Patient systems | MEDQA and Craft-MD | Scarcity of datasets, Patient system relies on Paid API rather tahn open source, Evaulation framwork is multiple choice format


# Methodology

### Patient system
The Patient system simulates a patient in clinical conversations and has access to the entire medical record, including symptoms, history, and relevant details.

Its primary role is to respond factually and relevantly to the Expert’s follow-up questions.

The system is evaluated on:

1. Factuality: Are the responses consistent with the record?

2. Relevance: Do the responses directly answer the Expert’s questions?

Multiple variants were explored (e.g., Direct, Instruct, Fact-Select), with Fact-Select achieving the best performance due to structured, atomic fact usage.

### Expert System
The Expert system simulates a clinical decision-maker who begins with partial information and iteratively asks follow-up questions to make accurate diagnoses.

It decides whether to answer or ask based on its confidence, using various abstention strategies (e.g., binary, scale-based, rationale generation).

Its effectiveness is judged by:

1. Accuracy of the final diagnosis.

2. Efficiency, measured by the number of questions asked.

A full reasoning pipeline includes initial assessment, abstention, question generation, information integration, and final decision-making.




### MeqiQ BenchMark Architecture


![Mediq Benchmark](https://github.com/aadarshachapagain/Meqiq_bench/blob/main/media/mediq_Benchmark.jpg?raw=true)


### Abstention Mechanism

The abstention mechanism in the MEDIQ framework is a critical component of the Expert System, designed to decide when the model should withhold a final answer and instead ask a follow-up question to gather more information. This approach helps avoid premature or incorrect diagnoses in situations of uncertainty—particularly vital in high-stakes clinical settings.

### What is Abstention  <br>
Abstention allows the Expert system to pause decision-making when it lacks confidence and proactively seek additional information from the Patient system. This helps reduce hallucinated or overconfident answers.

### How Abstention Works (5-Step Expert Pipeline Context)
It is part of the second step in the 5-step Expert pipeline:

1. Initial Assessment: Understand symptoms/options and identify missing info.
2. Abstention Module:  Estimate confidence; decide to answer or ask.
3. Question Generation: Generate a targeted follow-up question if abstaining.
4. Information Integration:  Update internal knowledge with new responses.
5. Decision Making:  Make a final diagnosis when confident.

### Variants of Abstention Strategies
1. BASIC (Implicit)<br>
  a. Model decides implicitly: If it’s confident, it answers; if not, it asks a question.<br>
  b. No explicit confidence signal is produced.
2. Numerical<br>
  a. Model generates a confidence score (0.0 to 1.0).<br>
  b. A fixed threshold (e.g., 0.7) determines whether to proceed or abstain.
  Limitation: LLMs struggle with calibrated numerical confidence.
3. Binary<br>
  a.Model is asked directly: "Are you confident?"<br>
  b. It answers YES (proceed) or NO (ask a question).<br>
  c. Simpler and more robust than numerical.
4. Scale<br>
  Confidence is expressed via a 5-point Likert scale:<br>
  a. Very Confident<br>
  b. Somewhat Confident<br>
  c. Neutral<br>
  d. Somewhat Unconfident<br>
  e. Very Unconfident<br>
  A threshold (e.g., ≥ “Somewhat Confident”) is used to proceed.

5. Rationale Generation (RG)<br>
  a. The model generates reasoning to support its confidence judgment.<br>
  b. Helps make more informed and calibrated decisions.

6. Self-Consistency (SC)<br>
  The model is prompted multiple times (e.g., 3 or 5) and the outputs are:
  a. Averaged (Numerical/Scale)<br>
  b. Majority-voted (Binary)<br>
  Reduces variance and improves reliability of abstention.







# Implementation

In order to run the mediQ benchmark, the following package of py files should be saved in the same location where this ipynb is currenly running. The package includes:

### Required Files:
* requirements.txt : all the Python packages required to run the MediQ benchmark locally.
* keys.py : Stores API keys.
* args.py : defines configuration options like model names, number of patients, max questions, logging, and data paths. Used by mediQ_benchmark.py to control benchmark behavior.
* mediQ_benchmark.py : script loads data, initializes expert and patient classes, iterates through interactions, records results, and calculates metrics like accuracy and average turns.
* expert.py : contains base Expert class and subclasses like FixedExpert, RandomExpert, NumericalExpert, etc., which implement different decision-making strategies for the benchmark
* expert_functions.py : implements decision-making logic for experts and contains functions like fixed_abstention_decision, binary_abstention_decision, etc., which experts use to determine whether to ask a question or make a decision.
* expert_basics.py : provides basic building blocks for expert decision-making and handles lower-level logic such as model interaction and message formatting. Used by expert_functions.py.
* patient.py : simulates different patient types like RandomPatient. Returns answers to questions asked by the expert. Used in each interaction loop.
* helper.py : contains utility functions for model interaction and response handling and interacts with LLMs.
* prompts.py : structures the format of questions, messages, and instructions used in interactions with models.
* evaluate.py : computes performance metrics and evaluation scores.

### Data Directory
* all_dev_good.jsonl : The main dev set used for running evaluation. Each line contains a single case (e.g. a simulated patient interaction) with a medical question, context, multiple-choice options, and correct answers.
* all_craft_md.jsonl : An alternate test set that includes similar structured medical QA cases, potentially from a different domain
* both all_dev_good.jsonl and all_craft_md.jsonl are read line-by-line during benchmarking and passed into the Patient and Expert classes for simulation

### Generated Output Files
When the mediQ benchmark has been suscessfull run, the following output files will be created:
* out_put.jsonl :  logs the complete interaction data for each evaluated patient and includes the patient ID, the list of questions asked by the expert model, the answers returned by the patient, as well as the intermediate and final choices made by the expert.
* log.log : a summary log that includes timestamps,  cumulative output, and the ongoing accuracy, timeout rate and average number of runs.

In [None]:
#Setting up the enviroment
!pip install -r requirements.txt

Collecting accelerate (from -r requirements.txt (line 1))
  Using cached accelerate-1.6.0-py3-none-any.whl.metadata (19 kB)
Collecting Pyarrow (from -r requirements.txt (line 79))
  Using cached pyarrow-19.0.1-cp310-cp310-macosx_12_0_arm64.whl.metadata (3.3 kB)
Collecting vllm (from -r requirements.txt (line 111))
  Using cached vllm-0.8.3-cp310-cp310-macosx_15_0_arm64.whl
Collecting xxhash (from -r requirements.txt (line 118))
  Using cached xxhash-3.5.0-cp310-cp310-macosx_11_0_arm64.whl.metadata (12 kB)
Collecting transformers (from -r requirements.txt (line 105))
  Using cached transformers-4.51.2-py3-none-any.whl.metadata (38 kB)
Collecting huggingface-hub (from -r requirements.txt (line 35))
  Using cached huggingface_hub-0.30.2-py3-none-any.whl.metadata (13 kB)
Collecting llguidance<0.8.0,>=0.7.9 (from vllm->-r requirements.txt (line 111))
  Using cached llguidance-0.7.14-cp39-abi3-macosx_11_0_arm64.whl.metadata (9.2 kB)
Collecting outlines (from -r requirements.txt (line 56))
  

In [None]:
# Running the Benchmark scripts. Note, be sure to modify the number of questions ask (maximum of 30) and number of patients to include (max of 1272)
# before running the script. Additionally, you may need to also update the output_filename as necessary.
# Note: this project is using a dedicated endpoint through Together.ai to access the Llama 3.1 model.
# If updating model, review keys.py, args.py and helper.py for any required changes.

!python mediQ_benchmark.py \
  --expert_module expert \
  --expert_class FixedExpert \
  --patient_module patient \
  --patient_class RandomPatient \
  --data_dir ../data \
  --dev_filename all_dev_good.jsonl \
  --output_filename out_20q_20p.jsonl \
  --max_questions 20 \
  --num_patients 20 \
  --use_api openai \
  --api_account mediQ \
  --expert_model cdlacey/meta-llama/Meta-Llama-3.1-8B-Instruct-Turbo-5307b20c \
  --patient_model cdlacey/meta-llama/Meta-Llama-3.1-8B-Instruct-Turbo-5307b20c


[LOG] ++++++++++++++++++++ Start of Fixed Abstention [expert_functions.py:fixed_abstention_decision()] ++++++++++++++++++++
[LOG] [ABSTENTION RESPONSE]: True

[2025-04-13 22:46:54] Processed 1/1272 patients | Accuracy: 1.0 | Timeout Rate: 1.0 | Avg. Turns: 21.0
[2025-04-13 22:47:04] Processed 2/1272 patients | Accuracy: 1.0 | Timeout Rate: 1.0 | Avg. Turns: 21.0
[2025-04-13 22:47:12] Processed 3/1272 patients | Accuracy: 0.6666666666666666 | Timeout Rate: 1.0 | Avg. Turns: 21.0
[2025-04-13 22:47:21] Processed 4/1272 patients | Accuracy: 0.75 | Timeout Rate: 1.0 | Avg. Turns: 21.0
[2025-04-13 22:47:29] Processed 5/1272 patients | Accuracy: 0.8 | Timeout Rate: 1.0 | Avg. Turns: 21.0
[2025-04-13 22:47:37] Processed 6/1272 patients | Accuracy: 0.8333333333333334 | Timeout Rate: 1.0 | Avg. Turns: 21.0
[2025-04-13 22:47:47] Processed 7/1272 patients | Accuracy: 0.8571428571428571 | Timeout Rate: 1.0 | Avg. Turns: 21.0
[2025-04-13 22:47:55] Processed 8/1272 patients | Accuracy: 0.75 | Timeout

# Conclusion and Future Direction

### Conclusion
The model struggled with sensitive and restrictive topics, such as those involving absorption-related queries, indicating a gap in nuanced medical understanding.

Using multiple-choice answers limited the model's ability to engage in open-ended conversations with patients, which is crucial for effective information gathering.

The training datasets were limited in scope, making it challenging to generalize across the vast and diverse medical diagnosis domain.

### Future diection
Future work should explore richer, open-ended data formats and incorporate abstention-based reasoning to better handle uncertain or incomplete patient information.

Expanding the dataset and leveraging instruction-tuned, medically-aligned models (e.g., Meta-Llama-3.1) could significantly improve adaptability and reliability in real-world medical settings.

# References:

[1]:  Authors names, title of the paper, Conference Name,Year, page number (iff available)

[2]:  Author names, title of the paper, Journal Name,Journal Vol, Issue Num, Year, page number (iff available)


[1]: Singhal, K., et al., "Large language models encode clinical knowledge," Nature, 2023, pp. 172–180.

[2]: Bolton, E., et al., "BioMedLM: A Domain-Specific Large Language Model for Biomedical Text," arXiv preprint arXiv:2212.09395, 2022.

[3]: Yasunaga, M., et al., "LinkBERT: Pretraining Language Models with Document Links," Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (ACL), 2022, pp. 8003–8016.

[4]: Wu, S., et al., "MedAlpaca: An Open-Source Collection of Medical Conversational AI Models and Training Data," arXiv preprint arXiv:2304.08247, 2023.

[5]: Singhal, K., et al., "Towards expert-level medical question answering with large language models," Nature Medicine, 2023, pp. 1–6.

[6]: Yunxiang, L., et al., "ChatDoctor: A Medical Chat Model Fine-Tuned on a Large Language Model Meta-AI LLaMA Using Medical Domain Knowledge," arXiv preprint arXiv:2303.14070, 2023.

[7]: Han, T., et al., "MedAlpaca: An Open-Source Collection of Medical Conversational AI Models and Training Data," arXiv preprint arXiv:2304.08247, 2023.

[8]: Kim, J., et al., "Multi-Agent Collaboration for Complex Medical Question Answering," arXiv preprint arXiv:2402.13470, 2024.

[9]: Li, Y., et al., "Eliciting Human Preferences with Language Models," arXiv preprint arXiv:2310.11589, 2023.

[10]: Andukuri, S., et al., "STaR-GATE: Teaching Language Models to Ask Clarifying Questions," arXiv preprint arXiv:2401.12345, 2024.

[11]: Wu, S., et al., "Evaluating Large Language Models and Chain-of-Thought Reasoning on DDXPlus," arXiv preprint arXiv:2305.10688, 2023.

[12]: Hu, X., et al., "Uncertainty-aware Reward Model for Medical Diagnosis," arXiv preprint arXiv:2410.00847, 2024.

[13]: Johri, P., et al., "Assessing the Information-Seeking Ability of Large Language Models in Medical Contexts," arXiv preprint arXiv:2312.04567, 2023.

[14]: Tu, J., et al., "Uni-Med: A Unified Medical Generalist Foundation Model for Multi-Task Medical Applications," NeurIPS, 2024, pp. 1–12.

[15]: Zhou, J., et al., "Understanding Nonlinear Collaboration between Human and AI Agents: A Co-design Framework for Creative Design," CHI Conference on Human Factors in Computing Systems, 2024, pp. 1–14.

[16]: Wu, S., et al., "PMC-LLaMA: A Language Model for Medical Research and Knowledge Extraction," arXiv preprint arXiv:2306.12345, 2024.

[17]: Deng, L., et al., "Multi-Agent Reinforcement Learning for Real-World Applications," AAMAS, 2024, pp. 1–10.

[18]: Lin, Y., et al., "BiMediX: Bilingual Medical Mixture of Experts LLM," arXiv preprint arXiv:2402.13253, 2024.