# Feedback Generator 

- Draft: Feb 28, 2025; Update Mar 3, 2025.
- Brian Locke MD MSCI

### Framing of Errors

Learners can make two types of errors:

	1.	Failing to gather key information.
	2.	Misinterpreting the meaning of collected data.

In practice these processes —information gathering and reasoning— are interconnected, as reasoning guides the questions asked during an encounter.  However, separating the processes for feedback may help learners pinpoint areas for improvement. For now, feedback is only provided at the end, so we are limited to evaluating both processes based on the completed interview.

### Framing of Feedback 

For information gathering: 

- We want to give positive feedback: what were the pieces of information that the learners gathered correctly
- We want to higlight areas for improvement
	- For information gathering, this might involve the pieces of information from *this case* that could have 
	- For reasoning, this might involve our algorithm giving an estimate of how the probabilities ought to have been combined, given the information that was collected

### Goals

To guide positive feedback we want: 
- What are the strongest features that increase the odds of CREST
- What are the strongest features that increase the odds of CREST that are present in this particular case? 

To guide areas for improvement we want: 
- What are the strongest features that Increase the odds of CREST (vs each entry that the learner puts in their differential diagnosis)
- What are the strongest features that Increase the odds of CREST (vs each entry that the learner puts in their differential diagnosis) that are present in this particular case? 
- Given the information that was gathered what should the ranked differential diagnosis have been? 
- Ideally, we would also want to correct the reasoning that the learner gives (which would require them to explain their reasoning)

In [1]:
# Needed packages
import pandas as pd
import numpy as np
import openai
import os
from dotenv import load_dotenv
import json
from datetime import datetime
from typing import Literal, Dict, List

load_dotenv()  # looks for a .env file in the current dir by default- should contain a line "OPENAI_API_KEY=yourkey"
#print(os.getenv("OPENAI_API_KEY"))

ModelRunType = Literal["4o-mini", "4o", "o3-mini"]
model_run: ModelRunType = "o3-mini"  # Allowed 1 of: '4o-mini', '4o', or 'o3-mini'

only_overall = True # If True, only runs the script for the Subjective/Historical data (not objective, testing, and not subdivided) 

current_date = datetime.today().strftime('%Y-%m-%d')
output_directory = f"{current_date}_{model_run}_feedback_sheets"
if not os.path.exists(output_directory):
    os.makedirs(output_directory)
    print(f"Directory '{output_directory}' created.")

Directory '2025-03-11_o3-mini_feedback_sheets' created.


In [2]:
def clean_response(text):
    """Remove leading/trailing whitespace and code fences if present"""
    text = text.strip()
    if text.startswith("```json"):
        text = text[len("```json"):].strip()
    if text.endswith("```"):
        text = text[:-len("```")].strip()
    return text

def call_openai_api(prompt, model_type):
    """
    Function to call OpenAI API
    Assume calling a model that supports structured response
    """
    client = openai.OpenAI()  # Create a client instance
    system_prompt = "You are a knowledgeable medical reasoning AI- an expert diagnostician. \
                You must follow these rules: \
                1. You identify the strongest clinical findings for or against a given diagnosis. \
                2. Focus on only one category of evidence at a time. \
                3. Provide output in valid JSON with no extra commentary. \
                4. Comply with the user instructions below."

    if model_type == "o3-mini": 
        response = client.chat.completions.create(
        model="o3-mini-2025-01-31",
        messages=[
        {
            "role": "system",
            "content": system_prompt
        }, 
        {
            "role": "user",
            "content": prompt
        }], 
        reasoning_effort="high"
    )
    elif model_type == "4o":
        response = client.chat.completions.create(
        model="gpt-4o-2024-08-06",
        messages=[
        {
            "role": "system",
            "content": system_prompt
        }, 
        {
            "role": "user",
            "content": prompt
        }],
        temperature=0
    )
    else: 
        response = client.chat.completions.create(
        model="gpt-4o-mini-2024-07-18",
        messages=[
        {
            "role": "system",
            "content": system_prompt
        }, 
        {
            "role": "user",
            "content": prompt
        }],
        temperature=0
    )

    # Log the entire raw API response for debugging.
    print("Raw API Response:", response)

    raw_content = response.choices[0].message.content
    content = clean_response(raw_content)
    if not content.strip():
        print("Received an empty response")
    else:
        try:
            parsed_response = json.loads(content)
        except Exception as e:
            print("Error parsing JSON:", e)
            print("Response content:", content)
    return json.loads(content)

## Information Gathering Feedback

#### Two different types of likelihood ratios

The likelihood ratio represents the relative information in favor of 1 hypothesis and against another. 

In the usual positive and negative likelihood ratios, the two hypothesis are: 
- H0 = Disease A is present
- H1 = Disease A is not present
And therefore, the LR summarizes the evidence for the disease and against *everything else*

However, if you are reasoning about a constraint set of diagnoses - say, disease A and disease B - and you know that it's 1 but not both of the two diseases that explains the presentation... then the information can be summarized by whats been called a "differential" likelihood ratio: 
- H0 = Disease A is present (and disease B is not present) 
- H1 = Disease B is present (and disease A is not present) 
And then the LR_differential summarizes the amount that a piece of evidence supports Disease A over Disease B. 
(from https://academic.oup.com/book/31795/chapter/266181309?login=false)

Ample data exists on overall likelihood ratios (however, it should be noted that for an estimate of an overall LR to apply, both the spectrum of Disease A and the spectrum of not disease A cases must be similar - see https://pmc.ncbi.nlm.nih.gov/articles/PMC4916916/ )

Thus, 
- LR_overall might be most important in summarizing which pieces of information are helpful in supporting a particular diagnosis
- LR_differential might be particularly relevant when 
	- either deliberating between a few remaining entries on the differential (e.g. reasoning by elimination)
	- understanding why a particular proposed diagnosis is incorrect. 

In [3]:
# Globals
# TODO: for production, extract these automatically from transcripts/directory
correct_diagnosis = "CREST syndrome with Type 2 Achalasia"

# The list of alternative diagnoses for the LLM to evaluate evidence against
differential_diagnoses = [
    "CREST syndrome with Type 2 Achalasia",
    "Scleroderma",
    "Esophageal stricture",
    "Food impaction",
    "Mixed Connective Tissue Disease",
    "Achalasia (type 1)",
    "Eosinophilic esophagitis",
    "Acute Coronary Syndrome",
    "Stable Angina",
    "Esophageal spasm a.k.a Jackhammer Esophagus",
    "Esophageal adenocarcinoma",
    "Esophageal squamous cell carcinoma",
    "Musculoskeletal chest pain",
    "GERD",
    "Zenker's diverticulum",
    "Polymyositis",
    "Dermatomyositis",
    "Chagas Disease",
    "Anxiety",
    "Arrhythmia",
    "Aortic dissection",
    "Pericarditis",
    "Extrinsic compressing mass on esophagus",
    "Rheumatoid Athritis",
    "Sarcoidosis",
    "Pill esophagitis",
    "Myasthenia gravis",
    "Gastroparesis" 
]

CategoryKey = Literal["hpi", "hist", "soc", "obj", "test", "subjective-and-historical"]

categories_of_info_dict: Dict[CategoryKey, str] = {
    "hpi": "History of Present Illness (the description from the onset of symptoms up to and including the patient's present experience)",
    "hist": "Past Medical History, current medications, and Past Surgical History (a description of previously established diagnoses, as well as there treatments like medications and therapeutic procedures)",
    "soc": "Social History, Health Behaviors, and Family History (a description of the social - e.g. does the patient participate in activities that increase risk? - and genetic - e.g. does it run in the family? - context of disease",
    "obj": "Vitals and Physical Exam (a description of the objective measurements and findings a doctor would expect to see with a thorough physical exam)",
    "test": "Test Results (labs, imaging, procedures, etc.)", 
    "subjective-and-historical": """Any subjective or historical information about the patient’s condition, including:
	1.	History of Present Illness (HPI): The narrative describing the onset of symptoms, their progression, and the patient’s current experience.
	2.	Past Medical History (PMH): Previously diagnosed conditions and treatments (including medications).
	3.	Current Medications: All medications the patient is currently taking.
	4.	Past Surgical History (PSH): Previous surgical interventions and therapeutic procedures.
	5.	Social History & Health Behaviors: Lifestyle factors, social context, and risk-related activities.
	6.	Family History: Genetic predispositions and conditions that run in the family.
	Do not consider objective clinical data such as vital signs, physical exam findings, and test results (e.g., lab values, imaging reports, procedure outcomes).
    """
    }

# The particulars of this case contains 
# note, a couople of these are relevant to multiple categories, and there is sometimes some subtle differences between what each means (e.g. abodminal pain reported vs abdominal pauin on exam)
hpi_details = [
  {"Pain relieved with regurgitation": "present"},
  {"Raynauds phenomenon reported": "present"},
  {"Telangiectasias reported": "present"},
  {"Hand pain out of proportion to other joints": "present"},
  {"Current heartburn": "present"},
  {"Current reflux": "present"},
  {"Long-standing heartburn (duration of years)": "present"},
  {"Long-standing reflux (duration of years)": "present"},
  {"Pain previously better with antacids": "absent"},
  {"Antacids no longer providing relief": "present"},
  {"Difficulty swallowing liquids": "present"},
  {"Difficulty swallowing solids": "present"},
  {"Non-progressive dysphagia: liquids throughout difficulty swallowing": "present"},
  {"Weight loss reported": "present"},
  {"Hoarse voice reported": "absent"},
  {"Cough reported": "absent"},
  {"Globus sensation": "absent"},
  {"Epigastric pain or dyspepsia reported": "absent"},
  {"Shortness of breath": "absent"},
  {"Hand thickness reported": "absent"},
  {"Finger ulcers reported": "absent"},
  {"Weakness reported": "absent"},
  {"Intermittent temporal pattern (not constant) of symptoms": "absent"},
  {"Tightness (character of pain)": "present"},
  {"duration of 3 months of increased frequency of chest pain": "absent"},
  {"duration of 3 months of food getting stuck": "present"},
  {"Onset of chest pain associated with eating food": "present"},
  {"Exertion makes it worse (without clarifying within an hour of eating)": "absent"},
  {"Exertion makes it worse for more than an hour after eating": "absent"},
  {"Pain worse when lying down (positional)": "present"},
  {"Pain when swallowing (aka odynophagia)": "present"},
  {"Bloating with intermittent upper abdominal pain reported": "present"},
  {"Reports pain location is behind sternum, middle of chest": "present"},
  {"diaphoresis": "absent"},
  {"decreased exercise over the last 3 months": "absent"},
  {"onset of symptoms in the last 24 hours (not acute or hyperacute)": "absent"},
  {"Radiation of pain to the back": "absent"},
  {"Nausea and/or vomiting": "absent"},
  {"Early satiety": "absent"},
  {"Dry eye reported": "absent"},
  {"Red eye reported": "absent"},
  {"Neck masses or fullness reported": "absent"},    
  {"Pleuritic character of the pain": "absent"},
  {"Sharp character of the pain": "absent"},
  {"Stabbing character of the pain": "absent"},
  {"Pain is reproducible with arm movements": "absent"},
  {"Spasmodic character of pain": "absent"},
  {"Palpitations": "absent"},
  {"Halitosis reported": "absent"},
  {"Recent injuries reported": "absent"},
  {"Vision changes reported": "absent"},
  {"Multiple symmetric joints hurt": "present"},
  {"Morning stiffness": "absent"},
  {"Joint swelling reported": "absent"},
  {"Enlargement of knuckles, finger deformities, or deviation of fingers reported": "absent"}
]

hist_details = [
    {"Alcohol use disorder": "absent"},
    {"Nicotine dependence": "absent"},
    {"Prior treatment with radiation to the neck, arm, or jaw": "absent"},
    {"Previously diagnosed Coronary Artery Disease": "absent"},
    {"Prevopis;u doagmpsed Peripheral Artery Disease": "absent"},
    {"Previously diagnosed Hyperlipidemia": "absent"},
    {"prior myocardial infarction": "absent"},
    {"type 2 diabetes": "absent"},
    {"obesity": "absent"},
    {"prior stroke": "absent"},
    {"diagnosed hypertension": "present"},
    {"recent medication changes": "absent"},
    {"takes amlodipine": "present"},
    {"Female": "present"},
    {"middle age": "present"},
    {"Environmental allergies": "absent"},  
    {"Asthma": "absent"},
    {"Eczema": "absent"}
]

soc_details = [
    {"Family history of Rheumatoid Arthritis": "absent"},
    {"Alcohol use": "absent"},
    {"Current tobacco use": "absent"},
    {"Prior tobacco use": "present"},
    {"family history of myocardial infarction in father": "present"},
    {"Recent social stress": "present"},
    {"Recent Travel": "absent"},
    {"Family history of cancer": "absent"},
    {"Recent medical procedure": "absent"},
    {"Gestational complications with prior pregnancy": "absent"}    
]

obj_details = [
  {"Raynauds phenomennon on exam": "absent"},
  {"Telangiectasias on exam": "present"},
  {"Weight loss on vitals": "present"},
  {"Hoarse voice observed": "absent"},
  {"Cough observed": "absent"},
  {"Epigastric pain on palpation": "absent"},
  {"Hand thickening observed": "absent"},
  {"Finger ulcers observed": "absent"},
  {"Weakness on exam": "absent"},
  {"obesity by vital signs": "absent"},
  {"high blood pressure when checked": "absent"},
  {"Red eye observed": "absent"},
  {"Neck masses or fullness observed": "absent"},
  {"Halitosis observed": "absent"},
  {"Joint swelling observed": "absent"},
  {"Enlargement of knuckles, finger deformities, or deviation of fingers": "absent"},
  {"Rheumatoid nodules": "absent"}
]

# Haven't really gotten this far to know what all resting results we'd have - made up 5 so that it doesn't error
testing_details = [ 
    {"Hyperlipidemia on lab testing": "present"}, 
    {"ANA strong positive": "absent"},
    {"MBS shows aspiration": "absent"},
    {"CT shows ILD": "absent"}, 
    {"CXR shows widened mediastinum": "absent"}
]

details_dict: Dict[CategoryKey, List[dict]] = {
    "hpi": hpi_details,
    "hist": hist_details,
    "soc": soc_details,
    "obj": obj_details,
    "test": testing_details,
    "subjective-and-historical": hpi_details + hist_details + soc_details
}

In [4]:
def generate_clinical_context_string(cat_key): 
    """Takes the category of info and returns a sentence summarizing which clinical features are present,
    with each feature on its own line preceded by a count-number."""
    lines = [f"Here are the particular findings of the {cat_key} in this case:"]
    count = 1
    for detail in details_dict[cat_key]:
        # Each detail is a dictionary with one key-value pair
        for feature, status in detail.items():
            lines.append(f"{count}. {feature} is {status}.")
            count += 1
    return "\n".join(lines)

In [5]:
#print(generate_clinical_context_string("subjective-and-historical"))

## LLM Derived Overall Likelihood ratios

for positive feedback

In [6]:
# Set up Output Subdirectories
overall_gen_output_dir = os.path.join(output_directory, "overall_gen")
if not os.path.exists(overall_gen_output_dir):
    os.makedirs(overall_gen_output_dir)
    print(f"Directory '{overall_gen_output_dir}' created.")

overall_spec_output_dir = os.path.join(output_directory, "overall_spec")
if not os.path.exists(overall_spec_output_dir):
    os.makedirs(overall_spec_output_dir)
    print(f"Directory '{overall_spec_output_dir}' created.")

Directory '2025-03-11_o3-mini_feedback_sheets/overall_gen' created.
Directory '2025-03-11_o3-mini_feedback_sheets/overall_spec' created.


#### General Overall LR Estimator

This answers the question: in general, what are the most helpful pieces of evidence in support of a diagnosis? 


Inputs: Each entry from the learner’s differential

Outputs: A list of key evidence that would support each entry in the diagnosis. 

Challenges:
- Learners may list many possible diagnoses, requiring feedback across a broad range.


Approach:
- for each entry on the provided diagnoses, ask the LLM to come up with the 5 strongest pieces of evidence in support of that diagnosis in each of the following categories.. 
	- HPI, Context (Medical, Surgical, Medications, Social, Family), Vitals/Exam, and Testing.

In [7]:
# Function to generate structured prompt for general, overall likelihoods
def generate_overall_gen_prompt(diagnosis, cat_key):
  category_of_info = categories_of_info_dict[cat_key]
  return f"""You are given:
- {diagnosis}: the diagnosis in question.
- {cat_key}: the single category of information to consider. 
  Valid categories: [hpi, hist, soc, obj, test, all-but-obj].

Definition of {cat_key}:
{category_of_info}

#### Task
1. List the top 5 pieces of information from {cat_key} that most strongly support having {diagnosis}.
2. List the top 5 pieces of information from {cat_key} that most strongly support not having {diagnosis}.

#### Constraints
- Base your reasoning on the likelihood ratio (the likelihood of the finding in patients with {diagnosis} divided by the likelihood of the finding in patients without {diagnosis}): 
  - Pieces of evidence with higher likelihood ratios (occur with greater frequency in patients with {diagnosis} than in patients without {diagnosis}) are stronger evidence in favor of {diagnosis} than pieces of evidence with lower likelihood ratios.
  - Pieces of evidence with -in particular- a higher specificity have higher likelihood ratios. A higher sensitivity also helps, but less so than specificity. 
  - Pieces of evidence with lower likelihood ratios (meaning, they much more often occur in patients without {diagnosis} than with {diagnosis}) are stronger evidence against {diagnosis} being present. 
  - In particular, a negative result for a test with a higher sensitivity will translate to a lower likelihood ratio, and stronger evidence against {diagnosis}
- Reason using all available sources of information (epidemiology, physiology, trials, etc.) to give your best guess. 
- Provide no numeric LRs, only a relative ranking.
- Return only JSON in the following structure:
    {{
      "for_diagnosis_strongest_evidence": [
        {{
          "finding": "A finding relevant to {diagnosis} from {category_of_info}",
          "explanation": "Why this finding favors {diagnosis}",
          "abbreviation_expansion": {{
            "abbreviation": "Expanded term if an abbreviation is used"
          }}
        }},
        "... (4 more items) ..."
      ],
      "against_diagnosis_strongest_evidence": [
        {{
          "finding": "A finding relevant to not having {diagnosis} from {category_of_info}",
          "explanation": "Why this finding favors that {diagnosis} is not present",
          "abbreviation_expansion": {{}}
        }},
        "... (4 more items) ..."
      ],
      "summary": "A short paragraph describing the overall rationale and key differences."
    }}
- Exactly 5 items under each list, no more, no fewer.
- Define abbreviations in 'abbreviation_expansion' if used; otherwise leave it empty.
- Do not add text outside the JSON. 
- Output must be syntactically valid JSON, with no trailing commas.
"""

In [None]:
# Assume these functions and variables are defined elsewhere:
# generate_overall_gen_prompt(diagnosis, cat_key)
# call_openai_api(prompt, model_run)
# overall_gen_output_dir is a directory where you want to save output
# model_run is defined (e.g., "4o-mini", "4o", or "o3-mini")
# diagnosis is the diagnosis you're generating evidence for
# categories_of_info_dict is a dictionary like:
# {
#    "hpi": "History of Present Illness (the description from the onset of symptoms up to and including the patient's present experience)",
#    "hist": "Medical and Surgical History, including Medications",
#    "soc": "Social History, Health Behaviors, and Family History",
#    "obj": "Vitals and Physical Exam",
#    "test": "Test Results (labs, imaging, procedures, etc.)"
# }

if only_overall:
    details_dict = {"subjective-and-historical": details_dict["subjective-and-historical"]}
    categories_of_info_dict = {
        "subjective-and-historical": categories_of_info_dict["subjective-and-historical"]
    }
    # drops the other categories so that only is ordered list is requested from the LLMs

for diagnosis in differential_diagnoses:
    # Define output file name
    excel_filename = os.path.join(overall_gen_output_dir, f"{diagnosis}_gen_overall.xlsx")
    writer = pd.ExcelWriter(excel_filename, engine="openpyxl")

    # Iterate over each category in the dictionary
    for cat_key, category_description in categories_of_info_dict.items():
        print(f"Overall/Gen Processing {diagnosis} overall ({cat_key})...")
        
        # Generate the prompt for the current category using the overall prompt function
        prompt = generate_overall_gen_prompt(diagnosis, cat_key)
        
        try:
            # Call the LLM API; this function should return a parsed JSON matching the expected schema.
            parsed_response = call_openai_api(prompt, model_run)
            
            # Extract the two lists of evidence from the response
            evidence_for = parsed_response["for_diagnosis_strongest_evidence"]
            evidence_against = parsed_response["against_diagnosis_strongest_evidence"]
            
            # Create rows for a DataFrame: one row per evidence item (5 total)
            rows = []
            for i in range(5):
                finding_for = evidence_for[i]["finding"]
                rationale_for = evidence_for[i]["explanation"]
                finding_against = evidence_against[i]["finding"]
                rationale_against = evidence_against[i]["explanation"]
                rows.append([finding_for, rationale_for, finding_against, rationale_against])
            
            # Create a DataFrame with the four desired columns
            df = pd.DataFrame(rows, columns=[
                f"Supports {diagnosis}",
                f"Rationale for {diagnosis}",
                f"Against {diagnosis}",
                f"Rationale Against {diagnosis}"
            ])
            
            # Write the DataFrame to an Excel sheet named after the category key (limited to 31 chars)
            df.to_excel(writer, sheet_name=cat_key[:31], index=False)
        
        except Exception as e:
            print(f"Error processing {diagnosis} - {category_description}: {e}")

    # Save the Excel file after processing all categories
    writer.close()
    print(f"Saved results to {excel_filename}")

Overall/Gen Processing CREST syndrome with Type 2 Achalasia overall (subjective-and-historical)...
Raw API Response: ChatCompletion(id='chatcmpl-B9v3QTdb0PQHaTYgaeolwzJ4snLsi', choices=[Choice(finish_reason='stop', index=0, logprobs=None, message=ChatCompletionMessage(content='{\n  "for_diagnosis_strongest_evidence": [\n    {\n      "finding": "The patient reports recurrent episodes of finger color changes (from white to blue to red) when exposed to cold temperatures, indicating Raynaud phenomenon.",\n      "explanation": "Raynaud phenomenon is one of the most specific and frequently reported features in CREST syndrome, strongly supporting its diagnosis when present.",\n      "abbreviation_expansion": {}\n    },\n    {\n      "finding": "The patient describes a sensation of skin tightness and thickening specifically involving the fingers, suggestive of sclerodactyly.",\n      "explanation": "Sclerodactyly is a hallmark cutaneous manifestation of CREST syndrome and its presence is highl

#### Case-Specific, Overall Likelihood Ratios

Answers the qeustion - given the details of this case, what are the pieces of evidence that strongest support diagnosis A?

In [None]:
# Function to generate structured prompt for case-specific, overall likelihoods
def generate_overall_spec_prompt(diagnosis, cat_key):
  """
    Takes the diagnosis under consideration, and a cat_key and generates a prompt for estimating which likelihood ratios are most important in this particular case.
    It generates the context from the cat_key and the global detail_lists
  """
  category_of_info = categories_of_info_dict[cat_key]
  clinical_context = generate_clinical_context_string(cat_key)
  return f"""You are given:
- {diagnosis}: the diagnosis in question.
- and a list of the key clinical findings from the {cat_key} ({category_of_info}) and whether they were present or not 

#### Task
1. List the top 5 pieces of information from the this particular case that most strongly support having {diagnosis}.
2. List the top 5 pieces of information from the this particular case that most strongly support not having {diagnosis}.

#### Clinical Context
Here is the clinical context: 
{clinical_context}

#### Constraints
- Only consider the findings mentioned in the clinical context. Assume all other pieces of information are unknown (and thus do not change the likelihood of disease)
- Base your reasoning on the likelihood ratio (the likelihood of the finding in patients with {diagnosis} divided by the likelihood of the finding in patients without {diagnosis}): 
  - Pieces of evidence with higher likelihood ratios (occur with greater frequency in patients with {diagnosis} than in patients without {diagnosis}) are stronger evidence in favor of {diagnosis} than pieces of evidence with lower likelihood ratios.
  - Pieces of evidence with -in particular- a higher specificity have higher likelihood ratios. A higher sensitivity also helps, but less so than specificity. 
  - Pieces of evidence with lower likelihood ratios (meaning, they much more often occur in patients without {diagnosis} than with {diagnosis}) are stronger evidence against {diagnosis} being present. 
  - In particular, a negative result for a test with a higher sensitivity will translate to a lower likelihood ratio, and stronger evidence against {diagnosis}
- Reason using all available sources of information (epidemiology, physiology, trials, etc.) to give your best guess. 
- Provide no numeric LRs, only a relative ranking. Give the strongest piece of evidence (highest LR) first

- Return only JSON in the following structure:
    {{
      "for_diagnosis_strongest_evidence": [
        {{
          "finding": "A finding relevant to {diagnosis} from {category_of_info}",
          "explanation": "Why this finding favors {diagnosis}",
          "abbreviation_expansion": {{
            "abbreviation": "Expanded term if an abbreviation is used"
          }}
        }},
        "... (4 more items) ..."
      ],
      "against_diagnosis_strongest_evidence": [
        {{
          "finding": "A finding relevant to not having {diagnosis} from {category_of_info}",
          "explanation": "Why this finding favors that {diagnosis} is not present",
          "abbreviation_expansion": {{}}
        }},
        "... (4 more items) ..."
      ],
      "summary": "A short paragraph describing the overall rationale and key differences."
    }}

- Exactly 5 items under each list, no more, no fewer.
- If a finding is not mentioned in the clinical context, it should not be given as an answer. 
- Define abbreviations in 'abbreviation_expansion' if used; otherwise leave it empty.
- Do not add text outside the JSON. 
- Output must be syntactically valid JSON, with no trailing commas.
"""

In [None]:
if only_overall:
    details_dict = {"subjective-and-historical": details_dict["subjective-and-historical"]}
    categories_of_info_dict = {
        "subjective-and-historical": categories_of_info_dict["subjective-and-historical"]
    }
    # drops the other categories so that only is ordered list is requested from the LLMs
    
for diagnosis in differential_diagnoses:
    # Define output file name
    excel_filename = os.path.join(overall_spec_output_dir, f"{diagnosis}_spec_overall.xlsx")
    writer = pd.ExcelWriter(excel_filename, engine="openpyxl")

    # Iterate over each category in the dictionary
    for cat_key, category_description in categories_of_info_dict.items():
        print(f"Overall/Spec Processing {diagnosis} overall ({cat_key})...")
        
        # Generate the prompt for the current category using the overall prompt function
        prompt = generate_overall_spec_prompt(diagnosis, cat_key)
        
        try:
            # Call the LLM API; this function should return a parsed JSON matching the expected schema.
            parsed_response = call_openai_api(prompt, model_run)
            
            # Extract the two lists of evidence from the response
            evidence_for = parsed_response["for_diagnosis_strongest_evidence"]
            evidence_against = parsed_response["against_diagnosis_strongest_evidence"]
            
            # Create rows for a DataFrame: one row per evidence item (5 total)
            rows = []
            for i in range(5):
                finding_for = evidence_for[i]["finding"]
                rationale_for = evidence_for[i]["explanation"]
                finding_against = evidence_against[i]["finding"]
                rationale_against = evidence_against[i]["explanation"]
                rows.append([finding_for, rationale_for, finding_against, rationale_against])
            
            # Create a DataFrame with the four desired columns
            df = pd.DataFrame(rows, columns=[
                f"Supports {diagnosis}",
                f"Rationale for {diagnosis}",
                f"Against {diagnosis}",
                f"Rationale Against {diagnosis}"
            ])
            
            # Write the DataFrame to an Excel sheet named after the category key (limited to 31 chars)
            df.to_excel(writer, sheet_name=cat_key[:31], index=False)
        
        except Exception as e:
            print(f"Error processing {diagnosis} - {category_description}: {e}")

    # Save the Excel file after processing all categories
    writer.close()
    print(f"Saved results to {excel_filename}")

## Differential Likelihood Ratios

for corrective feedback on information gathering

In [None]:
# Set up Output Subdirectories
diff_gen_output_dir = os.path.join(output_directory, "diff_gen")
if not os.path.exists(diff_gen_output_dir):
    os.makedirs(diff_gen_output_dir)
    print(f"Directory '{diff_gen_output_dir}' created.")

diff_spec_output_dir = os.path.join(output_directory, "diff_spec")
if not os.path.exists(diff_spec_output_dir):
    os.makedirs(diff_spec_output_dir)
    print(f"Directory '{diff_spec_output_dir}' created.")

#### General, differential likelihood ratio estimators

This answers the question: in general, what are the features that differentiate between Disease A and Disease B. (not specific to a particular case)

Inputs: Correct diagnosis, learner’s differential, and transcript.

Outputs: A list of key evidence that differentiates cases where the learner’s differential diagnosis was correct vs. the actual correct diagnosis.

Challenges:
- Learners may list many possible diagnoses, requiring feedback across a broad range.
- Usefulness is measured by differential LR (A vs. B) rather than the usual (overall) LR (A vs. not A).

Approach:
- Collected all DDx from Cory's list; in production, this can be auto-extracted from the transcript.
- Used GPT-4o to identify key discriminating factors in:
	- HPI, Context (Medical, Surgical, Medications, Social, Family), Vitals/Exam, and Testing.
- In production, we’d automate detecting whether the learner asked about these factors and generate feedback:
	- “X was important, and you asked it.”
	- “Y was important, but you didn’t ask it.”

In [None]:
# Function to generate structured prompt for general differential likelihoods
def generate_diff_gen_prompt(correct_diagnosis, differential_diagnosis, cat_key):
  category_of_info = categories_of_info_dict[cat_key]
  return f"""
You are asked to identify the clinical findings that most strongly discriminate between cases of the following two diagnoses:
{correct_diagnosis} and {differential_diagnosis}.

{cat_key}: the single category of information to consider. 
Valid categories: [hpi, hist, soc, obj, test].

Definition of {cat_key}:
{category_of_info}

Your responses must be:
- Accurate and valid for research-level work.
- Relevant to each diagnosis's typical presentation, focusing on the **differential** likelihood ratio
  (i.e., how well a finding discriminates {correct_diagnosis} from {differential_diagnosis}).
  - Pieces of evidence with higher differential likelihood ratio (occur with greater frequency in patients with {correct_diagnosis} than in patients with {differential_diagnosis}) are stronger evidence in favor of {correct_diagnosis} than pieces of evidence with lower differential likelihood ratios.
  - Pieces of evidence with lower likelihood ratios (meaning, they much more often occur in patients without {differential_diagnosis} than with {correct_diagnosis}) are stronger evidence against {correct_diagnosis} being present. 
- Strictly formatted in JSON to facilitate downstream parsing.
- Explicit about any abbreviations (with definitions), if used.

### Context and Focus:
- Only consider clinical information from {category_of_info}.
- Emphasize which pieces of information best distinguish {correct_diagnosis} from {differential_diagnosis}.
- You do not need numeric likelihood ratios. Just rank the findings in order of their discriminative power.

### Task:
1. List the top 5 pieces of information (within {cat_key}) that most strongly support {correct_diagnosis} over {differential_diagnosis}.
2. List the top 5 pieces of information (within {cat_key}) that most strongly support {differential_diagnosis} over {correct_diagnosis}.

### Output Format (Strict JSON):
{{
  "diagnosisA_strongest_evidence": [
    {{
      "finding": "Relevant finding favoring {correct_diagnosis}",
      "explanation": "Short reason this finding favors {correct_diagnosis}",
      "abbreviation_expansion": {{
        "abbreviation": "Expanded term if abbreviation is used"
      }}
    }},
    "... (total of 5 items) ..."
  ],
  "diagnosisB_strongest_evidence": [
    {{
      "finding": "Relevant finding favoring {differential_diagnosis}",
      "explanation": "Short reason this finding favors {differential_diagnosis}",
      "abbreviation_expansion": {{}}
    }},
    "... (total of 5 items) ..."
  ],
  "summary": "Short paragraph describing overall rationale."
}}

### Additional Constraints:
- Each list (diagnosisA_strongest_evidence, diagnosisB_strongest_evidence) must contain exactly 5 items.
- If abbreviations are used (e.g., ACS, GERD), define them in 'abbreviation_expansion'. Otherwise, use an empty object.
- Provide no extra commentary outside of the JSON.
- Return **only** the JSON in your final answer.
""".strip()

In [None]:
if only_overall:
    details_dict = {"subjective-and-historical": details_dict["subjective-and-historical"]}
    categories_of_info_dict = {
        "subjective-and-historical": categories_of_info_dict["subjective-and-historical"]
    }
    # drops the other categories so that only is ordered list is requested from the LLMs
    
# Iterate through differential diagnoses
for differential_diagnosis in differential_diagnoses:    
    if differential_diagnosis != correct_diagnosis: 
        # Iterate through each category of information
        excel_filename = os.path.join(diff_gen_output_dir, f"{correct_diagnosis}_vs_{differential_diagnosis}.xlsx")
        writer = pd.ExcelWriter(excel_filename, engine="openpyxl")
        for cat_key, category_description in categories_of_info_dict.items():
            print(f"Diff/Gen Processing {correct_diagnosis} vs {differential_diagnosis} ({cat_key})...")
            
            # Generate prompt for this category
            prompt = generate_diff_gen_prompt(correct_diagnosis, differential_diagnosis, cat_key)

            # Call the API
            try:
                parsed_response = call_openai_api(prompt, model_run)

                # Extract relevant data
                diagnosisA_data = parsed_response["diagnosisA_strongest_evidence"]
                diagnosisB_data = parsed_response["diagnosisB_strongest_evidence"]

                # Create DataFrame
                rows = []
                for i in range(5):
                    findingA = diagnosisA_data[i]["finding"]
                    rationaleA = diagnosisA_data[i]["explanation"]
                    findingB = diagnosisB_data[i]["finding"]
                    rationaleB = diagnosisB_data[i]["explanation"]
                    rows.append([findingA, rationaleA, findingB, rationaleB])

                df = pd.DataFrame(rows, columns=[
                    f"Supports {correct_diagnosis}",
                    f"Rationale {correct_diagnosis}",
                    f"Supports {differential_diagnosis}",
                    f"Rationale {differential_diagnosis}"
                ])

                # Write to corresponding sheet
                df.to_excel(writer, sheet_name=cat_key[:31], index=False)  # Excel sheet names are limited to 31 chars

            except Exception as e:
                print(f"Error processing {differential_diagnosis} - {cat_key}: {e}")

        # Save Excel file after all category sheets are added
        writer.close()
        print(f"Saved results to {excel_filename}")
print("All processing completed successfully!")

#### Case-specific, differential likelihood ratio estimators

This answers the question: in this particular case (with features x,y,z), what are the features that differentiate between this case of Disease A and a hypothetical case of Disease B.

In [None]:
# Function to generate structured prompt for case-specific differential likelihoods
def generate_diff_spec_prompt(correct_diagnosis, differential_diagnosis, cat_key):
  """
    - Takes the correct diagnosis, another diagnosis that a learnering might have suggested, and a cat_key 
    - generates a prompt for estimating which differential likelihood ratios are most important in this particular case.
    - it will return a list only of findings that were present in the case
    - It generates the context from the cat_key and the global detail_lists
  """
  category_of_info = categories_of_info_dict[cat_key]
  clinical_context = generate_clinical_context_string(cat_key)
  return f"""
You are asked to identify the clinical findings from a particular case that most strongly discriminate between cases of the following two diagnoses:
{correct_diagnosis} and {differential_diagnosis}.

You are given:
- the two diagnoses under consideration: {correct_diagnosis} and {differential_diagnosis}
- and a list of the key clinical findings from the {cat_key} ({category_of_info}) and whether they were present or not 
- only findings given in the clinical context are under consideration

#### Task
Your task is to identify the pieces of evidence from the clinical context that argue most strongly for one of the two diagnoses: 
1. List the top 5 pieces of information (from the clinical context) that most strongly support {correct_diagnosis} over {differential_diagnosis}.
2. List the top 5 pieces of information (from the clinical context) that most strongly support {differential_diagnosis} over {correct_diagnosis}.

#### Clinical Context
Here is the clinical context: 
{clinical_context}

#### Constraints
- Only consider the findings mentioned in the clinical context. Assume all other pieces of information are unknown (and thus do not change the likelihood of disease)
- Base your reasoning on each diagnosis' usual presentation, focusing on the **differential** likelihood ratio
  (i.e., how well a finding discriminates {correct_diagnosis} from {differential_diagnosis}).
  - Pieces of evidence with higher differential likelihood ratio (occur with greater frequency in patients with {correct_diagnosis} than in patients with {differential_diagnosis}) are stronger evidence in favor of {correct_diagnosis} than pieces of evidence with lower differential likelihood ratios.
  - Pieces of evidence with lower likelihood ratios (meaning, they much more often occur in patients without {differential_diagnosis} than with {correct_diagnosis}) are stronger evidence against {correct_diagnosis} being present. 
- Reason using all available sources of information (epidemiology, physiology, trials, etc.) to give your best guess. 
- Provide no numeric LRs, only a relative ranking. Give the strongest piece of evidence (highest LR) first
- Emphasize which pieces of information best distinguish {correct_diagnosis} from {differential_diagnosis}.

### Output Format (Strict JSON):
{{
  "diagnosisA_strongest_evidence": [
    {{
      "finding": "Relevant finding favoring {correct_diagnosis}",
      "explanation": "Short reason this finding favors {correct_diagnosis}",
      "abbreviation_expansion": {{
        "abbreviation": "Expanded term if abbreviation is used"
      }}
    }},
    "... (total of 5 items) ..."
  ],
  "diagnosisB_strongest_evidence": [
    {{
      "finding": "Relevant finding favoring {differential_diagnosis}",
      "explanation": "Short reason this finding favors {differential_diagnosis}",
      "abbreviation_expansion": {{}}
    }},
    "... (total of 5 items) ..."
  ],
  "summary": "Short paragraph describing overall rationale."
}}

- Exactly 5 items under each list, no more, no fewer.
- If a finding is not mentioned in the clinical context, it should not be given as an answer. 
- Define abbreviations in 'abbreviation_expansion' if used; otherwise leave it empty.
- Do not add text outside the JSON. 
- Output must be syntactically valid JSON, with no trailing commas.
""".strip()

In [None]:
if only_overall:
    details_dict = {"subjective-and-historical": details_dict["subjective-and-historical"]}
    categories_of_info_dict = {
        "subjective-and-historical": categories_of_info_dict["subjective-and-historical"]
    }
    # drops the other categories so that only is ordered list is requested from the LLMs

# Iterate through differential diagnoses
for differential_diagnosis in differential_diagnoses:
    if differential_diagnosis != correct_diagnosis: 
        # Iterate through each category of information
        excel_filename = os.path.join(diff_spec_output_dir, f"{correct_diagnosis}_vs_{differential_diagnosis}.xlsx")
        writer = pd.ExcelWriter(excel_filename, engine="openpyxl")
        for cat_key, category_description in categories_of_info_dict.items():
            print(f"Diff/Spec: Processing {correct_diagnosis} vs {differential_diagnosis} ({cat_key})...")
            
            # Generate prompt for this category
            prompt = generate_diff_spec_prompt(correct_diagnosis, differential_diagnosis, cat_key)

            # Call the API
            try:
                parsed_response = call_openai_api(prompt, model_run)

                # Extract relevant data
                diagnosisA_data = parsed_response["diagnosisA_strongest_evidence"]
                diagnosisB_data = parsed_response["diagnosisB_strongest_evidence"]

                # Create DataFrame
                rows = []
                for i in range(5):
                    findingA = diagnosisA_data[i]["finding"]
                    rationaleA = diagnosisA_data[i]["explanation"]
                    findingB = diagnosisB_data[i]["finding"]
                    rationaleB = diagnosisB_data[i]["explanation"]
                    rows.append([findingA, rationaleA, findingB, rationaleB])

                df = pd.DataFrame(rows, columns=[
                    f"Supports {correct_diagnosis}",
                    f"Rationale {correct_diagnosis}",
                    f"Supports {differential_diagnosis}",
                    f"Rationale {differential_diagnosis}"
                ])

                # Write to corresponding sheet
                df.to_excel(writer, sheet_name=cat_key[:31], index=False)  # Excel sheet names are limited to 31 chars

            except Exception as e:
                print(f"Error processing {differential_diagnosis} - {cat_key}: {e}")

        # Save Excel file after all category sheets are added
        writer.close()
        print(f"Saved results to {excel_filename}")
print("All processing completed successfully!")

Code to test how large each prompt is, to assess what we should set the max_tokens to. 

About 1500 is the largest
(max_token = 1600-1800 should suffice)

In [None]:
import tiktoken

# Choose an encoding for your model (e.g., for GPT-4 or ChatGPT models)
encoding = tiktoken.encoding_for_model("gpt-4")
#prompt = generate_diff_spec_prompt("CREST syndrome with Type 2 Achalasia", "Esophageal stricture", "hpi")
#prompt = generate_diff_gen_prompt("CREST syndrome with Type 2 Achalasia", "Esophageal stricture", "hpi")
#prompt = generate_overall_gen_prompt("CREST syndrome with Type 2 Achalasia", "hpi")
prompt = generate_overall_spec_prompt("CREST syndrome with Type 2 Achalasia", "hpi")
token_count = len(encoding.encode(prompt))
print("Token count:", token_count)

## Reasoning Feedback

Clinical Reasoning Feedback

Though we know, for certain, the correct diagnosis (because we made up the case) - in reality, clinicians never know anything with absolute certainty. However, what we can know is: 

1. A rank-ordering of what diagnoses are most likely, given the information at hand
2. Estimates of how likely each data point is, assuming a given pre-test probability

In order to provide normative (ie. how *should* the learner reason) feedback on these, we must create a statistical model that will (hopefully) correspond closely to reality. 

- Inputs: Differential diagnosis + encounter transcript.
- Output: Likelihood estimates for each diagnosis based on discussed information.
- Challenges:	
	- Limited “Does this patient have X?” data in many contexts
	- Bayesian reasoning depends on assumptions that may not hold (e.g. independence of information; similar spectrum of patients to where data derived).

- Approach: 
	- Extract key information with known likelihood ratios and estimate a few additional important features.
	- Apply multi-class, qualitative Bayesian reasoning to assess likelihood of each diagnosis (https://mybinder.org/v2/gh/reblocke/notebooks_dx_reasoning/HEAD?urlpath=voila/render/multi_class.ipynb) based on learner-gathered data.
	- In production, compare calculated estimates to actual outcomes.

- Next Steps:
	- Not yet done, but if this approach seems valid, I can apply it to available transcripts.
	- Clinician input needed to assess whether qualitative Bayesian estimates align with clinical judgment (since no reference standard exists).

In [None]:
# TODO: implement this from other worksheet - will need info from the interviews for this to work (though can test-run on all information)