# Notebook for extracting whether features are present from chatbot transcripts


### How it works: 

This takes: 
- a specification 'assessment_template.xlsx' that has a column identifying each piece of information you'd like to assess for in the transcript. This file must be in an ASSESSMENT_DIR, which is where the output will be written. Each sheet in the assessment_template should have information features pertinent to a single diagnosis (e.g. Cardiac chest pain, esophageal dysphagia, etc.)
- a folder that contains all the transcripts (TRANSCRIPT_DIR)

The script then extracts the text from each transcripts and feeds it to an LLM, asking it to evaluate whether each piece of information was assessed by the doctor, and if so whether they patient responded that it was present or not. 

It then feeds the response text to an LLM, reformats it, and outputs it back to a new excel spreadsheet. 

In [83]:
import pandas as pd
import numpy as np
from markitdown import MarkItDown
import llm
from openai import OpenAI
from pydantic import BaseModel
import os
from typing import List, Optional, Literal
from tabulate import tabulate
from IPython.display import display
from dotenv import load_dotenv

load_dotenv()  # looks for a .env file in the current dir by default
#print(os.getenv("OPENAI_API_KEY"))

True

Setup

In [84]:
# This script will read in all transcripts in this directory to be analyzed. 
TRANSCRIPT_DIR = r'/Users/reblocke/Research/dx_chat_entropy/Chatbot Transcripts/'

# The script will pull in the features from this template (which should be in the assessment dir), 
# and output the resulting assessments here. 
ASSESSMENT_DIR = r'/Users/reblocke/Research/dx_chat_entropy/Assessments/'
ASSESSMENT_TEMPLATE = os.path.join(ASSESSMENT_DIR, r'asssessment_template_new.xlsx')

# Note, you also need an OpenAI API key that should be saved using the LLM package. 
# Create a file ".env" in the working folder that contains OPENAI_API_KEY=your-secret-key

In [85]:
# Make array of all transcript file names to be ingested
pdf_filepaths = [
    os.path.join(TRANSCRIPT_DIR, f)
    for f in os.listdir(TRANSCRIPT_DIR)
    if f.endswith(".pdf")
]
print("PDF Filepaths:")
for filepath in pdf_filepaths:
    print(filepath)

PDF Filepaths:
/Users/reblocke/Research/dx_chat_entropy/Chatbot Transcripts/Intermtn MS4 2 Transcript.pdf
/Users/reblocke/Research/dx_chat_entropy/Chatbot Transcripts/transcript FM PGY1.pdf
/Users/reblocke/Research/dx_chat_entropy/Chatbot Transcripts/Intermtn MS4 1 Transcript.pdf


In [86]:
# Pipeline for ingesting transcript PDFs
md = MarkItDown()
transcripts = []

for filepath in pdf_filepaths:
    # Convert the PDF to text using MarkItDown
    result = md.convert(filepath)
    extracted_text = result.text_content
    transcripts.append({
        'filename': os.path.basename(filepath),
        'text_content': extracted_text
    })
transcripts_df = pd.DataFrame(transcripts)

display(transcripts_df)
#print(tabulate(transcripts_df, headers = 'keys', tablefmt = 'fancy_grid'))
# If desired, you can save it to a CSV
# df.to_csv(os.path.join(TRANSCRIPT_DIR, "transcripts.csv"), index=False)

Unnamed: 0,filename,text_content
0,Intermtn MS4 2 Transcript.pdf,PATIENT DOOR CHART and Learner Instructions\n\...
1,transcript FM PGY1.pdf,Patient Case\n\nPATIENT DOOR CHART and Learner...
2,Intermtn MS4 1 Transcript.pdf,PATIENT DOOR CHART and Learner Instructions\n\...


In [87]:
# Create instruction part of the prompt
instruction_prompt = """You are a research assistant who is meticulously reviewing transcripts of interview for a research project. 

Task: You will be given a list of pieces of information that a doctor might ask a patient about. 
Your goal is to read the transcript and score it by whether the doctor collected each piece of information. 
To do this, you must read the transcript carefully, understanding what each question and response meant. 
If an answer is asked obliquely - but a reasonable person would understand what was intended - this should count.

Return format: for each piece of information, you should answer in the following way - 

a. If the doctor asked about the piece of information, and the patient responded that the feature is present, answer "<Information>, YES". 
For example, if the information is: Chest Pain? and the doctor asked "Do you have chest pain?" and the patient answered "I do", you should respond 'Chest Pain?, YES'
If the information is: Nausea? doctor asked "Did you have nausea?" and the patient answers "I did", you should respond 'Nausea?, YES'.

b. If the doctor asked about the piece of information, and the patient responded that the feature was not present, answer '<Information>, NO' 
For example, if the question is Shortness of Breath? and the doctor asked "Are you dyspneic?" and the patient answers they are not, the answer should be 'Shortness of Breath?,NO'

c. If the doctor did not ask about the piece of information, you should return '<Information>, MISSING'

Warning: There are ONLY three ways you should ever answer for each piece of information: '<Information>, YES', '<Information>, NO', and '<Information>, NOT ASKED'. 
Never answer in other ways. 

Here are some examples:
1.
Information: 'Pain not worse with exertion (requires they clarify exercise 1hr after meal)'
Doctor at some point asks: Does the pain worsen after a meal? 
Patient: yes, it's worse
Response: "'Pain not worse with exertion (requires they clarify exercise 1hr after meal)', YES", because this is close enough for a reasonable person. 

2. 
Information: 'no prior CAD'
Doctor at some point asks: Have you ever had a heart attack? 
Patient: never
Response: "'no prior CAD', NO", because CAD stands for coronary artery disease and a heart attack is the most common manifestation.

3. 
Information: 'no diaphoresis'
Doctor never asks anything that clarifies if the patient was sweaty and never assessed it on examination
Response: "'no diaphoresis', MISSING" , because exam findings that are discussed should also count.  

Putting it all together, the response should follow this format: 
1. Pain not worse with exertion (requires they clarify exercise 1hr after meal), YES
2. "Do you have any PMHx?" (counts as 2 independent minor features), MISSING
3. no tobacco, NO
4. no associated shortness of breath, YES
5. no radiation to the neck, arm, or jaw?, MISSING
... and so on, through the entire list.

Remember, NOTHING ELSE should be in the final output. Just the information, and YES/NO/MISSING 

Here is the list of pieces of information I would like you to look for: """

In [88]:
# Processing of the specification for what assessments we want the LLM to look for

def process_sheet(sheet_data):
    """
    Processes a sheet to extract the 'Information' and associated 'LR' values.
    Ignores the 'Y/N' column.
    Returns a list of tuples (information_str, lr_value).
    """
    # Drop Y/N column if it exists
    sheet_data = sheet_data.drop(columns=["Y/N"], errors="ignore")

    info_list = []

    # Iterate through each row and capture the single 'Information' + 'LR' from that row
    for _, row in sheet_data.iterrows():
        info_val = row.get("Information", None)  # Safely get 'Information' column
        lr_val = row.get("LR", None)             # Safely get 'LR' column

        # If the information cell is not empty/NaN, we record it.
        # If LR is NaN or missing, we'll store it as None.
        if pd.notnull(info_val):
            # Normalize LR to None if it's NaN
            if pd.isnull(lr_val):
                lr_val = None

            info_list.append((info_val, lr_val))

    return info_list

diagnosis_info = {}
with pd.ExcelFile(ASSESSMENT_TEMPLATE) as spreadsheet_data:
    for sheet_name in spreadsheet_data.sheet_names:
        try:
            sheet_data = pd.read_excel(ASSESSMENT_TEMPLATE, sheet_name=sheet_name)

            if sheet_data.empty:
                print(f"Skipping empty sheet: {sheet_name}")
                continue

            # Process the sheet to get [(info, LR), ...]
            diagnosis_info[sheet_name] = process_sheet(sheet_data)
        except Exception as e:
            print(f"Error processing sheet '{sheet_name}': {e}")

# Print out the collected data
for diagnosis, info_pairs in diagnosis_info.items():
    print(f"Diagnosis: {diagnosis}")
    for info_val, lr_val in info_pairs:
        print(f"  Information: {info_val}, LR: {lr_val}")

# NOTE: may not actually need to bother with LRs at this point? 

Diagnosis: Cardiac
  Information: Do you have any PMHx? (counts as 2 independent minor features), LR: None
  Information: Pain not worse with exertion (requires they clarify exercise 1hr after meal), LR: 0.8
  Information: no tobacco, LR: 0.96
  Information: no associated shortness of breath, LR: 0.89
  Information: no radiation to the neck, arm, or jaw? , LR: 0.9
  Information: positional chest pain (worse when laying down), LR: 3.3333333333333335
  Information: What were you doing when the chest pain started? (eating), LR: None
  Information: Alternative cause of esoph dysphagia becomes obvious(food gets stuck or relieved by regurgitation of food), LR: 0.75
  Information: no prior CAD, LR: 0.75
  Information: no PAD, LR: 0.96
  Information: no HLD, LR: 0.85
  Information: no prior MI, LR: 0.88
  Information: no DM2, LR: 0.9
  Information: no obesity, LR: 0.99
  Information: no history of stroke, LR: 0.97
  Information: no diaphoresis, LR: 0.91
  Information: Pain worse with exertion 

In [80]:
%%time
# ESTIMATE LRS FOR ALL THAT HAVE UNKNOWN LRS
# TODO: Note, in the real workflow - should do this using o1 and only do it once, rather than over and over.

class LRResponse(BaseModel):
    """
    A structured schema ensuring the model returns exactly one of the five LR labels.
    """
    label: Literal["STRONG NEGATIVE", 
                   "WEAK NEGATIVE", 
                   "NEUTRAL", 
                   "WEAK POSITIVE", 
                   "STRONG POSITIVE"]


def estimate_lr(diagnosis, info_val, client):
    """
    Returns one of the five LR categories (STRONG NEGATIVE, WEAK NEGATIVE,
    NEUTRAL, WEAK POSITIVE, STRONG POSITIVE) for a given diagnosis and info_val.
    Uses OpenAI's structured output parsing to ensure the response is valid.
    """

    lr_prompt = """You are an expert diagnostician who is explaining to a trainee which pieces of information they should pay attention to during the diagnostic process. Your task is to summarize how strong of evidence the presence or absence of a particular new finding is for whether a patient has a disease. For example, if a patient has chest pain and the EKG show ST segment elevations, this is STRONG evidence that the chest pain is due to a heart attack. If the patient has t-wave inversions, this is WEAKER evidence in favor - because t-wave changes are not as specific for cardiac causes of chest pain. If they have known heartburn, this is WEAK absence against (because it’s an explanation, but it IS possible to have a history of heartburn but have a heart attack). Lastly, if they are a young female without an inherited condition, this is STRONG evidence against a cardiac cause because that demographic almost never has heart attacks. Lastly, if the piece of information is unhelpful, it would be called neutral. For example, if the patient has blue eyes irrelevant to the cause of chest pain, thus it would be NEUTRAL. 

    I’d like you to follow the following steps:
        1.	Consider, what does the finding mean about what is going on with the patient?
        2.	does the presence of the new information make the disease more or less likely? Or no difference?
        3.	Does the finding make another cause of the same symptom more common? If so, then by definition it makes the target condition a less likely explanation.
        4.	Once you’ve decided whether the finding makes the diagnosis more or less likely, use the following scale to come up with a response:

        •	If knowing the piece of information makes the odds of the diagnosis more than 1.95x higher than it was before, it is a STRONG POSITIVE finding
        •	If knowing the piece of information makes the odds of the diagnosis 1.18x to 1.95x higher than it was before, it is a WEAK POSITIVE finding
        •	If knowing the piece of information makes changes the odds only 0.92x to 1.18x as likely as it was before, then it is a NEUTRAL finding
        •	If knowing the piece of information makes the odds of the diagnosis 0.72x to 0.92x times as likely as it was before, then it is a WEAK NEGATIVE finding.
        •	If knowing the piece of information makes the odds of the diagnosis less than 0.72x higher than it was before, it is a STRONG POSITIVE

    As another example, say I’m wondering whether a patient with GI bleeding has a lower GI bleed (below the ligament of Treitz) or an upper GI bleed. The presence of clots in the blood is a very strong predictor of lower GI bleeding, because bleeding from the stomach cannot form clots due to the stomach acid. You should use all physiologic clues to whether a piece of information is a STRONG, WEAK, or NEUTRAL predictor. 

    You will receive inputs in the following format; Target condition: <Condition, e.g. Cardiac chest pain>. Finding: <piece of information, e.g. ‘No radiation to the neck, arm, or jaw’>.

    You must respond with EXACTLY ONE of the following categories (no extra text):
    STRONG POSITIVE, WEAK POSITIVE, NEUTRAL, WEAK NEGATIVE, STRONG NEGATIVE.

    You must respond with EXACTLY one of these categories in valid JSON
    Your output must match the Pydantic schema: { 'label': '<one of the five strings>' }

    Here are some examples:
    Prompt = Target condition: Cardiac Chest Pain. Finding: Pain not worse with exertion (requires they clarify exercise 1hr after meal).
    You would reason that because cardiac chest pain is usually worse with exertion because exertion worsens cardiac demand for oxygen, and thus worsens ischemia.
    Response = {
        "label": "STRONG NEGATIVE"
    }

    Prompt =  Target condition: Cardiac Chest Pain. Finding: No tobacco.
    You would reason that because being someone who smokes increases your risk of coronary artery disease, and thus being a never smoker means you’re at less risk… but many people who have heart attacks still smoke, so it’s only a weak predictor. 
    Response = {
        "label": "WEAK NEGATIVE"
    }

    Prompt = Target condition: Cardaic Chest Pain. Finding = enjoys playing chess.
    You would reason that because enjoying chest has no relationship to having a heart attack.
    Response = {
        "label": "NEUTRAL"
    }

    Prompt = Target condition: Cardiac Chest Pain. Finding = pain located behind the sternum
    You would reason that because cardiac chest pain is often experienced behind the sternum (thus, more likely), but so are many other causes of chest pain - like GERD.
    Response = {
        "label": "WEAK POSITIVE"
    }

    Prompt = Condition: Cardiac Chest Pain. Finding = pain worse with exertion.
    You would reason that because the increased myocardial oxygen consumption worsens the pain if oxygen delivery to the myocardium is the cause, as it is in heart attacks.
    Response = {
        "label": "STRONG NEGATIVE"
    }

    OK: here’s the prompt…. """
        
    # Create your conversation messages
    messages = [
        {"role": "system", "content": lr_prompt},
        {
            "role": "user",
            "content": f"Condition: {diagnosis}\nFinding: {info_val}"
        }
    ]
    
    # Make the structured call to the model
    completion = client.beta.chat.completions.parse(
        model="gpt-4o-mini",
        messages=messages,
        response_format=LRResponse,  # Our Pydantic model
    )
    
    # Extract the parsed LRResponse from the completion
    lr_response = completion.choices[0].message.parsed  # This will be an LRResponse instance
    # The label is guaranteed to be one of the enumerated strings by Pydantic
    return lr_response.label


client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
for diagnosis, info_pairs in diagnosis_info.items():
    # info_pairs is a list of (info_val, lr_val) tuples
    for idx, (info_val, lr_val) in enumerate(info_pairs):
        if lr_val is None:  # Missing LR
            estimated_label = estimate_lr(diagnosis, info_val, client)
            # Update the tuple
            info_pairs[idx] = (info_val, estimated_label)

CPU times: user 258 ms, sys: 16.3 ms, total: 275 ms
Wall time: 35.5 s


'\n# Initialize the model\n#model = llm.get_model("gpt-4o") # costs a bit more - \nmodel = llm.get_model("gpt-4o-mini")\nmodel.key = os.environ["OPENAI_API_KEY"]\n\nfor diagnosis, info_pairs in diagnosis_info.items():\n    # info_pairs is a list of (info_val, lr_val) tuples\n    for idx, (info_val, lr_val) in enumerate(info_pairs):\n        if lr_val is None:  # Missing LR\n            # Estimate LR using your custom function\n            estimated_lr = estimate_lr(diagnosis, info_val, model)\n            # Update the tuple in the list\n            info_pairs[idx] = (info_val, estimated_lr)\n\n# Now, \'diagnosis_info\' is updated with the newly-estimated LRs\n'

In [89]:
# Iterate through all the pieces of info to make prompts for each disease (LLM called separately)
info_prompts = {}
for diagnosis, info_list in diagnosis_info.items():
    # Create the long string with the specified format
    info_prompt = "\n".join([f"Information: {info[0]}" for info in info_list]) # info[0] = info, info[1] = lr
    info_prompts[diagnosis] = info_prompt

for key, prompt in info_prompts.items():
    print(f"Diagnosis: {key}. Prompt:\n{prompt}")

Diagnosis: Cardiac. Prompt:
Information: Do you have any PMHx? (counts as 2 independent minor features)
Information: Pain not worse with exertion (requires they clarify exercise 1hr after meal)
Information: no tobacco
Information: no associated shortness of breath
Information: no radiation to the neck, arm, or jaw? 
Information: positional chest pain (worse when laying down)
Information: What were you doing when the chest pain started? (eating)
Information: Alternative cause of esoph dysphagia becomes obvious(food gets stuck or relieved by regurgitation of food)
Information: no prior CAD
Information: no PAD
Information: no HLD
Information: no prior MI
Information: no DM2
Information: no obesity
Information: no history of stroke
Information: no diaphoresis
Information: Pain worse with exertion (without clarifying that it only occurs soley within an hour of eating)
Information: Decreased exercise x 3 months without clarifying post-prandial food fear
Information: How would you describe th

In [18]:
# Create the full prompts for each disease for each transcript

# Initialize a new column in transcripts_df to store the prompts
transcripts_df["full_prompts"] = None

# Iterate through each transcript in transcripts_df
for i, transcript in enumerate(transcripts_df['text_content']):
    # Create a dictionary for the current transcript's prompts
    transcript_prompts = {}
    
    # Iterate through each diagnosis and its associated info list
    for diagnosis, info_list in diagnosis_info.items():
        # Create the long string with the specified format
        info_prompt = "\n".join([f"Information: {info}" for info in info_list])
        
        # Prefix the instruction_prompt to the disease-specific prompt, and add the transcript to the end.
        full_prompt = (
            f"{instruction_prompt}\n{info_prompt}\n\n and here is the transcript to assess:\n{transcript}"
        )
        
        # Store the prompt for the current diagnosis
        transcript_prompts[diagnosis] = full_prompt
    
    # Assign the dictionary of prompts to the new column for this transcript
    transcripts_df.at[i, "full_prompts"] = transcript_prompts

display(transcripts_df)
#print(tabulate(transcripts_df, headers = 'keys', tablefmt = 'fancy_grid'))

Unnamed: 0,filename,text_content,full_prompts
0,Intermtn MS4 2 Transcript.pdf,PATIENT DOOR CHART and Learner Instructions\n\...,{'Cardiac': 'You are a research assistant who ...
1,transcript FM PGY1.pdf,Patient Case\n\nPATIENT DOOR CHART and Learner...,{'Cardiac': 'You are a research assistant who ...
2,Intermtn MS4 1 Transcript.pdf,PATIENT DOOR CHART and Learner Instructions\n\...,{'Cardiac': 'You are a research assistant who ...


Available models to use

In [19]:
for model in llm.get_models():
    print(model.model_id)

gpt-4o
gpt-4o-mini
gpt-4o-audio-preview
gpt-3.5-turbo
gpt-3.5-turbo-16k
gpt-4
gpt-4-32k
gpt-4-1106-preview
gpt-4-0125-preview
gpt-4-turbo-2024-04-09
gpt-4-turbo
o1-preview
o1-mini
gpt-3.5-turbo-instruct
hf.co/unsloth/DeepSeek-R1-Distill-Qwen-14B-GGUF:Q8_0
hf.co/unsloth/DeepSeek-R1-Distill-Llama-8B-GGUF:Q8_0


In [20]:
%%time
# Ollama Version - requires ollama to be installed (Mac only) and ~>16gb ram, 8gb of disk space.
# Local - too verbose and doesn't quite get the instructions right (openAI has better prompt engineering)
"""
model = llm.get_model("hf.co/unsloth/DeepSeek-R1-Distill-Llama-8B-GGUF:Q8_0") 

# Create a new column to store the responses
transcripts_df["responses"] = None

# Iterate through each transcript in the DataFrame
for i, row in transcripts_df.iterrows():
    full_prompts = row["full_prompts"]  # Get the full_prompts dictionary for this transcript
    transcript_responses = {}  # Dictionary to store responses for this transcript
    
    # Iterate through each diagnosis and its associated prompt
    for diagnosis, prompt in full_prompts.items():
        response = model.prompt(prompt)  # Get the model's response
        transcript_responses[diagnosis] = response.text()  # Store the response text
    
    # Save the responses back into the DataFrame
    transcripts_df.at[i, "responses"] = transcript_responses

display(transcripts_df)
#print(tabulate(transcripts_df, headers = 'keys', tablefmt = 'fancy_grid'))
"""

CPU times: user 3 μs, sys: 1e+03 ns, total: 4 μs
Wall time: 6.2 μs


'\nmodel = llm.get_model("hf.co/unsloth/DeepSeek-R1-Distill-Llama-8B-GGUF:Q8_0") \n\n# Create a new column to store the responses\ntranscripts_df["responses"] = None\n\n# Iterate through each transcript in the DataFrame\nfor i, row in transcripts_df.iterrows():\n    full_prompts = row["full_prompts"]  # Get the full_prompts dictionary for this transcript\n    transcript_responses = {}  # Dictionary to store responses for this transcript\n    \n    # Iterate through each diagnosis and its associated prompt\n    for diagnosis, prompt in full_prompts.items():\n        response = model.prompt(prompt)  # Get the model\'s response\n        transcript_responses[diagnosis] = response.text()  # Store the response text\n    \n    # Save the responses back into the DataFrame\n    transcripts_df.at[i, "responses"] = transcript_responses\n\ndisplay(transcripts_df)\n#print(tabulate(transcripts_df, headers = \'keys\', tablefmt = \'fancy_grid\'))\n'

In [21]:
%%time
# Initialize the model
#model = llm.get_model("gpt-4o") # costs a bit more - 
model = llm.get_model("gpt-4o-mini")
model.key = os.environ["OPENAI_API_KEY"]

# Create a new column to store the responses
transcripts_df["responses"] = None

# Iterate through each transcript in the DataFrame
for i, row in transcripts_df.iterrows():
    full_prompts = row["full_prompts"]  # Get the full_prompts dictionary for this transcript
    transcript_responses = {}  # Dictionary to store responses for this transcript
    
    # Iterate through each diagnosis and its associated prompt
    for diagnosis, prompt in full_prompts.items():
        response = model.prompt(prompt)  # Get the model's response
        transcript_responses[diagnosis] = response.text()  # Store the response text
    
    # Save the responses back into the DataFrame
    transcripts_df.at[i, "responses"] = transcript_responses

display(transcripts_df)
#print(tabulate(transcripts_df, headers = 'keys', tablefmt = 'fancy_grid'))

  self._start_utcnow = datetime.datetime.utcnow()
/opt/anaconda3/lib/python3.12/site-packages/llm/default_plugins/openai_models.py:624: PydanticDeprecatedSince20: The `dict` method is deprecated; use `model_dump` instead. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.10/migration/
  usage = item.usage.dict()


Unnamed: 0,filename,text_content,full_prompts,responses
0,Intermtn MS4 2 Transcript.pdf,PATIENT DOOR CHART and Learner Instructions\n\...,{'Cardiac': 'You are a research assistant who ...,{'Cardiac': '1. Do you have any PMHx? (counts ...
1,transcript FM PGY1.pdf,Patient Case\n\nPATIENT DOOR CHART and Learner...,{'Cardiac': 'You are a research assistant who ...,{'Cardiac': '1. Do you have any PMHx? (counts ...
2,Intermtn MS4 1 Transcript.pdf,PATIENT DOOR CHART and Learner Instructions\n\...,{'Cardiac': 'You are a research assistant who ...,{'Cardiac': '1. Do you have any PMHx? (counts ...


CPU times: user 1.62 s, sys: 146 ms, total: 1.77 s
Wall time: 34.7 s


In [22]:
%%time
# Separate Call to OpenAI with structured outputs to parse into JSON format

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

# Define a class for a single question-answer pair
class InfoAnswer(BaseModel):
    info: str
    answer: Optional[bool]

# Define a container class that holds a list of question-answer pairs
class InfoAnswerList(BaseModel):
    pairs: List[InfoAnswer]

# Add a new column for storing info_answers
transcripts_df["info_answers"] = None

# Iterate through each transcript's responses
for i, row in transcripts_df.iterrows():
    # Get the responses for this transcript
    transcript_responses = row["responses"]  # Assumes "responses" is already populated as a dictionary
    
    # Create a dictionary to hold the parsed info answers for all diagnoses
    transcript_info_answers = {}
    
    # Iterate through each diagnosis and its response
    for diagnosis, response_text in transcript_responses.items():
        # Generate messages for each diagnosis
        completion = client.beta.chat.completions.parse(
            model="gpt-4o-mini",
            messages=[
                {
                    "role": "system",
                    "content": "Extract a list of pieces of information (info) and whether or not that piece of information was present (answer) as a boolean (YES = True, NO = False, MISSING = None).",
                },
                {"role": "user", "content": response_text},
            ],
            response_format=InfoAnswerList,
        )
        
        # Parse the response into the structured InfoAnswerList
        info_answer_list = completion.choices[0].message.parsed
        
        # Store the parsed result for this diagnosis
        transcript_info_answers[diagnosis] = info_answer_list
    
    # Store the parsed info answers for this transcript in the DataFrame
    transcripts_df.at[i, "info_answers"] = transcript_info_answers

display(transcripts_df)
#print(tabulate(transcripts_df, headers = 'keys', tablefmt = 'fancy_grid'))

Unnamed: 0,filename,text_content,full_prompts,responses,info_answers
0,Intermtn MS4 2 Transcript.pdf,PATIENT DOOR CHART and Learner Instructions\n\...,{'Cardiac': 'You are a research assistant who ...,{'Cardiac': '1. Do you have any PMHx? (counts ...,{'Cardiac': pairs=[InfoAnswer(info='Do you hav...
1,transcript FM PGY1.pdf,Patient Case\n\nPATIENT DOOR CHART and Learner...,{'Cardiac': 'You are a research assistant who ...,{'Cardiac': '1. Do you have any PMHx? (counts ...,{'Cardiac': pairs=[InfoAnswer(info='Do you hav...
2,Intermtn MS4 1 Transcript.pdf,PATIENT DOOR CHART and Learner Instructions\n\...,{'Cardiac': 'You are a research assistant who ...,{'Cardiac': '1. Do you have any PMHx? (counts ...,{'Cardiac': pairs=[InfoAnswer(info='Do you hav...


CPU times: user 150 ms, sys: 17.2 ms, total: 167 ms
Wall time: 29.3 s


In [23]:
# Iterate through each row in transcripts_df
for i, row in transcripts_df.iterrows():
    # Get the info_answers dictionary for this transcript
    info_answers = row["info_answers"]
    transcript_filename = row["filename"]  # Get the original filename
    
    # Construct the output file name
    output_file = os.path.join(
        ASSESSMENT_DIR,
        f"answers_{transcript_filename.replace('.pdf', '.xlsx')}"
    )
    
    # Create a writer object to handle multiple sheets
    with pd.ExcelWriter(output_file, engine="openpyxl") as writer:
        # Iterate through each diagnosis in info_answers
        for diagnosis, info_answer_list in info_answers.items():
            # Create a DataFrame for the current diagnosis
            data = [{"answer": pair.answer, "info": pair.info} for pair in info_answer_list.pairs]
            df = pd.DataFrame(data)
            display(df)
            # Write the DataFrame to a sheet named after the diagnosis
            df.to_excel(writer, sheet_name=diagnosis[:31], index=False)  # Sheet name max length is 31 characters
    
    print(f"Info-Answer pairs have been written to {output_file}")

Unnamed: 0,answer,info
0,,Do you have any PMHx? (counts as 2 independent...
1,True,Pain not worse with exertion (requires they cl...
2,False,no tobacco
3,True,no associated shortness of breath
4,,"no radiation to the neck, arm, or jaw?"
5,True,positional chest pain (worse when laying down)
6,,What were you doing when the chest pain starte...
7,True,Alternative cause of esoph dysphagia becomes o...
8,,no prior CAD
9,,no PAD


Unnamed: 0,answer,info
0,True,Heartburn (Postprandial burning or pain)
1,True,Reflux / regurgitation
2,,Pain location behind sternum
3,True,Positional (worse when laying down)
4,True,Alternative cause becomes obvious: esoph dysph...
5,,How would you describe the pain? (tightness… n...
6,,Do antacids help with your chest pain?
7,,No hoarse voice
8,,No dry cough
9,,No globus


Unnamed: 0,answer,info
0,True,Food gets stuck
1,True,Regurgitation provides relief
2,,Pain location behind sternum
3,True,Positional chest pain (worse when laying down)
4,,How would you describe the pain? (tightness… n...
5,True,Difficulty swallowing liquids
6,,Weight loss
7,,No FHx of cancer
8,True,Does not use alcohol


Unnamed: 0,answer,info
0,True,Pattern of hand pain: multiple symmetric joint...
1,,Hand predominance disproportionate to other jo...
2,True,FHx of RA
3,,No morning stiffness
4,,Lack of joint swelling
5,,"No enlargement of knuckles, finger deformities..."
6,,No rheumatoid nodules


Unnamed: 0,answer,info
0,True,Alternative cause becomes obvious: esoph dysph...
1,,Raynauds phenomenon
2,,Rash (telangiectasias)
3,,Hand pain out of proportion to other joints (m...
4,True,Current heartburn or reflux
5,,Long-standing heartburn and reflux (duration o...
6,True,Difficulty swallowing liquids
7,,Weight loss
8,True,FHx of RA
9,True,no associated shortness of breath


Info-Answer pairs have been written to /Users/reblocke/Research/dx_chat_entropy/Assessments/answers_Intermtn MS4 2 Transcript.xlsx


Unnamed: 0,answer,info
0,True,Do you have any PMHx? (counts as 2 independent...
1,True,Pain not worse with exertion (requires they cl...
2,False,no tobacco
3,,no associated shortness of breath
4,True,"no radiation to the neck, arm, or jaw?"
5,,positional chest pain (worse when laying down)
6,,What were you doing when the chest pain starte...
7,True,Alternative cause of esoph dysphagia becomes o...
8,,no prior CAD
9,,no PAD


Unnamed: 0,answer,info
0,,Heartburn (Postprandial burning or pain)
1,True,Reflux / regurgitation
2,True,Pain location behind sternum
3,,Positional (worse when laying down)
4,True,Alternative cause becomes obvious: esoph dysph...
5,True,How would you describe the pain? (tightness… n...
6,,Do antacids help with your chest pain?
7,,No hoarse voice
8,True,No dry cough
9,,No globus


Unnamed: 0,answer,info
0,True,Food gets stuck
1,True,Regurgitation provides relief
2,True,Pain location behind sternum
3,,Positional chest pain (worse when laying down)
4,True,How would you describe the pain? (tightness… n...
5,True,Difficulty swallowing liquids
6,,Weight loss
7,,No FHx of cancer
8,True,Does not use alcohol


Unnamed: 0,answer,info
0,,Pattern of hand pain: multiple symmetric joint...
1,,Hand predominance disproportionate to other jo...
2,,FHx of RA
3,,No morning stiffness
4,,Lack of joint swelling
5,,"No enlargement of knuckles, finger deformities..."
6,,No rheumatoid nodules


Unnamed: 0,answer,info
0,True,Alternative cause becomes obvious: esoph dysph...
1,,Raynauds phenomenon
2,,Rash (telangiectasias)
3,,Hand pain out of proportion to other joints (m...
4,True,Current heartburn or reflux
5,,Long-standing heartburn and reflux (duration o...
6,True,Difficulty swallowing liquids
7,,Weight loss
8,,FHx of RA
9,True,no associated shortness of breath


Info-Answer pairs have been written to /Users/reblocke/Research/dx_chat_entropy/Assessments/answers_transcript FM PGY1.xlsx


Unnamed: 0,answer,info
0,,Do you have any PMHx? (counts as 2 independent...
1,True,Pain not worse with exertion (requires they cl...
2,,no tobacco
3,True,no associated shortness of breath
4,True,"no radiation to the neck, arm, or jaw?"
5,True,positional chest pain (worse when laying down)
6,,What were you doing when the chest pain starte...
7,True,Alternative cause of esoph dysphagia becomes o...
8,True,no prior CAD
9,,no PAD


Unnamed: 0,answer,info
0,True,Heartburn (Postprandial burning or pain)
1,True,Reflux / regurgitation
2,True,Pain location behind sternum
3,True,Positional (worse when laying down)
4,,Alternative cause becomes obvious: esoph dysph...
5,True,How would you describe the pain? (tightness… n...
6,,Do antacids help with your chest pain?
7,,No hoarse voice
8,,No dry cough
9,,No globus


Unnamed: 0,answer,info
0,True,Food gets stuck
1,True,Regurgitation provides relief
2,True,Pain location behind sternum
3,True,Positional chest pain (worse when laying down)
4,True,How would you describe the pain? (tightness… n...
5,,Difficulty swallowing liquids
6,,Weight loss
7,,No FHx of cancer
8,,Does not use alcohol


Unnamed: 0,answer,info
0,,Pattern of hand pain: multiple symmetric joint...
1,,Hand predominance disproportionate to other jo...
2,,FHx of RA
3,,No morning stiffness
4,,Lack of joint swelling
5,,"No enlargement of knuckles, finger deformities..."
6,,No rheumatoid nodules


Unnamed: 0,answer,info
0,True,Alternative cause becomes obvious: esoph dysph...
1,,Raynauds phenomenon
2,,Rash (telangiectasias)
3,,Hand pain out of proportion to other joints (m...
4,True,Current heartburn or reflux
5,True,Long-standing heartburn and reflux (duration o...
6,True,Difficulty swallowing liquids
7,,Weight loss
8,,FHx of RA
9,True,no associated shortness of breath


Info-Answer pairs have been written to /Users/reblocke/Research/dx_chat_entropy/Assessments/answers_Intermtn MS4 1 Transcript.xlsx


In [None]:
# Need to go back and match the assessments (from GPT) to the data-frame with the LRs - might be able to specify this in the call. 

In [None]:
# End