# Experiment 1.0

In this first experiment I simply want to try generating some synthetic patients notes. The goal is: 

* Demonstrate an example of calling the ChatGPT API 
* Solve some of the problems in formatting output
* Use ChatGPT to (a) pick a medical condition and (b) generate an admission note given some simple patient details
* Evaluate the output for (a) successful generation of text (b) plausibility of admission note

In order to seed the LLM with some patient information, I start with some patient parameters provided by synthetic data. NHS England prepared a synthetic dataset of A&E presentations. A blog post about it is [here](https://open-innovations.org/blog/2019-01-24-exploring-methods-for-creating-synthetic-a-e-data). The dataset can be accessed from [this website](https://data.england.nhs.uk/dataset/a-e-synthetic-data/resource/81b068e5-6501-4840-a880-a8e7aa56890e#). 

Unfortunately, there is no data dictionary. Therefore, here I only use data which can be intuively understood ie:  

* Age_Band
* AE_Arrive_HourOfDay
* AE_Time_Mins
* Admitted_Flag - assuming 1 means yes
* ICD10_Chapter_Code - for more see [Wikipedia](https://en.wikipedia.org/wiki/ICD-10)
* Length_Of_Stay_Days

A notebook exploring these columns is [here](explore-nhse-ae-data.ipynb)

This dataset provides, among other things, the hospital length of stay and whether the patient was admitted. As my interest is in clinical notes that are generated during hospital visits, I'm more interested in inpatients. I will use the length of stay information to ask ChatGPT to pick a medical condition that would merit inpatient admission. I also investigate whether, by giving the lengh of stay information, the admission note would indicate a severity that could plausibly (at least barring any unexpected developments in the patient's trajectory) result in a such a stay. 

Therefore, I propose to evaluate the admission notes: 
* on their readability as admission notes (although as I am not medical, I cannot do a full evaluation)
* on the plausbility of the medical condition chosen by ChatGPT leading to a hospital visit of this length


## Set up

In [46]:
# Reload functions every time
%load_ext autoreload 
%autoreload 2

In [47]:
# Load libraries
import sys
import os
import pandas as pd
from pathlib import Path

import seaborn as sns
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings("ignore", category=FutureWarning)


# Import the variables that have been set in the init.py folder in the root directory
# These include a constant called PROJECT_ROOT which stores the absolute path to this folder
sys.path.append(os.path.abspath(os.path.join(os.getcwd(), os.pardir)))
import init
PROJECT_ROOT = os.getenv("PROJECT_ROOT")

# Add the src folder to sys path, so that the application knows to look there for libraries
sys.path.append(str(Path(PROJECT_ROOT) / 'src'))

# Import function to load data
from data_ingestion.load_data import load_nhse_data

from utils.write_to_json import write_to_json
from utils.load_from_json import load_from_json



## Load data

Here the function called load_file()  

* check if the NHS England datset has already been saved in a local folder called data_store in a parquet format; if so the function returns it
* if not, it checks for the zip file from the NHSE website and if not downloads it, unzips it and saves it 
* the file is unzipped to csv and read as as pandas dataframe and saved to parquet format

In [48]:
ed = load_nhse_data()
# ed.head()

In [49]:
print(min(ed.AE_Arrive_Date))
print(max(ed.AE_Arrive_Date))

2014-03-23 00:00:00
2018-03-22 00:00:00


As noted above, we will only use the following columns initially:

In [4]:
# ed[['Age_Band', 'AE_Arrive_HourOfDay', 'AE_Time_Mins', 'Admitted_Flag', 'Length_Of_Stay_Days', 'ICD10_Chapter_Code', 'Title']].head()

## Generate an instance of a patient using the dataset

Here we call a script that contains a class definition called Patient.To see the script, go to [../src/functions/patient_class.py](../src/functions/patient_class.py) An instance of this class, here referred to as a persona, is a single Patient. The instance is populated with all the variables retrieved from a single row of the ED dataset loaded above.

The steps are the following:
* select a row from the A&E data
* pass the row information to the class definition

In addition, as part of creating the persona, a call to ChatGPT is made using the OpenAI API. Certain details about the patient (listed below) are embedded into the prompt to be passed in. ChatGPT is asked to generate a medical condition, and an admission note. 

A sub-function [pick_medical_condition()](../src/functions/pick_medical_condition.py) is called to populate three additional attributes of the persona: the medical condition, their admission note, and their most recent note (which is this case is the admission note, but later this attribute could be something else like a progress note). 

The script [pick_medical_condition()](../src/functions/pick_medical_condition.py)

* calls a function [generate_prompt_presenting_condition()](../src/functions/pick_medical_condition.py) (scroll down the file) which populates a ChatGPT prompt with details about the patient. The prompt contains ChatGPT's instructions and requests the ouput in a json format. To see the text of the prompt, go to [pick_medical_condition.txt](../templates/prompt_templates/pick_medical_condition.txt)
* calls ChatGPT with the prompt. Functions used to call ChatGPT are in [prompt_functions.py](../src/functions/prompt_functions.py)
* attempts to parse the json output




In [50]:
ed[['id','Age_Band', 'AE_Arrive_HourOfDay', 'AE_Time_Mins',  'Length_Of_Stay_Days', 'ICD10_Chapter_Code', 'Title']].iloc[10000]

id                                                  10000
Age_Band                                            65-84
AE_Arrive_HourOfDay                                 09-12
AE_Time_Mins                                          370
Length_Of_Stay_Days                                  12.0
ICD10_Chapter_Code                                     IX
Title                  Diseases of the circulatory system
Name: 10000, dtype: object

In [202]:
from functions.patient_class import Patient

def row_to_patient(row):
    return Patient(*row)

In [203]:
persona = ed[ed.Title == "Diseases of the circulatory system"].iloc[10000]
print(persona)
Pat = row_to_patient(persona)


id                                                              10000
IMD_Decile_From_LSOA                                              4.0
Age_Band                                                        65-84
Sex                                                               1.0
AE_Arrive_Date                                    2016-11-24 00:00:00
AE_Arrive_HourOfDay                                             09-12
AE_Time_Mins                                                      370
AE_HRG                                                           High
AE_Num_Diagnoses                                                    1
AE_Num_Investigations                                               6
AE_Num_Treatments                                                   6
AE_Arrival_Mode                                                     1
Provider_Patient_Distance_Miles                                   6.0
ProvID                                                          15179
Admitted_Flag       

The medical condition has been generated by ChatGPT based on the prompt inputted

In [205]:
Pat.Medical_Condition


'Coronary artery disease'

In [206]:
Pat.Admission_Note

"Chief Complaint:\nThe patient presented to the Accident & Emergency Department with complaints of chest pain and shortness of breath.\n\nHistory of Present Illness:\nThe patient, a 65-84-year-old male, reported experiencing intermittent episodes of chest pain for the past week. The pain typically occurs during physical exertion and is relieved by rest. Additionally, the patient has been experiencing episodes of shortness of breath, especially during exertion or lying flat. There is no associated cough or wheezing.\n\nPast Medical History:\nThe patient has a known history of hypertension and dyslipidemia. He is a current smoker with a 30 pack-year smoking history. There is no past history of coronary artery disease, congestive heart failure, or stroke.\n\nPhysical Examination:\nUpon examination, the patient appears in mild distress due to chest pain. Vital signs are stable with a blood pressure of 130/80 mmHg, heart rate of 80 bpm, respiratory rate of 18 breaths/min, and oxygen saturat

If the admission note is not generated, the following code may help for debugging. Note that you need to 
* uncomment the lines in patient_class.py to return the whole content string
* uncomment the line return(content) in pick_medical_condition.py

In [179]:
# import ast

# def process_response(content):
#     print("trying string replace")
#     try: 
#         content = content.replace("\n", "")
#         print(type(content))
#     except:
#         print("failed on string replace")
#     print("trying json load")
#     try: 
#         content_json = json.loads(content)
#         print(type(content_json))
#     except:
#         print("failed on json.loads")

#     print("trying json load with ast")
#     try: 
#         content_json = ast.literal_eval(content)
#         print(type(content_json))
#     except:
#         print("failed on json.loads with ast")

#     # print("trying json dump and load")
#     # try: 
#     #     content_json = json.loads(json.dumps(content))
#     #     print("succeeded on json dump and load")
#     #     print(type(content_json))
#     #     print(content_json)
#     # except:
#     #     print("failed on json. dump and loads")
        
#     print("trying most likely conditions")
#     try: 
#         # med_cond = content_json['most_likely_condition']
#         return( (content_json['possible_conditions'],
#                  content_json['most_likely_condition'],
#                  content_json['admission_note']), )
#     except:
#         print("failed on most_likely_condition argument")

#     # try: 
#     #     med_cond = content_json['most_likely_condition']
#     #     return(med_cond)
#     # except:
#     #     print("failed on most_likely_condition argument")

#     print("trying possible_conditions")

#     try: 
#         poss_cond = content_json['possible_conditions']
#         return(poss_cond)
#     except:
#         print("failed on possible_conditions argument")

#     try: 
#         content = content.replace("\n", "")
#         content_json = json.loads(content)
#         print(content_json)
#         poss_cond = content_json["possible_conditions"]
#         return(poss_cond)
#     except:
#         print("failed on possible_conditions argument after string replace")
        
# process_response(Pat.Medical_Condition)



## Generate 10 instance of a patient using the dataset and save to json

Here I pick 10 patients at random, and generate admission notes for them. I chose to save these to a json file, rather than to individual text files or to SQL, as this is human readable. You can view the full output [here](../src/data_exports/note_dict_20231002_2120.json)

In [262]:
attributes = ['Age_Band', 'AE_Arrive_HourOfDay', 'AE_Time_Mins',  'Length_Of_Stay_Days', 'ICD10_Chapter_Code', 'Title', 'Medical_Condition', 'Admission_Note']
note_dict = {}


In [263]:
for index, persona in ed.sample(10).iterrows():

    print("Getting admission note for patient with id " + str(index))
    Pat = row_to_patient(persona)

    note_dict[Pat.id] = {}
    note_dict[Pat.id] = {attr: getattr(Pat, attr) for attr in attributes if hasattr(Pat, attr)}
    # note_dict[Pat.id]['day_' + str(0)] = {}
    # note_dict[Pat.id]['day_' + str(0)]['admission'] = Pat.Admission_Note


write_to_json(note_dict, 'experiment_1.0')

Getting admission note for patient with index 2827279
Getting admission note for patient with index 2931824
Getting admission note for patient with index 2242585
Getting admission note for patient with index 3206525
Getting admission note for patient with index 2217701
Getting admission note for patient with index 5162913
Getting admission note for patient with index 1194824
Getting admission note for patient with index 2987383
Getting admission note for patient with index 3325602
Getting admission note for patient with index 4864380


## Evaluate output

In [59]:
note_dict = load_from_json('note_dict')

### Quantitative evaluation

First, check how many records are complete. Ideally, both medical condition and admission note would be populated for all patients. Here 4 out of 10 meet that criterion. You can view the full output [here](../src/data_exports/note_dict_20231002_2120.json)

In [60]:
admission_note_count = 0
medical_condition_count = 0
failed_count_total = 0
complete = 0

for id, persona in note_dict.items():

    # check for json issues
    failed_count = 0
    for key_, value_ in persona.items():
        if 'failed on json' in str(value_):
            failed_count = 1
    if failed_count == 0:

        if persona['Admission_Note'] != '' :
            admission_note_count +=1
        if persona['Medical_Condition'] != '' :
            medical_condition_count +=1
            if persona['Admission_Note'] != '':
                persona['complete'] = True
    else:
        failed_count_total+=1
    
print("Number of records that failed completely: " + str(failed_count_total))
print("Number of records with medical condition: " + str(medical_condition_count))
print("Number of records with admission note: " + str(admission_note_count))


Number of records that failed completely: 1
Number of records with medical condition: 6
Number of records with admission note: 4


### Qualitative evaluation

Look at the kind of medical conditions noted, for those with complete records, and consider whether this is reasonable given their length of stay. You can view the full output [here](../src/data_exports/note_dict_20231002_2120.json)

In [61]:
for id, persona in note_dict.items():
    if 'complete' in persona.keys():
        print(f"\n{id}")
        print(persona['Medical_Condition'])
        print(f"Length of stay: {persona['Length_Of_Stay_Days']} days")
        # print(persona['Admission_Note'])


2242585
Urinary tract infection
Length of stay: 4 days

2987383
Intracranial injury with loss of consciousness of less than 30 minutes duration
Length of stay: 2 days

3206525
Fracture
Length of stay: 8 days

4864380
Dehydration
Length of stay: 16 days


Viewing the output, I note: 
* length of stay is included in the admission note; this needs to be corrected (and probably could be with a change of prompt)

About the specific patients
* 4 days with a UTI might be plausible if (a) it led to delirium in a frail or confused patient or masked other conditions
* 2 days for a concussion might be plausible
* a fracture has been deemed a likely admission, which is itself unlikely and especially unlikely to lead to 8 days as an inpatient. The [prompt](../templates/prompt_templates/pick_medical_condition.txt) explicitly asked the agent to "Consider whether their condition is serious enough to warrant the expected duration"
* a 16 day length of stay for dehydration seems highly unlikely