### Predict patients that would need ventilator
This code will predict if a ventilator is needed for a COVID patient based on the observations. The dataset is from https://mitre.box.com/shared/static/9iglv8kbs1pfi7z8phjl9sbpjk08spze.zip

In [39]:
import pandas as pd
import numpy as np


- Read the files

In [40]:
patients = pd.read_csv('./10k_synthea_covid19_csv/patients.csv')
conditions = pd.read_csv('./10k_synthea_covid19_csv/conditions.csv')
observations = pd.read_csv('./10k_synthea_covid19_csv/observations.csv')
procedures = pd.read_csv('./10k_synthea_covid19_csv/procedures.csv')
patients.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12352 entries, 0 to 12351
Data columns (total 25 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   Id                   12352 non-null  object 
 1   BIRTHDATE            12352 non-null  object 
 2   DEATHDATE            2352 non-null   object 
 3   SSN                  12352 non-null  object 
 4   DRIVERS              10399 non-null  object 
 5   PASSPORT             9845 non-null   object 
 6   PREFIX               10110 non-null  object 
 7   FIRST                12352 non-null  object 
 8   LAST                 12352 non-null  object 
 9   SUFFIX               124 non-null    object 
 10  MAIDEN               3540 non-null   object 
 11  MARITAL              8833 non-null   object 
 12  RACE                 12352 non-null  object 
 13  ETHNICITY            12352 non-null  object 
 14  GENDER               12352 non-null  object 
 15  BIRTHPLACE           12352 non-null 

- Find patients with COVID

In [41]:
# Pick up only Id, Gender
patients_subset = patients[['Id', 'GENDER']]
patients_subset.head()

Unnamed: 0,Id,GENDER
0,f0f3bc8d-ef38-49ce-a2bd-dfdda982b271,M
1,067318a4-db8f-447f-8b6e-f2f61e9baaa5,F
2,ae9efba3-ddc4-43f9-a781-f72019388548,M
3,199c586f-af16-4091-9998-ee4cfc02ee7a,F
4,353016ea-a0ff-4154-85bb-1cf8b6cedf20,M


In [42]:
# Find number of unique patients with COVID-19
covid_patients = conditions[conditions['CODE'] == 840539006]
covid_patients = covid_patients[['PATIENT', 'CODE']]
covid_patients = covid_patients.merge(patients_subset, left_on='PATIENT', right_on='Id')
covid_patients.head()



Unnamed: 0,PATIENT,CODE,Id,GENDER
0,f0f3bc8d-ef38-49ce-a2bd-dfdda982b271,840539006,f0f3bc8d-ef38-49ce-a2bd-dfdda982b271,M
1,067318a4-db8f-447f-8b6e-f2f61e9baaa5,840539006,067318a4-db8f-447f-8b6e-f2f61e9baaa5,F
2,ae9efba3-ddc4-43f9-a781-f72019388548,840539006,ae9efba3-ddc4-43f9-a781-f72019388548,M
3,199c586f-af16-4091-9998-ee4cfc02ee7a,840539006,199c586f-af16-4091-9998-ee4cfc02ee7a,F
4,353016ea-a0ff-4154-85bb-1cf8b6cedf20,840539006,353016ea-a0ff-4154-85bb-1cf8b6cedf20,M


- Combine observations with the covid patients

In [43]:
# Merge observations with the covid patients
merged_data = pd.merge(covid_patients, observations, on='PATIENT')
# drop columns Id, DATE, ENCOUNTER
merged_data = merged_data.drop(columns=['Id', 'DATE', 'ENCOUNTER', 'TYPE', 'CODE_x'])

# filter for specific observation codes (O2 saturation), respiratory rate, ferritin)
merged_data = merged_data[merged_data['CODE_y'].isin(['2708-6', '9279-1', '2276-4'])]
merged_data = merged_data.rename(columns={'CODE_y': 'OBSERVATION_CODE'})
merged_data['PATIENT'].nunique()
# for each patient, get the max value of each observation
merged_data = merged_data.groupby(['PATIENT', 'OBSERVATION_CODE']).agg({'VALUE': 'max'}).reset_index()
merged_data = merged_data.merge(patients_subset, left_on='PATIENT', right_on='Id')
merged_data = merged_data.drop(columns=['Id'])
merged_data.info()
merged_data.head()
# Pivot the data to have observations as columns
pivot_data = merged_data.pivot(index='PATIENT', columns='OBSERVATION_CODE', values='VALUE')
pivot_data.head()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19507 entries, 0 to 19506
Data columns (total 4 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   PATIENT           19507 non-null  object
 1   OBSERVATION_CODE  19507 non-null  object
 2   VALUE             19507 non-null  object
 3   GENDER            19507 non-null  object
dtypes: object(4)
memory usage: 609.7+ KB


OBSERVATION_CODE,2276-4,2708-6,9279-1
PATIENT,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0000b247-1def-417a-a783-41c8682be022,,81.5,33.7
00049ee8-5953-4edd-a277-b9c1b1a7f16b,,78.8,30.2
00079a57-24a8-430f-b4f8-a1cf34f90060,625.1,94.5,37.2
0008a63c-c95c-46c2-9ef3-831d68892019,949.6,88.8,36.5
00093cdd-a9f0-4ad8-87e9-53534501f008,593.2,86.1,37.6


- Prepare the dataset

In [44]:
# Find patients that needed ventilation from procedures
ventilation_procedures = procedures[procedures['CODE'].isin([26763009])]
# Add a column to identify patients needing ventilation
ventilation_procedures = ventilation_procedures[['PATIENT']]
ventilation_procedures['VENTILATOR'] = True
ventilation_procedures.head()
# Merge the ventilation data with the pivot_data
df = pivot_data.merge(ventilation_procedures, left_index=True, right_on='PATIENT', how='left')
df['VENTILATOR'] = df['VENTILATOR'].fillna(False)
# reset index
df = df.reset_index(drop=True)
# rearrange columns and rename them 
df = df.rename(columns={'PATIENT': 'PATIENT', '2708-6': 'O2 Saturation', 
                        '9279-1': 'RR', '2276-4': 'Ferritin'})
df = df[['PATIENT', 'O2 Saturation', 'RR', 'Ferritin', 'VENTILATOR']]
df.dropna(inplace=True)
df = df.reset_index(drop=True)  # reset index after dropping NaN values
df['VENTILATOR'].value_counts()
# # Get min max values for each observation
# min_max_values = df[['O2 Saturation', 'RR', 'Ferritin']].agg(['min', 'max'])
# min_max_values
df[df['VENTILATOR'] == True].head()
#df[df['VENTILATOR'] == False].head()



  df['VENTILATOR'] = df['VENTILATOR'].fillna(False)


Unnamed: 0,PATIENT,O2 Saturation,RR,Ferritin,VENTILATOR
11,0100f99a-1b5d-4a5b-a73f-559a920412e5,88.8,39.5,982.2,True
12,0100f99a-1b5d-4a5b-a73f-559a920412e5,88.8,39.5,982.2,True
13,0100f99a-1b5d-4a5b-a73f-559a920412e5,88.8,39.5,982.2,True
14,0100f99a-1b5d-4a5b-a73f-559a920412e5,88.8,39.5,982.2,True
15,0100f99a-1b5d-4a5b-a73f-559a920412e5,88.8,39.5,982.2,True


### Using Open API model for classifcation

In [45]:
!pip install --upgrade openai


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m25.0.1[0m[39;49m -> [0m[32;49m25.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [46]:
from  openai import OpenAI

In [47]:
client = OpenAI(api_key='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') # For demo only. Replace it with your own API.

- Zero shot request

In [48]:
response = client.chat.completions.create(
  # use cheaper model for testing
  model = "gpt-4",

  messages=[
        {"role": "user", "content": "Decide in a single word if the patient needs ventilation based on the following data: oxygen saturation is at 95%, Ferritin levels are 800, respiratory rate is 25. Respond with 'Yes' or 'No'."},
    ]
)

print(response.choices[0].message.content)

No


- Chat based approach

In [49]:
response = client.chat.completions.create(
  model="gpt-4",
  messages=[
        {"role": "system", "content": "You are an expert on Covid diagnosis."},
        {"role": "user", "content": "Decide in a single word if the patient needs ventilation based on the following data: oxygen saturation is at 95%, Ferritin levels are 800, respiratory rate is 25. Respond with 'Yes' or 'No'."},
    ]
)

print(response.choices[0].message.content)

Yes


- Continue the conversation and ask why did it respond an yes

In [50]:
messages = [
    {"role": "system", "content": "You are an expert on Covid diagnosis."},
    {"role": "user", "content": "Decide in a single word if the patient needs ventilation based on the following data: oxygen saturation is at 95%, Ferritin levels are 800, respiratory rate is 25. Respond with 'Yes' or 'No'."},
    {"role": "assistant", "content": response.choices[0].message.content},
    {"role": "user", "content": "Can you provide details why you made that decision?"},
]

response = client.chat.completions.create(
  model="gpt-4",
  messages=messages
)

print(response.choices[0].message.content)

The decision is based on the information given; although the oxygen saturation (95%) is within the normal range (95% to 100%), the patient's Ferritin levels are significantly elevated (normal female range is 12-150 ng/mL and for males it's 12-300 ng/mL), indicating a possible severe inflammation or infection. Moreover, the respiratory rate is higher than normal (12-20 breaths per minute for a healthy adult). These combined signs suggests a severe response possibly due to Covid-19, which might require ventilatory support.


- Getting prompts for training and test data

In [51]:
from sklearn.model_selection import train_test_split

df_index = list(df.index)
train_index, test_index = train_test_split(df_index, test_size=0.2, random_state=42)

In [52]:
print("Train Index:", train_index)
print("Test Index:", test_index)
# train_index = train_index[0:50]
# test_index = test_index[0:10]
print(len(train_index))

Train Index: [1313, 2476, 2892, 790, 2283, 288, 1656, 3054, 2907, 2664, 1134, 177, 2589, 3083, 1422, 1624, 370, 1501, 729, 598, 2453, 2459, 2328, 1697, 1427, 1941, 1795, 141, 1916, 665, 817, 3481, 654, 471, 1510, 3430, 2170, 839, 2673, 926, 2785, 2460, 195, 532, 3404, 3208, 2043, 2110, 1074, 227, 555, 807, 2726, 3093, 2179, 2817, 3017, 631, 1670, 1362, 162, 219, 3586, 3438, 3287, 1091, 1511, 547, 874, 2972, 1760, 568, 3349, 1288, 1037, 2341, 611, 2965, 2979, 3374, 1703, 742, 637, 3405, 1236, 581, 2474, 969, 1937, 73, 48, 3556, 2916, 2087, 3213, 2298, 756, 1080, 838, 1206, 1584, 2023, 1462, 2766, 1392, 1075, 572, 3162, 2884, 1178, 1491, 2771, 3072, 1041, 2526, 1783, 3056, 3010, 2211, 3175, 3324, 259, 1608, 3432, 3250, 812, 2596, 2422, 1554, 3139, 2795, 3318, 3603, 3271, 2367, 1414, 2077, 1642, 1777, 1044, 2763, 2836, 2314, 2749, 1370, 1538, 867, 1873, 940, 2739, 2962, 108, 2016, 3259, 2827, 318, 816, 59, 903, 1426, 3182, 1565, 3512, 2993, 765, 1211, 1067, 1417, 1420, 1644, 3095, 727, 18

- Create the DataSet class

In [53]:
from torch.utils.data import Dataset

class VentilatorDataset(Dataset):
    def __init__(self, df):
        self.df = df

    def __len__(self):
        return len(self.df)

    def __getitem__(self, index):
        column_names = [
            ("O2 Saturation", "The first observation is oxygen saturation at "),
            ("RR", ". The second observation is respiratory rate at "),
            ("Ferritin", ". The third observation is ferritin level at "),
        ]

        x_strs = [f"{col_desc}{self.df.iloc[index][col]}" for col, col_desc in column_names]
        x_str = ''.join(x_strs)
        x_str = x_str.replace('\n', '')
        x_str = 'Decide in a single word if the patient needs ventilation: True or False '+x_str

        return x_str

In [54]:
test_ds = VentilatorDataset(df.iloc[test_index])


#### Use ChatGPT to get the response to the prompts from the dataset

In [55]:
from tqdm import tqdm
import time

results = []
for prompt in tqdm(test_ds):
  response = client.chat.completions.create(
    model="gpt-3.5-turbo",
    messages=[
          {"role": "user", "content": prompt},
      ]
  )
  results.append(response.choices[0].message.content)
  time.sleep(3)

100%|██████████| 722/722 [41:13<00:00,  3.43s/it]


In [56]:
results

['True',
 'False',
 'True',
 'True',
 'True',
 'True',
 'True',
 'True',
 'False',
 'True',
 'True',
 'True',
 'True',
 'True',
 'True',
 'True',
 'True',
 'True',
 'True',
 'True',
 'True',
 'True',
 'True',
 'False',
 'True',
 'True',
 'True',
 'True',
 'True',
 'False',
 'False',
 'True',
 'False',
 'True',
 'True',
 'False',
 'True',
 'False',
 'True',
 'True',
 'False',
 'True',
 'True',
 'True',
 'True',
 'True',
 'True',
 'True',
 'True',
 'True',
 'True',
 'True',
 'True',
 'True',
 'True',
 'True',
 'True',
 'False',
 'True',
 'True',
 'True',
 'False',
 'True',
 'True',
 'False',
 'True',
 'True',
 'True',
 'True',
 'True',
 'False',
 'True',
 'True',
 'True',
 'True',
 'True',
 'True',
 'False',
 'True',
 'True',
 'True',
 'True',
 'True',
 'True',
 'True',
 'True',
 'True',
 'False',
 'True',
 'True',
 'True',
 'True',
 'True',
 'True',
 'True',
 'True',
 'True',
 'True',
 'True',
 'False',
 'True',
 'True',
 'True',
 'True',
 'True',
 'True',
 'False',
 'True',
 'False',
 

- Get the accuracy scores

In [57]:
from sklearn.metrics import roc_auc_score, average_precision_score
results = [1 if r.strip().lower() == 'true' else 0 for r in results]
test_labels = df.iloc[test_index]['VENTILATOR'].tolist()
auroc = roc_auc_score(test_labels, results)
auprc = average_precision_score(test_labels, results)
print("AUROC:", auroc)
print("AUPRC:", auprc)

AUROC: 0.4935389648323402
AUPRC: 0.55778213315265


### Use ChatGPT embeddings for ventilator prediction

- Write function to get the embeddings

In [58]:
def generate_embeddings(texts, model="text-embedding-ada-002"):
    embeddings = []
    for text in tqdm(texts):
        text = text.replace("\n", " ")
        response = client.embeddings.create(
            input=text,
            model=model
        )
        
        # Access embedding with the proper attribute notation for new OpenAI client
        embeddings.append(response.data[0].embedding)
        
    return np.array(embeddings)



- Get the embeddings of the training dataset

In [59]:
train_ds = VentilatorDataset(df.iloc[train_index])
embeddings = generate_embeddings(train_ds)

100%|██████████| 2884/2884 [18:58<00:00,  2.53it/s] 


In [60]:
np.shape(embeddings)

(2884, 1536)

Get the label from the training dataset

In [61]:
labels = df.iloc[train_index]['VENTILATOR'].tolist()

- Train a simple classifier using the embeddings

In [62]:
from sklearn.linear_model import LogisticRegression
model = LogisticRegression(max_iter=1000)
model.fit(embeddings, labels)

- Test the performance

In [63]:
test_embeddings = generate_embeddings(test_ds)
test_labels = list(df.iloc[test_index]['VENTILATOR'])

test_pred = model.predict_proba(test_embeddings)[:,1]
auroc = roc_auc_score(test_labels, test_pred)
auprc = average_precision_score(test_labels, test_pred)
print('\nAUROC:', auroc, '\nAUPRC', auprc)

100%|██████████| 722/722 [04:40<00:00,  2.57it/s]


AUROC: 0.9218210850177202 
AUPRC 0.9341088693309437





### Autogen 

- Use Autogen to create an agent to make use of the embeddings generated from the training set.

In [64]:
!pip install "autogen-agentchat"
!pip install "autogen-ext[openai]"


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m25.0.1[0m[39;49m -> [0m[32;49m25.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m25.0.1[0m[39;49m -> [0m[32;49m25.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [65]:
from autogen_agentchat.agents import AssistantAgent
from autogen_agentchat.ui import Console
from autogen_ext.models.openai import OpenAIChatCompletionClient
from autogen_agentchat.conditions import TextMentionTermination
from autogen_agentchat.teams import RoundRobinGroupChat

In [66]:
API_KEY = "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"

In [67]:
# Install necessary packages
%pip install faiss-cpu openai autogen



[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m25.0.1[0m[39;49m -> [0m[32;49m25.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


- Generate the embeddings and store in the Vector DB

In [68]:

import faiss
import numpy as np
import openai
from autogen import AssistantAgent

embeddings = generate_embeddings(train_ds)

# Store embeddings in FAISS
dimension = embeddings.shape[1]  # length of embedding vector
index = faiss.IndexFlatL2(dimension)
index.add(embeddings)


100%|██████████| 2884/2884 [18:55<00:00,  2.54it/s]  


In [73]:
print("Number of vectors in the index:", index.ntotal)

Number of vectors in the index: 2884


- Define the retriever

In [88]:

def retrieve(question, top_k=200):
    query_vector = generate_embeddings([question])  # Pass the question as a list
    D, I = index.search(query_vector, top_k)  # Use the query_vector directly
    return [train_ds[i] for i in I[0]]  # Use the VentilatorDataset to get the original text


- Define the retrieval agent

In [89]:

# Define the AutoGen agent
class RetrievalAgent(AssistantAgent):
    def __init__(self, name, retriever, llm, *args, **kwargs):
        super().__init__(name=name, *args, **kwargs)
        self.retriever = retriever
        self.llm = llm

    def respond(self, user_query):
        context = self.retriever(user_query)
        context_str = "\n\n".join(context)
        prompt = f"Use the following context to answer:\n\n{context_str}\n\nQuestion: {user_query}"
        return self.llm(prompt)

# Define the LLM call (OpenAI)
def simple_llm(prompt):
    response = client.chat.completions.create(
        model="gpt-3.5-turbo",
        messages=[{"role": "user", "content": prompt}]
    )
    return response.choices[0].message.content

# Create the agent
agent = RetrievalAgent(
    name="retrieval_agent",
    retriever=retrieve,
    llm=simple_llm
)


- A simple question answer interface with the virtual agent

In [90]:

# Ask questions to the agent
question = "How many patients needed ventilation with O2 saturation below 90%?"
answer = agent.respond(question)
print(answer)

question = "How to prevent COVID-19 patients from needing ventilation? Use bullet points"
answer = agent.respond(question)
print(answer)

question = "What is the average O2 saturation of patients who needed ventilation? Show me the calculation."
answer = agent.respond(question)
print(answer)

question = "how many patients are there in the dataset?"
answer = agent.respond(question)
print(answer)


100%|██████████| 1/1 [00:00<00:00,  2.91it/s]


4 patients needed ventilation with oxygen saturation below 90%


100%|██████████| 1/1 [00:00<00:00,  1.04it/s]


- Encourage vaccination to reduce the risk of severe illness
- Practice good hygiene, such as washing hands frequently and wearing masks in public spaces
- Maintain physical distancing from others, especially in crowded or indoor settings
- Follow public health guidelines and recommendations
- Monitor symptoms closely and seek medical attention if symptoms worsen
- Stay informed about the latest updates and guidance from healthcare authorities


100%|██████████| 1/1 [00:00<00:00,  2.27it/s]


The average O2 saturation of patients who needed ventilation is 88.1.

Calculation: (87.0 + 88.7 + 88.5 + 88.7 + 88.1 + 88.7 + 88.7 + 88.5 + 88.5 + 88.5) / 10 = 881 / 10 = 88.1


100%|██████████| 1/1 [00:00<00:00,  4.24it/s]


Based on the context provided, there are 90 patients in the dataset.
