# CS-598 DL4H Reproducibility Project: "Improving clinical outcome predictions using convolution over medical entities with multimodal learning"

Team 34: Kristine Cheng (cycheng4), Vanessa Chen (zhenc5), Sophia Yu (sophiay3)

## Project Location
* Google Drive Link: https://drive.google.com/drive/folders/1nlDMRbCBY27ygu5EKwvnyUR3SDempmbJ?usp=drive_link

* Github Link: https://github.com/zhenc5/CS598-Group-Project

* Video presentation on Youtube Link:https://drive.google.com/file/d/1ME7jcAtdW8Zc9a09tRz-feuGr8HITqM5/view?usp=sharing

## Mount Notebook to Google Drive


In [None]:
from google.colab import drive
drive.mount('/content/drive', force_remount=True)

Mounted at /content/drive


## Introduction

*   **Background of the problem**

  It is crucial to assess a patient’s health by looking at their medical tests and predicting how they might fare during their stay in the ICU. The type of problem addressed in the paper is clinical outcome prediction, specifically predicting mortality (in-hospital & in-ICU) and length of ICU stay (LOS) (>3 days & >7 days).

  ![task_prediction.png](https://drive.google.com/uc?export=view&id=1Mfvjp_7vDHGOAF5SqUyOnfMb2tQQuhUM)
  
  Figure 1. Definitions of the clinical prediction tasks (as shown in the paper [1])

  Predicting clinical outcomes is an important problem in healthcare, because it can help hospitals and healthcare providers reduce healthcare costs, improve patient outcomes by determining treatment methods, and optimize healthcare resource utilization. By accurately predicting LOS and mortality, healthcare providers can provide targeted interventions to those at high risk, thereby leading to better patient outcomes and more efficient use of hospital resources.

  One of the major difficulties associated with predicting these clinical outcomes using electronic health records (EHR) is in standardizing the preprocessing steps, such as in the handling of missing data and outliers, unit conversions, and the transformation of raw data into usable features to be used in deep learning algorithms [1].  Additionally, previous studies that aim to predict these clinical outcomes have used only structured patient data, such as historical patient diagnoses (ICD codes) [5, 6], lab results and other measurements taken in the ICU [7-9]. To improve the accuracy of the predictions, unstructured clinical notes can be added to the deep learning model. However, extracting medical entities from unstructured clinical notes presents its own challenges because it is free text usually containing grammatical errors, shorthand, medical jargon and redundant information [1].

  Some state-of-the-art deep learning algorithm methods for using EHR data to predict clinical outcomes include Long Short Term Memory (LSTM) and Gated Recurrent Unit (GRU) because they are most effective at learning from sequence data. Lipton et al. demonstrated the effectiveness of a LSTM to model clinical data, specifically to classify 128 diagnoses using 13 clinical measurements [10]. Choi et al. showed promising results in predicting multi-label diagnosis for a patient's next visit using a GRU-based model called DoctorAI [5].

*  **Paper explanation**

  Electronic Health Record (EHR) data is commonly used in deep learning applications for clinical outcome predictions. However, traditional approaches often overlook the unstructured data within EHR, such as clinical notes and radiology. Bardak and Tan address [1]  this issue by exploring methods to improve two different common risk prediction tasks - mortality and length of ICU stay (LOS). The paper proposes a deep learning method that involves extracting medical entities from clinical notes and integrating them into prediction models using a convolution-based multimodal architecture. Additionally, they evaluated different embedding techniques, such as Word2Vec and FastText on medical entities.

  The innovative feature in the proposed method in the paper is the use of CNN architecture to capture local patterns in the EHR data and medical entity embeddings of the clinical notes, and then to combine the learned features from the CNN with features extracted from the timeseries data to make its predictions.

  The results show that the proposed method outperforms the baseline models on all 4 clinical outcome predictions in terms of AUCROC, AURPRC, and F1 score, with the exception of LOS >7 days where the F1 score was greater for the baseline model.

  Overall, the paper makes an important contribution to the research regime of clinical outcome predictions by introducing a novel approach that not only enhances the accuracy of these predictions but also has the adaptability to be applied to other clinical outcome prediction tasks. The implementation of convolution on medical entities, extracted from EHR clinical notes, in conjunction with multimodal learning, signifies an important step forward in the development of predictive models for clinical outcomes.


## Scope of Reproducibility:

Below lists the hypotheses to be tested and the corresponding experiments that will be run:
  1. The proposed convolution-based multimodal architecture outperforms the baseline multimodal architecture for each of the 4 clinical outcome prediction tasks.
    * Various embedding techniques, such as Word2Vec, FastText, and the concatenation of Word2Vec and FastText embeddings on the EHR clinical notes, will also be compared among the baseline multimodel architecture and the proposed convolution-based architecture.

  2. The baseline multimodal model shows an improved prediction performance compared to the baseline time-series GRU on each of the 4 clinical outcome tasks.
    * Various embedding techniques, such as Word2Vec, FastText, and the concatenation of Word2Vec and FastText embeddings on the EHR clinical notes, will also be compared among the baseline models.



## Methodology

### Environment Setup
- Python version 3.10.12
- Dependent packages needed:
  - numpy
  - pandas
  - tables
  - nltk
  - spacy
  - gensim
  - keras==2.10.0
  - scipy==1.10.1
  - tensorflow
  - scikit-learn
  - (med7 pre-trained model) https://huggingface.co/kormilitzin/en_core_med7_lg/resolve/main/en_core_med7_lg-any-py3-none-any.whl

To install the dependencies, use the following command:


In [None]:
# pip install -r requirements.txt

Then import the necessary packages:

In [None]:
# Import necessary packages

from google.colab import drive
import os
import numpy as np
import pandas as pd
import nltk
nltk.download('punkt')

import re
import spacy
from nltk import sent_tokenize, word_tokenize, punkt

from gensim.models import Word2Vec, FastText

import collections
import gc

import keras
from keras import backend as K
from keras import regularizers
from keras.models import Sequential, Model
from keras.layers import Flatten, Dense, Dropout, Input, concatenate, Activation, Concatenate, LSTM, GRU
from keras.layers import Conv2D, MaxPooling2D, UpSampling2D, Conv1D, BatchNormalization, Convolution1D
from keras.layers import UpSampling1D, MaxPooling1D, GlobalMaxPooling1D, GlobalAveragePooling1D, MaxPool1D
from keras.optimizers import Adam
from keras.callbacks import EarlyStopping, ModelCheckpoint, History, ReduceLROnPlateau
from tensorflow.python.keras.utils import np_utils
from tensorflow.python.keras.backend import set_session, clear_session, get_session

import tensorflow as tf

from sklearn.utils import class_weight
from sklearn.metrics import average_precision_score, roc_auc_score, accuracy_score, f1_score

from logging import NullHandler

import warnings
warnings.filterwarnings('ignore')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


###  Data

**Data download instructions**

In order to implement the paper's code, 3 folders must first be created in the directory and named "data", "embeddings", and "results". Within the "results" folder, 2 new folders called "cnn" and "multimodal" must also be created.

* Download the MIMIC-III dataset (specifically ADMISSIONS.csv, NOTEEVENTS.csv, ICUSTAYS.csv) via https://mimic.physionet.org/ and place in the "data" folder
* Download the MIMIC-Extract implementation (called "all_hourly_data.h5") via https://github.com/MLforHealth/MIMIC_Extract and place in the "data" folder
* The med7 implementation should already be installed via requirements.txt. Source: https://github.com/kormilitzin/med7
* Download the pre-trained Word2Vec and FastText embeddings via https://github.com/kexinhuang12345/clinicalBERT and place in the "embeddings" folder

**Source of the data**

The data is collected from running MIMIC-III data [2-4] through MIMIC-Extract Pipeline. Since we only wished to use the output of this pipeline, we were able to directly download a preprocessed version with default parameters from their Github page [4, 11]. The dataset is stored in the "data" folder as all_hourly_data.h5. We also use ADMISSIONS.csv, NOTEEVENTS.csv, ICUSTAYS.csv from the MIMIC-III dataset.  


**Statistics**


**For the time series data:**

The MIMIC-III dataset contains EHR data of 58,976 unique hospital admissions and 61,532 ICU admissions from 46,520 patients.

The MIMIC-Extract dataset contains a patient's first ICU visit and already eliminates patients with ages < 15 years and where the LOS is not between 12 hours and 10 days [1]. It contains 34,472 patients and 104 time-series variables.

Then, we drop any patients who do not have at least 30 hours of data. We also drop any clinical notes that do not contain chart time information and any patients that do not have any clinical notes in 24 hours. This leads to a final cohort, after clinical note elimination, of 23,944 records of patients, hospital admissions, and ICU admissions.

**For the medical entities data:**

The paper reported the final unique counts of the 7 medical entities (Drug, Strength, Form, Route, Dosage, Frequency, Duration) as being 18268, 10749, 597, 1193, 7239, 3344, and 1185 respectfully.

**Data process**

By feeding data through first 24 hour features, the data should be split into three different csv files named ADMISSION, NOTEEVENTS, ICUSTAYS respectfully and placed in the "data" folder.

The medical entities from the clinical notes will be used to enhance the prediction performance. In order to extract the medical embeddings, we used a pre-trained clinical named-entity recognition (NER) model, med7 [12], which extracts 7 different entities (Drug, Strength, Duration, Route, Form, Dosage, Frequency). Then, we used the pre-trained Word2Vec and FastText embedding techniques [13] (stored in the "embeddings" folder) to convert the medical entities into word representations.

The train/valid/test split, for all clinical tasks, is based on class
distribution with 70%/10%/20% ratio.


### Data Preprocessing Steps

In [None]:
DATAPATH = '/content/drive/MyDrive/CS598_Project/data'

#### 1. Extracting Time-Series Features and Preprocessing Clinical Notes

This step was executed locally on a PC and the ouput "preprocessed_notes.p" was uploaded to the data folder. The code is provided in "01-Extract-Timeseries-Features.ipynb" and "02+03-Preprocessing-Clinical-Notes.ipynb".


In [None]:
MIMIC_EXTRACT_DATA = os.path.join(DATAPATH, 'all_hourly_data.h5')
statistic = pd.read_hdf(MIMIC_EXTRACT_DATA, 'patients')
print(f"MIMIC-EXTRACT DATA (# of Patients & Hospital Admission & ICU Admission): {len(statistic)}")

# Check Time Series Data
lvl2_train_imputer = pd.read_pickle(os.path.join(DATAPATH, "lvl2_imputer_train.pkl"))
lvl2_dev_imputer = pd.read_pickle(os.path.join(DATAPATH, "lvl2_imputer_dev.pkl"))
lvl2_test_imputer = pd.read_pickle(os.path.join(DATAPATH,"lvl2_imputer_test.pkl"))
Ys = pd.read_pickle(os.path.join(DATAPATH, "Ys.pkl"))

patients_ids = []
for entry in Ys.index:
  patients_ids.append(entry[0])

print(f"MIMIC-EXTRACT DATA after preprocessing dataset (# of Patients & Hospital Admission & ICU Admission): {len(patients_ids)}")

print("Shape of train, dev, test datasets: {}, {}, {}.".format((lvl2_train_imputer.shape), (lvl2_dev_imputer.shape), (lvl2_test_imputer.shape)))
print("After applying time series feature (24 hours), train, dev, and test statistic: {}, {}, {}".format((lvl2_train_imputer.shape[0] / 24), (lvl2_dev_imputer.shape[0] / 24), (lvl2_test_imputer.shape[0] / 24)))


MIMIC-EXTRACT DATA (# of Patients & Hospital Admission & ICU Admission): 34472
MIMIC-EXTRACT DATA after preprocessing dataset (# of Patients & Hospital Admission & ICU Admission): 23944
Shape of train, dev, test datasets: (402240, 312), (57456, 312), (114960, 312).
After applying time series feature (24 hours), train, dev, and test statistic: 16760.0, 2394.0, 4790.0


#### 2. Extract Medical Entities in Clinical Notes
This step was executed locally on a PC, and the resulting output file "ner_df.p" was uploaded to the data folder. The code for this process is provided in the notebook titled "04 - Apply-med7-on-Clinical-Notes.ipynb."

In [None]:
med7_ner_data = pd.read_pickle(os.path.join(DATAPATH, 'new_ner_word_dict.pkl'))

# Check that med7 has 7 different entities
unique_categories = set()

for values in med7_ner_data.values():
    for item in values:
        category = item[1]
        unique_categories.add(category)

print(f"Entities in med7 NER model: {unique_categories}\n")

# Print the unique counts of each entity
unique_values_per_category = {category: set() for category in unique_categories}

for values in med7_ner_data.values():
    for item in values:
        value, category = item
        unique_values_per_category[category].add(value)

print("Unique count for each med7 entity:")
for category, unique_values in unique_values_per_category.items():
    print(f"{category}: {len(unique_values)}")

Entities in med7 NER model: {'DURATION', 'DRUG', 'ROUTE', 'FORM', 'FREQUENCY', 'DOSAGE', 'STRENGTH'}

Unique count for each med7 entity:
DURATION: 1678
DRUG: 17502
ROUTE: 1149
FORM: 885
FREQUENCY: 5444
DOSAGE: 4191
STRENGTH: 10067


#### 3. Represent Entities with Different Embeddings

This step was executed locally on a PC and the outputs "new_ner_word_dict.pkl", "new_ner_word2vec_dict.pkl", "new_ner_fasttext_dict.pkl", and "new_ner_combined_dict.pkl" were uploaded to the data folder. The code is provided in "05-Represent-Entities-With-Different-Embeddings.ipynb."

In [None]:
w2v = pd.read_pickle(os.path.join(DATAPATH, "new_ner_word2vec_dict.pkl"))
ft = pd.read_pickle(os.path.join(DATAPATH, "new_ner_fasttext_dict.pkl"))
combined = pd.read_pickle(os.path.join(DATAPATH, "new_ner_combined_dict.pkl"))

print(f"Number of Word2Vec embeddings: {len(w2v)}")
print(f"Number of Fast Text embeddings: {len(ft)}")
print(f"Number of concatenated embeddings (Word2Vec + FastText)): {len(combined)}")

Number of Word2Vec embeddings: 31732
Number of Fast Text embeddings: 31461
Number of concatenated embeddings (Word2Vec + FastText)): 32108


#### 4. Create Timeseries Data

This step was executed locally on a PC and the output files were uploaded to the data folder. The code is provided in "06-Create-Timeseries-Data.ipynb."

In [None]:
#Output files created in "06-Create-Timeseries-Data.ipynb."

new_train_ids = pd.read_pickle(os.path.join(DATAPATH, "new_train_ids.pkl"))
new_dev_ids = pd.read_pickle(os.path.join(DATAPATH, "new_dev_ids.pkl"))
new_test_ids = pd.read_pickle(os.path.join(DATAPATH, "new_test_ids.pkl"))

x_train = pd.read_pickle(os.path.join(DATAPATH, "new_x_train.pkl"))
x_dev = pd.read_pickle(os.path.join(DATAPATH, "new_x_dev.pkl"))
x_test = pd.read_pickle(os.path.join(DATAPATH, "new_x_test.pkl"))

y_train = pd.read_pickle(os.path.join(DATAPATH, "new_y_train.pkl"))
y_dev = pd.read_pickle(os.path.join(DATAPATH, "new_y_dev.pkl"))
y_test = pd.read_pickle(os.path.join(DATAPATH, "new_y_test.pkl"))

print("new_train_ids size:", len(new_train_ids))
print("new_dev_ids size:", len(new_dev_ids))
print("new_test_ids size:", len(new_test_ids))

print("x_train shape:", x_train.shape)
print("x_dev shape:", x_dev.shape)
print("x_test shape:", x_test.shape)

print("y_train shape:", y_train.shape)
print("y_dev shape:", y_dev.shape)
print("y_test shape:", y_test.shape)


new_train_ids size: 15567
new_dev_ids size: 2216
new_test_ids size: 4420
x_train shape: (15567, 24, 104)
x_dev shape: (2216, 24, 104)
x_test shape: (4420, 24, 104)
y_train shape: (15567, 4)
y_dev shape: (2216, 4)
y_test shape: (4420, 4)


##   Model

The execution of the models below were all performed locally on a PC and the output is saved in the results folder. The full code of the models can be found in the notebooks "07-Timeseries-Baseline.ipynb", "08-Multimodal-Baseline.ipynb", and "09-Proposed-Model.ipynb".

To demonstrate the implementation, we have have reduced the epoch number from 100 as stated in the original paper to 3 in this notebook, as well as the iteration number from 11 to 2. We also commented out the fasttext and combined embeddings to reduce runtime. The full code contains all 3 embeddings.

The inspiration for the figures explaining the model architecture were all taken from the original paper.

 *Citation of original paper:*

*Bardak B, Tan M, "Improving clinical outcome predictions using convolution over medical entities with multimodal learning", Artificial Intelligence in Medicine, 2021, 117:0933-3657, doi:https://doi.org/10.1016/j.artmed.2021.102112.*

*Link to original paper's repository:*

https://github.com/tanlab/ConvolutionMedicalNer/tree/master

In [None]:
# Import datasets
type_of_ner = "new"
x_train_lstm = pd.read_pickle(os.path.join(DATAPATH, type_of_ner+"_x_train.pkl"))
x_dev_lstm = pd.read_pickle(os.path.join(DATAPATH, type_of_ner+"_x_dev.pkl"))
x_test_lstm = pd.read_pickle(os.path.join(DATAPATH, type_of_ner+"_x_test.pkl"))

y_train = pd.read_pickle(os.path.join(DATAPATH, type_of_ner+"_y_train.pkl"))
y_dev = pd.read_pickle(os.path.join(DATAPATH, type_of_ner+"_y_dev.pkl"))
y_test = pd.read_pickle(os.path.join(DATAPATH, type_of_ner+"_y_test.pkl"))

ner_word2vec = pd.read_pickle(os.path.join(DATAPATH, type_of_ner+"_ner_word2vec_limited_dict.pkl"))
ner_fasttext = pd.read_pickle(os.path.join(DATAPATH, type_of_ner+"_ner_fasttext_limited_dict.pkl"))
ner_concat = pd.read_pickle(os.path.join(DATAPATH, type_of_ner+"_ner_combined_limited_dict.pkl"))

train_ids = pd.read_pickle(os.path.join(DATAPATH, type_of_ner+"_train_ids.pkl"))
dev_ids = pd.read_pickle(os.path.join(DATAPATH, type_of_ner+"_dev_ids.pkl"))
test_ids = pd.read_pickle(os.path.join(DATAPATH, type_of_ner+"_test_ids.pkl"))


### 1. Time-series Baseline

We use a Gated Recurrent Units (GRU) architecture to capture the temporal information between patient features. The GRU model has 2 gates, a reset gate, and an update gate. Predictions on mortality and LOS are done with a sigmoid classifier on 1 layer of GRU with 256 hidden units.

<img src="https://drive.google.com/uc?export=view&id=11JItGwXIa0Iy2AUElXfhrcRX8P3KBgrx"
     align="center"
     width="500" />

Figure 2. Overview of time-series model architecture for predicting the In-Hospital Mortality, In-ICU Mortality, LOS >3, and LOS >7. The MIMIC-Extract pipeline extracts the time series features for the GRU model.

In [None]:
class TimeSeriesModel:
    def __init__(self, type_of_ner):
        self.type_of_ner = type_of_ner

    def reset_keras(self, model):
        sess = get_session()
        clear_session()
        sess.close()
        sess = get_session()

        try:
            del model # this is from global space - change this as you need
        except:
            pass
        gc.collect() # if it's done something you should see a number being outputted

    def make_prediction_timeseries(self, model, test_data):
        probs = model.predict(test_data)
        y_pred = [1 if i>=0.5 else 0 for i in probs]
        return probs, y_pred

    def save_scores_timeseries(self, predictions, probs, ground_truth, model_name,
                               problem_type, iteration, hidden_unit_size, type_of_ner):

        auc = roc_auc_score(ground_truth, probs)
        auprc = average_precision_score(ground_truth, probs)
        acc   = accuracy_score(ground_truth, predictions)
        F1    = f1_score(ground_truth, predictions)

        result_dict = {}
        result_dict['auc'] = auc
        result_dict['auprc'] = auprc
        result_dict['acc'] = acc
        result_dict['F1'] = F1

        file_name = str(hidden_unit_size)+"-"+model_name+"-"+problem_type+"-"+str(iteration)+"-"+type_of_ner+".p"

        result_path = "/content/drive/MyDrive/CS598_Project/results/"
        pd.to_pickle(result_dict, os.path.join(result_path, file_name))

        print("AUC: {}, AUPRC: {}, Accuracy: {}, F1 Score: {}".format(auc, auprc, acc, F1))

    def timeseries_model(self, layer_name, number_of_unit):
        K.clear_session()
        sequence_input = Input(shape=(24,104),  name = "timeseries_input")

        if layer_name == "LSTM":
            x = LSTM(number_of_unit)(sequence_input)
        else:
            x = GRU(number_of_unit)(sequence_input)

        #logits_regularizer = tf.keras.regularizers.l2(0.01)
        logits_regularizer = keras.regularizers.l2(0.01)
        sigmoid_pred = Dense(1, activation='sigmoid',use_bias=False,
                             kernel_initializer=tf.keras.initializers.GlorotUniform(),
                             kernel_regularizer=logits_regularizer)(x)

        model = Model(inputs=sequence_input, outputs=sigmoid_pred)
        model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['acc'])
        return model

    def train_models(self):
        epoch_num = 3
        model_patience = 3
        monitor_criteria = 'val_loss'
        batch_size = 128

        unit_sizes = [256]
        iter_num = 2
        target_problems = ['mort_hosp', 'mort_icu', 'los_3', 'los_7']
        layers = ["GRU"]

        for each_layer in layers:
            print("Layer: ", each_layer)
            for each_unit_size in unit_sizes:
                print("Hidden unit: ", each_unit_size)
                for iteration in range(1, iter_num):
                    print("Iteration number: ", iteration)
                    print("=============================")

                    for each_problem in target_problems:
                        print ("Problem type: ", each_problem)
                        print ("__________________")


                        early_stopping_monitor = EarlyStopping(monitor=monitor_criteria, patience=model_patience)
                        best_model_name = str(each_layer)+"-"+str(each_unit_size)+"-"+str(each_problem)+"-"+"best_model.keras"
                        checkpoint = ModelCheckpoint(best_model_name, monitor='val_loss', verbose=1, save_best_only=True, mode='min')

                        callbacks = [early_stopping_monitor, checkpoint]

                        model = self.timeseries_model(each_layer, each_unit_size)
                        model.fit(x_train_lstm, y_train[each_problem], epochs=epoch_num, verbose=1,
                                  validation_data=(x_dev_lstm, y_dev[each_problem]), callbacks=callbacks, batch_size= batch_size)

                        model.load_weights(best_model_name)

                        probs, predictions = self.make_prediction_timeseries(model, x_test_lstm)
                        #save_scores_timeseries(predictions, probs, y_test[each_problem].values,str(each_layer),
                        #                       each_problem, iteration, each_unit_size,type_of_ner)
                        self.reset_keras(model)
                        #del model
                        clear_session()
                        gc.collect()

model = TimeSeriesModel(type_of_ner)
model.train_models()


Layer:  GRU
Hidden unit:  256
Iteration number:  1
Problem type:  mort_hosp
__________________
Epoch 1/3
Epoch 1: val_loss improved from inf to 0.24983, saving model to GRU-256-mort_hosp-best_model.keras
Epoch 2/3
Epoch 2: val_loss improved from 0.24983 to 0.24191, saving model to GRU-256-mort_hosp-best_model.keras
Epoch 3/3
Epoch 3: val_loss improved from 0.24191 to 0.23405, saving model to GRU-256-mort_hosp-best_model.keras
Problem type:  mort_icu
__________________
Epoch 1/3
Epoch 1: val_loss improved from inf to 0.18931, saving model to GRU-256-mort_icu-best_model.keras
Epoch 2/3
Epoch 2: val_loss improved from 0.18931 to 0.17323, saving model to GRU-256-mort_icu-best_model.keras
Epoch 3/3
Epoch 3: val_loss improved from 0.17323 to 0.16872, saving model to GRU-256-mort_icu-best_model.keras
Problem type:  los_3
__________________
Epoch 1/3
Epoch 1: val_loss improved from inf to 0.63724, saving model to GRU-256-los_3-best_model.keras
Epoch 2/3
Epoch 2: val_loss improved from 0.63724 

### 2. Multimodal Baseline

The multimodal approach tries to improve upon the prediction performance by  having 2 inputs. One is the time-series features derived from the GRU model (explained in the time series baseline model section). The other is the average representations of the medical entities derived from the patients' clinical notes. The 2 inputs are merged into a fully connected layer with 256 hidden units and predictions are done with a sigmoid classifier.

We use the pre-trained NER model, med7, to extract different named medical entities and represent them with 3 different embedding methods (a pretrained Word2Vec model, a pretrained FastText model, and the combination of the two).

<img src="https://drive.google.com/uc?export=view&id=1Roik2z3jOon_BXq-AOdr6z2ddytSIIPO"
     align="center"
     width="700" />

Figure 3. Overview of multimodal architecture for predicting the In-Hospital Mortality, In-ICU Mortality, LOS >3, and LOS >7. The MIMIC-Extract pipeline extracts the time series features for the GRU model. We also pass the preprocessed clinical notes through med7 to get the NER entities, which in turn are passed through different word embeddings to get the medical entity representations. An averaging of these representations gives a low-dimensional representation. Finally, we combine the time-series features with the low-dimensional medical entities to pass through a Dense layer (256 hidden units) in order to make a binary prediction on the 4 clinical tasks.

In [None]:
class MultimodalModel:
    def __init__(self, type_of_ner):
        self.type_of_ner = type_of_ner

    def reset_keras(self, model):
        sess = get_session()
        clear_session()
        sess.close()
        sess = get_session()

        try:
            del model # this is from global space - change this as you need
        except:
            pass
        gc.collect() # if it's done something you should see a number being outputted

    def create_dataset(self, dict_of_ner):
        """create the dataset"""
        temp_data = []
        for k, v in sorted(dict_of_ner.items()):
            temp = []
            for embed in v:
                temp.append(embed)
            temp_data.append(np.mean(temp, axis = 0))
        return np.asarray(temp_data)

    def make_prediction_multi_avg(self, model, test_data):
        probs = model.predict(test_data)
        y_pred = [1 if i>=0.5 else 0 for i in probs]
        return probs, y_pred

    def save_scores_multi_avg(self, predictions, probs, ground_truth, embed_name, problem_type, iteration, hidden_unit_size,
                                  sequence_name, type_of_ner):
        """save metrics of model"""
        auc = roc_auc_score(ground_truth, probs)
        auprc = average_precision_score(ground_truth, probs)
        acc   = accuracy_score(ground_truth, predictions)
        F1    = f1_score(ground_truth, predictions)

        result_dict = {}
        result_dict['auc'] = auc
        result_dict['auprc'] = auprc
        result_dict['acc'] = acc
        result_dict['F1'] = F1

        result_path = "results/multimodal"
        file_name = str(sequence_name)+"-"+str(hidden_unit_size)+"-"+embed_name
        file_name = file_name +"-"+problem_type+"-"+str(iteration)+"-"+type_of_ner+"-avg-.p"
        pd.to_pickle(result_dict, os.path.join(result_path, file_name))

        print(auc, auprc, acc, F1)

    def avg_ner_model(self, layer_name, number_of_unit, embedding_name):
        """define the model specifications"""

        #if embedding_name == "concat":
        #    input_dimension = 200
        #else:
        #    input_dimension = 100
        input_dimension = 100

        sequence_input = Input(shape=(24,104))
        input_avg = Input(shape=(input_dimension, ), name = "avg")

        if layer_name == "GRU":
            x = GRU(number_of_unit)(sequence_input)
        elif layer_name == "LSTM":
            x = LSTM(number_of_unit)(sequence_input)

        x = keras.layers.Concatenate()([x, input_avg])
        x = Dense(256, activation='relu')(x)
        x = Dropout(0.2)(x)

        #logits_regularizer = tf.contrib.layers.l2_regularizer(scale=0.01)
        logits_regularizer = keras.regularizers.l2(0.01)

        preds = Dense(1, activation='sigmoid',use_bias=False,
                      kernel_initializer=tf.keras.initializers.GlorotUniform(),
                      kernel_regularizer=logits_regularizer)(x)

        #opt = Adam(lr=0.001, decay = 0.01)
        opt = tf.keras.optimizers.legacy.Adam(lr=0.001, decay = 0.01)
        model = Model(inputs=[sequence_input, input_avg], outputs=preds)
        model.compile(loss='binary_crossentropy', optimizer=opt, metrics=['acc'])
        return model

    def train_models(self):
        embedding_types = ['word2vec']
        embedding_dict = [ner_word2vec]
        #embedding_types = ['word2vec', 'fasttext', 'concat']
        #embedding_dict = [ner_word2vec, ner_fasttext, ner_concat]
        target_problems = ['mort_hosp', 'mort_icu', 'los_3', 'los_7']

        num_epoch = 3
        model_patience = 5
        monitor_criteria = 'val_loss'
        batch_size = 64
        iter_num = 2
        unit_sizes = [256]

        layers = ["GRU"]
        for each_layer in layers:
            print ("Layer: ", each_layer)
            for each_unit_size in unit_sizes:
                print ("Hidden unit: ", each_unit_size)

                for embed_dict, embed_name in zip(embedding_dict, embedding_types):
                    print ("Embedding: ", embed_name)
                    print("=============================")

                    temp_train_ner = dict((k, ner_word2vec[k]) for k in train_ids)
                    temp_dev_ner = dict((k, ner_word2vec[k]) for k in dev_ids)
                    temp_test_ner = dict((k, ner_word2vec[k]) for k in test_ids)

                    x_train_ner = self.create_dataset(temp_train_ner)
                    x_dev_ner = self.create_dataset(temp_dev_ner)
                    x_test_ner = self.create_dataset(temp_test_ner)

                    for iteration in range(1, iter_num):
                        print ("Iteration number: ", iteration)

                        for each_problem in target_problems:
                            print ("Problem type: ", each_problem)
                            print ("__________________")

                            early_stopping_monitor = EarlyStopping(monitor=monitor_criteria, patience=model_patience)
                            best_model_name = "avg-"+str(embed_name)+"-"+str(each_problem)+"-"+"best_model.keras"
                            checkpoint = ModelCheckpoint(best_model_name, monitor='val_loss', verbose=1, save_best_only=True, mode='min')

                            callbacks = [early_stopping_monitor, checkpoint]

                            model = self.avg_ner_model(each_layer, each_unit_size, embed_name)
                            model.fit([x_train_lstm, x_train_ner], y_train[each_problem], epochs=num_epoch, verbose=1,
                                      validation_data=([x_dev_lstm, x_dev_ner], y_dev[each_problem]), callbacks=callbacks,
                                      batch_size=batch_size )

                            model.load_weights(best_model_name)

                            probs, predictions = self.make_prediction_multi_avg(model, [x_test_lstm, x_test_ner])

                            #save_scores_multi_avg(predictions, probs, y_test[each_problem],
                            #                      embed_name, each_problem, iteration, each_unit_size,
                            #                      each_layer, type_of_ner)

                            self.reset_keras(model)
                            #del model
                            clear_session()
                            gc.collect()

model = MultimodalModel(type_of_ner)
model.train_models()

Layer:  GRU
Hidden unit:  256
Embedding:  word2vec
Iteration number:  1
Problem type:  mort_hosp
__________________
Epoch 1/3
Epoch 1: val_loss improved from inf to 0.24225, saving model to avg-word2vec-mort_hosp-best_model.keras
Epoch 2/3
Epoch 2: val_loss improved from 0.24225 to 0.23569, saving model to avg-word2vec-mort_hosp-best_model.keras
Epoch 3/3
Epoch 3: val_loss did not improve from 0.23569
Problem type:  mort_icu
__________________
Epoch 1/3
Epoch 1: val_loss improved from inf to 0.17867, saving model to avg-word2vec-mort_icu-best_model.keras
Epoch 2/3
Epoch 2: val_loss did not improve from 0.17867
Epoch 3/3
Epoch 3: val_loss improved from 0.17867 to 0.17230, saving model to avg-word2vec-mort_icu-best_model.keras
Problem type:  los_3
__________________


KeyboardInterrupt: 

### 3. Proposed Model

The proposed model further tries to improve upon the prediction performance of the 4 clinical tasks. We leverage 1D Convolutional Neural Networks (CNN) to extract features from medical entities, subsequently integrating them with recurrent and fully-connected layers for comprehensive patient representation. The CNN model uses 3 consecutive 1D convolutional layers of filter size 32, 64, and 96, plus a max-pooling layer at the end.

<img src="https://drive.google.com/uc?export=view&id=1B4oDgE3Zf8FQFQ68fngFSDEAloNsXSTy"
     align="center"
     width="700" />

Figure 4. Overview of the proposed CNN model architecture for predicting the In-Hospital Mortality, In-ICU Mortality, LOS >3, and LOS >7. The MIMIC-Extract pipeline extracts the time series features for the GRU model. We also pass the preprocessed clinical notes through med7 and different word embeddings to get the medical entity representations. A 1D CNN is then applied to get the final medical entity features. Finally, we combine the time-series features with the medical entity features to pass through a Dense layer (512 hidden units) in order to make a binary prediction on the 4 clinical tasks.

In [None]:
class ProposedModel:
    def __init__(self, type_of_ner):
        self.type_of_ner = type_of_ner

    def make_prediction_cnn(self, model, test_data):
        """make model predictions"""
        probs = model.predict(test_data)
        y_pred = [1 if i>=0.5 else 0 for i in probs]
        return probs, y_pred

    def save_scores_cnn(self, predictions, probs, ground_truth,
                        embed_name, problem_type, iteration, hidden_unit_size,
                        sequence_name, type_of_ner):
        """save metrics from model predictions"""
        auc = roc_auc_score(ground_truth, probs)
        auprc = average_precision_score(ground_truth, probs)
        acc   = accuracy_score(ground_truth, predictions)
        F1    = f1_score(ground_truth, predictions)

        result_dict = {}
        result_dict['auc'] = auc
        result_dict['auprc'] = auprc
        result_dict['acc'] = acc
        result_dict['F1'] = F1

        result_path = "results/cnn/"
        file_name = str(sequence_name)+"-"+str(hidden_unit_size)+"-"+embed_name
        file_name = file_name +"-"+problem_type+"-"+str(iteration)+"-"+type_of_ner+"-cnn-.p"
        pd.to_pickle(result_dict, os.path.join(result_path, file_name))

        print(auc, auprc, acc, F1)

    def print_scores_cnn(self, predictions, probs, ground_truth, model_name, problem_type, iteration, hidden_unit_size):
        """print metric scores"""
        auc = roc_auc_score(ground_truth, probs)
        auprc = average_precision_score(ground_truth, probs)
        acc   = accuracy_score(ground_truth, predictions)
        F1    = f1_score(ground_truth, predictions)

        print ("AUC: ", auc, "AUPRC: ", auprc, "F1: ", F1)

    def get_subvector_data(self, size, embed_name, data):
        """get subvector data"""
        #if embed_name == "concat":
        #    vector_size = 200
        #else:
        #   vector_size = 100
        vector_size = 100

        x_data = {}

        for k, v in data.items():
            number_of_additional_vector = len(v) - size
            vector = []
            for i in v:
                vector.append(i)
            if number_of_additional_vector < 0:
                number_of_additional_vector = np.abs(number_of_additional_vector)

                temp = vector[:size]
                for i in range(0, number_of_additional_vector):
                    temp.append(np.zeros(vector_size))
                x_data[k] = np.asarray(temp)
            else:
                x_data[k] = np.asarray(vector[:size])
        return x_data

    def proposedmodel(self, layer_name, number_of_unit, embedding_name, ner_limit, num_filter):
        """define model specifications"""
        #if embedding_name == "concat":
        #    input_dimension = 200
        #else:
        #    input_dimension = 100
        input_dimension = 100

        sequence_input = Input(shape=(24,104))
        input_img = Input(shape=(ner_limit, input_dimension), name = "cnn_input")

        convs = []
        filter_sizes = [2,3,4]

        text_conv1d = Conv1D(filters=num_filter, kernel_size=3,
                             padding = 'valid', strides = 1, dilation_rate=1, activation='relu',
                             kernel_initializer=tf.keras.initializers.GlorotUniform())(input_img)

        text_conv1d = Conv1D(filters=num_filter*2, kernel_size=3,
                             padding = 'valid', strides = 1, dilation_rate=1, activation='relu',
                             kernel_initializer=tf.keras.initializers.GlorotUniform())(text_conv1d)

        text_conv1d = Conv1D(filters=num_filter*3, kernel_size=3,
                             padding = 'valid', strides = 1, dilation_rate=1, activation='relu',
                             kernel_initializer=tf.keras.initializers.GlorotUniform())(text_conv1d)

        text_embeddings = GlobalMaxPooling1D()(text_conv1d)

        if layer_name == "GRU":
            x = GRU(number_of_unit)(sequence_input)
        elif layer_name == "LSTM":
            x = LSTM(number_of_unit)(sequence_input)

        concatenated = keras.layers.Concatenate()([x, text_embeddings])
        concatenated = Dense(512, activation='relu')(concatenated)
        concatenated = Dropout(0.2)(concatenated)

        logits_regularizer = keras.regularizers.l2(0.01)
        preds = Dense(1, activation='sigmoid',use_bias=False,
                      kernel_initializer=tf.keras.initializers.GlorotUniform(),
                      kernel_regularizer=logits_regularizer)(concatenated)

        #opt = Adam(lr=1e-3, decay = 0.01)
        opt = tf.keras.optimizers.legacy.Adam(lr=1e-3, decay = 0.01)

        model = Model(inputs=[sequence_input, input_img], outputs=preds)
        model.compile(loss='binary_crossentropy', optimizer=opt, metrics=['acc'])
        return model

    def train_models(self):
        embedding_types = ['word2vec']
        embedding_dict = [ner_word2vec]
        #embedding_types = ['word2vec', 'fasttext', 'concat']
        #embedding_dict = [ner_word2vec, ner_fasttext, ner_concat]
        target_problems = ['mort_hosp', 'mort_icu', 'los_3', 'los_7']

        num_epoch = 3
        model_patience = 5
        monitor_criteria = 'val_loss'
        batch_size = 64

        filter_number = 32
        ner_representation_limit = 64
        activation_func = "relu"

        sequence_model = "GRU"
        sequence_hidden_unit = 256

        maxiter = 2
        for embed_dict, embed_name in zip(embedding_dict, embedding_types):
            print ("Embedding: ", embed_name)
            print("=============================")

            temp_train_ner = {k: embed_dict[k] for k in train_ids if k in embed_dict}
            temp_dev_ner = {k: embed_dict[k] for k in dev_ids if k in embed_dict}
            temp_test_ner = {k: embed_dict[k] for k in test_ids if k in embed_dict}

            x_train_dict = {}
            x_dev_dict = {}
            x_test_dict = {}

            x_train_dict = self.get_subvector_data(ner_representation_limit, embed_name, temp_train_ner)
            x_dev_dict = self.get_subvector_data(ner_representation_limit, embed_name, temp_dev_ner)
            x_test_dict = self.get_subvector_data(ner_representation_limit, embed_name, temp_test_ner)

            # Sort dictionaries and convert values to NumPy arrays
            x_train_dict_sorted = {k: v for k, v in sorted(x_train_dict.items())}
            x_dev_dict_sorted = {k: v for k, v in sorted(x_dev_dict.items())}
            x_test_dict_sorted = {k: v for k, v in sorted(x_test_dict.items())}

            x_train_ner = np.asarray(list(x_train_dict_sorted.values()))
            x_dev_ner = np.asarray(list(x_dev_dict_sorted.values()))
            x_test_ner = np.asarray(list(x_test_dict_sorted.values()))

            for iteration in range(1,maxiter):
                print ("Iteration number: ", iteration)

                for each_problem in target_problems:
                    print ("Problem type: ", each_problem)
                    print ("__________________")

                    early_stopping_monitor = EarlyStopping(monitor=monitor_criteria, patience=model_patience)
                    best_model_name = str(ner_representation_limit)+"-basiccnn1d-"+str(embed_name)+"-"+str(each_problem)+"-"+"best_model.keras"
                    checkpoint = ModelCheckpoint(best_model_name, monitor=monitor_criteria, verbose=1, save_best_only=True, mode='min')
                    reduce_lr = ReduceLROnPlateau(monitor=monitor_criteria, factor=0.2, patience=2, min_lr=0.00001, min_delta=1e-4, mode='min')
                    callbacks = [early_stopping_monitor, checkpoint, reduce_lr]

                    model = self.proposedmodel(sequence_model, sequence_hidden_unit, embed_name, ner_representation_limit,filter_number)
                    model.fit([x_train_lstm, x_train_ner], y_train[each_problem], epochs=num_epoch, verbose=1,
                              validation_data=([x_dev_lstm, x_dev_ner], y_dev[each_problem]), callbacks=callbacks, batch_size=batch_size)

                    probs, predictions = self.make_prediction_cnn(model, [x_test_lstm, x_test_ner])
                    #self.print_scores_cnn(predictions, probs, y_test[each_problem], embed_name, each_problem, iteration, sequence_hidden_unit)

                    model.load_weights(best_model_name)

                    probs, predictions = self.make_prediction_cnn(model, [x_test_lstm, x_test_ner])
                    #self.save_scores_cnn(predictions, probs, y_test[each_problem], embed_name, each_problem, iteration,
                    #                sequence_hidden_unit, sequence_model, type_of_ner)
                    del model
                    clear_session()
                    gc.collect()

model = ProposedModel(type_of_ner)
model.train_models()


Embedding:  word2vec
Iteration number:  1
Problem type:  mort_hosp
__________________
Epoch 1/3
Epoch 1: val_loss improved from inf to 0.24132, saving model to 64-basiccnn1d-word2vec-mort_hosp-best_model.keras
Epoch 2/3
Epoch 2: val_loss improved from 0.24132 to 0.23845, saving model to 64-basiccnn1d-word2vec-mort_hosp-best_model.keras
Epoch 3/3
Epoch 3: val_loss did not improve from 0.23845
Problem type:  mort_icu
__________________
Epoch 1/3
Epoch 1: val_loss improved from inf to 0.17983, saving model to 64-basiccnn1d-word2vec-mort_icu-best_model.keras
Epoch 2/3
Epoch 2: val_loss improved from 0.17983 to 0.17660, saving model to 64-basiccnn1d-word2vec-mort_icu-best_model.keras
Epoch 3/3
Epoch 3: val_loss improved from 0.17660 to 0.17497, saving model to 64-basiccnn1d-word2vec-mort_icu-best_model.keras
Problem type:  los_3
__________________
Epoch 1/3
Epoch 1: val_loss improved from inf to 0.63443, saving model to 64-basiccnn1d-word2vec-los_3-best_model.keras
Epoch 2/3
Epoch 2: val_lo

## Training

**Setting**

Each of the models underwent training for 100 epochs with 10 iterations. Notably, the average runtime for each epoch was impressively fast, requiring only a few seconds, approximately 15 seconds on average. In total, the project consumed approximately 25 hours, with approximately 21 hours allocated to the Med7 application on clinical notes, and the remaining hours divided among training and testing time series models, multimodal models, and the proposed model.

For the purpose of comparing the effectiveness of the proposed model enhancements, we trained the timeseries baseline model and the multimodal baseline model exclusively with a GRU layer featuring 256 units.

The hyperparameters used in the proposed model include:
- stack of 3 1-D convolution layers, ReLU activation, filter size=[32, 64, 96], kernel size=3, followed by max pooling layer
- the features in the max pooling layer are combined with the features of 1 layer GRU of 256 hidden units and has a Dense layer (units=512, ReLU activation) and a Dropout layer with p=0.2.
- training the model used a batch size of 64, an Adam optimizer (learning rate=0.001, decay=0.1) and binary cross entropy loss function.

The code for model training is included in each models' class function, in the 'Model' section of this notebook. The training results are saved in the results folder.

**Computation Details**

The paper suggests the computational requirements needed is a NVIDIA Tesla K80 GPU with 24 GB of VRAM, 378 GB of RAM and Intel Xeon E5 2683 processor [1]. Because the MIMIC-III dataset is so large, we ran the data preprocessing and training/testing of each model in separate Jupyter notebooks stored locally on a PC with Nvidia GeForce RTX 3080 Ti (GPU) containing 16 GB of GDDR6 VRAM, 32GB of DDR5 RAM and Intel i9-12900H processor.

## Evaluation

We will be using 3 different statistical methods for the comparison of our models.
* Area Under the Receiver Operating Characteristic curve (AUROC), which is the area under the true positive rate versus the false positive rate.
* Area Under the Precision-Recall Curve (AUPRC), which is the area under
the precision versus recall plot.
* F1 score measures accuracy by considering both precision and recall to compute the score, providing a balance between false positives and false negatives.

The evaluation code is provided as a function in each models' class, named similar to save_scores().

The results are then averaged across the 10 iterations for each performance metric and clinical task. The average and standard deviation are reported below.

In [None]:
# For the Time Series Baseline model
print("Time Series Baseline Model: Average +- Std Dev of Performance Metrics for Predicting Clinical Tasks")

# Define categories and metrics
categories = ["256-GRU"]
metrics = {"auc":"AUROC", "auprc":"AUPRC", "acc":"Accuracy", "F1":"F1"}
tasks = ["mort_hosp", "mort_icu", "los_3", "los_7"]

# Initialize dictionaries to store results
results = {category: {task: {metric_name: [] for metric, metric_name in metrics.items()} for task in tasks} for category in categories}

# Directory where pickle files are stored
directory = "/content/drive/MyDrive/CS598_Project/results/"

# Loop through each file
for filename in os.listdir(directory):
    if filename.endswith(".p"):
        parts = filename.split("-")
        category = parts[0] + "-" + parts[1]
        task = parts[2]
        if category in categories and task in tasks:
            result_dict = pd.read_pickle(os.path.join(directory, filename))
            for metric, metric_name in metrics.items():
                results[category][task][metric_name].append(result_dict[metric])

# Calculate average and standard deviation
for category in categories:
    print(f"Model: {category}")
    df_data = {task: {} for task in tasks}
    for task in tasks:
        task_data = {}
        for metric, metric_name in metrics.items():
            values = results[category][task][metric_name]
            mean = np.mean(values)
            std = np.std(values)
            task_data[metric_name] = f"{mean:.4f} \u00B1 {std:.4f}"
        df_data[task] = task_data
    df = pd.DataFrame(df_data).transpose()
    print(df)
    print()


Time Series Baseline Model: Average +- Std Dev of Performance Metrics for Predicting Clinical Tasks
Model: 256-GRU
                     AUROC            AUPRC         Accuracy               F1
mort_hosp  0.8757 ± 0.0031  0.5574 ± 0.0054  0.9139 ± 0.0011  0.4394 ± 0.0178
mort_icu   0.8827 ± 0.0031  0.5045 ± 0.0098  0.9404 ± 0.0015  0.4111 ± 0.0181
los_3      0.6958 ± 0.0026  0.6364 ± 0.0062  0.6610 ± 0.0035  0.5495 ± 0.0119
los_7      0.7304 ± 0.0054  0.2136 ± 0.0103  0.9187 ± 0.0005  0.0414 ± 0.0171



In [None]:
# For the Multimodal Baseline model
print("Multimodal Baseline Model: Average +- Std Dev of Performance Metrics for Predicting Clinical Tasks Using Different Embeddings")

# Define categories and metrics
categories = ["GRU-256"]
metrics = {"auc":"AUROC", "auprc":"AUPRC", "acc":"Accuracy", "F1":"F1"}
tasks = ["mort_hosp", "mort_icu", "los_3", "los_7"]
embeddings = ["word2vec", "fasttext", "concat"]

# Initialize dictionaries to store results
results = {category: {embedding: {task: {metric_name: [] for metric, metric_name in metrics.items()} for task in tasks} for embedding in embeddings} for category in categories}

# Directory where pickle files are stored
directory = "/content/drive/MyDrive/CS598_Project/results/multimodal/"

# Loop through each file
for filename in os.listdir(directory):
    if filename.endswith("-new-avg-.p"):
        parts = filename.split("-")
        category = parts[0] + "-" + parts[1]
        embedding = parts[2]
        task = parts[3]
        if category in categories and task in tasks and embedding in embeddings:
            result_dict = pd.read_pickle(os.path.join(directory, filename))
            for metric, metric_name in metrics.items():
                results[category][embedding][task][metric_name].append(result_dict[metric])

# Calculate average and standard deviation
for category in categories:
    print(f"Model: {category}")
    for embedding in embeddings:
        print(f"Embedding: {embedding}")
        df_data = {task: {} for task in tasks}
        for task in tasks:
            task_data = {}
            for metric, metric_name in metrics.items():
                values = results[category][embedding][task][metric_name]
                mean = np.mean(values)
                std = np.std(values)
                task_data[metric_name] = f"{mean:.4f} \u00B1 {std:.4f}"
            df_data[task] = task_data
        df = pd.DataFrame(df_data).transpose()
        print(df)
        print()

Multimodal Baseline Model: Average +- Std Dev of Performance Metrics for Predicting Clinical Tasks Using Different Embeddings
Model: GRU-256
Embedding: word2vec
                     AUROC            AUPRC         Accuracy               F1
mort_hosp  0.8833 ± 0.0018  0.5901 ± 0.0064  0.9183 ± 0.0015  0.4736 ± 0.0159
mort_icu   0.8900 ± 0.0018  0.5346 ± 0.0061  0.9440 ± 0.0010  0.4594 ± 0.0147
los_3      0.7075 ± 0.0018  0.6459 ± 0.0018  0.6679 ± 0.0047  0.5585 ± 0.0102
los_7      0.7357 ± 0.0052  0.2249 ± 0.0082  0.9199 ± 0.0008  0.0431 ± 0.0151

Embedding: fasttext
                     AUROC            AUPRC         Accuracy               F1
mort_hosp  0.8842 ± 0.0019  0.5933 ± 0.0047  0.9190 ± 0.0009  0.4807 ± 0.0107
mort_icu   0.8912 ± 0.0020  0.5360 ± 0.0042  0.9438 ± 0.0011  0.4519 ± 0.0153
los_3      0.7070 ± 0.0023  0.6446 ± 0.0035  0.6659 ± 0.0039  0.5618 ± 0.0076
los_7      0.7376 ± 0.0031  0.2299 ± 0.0054  0.9201 ± 0.0004  0.0417 ± 0.0125

Embedding: concat
                   

In [None]:
# For the Proposed CNN model
print("Proposed CNN Model: Average +- Std Dev of Performance Metrics for Predicting Clinical Tasks Using Different Embeddings")

# Define categories and metrics
categories = ["GRU-256"]
metrics = {"auc":"AUROC", "auprc":"AUPRC", "acc":"Accuracy", "F1":"F1"}
tasks = ["mort_hosp", "mort_icu", "los_3", "los_7"]
embeddings = ["word2vec", "fasttext", "concat"]

# Initialize dictionaries to store results
results = {category: {embedding: {task: {metric_name: [] for metric, metric_name in metrics.items()} for task in tasks} for embedding in embeddings} for category in categories}

# Directory where pickle files are stored
directory = "/content/drive/MyDrive/CS598_Project/results/cnn/"

# Loop through each file
for filename in os.listdir(directory):
    if filename.endswith("-new-cnn-.p"):
        parts = filename.split("-")
        category = parts[0] + "-" + parts[1]
        embedding = parts[2]
        task = parts[3]
        if category in categories and task in tasks and embedding in embeddings:
            result_dict = pd.read_pickle(os.path.join(directory, filename))
            for metric, metric_name in metrics.items():
                results[category][embedding][task][metric_name].append(result_dict[metric])

# Calculate average and standard deviation
for category in categories:
    print(f"Model: {category} + CNN")
    for embedding in embeddings:
        print(f"Embedding: {embedding}")
        df_data = {task: {} for task in tasks}
        for task in tasks:
            task_data = {}
            for metric, metric_name in metrics.items():
                values = results[category][embedding][task][metric_name]
                mean = np.mean(values)
                std = np.std(values)
                task_data[metric_name] = f"{mean:.4f} \u00B1 {std:.4f}"
            df_data[task] = task_data
        df = pd.DataFrame(df_data).transpose()
        print(df)
        print()

Proposed CNN Model: Average +- Std Dev of Performance Metrics for Predicting Clinical Tasks Using Different Embeddings
Model: GRU-256 + CNN
Embedding: word2vec
                     AUROC            AUPRC         Accuracy               F1
mort_hosp  0.8816 ± 0.0014  0.5802 ± 0.0044  0.9167 ± 0.0007  0.4543 ± 0.0095
mort_icu   0.8870 ± 0.0029  0.5231 ± 0.0063  0.9431 ± 0.0009  0.4443 ± 0.0195
los_3      0.7017 ± 0.0031  0.6428 ± 0.0039  0.6637 ± 0.0030  0.5570 ± 0.0107
los_7      0.7383 ± 0.0036  0.2355 ± 0.0067  0.9195 ± 0.0004  0.0192 ± 0.0070

Embedding: fasttext
                     AUROC            AUPRC         Accuracy               F1
mort_hosp  0.8792 ± 0.0025  0.5752 ± 0.0051  0.9165 ± 0.0013  0.4552 ± 0.0112
mort_icu   0.8845 ± 0.0013  0.5170 ± 0.0045  0.9428 ± 0.0010  0.4430 ± 0.0180
los_3      0.6977 ± 0.0033  0.6373 ± 0.0023  0.6610 ± 0.0032  0.5423 ± 0.0139
los_7      0.7291 ± 0.0068  0.2168 ± 0.0098  0.9196 ± 0.0005  0.0199 ± 0.0116

Embedding: concat
                    

## Results

We organized our result comparison to mirror the format of the original paper. First, we assessed baseline models using essential metrics: AUROC, AUPRC, and F1 scores. We highlighted the best-performing metrics for each task in the initial table. Then, we compared the top scores of the baseline models with those of our proposed model, showcasing the superior score for each task in the subsequent table.

<img src="https://drive.google.com/uc?export=view&id=1-78qDJq4KP0gZ153nguy5dRz14-EdXW3"
     align="center"
     width="700" />

Table 1. Statistical summary of prediction results using baseline model and baseline multimodal architecture

**Baseline Model Results**

We predict four clinical tasks with the patient's first 24 hours ICU measurements and medical entities. Table 1 summarizes the overall performance of the baseline models.

*  Across all four clinical task predictions, the multimodal baseline model consistently outperformed the GRU baseline model. This aligns with our hypothesis as well as the original paper. We observed the most significant improvements in AUPRC and F1 for predicting In-Hospital Mortality and In-ICU Mortality.
* Comparing the reproduced baseline model results with those of the original paper, we observed improvements in all baseline model metrics. We suspect that discrepancies in library versions, embeddings, and dataset configurations might have contributed to these improvements.


<img src="https://drive.google.com/uc?export=view&id=1yFt5YU-BXQdZKAN5EWHwACYeZIZzilsL"
     align="center"
     width="700" />

Table 2. Statistical summary of prediction results using the best baseline model obtained from Table 1 and the proposed model

**Proposed Model Results**

We compare the result of the proposed model against the best scores taken from the baseline models. Table 2 presents all outcomes from the proposed model in contrast to the best baseline scores.

*   The proposed model yielded similar results as the best baseline metrics. In the task of predicting Length of Stay (LOS) exceeding 7 days, we observed enhancements in both AUROC and AUPRC with the utilization of Word2Vec embeddings.
*  Comparing the reproduced results of the proposed model with those reported in the original paper, we once again observed improvements across all proposed model metrics.
*  In the reproduced results, while the proposed model demonstrated comparable performance to the best baseline model, the anticipated significant improvements in the performance metrics compared to the best baseline score were not observed, which contradicts our initial hypothesis. We suspect several reasons for this disparity:
  1.  Upon comparing the reproduced results of both the baseline and proposed models to those reported in the original paper, we observed improvements across all four tasks for both models. Additionally, the best baseline scores are significantly higher than the ones from the paper. This observation raises the possibility that the proposed model may reach a performance plateau once it attains a certain level of improvement.
  2.   Discrepancies in library versions, embeddings, and dataset configurations may have hindered our ability to replicate similar results.


**Abalation Study**

In our conducted ablation study, we compared the performance of the proposed convolution-based GRU multimodal architecture with the baseline GRU architecture. We utilized three different embedding techniques - word2vec, fasttext, and a combination of the two - in predicting four distinct clinical tasks. The baseline models comprised of both the GRU architecture and GRU multimodal architecture, utilizing the abovementioned embeddings.

Our findings revealed that the addition of a CNN layer led to performance enhancements in one out of the four prediction tasks. Specifically, when predicting Length of Stay (LOS) exceeding 7 days, we observed improvements in both AUROC and AUPRC with the incorporation of word2vec embeddings. This suggests that the CNN layer may be advantageous for improving the prediction of LOS > 7 Days, although its benefits may not extend uniformly across all tasks, especially when baseline scores already exhibit high performance levels.



## Discussion

**Implications of the Result**

When comparing the results of our baseline and proposed models with those reported in the original paper, we noticed similar performance metrics, with our reproduced results showing slight improvements.

When comparing across prediction tasks, we observed superior performance in predicting mortality, especially when representing medical entities using the averaging method. Regarding the three different embeddings, in Table 1, fasttext consistently yielded the best baseline scores. However, in Table 2, we noted that word2vec embeddings achieved higher scores than fasttext and the combination, which is consistent with the observations reported in the paper.

When comparing our best baseline scores with those of the proposed model, we only observed improvement in 1 out of 4 prediction tasks, which deviates from the findings of the original paper. We attribute this deviation  to changes we made in the code, including the removal of deprecated methods in Keras and TensorFlow libraries, as well as differences in fasttext embeddings (as the original embeddings from the GitHub repository were missing), and discrepancies in library versions.

**Reproducibility**

In general, we were able to reproduce comparable results, largely thanks to the availability of the original code and comprehensive documentation outlining preprocessing steps and environment setup.

However, several factors posed challenges to reproduction:
* Downloading large datasets like MIMIC-III and obtaining pre-trained models from various sources proved to be time-consuming.
* Data preprocessing took considerably longer than anticipated. For instance, extracting medical entities using Word2Vec required over 20 hours.
* Accessing the Fasttext embedding posed difficulties as it was not readily available in the original repository download link provided by the author. Eventually, we located it in the comments section of the Issues tab in the paper's GitHub repository.
* The original code contained deprecated methods from Keras and Tensorflow, as well as unused or outdated imported libraries (e.g., Glove), leading to compatibility and functionality issues.

To enhance reproducibility, we recommend including estimates of the time required for each preprocessing step, embedding technique implementation, model training, etc. Additionally, maintaining up-to-date dependencies and promptly removing deprecated methods would streamline the reproduction process.



# References

1. Bardak B, Tan M, "Improving clinical outcome predictions using convolution over medical entities with multimodal learning", Artificial Intelligence in Medicine, 2021, 117:0933-3657, doi:https://doi.org/10.1016/j.artmed.2021.102112.
2. Johnson A, Pollard T, Mark R, "MIMIC-III Clinical Database (version 1.4)", PhysioNet, 2016, doi:https://doi.org/10.13026/C2XW26.
3. Johnson AEW, Pollard TJ, Shen L, Lehman LH, Feng M, Ghassemi M, Moody B, Szolovits P, Celi L A, Mark RG, "MIMIC-III, a freely accessible critical care database", Scientific Data, 2016, 3:160035.
4. Goldberger A, Amaral L, Glass L, Hausdorff J, Ivanov PC, Mark R, ... & Stanley HE, "PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals", Circulation [Online], 2000, 101:23, pp. e215–e220.
5. Choi E, Bahadori MT, Schuetz A, Stewart WF, Sun J. Doctor AI: predicting clinical events via recurrent neural networks. Machine learning for healthcare conference 2016:301-18.
6. Choi E, Bahadori MT, Sun J, Kulas J, Schuetz A, Stewart W. Retain: an interpretable predictive model for healthcare using reverse time attention mechanism. Advances in neural information processing systems. 2016. p. 3504-12.
7. Caballero Barajas KL, Akella R. Dynamically modeling patient’s health state from electronic medical records: a time series approach. Proceedings of the 21st ACM SIGKDD international conference on knowledge discovery and data mining 2015:69–78.
8. Song H, Rajan D, Thiagarajan JJ, Spanias A. Attend and diagnose: clinical time series analysis using attention models. Thirty-second AAAI conference on artificial intelligence 2018.
9. Suresh H, Gong JJ, Guttag JV. Learning tasks for multitask learning: heterogenous patient populations in the ICU. Proceedings of the 24th ACM SIGKDD international conference on knowledge discovery & data mining 2018:802–10.
10. Lipton ZC, Kale DC, Elkan C, Wetzel R. Learning to diagnose with LSTM recurrent neural networks. 2015 (arXiv preprint), arXiv:1511.03677.
11. Wang S, McDermott MBA, Chauhan G, Hughes MC, Naumann T, Ghassemi M. MIMIC-Extract: A Data Extraction, Preprocessing, and Representation
Pipeline for MIMIC-III. arXiv:1907.08322.
12. Kormilitzin A, Vaci N, Liu Q, Nevado-Holgado A. Med7: A Transferable Clinical Natural Language Processing Model for Electronic Health Records. 2020. arXiv:2003.01271.
13. Huang K, Altosaar J, Ranganath R. ClinicalBERT: Modeling Clinical Notes and Predicting Hospital Readmission. 2019. arXiv:1904.05342



In [None]:
%%shell
jupyter nbconvert --to html /content/DL4H_Team_34.ipynb

[NbConvertApp] Converting notebook /content/DL4H_Team_34.ipynb to html
[NbConvertApp] Writing 831300 bytes to /content/DL4H_Team_34.html


