<a href="https://colab.research.google.com/github/srnanda2/DataViz/blob/main/DL4H_Team_138.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Project Draft :  Improving clinal outcome predictions using convolution over medical entities with multimodal learning

## Team 138 - Soumya Nanda (srnanda2), Ayush Ghosh(ayushg7), Ray Ko (wk021)

- Original code link: https://github.com/tanlab/ConvolutionMedicalNer
- Team Project code link: https://github.com/ghosh-ayush/DLH_Medical_Convolution

---



In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


# Introduction
In recent years, the healthcare industry has seen a surge in data generated from electronic health records (EHRs), which provide detailed patient medical histories, treatments, and outcomes. This data presents an unprecedented opportunity for healthcare professionals and researchers to enhance patient care and outcomes through machine learning techniques. By leveraging these methods, such as predicting and diagnosing diseases, customizing treatments, and identifying patients at risk of adverse events.

Our project aims to replicate the findings of Bardak and Tan's paper - [2] Improving clinical outcome predictions using convolution over medical entities with multimodal learning.The paper introduces a new approach to predicting two crucial clinical outcomes: patient length of stay (LOS) and mortality. By combining time-series data and clinical notes extracted from EHRs, the authors utilize a gated recurrent unit (GRU) network, named entity recognition (NER), and convolutional neural networks (CNNs). Their method surpasses all tested baseline models, including multimodal architectures.

# Scope of Reproducibility:

The main assertion of the referenced paper, which our study aims to replicate and validate, is that integrating CNN models to extract features from medical entities in clinical notes yields superior results compared to baseline models. This improvement in length of stay (LOS) and mortality metrics suggests that the proposed model could enhance clinical decision-making and patient outcomes.

In the paper, time-series data features are generated using a gated recurrent unit (GRU) network, while a pretrained clinical named entity recognition (NER) model identifies seven different medical entities from clinical notes. The NER task categorizes words into predefined groups [3] and assigns labels accordingly. These extracted medical entities are then represented using embeddings and passed through a 1D convolutional neural network (CNN). Finally, the time-series features are combined with medical entity features and processed through a fully connected layer for prediction.
In this study, we will replicate this architecture and compare it against several baseline models proposed in the reference paper, including GRU on time-series data, Word2Vec[4], FastText[5], and the combination of both methods on clinical notes.


#### Claims from the original paper.
The assertion is that incorporating CNN models for feature extraction from medical entities yields higher scores (AUC, AUPRC, and F1) compared to models using time-series data and/or clinical notes. This improvement is observed across four tasks:
- In-Hospital mortality prediction
- In-ICU mortality prediction
- Prediction of Length of ICU stay (LOS) > 3 days
- Prediction of LOS exceeding 7 days

These tasks are selected because they address common risk prediction scenarios: mortality and length of ICU stay.


<b>Both are critical clinical outcomes influencing treatment decisions, resource allocation, and ultimately patient outcomes.</b>



# Methodology

To aid in the reproduction of the results, we forked the code base available from the authors repository '[1](https://github.com/tanlab/ConvolutionMedicalNer)'.


A basic '[README](https://github.com/ghosh-ayush/DLH_Medical_Convolution/blob/master/README.md)' file is available within the repository which documents the steps required to run the code. <b>For clarification we encourage the evaluator to kindly look at github repository</b>


The code is structured into nine (9) Jupyter Notebooks, each dedicated to a specific task.

Notebooks 1-3 handle the data preprocessing steps
Notebooks 4-6 focus on creating embeddings and processing time-series data
Notebooks 7-9 are utilized to execute both baseline and newly proposed models.
Originally, the code was designed for Python 2, but it has since been refactored to be compatible with the more recent Python 3 and TensorFlow 2. Additionally, several bugs were addressed during the refactoring process. The modifications made to each notebook are documented at the top of the respective notebook

<b>In this project draft we will be placing the links to the respective notebooks in our GitHub directory and presenting the key results we found so far.



###  Data descriptions
The data utilized in the paper is sourced from the MIMIC-III database [6, 7], hosted on PhysioNet. Access to PhysioNet is granted upon the completion of required training modules. MIMIC-III is a comprehensive and freely-available database containing de-identified health-related data from over forty thousand patients who were treated in critical care units at the Beth Israel Deaconess Medical Center between 2001 and 2012.

The time-series data covers the initial 24 hours following admission to the ICU, with inclusion criteria requiring patients to have at least 30 hours of electronic health record (EHR) data available. We used the below jpynb files in order, Please refer to '[README](https://github.com/ghosh-ayush/DLH_Medical_Convolution/blob/master/README.md)' for more details

1. '[Github link](https://github.com/ghosh-ayush/DLH_Medical_Convolution/blob/master/01-Extract-Timeseries-Features.ipynb)'
2.  '[Github link](https://github.com/ghosh-ayush/DLH_Medical_Convolution/blob/master/02-Select-SubClinicalNotes.ipynb)'
3.  '[Github link](https://github.com/ghosh-ayush/DLH_Medical_Convolution/blob/master/03-Preprocess-Clinical-Notes.ipynb)'

Table 1 presents the cohort used for analysis. It's important to note that the values presented in our table differ from those in Table 1 of the original paper.

![Table_1.png](https://drive.google.com/uc?export=view&id=1xFta23tpSEoKWZnqdKGEjmJl8O4cgumf)

After extracting patients with a minimum of 30 hours of data, the available cohort is larger by seven (7) patients compared to the original paper, with a final cohort size increased by 945 patients.


In [None]:
# dir and function to load a subset of preprocessed data
raw_data_dir = '/content/gdrive/My Drive/Colab Notebooks/small_preprocessed_notes.p'

import pandas as pd
sub_notes = pd.read_pickle(raw_data_dir)

print(sub_notes.shape)
print("First few rows of the smaller version of the Prepocessed data:")
print(sub_notes.head())



(18148, 5)
First few rows of the smaller version of the Prepocessed data:
        SUBJECT_ID  HADM_ID_y           CHARTTIME  \
396836       76021     157169 2135-10-03 12:50:00   
531850         870     109361 2127-03-07 06:26:00   
607382        8134     112290 2141-08-08 04:01:00   
130548       57283     181734 2114-11-24 06:17:00   
588002        6957     102224 2178-02-20 17:17:00   

                                                     TEXT  \
396836  [**2135-10-3**] 12:50 PM\n CHEST (PORTABLE AP)...   
531850  MICU Nursing Admit Note 0600\n\nCode: Full\nAl...   
607382  RESP CARE\nPT REMAINS [**Name (NI) 136**] AND ...   
130548  SICU\n   HPI:\n   88F w/ abdominal pain and em...   
588002  NPN\nSee careview for details:\n\nNuero: Sedat...   

                                        preprocessed_text  
396836  [12:50 pm chest ( portable ap ) clip # reason ...  
531850  [micu nursing admit note 0600 code : full alle...  
607382  [resp care pt remains and vented on a/c 800 x ...  


##   Model
The proposed multimodal approach employs multiple models to derive its outcomes. In this approach, a Gated Recurrent Unit (GRU) is utilized to extract features from the available time series data.

* <b>GRU</b>: A gated recurrent network (GRU) is used to capture the
temporal information available within the time series
data. A sigmoid classifier is stacked on top of the one
layer GRU with 256 hidden units. The use of the GRU
on the time series data alone is also one of the baseline
models tested. It is also used in conjunction with the
other models based on the clinical note data.

For the clinical notes, the process involves three main stages:

* <b>Extraction of Medical Entities:</b> Med7 is employed to extract medical entities from the notes.
The code implements a pre-trained clinical Named Entity Recognition (NER) model known as med7 [10]. Trained on the MIMIC-III dataset, which is also utilized in this project, this pretrained med7 model is employed to extract seven distinct named entities from clinical notes sourced from Electronic Health Records (EHRs). These entities include 'Drug', 'Strength', 'Duration', 'Route', 'Form', 'Dosage', and 'Frequency'. Subsequently, these extracted entities serve as inputs to the designated text embedding models. Given its pretrained nature, the med7 model does not necessitate any parameter tuning.   Please see detailed code implementation in
'[Github link](https://github.com/ghosh-ayush/DLH_Medical_Convolution/blob/master/04-Apply-med7-on-Clinical-Notes.ipynb)'

* <b>Text Embedding Models</b>: A word embedding model is then applied to generate embeddings from these medical entities.Two different pretrained embedding models were tested:
** Word2Vec [11]:
A two-layer neural network that learns word representations in two ways: as a continuous bag-of-words and as a skip-gram.
** FastText [8]:
An extension of Facebook's AI Research (FAIR) labs skip-gram model.
Capable of handling out-of-vocabulary words and learning representations for rare words.
** Additionally, a third version was tested which concatenated the results from both Word2Vec and FastText embedding models together.
No parameter tuning was required for these pretrained models. Please see detailed code implementation in
'[Github link](https://github.com/ghosh-ayush/DLH_Medical_Convolution/blob/master/05-Represent-Entities-With-Different-Embeddings.ipynb)' . For detailed implementation codes for getting time series data through GRU models see this '[Github link](https://github.com/ghosh-ayush/DLH_Medical_Convolution/blob/master/06-Create-Timeseries-Data.ipynb)'

* <b>Feature Extraction with CNNs:</b> Subsequently, three consecutive 1-layer Convolutional Neural Networks (CNNs) are employed to extract features from the embeddings, followed by passing through a global max pooling layer. The Convolutional Neural Network (CNN) layer consists of three 1D convolutional layers followed by a global max pooling layer. It takes the text embeddings as inputs. The number of parameters for each convolutional layer is calculated using the formula: <i>(kernel size * input dimension + 1) * number of filters </i>. For this model, the input dimension size is 64. The three convolutional layers have 32, 64, and 96 filters, respectively, with a kernel size of 3.

* <b> Output Layers </b>: The output layers consist first of a dense (fully connected) layer, which operates on the concatenation of the outputs of the two layers above (output of the CNN and the output of the GRU layer), with dropout regularization
applied to the first dense layer. It has an output dimension size of 512 and dropout of 0.2.
A second dense layer, with a sigmoid activation function, is finally then used to binary classify the tasks mentioned earlier. The output layer has <i> (output dim +1) ∗ num classes </i> parameters, where the output dimension size is the number of neurons in the previous layer (in this case, 512), and the number of classes is the number of classes in the output (in this case, 1 for binary classification).

Finally, the features obtained from both the time series data and the clinical notes are concatenated and passed through a fully connected layer.  

# Training and Evaluation

* <b>Hyperparameters:</b> Some of the model hyper-parameters used are set to the
values described within the original implementation of
the paper while others are tuned so as to minimise the
binary cross-entropy loss.

* <b>Set parameters:</b> A dropout rate of 0.2 is set at the end of the fully connected layer. A ReLU activation function is used for non-linearity and L2 norm for sparsity regularization is selected with the scale factor set to 0.01. For optimization, we use ADAM [9] algorithm with a learning rate
of 0.001 and a decay value of 0.01.

* <b>Set parameters:</b> The proposed model is trained so as to minimise the
binary cross-entropy loss. The following parameters are tuned: number of hidden layers, hidden units, convolutional filters, filter-size, learning rate, dropout rates and regularization parameters on the validation set. Each model is trained for 100 epochs and early stopping is used on the validation loss.



We used the author’s code with some modifications and adaptations to make it run using <b>Python 3 and Tensor-Flow </b>. The updated and documented code can be
viewed on '[Github](https://github.com/ghosh-ayush/DLH_Medical_Convolution)'. The modifications made to each notebook are documented at the top of each individual
notebook.


In [12]:
!pip install mittens

Collecting mittens
  Downloading mittens-0.2-py3-none-any.whl (15 kB)
Installing collected packages: mittens
Successfully installed mittens-0.2


In [13]:
import pandas as pd
import os
import numpy as np
from gensim.models import Word2Vec, FastText
from mittens import GloVe as glove

import collections
import gc

import tensorflow as tf
tf.compat.v1.enable_eager_execution()
from tensorflow.keras import backend as K
from tensorflow.keras import regularizers
from tensorflow.keras.models import Sequential, Model
from tensorflow.keras.layers import Flatten, Dense, Dropout, Input, concatenate, Activation, Concatenate, GRU, Conv2D, MaxPooling2D, UpSampling2D, Conv1D, BatchNormalization, Convolution1D, UpSampling1D, MaxPooling1D, GlobalMaxPooling1D, GlobalAveragePooling1D, MaxPool1D
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.callbacks import EarlyStopping, ModelCheckpoint, History, ReduceLROnPlateau
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.backend import clear_session

from sklearn.utils import class_weight
from sklearn.metrics import average_precision_score, roc_auc_score, accuracy_score, f1_score

import warnings
warnings.filterwarnings('ignore')

* <b>Computational requirements: </b>

[ AYUSH to FILL HERE ]

In [14]:
import tensorflow as tf

# List all available devices
gpus = tf.config.list_physical_devices('GPU')
cpus = tf.config.list_physical_devices('CPU')

if not gpus:
    print("GPU not available. Training on CPU.")
else:
    # Set TensorFlow to only use the first GPU
    tf.config.set_visible_devices(gpus[0], 'GPU')
    print("Training on GPU:", gpus[0])

GPU not available. Training on CPU.


* <b> Running the baseline model which predicts the 4 different clinimal tasks.</b> Please see detailed implementation of the codes in '[Github link](https://github.com/ghosh-ayush/DLH_Medical_Convolution/blob/master/07-TimeseriesBaseline.ipynb)'

* We have tried to create a mini version of this implementation using the processed time series data set for the purposes of evaluation. The other models are relativly more complicated and hence we use this step for illustration for 2 epochs and 1 iteration under 8 mins

In [22]:
#Write all the pre defined functions here
def reset_keras(model):
    tf.keras.backend.clear_session()  # Clear the Keras session
    del model  # Delete the model to help ensure that the model is garbage collected
    gc.collect()  # Suggest to the garbage collector to free up memory

    try:
        del model # this is from global space - change this as you need
    except:
        pass

    gc.collect() # if it's done something you should see a number being outputted

def make_prediction_timeseries(model, test_data):
    probs = model.predict(test_data)
    y_pred = [1 if i>=0.5 else 0 for i in probs]
    return probs, y_pred

def save_scores_timeseries(predictions, probs, ground_truth, model_name,
                problem_type, iteration, hidden_unit_size, type_of_ner):

    auc = roc_auc_score(ground_truth, probs)
    auprc = average_precision_score(ground_truth, probs)
    acc   = accuracy_score(ground_truth, predictions)
    F1    = f1_score(ground_truth, predictions)


    result_dict = {}
    result_dict['auc'] = auc
    result_dict['auprc'] = auprc
    result_dict['acc'] = acc
    result_dict['F1'] = F1


    file_name = str(hidden_unit_size)+"-"+model_name+"-"+problem_type+"-"+str(iteration)+"-"+type_of_ner+".p"

    result_path = "/content/gdrive/My Drive/Colab Notebooks/Results"
    pd.to_pickle(result_dict, os.path.join(result_path, file_name))

    print("auc:", auc, ", auprc:", auprc, ", acc:", acc, ", F1:", F1)

def timeseries_model(layer_name, number_of_unit):
    clear_session()

    sequence_input = Input(shape=(24,104),  name = "timeseries_input")

    x = GRU(number_of_unit)(sequence_input)

    logits_regularizer = tf.keras.regularizers.l2(0.01)
    sigmoid_pred = Dense(1, activation='sigmoid',use_bias=False,
                         kernel_initializer=tf.keras.initializers.GlorotUniform(),
                  kernel_regularizer=logits_regularizer)(x)


    model = Model(inputs=sequence_input, outputs=sigmoid_pred)

    model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['acc'])
    return model
#LOAD THE TRAINING,VALIDATION AND TEST DATA SETS
type_of_ner='new'
x_train = pd.read_pickle("/content/gdrive/My Drive/Colab Notebooks/Data/new_x_train.pkl")
x_dev   = pd.read_pickle("/content/gdrive/My Drive/Colab Notebooks/Data/new_x_dev.pkl")
x_test  = pd.read_pickle("/content/gdrive/My Drive/Colab Notebooks/Data/new_x_test.pkl")

y_train = pd.read_pickle("/content/gdrive/My Drive/Colab Notebooks/Data/new_y_train.pkl")
y_dev   = pd.read_pickle("/content/gdrive/My Drive/Colab Notebooks/Data/new_y_dev.pkl")
y_test  = pd.read_pickle("/content/gdrive/My Drive/Colab Notebooks/Data/new_y_test.pkl")

#RUN THE TRAINING WITH 2 EPOCHS FOR THE PURPOSES of EVALUATION
epoch_num        = 2 #we are amending the epochs
model_patience   = 3
monitor_criteria = 'val_loss'
batch_size       = 128

unit_sizes       = [256]
iter_num         = 2 # we are amending the number of iterations to 1 just for illustration
target_problems  = ['mort_hosp', 'mort_icu', 'los_3', 'los_7']
layers           = ["GRU"]

for each_layer in layers:
    for each_unit_size in unit_sizes:
        for iteration in range(1, iter_num):
            for each_problem in target_problems:

                print("Layer: ", each_layer)
                print("Hidden unit: ", each_unit_size)
                print ("Problem type: ", each_problem)
                print("Iteration number: ", iteration)
                print ("__________________")


                early_stopping_monitor = EarlyStopping(monitor=monitor_criteria, patience=model_patience)

                #best_model_name = str(each_layer)+"-"+str(each_unit_size)+"-"+str(each_problem)+"-"+"best_model.hdf5"
                best_model_name = str(each_layer) + "-" + str(each_unit_size) + "-" + str(each_problem) + "-" + "best_model.keras"

                checkpoint = ModelCheckpoint(best_model_name,
                                             monitor='val_loss',
                                             verbose=0,
                                             save_best_only=True,
                                             mode='min',
                                             #period=1
                                            )

                callbacks = [early_stopping_monitor, checkpoint]

                model = timeseries_model(each_layer, each_unit_size)
                model.fit(x_train,
                          y_train[each_problem],
                          epochs=epoch_num,
                          verbose=0,
                          validation_data=(x_dev, y_dev[each_problem]),
                          callbacks=callbacks,
                          batch_size= batch_size)

                model.load_weights(best_model_name)

                probs, predictions = make_prediction_timeseries(model, x_test)
                save_scores_timeseries(predictions, probs, y_test[each_problem].values,str(each_layer),
                                       each_problem, iteration, each_unit_size,type_of_ner)
                reset_keras(model)
                clear_session()
                gc.collect()


Layer:  GRU
Hidden unit:  256
Problem type:  mort_hosp
Iteration number:  1
__________________
auc: 0.8770703755556065 , auprc: 0.5562853684378408 , acc: 0.9132711131345322 , F1: 0.39427662957074716
Layer:  GRU
Hidden unit:  256
Problem type:  mort_icu
Iteration number:  1
__________________
auc: 0.8873756056733026 , auprc: 0.4670652481400007 , acc: 0.9358069656271341 , F1: 0.3788546255506608
Layer:  GRU
Hidden unit:  256
Problem type:  los_3
Iteration number:  1
__________________
auc: 0.6895790327259023 , auprc: 0.622007404261435 , acc: 0.6562713407694059 , F1: 0.5545722713864307
Layer:  GRU
Hidden unit:  256
Problem type:  los_7
Iteration number:  1
__________________
auc: 0.7214657093339937 , auprc: 0.19683184045311303 , acc: 0.919872524470749 , F1: 0.043478260869565216


* Futhermore we run the multimodal base line model that uses the word representations obtained while applying the embedding. Please see '[Github link](https://github.com/ghosh-ayush/DLH_Medical_Convolution/blob/master/08-Multimodal-Baseline.ipynb)'

* Finally we run proposed model from the original paper, which uses 1D convolutional layers as a feature extractor on medical entities obtained through the embedding techniques, train and evaluate its performance on the 4 tasks. Please see  '[Github link](https://github.com/ghosh-ayush/DLH_Medical_Convolution/blob/master/09-Proposed-Model.ipynb)'

# Results

### Comparing results from baseline model replication
Please use '[Github link](https://github.com/ghosh-ayush/DLH_Medical_Convolution/blob/master/load_print_results.ipynb)' for this section to get details of the runs that we acheived


In order to assess the validity of the four claims outlined in the section <b> Claims from the paper</b>, we initially computed the mean AUC, AUPRC, and F1 score for the baseline models. Four baseline models were executed: the GRU model utilizing solely time-series data, and three average multimodal models - "Word2Vec", "FastText", and the concatenation of these two embedding techniques with the GRU output. Each model underwent ten (10) runs, and the average results were recorded. Table below illustrates the performance comparison of these baseline models, mirroring the presentation in the original paper.

<b>Results from our replication for baseline models</b>
![Baseline](https://drive.google.com/uc?export=view&id=1sxVf0NtjZnpO0LGnELAUS5tWrHut1XQI)

<b>Original Results for baseline models</b>
![Baseline_Original](https://drive.google.com/uc?export=view&id=18kgVa9MGqkwDuImQxApjvIUIXNVIP7QF)

As a comparsion to what we see in the paper vs. our generation, we could not match the results exactly. However, they can be broadly similay when it comes to using average multimodal elements. Please note that we managed to run the evarge multimodel embeddings to be compared to GRU but see differences in results.
* One thing to notice is GRU is better in F1 scores for LOS > 3 days and LOS > 7 days predictions than the multimodel average embeddings

### Comparing results from proposed model replication

<b>Results from our replication for proposed models</b>
![Proposed](https://drive.google.com/uc?export=view&id=1t2gxlyt3TVDxQ5appNXGYM5-dHyaa4Tu)

<b>Original Results for proposed models</b>
![Proposed_Original](https://drive.google.com/uc?export=view&id=1Y3HeH7jsEsxo9XUxgmxxN7vNCMDc1vRk)


Again our results were not exactly replicable with the output from the paper. Best baseline model seems to be performing better in most of the cases as compared to the proposed models. Furthermore, we dont see the proposed model outperforming the Baseline model consistently as compared to the bestline model which is claimed by the paper


## Model comparison

#### In-Hospital Mortality
In the original paper, the use of ”Word2Vec” embedding model in the proposed model architecture provided the highest scores across all AUC, AUPRC and F1 metrics.
However, in our reproduced results, we have found that the best baseline model outperforms all the proposed model metrics as per the results table in the previous section

#### In-ICU Mortality

Originally the proposed model with ”Word2Vec” embeddings provided the top results for AUC and AUPRC with ”FastText” providing the top for F1. However,
similar to the in-hospital mortality task, the reproduced results show the baseline models again outperform for AUC and AUPRC

#### LOS > 3

The task of determining length of ICU stay greater than 3 days is the only task in which the reproduced results show the proposed model outperforming the baseline model in two out of the three metrics, namely AUC and AUPRC. In this case, the ”Word2Vec” embedding technique is shown to have the best results. However, in the original paper, it was the ”Concat” model which showed
best results for AUC and AUPRC and the ”FastText” model which was optimal for F1 metric.

#### LOS >7

the task of determining length of ICU stay greater than seven days, the results show again that the baseline model outperforms the best results from the
proposed models across 2 scoring metrics.

Overall we could not replicate the results as per the proposed paper



# Discussion


Our reproduction's results diverge from those of the original paper despite utilizing the same dataset and the authors' code. The baseline model surpassed the proposed models in 10 out of 12 evaluation metrics across various tasks. Surprisingly, not only did the proposed model fail to match the original results, but nearly every result we obtained exceeded those reported in the paper. The mean discrepancy observed among the trained models is detailed in the Results section. The diminished performance of the proposed models can be attributed to the enhanced performance of the baseline models compared to the proposed ones.

The discrepancies between our reproduction and the original paper likely stem from differences in the analysis cohort size (22,025 versus 21,080 patients) between the two implementations. This discrepancy was attributed to two issues. Firstly, variations in the code versions used to extract test data from the MIMIC-III dataset resulted in minor differences in the number of subjects used. This issue was tested and verified. Secondly, differences in versions of the med7 model were hypothesized to contribute to the discrepancy, although this could not be confirmed. Our findings suggest that while the results presented in the original paper may have been accurate at the time, updates and improvements to the models utilized in this analysis have minimized the impact of incorporating the proposed CNN layer into the model architecture.


Several factors contributed to the successful replication of the selected paper. These factors included:

* The accessibility of the code in a GitHub repository ensured that we executed identical code and constructed models consistent with those of the original authors.
* The presentation of findings in tables within the original paper facilitated straightforward comparison between our results and theirs.
* The dataset was relatively small in size, enabling us to store it locally without the need for external storage services.


Despite the mentioned advantages, we faced several hurdles during our replication efforts. These challenges included:

* The code provided in the paper was written using outdated dependencies, resulting in deprecated functions and variables. As a result, we had to identify and replace these deprecated components with equivalents from the latest dependency versions.
* The original code lacked adequate comments or documentation, making it challenging to understand and follow.
* While the paper did not specify a requirement for a GPU, the computations were resource-intensive, often requiring overnight runs. However, we encountered errors after a few hours, leading to delays.
* Ensuring accuracy and adherence to the paper proved challenging due to discrepancies with the original results. Speculating on the reasons behind these inconsistencies was also difficult.

To facilitate the reproduction of results, we suggest that the original authors implement the following recommendations:


* <b>Set a random seed value</b> to ensure that the results are reproducible, even when the code is run multiple times.

* <b>Update for Compatibility:</b> The code has been updated to ensure compatibility with the latest versions of Python and TensorFlow, ensuring smooth operation with the most recent software releases.

* <b>Model Version Documentation:</b> Detailed documentation of the versions of all models used, including MIMIC Extract and med7, enables easy replication of experiments by providing the necessary model versions.

* <b>Enhanced Documentation with Comments:</b> The code's documentation has been enriched with explanatory comments, making it easier to understand the functionality and workflow of the code.


<b>We have shared our github repository if they would find it helpful for implementing the above suggestions</b>


# Further to dos
* We will be reaching out to the authors to understand the discrepencies further
* We would be further spending time to check our results vs that of the paper to pin down on any particular reason
* We will work on adding a couple of ablations if possible
* Some more visualisation in the results section

# References

1.  B. Bardak and Tan M. Convolutionmedicalner. https://github.com/tanlab/ConvolutionMedicalNer, 2020.
2. Batuhan Bardak and Mehmet Tan. Improving clinical outcome predictions using convolution over medical entities with multimodal learning, 2020.
3. Hinrich Sch¨utze Christopher D. Manning, Prabhakar Raghavan. Introduction to Information Retrieval. Cambridge University Press, 2008. ISBN:0521865719.
4. A. L. Goldberger, L. A. N. Amaral, L. Glass, J. M. Hausdorff, P. Ch. Ivanov, R. G. Mark, J. E. Mietus, G. B. Moody, C.-K. Peng, and H. E.Stanley. PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation, 101(23):e215–e220,2000 (June 13). Circulation Electronic Pages:http://circ.ahajournals.org/content/101/23/e215.full PMID:1085218; doi: 10.1161/01.CIR.101.23.e215.
5. Kexin Huang, Jaan Altosaar, and Rajesh Ranganath. Clinicalbert: Modeling clinical notes and predicting hospital readmission, 2020.
6. Alistair E. W. Johnson, Tom J. Pollard, and Roger G.Mark. Mimic-iii clinical database (version 1.4). PhysioNet,2016.
7. Alistair E. W. Johnson, Tom J. Pollard, Lu Shen, Liwei H. Lehman, Mengling Feng, Mohammad Ghassemi, Benjamin Moody, Peter Szolovits, Leo Anthony
Celi, and Roger G. Mark. Mimic-iii, a freely accessible critical care database. Scientific Data, 3(1):160035, 2016.
8. Armand Joulin, Edouard Grave, Piotr Bojanowski, and Tomas Mikolov. Bag of tricks for efficient text classification, 2016.
9. Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization, 2017.
10. Andrey Kormilitzin, Nemanja Vaci, Qiang Liu, and Alejo Nevado-Holgado. Med7: A transferable clinical natural language processing model for electronic
health records. Artificial Intelligence in Medicine, 118:102086, 2021.
11. Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey
 Dean. Efficient estimation of word representations in vector space, 2013.
