<a href="https://colab.research.google.com/github/revatishelat/DST_A2/blob/main/report/05_RNN.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#### 1. Import libraries

If the any of the libraries are not installed, please use !pip install [name]

In [1]:
#preprocessing
import pandas as pd

import re #for regular expression
import nltk

from nltk.stem import WordNetLemmatizer
nltk.download('wordnet')
nltk.download('punkt')
nltk.download('stopwords')
en_stop = set(nltk.corpus.stopwords.words('english'))

#for rnn
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, classification_report

import tensorflow as tf

from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


#### 2. Load datasets

The training and test datasets that are used in this script are available [here](https://colab.research.google.com/drive/1tr7CoaE6InDuI445Lj1StgUWEB90DQOt#scrollTo=aP-ua3PCjnzd&line=1&uniqifier=1).

In [2]:
# Load the data from CSV
train_df =pd.read_csv("https://raw.githubusercontent.com/sebischair/Medical-Abstracts-TC-Corpus/main/medical_tc_train.csv")
test_df =pd.read_csv("https://raw.githubusercontent.com/sebischair/Medical-Abstracts-TC-Corpus/main/medical_tc_test.csv")
labels = pd.read_csv("https://raw.githubusercontent.com/sebischair/Medical-Abstracts-TC-Corpus/main/medical_tc_labels.csv")


In [3]:
print("The first few rows of the training dataset: ", train_df.head(10))

The first few rows of the training dataset:     condition_label                                   medical_abstract
0                5  Tissue changes around loose prostheses. A cani...
1                1  Neuropeptide Y and neuron-specific enolase lev...
2                2  Sexually transmitted diseases of the colon, re...
3                1  Lipolytic factors associated with murine and h...
4                3  Does carotid restenosis predict an increased r...
5                3  The shoulder in multiple epiphyseal dysplasia....
6                2  The management of postoperative chylous ascite...
7                4  Pharmacomechanical thrombolysis and angioplast...
8                5  Color Doppler diagnosis of mechanical prosthet...
9                5  Noninvasive diagnosis of right-sided extracard...


In [4]:
print("The first few rows of the test dataset: ", test_df.head(10))

The first few rows of the test dataset:     condition_label                                   medical_abstract
0                3  Obstructive sleep apnea following topical orop...
1                5  Neutrophil function and pyogenic infections in...
2                5  A phase II study of combined methotrexate and ...
3                1  Flow cytometric DNA analysis of parathyroid tu...
4                4  Paraneoplastic vasculitic neuropathy: a treata...
5                1  Treatment of childhood angiomatous diseases wi...
6                1  Expression of major histocompatibility complex...
7                1  Questionable role of CNS radioprophylaxis in t...
8                5  Reversibility of hepatic fibrosis in experimen...
9                2  Current status of duplex Doppler ultrasound in...


In [5]:
# train_df["medical_abstract"]

#### 3. Data Preprocessing

The preprocessing of text is similar to that 01_Introduction_EDA_and_preprocessing.ipynb

In [6]:
#preprocessing function
def preprocess(data):
    processed_data = []
    for doc in data['medical_abstract']:
        #lowercases document
        doc = doc.lower()
        #removes any non-letter characters
        doc = re.sub(r'\b[^a-zA-Z]+\b', ' ', doc)
        #tokenize
        toks = nltk.word_tokenize(doc)
        #remove tokens of lenth <= 1 (can be varied)
        toks = [tok for tok in toks if len(tok) > 1]
        #remove stopwords
        toks = [tok for tok in toks if tok not in en_stop]
        #lemmatize
        toks = [WordNetLemmatizer().lemmatize(tok) for tok in toks]
        processed_data.append(toks)
    return processed_data

In [7]:
# Split the data into text and labels
train_texts, y_train_labels = train_df['medical_abstract'], train_df['condition_label']
test_texts, y_test_labels = test_df['medical_abstract'], test_df['condition_label']



In [8]:
def preprocess(df):
    for x in df: #['medical_abstract']:
        #lowercases document
        x = x.lower()
        #removes any non-letter characters
        x = re.sub(r'\b[^a-zA-Z]+\b', ' ', x)
        #tokenize
        toks = nltk.word_tokenize(x)
        #remove tokens of lenth <= 1 (can be varied)
        toks = [tok for tok in toks if len(tok) > 1]
        #remove stopwords
        toks = [tok for tok in toks if tok not in en_stop]
        #lemmatize
        toks = [WordNetLemmatizer().lemmatize(tok) for tok in toks]
        return toks

In [9]:
X_train_texts = preprocess(train_df)
X_test_texts = preprocess(test_df)

In [10]:
# len(train_texts)
# type(train_texts)

In [11]:
# Tokenize the text data
max_words = 10000
tokenizer = Tokenizer(num_words=max_words, oov_token="<OOV>")
tokenizer.fit_on_texts(train_texts)


In [12]:
# Convert text data to sequences
train_sequences = tokenizer.texts_to_sequences(train_texts)
test_sequences = tokenizer.texts_to_sequences(test_texts)

In [13]:
# train_sequences

In [14]:
# test_sequences

In [15]:
# Pad sequences to ensure consistent length
max_length = 100
train_padded = pad_sequences(train_sequences, maxlen=max_length, padding='post', truncating='post')
test_padded = pad_sequences(test_sequences, maxlen=max_length, padding='post', truncating='post')



#### 4. LSTM (RNN) model implementation

In this section, we tokenize the training text, pad sequences, and build an RNN model with keras library. We first tokenize the training data. Here, words that are not in the tokenizer's word index are represented by out-of-vocabulary tokens (i.e. < OOV > ). Here, the internal vocaulary is updated based on the internal vocabulary of the training text.



We build a sequential model that has exactly one input and output tensor. We also add the LSTM layer which caputures the long-term dependencies in the sequences.

In [16]:
# import keras
# def create_lstm_model(embedding_layer = None):
#     # create input layer
#     inputs = keras.Input(shape=(None,), dtype="int64")
#     # add word embedding layer
#     if embedding_layer is not None:
#         embedded = embedding_layer(inputs)
#     else:
#         embedded = layers.Embedding(input_dim=MAX_TOKENS, output_dim=256, mask_zero=True)(inputs)
#     # add LSTM layer
#     x = layers.Bidirectional(layers.LSTM(32))(embedded)
#     # add dropout layer
#     x = layers.Dropout(0.5)(x)
#     # add output layer
#     outputs = layers.Dense(9, activation="softmax")(x)
#     # combine all layers into one model
#     lstm_model = keras.Model(inputs, outputs)
#     # specifiy optimizer, loss, and metrics for the model
#     lstm_model.compile(optimizer="rmsprop",
#                   loss="sparse_categorical_crossentropy",
#                   metrics=["accuracy"])
#     # print the summay of the model architecture
#     lstm_model.summary()

#     return lstm_model
# if SEQUENCE_MODEL:
#     if USE_GROVE:
#         lstm_model = create_lstm_model(embedding_layer)
#     else:
#         lstm_model = create_lstm_model()

#     # define callback function
#     callbacks = [
#         keras.callbacks.ModelCheckpoint("embeddings_bidir_gru_with_masking.keras",
#                                         save_best_only=False)
#     ]
#     # train model
#     lstm_model.fit(X_train, y_train, epochs=10, callbacks=callbacks)

In [17]:
max_words = 10000
tokenizer = Tokenizer(num_words=max_words, oov_token="<OOV>")
tokenizer.fit_on_texts(train_texts)

In [18]:
# Pad sequences to ensure consistent length
max_length = 100
train_padded = pad_sequences(train_sequences, maxlen=max_length, padding='post', truncating='post')
test_padded = pad_sequences(test_sequences, maxlen=max_length, padding='post', truncating='post')

In [19]:
#check for unique words in dataframe

In [20]:
# Build the RNN model
model = Sequential()
model.add(Embedding(input_dim=max_words, input_length=max_length, output_dim=64))
model.add(LSTM(128))
model.add(Dense(5, activation='softmax'))

# Compile the model
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])


In order to correct the out of bounds error, we substract 1 from each label. So, now we have the corresponding classes:\
0 : Neoplasms\
1 : Digestive system diseases\
2 : Nervous system diseases\
3 : Cardiovascular diseases\
4 : General pathological conditions

In [21]:
# Enable GPU acceleration if available
if tf.test.gpu_device_name():
    print('GPU is available')
    model = tf.keras.utils.multi_gpu_model(model, gpus=2)  # adjust the number of GPUs


In [22]:
# y_train_labels

In [23]:

y_train_labels -= 1
y_test_labels -= 1

In [24]:
#

In [27]:
# len(train_padded)


11550

In [28]:
# Train the model
# time taken to for 10 epochs:
model.fit(train_padded, y_train_labels, epochs=10)

# model.compile(loss = CategoricalCrossentropy(), optimizer = Adam(), metrics=['accuracy'])


Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.src.callbacks.History at 0x7f3bbc42df00>

It takes ~11 minutes to fit the training the data for 10 epochs. We get an accuracy of ~70.59. This is quite low as compared to other LSTM models, which have accuracies ranging from 80-97. [See here](https://bmcmedresmethodol.biomedcentral.com/articles/10.1186/s12874-022-01665-y) and [here](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8486521/)

In [29]:
# Evaluate the model on test data
predictions = model.predict(test_padded)




In [31]:
# Convert predictions to labels
predicted_labels = predictions.argmax(axis=1)



In [32]:
# Compute confusion matrix
conf_matrix = confusion_matrix(y_test_labels, predicted_labels)
print("Confusion Matrix:")
print(conf_matrix)


Confusion Matrix:
[[437  30  21  20 125]
 [ 45 103   3  14 134]
 [ 40   9 128  48 160]
 [ 10   9  17 410 164]
 [163 119 127 249 303]]


In [33]:
# classification report
class_report = classification_report(y_test_labels, predicted_labels)
print("\nClassification Report:")
print(class_report)



Classification Report:
              precision    recall  f1-score   support

           0       0.63      0.69      0.66       633
           1       0.38      0.34      0.36       299
           2       0.43      0.33      0.38       385
           3       0.55      0.67      0.61       610
           4       0.34      0.32      0.33       961

    accuracy                           0.48      2888
   macro avg       0.47      0.47      0.47      2888
weighted avg       0.47      0.48      0.47      2888



In [34]:
# Evaluate the model on the test set
loss, accuracy = model.evaluate(test_padded, y_test_labels)
print(f'Test Loss: {loss:.4f}, Test Accuracy: {accuracy:.4f}')


Test Loss: 1.3325, Test Accuracy: 0.4782


With the testing data, the LSTM model gives an accuracy of ~47.82. A higher accuracy for training dataset as compared to that of the test dataset suggests overfitting of the model. To improve accuracy, there are a few solutions. First, embedding with BioWordVec may improve the model's accuracy. Another possible way to improve its accuracy is to include more variables in regard to medical data. This, however may increase computation time. [This](https://bmcmedresmethodol.biomedcentral.com/articles/10.1186/s12874-022-01665-y) paper


references:

https://coderzcolumn.com/tutorials/artificial-intelligence/pytorch-rnn-for-text-classification-tasks

https://www.analyticsvidhya.com/blog/2021/06/lstm-for-text-classification/#h-lstm-python-for-text-classification

paper with potentially more insight of LSTM for medical text and diagnoses [here](https://arxiv.org/abs/1511.03677) and [here](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8486521/)