# Classifying Medical Texts
This notebook will use various vectorizers and deep learning methods to classify transcriptions of medical notes and text into various areas of medicine. The text has already been cleaned and preprocessed.

## Setup

In [1]:
import os
import pandas as pd
import datetime
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
import nltk
import spacy
import random
import gensim

In [2]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

from sklearn.model_selection import train_test_split

In [3]:
os.getcwd()

'C:\\Users\\Shru\\Documents\\Springboard\\Capstone 3'

In [4]:
path = 'C:\\Users\\Shru\\Documents\\Springboard\\Capstone 3/data'

data = pd.read_csv(path+'/datafull.tsv', delimiter='\t')
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4966 entries, 0 to 4965
Data columns (total 4 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   medical_specialty  4966 non-null   object
 1   text               4966 non-null   object
 2   class_label        4966 non-null   int64 
 3   tokens             4966 non-null   object
dtypes: int64(1), object(3)
memory usage: 155.3+ KB


In [46]:
data['text'][4634]

'subjective patient admit shortness breath continue fairly well patient chronic atrial fibrillation anticoagulation inr 172 patient undergo echocardiogram show aortic stenosis severe patient outside cardiologist understand schedule undergo workup regard physical examination vital signs pulse 78 blood pressure 13060 lungs clear heart soft systolic murmur aortic area abdomen soft nontender extremities edema impression 1 status shortness breath respond well medical management 2 atrial fibrillation chronic anticoagulation 3 aortic stenosis recommendations 1 continue medication 2 patient would like follow cardiologist regard aortic stenosis may need surgical intervention regard explain patient discharge home medical management appointment see cardiologist next day interim change mind concern request call back'

In [47]:
data_og = pd.read_csv('medical_transcriptions/mtsamples.csv',index_col=0)

In [55]:
# drop empty transcription values, drop uncessary columns for our modeling
data_og = data_og.drop(data_og[data_og['transcription'].isna()].index).reset_index(drop=True)
data_og = data_og[['medical_specialty','transcription']]
data_og.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4966 entries, 0 to 4965
Data columns (total 2 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   medical_specialty  4966 non-null   object
 1   transcription      4966 non-null   object
dtypes: object(2)
memory usage: 77.7+ KB


In [5]:
word2vec = gensim.models.KeyedVectors.load_word2vec_format('https://s3.amazonaws.com/dl4j-distribution/GoogleNews-vectors-negative300.bin.gz', binary=True)

In [6]:
def get_average_word2vec(tokens_list, vector, generate_missing=False, k=300):
    if len(tokens_list)<1:
        return np.zeros(k)
    if generate_missing:
        vectorized = [vector[word] if word in vector else np.random.rand(k) for word in tokens_list]
    else:
        vectorized = [vector[word] if word in vector else np.zeros(k) for word in tokens_list]
    length = len(vectorized)
    summed = np.sum(vectorized, axis=0)
    averaged = np.divide(summed, length)
    return averaged

def get_word2vec_embeddings(vectors, clean_questions, generate_missing=False):
    embeddings = clean_questions['tokens'].apply(lambda x: get_average_word2vec(x, vectors, 
                                                                                generate_missing=generate_missing))
    return list(embeddings)

def w2v(data):

    embeddings = get_word2vec_embeddings(word2vec, data)
    list_labels = data["class_label"].tolist()
    
    return embeddings, list_labels

In [7]:
def tfidf_v2(data, ngrams_l = 1, ngrams_u = 1):
    
    tfidf_vectorizer = TfidfVectorizer(ngram_range=(ngrams_l, ngrams_u))
    tfidf_vectorizer.fit(data['text'])

    list_corpus = data["text"].tolist()
    list_labels = data["labels"].tolist()

    X = tfidf_vectorizer.transform(list_corpus)
    
    return X, list_labels

def w2v_v2(data):

    embeddings = get_word2vec_embeddings(word2vec, data)
    list_labels = data["labels"].tolist()
    
    return embeddings, list_labels

In [12]:
counts = data['medical_specialty'].value_counts()
data_adj = data.copy(deep=True)
data_adj.loc[data_adj['medical_specialty'].isin(counts[counts<100].index), 'medical_specialty'] = ' Other Specialties'

In [24]:
len(data_adj['medical_specialty'].unique())
data_adj['medical_specialty'].value_counts()

 Surgery                          1088
 Other Specialties                1072
 Consult - History and Phy.        516
 Cardiovascular / Pulmonary        371
 Orthopedic                        355
 Radiology                         273
 General Medicine                  259
 Gastroenterology                  224
 Neurology                         223
 SOAP / Chart / Progress Notes     166
 Urology                           156
 Obstetrics / Gynecology           155
 Discharge Summary                 108
Name: medical_specialty, dtype: int64

In [56]:
counts = data_og['medical_specialty'].value_counts()
df = data_og.copy(deep=True)
df.loc[df['medical_specialty'].isin(counts[counts<100].index), 'medical_specialty'] = ' Other Specialties'

In [57]:
len(df['medical_specialty'].unique())
df['medical_specialty'].value_counts()

 Surgery                          1088
 Other Specialties                1072
 Consult - History and Phy.        516
 Cardiovascular / Pulmonary        371
 Orthopedic                        355
 Radiology                         273
 General Medicine                  259
 Gastroenterology                  224
 Neurology                         223
 SOAP / Chart / Progress Notes     166
 Urology                           156
 Obstetrics / Gynecology           155
 Discharge Summary                 108
Name: medical_specialty, dtype: int64

## Deep Learning with keras

In [8]:
from keras.models import Sequential
from keras.layers import Dense, Dropout, Activation
from keras.callbacks import ModelCheckpoint
import keras

Using TensorFlow backend.
  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])
  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])


In [25]:
def build_sequential(input_size, output_size):
    model=Sequential()
    model.add(Dense(64, activation = 'relu', input_shape=(input_size,)))
    model.add(Dropout(0.5))
    model.add(Dense(output_size, activation='softmax'))
    return model

In [13]:
from sklearn.preprocessing import LabelEncoder

In [17]:
X = data_adj['text']
le = LabelEncoder()
le.fit(data_adj['medical_specialty'])
y = le.transform(data_adj['medical_specialty'])
output_size = len(le.classes_)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) 

tfidf_vectorizer = TfidfVectorizer(max_features=5000)
x_train_vec = tfidf_vectorizer.fit_transform(X_train).toarray()
x_test_vec = tfidf_vectorizer.transform(X_test).toarray()
y_train_vec=keras.utils.to_categorical(y_train, data_adj['medical_specialty'].nunique())
y_test_vec=keras.utils.to_categorical(y_test, data_adj['medical_specialty'].nunique())

n_cols=x_train_vec.shape[1]

In [23]:
output_size, n_cols

(13, 5000)

In [26]:
model = build_sequential(n_cols, output_size)
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

batch_size=100
epochs=30

# checkpoint=ModelCheckpoint('model-{epoch:03d}.model', monitor='val_loss', verbose=0, save_best_only=False, mode='auto')
early_stopping = keras.callbacks.EarlyStopping(monitor='val_loss', patience=3, verbose=0, mode='auto')
model.fit(x_train_vec, y_train_vec, 
          batch_size=batch_size, 
          epochs=epochs, verbose=1, 
          validation_data = (x_test_vec, y_test_vec), callbacks=[early_stopping])


Train on 3972 samples, validate on 994 samples
Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30
Epoch 11/30
Epoch 12/30


<keras.callbacks.callbacks.History at 0x1db80bf6e48>

### LSTM models

In [42]:
from keras.layers import Dense, SimpleRNN
from keras.layers.embeddings import Embedding
from keras.layers import LSTM
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.callbacks import EarlyStopping
from keras.constraints import maxnorm
import warnings
warnings.filterwarnings("ignore")

In [28]:
# The maximum number of words to be used. (most frequent)
MAX_NB_WORDS = 50000
# Max number of words in each complaint.
MAX_SEQUENCE_LENGTH = 250
# This is fixed.
EMBEDDING_DIM = 100
tokenizer = Tokenizer(num_words=MAX_NB_WORDS, filters='!"#$%&()*+,-./:;<=>?@[\]^_`{|}~', lower=True)
tokenizer.fit_on_texts(data_adj['text'].values)
word_index = tokenizer.word_index
print('Found %s unique tokens.' % len(word_index))

Found 22129 unique tokens.


In [30]:
X = tokenizer.texts_to_sequences(data_adj['text'].values)
X = pad_sequences(X, maxlen=MAX_SEQUENCE_LENGTH)
print('Shape of data tensor:', X.shape)

Shape of data tensor: (4966, 250)


In [31]:
Y = pd.get_dummies(data_adj['medical_specialty']).values
print('Shape of label tensor:', Y.shape)

Shape of label tensor: (4966, 13)


In [32]:
X_train, X_test, Y_train, Y_test = train_test_split(X,Y, test_size = 0.2, random_state = 42)
print(X_train.shape,Y_train.shape)
print(X_test.shape,Y_test.shape)

(3972, 250) (3972, 13)
(994, 250) (994, 13)


In [36]:
model = Sequential()
model.add(Embedding(MAX_NB_WORDS, EMBEDDING_DIM, input_length=X.shape[1]))
model.add(Dropout(0.5))
model.add(LSTM(100, dropout=0.2, recurrent_dropout=0.2))
model.add(Dense(13, activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

epochs = 5
batch_size = 64

history = model.fit(X_train, Y_train, epochs=epochs, batch_size=batch_size,validation_split=0.1,callbacks=[EarlyStopping(monitor='val_loss', patience=3, min_delta=0.0001)])


Train on 3574 samples, validate on 398 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


In [37]:
accr = model.evaluate(X_test,Y_test)
print('Test set\n  Loss: {:0.3f}\n  Accuracy: {:0.3f}'.format(accr[0],accr[1]))

Test set
  Loss: 1.984
  Accuracy: 0.318


In [43]:
model = Sequential()
model.add(Embedding(MAX_NB_WORDS, EMBEDDING_DIM, input_length=X.shape[1]))
model.add(Dropout(0.5))
model.add(LSTM(100, return_sequences=True, kernel_constraint=maxnorm(3)))
model.add(Dropout(0.5))
model.add(LSTM(100, dropout=0.2, recurrent_dropout=0.2))
model.add(Dense(13, activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

epochs = 30
batch_size = 64

history = model.fit(X_train, Y_train, epochs=epochs, batch_size=batch_size,validation_split=0.1,callbacks=[EarlyStopping(monitor='val_loss', patience=3, min_delta=0.0001)])

Train on 3574 samples, validate on 398 samples
Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30


In [44]:
accr = model.evaluate(X_test,Y_test)
print('Test set\n  Loss: {:0.3f}\n  Accuracy: {:0.3f}'.format(accr[0],accr[1]))

Test set
  Loss: 2.111
  Accuracy: 0.309


In [61]:
# The maximum number of words to be used. (most frequent)
MAX_NB_WORDS = 50000
# Max number of words in each complaint.
MAX_SEQUENCE_LENGTH = 250
# This is fixed.
EMBEDDING_DIM = 100
tokenizer = Tokenizer(num_words=MAX_NB_WORDS, filters='!"#$%&()*+,-./:;<=>?@[\]^_`{|}~', lower=True)
tokenizer.fit_on_texts(df['transcription'].values)
word_index = tokenizer.word_index
print('Found %s unique tokens.' % len(word_index))

Found 22780 unique tokens.


In [62]:
X = tokenizer.texts_to_sequences(df['transcription'].values)
X = pad_sequences(X, maxlen=MAX_SEQUENCE_LENGTH)
print('Shape of data tensor:', X.shape)

Shape of data tensor: (4966, 250)


In [63]:
Y = pd.get_dummies(df['medical_specialty']).values
print('Shape of label tensor:', Y.shape)

Shape of label tensor: (4966, 13)


In [64]:
X_train, X_test, Y_train, Y_test = train_test_split(X,Y, test_size = 0.2, random_state = 42)
print(X_train.shape,Y_train.shape)
print(X_test.shape,Y_test.shape)

(3972, 250) (3972, 13)
(994, 250) (994, 13)


In [65]:
model = Sequential()
model.add(Embedding(MAX_NB_WORDS, EMBEDDING_DIM, input_length=X.shape[1]))
model.add(Dropout(0.5))
model.add(LSTM(100, dropout=0.2, recurrent_dropout=0.2))
model.add(Dense(13, activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

epochs = 5
batch_size = 64

history = model.fit(X_train, Y_train, epochs=epochs, batch_size=batch_size,validation_split=0.1,callbacks=[EarlyStopping(monitor='val_loss', patience=3, min_delta=0.0001)])


Train on 3574 samples, validate on 398 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


In [66]:
accr = model.evaluate(X_test,Y_test)
print('Test set\n  Loss: {:0.3f}\n  Accuracy: {:0.3f}'.format(accr[0],accr[1]))

Test set
  Loss: 1.922
  Accuracy: 0.361


In [69]:
model = Sequential()
model.add(Embedding(MAX_NB_WORDS, EMBEDDING_DIM, input_length=X.shape[1]))
model.add(Dropout(0.5))
model.add(LSTM(64, return_sequences=True))
model.add(Dropout(0.5))
model.add(LSTM(100, dropout=0.2, recurrent_dropout=0.2))
model.add(Dense(13, activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

epochs = 30
batch_size = 64

history = model.fit(X_train, Y_train, epochs=epochs, batch_size=batch_size,validation_split=0.1,callbacks=[EarlyStopping(monitor='val_loss', patience=3, min_delta=0.0001)])

Train on 3574 samples, validate on 398 samples
Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30


In [70]:
accr = model.evaluate(X_test,Y_test)
print('Test set\n  Loss: {:0.3f}\n  Accuracy: {:0.3f}'.format(accr[0],accr[1]))

Test set
  Loss: 1.991
  Accuracy: 0.312


In [72]:
model = Sequential()
model.add(Embedding(MAX_NB_WORDS, EMBEDDING_DIM, input_length=X.shape[1]))
model.add(Dropout(0.2))
model.add(LSTM(64, dropout=0.2, recurrent_dropout=0.2))
model.add(Dense(13, activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

epochs = 30
batch_size = 64

history = model.fit(X_train, Y_train, epochs=epochs, batch_size=batch_size,validation_split=0.1,callbacks=[EarlyStopping(monitor='val_loss', patience=3, min_delta=0.0001)])


Train on 3574 samples, validate on 398 samples
Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30


In [85]:
import tensorflow as tf
print("You are using TensorFlow version", tf.__version__)


You are using TensorFlow version 1.14.0


In [75]:
hello=tf.constant('Hello,TensorFlow!')

In [76]:
sess=tf.Session()

In [77]:
print(sess.run(hello))

b'Hello,TensorFlow!'


In [82]:
model = Sequential()
model.add(Embedding(MAX_NB_WORDS, EMBEDDING_DIM, input_length=X.shape[1]))
model.add(Dropout(0.2))
model.add(LSTM(100, return_sequences=True))
model.add(Dropout(0.2))
model.add(LSTM(64, return_sequences=True))
model.add(Dropout(0.2))
model.add(LSTM(64, dropout=0.2, recurrent_dropout=0.2))
model.add(Dense(13, activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

epochs = 30
batch_size = 64

history = model.fit(X_train, Y_train, epochs=epochs, batch_size=batch_size,validation_split=0.1,callbacks=[EarlyStopping(monitor='val_loss', patience=3, min_delta=0.0001)])


Train on 3574 samples, validate on 398 samples
Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30


In [83]:
accr = model.evaluate(X_test,Y_test)
print('Test set\n  Loss: {:0.3f}\n  Accuracy: {:0.3f}'.format(accr[0],accr[1]))

Test set
  Loss: 2.073
  Accuracy: 0.315


# Conclusion

After all that testing the LSTM deep learning models could not perform better on the tokenized data than logistic regression performed on pca reduced tf-idf vectors. the best model using LSTM had 36% accuracy on the test data while Lasso LogReg obtained 38% accuracy on the test data. From theses experiments it can be seen that preprocessing methods heavily influence the learning of the model. Optimizations can be further explored.