##  CNN with Attention, running the full dataset, 54.6K records

The MaxPooling element in the CNN network is replaced by the Attention layer in order to create a vector representing
all relevant information and not only taking in account max values. 


In [1]:
%matplotlib inline
# General imports
import numpy as np
import pandas as pd
from sklearn.metrics import f1_score
import random
from collections import Counter, defaultdict
from operator import itemgetter
import matplotlib.pyplot as plt


#keras
from keras.models import Sequential, Model
from keras.layers import Dense, Dropout, Flatten, Input, MaxPooling1D, Convolution1D, Embedding
from keras.layers.merge import Concatenate
from keras.models import load_model
from IPython.display import SVG
from keras.utils.vis_utils import model_to_dot

# Custom functions
%load_ext autoreload
%autoreload 2
import database_selection
import vectorization
import helpers
import icd9_cnn_model
import lstm_model
import icd9_cnn_att


Using TensorFlow backend.


## Reading Input File

In [3]:
#reading file
df = pd.read_csv('../data/disch_notes_all_icd9.csv',
                 names = ['HADM_ID', 'SUBJECT_ID', 'DATE', 'ICD9','TEXT'])
print df.shape

(52696, 5)


## Pre processing ICD 9 codes

In [4]:
#Source: https://github.com/sirrice/icd9 plus doing queries with it
ICD9_FIRST_LEVEL = [
    '001-139','140-239','240-279','290-319', '320-389', '390-459','460-519', '520-579', '580-629', 
    '630-679', '680-709','710-739', '760-779', '780-789', '790-796', '797', '798', '799', '800-999' ]
N_TOP = len(ICD9_FIRST_LEVEL)
# replacing leave ICD9 codes with the grandparents
df['ICD9'] = df['ICD9'].apply(lambda x: helpers.replace_with_grandparent_codes(x,ICD9_FIRST_LEVEL))


In [6]:
#preprocess icd9 codes to vectors 
top_codes = ICD9_FIRST_LEVEL
labels = vectorization.vectorize_icd_column(df, 'ICD9', top_codes)
print 'sample of vectorized icd9 labels: ', labels[0]


sample of vectorized icd9 labels:  [0 0 1 0 1 1 0 1 1 0 1 0 0 0 0 0 0 0 0]


## Pre process Notes

In [7]:
#preprocess notes
MAX_VOCAB = None # to limit original number of words (None if no limit)
MAX_SEQ_LENGTH = 5000 # to limit length of word sequence (None if no limit)
df.TEXT = vectorization.clean_notes(df, 'TEXT')
data_vectorized, dictionary, MAX_VOCAB = vectorization.vectorize_notes(df.TEXT, MAX_VOCAB, verbose = True)
data, MAX_SEQ_LENGTH = vectorization.pad_notes(data_vectorized, MAX_SEQ_LENGTH)

print("Final Vocabulary: %s" % MAX_VOCAB)
print("Final Max Sequence Length: %s" % MAX_SEQ_LENGTH)

Vocabulary size: 139074
Average note length: 1634.982845
Max note length: 10924
Final Vocabulary: 139074
Final Max Sequence Length: 5000


In [9]:
#creating glove embeddings
EMBEDDING_DIM = 100 # given the glove that we chose
EMBEDDING_MATRIX= []
EMBEDDING_LOC = '../data/notes.100.txt' # location of embedding
EMBEDDING_MATRIX, embedding_dict = vectorization.embedding_matrix(EMBEDDING_LOC,
                                                                  dictionary, EMBEDDING_DIM, verbose = True, sigma=True)


('Vocabulary in notes:', 139074)
('Vocabulary in original embedding:', 21056)
('Vocabulary intersection:', 20640)


## Split Files

In [10]:
#split sets
X_train, X_val, X_test, y_train, y_val, y_test = helpers.train_val_test_split(
    data, labels, val_size=0.2, test_size=0.1, random_state=101)
print("Train: ", X_train.shape, y_train.shape)
print("Validation: ", X_val.shape, y_val.shape)
print("Test: ", X_test.shape, y_test.shape)

('Train: ', (36887, 5000), (36887, 19))
('Validation: ', (10539, 5000), (10539, 19))
('Test: ', (5270, 5000), (5270, 19))


In [11]:
# Delete temporary variables to free some memory
del df, data, labels

## CNN and attention, runing with full data set

Model is overfit, the first 7 epochs do increase the validation performance metrics, but later epochs bring their values down.   When this model runs with 5k records in 5 epochs, it gets the highest f1 score in regards to the other models (plain CNN, LSTM, LSTM-ATT, Hierarchical-LSTM-ATT), but not when running with the full dataset.

Note: this run already has several improvement, the first run was super super overfit

In [48]:
reload(icd9_cnn_att)
#### build model
cnn_att_model = icd9_cnn_att.build_icd9_cnn_model (input_seq_length=MAX_SEQ_LENGTH, max_vocab = MAX_VOCAB,
                             external_embeddings = True,
                             embedding_dim=EMBEDDING_DIM,embedding_matrix=EMBEDDING_MATRIX,
                             num_filters = 100, filter_sizes=[2,3,4,5],
                             training_dropout=0.5,
                             num_classes=N_TOP )

____________________________________________________________________________________________________
Layer (type)                     Output Shape          Param #     Connected to                     
input_2 (InputLayer)             (None, 5000)          0                                            
____________________________________________________________________________________________________
embedding_2 (Embedding)          (None, 5000, 100)     13907500    input_2[0][0]                    
____________________________________________________________________________________________________
conv1d_5 (Conv1D)                (None, 4999, 100)     20100       embedding_2[0][0]                
____________________________________________________________________________________________________
conv1d_6 (Conv1D)                (None, 4998, 100)     30100       embedding_2[0][0]                
___________________________________________________________________________________________

### running after tuning parameters
Better results but still doesn't go futher 7 epochs before dropping in performance
* dropout in the output layer
* two dropouts in attention layer
* dropout value = 0.5  (tried higher values, runs took longer but still they will not improve f1 score)
* L2 regularizations
* default learning rate (we used a smaller one, it didn't work


In [14]:
cnn_att_model.fit(X_train, y_train, batch_size=50, epochs=5, validation_data=(X_val, y_val), verbose=1)

Train on 36887 samples, validate on 10539 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.callbacks.History at 0x7f6c2da355d0>

In [15]:
cnn_att_model.save('models/cnn_att_5_epochs_50k.h5')

In [47]:
pred_train = cnn_att_model.predict(X_train, batch_size=100)
pred_dev = cnn_att_model.predict(X_val, batch_size=100)
# perform evaluation
helpers.show_f1_score(y_train, pred_train, y_val, pred_dev)

F1 scores
threshold | training | dev  
0.020:      0.551      0.546
0.030:      0.568      0.563
0.040:      0.583      0.576
0.050:      0.597      0.589
0.055:      0.604      0.595
0.058:      0.607      0.598
0.060:      0.610      0.601
0.080:      0.634      0.622
0.100:      0.656      0.641
0.200:      0.736      0.711
0.300:      0.788      0.751
0.400:      0.817      0.771
0.500:      0.826      0.775
0.600:      0.816      0.766
0.700:      0.789      0.742


In [25]:
cnn_att_model = load_model('models/cnn_att_5_epochs_50k.h5')

  return cls(**config)


In [26]:
cnn_att_model.fit(X_train, y_train, batch_size=50, epochs=5, validation_data=(X_val, y_val), verbose=1)

Train on 36887 samples, validate on 10539 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.callbacks.History at 0x7f6b2c106f90>

In [27]:
pred_train = cnn_att_model.predict(X_train, batch_size=100)
pred_dev = cnn_att_model.predict(X_val, batch_size=100)
# perform evaluation
helpers.show_f1_score(y_train, pred_train, y_val, pred_dev)

F1 scores
threshold | training | dev  
0.020:      0.539      0.538
0.030:      0.559      0.557
0.040:      0.577      0.574
0.050:      0.592      0.589
0.055:      0.599      0.596
0.058:      0.603      0.600
0.060:      0.606      0.602
0.080:      0.630      0.624
0.100:      0.650      0.643
0.200:      0.719      0.707
0.300:      0.766      0.748
0.400:      0.797      0.774
0.500:      0.807      0.782
0.600:      0.795      0.769
0.700:      0.758      0.735
