# Training the LSTM on your Data

This notebook will take you through the steps necessary to train an LSTM to recognize ICD-9 codes, or items from similar dictionaries, from free text.

## Setup

### Imports

Make sure that the below packages are installed on the server on which this program will run.

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import sys
sys.path.append('src/taggerSystem/')
from data_util import load_and_preprocess_data, load_embeddings, ModelHelper, lastTrueWordIdxs
import tensorflow as tf
from simpleLSTMWithNNetModel import Model, reloadModel
from trainModel import trainModel
import os
import pprint
import numpy as np
import pickle
pp = pprint.PrettyPrinter(indent=4)

### Directories

Executing the following cell will prompt the user to type in the directories corresponding to the training and validation sets, the vocabulary, and the word vector mappings. Defaults are given in the comments below.

Note that the training and test data needs to have one column dedicated to free text (`noteIdx`) and another dedicated to top-level ICD-9 codes (`codeIdx`) associated with each patient. Preferably, the latter should be strung together using '-' as the delimiter (e.g. for patient 1, 1-2-6-4).

Please make sure that the parameters such as the embed size, maximum note length, learning rate, number of maximum training epochs, batch size, hidden layer size, number of neural net layers, and probabilities are to your specification.

### Put sample file with require file headers:

adjust training data headers to match small icd9 training file

In [3]:
data_train = raw_input('What is the path to the training data? ') #default: data/icd9NotesDataTable_train.csv
data_valid = raw_input('What is the path to the validation data? ') #default: data/icd9NotesDataTable_valid.csv
vocab = raw_input('What is the path to the vocabulary? ') #default: data/icd9Vocab.txt
wordVecs = raw_input('What is the path to the vocabulary? ') #data/newgloveicd9.txt. These are length 300 word vectors from GloVE

NUM = "NNNUMMM"
UNK = "UUUNKKK"
EMBED_SIZE = 300 # this should correspond to the length of the word vectors
maxAllowedNoteLength = 1000
max_grad_norm = 5
codeIdx = raw_input('Which column contains top-level ICD-9 codes (outputs) in the training and test data? Default is 9. ')
textIdx = raw_input('Which column contains notes/text (inputs) in the training and test data? Default is 6. ')
learning_rate = 0.001
training_epochs = 100
batch_size = 256
n_hidden = 200
output_keep_prob = 0.5
input_keep_prob = 1
numLayers = 1

What is the path to the training data? data/0815/csu_nodiag_no17_0815_train.csv
What is the path to the validation data? data/0815/csu_nodiag_no17_0815_valid.csv
What is the path to the vocabulary? data/icd9Vocab.txt
What is the path to the vocabulary? data/newgloveicd9.txt
Which column contains top-level ICD-9 codes (outputs) in the training and test data? Default is 9. 9
Which column contains notes/text (inputs) in the training and test data? Default is 6. 6


In [4]:
# NOTE:
# Fixing some issue ashley had

textIdx = int(textIdx)
codeIdx = int(codeIdx)# I'm not sure how models were trinaed before if this part was broken

Here, make sure that the output path is specified as you would like. By default, the program saves the output in a folder with the name of your choice within the folder `results`.

If there exists a folder with results that you would like to load again, use that here.

In [5]:
output_path = raw_input('Where are models and performances (to be) saved? ')
output_path = os.path.join('results', output_path)
if output_path == 'results/':
    output_path = 'results/temp'
if not os.path.exists(output_path):
    os.makedirs(output_path)

Where are models and performances (to be) saved? 0815_csu


## Initialization

Executing the following cell will ask whether or not there is a previously saved model; if not, the model will train features from scratch, and if so, the features will be loaded.

Note that AZ added "int() to the codeIdx and textIdx to resolve some errors that were preventing it from initializing

In [6]:
sizeList = [n_hidden, 150, 75] # these are the weights we will be using

def query_yes_no(question, default="yes"):
    """Ask a yes/no question via raw_input() and return their answer.

    "question" is a string that is presented to the user.
    "default" is the presumed answer if the user just hits <Enter>.
        It must be "yes" (the default), "no" or None (meaning
        an answer is required of the user).

    The "answer" return value is True for "yes" or False for "no".
    """
    valid = {"yes": True, "y": True, "ye": True,
             "no": False, "n": False}
    if default is None:
        prompt = " [y/n] "
    elif default == "yes":
        prompt = " [Y/n] "
    elif default == "no":
        prompt = " [y/N] "
    else:
        raise ValueError("invalid default answer: '%s'" % default)

    while True:
        sys.stdout.write(question + prompt)
        choice = raw_input().lower()
        if default is not None and choice == '':
            return valid[default]
        elif choice in valid:
            return valid[choice]
        else:
            sys.stdout.write("Please respond with 'yes' or 'no' "
                             "(or 'y' or 'n').\n")
            
prev_model = query_yes_no("Is there a previously trained model?")

if prev_model:
    helper, train, dev, train_raw, dev_raw, xTrain, yTrain, xDev, yDev = load_and_preprocess_data(
    data_train = data_train, data_valid = data_valid, 
    maxAllowedNoteLength = maxAllowedNoteLength, 
    codeIdx = int(codeIdx), textIdx = int(textIdx),
    helperLoadPath = output_path)
else:
    #print codeIdx
    #print textIdx
    helper, train, dev, train_raw, dev_raw, xTrain, yTrain, xDev, yDev = load_and_preprocess_data(
    data_train = data_train, data_valid = data_valid, 
    maxAllowedNoteLength = maxAllowedNoteLength, 
    codeIdx = int(codeIdx), textIdx = int(textIdx))
    
embeddings = load_embeddings(vocab, wordVecs, helper, embeddingSize = EMBED_SIZE)
lastTrueWordIdx_train = lastTrueWordIdxs(train)
lastTrueWordIdx_dev = lastTrueWordIdxs(dev)
helper.save(output_path) # token2id and max length saved to output_path
sizeList.append(helper.n_labels)

total_batches = (xTrain.shape[0]//batch_size)
print('Total number of batches per epoch: %d'%(total_batches))
print('Maximum note length: %d'%(helper.max_length))
print('Number of ICD-9 codes: %d'%(helper.n_labels))
print('There are a total of: {} ICD-9 codes'.format(len(helper.icdDict.keys())))
pp.pprint(helper.icdDict)
print('xDev shape: nObs = %d, nWords = %d'%(xDev.shape))
print('yDev shape: nObs = %d, nClasses = %d'%(yDev.shape))
print('xTrain shape: nObs = %d, nWords = %d'%(xTrain.shape))
print('yTrain shape: nObs = %d, nClasses = %d'%(yTrain.shape))
print('Embeddings shape: nWords = %d, wordVec length = %d'%(embeddings.shape))

Is there a previously trained model? [Y/n] n
Total number of batches per epoch: 245
Maximum note length: 1000
Number of ICD-9 codes: 17
There are a total of: 17 ICD-9 codes
{   'cat:1': 8,
    'cat:10': 15,
    'cat:11': 14,
    'cat:12': 13,
    'cat:13': 12,
    'cat:14': 11,
    'cat:15': 10,
    'cat:16': 9,
    'cat:18': 16,
    'cat:2': 7,
    'cat:3': 6,
    'cat:4': 5,
    'cat:5': 4,
    'cat:6': 3,
    'cat:7': 2,
    'cat:8': 1,
    'cat:9': 0}
xDev shape: nObs = 26733, nWords = 1000
yDev shape: nObs = 26733, nClasses = 17
xTrain shape: nObs = 62858, nWords = 1000
yTrain shape: nObs = 62858, nClasses = 17
Embeddings shape: nWords = 10008, wordVec length = 300


The following cell initializes the dictionary of hyperparameters for the model that fully describe the model.

In [7]:
hyperParamDict = {'EMBED_SIZE': EMBED_SIZE,
                  'maxNoteLength': maxAllowedNoteLength,
                  'maxGradNorm': max_grad_norm,
                  'outputKeepProb': output_keep_prob,
                  'inputKeepProb': input_keep_prob,
                  'learningRate': learning_rate,
                  'trainingEpochsMax': training_epochs,
                  'batchSize': batch_size,
                  'n_hidden': n_hidden,
                 'numLayers': numLayers,
                 'sizeList':sizeList}
pp.pprint(hyperParamDict)
with open(os.path.join(output_path, 'hyperParamDict.pickle'), 'wb') as handle:
    pickle.dump(hyperParamDict, handle, protocol = 2)
    #dumping with 2 because ALTUD uses python 2.7 right now.

{   'EMBED_SIZE': 300,
    'batchSize': 256,
    'inputKeepProb': 1,
    'learningRate': 0.001,
    'maxGradNorm': 5,
    'maxNoteLength': 1000,
    'n_hidden': 200,
    'numLayers': 1,
    'outputKeepProb': 0.5,
    'sizeList': [200, 150, 75, 17],
    'trainingEpochsMax': 100}


## Training

Finally, the model is trained (be wary - it will take some time; on an Amazon Deep Learning AMI, it took around an hour to train)...

In [8]:
from trainModel import trainModel
xDev[xDev == -1] = 0
xTrain[xTrain == -1] = 0
trainModel(helperObj = helper, embeddings = embeddings, hyperParamDict = hyperParamDict, 
          xDev = xDev, xTrain = xTrain, yDev = yDev, yTrain = yTrain, 
           lastTrueWordIdx_dev = lastTrueWordIdx_dev, 
           lastTrueWordIdx_train = lastTrueWordIdx_train,
           training_epochs = training_epochs, 
           output_path = output_path, batchSizeTrain = batch_size,
           sizeList = sizeList,
           maxIncreasingLossCount = 100, batchSizeDev = 1500, chatty = True)

shape of embeddings
(?, 1000, 300)
<class 'tensorflow.python.ops.rnn_cell_impl.MultiRNNCell'>
<tensorflow.python.ops.rnn_cell_impl.MultiRNNCell object at 0x7f4e01245990>
<class 'tensorflow.python.framework.ops.Tensor'>
(?, 1000, 300)
cell output size
200
cell state size
(LSTMStateTuple(c=200, h=200),)
output shape
(?, 1000, 200)
offset shape
(?, 1)
output shape new shape
(?, 200)
flattened indices shape
(?, 1)
output flattened shape
(?, 200)
(200, 150)
W_1 shape
(150,)
b_1 shape
(150, 75)
W_2 shape
(75,)
bias shape
(17,)
U shape
(75, 17)
bias shape
(17,)
output wx + b
(?, 17)
(?, 17)


  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "


***************************
***************************
Running on epoch 0
***************************
***************************
running iteration 0 with loss 0.692575 at time 2.115283
running iteration 25 with loss 0.346115 at time 33.993040
running iteration 50 with loss 0.297302 at time 65.938087
running iteration 75 with loss 0.327600 at time 97.870244
running iteration 100 with loss 0.323432 at time 129.841913
running iteration 125 with loss 0.313994 at time 161.733622
running iteration 150 with loss 0.310221 at time 193.522832
running iteration 175 with loss 0.325047 at time 225.321969
running iteration 200 with loss 0.327015 at time 257.115479
running iteration 225 with loss 0.336798 at time 288.898460
average training loss 0.334706
test loss 0.324836
Total run time was 345.876261
New best model found. Saving
results/0815_csu
***************************
***************************
Running on epoch 1
***************************
***************************
running iteration 0 wi

running iteration 200 with loss 0.158728 at time 256.466041
running iteration 225 with loss 0.173702 at time 288.358574
average training loss 0.147600
test loss 0.187914
Total run time was 345.088169
validation Loss Increase
***************************
***************************
Running on epoch 10
***************************
***************************
running iteration 0 with loss 0.126595 at time 1.274227
running iteration 25 with loss 0.135810 at time 33.024267
running iteration 50 with loss 0.105294 at time 64.782590
running iteration 75 with loss 0.134040 at time 96.541443
running iteration 100 with loss 0.127179 at time 128.302791
running iteration 125 with loss 0.138033 at time 160.074024
running iteration 150 with loss 0.137685 at time 191.965664
running iteration 175 with loss 0.133304 at time 223.777667
running iteration 200 with loss 0.149457 at time 255.705356
running iteration 225 with loss 0.170732 at time 287.640996
average training loss 0.139852
test loss 0.193268
Tot

and the session closed. You should be able to see your results in the `output_path` directory you specified earlier.

To evaluate the results and generate plots and such, please check out `predictionEvaluation.ipynb` in the same repository.

In [9]:
# xDev[xDev == -1] = 0
# xTrain[xTrain == -1] = 0
# trueWordIdxs = tf.placeholder(tf.int32, shape = (None,1))
# with tf.Session() as session:
#     tf.global_variables_initializer().run()
#     modelDict = reloadModel(session = session,
#                             saverCheckPointPath = output_path,
#                             saverMetaPath = output_path + '/bestModel.meta')
#     print('here we go')
#     for i in range(3):
#         pred_y = session.run(modelDict['y_last'],feed_dict={modelDict['xPlaceHolder']: xDev,
#                                       modelDict['trueWordIdxs']:lastTrueWordIdx_dev,
#                                       modelDict['outputKeepProb']: 1.0,
#                                       modelDict['inputKeepProb']: 1.0}, ) 
#         validLoss = tf.nn.sigmoid_cross_entropy_with_logits(logits = pred_y, 
#                                              labels = tf.cast(yDev, tf.float32))
#         validLoss = tf.reduce_mean(validLoss)
#         validLoss = validLoss.eval()
#         print('test loss %f'%(validLoss))
#         print('***********************************************')

In [10]:
if 'session' in locals() and session is not None:
    print('Close interactive session')
    session.close()