# Training the LSTM on your Data

This notebook will take you through the steps necessary to train an LSTM to recognize ICD-9 codes, or items from similar dictionaries, from free text.

## Setup

### Imports

Make sure that the below packages are installed on the server on which this program will run.

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import sys
sys.path.append('src/taggerSystem/')
from data_util import load_and_preprocess_data, load_embeddings, ModelHelper, lastTrueWordIdxs
import tensorflow as tf
from simpleLSTMWithNNetModel import Model, reloadModel
from trainModel import trainModel
import os
import pprint
import numpy as np
import pickle
pp = pprint.PrettyPrinter(indent=4)

### A bit of data clean-up on the CSU data

In [3]:
import pandas as pd
from IPython.display import display, HTML

In [83]:
csu = pd.read_csv('data/csu_snomed_test', sep='\t')
display(csu)

Unnamed: 0,id,case_number,invoice_number,combined,all_groups,top_levels
0,909695,296423,3393678,"Gracie, a 2 year old female spayed Bernese Mou...",232234000|3135009|246075003|18097004|275441009...,3-7-17-13-7-17-13-17-1-13-11-17-13
1,909696,296422,3393679,Truman was presented to the CSU Orthopedic Sur...,450876003|272741003|7771000,
2,909698,291144,3393683,"Twila is an 11 month old, female spayed, DSH f...",102499006,17
3,909699,295061,3393684,Rufus is a 9 month old male intact Newfoundlan...,183644000|R44462005|363698007|R23416004|272741...,17-14-17-14
4,909706,296425,3393700,This patient was a stray geriatric female cat ...,88610006|S42399005|371597004,17-11-17-17
5,909713,290876,3393712,"Lincoln, a 5 year old male castrated Bouvier d...",R24079001|RS119292006|R363732003,3-3-17-13-17-10-17-3
6,909725,295452,3393727,"Ruddr, a 5mo MI Labrador, was represented to t...",R298005009|363698007|R116013008|272741003|7771000,17
7,909739,295253,3393748,"Sophie has been sneezing more this week, the o...",R118600007|21911005|69960004|363701004|387227009,17-4-2
8,909740,291538,3393749,"Penny, an 11 year old female spayed Miniature ...",R128091003|R25121006|R75331009|R235856003|2460...,3-17-3-4-4-3-17-3-4-4-4-17-10
9,909742,267273,3393752,"24-hour history of collapse, lethargy. Abdomin...",373945007|389026000|271737000|371019009|234120...,17-8-17-17-3-17-17-8-17-8-17-8-2-17-2-10-17-2


In [84]:
csu['CHARTDATE'] = np.nan
csu['DESCRIPTION'] = np.nan
csu['Level2ICD'] = np.nan
csu['TopLevelICD'] = np.nan

display(csu[:5])

Unnamed: 0,id,case_number,invoice_number,combined,all_groups,top_levels,CHARTDATE,DESCRIPTION,Level2ICD,TopLevelICD
0,909695,296423,3393678,"Gracie, a 2 year old female spayed Bernese Mou...",232234000|3135009|246075003|18097004|275441009...,3-7-17-13-7-17-13-17-1-13-11-17-13,,,,
1,909696,296422,3393679,Truman was presented to the CSU Orthopedic Sur...,450876003|272741003|7771000,,,,,
2,909698,291144,3393683,"Twila is an 11 month old, female spayed, DSH f...",102499006,17,,,,
3,909699,295061,3393684,Rufus is a 9 month old male intact Newfoundlan...,183644000|R44462005|363698007|R23416004|272741...,17-14-17-14,,,,
4,909706,296425,3393700,This patient was a stray geriatric female cat ...,88610006|S42399005|371597004,17-11-17-17,,,,


In [85]:
csu.rename(columns={'id': 'HADM_ID', 'case_number':'SUBJECT_ID', 'combined':'TEXT', 'all_groups':'ICD9_CODE', 'top_levels':'V9'}, inplace=True)

In [86]:
#df.drop(['B', 'C'], axis=1)

csu.drop(['invoice_number'], axis=1, inplace=True)
display(csu[:5])

Unnamed: 0,HADM_ID,SUBJECT_ID,TEXT,ICD9_CODE,V9,CHARTDATE,DESCRIPTION,Level2ICD,TopLevelICD
0,909695,296423,"Gracie, a 2 year old female spayed Bernese Mou...",232234000|3135009|246075003|18097004|275441009...,3-7-17-13-7-17-13-17-1-13-11-17-13,,,,
1,909696,296422,Truman was presented to the CSU Orthopedic Sur...,450876003|272741003|7771000,,,,,
2,909698,291144,"Twila is an 11 month old, female spayed, DSH f...",102499006,17,,,,
3,909699,295061,Rufus is a 9 month old male intact Newfoundlan...,183644000|R44462005|363698007|R23416004|272741...,17-14-17-14,,,,
4,909706,296425,This patient was a stray geriatric female cat ...,88610006|S42399005|371597004,17-11-17-17,,,,


In [87]:
#df = df[['mean', 4,3,2,1]]

csu = csu[['HADM_ID', 'SUBJECT_ID','ICD9_CODE','CHARTDATE','DESCRIPTION','TEXT','Level2ICD','TopLevelICD','V9']]
display(csu[:5])

Unnamed: 0,HADM_ID,SUBJECT_ID,ICD9_CODE,CHARTDATE,DESCRIPTION,TEXT,Level2ICD,TopLevelICD,V9
0,909695,296423,232234000|3135009|246075003|18097004|275441009...,,,"Gracie, a 2 year old female spayed Bernese Mou...",,,3-7-17-13-7-17-13-17-1-13-11-17-13
1,909696,296422,450876003|272741003|7771000,,,Truman was presented to the CSU Orthopedic Sur...,,,
2,909698,291144,102499006,,,"Twila is an 11 month old, female spayed, DSH f...",,,17
3,909699,295061,183644000|R44462005|363698007|R23416004|272741...,,,Rufus is a 9 month old male intact Newfoundlan...,,,17-14-17-14
4,909706,296425,88610006|S42399005|371597004,,,This patient was a stray geriatric female cat ...,,,17-11-17-17


In [88]:
csu.to_csv('data/csu_test_revise2', sep=',')

In [5]:
csu = pd.read_csv('data/csu_test_revise2', sep=',')

In [16]:
csu.iloc[:,6:7]



Unnamed: 0,TEXT
0,"Gracie, a 2 year old female spayed Bernese Mou..."
1,Truman was presented to the CSU Orthopedic Sur...
2,"Twila is an 11 month old, female spayed, DSH f..."
3,Rufus is a 9 month old male intact Newfoundlan...
4,This patient was a stray geriatric female cat ...
5,"Lincoln, a 5 year old male castrated Bouvier d..."
6,"Ruddr, a 5mo MI Labrador, was represented to t..."
7,"Sophie has been sneezing more this week, the o..."
8,"Penny, an 11 year old female spayed Miniature ..."
9,"24-hour history of collapse, lethargy. Abdomin..."


### Directories

Executing the following cell will prompt the user to type in the directories corresponding to the training and validation sets, the vocabulary, and the word vector mappings. Defaults are given in the comments below.

Note that the training and test data needs to have one column dedicated to free text (`noteIdx`) and another dedicated to top-level ICD-9 codes (`codeIdx`) associated with each patient. Preferably, the latter should be strung together using '-' as the delimiter (e.g. for patient 1, 1-2-6-4).

Please make sure that the parameters such as the embed size, maximum note length, learning rate, number of maximum training epochs, batch size, hidden layer size, number of neural net layers, and probabilities are to your specification.

### Put sample file with require file headers:

adjust training data headers to match small icd9 training file

In [3]:
data_train = raw_input('What is the path to the training data? ') #default: data/icd9NotesDataTable_train.csv
data_valid = raw_input('What is the path to the validation data? ') #default: data/icd9NotesDataTable_valid.csv
vocab = raw_input('What is the path to the vocabulary? ') #default: data/icd9Vocab.txt
wordVecs = raw_input('What is the path to the vocabulary? ') #data/newgloveicd9.txt. These are length 300 word vectors from GloVE

NUM = "NNNUMMM"
UNK = "UUUNKKK"
EMBED_SIZE = 300 # this should correspond to the length of the word vectors
maxAllowedNoteLength = 1000
max_grad_norm = 5
codeIdx = raw_input('Which column contains top-level ICD-9 codes (outputs) in the training and test data? Default is 9. ')
print codeIdx
textIdx = raw_input('Which column contains notes/text (inputs) in the training and test data? Default is 6. ')
print textIdx
learning_rate = 0.001
training_epochs = 100
batch_size = 256
n_hidden = 200
output_keep_prob = 0.5
input_keep_prob = 1
numLayers = 1

What is the path to the training data? data/csu_train_revise2
What is the path to the validation data? data/csu_test_revise2
What is the path to the vocabulary? data/icd9Vocab.txt
What is the path to the vocabulary? data/newgloveicd9.txt
Which column contains top-level ICD-9 codes (outputs) in the training and test data? Default is 9. 9
9
Which column contains notes/text (inputs) in the training and test data? Default is 6. 6
6


Here, make sure that the output path is specified as you would like. By default, the program saves the output in a folder with the name of your choice within the folder `results`.

If there exists a folder with results that you would like to load again, use that here.

In [4]:
output_path = raw_input('Where are models and performances (to be) saved? ')
output_path = os.path.join('results', output_path)
if output_path == 'results/':
    output_path = 'results/temp'
if not os.path.exists(output_path):
    os.makedirs(output_path)

Where are models and performances (to be) saved? az_csu


## Initialization

Executing the following cell will ask whether or not there is a previously saved model; if not, the model will train features from scratch, and if so, the features will be loaded.

Note that AZ added "int() to the codeIdx and textIdx to resolve some errors that were preventing it from initializing

In [5]:
sizeList = [n_hidden, 150, 75] # these are the weights we will be using

def query_yes_no(question, default="yes"):
    """Ask a yes/no question via raw_input() and return their answer.

    "question" is a string that is presented to the user.
    "default" is the presumed answer if the user just hits <Enter>.
        It must be "yes" (the default), "no" or None (meaning
        an answer is required of the user).

    The "answer" return value is True for "yes" or False for "no".
    """
    valid = {"yes": True, "y": True, "ye": True,
             "no": False, "n": False}
    if default is None:
        prompt = " [y/n] "
    elif default == "yes":
        prompt = " [Y/n] "
    elif default == "no":
        prompt = " [y/N] "
    else:
        raise ValueError("invalid default answer: '%s'" % default)

    while True:
        sys.stdout.write(question + prompt)
        choice = raw_input().lower()
        if default is not None and choice == '':
            return valid[default]
        elif choice in valid:
            return valid[choice]
        else:
            sys.stdout.write("Please respond with 'yes' or 'no' "
                             "(or 'y' or 'n').\n")
            
prev_model = query_yes_no("Is there a previously trained model?")

if prev_model:
    helper, train, dev, train_raw, dev_raw, xTrain, yTrain, xDev, yDev = load_and_preprocess_data(
    data_train = data_train, data_valid = data_valid, 
    maxAllowedNoteLength = maxAllowedNoteLength, 
    codeIdx = int(codeIdx), textIdx = int(textIdx),
    helperLoadPath = output_path)
else:
    #print codeIdx
    #print textIdx
    helper, train, dev, train_raw, dev_raw, xTrain, yTrain, xDev, yDev = load_and_preprocess_data(
    data_train = data_train, data_valid = data_valid, 
    maxAllowedNoteLength = maxAllowedNoteLength, 
    codeIdx = int(codeIdx), textIdx = int(textIdx))
    
embeddings = load_embeddings(vocab, wordVecs, helper, embeddingSize = EMBED_SIZE)
lastTrueWordIdx_train = lastTrueWordIdxs(train)
lastTrueWordIdx_dev = lastTrueWordIdxs(dev)
helper.save(output_path) # token2id and max length saved to output_path
sizeList.append(helper.n_labels)

total_batches = (xTrain.shape[0]//batch_size)
print('Total number of batches per epoch: %d'%(total_batches))
print('Maximum note length: %d'%(helper.max_length))
print('Number of ICD-9 codes: %d'%(helper.n_labels))
print('There are a total of: {} ICD-9 codes'.format(len(helper.icdDict.keys())))
pp.pprint(helper.icdDict)
print('xDev shape: nObs = %d, nWords = %d'%(xDev.shape))
print('yDev shape: nObs = %d, nClasses = %d'%(yDev.shape))
print('xTrain shape: nObs = %d, nWords = %d'%(xTrain.shape))
print('yTrain shape: nObs = %d, nClasses = %d'%(yTrain.shape))
print('Embeddings shape: nWords = %d, wordVec length = %d'%(embeddings.shape))

Is there a previously trained model? [Y/n] n
9
6
<type 'list'>
<type '_csv.reader'>
['', 'HADM_ID', 'SUBJECT_ID', 'ICD9_CODE', 'CHARTDATE', 'DESCRIPTION', 'TEXT', 'Level2ICD', 'TopLevelICD', 'V9']
<type 'list'>
<type 'list'>
<type '_csv.reader'>
['', 'HADM_ID', 'SUBJECT_ID', 'ICD9_CODE', 'CHARTDATE', 'DESCRIPTION', 'TEXT', 'Level2ICD', 'TopLevelICD', 'V9']
<type 'list'>
Total number of batches per epoch: 307
Maximum note length: 1000
Number of ICD-9 codes: 19
There are a total of: 19 ICD-9 codes
{   '': 0,
    '1': 10,
    '10': 1,
    '11': 8,
    '12': 3,
    '13': 2,
    '14': 5,
    '15': 4,
    '16': 7,
    '17': 6,
    '18': 9,
    '2': 12,
    '3': 11,
    '4': 14,
    '5': 13,
    '6': 16,
    '7': 15,
    '8': 18,
    '9': 17}
xDev shape: nObs = 33741, nWords = 1000
yDev shape: nObs = 33741, nClasses = 19
xTrain shape: nObs = 78816, nWords = 1000
yTrain shape: nObs = 78816, nClasses = 19
Embeddings shape: nWords = 10008, wordVec length = 300


The following cell initializes the dictionary of hyperparameters for the model that fully describe the model.

In [6]:
hyperParamDict = {'EMBED_SIZE': EMBED_SIZE,
                  'maxNoteLength': maxAllowedNoteLength,
                  'maxGradNorm': max_grad_norm,
                  'outputKeepProb': output_keep_prob,
                  'inputKeepProb': input_keep_prob,
                  'learningRate': learning_rate,
                  'trainingEpochsMax': training_epochs,
                  'batchSize': batch_size,
                  'n_hidden': n_hidden,
                 'numLayers': numLayers,
                 'sizeList':sizeList}
pp.pprint(hyperParamDict)
with open(os.path.join(output_path, 'hyperParamDict.pickle'), 'wb') as handle:
    pickle.dump(hyperParamDict, handle, protocol = 2)
    #dumping with 2 because ALTUD uses python 2.7 right now.

{   'EMBED_SIZE': 300,
    'batchSize': 256,
    'inputKeepProb': 1,
    'learningRate': 0.001,
    'maxGradNorm': 5,
    'maxNoteLength': 1000,
    'n_hidden': 200,
    'numLayers': 1,
    'outputKeepProb': 0.5,
    'sizeList': [200, 150, 75, 19],
    'trainingEpochsMax': 100}


## Training

Finally, the model is trained (be wary - it will take some time; on an Amazon Deep Learning AMI, it took around an hour to train)...

In [7]:
from trainModel import trainModel
xDev[xDev == -1] = 0
xTrain[xTrain == -1] = 0
trainModel(helperObj = helper, embeddings = embeddings, hyperParamDict = hyperParamDict, 
          xDev = xDev, xTrain = xTrain, yDev = yDev, yTrain = yTrain, 
           lastTrueWordIdx_dev = lastTrueWordIdx_dev, 
           lastTrueWordIdx_train = lastTrueWordIdx_train,
           training_epochs = training_epochs, 
           output_path = output_path, batchSizeTrain = batch_size,
           sizeList = sizeList,
           maxIncreasingLossCount = 100, batchSizeDev = 1500, chatty = True)

shape of embeddings
(?, 1000, 300)
<class 'tensorflow.python.ops.rnn_cell_impl.MultiRNNCell'>
<tensorflow.python.ops.rnn_cell_impl.MultiRNNCell object at 0x7f359cbb97d0>
<class 'tensorflow.python.framework.ops.Tensor'>
(?, 1000, 300)
cell output size
200
cell state size
(LSTMStateTuple(c=200, h=200),)
output shape
(?, 1000, 200)
offset shape
(?, 1)
output shape new shape
(?, 200)
flattened indices shape
(?, 1)
output flattened shape
(?, 200)
(200, 150)
W_1 shape
(150,)
b_1 shape
(150, 75)
W_2 shape
(75,)
bias shape
(19,)
U shape
(75, 19)
bias shape
(19,)
output wx + b
(?, 19)
(?, 19)


  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "


***************************
***************************
Running on epoch 0
***************************
***************************
running iteration 0 with loss 0.671181 at time 2.124994
running iteration 25 with loss 0.283091 at time 33.695199
running iteration 50 with loss 0.249370 at time 65.307023
running iteration 75 with loss 0.224971 at time 96.903477
running iteration 100 with loss 0.247238 at time 128.519557
running iteration 125 with loss 0.244419 at time 160.152017
running iteration 150 with loss 0.256121 at time 191.788166
running iteration 175 with loss 0.244464 at time 223.416894
running iteration 200 with loss 0.241822 at time 255.068012
running iteration 225 with loss 0.234088 at time 286.710529
running iteration 250 with loss 0.245816 at time 318.351123
running iteration 275 with loss 0.248200 at time 349.999108
running iteration 300 with loss 0.275373 at time 381.648924
average training loss 0.260528
test loss 0.227216
Total run time was 429.598122
New best model foun

running iteration 0 with loss 0.079211 at time 1.266474
running iteration 25 with loss 0.074628 at time 32.916376
running iteration 50 with loss 0.066893 at time 64.576158
running iteration 75 with loss 0.051411 at time 96.230649
running iteration 100 with loss 0.068470 at time 127.888229
running iteration 125 with loss 0.077177 at time 159.549488
running iteration 150 with loss 0.075836 at time 191.198557
running iteration 175 with loss 0.070868 at time 222.852068
running iteration 200 with loss 0.073820 at time 254.504941
running iteration 225 with loss 0.061037 at time 286.156615
running iteration 250 with loss 0.071956 at time 317.802695
running iteration 275 with loss 0.095531 at time 349.454704
running iteration 300 with loss 0.087856 at time 381.116599
average training loss 0.076485
test loss 0.097558
Total run time was 429.098515
validation Loss Increase
***************************
***************************
Running on epoch 9
***************************
**********************

and the session closed. You should be able to see your results in the `output_path` directory you specified earlier.

To evaluate the results and generate plots and such, please check out `predictionEvaluation.ipynb` in the same repository.

In [8]:
if 'session' in locals() and session is not None:
    print('Close interactive session')
    session.close()