# Phenotyping pattern classification
Yizhen Zhong
03282018

### Background

Phenotyping algorithms are designed to enable robust selection of patients that meet certain research interests from the Electronic Health Record system. The development process of phenotyping algorithm is very time and resource intensive, which requires manual review of experts and multiple rounds of validation. Therefore, the automatic development of phenotyping algorithm with machine learning techniques is of great interest. Recently, the “phenotype design patterns” are proposed to extract the common and repeated concepts from phenotyping algorithms. With the understanding of how phenotyping algorithms are constructed from design patterns, it is potential to automate the algorithm development process. Here, I will explore the usage of the Convolution Neural Network to classify short sentence segments from phenotyping algorithms into design patterns.


I have obtained 160 sentences extracted from the phenotyping algorithms. 53 of them were classified as “Rule of N” and 107 were classified as other concepts. I first split the data into training and testing dataset and train a Convolution Neural Network on the training set and evaluate the best model on the testing set of its ability to classify “Rule of N” sentences.


### Model architecture

The code is modified from [https://github.com/dennybritz/cnn-text-classification-tf]
* <span style="color:#a50e3e;">Number of classes: 2 
* <span style="color:#a50e3e;">Word embedding dimension: [100, 200, 300, 400,  
* <span style="color:#a50e3e;">Filter size: [3,4,5] 
* <span style="color:#a50e3e;">Filter number: 50
* <span style="color:#a50e3e;">Drop out rate: 0.5
* <span style="color:#a50e3e;">Batch size: 15
* <span style="color:#a50e3e;">Epoch number: 200

In [2]:
import tensorflow as tf
import numpy as np
import os
import time
import datetime
import _pickle as cPickle
from text_cnn import TextCNN
from tensorflow.contrib import learn
import data_helpers
import matplotlib.pyplot as plt
%matplotlib inline
tf.logging.set_verbosity(tf.logging.ERROR)

In [169]:
def count_f1(count_site):
    #TP,FP,TN,FN
    TP = 0
    FP = 0
    FN = 0
    pre = []
    recal = []
    for lis in count_site:
        TP = TP + lis[0]
        FP = FP + lis[1]
        FN = FN + lis[3]
        if lis[0] + lis[1] == 0:
            pre.append(0)
        else:
            pre.append(lis[0]/float((lis[0]+lis[1])))
        recal.append(lis[0]/float((lis[0]+lis[3])))
  
    if TP + FP == 0:
        micro_pres = 0
    else:
        micro_pres = float(TP)/(TP + FP)
    #print micro_pres
    micro_recall = float(TP)/(TP + FN)
    #print micro_recall
    f1 = 2*micro_pres*micro_recall/(micro_pres + micro_recall)
    macro_pre = sum(pre)/len(count_site) 
    macro_recall = sum(recal)/len(count_site)
    macro_f1 = 2*macro_pre*macro_recall/(macro_pre + macro_recall)
    return macro_f1,f1

def perf_measure(y_actual, y_hat):
    COUNT = []
    TP = 0
    FP = 0
    TN = 0
    FN = 0

    for i in range(len(y_hat)): 
        if y_actual[i]==y_hat[i]==1:
           TP += 1

    for i in range(len(y_hat)): 
        if y_actual[i]==0 and y_hat[i] == 1:
           FP += 1
    
    for i in range(len(y_hat)): 
        if y_actual[i]==y_hat[i]==0:
           TN += 1
    
    for i in range(len(y_hat)): 
        if y_actual[i]==1 and y_hat[i] == 0:
           FN += 1
    if TP == 0:
        precision = 0
        f1 = 0
    else:
        precision = TP / (TP + FP)
        recall = TP / (TP + FN)
        f1 = 2 * precision * recall / (precision + recall)
    
    return [TP,FP,TN,FN], f1

                        

### Data processing

In [172]:
#choose a embedding size
word2vec = "embedding_mimic3_pp600_0829.pik"
f = cPickle.load(open(word2vec,"rb"),encoding='bytes')
em,vocab_out,vocab_in = f[0],f[1],f[2]
count_all = [] 
f1_all = []
em_dim = 600

In [173]:
name = ["Confirm Disease Was Checked","Rule of N","Use Distinct Dates","Credentials of the Actor","Where Did It Happen?","Check For Negation"]#

print("Loading data...")

#label_class = name[i]

Loading data...


In [174]:
#################loop over each class
for i in range(6):
    label_class = name[i]

    print("processing: ", label_class)
    label = "_".join(label_class.split())
    x_text, y = data_helpers.load_data_and_labels("./data/" + label + "_seed100replace_0403_train.txt", 
                                                  "./data/" + label + "_seed100replace_0403_train_rest.txt")
    x_test, y_test = data_helpers.load_data_and_labels("./data/" + label + "_seed100replace_0403_test.txt",
                                                      "./data/" + label + "_seed100replace_0403_test_rest.txt")


    max_document_length = max([len(x.split(" ")) for x in x_text])
    vocab_processor = learn.preprocessing.VocabularyProcessor(max_document_length)
    x = np.array(list(vocab_processor.fit_transform(x_text)))
    x_test = np.array(list(vocab_processor.transform(x_test)))

    dev_sample_percentage = 0.2
    dev_sample_index = -1 * int(dev_sample_percentage * float(len(y)))
    x_train, x_dev = x[:dev_sample_index], x[dev_sample_index:]
    y_train, y_dev = y[:dev_sample_index], y[dev_sample_index:]


    print("Vocabulary Size: {:d}".format(len(vocab_processor.vocabulary_)))
    print("Train/Test split: {:d}/{:d}".format(len(y), len(y_test)))
    print("Train/Val split: {:d}/{:d}".format(len(y_train), len(y_dev)))

    with tf.Graph().as_default():
        session_conf = tf.ConfigProto(
          allow_soft_placement=True,
          log_device_placement=False)
        sess = tf.Session(config=session_conf)
        with sess.as_default():
            cnn = TextCNN(
                sequence_length=x_train.shape[1],
                num_classes=y_train.shape[1],
                vocab_size=len(vocab_processor.vocabulary_),
                embedding_size=em_dim,
                filter_sizes=list(map(int, "3,4,5".split(","))),
                num_filters=50,
                l2_reg_lambda=0)
            global_step = tf.Variable(0, name="global_step", trainable=False)
            optimizer = tf.train.AdamOptimizer(1e-3)
            grads_and_vars = optimizer.compute_gradients(cnn.loss)
            train_op = optimizer.apply_gradients(grads_and_vars, global_step=global_step)

            # Keep track of gradient values and sparsity (optional)
            grad_summaries = []
            for g, v in grads_and_vars:
                if g is not None:
                    grad_hist_summary = tf.summary.histogram("{}/grad/hist".format(v.name), g)
                    sparsity_summary = tf.summary.scalar("{}/grad/sparsity".format(v.name), tf.nn.zero_fraction(g))
                    grad_summaries.append(grad_hist_summary)
                    grad_summaries.append(sparsity_summary)
            grad_summaries_merged = tf.summary.merge(grad_summaries)


             # Output directory for models and summaries
            timestamp = str(int(time.time()))
            out_dir = os.path.abspath(os.path.join(os.path.curdir, "runs", timestamp))
            print("Writing to {}\n".format(out_dir))

            # Summaries for loss and accuracy
            loss_summary = tf.summary.scalar("loss", cnn.loss)
            acc_summary = tf.summary.scalar("accuracy", cnn.accuracy)


            # Train Summaries
            train_summary_op = tf.summary.merge([loss_summary, acc_summary, grad_summaries_merged])
            train_summary_dir = os.path.join(out_dir, "summaries", "train")
            train_summary_writer = tf.summary.FileWriter(train_summary_dir, sess.graph)


            # Dev summaries
            dev_summary_op = tf.summary.merge([loss_summary, acc_summary])
            dev_summary_dir = os.path.join(out_dir, "summaries", "dev")
            dev_summary_writer = tf.summary.FileWriter(dev_summary_dir, sess.graph)


            # Checkpoint directory. Tensorflow assumes this directory already exists so we need to create it
            checkpoint_dir = os.path.abspath(os.path.join(out_dir, "checkpoints"))
            checkpoint_prefix = os.path.join(checkpoint_dir, "model")
            if not os.path.exists(checkpoint_dir):
                os.makedirs(checkpoint_dir)
            saver = tf.train.Saver(tf.global_variables(), max_to_keep=5)


            # Write vocabulary
            vocab_processor.save(os.path.join(out_dir, "vocab"))

            # Initialize all variables
            sess.run(tf.global_variables_initializer())

            initW = np.random.uniform(-0.25,-0.25,(len(vocab_processor.vocabulary_),em_dim))

            for word in vocab_in:
                idx = vocab_processor.vocabulary_.get(word.decode("utf-8"))
                initW[idx] = em[vocab_in[word],]


            sess.run(cnn.W.assign(initW))
            print("Using pre-trained word2vec")
            #print(sess.run(cnn.W[6,]))
            def train_step(x_batch, y_batch):
                """
                A single training step
                """
                feed_dict = {
                  cnn.input_x: x_batch,
                  cnn.input_y: y_batch,
                  cnn.dropout_keep_prob: 0.5
                }
                _, step, summaries, loss, accuracy = sess.run(
                    [train_op, global_step, train_summary_op, cnn.loss, cnn.accuracy],
                    feed_dict)
                time_str = datetime.datetime.now().isoformat()
                #print("{}: step {}, loss {:g}, acc {:g}".format(time_str, step, loss, accuracy))
                train_summary_writer.add_summary(summaries, step)
                return [time_str, step, loss, accuracy]
             # Generate batches

            def dev_step(x_batch, y_batch, writer=None):
                """
                Evaluates model on a dev set
                """
                feed_dict = {
                  cnn.input_x: x_batch,
                  cnn.input_y: y_batch,
                  cnn.dropout_keep_prob: 1.0
                }
                step, summaries, loss, accuracy = sess.run(
                    [global_step, dev_summary_op, cnn.loss, cnn.accuracy],
                    feed_dict)
                time_str = datetime.datetime.now().isoformat()
                #print("{}: step {}, loss {:g}, acc {:g}".format(time_str, step, loss, accuracy))
                if writer:
                    writer.add_summary(summaries, step)
                return [time_str, step, loss, accuracy]    

            batches = data_helpers.batch_iter(list(zip(x, y)), 10, 200)   
            # Training loop. For each batch...
            #best_perf = [0,0,0,0]
            loss_history = []
            for batch in batches:
                x_batch, y_batch = zip(*batch)
                train_step(x_batch, y_batch)
                current_step = tf.train.global_step(sess, global_step)
                if current_step % 100 == 0:
                    print("\nEvaluation:")
                    dev_step(x_dev, y_dev, writer=dev_summary_writer)
                    print("")
                if current_step % 100 == 0:
                    path = saver.save(sess, checkpoint_prefix, global_step=current_step)
                    print("Saved model checkpoint to {}\n".format(path))


            #test = dev_step(x_test_transform, y_test,writer=dev_summary_writer)
            #print("{}: step {}, loss {:g}, acc {:g}".format(test[0],test[1],test[2],test[3]))


    checkpoint_dir = out_dir[66:] + "/checkpoints"
    print(checkpoint_dir)

    print("\nEvaluating...\n")
    # Evaluation
    # ==================================================
    print()
    checkpoint_file = tf.train.latest_checkpoint(checkpoint_dir)
    graph = tf.Graph()
    with graph.as_default():
        session_conf = tf.ConfigProto(
          allow_soft_placement=True,
          log_device_placement=True)
        sess = tf.Session(config=session_conf)
        with sess.as_default():
            # Load the saved meta graph and restore variables
            saver = tf.train.import_meta_graph("{}.meta".format(checkpoint_file))
            saver.restore(sess, checkpoint_file)

            # Get the placeholders from the graph by name
            input_x = graph.get_operation_by_name("input_x").outputs[0]
            # input_y = graph.get_operation_by_name("input_y").outputs[0]
            dropout_keep_prob = graph.get_operation_by_name("dropout_keep_prob").outputs[0]

            # Tensors we want to evaluate
            predictions = graph.get_operation_by_name("output/predictions").outputs[0]

            # Generate batches for one epoch
            batches = data_helpers.batch_iter(list(x_test), 20, 1, shuffle=False)

            # Collect the predictions here
            all_predictions = []

            for x_test_batch in batches:
                batch_predictions = sess.run(predictions, {input_x: x_test_batch, dropout_keep_prob: 1.0})
                all_predictions = np.concatenate([all_predictions, batch_predictions])

    print(name[i])
    y_test_label = np.argmax(y_test, axis=1)
    if y_test is not None:
        correct_predictions = float(sum(all_predictions == y_test_label))
        print("Total number of test examples: {}".format(len(y_test_label))) 
        count, f1 = perf_measure(y_test_label, all_predictions)
        print("F1: {:g}".format(f1))
    print(count)
    print(count_all)

    count_all.append(count)
    f1_all.append(f1)
    print(f1_all)

processing:  Confirm Disease Was Checked
Vocabulary Size: 414
Train/Test split: 91/40
Train/Val split: 73/18
Writing to /Users/zhongyi/Box Sync/phenotype_patterns_data/PhenoPattern/venv/runs/1522877611

Using pre-trained word2vec

Evaluation:

Saved model checkpoint to /Users/zhongyi/Box Sync/phenotype_patterns_data/PhenoPattern/venv/runs/1522877611/checkpoints/model-100


Evaluation:

Saved model checkpoint to /Users/zhongyi/Box Sync/phenotype_patterns_data/PhenoPattern/venv/runs/1522877611/checkpoints/model-200


Evaluation:

Saved model checkpoint to /Users/zhongyi/Box Sync/phenotype_patterns_data/PhenoPattern/venv/runs/1522877611/checkpoints/model-300


Evaluation:

Saved model checkpoint to /Users/zhongyi/Box Sync/phenotype_patterns_data/PhenoPattern/venv/runs/1522877611/checkpoints/model-400


Evaluation:

Saved model checkpoint to /Users/zhongyi/Box Sync/phenotype_patterns_data/PhenoPattern/venv/runs/1522877611/checkpoints/model-500


Evaluation:

Saved model checkpoint to /User


Evaluation:

Saved model checkpoint to /Users/zhongyi/Box Sync/phenotype_patterns_data/PhenoPattern/venv/runs/1522877966/checkpoints/model-1100


Evaluation:

Saved model checkpoint to /Users/zhongyi/Box Sync/phenotype_patterns_data/PhenoPattern/venv/runs/1522877966/checkpoints/model-1200


Evaluation:

Saved model checkpoint to /Users/zhongyi/Box Sync/phenotype_patterns_data/PhenoPattern/venv/runs/1522877966/checkpoints/model-1300


Evaluation:

Saved model checkpoint to /Users/zhongyi/Box Sync/phenotype_patterns_data/PhenoPattern/venv/runs/1522877966/checkpoints/model-1400


Evaluation:

Saved model checkpoint to /Users/zhongyi/Box Sync/phenotype_patterns_data/PhenoPattern/venv/runs/1522877966/checkpoints/model-1500


Evaluation:

Saved model checkpoint to /Users/zhongyi/Box Sync/phenotype_patterns_data/PhenoPattern/venv/runs/1522877966/checkpoints/model-1600


Evaluation:

Saved model checkpoint to /Users/zhongyi/Box Sync/phenotype_patterns_data/PhenoPattern/venv/runs/1522877966/ch

Where Did It Happen?
Total number of test examples: 40
F1: 0.444444
[2, 0, 33, 5]
[[1, 0, 34, 5], [5, 0, 24, 11], [4, 0, 30, 6], [4, 1, 35, 0]]
[0.2857142857142857, 0.47619047619047616, 0.5714285714285715, 0.888888888888889, 0.4444444444444445]
processing:  Check For Negation
Vocabulary Size: 414
Train/Test split: 91/40
Train/Val split: 73/18
Writing to /Users/zhongyi/Box Sync/phenotype_patterns_data/PhenoPattern/venv/runs/1522878489

Using pre-trained word2vec

Evaluation:

Saved model checkpoint to /Users/zhongyi/Box Sync/phenotype_patterns_data/PhenoPattern/venv/runs/1522878489/checkpoints/model-100


Evaluation:

Saved model checkpoint to /Users/zhongyi/Box Sync/phenotype_patterns_data/PhenoPattern/venv/runs/1522878489/checkpoints/model-200


Evaluation:

Saved model checkpoint to /Users/zhongyi/Box Sync/phenotype_patterns_data/PhenoPattern/venv/runs/1522878489/checkpoints/model-300


Evaluation:

Saved model checkpoint to /Users/zhongyi/Box Sync/phenotype_patterns_data/PhenoPatter

#### <span style="color:#a50e3e;">Model </span> Use pre-trained word vector representation

We have trained the embedding vector for all words from these sentences with the MIMIC3 corpus

In [177]:
print("all TP, FP, TN, FN: " ,count_all)
print("all f1: ", f1_all)
print("macro f1: ", count_f1(count_all)[0])
print("micro f1: ", count_f1(count_all)[1])

#count_f1(count_all)

all TP, FP, TN, FN:  [[1, 0, 34, 5], [5, 0, 24, 11], [4, 0, 30, 6], [4, 1, 35, 0], [2, 0, 33, 5], [3, 1, 33, 3]]
all f1:  [0.2857142857142857, 0.47619047619047616, 0.5714285714285715, 0.888888888888889, 0.4444444444444445, 0.6]
macro f1:  0.6001340482573727
micro f1:  0.5428571428571428
