<h1> HIMSS Demo - HealtheDatalab </h1>

<h2> Structured Machine Learning using Tensorflow </h2>
<hr />
This notebook demonstrates a process to train, evaluate and deploy a ML model to CloudML. It leverages a pre-built machine learning model to predict Length of Stay in ED and inpatient care settings
<h3>
<br />
<ol>
<li> Access, Analize & Visualize Data using HealtheDataLab </li> <br />
<li> Label generation - Generate Labels in TFRecord format </li> <br />
<li> Generate TFSequenceExamples </li> <br />
<li> Train and Evaluate Machine Learning Model </li> <br />
<li> Deploy ML Model to CloudML </li>
</ol></h3>
<hr />

<h2> 1. Access, Analize & Visualize Data using HealtheDataLab </h2>
<ul>
    <li>Import FHIR bundles (Patient's longitudinal records) into Spark Dataframes</li>
    <li>Extract patient records into Spark Dataframes</li>
    <li>Query and visualize patient records using Spark SQL </li>
</ul>

In [1]:
from pyspark.sql import SparkSession
from bunsen.stu3.bundles import load_from_directory, extract_entry
from demo_utils import age

# Enable Hive support for our session so we can save resources as Hive tables
spark = SparkSession.builder \
                    .config('hive.exec.dynamic.partition.mode', 'nonstrict') \
                    .enableHiveSupport() \
                    .getOrCreate()

# Load and cache the bundles so we don't reload them every time.
bundles = load_from_directory(spark, 'gs://cluster-data/demo/data/synthea/fhir/').cache()

# Extract patients from bundles
patients = extract_entry(spark, bundles, 'patient')

pats = patients.select('id','gender', 'birthDate', 'address.city', 'address.state', 'address.country') 

#pats['birthDate'] = pats['birthDate'].apply(age)
patsDF = pats.limit(10).toPandas()
patsDF['age'] = patsDF['birthDate'].apply(age)
display(patsDF)

Unnamed: 0,id,gender,birthDate,city,state,country,age
0,urn:uuid:c127185e-9f14-462a-9817-c90963fb7354,male,1980-11-07,[Pittsfield],[Massachusetts],[US],38
1,urn:uuid:345efce8-d11d-429d-9984-6b67e31a7269,male,1952-06-04,[Harwich],[Massachusetts],[US],66
2,urn:uuid:44810270-bafe-42a4-8fc8-c229368b0058,male,1966-02-17,[Hubbardston],[Massachusetts],[US],52
3,urn:uuid:d6be5e17-7733-4096-b3a7-32c2a80582af,female,2018-12-29,[Worcester],[Massachusetts],[US],0
4,urn:uuid:5c6ad3ff-99b1-47b3-92c1-a37d82a5a559,male,1961-03-13,[Methuen Town],[Massachusetts],[US],57
5,urn:uuid:e3952c11-3fa2-4492-899c-bbbb8c7b6db0,male,1956-07-01,[Wareham],[Massachusetts],[US],62
6,urn:uuid:665b7d87-1e8a-46f5-a2fb-6e8200f6662e,male,1952-08-05,[Hudson],[Massachusetts],[US],66
7,urn:uuid:08e56bf9-7034-4b6e-8345-c61a0d910c6e,female,1963-03-29,[Brockton],[Massachusetts],[US],55
8,urn:uuid:e272d8a3-73c9-4887-a457-f0d1d7cc1e44,female,2003-11-19,[Weymouth Town],[Massachusetts],[US],15
9,urn:uuid:1d9e528b-18b4-4cfa-bfd4-d2eb85e9ce1b,female,1984-11-14,[Lowell],[Massachusetts],[US],34


<ul>
    <li>Extract Patient Encounters into Spark Dataframes</li>
    <li>Query and visualize Encounter records using Spark SQL </li>
    <li>Compute Length of Stay from Encounter start and end dates. </li>
    <li>We will use Length of Stay and other features from Patient, Observation and other records to train our linear regression model.</li>
    <li>Our linear regression model will predict label: "Length of Stay"</li>
</ul>

In [2]:
from pyspark.sql.functions import col
from demo_utils import los

# Extract encounters from bundles
encounters = extract_entry(spark, bundles, 'encounter') 

encs=encounters.select('subject.reference', 
                  'class.code', 
                  'period.start', 
                  'period.end') \
          .where(col('class.code').isin("inpatient", "emergency"))


encsDF = encs.limit(10).toPandas()
encsDF['los'] = encsDF.apply(los, axis=1)
display(encsDF)

Unnamed: 0,reference,code,start,end,los
0,urn:uuid:c127185e-9f14-462a-9817-c90963fb7354,inpatient,1994-12-11T11:05:54-08:00,1994-12-12T11:20:54-08:00,"1 day, 0:15:00"
1,urn:uuid:c127185e-9f14-462a-9817-c90963fb7354,inpatient,1995-04-06T12:05:54-07:00,1995-04-07T12:20:54-07:00,"1 day, 0:15:00"
2,urn:uuid:c127185e-9f14-462a-9817-c90963fb7354,inpatient,1995-06-19T12:05:54-07:00,1995-06-20T12:05:54-07:00,"1 day, 0:00:00"
3,urn:uuid:c127185e-9f14-462a-9817-c90963fb7354,inpatient,1995-08-25T12:05:54-07:00,1995-08-26T12:05:54-07:00,"1 day, 0:00:00"
4,urn:uuid:c127185e-9f14-462a-9817-c90963fb7354,inpatient,1995-11-28T11:05:54-08:00,1995-11-29T11:05:54-08:00,"1 day, 0:00:00"
5,urn:uuid:c127185e-9f14-462a-9817-c90963fb7354,inpatient,1996-01-18T11:05:54-08:00,1996-01-19T11:05:54-08:00,"1 day, 0:00:00"
6,urn:uuid:c127185e-9f14-462a-9817-c90963fb7354,inpatient,1996-03-02T11:05:54-08:00,1996-03-03T11:20:54-08:00,"1 day, 0:15:00"
7,urn:uuid:c127185e-9f14-462a-9817-c90963fb7354,inpatient,1996-04-22T12:05:54-07:00,1996-04-23T12:20:54-07:00,"1 day, 0:15:00"
8,urn:uuid:c127185e-9f14-462a-9817-c90963fb7354,inpatient,1996-10-01T12:05:54-07:00,1996-10-02T12:05:54-07:00,"1 day, 0:00:00"
9,urn:uuid:c127185e-9f14-462a-9817-c90963fb7354,inpatient,1996-12-17T11:05:54-08:00,1996-12-18T11:20:54-08:00,"1 day, 0:15:00"


<h2> 2. Label generation - Generate Labels in TFRecord format </h2>
<ul>
    <li>The next few cells generates labels from bundles in TFRecord format</li>
    <li>Bundles in TFRecord format have already been generated from Synthetic FHIR data</li>
    <li>Bundles will be used as inputs and are stored in Google Cloud Storage</li>
    <li>Output labels will also be stored in Google Cloud Storage </li>
</ul>

In [5]:
input_bundles = 'gs://cluster-data/demo/data/bundles/bundles*'
labels_path = 'gs://cluster-data/demo/data/output/labels'

Let's examine GCS bucket that holds the bundels in TFRecord format

In [6]:
%bash
gsutil ls -l gs://cluster-data/demo/data/bundles/bundles*

  32501370  2019-01-08T23:10:08Z  gs://cluster-data/demo/data/bundles/bundles-00000-of-00010.tfrecords
  39740597  2019-01-08T23:10:08Z  gs://cluster-data/demo/data/bundles/bundles-00001-of-00010.tfrecords
  32894855  2019-01-08T23:10:09Z  gs://cluster-data/demo/data/bundles/bundles-00002-of-00010.tfrecords
  30817812  2019-01-08T23:10:10Z  gs://cluster-data/demo/data/bundles/bundles-00003-of-00010.tfrecords
  33319395  2019-01-08T23:10:11Z  gs://cluster-data/demo/data/bundles/bundles-00004-of-00010.tfrecords
  48719477  2019-01-08T23:10:11Z  gs://cluster-data/demo/data/bundles/bundles-00005-of-00010.tfrecords
  42681976  2019-01-08T23:10:12Z  gs://cluster-data/demo/data/bundles/bundles-00006-of-00010.tfrecords
  32319546  2019-01-08T23:10:13Z  gs://cluster-data/demo/data/bundles/bundles-00007-of-00010.tfrecords
  49995527  2019-01-08T23:10:14Z  gs://cluster-data/demo/data/bundles/bundles-00008-of-00010.tfrecords
  36610205  2019-01-08T23:10:14Z  gs://cluster-data/demo/data/bundles/bun

Delete labels generated from previous runs

In [7]:
%bash
gsutil rm gs://cluster-data/demo/data/output/labels*

Removing gs://cluster-data/demo/data/output/labels-00000-of-00001.tfrecords...
/ [1 objects]                                                                   
Operation completed over 1 objects.                                              


In [8]:
from apache_beam.options.pipeline_options import PipelineOptions
from apache_beam.options.pipeline_options import GoogleCloudOptions
from apache_beam.options.pipeline_options import StandardOptions
import apache_beam as beam

import tensorflow as tf
from tensorflow.core.example import example_pb2

from proto.stu3 import google_extensions_pb2
from proto.stu3 import resources_pb2
from proto.stu3 import version_config_pb2

from google.protobuf import text_format
from py.google.fhir.labels import label
from py.google.fhir.labels import bundle_to_label
from py.google.fhir.seqex import bundle_to_seqex

options = PipelineOptions()
google_cloud_options = options.view_as(GoogleCloudOptions)
google_cloud_options.project = 'dp-workspace'
google_cloud_options.job_name = 'bundlesTolabels'
google_cloud_options.staging_location = 'gs://healthedatalab/staging'
google_cloud_options.temp_location = 'gs://healthedatalab/temp'
options.view_as(StandardOptions).runner = 'DirectRunner'

p = beam.Pipeline(options=options)

  from ._conv import register_converters as _register_converters
  from ._conv import register_converters as _register_converters
  from .. import h5g, h5i, h5o, h5r, h5t, h5l, h5p
  from . import _ni_label


In [9]:
bundles = p | 'read' >> beam.io.ReadFromTFRecord(
    input_bundles, coder=beam.coders.ProtoCoder(resources_pb2.Bundle))

labels = bundles | 'BundleToLabel' >> beam.ParDo(
    bundle_to_label.LengthOfStayRangeLabelAt24HoursFn(for_synthea=True))

_ = labels | beam.io.WriteToTFRecord(
    labels_path,
    coder=beam.coders.ProtoCoder(google_extensions_pb2.EventLabel),
    file_name_suffix='.tfrecords')

p.run().wait_until_finish()

Using this argument will have no effect on the actual scopes for tokens
requested. These scopes are set at VM instance creation time and
can't be overridden in the request.

W0123 19:41:30.783166 140558414481152 tfrecordio.py:49] Couldn't find python-snappy so the implementation of _TFRecordUtil._masked_crc32c is not as fast as it could be.


'DONE'

Let's examine the output location in GCS where labels have been crearted

In [10]:
%bash
gsutil ls -l gs://cluster-data/demo/data/output

     97285  2019-01-15T10:25:54Z  gs://cluster-data/demo/data/output/label-00000-of-00001.tfrecords
     97285  2019-01-23T19:41:56Z  gs://cluster-data/demo/data/output/labels-00000-of-00001.tfrecords
TOTAL: 2 objects, 194570 bytes (190.01 KiB)


<h2> 3. Generate TFSequenceExamples</h2>
<ul>
    <li>The next few cell generates Tensorflow sequence examples</li>
    <li>Bundles in TFRecord format have already been generated from Synthetic FHIR data</li>
    <li>Bundles will be used as inputs and are stored in Google Cloud Storage</li>
    <li>Output labels will also be stored in Google Cloud Storage </li>
</ul>

In [11]:
#input_bundles = 'gs://cluster-data/demo/data/bundles/bundles*'
#labels = 'gs://cluster-data/demo/data/labels/train-00000-of-00001.tfrecords'
labels = 'gs://cluster-data/demo/data/output/labels*'
seqex_path = 'gs://cluster-data/demo/data/output/seqex'
seqex_for_training = 'gs://cluster-data/demo/data/seqex/train*'
#seqex_for_training = 'gs://cluster-data/demo/data/output/seqex*'
seqex_for_eval = 'gs://cluster-data/demo/data/seqex/validation*'

In [12]:
%bash
gsutil ls -l gs://cluster-data/demo/data/output/labels*

     97285  2019-01-23T19:41:56Z  gs://cluster-data/demo/data/output/labels-00000-of-00001.tfrecords
TOTAL: 1 objects, 97285 bytes (95 KiB)


In [13]:
def _get_version_config(version_config_path):
  with open(version_config_path) as f:
    return text_format.Parse(f.read(), version_config_pb2.VersionConfig())

p1 = beam.Pipeline(options=options)

version_config = _get_version_config("/usr/local/fhir/proto/stu3/version_config.textproto")

keyed_bundles = ( 
    p1 
    | 'readBundles' >> beam.io.ReadFromTFRecord(
        input_bundles, coder=beam.coders.ProtoCoder(resources_pb2.Bundle))
    | 'KeyBundlesByPatientId' >> beam.ParDo(
        bundle_to_seqex.KeyBundleByPatientIdFn()))

event_labels = ( 
    p1 | 'readEventLabels' >> beam.io.ReadFromTFRecord(
        labels,
        coder=beam.coders.ProtoCoder(google_extensions_pb2.EventLabel)))

keyed_event_labels = bundle_to_seqex.CreateTriggerLabelsPairLists(
    event_labels)

bundles_and_labels = bundle_to_seqex.CreateBundleAndLabels(
    keyed_bundles, keyed_event_labels)

_ = ( 
    bundles_and_labels
    | 'Reshuffle1' >> beam.Reshuffle()
    | 'GenerateSeqex' >> beam.ParDo(
        bundle_to_seqex.BundleAndLabelsToSeqexDoFn(
            version_config=version_config,
            enable_attribution=False,
            generate_sequence_label=False))
    | 'Reshuffle2' >> beam.Reshuffle()
    | 'WriteSeqex' >> beam.io.WriteToTFRecord(
        seqex_path,
        coder=beam.coders.ProtoCoder(example_pb2.SequenceExample),
        file_name_suffix='.tfrecords',
        num_shards=2))

In [None]:
p1.run().wait_until_finish()

In [1]:
%bash
gsutil ls -l gs://cluster-data/demo/data/output/seqex

CommandException: One or more URLs matched no objects.


<h2> 4. Train and Evaluate ML Model</h2>
<ul>
    <li>The next few cell demonstrate the process to train a ML Model using the training data set created in Step 3</li>
    <li>Training requires sequence examples in TFRecord format</li>
    <li>Trained ML model will be stored in Google Cloud Storage </li>
    <li>Model will be evaluated and the evaluation output will be printed</li>
</ul>

In [5]:
import tensorflow as tf
model_path = 'gs://cluster-data/demo/data/output/model'
train_file = 'gs://cluster-data/demo/data/seqex/train-00000-of-00010.tfrecords'
validation_file = 'gs://cluster-data/demo/data/seqex/validation-00000-of-00010.tfrecords'

  from ._conv import register_converters as _register_converters


In [6]:
def create_hparams(hparams_overrides=None):
  """Creates default HParams with the option of overrides.

  Args:
    hparams_overrides: HParams overriding the otherwise provided defaults.
      Defaults to None (meaning no overrides take place). HParams specified need
      to be a referencing a subset of the defaults.

  Returns:
    Default HParams.
  """
  hparams = tf.contrib.training.HParams(
      # Sequence features are bucketed by their age at time of prediction in:
      # [time_windows[0] - time_windows[1]),
      # [time_windows[1] - time_windows[2]),
      # ...
      time_windows=[
          5 * 365 * 24 * 60 * 60,  # 5 years
          365 * 24 * 60 * 60,  # 1 year
          30 * 24 * 60 * 60,  # 1 month
          7 * 24 * 60 * 60,  # 1 week
          1 * 24 * 60 * 60,  # 1 day
          0,  # now
      ],
      batch_size=64,
      learning_rate=0.003,
      dedup=True,
      l1_regularization_strength=0.0,
      l2_regularization_strength=0.0,
      include_age=True,
      age_boundaries=[1, 5, 18, 30, 50, 70, 90],
      categorical_context_features=['Patient.gender'],
      sequence_features=[
          'Composition.section.text.div.tokenized',
          'Composition.type',
          'Condition.code',
          'Encounter.hospitalization.admitSource',
          'Encounter.reason.hcc',
          'MedicationRequest.contained.medication.code.gsn',
          'Procedure.code.cpt',
      ],
      # Number of hash buckets to map the tokens of the sequence_features into.
      sequence_bucket_sizes=[
          17000,
          16,
          3052,
          10,
          62,
          1600,
          732,
      ],
      # List of strings each of which is a ':'-separated list of feature that we
      # want to concatenate over the time dimension
      time_crossed_features=[
          '%s:%s:%s:%s' % ('Observation.code',
                           'Observation.value.quantity.value',
                           'Observation.value.quantity.unit',
                           'Observation.value.string')
      ],
      time_concat_bucket_sizes=[39571],
      context_bucket_sizes=[4])
  # Other overrides (possibly coming from vizier) are applied.
  if hparams_overrides:
    hparams = tf.training.merge_hparam(hparams, hparams_overrides)
  return hparams

In [7]:
CONTEXT_KEY_PREFIX = 'c-'
SEQUENCE_KEY_PREFIX = 's-'
AGE_KEY = 'Patient.ageInYears'

LABEL_VALUES = ['less_or_equal_3', '3_7', '7_14', 'above_14']


def _example_index_to_sparse_index(example_indices, batch_size):
  """Creates a sparse index tensor from a list of example indices.

  For example, this would do the transformation:
  [0, 0, 0, 1, 3, 3] -> [[0,0], [0,1], [0,2], [1,0], [3,0], [3,1]]

  The second column of the output tensor is the running count of the occurrences
  of that example index.

  Args:
    example_indices: A sorted 1D Tensor with example indices.
    batch_size: The batch_size. Could be larger than max(example_indices) if the
      last examples of the batch do not have the feature present.
  Returns:
    The sparse index tensor.
    The maxmium length of a row in this tensor.
  """
  binned_counts = tf.bincount(example_indices, minlength=batch_size)
  max_len = tf.to_int64(tf.reduce_max(binned_counts))
  return tf.where(tf.sequence_mask(binned_counts)), max_len

def _dedup_tensor(sp_tensor):
  """Dedup values of a SparseTensor along each row.

  Args:
    sp_tensor: A 2D SparseTensor to be deduped.
  Returns:
    A deduped SparseTensor of shape [batch_size, max_len], where max_len is
    the maximum number of unique values for a row in the Tensor.
  """
  string_batch_index = tf.as_string(sp_tensor.indices[:, 0])

  # tf.unique only works on 1D tensors. To avoid deduping across examples,
  # prepend each feature value with the example index. This requires casting
  # to and from strings for non-string features.
  original_dtype = sp_tensor.values.dtype
  string_values = (
      sp_tensor.values
      if original_dtype == tf.string else tf.as_string(sp_tensor.values))
  index_and_value = tf.string_join([string_batch_index, string_values],
                                   separator='|')
  unique_index_and_value, _ = tf.unique(index_and_value)

  # split is a shape [tf.size(values), 2] tensor. The first column contains
  # indices and the second column contains the feature value (we assume no
  # feature contains | so we get exactly 2 values from the string split).
  split = tf.string_split(unique_index_and_value, delimiter='|')
  split = tf.reshape(split.values, [-1, 2])
  string_indices = split[:, 0]
  values = split[:, 1]

  indices = tf.reshape(
      tf.string_to_number(string_indices, out_type=tf.int32), [-1])
  if original_dtype != tf.string:
    values = tf.string_to_number(values, out_type=original_dtype)
  values = tf.reshape(values, [-1])
  # Convert example indices into SparseTensor indices, e.g.
  # [0, 0, 0, 1, 3, 3] -> [[0,0], [0,1], [0,2], [1,0], [3,0], [3,1]]
  batch_size = tf.to_int32(sp_tensor.dense_shape[0])
  new_indices, max_len = _example_index_to_sparse_index(indices, batch_size)
  return tf.SparseTensor(
      indices=tf.to_int64(new_indices),
      values=values,
      dense_shape=[tf.to_int64(batch_size), max_len])

def get_input_fn(mode,
                 input_pattern,
                 dedup,
                 time_windows,
                 include_age,
                 categorical_context_features,
                 sequence_features,
                 time_crossed_features,
                 batch_size,
                 shuffle=True):
  """Creates an input function to an estimator.

  Args:
    mode: The execution mode, as defined in tf.estimator.ModeKeys.
    input_pattern: Input data pattern in TFRecord format containing
      tf.SequenceExamples.
    dedup: Whether to remove duplicate values.
    time_windows: List of time windows - we bucket all sequence features by
      their age into buckets [time_windows[i], time_windows[i+1]).
    include_age: Whether to include the age_in_years as a feature.
    categorical_context_features: List of string context features that are valid
      keys in the tf.SequenceExample.
    sequence_features: List of sequence features (strings) that are valid keys
      in the tf.SequenceExample.
    time_crossed_features: List of list of sequence features (strings) that
      should be crossed at each step along the time dimension.
    batch_size: The size of the batch when reading in data.
    shuffle: Whether to shuffle the examples.

  Returns:
    A function that returns a dictionary of features and the target labels.
  """

  def input_fn():
    """Supplies input to our model.

    This function supplies input to our model, where this input is a
    function of the mode. For example, we supply different data if
    we're performing training versus evaluation.

    Returns:
      A tuple consisting of 1) a dictionary of tensors whose keys are
      the feature names, and 2) a tensor of target labels if the mode
      is not INFER (and None, otherwise).
    """

    sequence_features_config = dict()
    for feature in sequence_features:
      dtype = tf.string
      if feature == 'Observation.value.quantity.value':
        dtype = tf.float32
      sequence_features_config[feature] = tf.VarLenFeature(dtype)

    sequence_features_config['eventId'] = tf.FixedLenSequenceFeature(
        [], tf.int64, allow_missing=False)
    for cross in time_crossed_features:
      for feature in cross:
        dtype = tf.string
        if feature == 'Observation.value.quantity.value':
          dtype = tf.float32
        sequence_features_config[feature] = tf.VarLenFeature(dtype)
    context_features_config = dict()
    if include_age:
      context_features_config['timestamp'] = tf.FixedLenFeature(
          [], tf.int64, default_value=-1)
      context_features_config['Patient.birthDate'] = tf.FixedLenFeature(
          [], tf.int64, default_value=-1)
    context_features_config['sequenceLength'] = tf.FixedLenFeature(
        [], tf.int64, default_value=-1)

    for context_feature in categorical_context_features:
      context_features_config[context_feature] = tf.VarLenFeature(tf.string)
    if mode != tf.estimator.ModeKeys.PREDICT:
      context_features_config['label.length_of_stay_range.class'] = (
          tf.FixedLenFeature([], tf.string, default_value='MISSING'))

    is_training = mode == tf.estimator.ModeKeys.TRAIN
    num_epochs = None if is_training else 1

    with tf.name_scope('read_batch'):
      file_names = [input_pattern]
      files = tf.data.Dataset.list_files(file_names)
      if shuffle:
        files = files.shuffle(buffer_size=len(file_names))
      dataset = (files
                 .apply(tf.contrib.data.parallel_interleave(
                     tf.data.TFRecordDataset, cycle_length=10))
                 .repeat(num_epochs))
      if shuffle:
        dataset = dataset.shuffle(buffer_size=100)
      dataset = dataset.batch(batch_size)

      def _parse_fn(serialized_examples):
        context, sequence, _ = tf.io.parse_sequence_example(
            serialized_examples,
            context_features=context_features_config,
            sequence_features=sequence_features_config,
            name='parse_sequence_example')
        return context, sequence

      dataset = dataset.map(_parse_fn, num_parallel_calls=8)

      def _process(context, sequence):
        """Supplies input to our model.

        This function supplies input to our model after parsing.

        Args:
          context: The dictionary from key to (Sparse)Tensors with context
            features
          sequence: The dictionary from key to (Sparse)Tensors with sequence
            features

        Returns:
          A tuple consisting of 1) a dictionary of tensors whose keys are
          the feature names, and 2) a tensor of target labels if the mode
          is not INFER (and None, otherwise).
        """
        # Combine into a single dictionary.
        feature_map = {}
        # Add age if requested.
        if include_age:
          age_in_seconds = (
              context['timestamp'] -
              context.pop('Patient.birthDate'))
          age_in_years = tf.to_float(age_in_seconds) / (60 * 60 * 24 * 365.0)
          feature_map[CONTEXT_KEY_PREFIX + AGE_KEY] = age_in_years

        sequence_length = context.pop('sequenceLength')
        # Cross the requested features.
        for cross in time_crossed_features:
          # The features may be missing at different rates - we take the union
          # of the indices supplying defaults.
          extended_features = dict()
          dense_shape = tf.concat(
              [[tf.shape(sequence_length)[0]], [tf.reduce_max(sequence_length)],
               tf.constant([1], dtype=tf.int64)],
              axis=0)
          for i, feature in enumerate(cross):
            sp_tensor = sequence[feature]
            additional_indices = []
            covered_indices = sp_tensor.indices
            for j, other_feature in enumerate(cross):
              if i != j:
                additional_indices.append(
                    tf.sets.set_difference(
                        tf.sparse_reorder(
                            tf.SparseTensor(
                                indices=sequence[other_feature].indices,
                                values=tf.zeros([
                                    tf.shape(sequence[other_feature].indices)[0]
                                ],
                                                dtype=tf.int32),
                                dense_shape=dense_shape)),
                        tf.sparse_reorder(
                            tf.SparseTensor(
                                indices=covered_indices,
                                values=tf.zeros([tf.shape(covered_indices)[0]],
                                                dtype=tf.int32),
                                dense_shape=dense_shape))).indices)
                covered_indices = tf.concat(
                    [sp_tensor.indices] + additional_indices, axis=0)

            additional_indices = tf.concat(additional_indices, axis=0)

            # Supply defaults for all other indices.
            default = tf.tile(
                tf.constant(['n/a']),
                multiples=[tf.shape(additional_indices)[0]])

            string_value = (
                tf.as_string(sp_tensor.values)
                if sp_tensor.values.dtype != tf.string else sp_tensor.values)

            extended_features[feature] = tf.sparse_reorder(
                tf.SparseTensor(
                    indices=tf.concat([sp_tensor.indices, additional_indices],
                                      axis=0),
                    values=tf.concat([string_value, default], axis=0),
                    dense_shape=dense_shape))

          new_values = tf.string_join(
              [extended_features[f].values for f in cross], separator='-')
          crossed_sp_tensor = tf.sparse_reorder(
              tf.SparseTensor(
                  indices=extended_features[cross[0]].indices,
                  values=new_values,
                  dense_shape=extended_features[cross[0]].dense_shape))
          sequence['_'.join(cross)] = crossed_sp_tensor
        # Remove unwanted features that are used in the cross but should not be
        # considered outside the cross.
        for cross in time_crossed_features:
          for feature in cross:
            if feature not in sequence_features and feature in sequence:
              del sequence[feature]

        # Flatten sparse tensor to compute event age. This dense tensor also
        # contains padded values. These will not be used when gathering elements
        # from the dense tensor since each sparse feature won't have a value
        # defined for the padding.
        padded_event_age = (
            # Broadcast current time along sequence dimension.
            tf.expand_dims(context.pop('timestamp'), 1)
            # Subtract time of events.
            - sequence.pop('eventId'))

        for i in range(len(time_windows) - 1):
          max_age = time_windows[i]
          min_age = time_windows[i+1]
          padded_in_time_window = tf.logical_and(padded_event_age <= max_age,
                                                 padded_event_age > min_age)

          for k, v in sequence.items():
            # For each sparse feature entry, look up whether it is in the time
            # window or not.
            in_time_window = tf.gather_nd(padded_in_time_window,
                                          v.indices[:, 0:2])
            v = tf.sparse_retain(v, in_time_window)
            sp_tensor = tf.sparse_reshape(v, [v.dense_shape[0], -1])
            if dedup:
              sp_tensor = _dedup_tensor(sp_tensor)

            feature_map[SEQUENCE_KEY_PREFIX + k +
                        '-til-%d' %min_age] = sp_tensor

        for k, v in context.items():
          feature_map[CONTEXT_KEY_PREFIX + k] = v
        return feature_map

      feature_map = (dataset
                     # Parallelize the input processing and put it behind a
                     # queue to increase performance by removing it from the
                     # critical path of per-step-computation.
                     .map(_process, num_parallel_calls=8)
                     .prefetch(buffer_size=1)
                     .make_one_shot_iterator()
                     .get_next())
      label = None
      if mode != tf.estimator.ModeKeys.PREDICT:
        label = feature_map.pop(CONTEXT_KEY_PREFIX +
                                'label.length_of_stay_range.class')
      return feature_map, label
  return input_fn

In [8]:
tf.reset_default_graph()
hparams = create_hparams()

time_crossed_features = [
        cross.split(':') for cross in hparams.time_crossed_features if cross
    ]

map_, label_ = get_input_fn(tf.estimator.ModeKeys.TRAIN, train_file, True, hparams.time_windows,
                            hparams.include_age, hparams.categorical_context_features,
                            hparams.sequence_features, time_crossed_features, batch_size=2)()
with tf.train.MonitoredSession() as sess:
  map_['label'] = label_
  print(sess.run(map_))

Instructions for updating:
Use `tf.data.experimental.parallel_interleave(...)`.
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
{'s-Observation.code_Observation.value.quantity.value_Observation.value.quantity.unit_Observation.value.string-til-0': SparseTensorValue(indices=array([[ 1,  0],
       [ 1,  1],
       [ 1,  2],
       [ 1,  3],
       [ 1,  4],
       [ 1,  5],
       [ 1,  6],
       [ 1,  7],
       [ 1,  8],
       [ 1,  9],
       [ 1, 10],
       [ 1, 11],
       [ 1, 12],
       [ 1, 13],
       [ 1, 14],
       [ 1, 15],
       [ 1, 16],
       [ 1, 17],
       [ 1, 18],
       [ 1, 19],
       [ 1, 20],
       [ 1, 21],
       [ 1, 22],
       [ 1, 23],
       [ 1, 24],
       [ 1, 25],
       [ 1, 26],
       [ 1, 27],
       [ 1, 28],
       [ 1, 29],
       [ 1, 30],
       [ 1, 31],
       [ 1, 32],
       [ 1, 33],
       [ 1, 34],
       [ 1, 35],
       [ 1, 36],
       [ 1, 37],
       [ 

In [9]:
seq_features = []
seq_features_sizes = []
hparams = create_hparams()

for k, bucket_size in zip(
    hparams.sequence_features,
    hparams.sequence_bucket_sizes):
  for max_age in hparams.time_windows[1:]:
    seq_features.append(
        tf.feature_column.categorical_column_with_hash_bucket(
            SEQUENCE_KEY_PREFIX + k + '-til-' +
            str(max_age), bucket_size))
    seq_features_sizes.append(bucket_size)

categorical_context_features = [
    tf.feature_column.categorical_column_with_hash_bucket(
        CONTEXT_KEY_PREFIX + k, bucket_size)
    for k, bucket_size in zip(hparams.categorical_context_features,
                              hparams.context_bucket_sizes)
]
discretized_context_features = []
if hparams.include_age:
  discretized_context_features.append(
      tf.feature_column.bucketized_column(
          tf.feature_column.numeric_column(CONTEXT_KEY_PREFIX + AGE_KEY),
          boundaries=hparams.age_boundaries))

optimizer = tf.train.FtrlOptimizer(
      learning_rate=hparams.learning_rate,
      l1_regularization_strength=hparams.l1_regularization_strength,
      l2_regularization_strength=hparams.l2_regularization_strength)

estimator = tf.estimator.LinearClassifier(
    feature_columns=seq_features + categorical_context_features +
    discretized_context_features,
    n_classes=len(LABEL_VALUES),
    label_vocabulary=LABEL_VALUES,
    model_dir=model_path,
    optimizer=optimizer,
    loss_reduction=tf.losses.Reduction.SUM_OVER_BATCH_SIZE)

INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_save_checkpoints_secs': 600, '_num_ps_replicas': 0, '_keep_checkpoint_max': 5, '_task_type': 'worker', '_global_id_in_cluster': 0, '_is_chief': True, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7fb9f5edd890>, '_model_dir': 'gs://cluster-data/demo/data/output/model', '_protocol': None, '_save_checkpoints_steps': None, '_keep_checkpoint_every_n_hours': 10000, '_service': None, '_session_config': allow_soft_placement: true
graph_options {
  rewrite_options {
    meta_optimizer_iterations: ONE
  }
}
, '_tf_random_seed': None, '_save_summary_steps': 100, '_device_fn': None, '_experimental_distribute': None, '_num_worker_replicas': 1, '_task_id': 0, '_log_step_count_steps': 100, '_evaluation_master': '', '_eval_distribute': None, '_train_distribute': None, '_master': ''}


In [10]:
def multiclass_metrics_fn(labels, predictions):
  """Computes precsion/recall@k metrics for each class and micro-weighted.

  Args:
    labels: A string Tensor of shape [batch_size] with the true labels
    predictions: A float Tensor of shape [batch_size, num_classes].

  Returns:
    A dictionary with metrics of precision/recall @1/2 and precision/recall per
    class.
  """

  label_ids = tf.contrib.lookup.index_table_from_tensor(
      tuple(LABEL_VALUES),
      name='class_id_lookup').lookup(labels)
  dense_labels = tf.one_hot(label_ids, len(LABEL_VALUES))

  # We convert the task to a binary one of < 7 days.
  # 'less_or_equal_3', '3_7', '7_14', 'above_14'
  binary_labels = label_ids < 2
  binary_probs = tf.reduce_sum(predictions['probabilities'][:, 0:2], axis=1)

  metrics_dict = {
      'precision_at_1':
          tf.metrics.precision_at_k(
              labels=label_ids,
              predictions=predictions['probabilities'], k=1),
      'precision_at_2':
          tf.metrics.precision_at_k(
              labels=label_ids,
              predictions=predictions['probabilities'], k=2),
      'recall_at_1':
          tf.metrics.recall_at_k(
              labels=label_ids,
              predictions=predictions['probabilities'], k=1),
      'recall_at_2':
          tf.metrics.recall_at_k(
              labels=label_ids,
              predictions=predictions['probabilities'], k=2),
      'auc_roc_at_most_7d':
          tf.metrics.auc(
              labels=binary_labels,
              predictions=binary_probs,
              curve='ROC',
              summation_method='careful_interpolation'),
      'auc_pr_at_most_7d':
          tf.metrics.auc(
              labels=binary_labels,
              predictions=binary_probs,
              curve='PR',
              summation_method='careful_interpolation'),
      'precision_at_most_7d':
          tf.metrics.precision(
              labels=binary_labels,
              predictions=binary_probs >= 0.5),
      'recall_at_most_7d':
          tf.metrics.recall(
              labels=binary_labels,
              predictions=binary_probs >= 0.5),
  }
  for i, label in enumerate(LABEL_VALUES):
    metrics_dict['precision_%s' % label] = tf.metrics.precision_at_k(
        labels=label_ids,
        predictions=predictions['probabilities'],
        k=1,
        class_id=i)
    metrics_dict['recall_%s' % label] = tf.metrics.recall_at_k(
        labels=label_ids,
        predictions=predictions['probabilities'],
        k=1,
        class_id=i)

  return metrics_dict
estimator = tf.contrib.estimator.add_metrics(estimator, multiclass_metrics_fn)

INFO:tensorflow:Using config: {'_save_checkpoints_secs': 600, '_num_ps_replicas': 0, '_keep_checkpoint_max': 5, '_task_type': 'worker', '_global_id_in_cluster': 0, '_is_chief': True, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7fb9e80bc410>, '_model_dir': 'gs://cluster-data/demo/data/output/model', '_protocol': None, '_save_checkpoints_steps': None, '_keep_checkpoint_every_n_hours': 10000, '_service': None, '_session_config': allow_soft_placement: true
graph_options {
  rewrite_options {
    meta_optimizer_iterations: ONE
  }
}
, '_tf_random_seed': None, '_save_summary_steps': 100, '_device_fn': None, '_experimental_distribute': None, '_num_worker_replicas': 1, '_task_id': 0, '_log_step_count_steps': 100, '_evaluation_master': '', '_eval_distribute': None, '_train_distribute': None, '_master': ''}


In [11]:
train_input_fn = get_input_fn(tf.estimator.ModeKeys.TRAIN, train_file, True, hparams.time_windows,
                            hparams.include_age, hparams.categorical_context_features,
                            hparams.sequence_features, time_crossed_features, batch_size=24)

In [12]:
estimator.train(input_fn=train_input_fn, steps=100)

INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Saving checkpoints for 0 into gs://cluster-data/demo/data/output/model/model.ckpt.
INFO:tensorflow:loss = 1.3862944, step = 1
INFO:tensorflow:Saving checkpoints for 100 into gs://cluster-data/demo/data/output/model/model.ckpt.
INFO:tensorflow:Loss for final step: 0.98193234.


<tensorflow.python.estimator.estimator.Estimator at 0x7fb9e80bc3d0>

In [13]:
validation_input_fn = get_input_fn(tf.estimator.ModeKeys.EVAL, validation_file, True, hparams.time_windows,
                            hparams.include_age, hparams.categorical_context_features,
                            hparams.sequence_features, time_crossed_features, batch_size=24)

In [14]:
estimator.evaluate(input_fn=validation_input_fn, steps=40)

INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Starting evaluation at 2019-01-23-20:28:06
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from gs://cluster-data/demo/data/output/model/model.ckpt-100
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Finished evaluation at 2019-01-23-20:28:13
INFO:tensorflow:Saving dict for global step 100: accuracy = 0.5, auc_pr_at_most_7d = 1.0, auc_roc_at_most_7d = 0.0, average_loss = 1.2924907, global_step = 100, loss = 1.2924907, precision_3_7 = nan, precision_7_14 = nan, precision_above_14 = nan, precision_at_1 = 0.5, precision_at_2 = 0.5, precision_at_most_7d = 1.0, precision_less_or_equal_3 = 0.5, recall_3_7 = 0.0, recall_7_14 = nan, recall_above_14 = nan, recall_at_1 = 0.5, recall_at_2 = 1.0, recall_at_most_7d = 1.0, recall_less_or_equal_3 = 1.0
INFO:tensorfl

{'accuracy': 0.5,
 'auc_pr_at_most_7d': 1.0,
 'auc_roc_at_most_7d': 0.0,
 'average_loss': 1.2924907,
 'global_step': 100,
 'loss': 1.2924907,
 'precision_3_7': nan,
 'precision_7_14': nan,
 'precision_above_14': nan,
 'precision_at_1': 0.5,
 'precision_at_2': 0.5,
 'precision_at_most_7d': 1.0,
 'precision_less_or_equal_3': 0.5,
 'recall_3_7': 0.0,
 'recall_7_14': nan,
 'recall_above_14': nan,
 'recall_at_1': 0.5,
 'recall_at_2': 1.0,
 'recall_at_most_7d': 1.0,
 'recall_less_or_equal_3': 1.0}

<h2> 5. Deploy ML Model to Cloud ML</h2>
<ul>
    <li>The trained ML Model will be deployed to CoudML for serving </li>
</ul>