<h1> HIMSS Demo - HealtheDatalab </h1>

<h2> Structured Machine Learning using Tensorflow </h2>
<hr />
This notebook demonstrates a process to train, evaluate and deploy a ML model to CloudML. It leverages a pre-built machine learning model to predict Length of Stay in ED and inpatient care settings
<h3>
<br />
<ol>
<li> Access, Analize & Visualize Data using HealtheDataLab </li> <br />
<li> Label generation - Generate Labels in TFRecord format </li> <br />
<li> Generate TFSequenceExamples </li> <br />
<li> Train and Evaluate Machine Learning Model </li> <br />
<li> Deploy ML Model to CloudML </li>
</ol></h3>
<hr />

<h2> 1. Access, Analize & Visualize Data using HealtheDataLab </h2>
<ul>
    <li>Import FHIR bundles (Patient's longitudinal records) into Spark Dataframes</li>
    <li>Extract patient records into Spark Dataframes</li>
    <li>Query and visualize patient records using Spark SQL </li>
</ul>

In [None]:
from pyspark.sql import SparkSession
from bunsen.stu3.bundles import load_from_directory, extract_entry
from demo_utils import age

# Enable Hive support for our session so we can save resources as Hive tables
spark = SparkSession.builder \
                    .config('hive.exec.dynamic.partition.mode', 'nonstrict') \
                    .enableHiveSupport() \
                    .getOrCreate()

# Load and cache the bundles so we don't reload them every time.
bundles = load_from_directory(spark, 'gs://cluster-data/demo/data/synthea/fhir/').cache()

# Extract patients from bundles
patients = extract_entry(spark, bundles, 'patient')

pats = patients.select('id','gender', 'birthDate', 'address.city', 'address.state', 'address.country') 

#pats['birthDate'] = pats['birthDate'].apply(age)
patsDF = pats.limit(10).toPandas()
patsDF['age'] = patsDF['birthDate'].apply(age)
display(patsDF)

<ul>
    <li>Extract Patient Encounters into Spark Dataframes</li>
    <li>Query and visualize Encounter records using Spark SQL </li>
    <li>Compute Length of Stay from Encounter start and end dates. </li>
    <li>We will use Length of Stay and other features from Patient, Observation and other records to train our linear regression model.</li>
    <li>Our linear regression model will predict label: "Length of Stay"</li>
</ul>

In [None]:
from pyspark.sql.functions import col
from demo_utils import los

# Extract encounters from bundles
encounters = extract_entry(spark, bundles, 'encounter') 

encs=encounters.select('subject.reference', 
                  'class.code', 
                  'period.start', 
                  'period.end') \
          .where(col('class.code').isin("inpatient", "emergency"))


encsDF = encs.limit(10).toPandas()
encsDF['los'] = encsDF.apply(los, axis=1)
display(encsDF)

<h2> 2. Label generation - Generate Labels in TFRecord format </h2>
<ul>
    <li>The next few cells generates labels from bundles in TFRecord format</li>
    <li>Bundles in TFRecord format have already been generated from Synthetic FHIR data</li>
    <li>Bundles will be used as inputs and are stored in Google Cloud Storage</li>
    <li>Output labels will also be stored in Google Cloud Storage </li>
</ul>

In [None]:
input_bundles = 'gs://cluster-data/demo/data/bundles/bundles*'
labels_path = 'gs://cluster-data/demo/data/output/labels'
labels = 'gs://cluster-data/demo/data/labels/train-00000-of-00001.tfrecords'
#labels = 'gs://cluster-data/demo/data/output/labels'*'
seqex_path = 'gs://cluster-data/demo/data/output/seqex'
seqex_for_training = 'gs://cluster-data/demo/data/seqex/train*'
#seqex_for_training = 'gs://cluster-data/demo/data/output/seqex*'
seqex_for_eval = 'gs://cluster-data/demo/data/seqex/validation*'

Let's examine GCS bucket that holds the bundels in TFRecord format

In [None]:
%bash
gsutil ls -l gs://cluster-data/demo/data/bundles/bundles*

Delete labels generated from previous runs

In [None]:
%bash
gsutil rm gs://cluster-data/demo/data/output/labels*

In [None]:
from apache_beam.options.pipeline_options import PipelineOptions
from apache_beam.options.pipeline_options import GoogleCloudOptions
from apache_beam.options.pipeline_options import StandardOptions
import apache_beam as beam

import tensorflow as tf
from tensorflow.core.example import example_pb2

from proto.stu3 import google_extensions_pb2
from proto.stu3 import resources_pb2
from proto.stu3 import version_config_pb2

from google.protobuf import text_format
from py.google.fhir.labels import label
from py.google.fhir.labels import bundle_to_label
from py.google.fhir.seqex import bundle_to_seqex

options = PipelineOptions()
google_cloud_options = options.view_as(GoogleCloudOptions)
google_cloud_options.project = 'dp-workspace'
google_cloud_options.job_name = 'bundlesTolabels'
google_cloud_options.staging_location = 'gs://healthedatalab/staging'
google_cloud_options.temp_location = 'gs://healthedatalab/temp'
options.view_as(StandardOptions).runner = 'DirectRunner'

p = beam.Pipeline(options=options)

In [None]:
bundles = p | 'read' >> beam.io.ReadFromTFRecord(
    input_bundles, coder=beam.coders.ProtoCoder(resources_pb2.Bundle))

labels = bundles | 'BundleToLabel' >> beam.ParDo(
    bundle_to_label.LengthOfStayRangeLabelAt24HoursFn(for_synthea=True))

_ = labels | beam.io.WriteToTFRecord(
    labels_path,
    coder=beam.coders.ProtoCoder(google_extensions_pb2.EventLabel),
    file_name_suffix='.tfrecords')

p.run().wait_until_finish()

Let's examine the output location in GCS where labels have been crearted

In [None]:
%bash
gsutil ls -l gs://cluster-data/demo/data/output

<h2> 3. Generate TFSequenceExamples</h2>
<ul>
    <li>The next few cell generates Tensorflow sequence examples</li>
    <li>Bundles in TFRecord format have already been generated from Synthetic FHIR data</li>
    <li>Bundles will be used as inputs and are stored in Google Cloud Storage</li>
    <li>Output labels will also be stored in Google Cloud Storage </li>
</ul>

In [None]:
%bash
gsutil ls -l gs://cluster-data/demo/data/labels/train*

In [None]:
def _get_version_config(version_config_path):
  with open(version_config_path) as f:
    return text_format.Parse(f.read(), version_config_pb2.VersionConfig())

p1 = beam.Pipeline(options=options)

version_config = _get_version_config("/usr/local/fhir/proto/stu3/version_config.textproto")

keyed_bundles = ( 
    p1 
    | 'readBundles' >> beam.io.ReadFromTFRecord(
        input_bundles, coder=beam.coders.ProtoCoder(resources_pb2.Bundle))
    | 'KeyBundlesByPatientId' >> beam.ParDo(
        bundle_to_seqex.KeyBundleByPatientIdFn()))

event_labels = ( 
    p1 | 'readEventLabels' >> beam.io.ReadFromTFRecord(
        labels,
        coder=beam.coders.ProtoCoder(google_extensions_pb2.EventLabel)))

keyed_event_labels = bundle_to_seqex.CreateTriggerLabelsPairLists(
    event_labels)

bundles_and_labels = bundle_to_seqex.CreateBundleAndLabels(
    keyed_bundles, keyed_event_labels)

_ = ( 
    bundles_and_labels
    | 'Reshuffle1' >> beam.Reshuffle()
    | 'GenerateSeqex' >> beam.ParDo(
        bundle_to_seqex.BundleAndLabelsToSeqexDoFn(
            version_config=version_config,
            enable_attribution=False,
            generate_sequence_label=False))
    | 'Reshuffle2' >> beam.Reshuffle()
    | 'WriteSeqex' >> beam.io.WriteToTFRecord(
        seqex_path,
        coder=beam.coders.ProtoCoder(example_pb2.SequenceExample),
        file_name_suffix='.tfrecords',
        num_shards=2))

In [None]:
p1.run().wait_until_finish()

<h2> 4. Train and Evaluate ML Model</h2>
<ul>
    <li>The next few cell demonstrate the process to train a ML Model using the training data set created in Step 3</li>
    <li>Training requires sequence examples in TFRecord format</li>
    <li>Trained ML model will be stored in Google Cloud Storage </li>
    <li>Model will be evaluated and the evaluation output will be printed</li>
</ul>

In [None]:
model_path = 'gs://healthedatalab/synthea/model/'
train_file = 'gs://healthedatalab/sythea/seqex/seqex-00000-of-00002.tfrecords'
validation_file = 'gs://healthedatalab/synthea/seqex/seqex-00001-of-00002.tfrecords	'

<h2> 5. Deploy ML Model to Cloud ML</h2>
<ul>
    <li>The trained ML Model will be deployed to CoudML for serving </li>
</ul>