<h1> HIMSS Demo - HealtheDatalab </h1>

<h2> Structured Machine Learning using Tensorflow </h2>
<hr />
This notebook demonstrates a process to train, evaluate and deploy a ML model to CloudML. It leverages a pre-built machine learning model to predict Length of Stay in ED and inpatient care settings
<h3>
<br />
<ol>
<li> Access, Analize & Visualize Data using HealtheDataLab </li> <br />
<li> Label generation - Generate Labels in TFRecord format </li> <br />
<li> Generate TFSequenceExamples with context = patient + time series data = encounters </li> <br />
<li> Train and Evaluate Machine Learning Model </li> <br />
<li> Deploy ML Model to CloudML </li> <br />
</ol>
</h3>
<hr />


In [17]:
from dummy import foo
foo()

 test code 


In [19]:
%bash
ls -al

total 280
drwxr-xr-x 6 root root  4096 Jan 21 19:49 .
drwxr-xr-x 6 root root  4096 Jan 15 09:54 ..
-rw-r--r-- 1 root root 21308 Jan 20 03:53 AnalyzeVisits.ipynb
-rw-r--r-- 1 root root 19866 Jan 15 10:26 bundles_to_labels_dr.ipynb
-rw-r--r-- 1 root root  5577 Jan 15 17:17 bundles_to_seqex_dr.ipynb
-rw-r--r-- 1 root root 95166 Jan 15 11:16 bunsen_de_tutorial.ipynb
-rw-r--r-- 1 root root 53071 Jan 20 03:55 bunsen_getting_started.ipynb
-rw-r--r-- 1 root root  7544 Jan 15 09:54 demo.ipynb
drwxr-xr-x 7 root root  4096 Jan 21 19:43 .git
-rw-r--r-- 1 root root 12748 Jan 21 19:48 hdl_demo.ipynb
drwxr-xr-x 2 root root  4096 Jan 20 03:53 .ipynb_checkpoints
-rw-r--r-- 1 root root 28820 Jan 17 02:56 linear-model-notebook.ipynb
drwxr-xr-x 2 root root  4096 Jan 15 09:54 misc
drwxr-xr-x 2 root root  4096 Jan 20 04:16 __pycache__


<h2> 1. Access, Analize & Visualize Data using HealtheDataLab </h2>

In [None]:
from pyspark.sql import SparkSession
from datetime import datetime, date
from bunsen.stu3.bundles import load_from_directory, extract_entry

# Enable Hive support for our session so we can save resources as Hive tables
spark = SparkSession.builder \
                    .config('hive.exec.dynamic.partition.mode', 'nonstrict') \
                    .enableHiveSupport() \
                    .getOrCreate()



# Load and cache the bundles so we don't reload them every time.
bundles = load_from_directory(spark, 'gs://cluster-data/demo/data/synthea/fhir/').cache()

# Get patients and encounters data from bundles
patients = extract_entry(spark, bundles, 'patient')
encounters = extract_entry(spark, bundles, 'encounter')

def age(birthdate):
    born = datetime.strptime(birthdate, "%Y-%m-%d")
    today = date.today()
    return str(today.year - born.year - ((today.month, today.day) < (born.month, born.day)))

pats = patients.select('id','gender', 'birthDate', 'address.city', 'address.state', 'address.country') 

#pats['birthDate'] = pats['birthDate'].apply(age)
patsDF = pats.limit(10).toPandas()
patsDF['age'] = patsDF['birthDate'].apply(age)
display(patsDF)

In [None]:
from pyspark.sql.functions import col
from pyspark.sql import functions as F
from dateutil.parser import parse

def los(row):
    #start = datetime.strptime(row['start'], '%Y-%m-%dT%H:%M:%S%z')
    start = parse(row['start'])
    end = parse(row['end'])
    #end = datetime.strptime(row['end'], '%Y-%m-%dT%H:%M:%S%z')
    los = end-start
    return str(los)
  
encs=encounters.select('subject.reference', 
                  'class.code', 
                  'period.start', 
                  'period.end') \
          .where(col('class.code').isin("inpatient", "emergency"))


encsDF = encs.limit(10).toPandas()
encsDF['los'] = encsDF.apply(los, axis=1)
display(encsDF)


In [None]:
#CODE(WIP)
PROJECT = 'dp-workspace'
REGION = 'us-west1'
BUCKET = 'cluster-data'

import os
os.environ['PROJECT'] = PROJECT
os.environ['REGION'] = REGION
os.environ['BUCKET'] = BUCKET

In [None]:
%bash
gcloud config set project $PROJECT
gcloud config set compute/region $REGION

<h2> 2. Preparation of data - Create input data bundles in TFRecord format</h2>
This cell creates FHIR bundles from RAW Synthetic data

In [None]:
from pyspark.sql import SparkSession

# Enable Hive support for our session so we can save resources as Hive tables
spark = SparkSession.builder \
                    .config('hive.exec.dynamic.partition.mode', 'nonstrict') \
                    .enableHiveSupport() \
                    .getOrCreate()

from bunsen.stu3.bundles import load_from_directory, extract_entry, write_to_database

# Load and cache the raw data (FHIR bundles) from Google Cloud Storage bucket so we don't reload them every time.
bundles = load_from_directory(spark, 'gs://bunsen/data/bundles').cache()

# Create TFRecords from the raw FHIR bundles (one line to create TFrecordas)
#TODO ........
#For now we have generated a sample TF Record and stored in a following cloud storage bucket: gs://cluster-data/demo/data/test_bundle.tfrecord-00000-of-00001 
#Text version of the test_bundle.tfrecord-00000-of-00001 is in file: bundle_1.pbtxt

In [None]:
%bash
gsutil ls -l gs://${BUCKET}/demo/data/

<h2> 3. Label generation - Generate Labels in TFRecord format</h2>
Input: FHIR bundles
Output: Labels

In [None]:
from absl import app
from absl import flags
import apache_beam as beam
from proto.stu3 import google_extensions_pb2
from proto.stu3 import resources_pb2
from py.google.fhir.labels import encounter
from py.google.fhir.labels import label

@beam.typehints.with_input_types(resources_pb2.Bundle)
@beam.typehints.with_output_types(google_extensions_pb2.EventLabel)
class LengthOfStayRangeLabelAt24HoursFn(beam.DoFn):
  """Converts Bundle into length of stay range at 24 hours label.

    Cohort: inpatient encounter that is longer than 24 hours
    Trigger point: 24 hours after admission
    Label: multi-label for length of stay ranges, see label.py for detail
  """

  def process(self, bundle):
    """Iterate through bundle and yield label.

    Args:
      bundle: input stu3.Bundle proto
    Yields:
      stu3.EventLabel proto.
    """
    patient = encounter.GetPatient(bundle)
    if patient is not None:
      # Cohort: inpatient encounter > 24 hours.
      for enc in encounter.Inpatient24HrEncounters(bundle):
        for one_label in label.LengthOfStayRangeAt24Hours(patient, enc):
          yield one_label
          
          
          
from apache_beam.options.pipeline_options import GoogleCloudOptions
from apache_beam.options.pipeline_options import StandardOptions
from apache_beam.options.pipeline_options import SetupOptions
from apache_beam.options.pipeline_options import PipelineOptions

from apache_beam.io import ReadFromText
from apache_beam.io import WriteToText
from apache_beam.metrics import Metrics
from apache_beam.metrics.metric import MetricsFilter

import apache_beam as beam
import re


options = PipelineOptions()
google_cloud_options = options.view_as(GoogleCloudOptions)
#google_cloud_options.project = 'dp-workspace'
google_cloud_options.project = PROJECT
google_cloud_options.job_name = 'job1'
google_cloud_options.staging_location = 'gs://cluster-data/demo/staging'
google_cloud_options.temp_location = 'gs://cluster-data/demo/temp'
options.view_as(StandardOptions).runner = 'DirectRunner'

p = beam.Pipeline(options=options)
input_bundle = 'gs://cluster-data/demo/data/test_bundle.tfrecord-00000-of-00001'
output_file_prefix = 'gs://cluster-data/demo/data/output/label'

bundles = p | 'read' >> beam.io.ReadFromTFRecord(input_bundle, coder=beam.coders.ProtoCoder(resources_pb2.Bundle))
    
labels = bundles | 'BundleToLabel' >> beam.ParDo(
    LengthOfStayRangeLabelAt24HoursFn())
_ = labels | beam.io.WriteToTFRecord(output_file_prefix,
    coder=beam.coders.ProtoCoder(google_extensions_pb2.EventLabel))


p.run().wait_until_finish()

In [None]:
# Above cell generates a label TFRecord and stores it into a GS Bucket
%bash
gsutil ls -l gs://cluster-data/demo/data/output

<h2> 4. Generate TFSequenceExamples with context = patient + time series data = encounters</h2>
Input: FHIR bundles
Output: Features

In [None]:
#CODE(WIP)

<h2> 5. Train and Evaluate Machine Learning Model </h2>
Input: Training and Evaluation Dataset
Output: Model

In [None]:
#CODE(WIP)

<h2> 6. Deploy ML Model to ??? </h2>
Input: New Data set
Output: Length Of Stay Prediction

In [None]:
#CODE(WIP)