## Overview
This notebook loads a collection of synthetic FHIR bundles and value sets and shows some simple queries. Running this first will set up the environment for other notebooks in the tutorial

## Setup Tasks
Some setup before the real show begins...

In [None]:
from pyspark.sql import SparkSession
GCS_BUCKET = 'gs://cluster12-bkt'
FHIR_BUNDLES = GCS_BUCKET+'/synthea/fhir'

# Enable Hive support for our session so we can save resources as Hive tables
spark = SparkSession.builder \
                    .config('hive.exec.dynamic.partition.mode', 'nonstrict') \
                    .enableHiveSupport() \
                    .getOrCreate()

## Import Synthetic Data
This tutorial uses data generated by Synthea. It is simply a directory of STU3 bundles visible included in the tutorial; you can see it in the bundles directory.

Let's load the bundles and examine a couple data types in them.

In [None]:
from bunsen.stu3.bundles import load_from_directory, extract_entry, write_to_database

# Load and cache the bundles so we don't reload them every time.
bundles = load_from_directory(spark, FHIR_BUNDLES).cache()

# Get the observation and encounters
observations = extract_entry(spark, bundles, 'observation')
encounters = extract_entry(spark, bundles, 'encounter')

## Bunsen documentation
To get help using functions like *load_from_directory* or *extract_entry*, you can see the documentation at https://engineering.cerner.com/bunsen or via Python's help system, like this:

In [None]:
help(extract_entry)

## Generated from FHIR Resource Definitions
The Apache Spark datasets used here are fully generated from the FHIR resource definitions, with every field mapped one-to-one. For instance, here is the fully Spark schema of the Observation resource:

In [None]:
observations.printSchema()

## Load some data
The next step will load some data and inspect it. Since Spark lazily delays execution until output is needed, all of the work will be done here. This can take several seconds or longer depending on the machine, but users can check its status by looking at the [Spark application page](http://localhost:4040).

For now, let's just turn our encounter resources into a simple table of all encounters since 2013:

In [None]:
from pyspark.sql.functions import col

encounters.select('subject.reference', 
                  'class.code', 
                  'period.start', 
                  'period.end') \
          .where(col('start') > '2013') \
          .limit(10) \
          .toPandas()

## Exploding nested lists
FHIR's nested structures group related data, making many workloads simpler. We can reference such nested structures directly, and "explode" nested lists when needed to analyze them. Let's build a table of all observation codes in our data:

In [None]:
from pyspark.sql.functions import explode

codes = observations.select('subject',
                            explode('code.coding').alias('coding')) \
                    .select('subject.reference', 
                            'coding.system', 
                            'coding.code',
                            'coding.display')
                    
codes.limit(10).toPandas()

## Analyzing data
Our datasets become much easier to analyze once they've been projected onto a simpler model that suits the proble at hand. The code below simply shows the most frequent observation codes in our synthetic data.

In [None]:
codes.groupBy('system', 'code', 'display') \
     .count() \
     .orderBy('count', ascending=False) \
     .limit(10) \
     .toPandas()

## Writing resources to a database
Directly loading JSON or XML FHIR bundles is useful for ingesting and early exploration of data, but a more efficient format works better repeated use. Since Bunsen encodes resources natively in Apache Spark dataframes, we can take advantage of Spark's ability to write it to a Hive database. Bunsen offers the *write_to_database* function as a convenient way to write resources from bundles to a database, with a table for each resource. 

Note that each table preserves the original, nested structure definition of the FHIR resource, and is field-for-field equivalent. 

The cell below will save our test data to tables in the "tutorial_small" database. When running it, you can see progress in the Spark UI at http://localhost:4040.


In [None]:
resources = ['allergyintolerance',
             'careplan',
             'claim',
             'condition',
             'encounter',
             'immunization',
             'medication',
             'medicationrequest',
             'observation',
             'organization',
             'patient',
             'procedure']

write_to_database(spark, 
                  bundles, 
                  'tutorial_small',
                  resources)

## Reading from a Hive database
Now that we've saved our data to a Hive database, we can easily view and query the tables with Spark SQL:

In [None]:
spark.sql('use tutorial_small')
spark.sql('show tables').toPandas()

In [None]:
spark.sql("""
select subject.reference, 
       count(*) cnt
from encounter
where class.code != 'WELLNESS' and
      period.start > '2013'
group by subject.reference
order by cnt desc
limit 10
""").toPandas()

## Loading Valuesets
Bunsen has built-in support for working with FHIR valuesets. As a convenience, the APIs in the bunsen.stu3.codes package offers ways to save valuesets to Hive tables that are more easily used.

In [None]:
from bunsen.stu3.codes import create_value_sets

# Load the valuesets from bundles
valueset_bundles = load_from_directory(spark, 'gs://bunsen/data/valuesets')
valueset_data = extract_entry(spark, valueset_bundles, 'valueset')

# Import the value sets and save them to an ontologies database for easy future use
spark.sql('create database tutorial_ontologies')

create_value_sets(spark).with_value_sets(valueset_data) \
                        .write_to_database('tutorial_ontologies')

This creates a valuesets table, which uses the FHIR ValueSet schema, that we can easily explore:

In [None]:
spark.table('tutorial_ontologies.valuesets').select('url', 'version', 'description').toPandas()

This also creates a "values" table that we can more easily look at the values in our valuesets. This makes valuesets that may contain many thousands of values easier to use:

In [None]:
spark.table('tutorial_ontologies.values').toPandas()

## Using Valuesets in Queries
Finally, we illustrate how we can easily use FHIR valuesets within Spark SQL. Bunsen provides an *in_valueset* user-defined function that can be invoked directly from SQL, so users can easily work with valuesets without needing complex joins to separate ontology tables.

First, we will push some interesting valuesets to the cluster with the *push_valuesets* function seen below. This uses Apache Spark's broadcast variables to get this reference data on each node, so it can be easily used. Details are in that function documentation, but typically users work with valuesets in one of three ways:

* From a FHIR ValueSet resource, as illustrated here
* As a collection of values in a Python structure
* As an is-a relationship in some ontology, like LOINC or SNOMED.

Further documentation can be viewed in the function documentation or via help(push_valuesets).

Let's take a look at an example:

In [None]:
from bunsen.stu3.valuesets import push_valuesets, valueset

# Push multiple valuesets for this example, even though we use only one.
push_valuesets(spark, 
               {'ldl'               : [('http://loinc.org', '18262-6')],                
                'hdl'               : [('http://loinc.org', '2085-9')],
                'cholesterol'       : valueset('http://hl7.org/fhir/ValueSet/example-extensional', '20150622')},
               database='tutorial_ontologies'); 

Now that the above valuesets have been broadcast across our processing cluster, we can easily query them with the *in_valueset* user-defined function inline with our SQL:

In [None]:
spark.sql("""
select subject.reference, 
       valueQuantity.value,
       valueQuantity.unit
from tutorial_small.observation
where in_valueset(code, 'cholesterol')
limit 10
""").toPandas()