## Overview
This notebook loads a collection of synthetic FHIR bundles and value sets and shows some simple queries. Running this first will set up the environment for other notebooks in the tutorial

## Setup Tasks
Some setup before the real show begins...

In [1]:
from pyspark.sql import SparkSession

# Enable Hive support for our session so we can save resources as Hive tables
spark = SparkSession.builder \
                    .config('hive.exec.dynamic.partition.mode', 'nonstrict') \
                    .enableHiveSupport() \
                    .getOrCreate()

## Import Synthetic Data
This tutorial uses data generated by Synthea. It is simply a directory of STU3 bundles visible included in the tutorial; you can see it in the bundles directory.

Let's load the bundles and examine a couple data types in them.

In [2]:
from bunsen.stu3.bundles import load_from_directory, extract_entry, write_to_database

# Load and cache the bundles so we don't reload them every time.
bundles = load_from_directory(spark, 'gs://bunsen/data/bundles').cache()

# Get the observation and encounters
observations = extract_entry(spark, bundles, 'observation')
encounters = extract_entry(spark, bundles, 'encounter')

## Bunsen documentation
To get help using functions like *load_from_directory* or *extract_entry*, you can see the documentation at https://engineering.cerner.com/bunsen or via Python's help system, like this:

In [3]:
help(extract_entry)

Help on function extract_entry in module bunsen.stu3.bundles:

extract_entry(sparkSession, javaRDD, resourceName)
    Returns a dataset for the given entry type from the bundles.
    
    :param sparkSession: the SparkSession instance
    :param javaRDD: the RDD produced by :func:`load_from_directory` or other methods
        in this package
    :param resourceName: the name of the FHIR resource to extract
        (condition, observation, etc)
    :return: a DataFrame containing the given resource encoded into Spark columns



In [4]:
observations.printSchema()

root
 |-- id: string (nullable = true)
 |-- meta: struct (nullable = true)
 |    |-- id: string (nullable = true)
 |    |-- versionId: string (nullable = true)
 |    |-- lastUpdated: timestamp (nullable = true)
 |    |-- profile: array (nullable = true)
 |    |    |-- element: string (containsNull = true)
 |    |-- security: array (nullable = true)
 |    |    |-- element: struct (containsNull = true)
 |    |    |    |-- id: string (nullable = true)
 |    |    |    |-- system: string (nullable = true)
 |    |    |    |-- version: string (nullable = true)
 |    |    |    |-- code: string (nullable = true)
 |    |    |    |-- display: string (nullable = true)
 |    |    |    |-- userSelected: boolean (nullable = true)
 |    |-- tag: array (nullable = true)
 |    |    |-- element: struct (containsNull = true)
 |    |    |    |-- id: string (nullable = true)
 |    |    |    |-- system: string (nullable = true)
 |    |    |    |-- version: string (nullable = true)
 |    |    |    |-- code: s

## Load some data
The next step will load some data and inspect it. Since Spark lazily delays execution until output is needed, all of the work will be done here. This can take several seconds or longer depending on the machine, but users can check its status by looking at the [Spark application page](http://localhost:4040).

For now, let's just turn our encounter resources into a simple table of all encounters since 2013:

In [4]:
from pyspark.sql.functions import col

encounters.select('subject.reference', 
                  'class.code', 
                  'period.start', 
                  'period.end') \
          .where(col('start') > '2013') \
          .limit(10) \
          .toPandas()

Unnamed: 0,reference,code,start,end
0,urn:uuid:214ff775-6924-4ecf-aedd-9847146fe66b,WELLNESS,2013-01-31T00:59:02-06:00,2013-01-31T00:59:02-06:00
1,urn:uuid:214ff775-6924-4ecf-aedd-9847146fe66b,ambulatory,2015-09-15T01:59:02-05:00,2015-09-15T01:59:02-05:00
2,urn:uuid:214ff775-6924-4ecf-aedd-9847146fe66b,WELLNESS,2016-02-04T00:59:02-06:00,2016-02-04T00:59:02-06:00
3,urn:uuid:214ff775-6924-4ecf-aedd-9847146fe66b,ambulatory,2016-04-19T01:59:02-05:00,2016-04-19T01:59:02-05:00
4,urn:uuid:214ff775-6924-4ecf-aedd-9847146fe66b,emergency,2018-01-04T00:59:02-06:00,2018-01-04T00:59:02-06:00
5,urn:uuid:e206880c-7762-4aee-a3e2-5a8c89512c18,ambulatory,2013-01-04T21:59:31-06:00,2013-01-04T21:59:31-06:00
6,urn:uuid:e206880c-7762-4aee-a3e2-5a8c89512c18,ambulatory,2013-01-11T21:59:31-06:00,2013-01-11T22:14:31-06:00
7,urn:uuid:e206880c-7762-4aee-a3e2-5a8c89512c18,ambulatory,2013-03-22T22:59:31-05:00,2013-03-22T23:14:31-05:00
8,urn:uuid:e206880c-7762-4aee-a3e2-5a8c89512c18,WELLNESS,2013-09-27T22:59:31-05:00,2013-09-27T22:59:31-05:00
9,urn:uuid:e206880c-7762-4aee-a3e2-5a8c89512c18,WELLNESS,2014-10-03T22:59:31-05:00,2014-10-03T22:59:31-05:00


## Exploding nested lists
FHIR's nested structures group related data, making many workloads simpler. We can reference such nested structures directly, and "explode" nested lists when needed to analyze them. Let's build a table of all observation codes in our data:

In [5]:
from pyspark.sql.functions import explode

codes = observations.select('subject',
                            explode('code.coding').alias('coding')) \
                    .select('subject.reference', 
                            'coding.system', 
                            'coding.code',
                            'coding.display')
                    
codes.limit(10).toPandas()

Unnamed: 0,reference,system,code,display
0,urn:uuid:214ff775-6924-4ecf-aedd-9847146fe66b,http://loinc.org,8331-1,Oral temperature
1,urn:uuid:214ff775-6924-4ecf-aedd-9847146fe66b,http://loinc.org,8302-2,Body Height
2,urn:uuid:214ff775-6924-4ecf-aedd-9847146fe66b,http://loinc.org,29463-7,Body Weight
3,urn:uuid:214ff775-6924-4ecf-aedd-9847146fe66b,http://loinc.org,39156-5,Body Mass Index
4,urn:uuid:214ff775-6924-4ecf-aedd-9847146fe66b,http://loinc.org,55284-4,Blood Pressure
5,urn:uuid:214ff775-6924-4ecf-aedd-9847146fe66b,http://loinc.org,8302-2,Body Height
6,urn:uuid:214ff775-6924-4ecf-aedd-9847146fe66b,http://loinc.org,29463-7,Body Weight
7,urn:uuid:214ff775-6924-4ecf-aedd-9847146fe66b,http://loinc.org,39156-5,Body Mass Index
8,urn:uuid:214ff775-6924-4ecf-aedd-9847146fe66b,http://loinc.org,55284-4,Blood Pressure
9,urn:uuid:214ff775-6924-4ecf-aedd-9847146fe66b,http://loinc.org,4548-4,Hemoglobin A1c/Hemoglobin.total in Blood


## Analyzing data
Our datasets become much easier to analyze once they've been projected onto a simpler model that suits the problem at hand. The code below simply shows the most frequent observation codes in our synthetic data.

In [7]:
codes.groupBy('system', 'code', 'display') \
     .count() \
     .orderBy('count', ascending=False) \
     .limit(10) \
     .toPandas()

Unnamed: 0,system,code,display,count
0,http://loinc.org,4548-4,Hemoglobin A1c/Hemoglobin.total in Blood,1753
1,http://loinc.org,8302-2,Body Height,1377
2,http://loinc.org,55284-4,Blood Pressure,1377
3,http://loinc.org,29463-7,Body Weight,1377
4,http://loinc.org,39156-5,Body Mass Index,1350
5,http://loinc.org,6299-2,Urea Nitrogen,871
6,http://loinc.org,2339-0,Glucose,871
7,http://loinc.org,6298-4,Potassium,871
8,http://loinc.org,2947-0,Sodium,871
9,http://loinc.org,2069-3,Chloride,871


## Writing resources to a database
Directly loading JSON or XML FHIR bundles is useful for ingesting and early exploration of data, but a more efficient format works better repeated use. Since Bunsen encodes resources natively in Apache Spark dataframes, we can take advantage of Spark's ability to write it to a Hive database. Bunsen offers the *write_to_database* function as a convenient way to write resources from bundles to a database, with a table for each resource. 

Note that each table preserves the original, nested structure definition of the FHIR resource, and is field-for-field equivalent. 

The cell below will save our test data to tables in the "tutorial_small" database. When running it, you can see progress in the Spark UI at http://localhost:4040.


In [8]:
resources = ['allergyintolerance',
             'careplan',
             'claim',
             'condition',
             'encounter',
             'immunization',
             'medication',
             'medicationrequest',
             'observation',
             'organization',
             'patient',
             'procedure']

write_to_database(spark, 
                  bundles, 
                  'tutorial_small',
                  resources)

## Reading from a Hive database
Now that we've saved our data to a Hive database, we can easily view and query the tables with Spark SQL:

In [11]:
spark.sql('use tutorial_small')
spark.sql('show tables').toPandas()

Unnamed: 0,database,tableName,isTemporary
0,tutorial_small,allergyintolerance,False
1,tutorial_small,careplan,False
2,tutorial_small,claim,False
3,tutorial_small,condition,False
4,tutorial_small,encounter,False
5,tutorial_small,immunization,False
6,tutorial_small,medication,False
7,tutorial_small,medicationrequest,False
8,tutorial_small,observation,False
9,tutorial_small,organization,False


In [12]:
spark.sql("""
select subject.reference, 
       count(*) cnt
from encounter
where class.code != 'WELLNESS' and
      period.start > '2013'
group by subject.reference
order by cnt desc
limit 10
""").toPandas()

Unnamed: 0,reference,cnt
0,urn:uuid:e206880c-7762-4aee-a3e2-5a8c89512c18,53
1,urn:uuid:e538491e-cf8e-4a3f-97a5-45811e066f27,44
2,urn:uuid:dcad3c44-64de-43b6-b24c-989f8f27c71d,33
3,urn:uuid:5804a9d3-3518-4862-a1e4-a61b0f1a4be4,31
4,urn:uuid:2bf9eab0-fec0-41b2-9f91-3369e38b98f6,19
5,urn:uuid:90a7ded5-a5ce-43df-b973-7bc7ce7a3011,18
6,urn:uuid:8f538e46-a1d1-4c75-beb7-e3946124e730,16
7,urn:uuid:6f58dbea-7532-4090-97a8-79982bab98f5,12
8,urn:uuid:aa251e83-9a9b-446f-ba2f-87e2da7c4d34,8
9,urn:uuid:73bbd5a3-00b5-4216-bd5d-601359ca9e42,6


## Loading Valuesets
Bunsen has built-in support for working with FHIR valuesets. As a convenience, the APIs in the bunsen.stu3.codes package offers ways to save valuesets to Hive tables that are more easily used.

In [13]:
# Import the value sets and save them to an ontologies database for easy future use
spark.sql('create database IF NOT EXISTS tutorial_ontologies')
spark.sql('use tutorial_ontologies')
spark.sql('show tables').toPandas()

Unnamed: 0,database,tableName,isTemporary


In [16]:
from bunsen.stu3.codes import create_value_sets
# Drop table: valuesets if exist
spark.sql('create database IF NOT EXISTS tutorial_ontologies')
sqlContext.sql("DROP TABLE IF EXISTS tutorial_ontologies.valuesets");
# Drop table: values if exist
sqlContext.sql("DROP TABLE IF EXISTS tutorial_ontologies.values");
# Load the valuesets from bundles
valueset_bundles = load_from_directory(spark, 'gs://bunsen/data/valuesets')
valueset_data = extract_entry(spark, valueset_bundles, 'valueset')



In [20]:
create_value_sets(spark).with_value_sets(valueset_data).write_to_database('tutorial_ontologies')

Py4JJavaError: An error occurred while calling o276.writeToDatabase.
: org.apache.spark.SparkException: Job aborted.
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:224)
	at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:154)
	at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104)
	at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102)
	at org.apache.spark.sql.execution.command.DataWritingCommandExec.doExecute(commands.scala:122)
	at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
	at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
	at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
	at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
	at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
	at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80)
	at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80)
	at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:654)
	at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:654)
	at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77)
	at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:654)
	at org.apache.spark.sql.DataFrameWriter.insertInto(DataFrameWriter.scala:322)
	at org.apache.spark.sql.DataFrameWriter.insertInto(DataFrameWriter.scala:308)
	at com.cerner.bunsen.codes.base.AbstractValueSets.writeToTables(AbstractValueSets.java:523)
	at com.cerner.bunsen.codes.base.AbstractValueSets.writeToDatabase(AbstractValueSets.java:451)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:238)
	at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 69.0 failed 4 times, most recent failure: Lost task 1.3 in stage 69.0 (TID 3667, hive-cluster1-w-1.c.grand-magpie-222719.internal, executor 1): org.apache.spark.SparkException: Task failed while writing rows.
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:285)
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:197)
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:196)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
	at org.apache.spark.scheduler.Task.run(Task.scala:109)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.IllegalArgumentException: Invalid bucket name (g-warehouse) or object name (hadoop/tutorial_ontologies.db/valuesets/_temporary/0/_temporary/attempt_20181126203459_0069_m_000001_3/timestamp=2018-11-26 20%3A34%3A30.458/part-00001-cff111e0-d909-4b9a-9fa6-d23384be381b.c000.snappy.parquet)
	at com.google.cloud.hadoop.repackaged.gcs.com.google.cloud.hadoop.gcsio.LegacyPathCodec.getPath(LegacyPathCodec.java:96)
	at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem.getGcsPath(GoogleHadoopFileSystem.java:172)
	at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.create(GoogleHadoopFileSystemBase.java:762)
	at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:1067)
	at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:1048)
	at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:937)
	at org.apache.parquet.hadoop.ParquetFileWriter.<init>(ParquetFileWriter.java:241)
	at org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:342)
	at org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:302)
	at org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.<init>(ParquetOutputWriter.scala:37)
	at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anon$1.newInstance(ParquetFileFormat.scala:151)
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$DynamicPartitionWriteTask.org$apache$spark$sql$execution$datasources$FileFormatWriter$DynamicPartitionWriteTask$$newOutputWriter(FileFormatWriter.scala:511)
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$DynamicPartitionWriteTask$$anonfun$execute$5.apply(FileFormatWriter.scala:546)
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$DynamicPartitionWriteTask$$anonfun$execute$5.apply(FileFormatWriter.scala:527)
	at scala.collection.Iterator$class.foreach(Iterator.scala:893)
	at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$DynamicPartitionWriteTask.execute(FileFormatWriter.scala:527)
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:269)
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:267)
	at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1414)
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:272)
	... 8 more
Caused by: java.net.URISyntaxException: Illegal character in path at index 140: gs://g-warehouse/hadoop/tutorial_ontologies.db/valuesets/_temporary/0/_temporary/attempt_20181126203459_0069_m_000001_3/timestamp=2018-11-26 20%3A34%3A30.458/part-00001-cff111e0-d909-4b9a-9fa6-d23384be381b.c000.snappy.parquet
	at java.net.URI$Parser.fail(URI.java:2848)
	at java.net.URI$Parser.checkChars(URI.java:3021)
	at java.net.URI$Parser.parseHierarchical(URI.java:3105)
	at java.net.URI$Parser.parse(URI.java:3053)
	at java.net.URI.<init>(URI.java:588)
	at com.google.cloud.hadoop.repackaged.gcs.com.google.cloud.hadoop.gcsio.LegacyPathCodec.getPath(LegacyPathCodec.java:91)
	... 28 more

Driver stacktrace:
	at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1602)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1590)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1589)
	at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
	at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1589)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)
	at scala.Option.foreach(Option.scala:257)
	at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:831)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1823)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1772)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1761)
	at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
	at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:642)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2034)
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:194)
	... 31 more
Caused by: org.apache.spark.SparkException: Task failed while writing rows.
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:285)
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:197)
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:196)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
	at org.apache.spark.scheduler.Task.run(Task.scala:109)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	... 1 more
Caused by: java.lang.IllegalArgumentException: Invalid bucket name (g-warehouse) or object name (hadoop/tutorial_ontologies.db/valuesets/_temporary/0/_temporary/attempt_20181126203459_0069_m_000001_3/timestamp=2018-11-26 20%3A34%3A30.458/part-00001-cff111e0-d909-4b9a-9fa6-d23384be381b.c000.snappy.parquet)
	at com.google.cloud.hadoop.repackaged.gcs.com.google.cloud.hadoop.gcsio.LegacyPathCodec.getPath(LegacyPathCodec.java:96)
	at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem.getGcsPath(GoogleHadoopFileSystem.java:172)
	at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.create(GoogleHadoopFileSystemBase.java:762)
	at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:1067)
	at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:1048)
	at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:937)
	at org.apache.parquet.hadoop.ParquetFileWriter.<init>(ParquetFileWriter.java:241)
	at org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:342)
	at org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:302)
	at org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.<init>(ParquetOutputWriter.scala:37)
	at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anon$1.newInstance(ParquetFileFormat.scala:151)
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$DynamicPartitionWriteTask.org$apache$spark$sql$execution$datasources$FileFormatWriter$DynamicPartitionWriteTask$$newOutputWriter(FileFormatWriter.scala:511)
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$DynamicPartitionWriteTask$$anonfun$execute$5.apply(FileFormatWriter.scala:546)
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$DynamicPartitionWriteTask$$anonfun$execute$5.apply(FileFormatWriter.scala:527)
	at scala.collection.Iterator$class.foreach(Iterator.scala:893)
	at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$DynamicPartitionWriteTask.execute(FileFormatWriter.scala:527)
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:269)
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:267)
	at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1414)
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:272)
	... 8 more
Caused by: java.net.URISyntaxException: Illegal character in path at index 140: gs://g-warehouse/hadoop/tutorial_ontologies.db/valuesets/_temporary/0/_temporary/attempt_20181126203459_0069_m_000001_3/timestamp=2018-11-26 20%3A34%3A30.458/part-00001-cff111e0-d909-4b9a-9fa6-d23384be381b.c000.snappy.parquet
	at java.net.URI$Parser.fail(URI.java:2848)
	at java.net.URI$Parser.checkChars(URI.java:3021)
	at java.net.URI$Parser.parseHierarchical(URI.java:3105)
	at java.net.URI$Parser.parse(URI.java:3053)
	at java.net.URI.<init>(URI.java:588)
	at com.google.cloud.hadoop.repackaged.gcs.com.google.cloud.hadoop.gcsio.LegacyPathCodec.getPath(LegacyPathCodec.java:91)
	... 28 more


Now we can more easily look at the values in our valuesets:

In [19]:
spark.table('tutorial_ontologies.values').toPandas()

Unnamed: 0,system,version,value,valueseturi,valuesetversion
0,http://snomed.info/sct,20180301.0,15777000,http://engineering.cerner.com/bunsen/example,201806001
1,http://snomed.info/sct,20180301.0,44054006,http://engineering.cerner.com/bunsen/example,201806001
2,http://snomed.info/sct,20180301.0,15777000,http://engineering.cerner.com/bunsen/example,201806001
3,http://snomed.info/sct,20180301.0,44054006,http://engineering.cerner.com/bunsen/example,201806001
4,http://loinc.org,2.36,14647-2,http://hl7.org/fhir/ValueSet/example-extensional,20150622
5,http://loinc.org,2.36,2093-3,http://hl7.org/fhir/ValueSet/example-extensional,20150622
6,http://loinc.org,2.36,35200-5,http://hl7.org/fhir/ValueSet/example-extensional,20150622
7,http://loinc.org,2.36,9342-7,http://hl7.org/fhir/ValueSet/example-extensional,20150622
8,http://loinc.org,2.36,14647-2,http://hl7.org/fhir/ValueSet/example-extensional,20150622
9,http://loinc.org,2.36,2093-3,http://hl7.org/fhir/ValueSet/example-extensional,20150622


In [31]:
spark.table('tutorial_ontologies.valuesets').toPandas()

Unnamed: 0,id,meta,implicitRules,language,text,url,identifier,version,name,title,...,description,useContext,jurisdiction,immutable,purpose,copyright,extensible,compose,expansion,timestamp


## Using Valuesets
Finally, we illustrate how we can easily use FHIR valuesets within Spark SQL. Bunsen provides an *in_valueset* user-defined function that can be invoked directly from SQL, so users can easily work with valuesets without needing complex joins to separate ontology tables.

First, we will push some interesting valuesets to the cluster with the *push_valuesets* function seen below. This uses Apache Spark's broadcast variables to get this reference data on each node, so it can be easily used. Details are in that function documentation, but typically users work with valuesets in one of three ways:

* From a FHIR ValueSet resource, as illustrated here
* As a collection of values in a Python structure
* As an is-a relationship in some ontology, like LOINC or SNOMED.

Further documentation can be viewed in the function documentation or via help(push_valuesets).

Let's take a look at an example:

In [22]:
from bunsen.stu3.valuesets import push_valuesets, valueset

# Push multiple valuesets for this example, even though we use only one.
push_valuesets(spark, 
               {'ldl'               : [('http://loinc.org', '18262-6')],                
                'hdl'               : [('http://loinc.org', '2085-9')],
                'cholesterol'       : valueset('http://hl7.org/fhir/ValueSet/example-extensional', '20150622')},
               database='tutorial_ontologies'); 

In [24]:
spark.sql("""
select subject.reference, 
       valueQuantity.value,
       valueQuantity.unit
from tutorial_small.observation
where in_valueset(code, 'cholesterol')
limit 10
""").toPandas()

Unnamed: 0,reference,value,unit
0,urn:uuid:710a77bd-57f8-401e-a149-a686ce193e45,242.6102,mg/dL
1,urn:uuid:710a77bd-57f8-401e-a149-a686ce193e45,246.6088,mg/dL
2,urn:uuid:710a77bd-57f8-401e-a149-a686ce193e45,252.2379,mg/dL
3,urn:uuid:710a77bd-57f8-401e-a149-a686ce193e45,257.2211,mg/dL
4,urn:uuid:710a77bd-57f8-401e-a149-a686ce193e45,251.0915,mg/dL
5,urn:uuid:710a77bd-57f8-401e-a149-a686ce193e45,252.0646,mg/dL
6,urn:uuid:710a77bd-57f8-401e-a149-a686ce193e45,256.5613,mg/dL
7,urn:uuid:710a77bd-57f8-401e-a149-a686ce193e45,251.2045,mg/dL
8,urn:uuid:710a77bd-57f8-401e-a149-a686ce193e45,246.4793,mg/dL
9,urn:uuid:710a77bd-57f8-401e-a149-a686ce193e45,244.7487,mg/dL
