## FHIR to OMOP

This is an attempt at mapping FHIR to OMOP using the following guide: https://build.fhir.org/ig/HL7/cdmh/profiles.html#omop-to-fhir-mappings

### Connect to Spark cluster

Instructions here: https://docs.aws.amazon.com/glue/latest/dg/dev-endpoint.html

In [132]:
spark.sparkContext.getConf().get('spark.driver.memory')

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

An error was encountered:
'SparkSession' object has no attribute 'getConf'
Traceback (most recent call last):
AttributeError: 'SparkSession' object has no attribute 'getConf'



### Load DynamicFrame from Glue Catalog

This is similar to DataFrame but allows for dynamic schema changes which is what we want

In [160]:
from awsglue.context import GlueContext
from pyspark.sql import SparkSession
from awsglue.dynamicframe import DynamicFrame

glueContext = GlueContext(SparkSession.builder.enableHiveSupport().getOrCreate())
spark = glueContext.spark_session


FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [161]:
df = glueContext.create_dynamic_frame.from_catalog(
    database="fhir-catalog", table_name="resource_db_dev")

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Next we want to check out the schema to see what properties are there

In [162]:
#df.printSchema()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

### Start with Patient

In [163]:
df = df.toDF()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Other resource types 

In [164]:
df.select('resourceType').distinct().show()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+--------------------+
|        resourceType|
+--------------------+
|   DocumentReference|
|    DiagnosticReport|
|   MedicationRequest|
|         Observation|
|              Device|
|            CarePlan|
|ExplanationOfBenefit|
|          Provenance|
|               Claim|
|        Immunization|
|           Procedure|
|             Patient|
|        Organization|
|            Location|
|              Binary|
|           Condition|
|            CareTeam|
|           Encounter|
|        Practitioner|
+--------------------+

Filter by patient resource type

In [165]:
patients = df.filter(df['resourceType'] == 'Patient')

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [177]:
#patients.printSchema()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Drop irrelevant columns (https://www.hl7.org/fhir/patient.html)

In [178]:
persons = patients.select(['id','gender','birthDate'])

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Convert date of birth to separate properties

In [179]:
from pyspark.sql.functions import dayofmonth,month,year,to_date

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [180]:
stage_persons = persons\
    .withColumn("year_of_birth",year(persons['birthDate']))\
    .withColumn("month_of_birth",month(persons['birthDate']))\
    .withColumn("day_of_birth",dayofmonth(persons['birthDate']))\
    .withColumn("birth_datetime",to_date(persons['birthDate']))

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [181]:
#stage_persons.select([
#    "year_of_birth","month_of_birth","day_of_birth","birth_datetime"
#]).show(5)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Rename Columns

In [182]:
patient_dataframe = stage_persons.withColumnRenamed("identifier","person_id")\
        .withColumnRenamed("gender","gender_concept_id")

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Shows mapped output table<br>
TODO: Missing "provider_id", "care_site_id", "race_concept_id","ethnicity_concept_id" and "location_id" 

In [183]:
patient_dataframe.show(5) 

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+--------------------+-----------------+----------+-------------+--------------+------------+--------------+
|                  id|gender_concept_id| birthDate|year_of_birth|month_of_birth|day_of_birth|birth_datetime|
+--------------------+-----------------+----------+-------------+--------------+------------+--------------+
|b1a91dd8-27d9-439...|             male|1991-10-11|         1991|            10|          11|    1991-10-11|
|5697c724-a5cd-479...|           female|2001-08-03|         2001|             8|           3|    2001-08-03|
|81bfb1ae-323f-43a...|           female|2018-05-11|         2018|             5|          11|    2018-05-11|
|e3b2af8e-24ce-493...|           female|1980-06-16|         1980|             6|          16|    1980-06-16|
|1eb90da7-fff7-46a...|             male|1988-05-06|         1988|             5|           6|    1988-05-06|
+--------------------+-----------------+----------+-------------+--------------+------------+--------------+
only showing top 5 

Convert to DynamicFrame so it can be outputted in Glue ETL

In [184]:
patient_dynamicframe = DynamicFrame.fromDF(patient_dataframe,glueContext,"patient_dynamicframe")

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [185]:
patient_dynamicframe.printSchema()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

root
|-- id: string
|-- gender_concept_id: string
|-- birthDate: string
|-- year_of_birth: int
|-- month_of_birth: int
|-- day_of_birth: int
|-- birth_datetime: date