## FHIR to OMOP

This is an attempt at mapping FHIR to OMOP using the following guide: https://build.fhir.org/ig/HL7/cdmh/profiles.html#omop-to-fhir-mappings

### Connect to Spark cluster

Instructions here: https://docs.aws.amazon.com/glue/latest/dg/dev-endpoint.html

In [1]:
spark.sparkContext.getConf().get('spark.driver.memory')

Starting Spark application


ID,YARN Application ID,Kind,State,Spark UI,Driver log,User,Current session?
3,application_1615075593408_0004,pyspark,idle,Link,Link,,✔


FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

SparkSession available as 'spark'.


FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

'6000M'

### Load DynamicFrame from Glue Catalog

This is similar to DataFrame but allows for dynamic schema changes which is what we want

In [2]:
from awsglue.context import GlueContext
from pyspark.sql import SparkSession

glueContext = GlueContext(SparkSession.builder.enableHiveSupport().getOrCreate())
spark = glueContext.spark_session


FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [3]:
df = glueContext.create_dynamic_frame.from_catalog(
    database="fhir-catalog", table_name="resource_db_dev")

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Next we want to check out the schema to see what properties are there

In [69]:
#df.printSchema()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

### Start with Patient

In [14]:
df = df.toDF()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [23]:
patients = df.filter(df['resourceType'] == 'Patient')

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [70]:
#patients.printSchema()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Drop irrelevant columns (https://www.hl7.org/fhir/patient.html)

In [52]:
persons = patients.select(['identifier','gender','birthDate'])

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Convert date of birth to separate properties

In [65]:
from pyspark.sql.functions import dayofmonth,month,year,to_date

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [66]:
stage_persons = persons\
    .withColumn("year_of_birth",year(persons['birthDate']))\
    .withColumn("month_of_birth",month(persons['birthDate']))\
    .withColumn("day_of_birth",dayofmonth(persons['birthDate']))\
    .withColumn("birth_datetime",to_date(persons['birthDate']))

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [68]:
stage_persons.select([
    "year_of_birth","month_of_birth","day_of_birth","birth_datetime"
]).show(5)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+-------------+--------------+------------+--------------+
|year_of_birth|month_of_birth|day_of_birth|birth_datetime|
+-------------+--------------+------------+--------------+
|         1983|            12|          17|    1983-12-17|
|         1996|             7|          23|    1996-07-23|
|         2008|             1|          15|    2008-01-15|
|         1971|             8|          24|    1971-08-24|
|         1968|             2|           3|    1968-02-03|
+-------------+--------------+------------+--------------+
only showing top 5 rows