## FHIR to OMOP

This is an attempt at mapping FHIR to OMOP using the following guide: https://build.fhir.org/ig/HL7/cdmh/profiles.html#omop-to-fhir-mappings

### Connect to Spark cluster

Instructions here: https://docs.aws.amazon.com/glue/latest/dg/dev-endpoint.html

In [1]:
#spark.sparkContext.getConf().get('spark.driver.memory')

### Load Data Frame from Parquet Catalog File

In [2]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import dayofmonth,month,year,to_date,trunc,split,explode,array

# Create a local Spark session
spark = SparkSession.builder.appName('etl').getOrCreate()


In [3]:
# Read in our data
df = spark.read.parquet('data/catalog.parquet')

Next we want to check out the schema to see what properties are there

In [4]:
#df.printSchema()

List of different resource types 

In [5]:
df.select('resourceType').distinct().show()

+--------------------+
|        resourceType|
+--------------------+
|   DocumentReference|
|    DiagnosticReport|
|   MedicationRequest|
|         Observation|
|              Device|
|            CarePlan|
|ExplanationOfBenefit|
|          Provenance|
|               Claim|
|        Immunization|
|           Procedure|
|             Patient|
|        Organization|
|            Location|
|              Binary|
|           Condition|
|            CareTeam|
|           Encounter|
|        Practitioner|
+--------------------+



### Patient Mapping

Filter by patient resource type

In [6]:
patients = df.filter(df['resourceType'] == 'Patient')

In [7]:
#patients.printSchema()

Drop irrelevant columns (https://www.hl7.org/fhir/patient.html)

In [8]:
persons = patients.select(['id','gender','birthDate'])

Convert date of birth to separate properties

In [9]:
from pyspark.sql.functions import dayofmonth,month,year,to_date

In [10]:
stage_persons = persons\
    .withColumn("year_of_birth",year(persons['birthDate']))\
    .withColumn("month_of_birth",month(persons['birthDate']))\
    .withColumn("day_of_birth",dayofmonth(persons['birthDate']))\
    .withColumn("birth_datetime",to_date(persons['birthDate']))

In [11]:
#stage_persons.select([
#    "year_of_birth","month_of_birth","day_of_birth","birth_datetime"
#]).show(5)

Rename Columns

In [12]:
patient_dataframe = stage_persons.withColumnRenamed("identifier","person_id")\
        .withColumnRenamed("gender","gender_concept_id")

Shows mapped output table<br>
TODO: Missing "provider_id", "care_site_id", "race_concept_id","ethnicity_concept_id" and "location_id" 

In [13]:
patient_dataframe.show(5) 

+--------------------+-----------------+----------+-------------+--------------+------------+--------------+
|                  id|gender_concept_id| birthDate|year_of_birth|month_of_birth|day_of_birth|birth_datetime|
+--------------------+-----------------+----------+-------------+--------------+------------+--------------+
|892799c4-760c-445...|           female|2008-04-18|         2008|             4|          18|    2008-04-18|
|394cbec0-93ce-4a9...|           female|1943-08-30|         1943|             8|          30|    1943-08-30|
|96f58e83-0237-4a8...|           female|2009-01-01|         2009|             1|           1|    2009-01-01|
|b1a91dd8-27d9-439...|             male|1991-10-11|         1991|            10|          11|    1991-10-11|
|b9d2e182-6859-402...|             male|2005-11-17|         2005|            11|          17|    2005-11-17|
+--------------------+-----------------+----------+-------------+--------------+------------+--------------+
only showing top 5 

Convert to DynamicFrame so it can be outputted in Glue ETL

In [14]:
#patient_dynamicframe = DynamicFrame.fromDF(patient_dataframe,glueContext,"patient_dynamicframe")

In [15]:
#patient_dynamicframe.printSchema()