Skip to content

spe-uob/2020-HealthcareLakeETL

Repository files navigation

build Scan

HealthcareLakeETL

This repository contains the Spark ETL jobs for our AWS Glue pipeline. Used by the HealthcareLake project.

FHIR → OMOP

We are transforming one dataframe (FHIR) into several dataframes that correspond with the OMOP Common Data Model (CDM). The exact mapping can be found here.

Once the patient-level data model (FHIR) has been transformed to the population-level data model (OMOP CDM), we can access the Observational Health Data Sciences and Informatics (OHDSI) resources that can perform data aggregations and packages for cohort creation and various population level data analytics. More info

Local development

These instructions are for working with the data offline as opposed to connecting to AWS EMR. This is recommended as there is less setup involved.

To setup the Jupyter Notebook environment, follow these steps:

  1. Install Anaconda

  2. Create a Virtual Environment with Anaconda

conda create --name etl python=3.7
  1. Switch to this virtual environment
conda activate etl
  1. Add the environment to jupyter kernels
pip install --user ipykernel

And then link it

python -m ipykernel install --user --name=etl

You should now be able to run jupyter notebook in your browser:

jupyter notebook

Select Kernel→Change kernel→etl

  1. Install PySpark

Open a new terminal. (Remember to activate the environment with conda activate etl)

pip install pyspark
  1. Start developing

In your notebook:

from pyspark.sql import SparkSession

# Create a local Spark session
spark = SparkSession.builder.appName('etl').getOrCreate()

# Read in our data
df = spark.read.parquet('data/catalog.parquet')

That's it, you have the DataFrame to work with.