This repository contains the Spark ETL jobs for our AWS Glue pipeline. Used by the HealthcareLake project.
We are transforming one dataframe (FHIR) into several dataframes that correspond with the OMOP Common Data Model (CDM). The exact mapping can be found here.
Once the patient-level data model (FHIR) has been transformed to the population-level data model (OMOP CDM), we can access the Observational Health Data Sciences and Informatics (OHDSI) resources that can perform data aggregations and packages for cohort creation and various population level data analytics. More info
These instructions are for working with the data offline as opposed to connecting to AWS EMR. This is recommended as there is less setup involved.
To setup the Jupyter Notebook environment, follow these steps:
-
Install Anaconda
-
Create a Virtual Environment with Anaconda
conda create --name etl python=3.7
- Switch to this virtual environment
conda activate etl
- Add the environment to jupyter kernels
pip install --user ipykernel
And then link it
python -m ipykernel install --user --name=etl
You should now be able to run jupyter notebook in your browser:
jupyter notebook
Select Kernel→Change kernel→etl
- Install PySpark
Open a new terminal. (Remember to activate the environment with conda activate etl
)
pip install pyspark
- Start developing
In your notebook:
from pyspark.sql import SparkSession
# Create a local Spark session
spark = SparkSession.builder.appName('etl').getOrCreate()
# Read in our data
df = spark.read.parquet('data/catalog.parquet')
That's it, you have the DataFrame to work with.