HealthcareLakeETL

This repository contains the Spark ETL jobs for our AWS Glue pipeline. Used by the HealthcareLake project.

FHIR → OMOP

We are transforming one dataframe (FHIR) into several dataframes that correspond with the OMOP Common Data Model (CDM). The exact mapping can be found here.

Once the patient-level data model (FHIR) has been transformed to the population-level data model (OMOP CDM), we can access the Observational Health Data Sciences and Informatics (OHDSI) resources that can perform data aggregations and packages for cohort creation and various population level data analytics. More info

Local development

These instructions are for working with the data offline as opposed to connecting to AWS EMR. This is recommended as there is less setup involved.

To setup the Jupyter Notebook environment, follow these steps:

Install Anaconda
Create a Virtual Environment with Anaconda

conda create --name etl python=3.7

Switch to this virtual environment

conda activate etl

Add the environment to jupyter kernels

pip install --user ipykernel

And then link it

python -m ipykernel install --user --name=etl

You should now be able to run jupyter notebook in your browser:

jupyter notebook

Select Kernel→Change kernel→etl

Install PySpark

Open a new terminal. (Remember to activate the environment with conda activate etl)

pip install pyspark

Start developing

In your notebook:

from pyspark.sql import SparkSession

# Create a local Spark session
spark = SparkSession.builder.appName('etl').getOrCreate()

# Read in our data
df = spark.read.parquet('data/catalog.parquet')

That's it, you have the DataFrame to work with.

Name		Name	Last commit message	Last commit date
Latest commit History 160 Commits
.github/workflows		.github/workflows
data		data
mappings		mappings
notebooks		notebooks
tests		tests
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
main.py		main.py
main.tf		main.tf
outputs.tf		outputs.tf
requirements.txt		requirements.txt
variables.tf		variables.tf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

HealthcareLakeETL

FHIR → OMOP

Local development

About

Releases

Contributors 5

Languages

License

spe-uob/2020-HealthcareLakeETL

Folders and files

Latest commit

History

Repository files navigation

HealthcareLakeETL

FHIR → OMOP

Local development

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Contributors 5

Languages