Group Members: Teresa Lee, Tino Trangia, Micah Hunter
https://drive.google.com/file/d/15b9CFIhlCw2wgkapxL_2Qp4AD6Ne-dhQ/view?usp=sharing
- Synthetic patient medical records (i.e. Electronic Health Records or EHRs) from Synthea. We used both the 1000 Sample and 100 Sample CSV files. Data dictionary and in-depth information about the various CSV files can be found on the FHIR website. For example, the attributes for each patient are describe here. We have included CSVs for patients and conditions, as well as their SQL schemas, in the data directory.
- Social Determinants of Health (SDOH) from Agency for Healthcare Research and Quality (AHRQ) found here. Originally available as .xlsx but we have converted to CSV for our project. We used the 2020 Zip Code based data. The codebook is included in our data directory. Preprocessing steps, such as filtering relevant columns and records (and Neo4j import), are included in the data_preprocessing directory.
You can use Neo4j Sandbox, a cloud-based, short-term instance of Neo4j database that is good for quick projects and exploring graph DB without local setup. You can also use a desktop installation of Neo4j and modify the driver details accordingly. See documentation on the Neo4j site for using Neo4j with Python. We also used PostgreSQL with Datagrip for relational data.
Steps for replicating this project:
- Clone the repo
- Download the Synthea data. We used the 100 Sample Synthetic Patient Records, CSV.
- Create a directory called 'csv' in your local repo and unzip the Synthea data (all .csv files) into it.
- Edit main.ipynb with the correct driver information (see sandbox tab -> connect via drivers -> Python).
- Run the cells in main.ipynb, which will read the CSV data and merge it into the sandbox database.
- Download the SDOH data (ZIP code) from AHRQ. Place SDOH_2020_ZIPCODE_1_0.xlsx into your local repo.
- Run sdoh.ipynb (in the data_preprocessing directory). This will read in the excel data into a Pandas dataframe, then filter for rows corresponding to Massachusetts zip codes (as all the synthetic patient records are from Massachusetts). It then write the data to a CSV file in the main directory. It will import the nodes to the Neo4j instance, and the CSV can be further processed and used with PostgreSQL. For example, you can move sdoh_ma.csv into the data_preprocessing directory and run filter_zipcode_citizenship_data.ipynb to filter for desired columns and then import into PostgreSQL using our provided schema.
- Now you can start querying the data. See the query_examples directory for the queries we used in the presentation. You can simply copy-paste SQL queries to the Datagrip terminal and Cypher queries to the Neo4j browser.