Skip to content

Generating Cohorts

Robin van de Water edited this page Jun 6, 2023 · 1 revision

General Instructions

Please clone the YAIB-cohorts repository.

Replicate the cohorts within R

The following instructions assume that the YAIB-cohorts/R is the current working directory in your R session.

Step 0: Environment and Prerequisites

All data extractions were run using R 4.2.2 on an Apple M1 Max with Ventura 13.2.1. An renv lock file was created to install all necessary package dependencies. To recreate the environment, start a Terminal and run:

Rscript setup_env.R

Step 1: Load the datasets into ricu

In order to access the full datasets through R, you need to download them and make them available to ricu. Please follow the instructions given by the ricu package: ?ricu::import_src.

For quick experimentation, ricu comes with two demo datasets: mimic.demo and eicu.demo. These are small, openly available subsets of mimic and eicu that allow for easy prototyping. They should have been installed by renv. If they aren't, please see the respective Github pages here and here.

Once you have imported the datasets, make sure to set the right data path in .Rprofile file in this directory:

Sys.setenv(RICU_DATA_PATH = "/path/to/your/ricu/data/folder")

Step 2: Generate the Cohorts

This repository currently allows for the extraction of five prediction tasks:

Classification:

  1. ICU mortality after 24 hours: mortality.R
  2. Acute Kidney Injury within the next 6 hours: aki.R
  3. Sepsis within the next 6 hours: sepsis.R

Regression:

  1. Kidney function on the second day of ICU admission: kidney_function.R
  2. Remaining length of stay: los.R

All five tasks rely on a shared data cleaning provided in base_cohort.R, which defines and stores a subset of patients in each dataset with sufficient data quality. base_cohort.R therefore needs to be called before any of the task-specific cohorts can be generated.

Rscript "base_cohort.R" --src mimic_demo

Once base_cohort.R was run, data for a single task like mortality from a single dataset can be extracted via:

Rscript "mortality.R" --src mimic_demo

where mortality.R should be replaced with the task file of interest and mimic_demo with the database of interest (one of mimic_demo, eicu_demo, aumc, hirid, eicu, mimic, miiv). Data can be extracted from all datasets simultaneously via bash gen_cohort.sh mortality.

The output directory for the extracted data can be set in ../config.yaml

Replicate the cohorts from Python

The following instructions assume that the YAIB-cohorts/Python folder is the current working directory in your Python session.

Step 0: Environment and Prerequisites

An renv lock file was created to install all necessary package dependencies. To recreate the environment, start a Terminal and run:

python setup_env.py

Step 1: Load the datasets into ricu

In order to access the full datasets through R, you need to download them and make them available to ricu. Please follow the instructions given by the ricu package: ?ricu::import_src. Currently, Python code to perform this step is in development and will be uploaded soon.

For quick experimentation, ricu comes with two demo datasets: mimic.demo and eicu.demo. These are small, openly available subsets of mimic and eicu that allow for easy prototyping. They should have been installed by renv. If they aren't, please see the respective Github pages here and here.

Once you have imported the datasets, make sure to set the right data path in .Rprofile file in this directory:

Sys.setenv(RICU_DATA_PATH = "/path/to/your/ricu/data/folder")

Step 2: Generate the Cohorts

This repository currently allows for the extraction of five prediction tasks:

Classification:

  1. ICU mortality after 24 hours: mortality.py
  2. Acute Kidney Injury within the next 6 hours: aki.py
  3. Sepsis within the next 6 hours: sepsis.py

Regression:

  1. Kidney function on the second day of ICU admission: kidney_function.py
  2. Remaining length of stay: los.py

Data for a single task like mortality from a single dataset can be extracted via:

python mortality.py --src mimic_demo

where mortality.py should be replaced with the task file of interest and mimic_demo with the database of interest (one of mimic_demo, eicu_demo, aumc, hirid, eicu, mimic, miiv).

The output directory for the extracted data can be set in ../config.yaml