Skip to content

Adding datasets

Robin van de Water edited this page Nov 14, 2023 · 1 revision

Add your own dataset

Users can supply their own datasets in a specific format. Alternatively, some minor modifications to the codebase are necessary in other cases.

Adding a new dataset type can be easily done by providing it in a .gin task definition file, see the task definition for binary classification. Note, however, that any datasets formatted in the default way do not require any changes to be used by YAIB. By default, we have chosen to work with the Apache parquet file format, which is a modern, open-source column-oriented format that does not require a lot of storage due to efficient data compression. We separate the data into three separate files: DYNAMIC, STATIC, and OUTCOME; this is defined for dynamic variables (that change during the stay), constant parameters, and the prediction task label respectively. Our cohort definition code produces the files exactly in this format. Furthermore, we see the concept of roles with the definition of the vars dictionary. These roles are assigned as defined in ReciPys, the preprocessing package developed alongside YAIB. The GROUP variable defines which internal dataset variable should be used to "group by" for, e.g., aggregating patient vital signs. The SEQUENCE variable defines the sequential dimension of the dataset (in the common case, time). The other keys in this dictionary define the feature columns and outcome variables to be used for prediction.

Defining new Harmonized Datasets

We use the ricu package for harmonizing our datasets. You can find additional instructions in our repository https://github.com/rvandewater/YAIB-cohorts for generating cohorts from new datasets.

Define ID types and table structure

First, ricu needs to know what tables and columns exist within the data source. This is specified via the JSON configuration file partially shown below.

{
    "name": "sic",
    "id_cfg": {
      "patient": {
        "id": "patientid",
        "position": 1,
        "start": "firstadmission",
        "end": "offsetofdeath",
        "table": "cases"
      },
      "icustay": {
        "id": "caseid",
        "position": 2,
        "start": "offsetafterfirstadmission",
        "end": "timeofstay",
        "table": "cases"
      }
    },
    "tables": {
      "cases": {
        "files": "cases.csv.gz",
        "defaults": {
          "index_var": "offsetafterfirstadmission",
          "time_vars": ["offsetafterfirstadmission", "offsetofdeath"]
        },
        "cols": {
          "caseid": {
            "name": "CaseID",
            "spec": "col_integer"
          },
          "patientid": {
            "name": "PatientID",
            "spec": "col_integer"
          },
          "admissionyear": {
            "name": "AdmissionYear",
            "spec": "col_integer"
          },
          ...
        }
      },
      "d_references": {
        ...
      },
      ...
    }
}

Tables are defined under the tables element. The definition of each table contains thename of its source file, usually provided in .csv or .csv.gz format, and a list of all columns andtheir data types. In addition, default roles can be defined for certain columns in the table. These usually include the time index (if present), all other time columns, and a column that is considered to contain the value of interest.

In addition to the available tables, ricu also expects information about the main ID types used in the dataset. Each piece of information in ICU datasets is usually linked to a certain unit of observation, most commonly the patient (patient), the hospital admission (hospadm), or the specific ICU stay (icustay). By knowing what IDs a piece of information is measured for, ricu is able to relate all information within the dataset temporally. For example, labevents in MIMIC IV are recorded for hospital admissions, whereas chartevents are recorded for ICU stays. Defining how these two ID systems relate to each other allows them to be mapped to a common time scale (e.g.,time since ICU admission). At the same time, knowing that an ICU stay is detailed in icustays in MIMIC and in cases in SICdb allows to define the same semantic reference point in both databases.

The two ID types available in SICdb are the patient and the icustay. There is no separate demarcation of the hospadm. The observation time for a patient ranges from their first observed admission to their death (if it occurred). icustays range from the current admission to the ICU until the end of the stay.

Calculate origin times for each ID type.

After ricu has been told which ID systems exist in the dataset, it also needs to know when they start and end. Much of this process is automated. For SICdb, all origin times are already provided in a format suitable for use with ricu, and thus the default behavior is appropriate for them. However, minor adjustments are often necessary. For example, discharge times in SICdb are not provided as absolute times but in seconds since the start of admission. Such adjustments can be made on a case-by-case basis by subtyping the respective functions (in this case id_win_helper) and overwriting the default behavior (Code Listing 6). Since SICdb provides time in seconds since admission, ricu must further be told to work with relative times in seconds. This can be conveniently achieved through existing helper functions (CodeListing 6).

Following the steps above makes SICdb available and fully usable within ricu. While some further helper functions may be necessary to enable optional functionality such as automatic determination of measurement units (SICdb stores units in a separate reference table that needs to be merged at runtime), these are not essential to the main functionality of ricu. Note that the above only interfaces SICdb. It does not automatically map all existing clinical concepts for SICdb. Defining a clinical concept for SICdb still requires manual mapping of the concept to SICdb data items, for example via the appropriate measurement IDs (see also the next session on adding a clinical concept). We do not expect that this process can ever be fully avoided (unless it was already performed prior, for example by mapping to and providing the data in OMOP format). However, we found that the framework and helper functions that ricu provides greatly simplify this process. Additional information on ricu and its design principles can be found in Bennett et al. (2023).