# Example ETL via MEDS-Extract

In this example, we'll extract some raw data into the MEDS format via MEDS-Extract. To start with, let's inspect our input (synthetic) raw data:

In [10]:
from pathlib import Path

from pretty_print_directory import PrintConfig, print_directory

DATA_ROOT = Path("raw_data")

In [5]:
print_directory(DATA_ROOT)

├── diagnoses.csv
├── labs_vitals.csv
├── medications.csv
└── patients.csv


We can see there are four `csv` files; let's look at what they contain, using [`polars`](https://docs.pola.rs/) as our dataframe engine.

In [6]:
import polars as pl
from IPython.display import display  # This is used for nice displays later

dfs = {}
for fn in ("diagnoses", "labs_vitals", "medications", "patients"):
    fp = DATA_ROOT / f"{fn}.csv"
    dfs[fn] = pl.read_csv(fp)
    print(f"{fn}: ")
    display(dfs[fn].head(2))

diagnoses: 


patient_id,diagnosis_code,timestamp
i64,str,str
1,"""E11.9""","""2024-04-03T00:00:00"""
1,"""I10""","""2024-08-30T00:00:00"""


labs_vitals: 


test_name,patient_id,timestamp,result
str,i64,str,f64
"""Systolic BP (mmHg)""",1,"""2024-08-15T01:25:00""",153.04
"""Systolic BP (mmHg)""",1,"""2025-06-10T16:08:00""",221.74


medications: 


medication_name,dose,patient_id,timestamp
str,str,i64,str
"""Metformin""","""500 mg""",1,"""2025-03-16T00:00:00"""
"""Lisinopril""","""10 mg""",1,"""2025-03-30T00:00:00"""


patients: 


patient_id,eye_color,hair_color,dob,dod
i64,str,str,str,str
1,"""brown""","""blond""","""1954-01-24T00:00:00""","""2018-11-01T00:00:00"""
2,"""blue""","""blond""","""2009-02-19T00:00:00""",


Following the conventions of MEDS-Extract, we'll use the event conversion configuration file stored in `event_cfg.yaml` in this directory to parse these data:

In [7]:
print(Path("event_cfg.yaml").read_text())

subject_id_col: patient_id

patients:
  eye_color:
    code:
      - EYE_COLOR
      - col(eye_color)
    time: null
  hair_color:
    code:
      - HAIR_COLOR
      - col(hair_color)
    time: null
  dob:
    code: MEDS_BIRTH
    time: col(dob)
    time_format: "%Y-%m-%dT%H:%M:%S"
  dod:
    code: MEDS_DEATH
    time: col(dod)
    time_format: "%Y-%m-%dT%H:%M:%S"

labs_vitals:
  lab:
    code: col(test_name)
    time: col(timestamp)
    time_format: "%Y-%m-%dT%H:%M:%S"
    numeric_value: col(result)

medications:
  med:
    code:
      - col(medication_name)
      - col(dose)
    time: col(timestamp)
    time_format: "%Y-%m-%dT%H:%M:%S"
    numeric_value: col(dose)

diagnoses:
  dx:
    code: col(diagnosis_code)
    time: col(timestamp)
    time_format: "%Y-%m-%dT%H:%M:%S"



Now, we can simply run the pipeline via the traditional syntax in MEDS-Transforms, specifying as needed the additional properties for the MEDS-Extract library. These properties include:

1. Normal, MEDS-Transforms properties:
  - The `input_dir` (or specified via `dataset.root_dir`
  - The `output_dir`
2. MEDS-Extract specific properties:
  - The `event_conversion_config_fp`
  - The dataset's name (nested within `dataset.name` or `etl_metadata.dataset_name`)
  - The dataset's version (nested within `dataset.version` or `etl_metadata.dataset_version`)

```bash
MEDS_transform-pipeline \
    pkg://MEDS_extract.configs._extract.yaml \
    --overrides \
    input_dir=raw_data \
    output_dir=MEDS_output \
    event_conversion_config_fp=event_cfg.yaml \
    dataset.name=EXAMPLE \
    dataset.version=1.0
    
```

In [8]:
%%bash
MEDS_transform-pipeline \
    pkg://MEDS_extract.configs._extract.yaml \
    --overrides \
    input_dir=raw_data \
    output_dir=MEDS_output \
    event_conversion_config_fp=event_cfg.yaml \
    dataset.name=EXAMPLE \
    dataset.version=1.0

The command exits silently, which is a good sign -- but let's see what's now in the output directory. We'll start by just looking at the final data and metadata directories, and omitting logs, to keep the output small:

In [13]:
output_data_root = Path("MEDS_output/data")
print_directory(output_data_root, PrintConfig(ignore_regex=r"\.logs"))

├── held_out
│   └── 0.parquet
├── train
│   └── 0.parquet
└── tuning
    └── 0.parquet


In [14]:
output_metadata_root = Path("MEDS_output/metadata")
print_directory(output_metadata_root, PrintConfig(ignore_regex=r"\.logs"))

├── .shards.json
├── codes.parquet
├── dataset.json
└── subject_splits.parquet


Let's see some of the files:

In [16]:
for fp in output_data_root.rglob("*.parquet"):
    print(fp.relative_to(output_data_root))
    display(pl.read_parquet(fp).head(6))

held_out/0.parquet


subject_id,time,code,numeric_value
i64,datetime[μs],str,f32
7,,"""EYE_COLOR//blue""",
7,,"""HAIR_COLOR//red""",
7,1973-01-24 00:00:00,"""MEDS_BIRTH""",
7,2023-07-18 00:00:00,"""B20""",
7,2024-07-20 19:26:00,"""Hemoglobin A1c (%)""",11.59
7,2024-09-13 06:51:00,"""Glucose (mg/dL)""",178.5


train/0.parquet


subject_id,time,code,numeric_value
i64,datetime[μs],str,f32
1,,"""EYE_COLOR//brown""",
1,,"""HAIR_COLOR//blond""",
1,1954-01-24 00:00:00,"""MEDS_BIRTH""",
1,2018-11-01 00:00:00,"""MEDS_DEATH""",
1,2024-04-03 00:00:00,"""E11.9""",
1,2024-07-30 03:37:00,"""Systolic BP (mmHg)""",169.039993


tuning/0.parquet


subject_id,time,code,numeric_value
i64,datetime[μs],str,f32
3,,"""EYE_COLOR//gray""",
3,,"""HAIR_COLOR//brown""",
3,1967-04-17 00:00:00,"""MEDS_BIRTH""",
3,2023-04-14 00:00:00,"""J45.909""",
3,2024-07-17 00:10:00,"""Creatinine (mg/dL)""",0.86
3,2024-10-10 18:57:00,"""Creatinine (mg/dL)""",1.05


In [18]:
print((output_metadata_root / "dataset.json").read_text())

{"dataset_name": "EXAMPLE", "dataset_version": "1.0", "etl_name": "MEDS_transforms", "etl_version": "0.6.0", "meds_version": "0.4.0", "created_at": "2025-07-10T15:23:15.287066+00:00"}


In [19]:
display(pl.read_parquet(output_metadata_root / "codes.parquet"))

code,description,parent_codes
str,str,list[str]


We can see that by default, the codes file has the right schema but is empty, as we extracted no metadata in this pipeline.

In [20]:
display(pl.read_parquet(output_metadata_root / "subject_splits.parquet"))

subject_id,split
i64,str
9,"""train"""
4,"""train"""
5,"""train"""
8,"""train"""
6,"""train"""
1,"""train"""
2,"""train"""
10,"""train"""
3,"""tuning"""
7,"""held_out"""
