# Loading Data from OMOP Common Data Model into `EHRData`

This tutorial demonstrates how to load electronic health record data from the **OMOP Common Data Model** (CDM) with `ehrdata`.

## What is OMOP CDM?

The [OMOP CDM](https://ohdsi.github.io/CommonDataModel/) standardizes observational healthcare data across different sources to enable reproducible, multi-site analyses.

**Key benefits:**
- **Standardization**: Different hospitals' data formats are harmonized
- **Vocabularies**: Clinical concepts are mapped to standard terminologies (SNOMED, ICD, LOINC)
- **Privacy**: Analysis code can be sent to data rather than moving patient data
- **Reproducibility**: Same analysis can run across multiple sites

We'll use the [MIMIC-IV demo dataset](https://physionet.org/content/mimic-iv-demo-omop/0.9/) (100 patients) in OMOP format as our example.


## Setup and Data Download


```{note}
**ehrdata provides ready-to-use OMOP datasets!**

The `ed.dt.mimic_iv_omop()` function automatically:
1. Downloads real healthcare data in OMOP table format
2. Loads the tables into a database connection
3. Makes them ready for you to construct EHRData objects

This workflow mirrors what you'd do with your own OMOP database:
**Download OMOP tables** → **Connect to database** → **Create EHRData structure** (using arguments explained below)
```


In [55]:
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [56]:
import ehrdata as ed
import duckdb

We'll use ehrdata's built-in `mimic_iv_omop()` function, which downloads the data as OMOP tables and loads them into a database connection:


In [57]:
# Create an in-memory database
con = duckdb.connect(":memory:")

# Download data and load OMOP tables into the database
# This creates tables like: person, measurement, observation, etc.
data_path = ed.dt.mimic_iv_omop(backend_handle=con)
print(f"Data downloaded to: {data_path}")

Data downloaded to: None


## Quick Look at OMOP Tables

The OMOP CDM organizes data into standardized tables. Let's explore the key tables:


In [58]:
# See what tables we have
tables = con.execute("SHOW TABLES").df()
print(f"Number of tables: {len(tables)}")
print("\nAvailable tables:")
print(tables)

Number of tables: 30

Available tables:
                    name
0              care_site
1             cdm_source
2                 cohort
3      cohort_definition
4                concept
5   concept_relationship
6          condition_era
7   condition_occurrence
8                   cost
9                  death
10       device_exposure
11              dose_era
12              drug_era
13         drug_exposure
14     fact_relationship
15              location
16           measurement
17              metadata
18                  note
19              note_nlp
20           observation
21    observation_period
22     payer_plan_period
23                person
24  procedure_occurrence
25              provider
26              specimen
27          visit_detail
28      visit_occurrence
29            vocabulary


### The `person` table

The `person` table contains demographic information about each patient:


In [59]:
person_df = con.execute("SELECT * FROM person LIMIT 5").df()
person_df.head()

Unnamed: 0,person_id,gender_concept_id,year_of_birth,month_of_birth,day_of_birth,birth_datetime,race_concept_id,ethnicity_concept_id,location_id,provider_id,care_site_id,person_source_value,gender_source_value,gender_source_concept_id,race_source_value,race_source_concept_id,ethnicity_source_value,ethnicity_source_concept_id
0,3589912774911670296,8507,2095,,,NaT,0,38003563,,,,10009628,M,0,,0,HISPANIC/LATINO,2000001408
1,-3210373572193940939,8507,2079,,,NaT,0,38003563,,,,10011398,M,0,,0,HISPANIC/LATINO,2000001408
2,-775517641933593374,8507,2149,,,NaT,8516,0,,,,10004235,M,0,BLACK/AFRICAN AMERICAN,2000001406,,0
3,-2575767131279873665,8507,2050,,,NaT,8516,0,,,,10024043,M,0,BLACK/AFRICAN AMERICAN,2000001406,,0
4,-8970844422700220177,8507,2114,,,NaT,8527,0,,,,10038933,M,0,WHITE,2000001404,,0


### The `measurement` table

The `measurement` table contains lab values, vital signs, and other measurements:


In [60]:
measurement_df = con.execute("SELECT * FROM measurement LIMIT 5").df()
measurement_df.head()

Unnamed: 0,measurement_id,person_id,measurement_concept_id,measurement_date,measurement_datetime,measurement_time,measurement_type_concept_id,operator_concept_id,value_as_number,value_as_concept_id,unit_concept_id,range_low,range_high,provider_id,visit_occurrence_id,visit_detail_id,measurement_source_value,measurement_source_concept_id,unit_source_value,value_source_value
0,7620661609057829801,-7437341330444582833,3007913,2113-09-14,2113-09-14 10:41:00,,32856,,586.0,,8876,,,,3697313480337443666,,50801,2000001001,mm Hg,586
1,-6166868866082303206,-2312013739856114142,3012501,2116-07-05,2116-07-05 05:51:00,,32856,,-4.0,,9557,,,,-5005846256467230136,,50802,2000001002,mEq/L,-4
2,-5240588523649662838,-4234372750442829205,3012501,2154-01-02,2154-01-02 20:18:00,,32856,,-2.0,,9557,,,,-5317261811030552609,,50802,2000001002,mEq/L,-2
3,6455538495061268502,8805478484003283429,3012501,2114-06-20,2114-06-20 09:59:00,,32856,,3.0,,9557,,,,4821424463988433938,,50802,2000001002,mEq/L,3
4,8972684215776493449,6339505631013617478,3012501,2111-11-15,2111-11-15 03:09:00,,32856,,-2.0,,9557,,,,-6354047184485090226,,50802,2000001002,mEq/L,-2


## Building an EHRData Object from OMOP

Now let's construct an `EHRData` object from the OMOP database. This happens in three steps:

1. **Setup observations** (`.obs`) - Define what each row represents (patients, visits, etc.)
2. **Setup variables** (`.var` and `.layers`) - Extract clinical measurements as time series
3. **Explore the result** - See how OMOP data maps to EHRData structure


### Step 1: Setup Observations (`.obs`)

The first step is to define what each observation (row) in our dataset represents. We use `ed.io.omop.setup_obs()` for this.

In OMOP, we can choose different observation units:
- `"person"` - Each row is one patient
- `"person_visit_occurrence"` - Each row is one hospital visit
- `"person_observation_period"` - Each row is one observation period
- `"person_cohort`"` - Each row is a person of a cohort

```{important}
**The choice of observation table determines the reference timepoint (t=0) for time series data:**

| Observation Table | Timepoint 0 Reference |
|-------------------|----------------------|
| `"person"` | Patient's birth date |
| `"person_visit_occurrence"` | Start of the hospital visit |
| `"person_observation_period"` | Start of the observation period |
| `"person_cohort"` | Start of the cohort |

This allows you to leverage the temporal information already defined in the OMOP database to align your time series data to clinically meaningful reference points.
```

**For this tutorial, we'll use `"person_visit_occurrence"`** because we care about monitoring and modeling individual hospital visits. This is particularly useful when:
- Patients have multiple hospital encounters
- Each visit has distinct clinical characteristics
- You want to predict outcomes at the visit level (e.g., in-hospital mortality, length of stay)

In [61]:
# Step 1: Setup observations from the person table
edata = ed.io.omop.setup_obs(
    backend_handle=con,
    observation_table="person_visit_occurrence",
    death_table=True,  # Include death information
)

print(f"Created EHRData with {edata.n_obs} patients")
edata

Created EHRData with 852 patients


EHRData object with n_obs × n_vars × n_t = 852 × 0 × 1
    obs: 'person_id', 'gender_concept_id', 'year_of_birth', 'month_of_birth', 'day_of_birth', 'birth_datetime', 'race_concept_id', 'ethnicity_concept_id', 'location_id', 'provider_id', 'care_site_id', 'person_source_value', 'gender_source_value', 'gender_source_concept_id', 'race_source_value', 'race_source_concept_id', 'ethnicity_source_value', 'ethnicity_source_concept_id', 'visit_occurrence_id', 'person_id_1', 'visit_concept_id', 'visit_start_date', 'visit_start_datetime', 'visit_end_date', 'visit_end_datetime', 'visit_type_concept_id', 'provider_id_1', 'care_site_id_1', 'visit_source_value', 'visit_source_concept_id', 'admitting_source_concept_id', 'admitting_source_value', 'discharge_to_concept_id', 'discharge_to_source_value', 'preceding_visit_occurrence_id', 'death_date', 'death_datetime', 'death_type_concept_id', 'cause_concept_id', 'cause_source_value', 'cause_source_concept_id'
    uns: 'omop_io_observation_table'

Let's examine the `.obs` table - it contains patient demographics from the OMOP `person` table:


In [62]:
edata.obs.head()

Unnamed: 0,person_id,gender_concept_id,year_of_birth,month_of_birth,day_of_birth,birth_datetime,race_concept_id,ethnicity_concept_id,location_id,provider_id,...,admitting_source_value,discharge_to_concept_id,discharge_to_source_value,preceding_visit_occurrence_id,death_date,death_datetime,death_type_concept_id,cause_concept_id,cause_source_value,cause_source_concept_id
0,-4502092208250381979,8532,2071,,,NaT,8527,0,,,...,TRANSFER FROM HOSPITAL,581476.0,HOME HEALTH CARE,,NaT,NaT,,,,
1,4239478333578644568,8507,2111,,,NaT,8527,0,,,...,PHYSICIAN REFERRAL,581476.0,HOME,,NaT,NaT,,,,
2,-8090189584974691216,8507,2118,,,NaT,8527,0,,,...,EMERGENCY ROOM,581476.0,HOME,,NaT,NaT,,,,
3,2188642953583197091,8532,2102,,,NaT,8527,0,,,...,,,,-6.726268352887624e+18,NaT,NaT,,,,
4,3129727379702505063,8532,2145,,,NaT,8516,0,,,...,EMERGENCY ROOM,,,,NaT,NaT,,,,


### Step 2: Setup Variables (`.var` and `.layers`)

Now we extract clinical measurements from the OMOP `measurement` table and organize them as time series data.

We need to specify:
- **`data_tables`**: Which OMOP tables to use (e.g., `measurement`, `observation`). Here, we are interested in the `measurement` table.
- **`data_field_to_keep`**: Which field contains the values we want to extract (e.g., `"value_as_number"`).
- **`time_precision`**: Whether to look for `date` or `datetime` fields in the OMOP tables. Since we care about hourly intervals, `datetime` is more suitable.
- **`interval_length`** and **`num_intervals`**: How to discretize time. Here, we choose to create intervals of 1h, and create 24 of these intervals. This monitors the first 24h after the beginning of the visit stay here.


In [76]:
edata = ed.io.omop.setup_variables(
    edata=edata,
    layer="tem_data",
    backend_handle=con,
    data_tables=["measurement"],
    data_field_to_keep=["value_as_number"],
    time_precision="datetime",
    interval_length_number=1,
    interval_length_unit="h",
    num_intervals=24,
)

 [  6]
 [ 18]
 [ 25]
 [ 27]
 [ 31]
 [ 33]
 [ 35]
 [ 36]
 [ 37]
 [ 40]
 [ 41]
 [ 43]
 [ 46]
 [ 51]
 [ 55]
 [ 56]
 [ 60]
 [ 61]
 [ 80]
 [ 83]
 [ 91]
 [ 92]
 [102]
 [106]
 [110]
 [112]
 [113]
 [120]
 [122]
 [123]
 [125]
 [133]
 [134]
 [139]
 [141]
 [145]
 [147]
 [156]
 [159]
 [168]
 [179]
 [180]
 [187]
 [192]
 [193]
 [199]
 [201]
 [207]
 [225]
 [226]
 [231]
 [232]
 [233]
 [236]
 [242]
 [243]
 [244]
 [249]
 [253]
 [257]
 [258]
 [266]
 [273]
 [291]]


### Step 3: Explore the Result

Let's examine what was constructed from the OMOP database:


First, let's compare the original OMOP `person` table with the `.obs` attribute of our EHRData:


In [77]:
edata.obs[["person_id", "visit_occurrence_id", "visit_start_datetime", "gender_source_value", "year_of_birth"]]

Unnamed: 0,person_id,visit_occurrence_id,visit_start_datetime,gender_source_value,year_of_birth
0,-4502092208250381979,-9176297757944464068,2154-02-05 17:09:00,F,2071
1,4239478333578644568,-9149771978458038515,2177-03-12 07:15:00,M,2111
2,-8090189584974691216,-9133360720296560252,2174-05-26 04:20:00,M,2118
3,2188642953583197091,-9128519808785176541,2146-08-26 15:36:00,F,2102
4,3129727379702505063,-9127810274408915712,2197-04-16 22:57:00,F,2145
...,...,...,...,...,...
847,7918537411740862407,9159184765222535078,2132-02-12 09:43:00,F,2055
848,-626229666378242477,9165125063680661115,2174-11-28 10:30:00,M,2117
849,4985579811051920670,9197703010583516730,2112-02-05 14:48:00,F,2064
850,7131048714591189903,9197704996243617072,2191-11-18 15:00:00,M,2133


In [78]:
print("\nPatients with multiple visits:")

print(f"Total patients: {edata.obs['person_id'].nunique()}")
print(f"Total visits: {len(edata)}")


Patients with multiple visits:
Total patients: 100
Total visits: 852


**The `.var` table** contains information about each clinical variable (mapped from OMOP concepts):


In [79]:
edata.var.head(10)

Unnamed: 0,data_table_concept_id
0,0
1,1175625
2,3000067
3,3000068
4,3000099
5,3000285
6,3000330
7,3000348
8,3000456
9,3000461


#### Understanding `.layers` - Time Series Data

**The `.layers` tensor** contains the time series data with shape (visits × variables × time).


In [80]:
edata.layers

Layers3D with keys: tem_data

In [82]:
print(f"edata.layers['tem_data'] shape: {edata.layers['tem_data'].shape}")
print(f"  → {edata.layers['tem_data'].shape[0]} visits")
print(f"  → {edata.layers['tem_data'].shape[1]} variables")
print(f"  → {edata.layers['tem_data'].shape[2]} time intervals")

edata.layers['tem_data'] shape: (852, 450, 24)
  → 852 visits
  → 450 variables
  → 24 time intervals


Let's explore this for a specific visit, and the measurement concept id `3016723`, which represents a measurement of Creatinine [Mass/volume] in Serum or Plasma.

```{note}
You can look up concept ids such as `3016723` using OHDSI's [Athena](https://athena.ohdsi.org/search-terms/start).
```

In [86]:
visit_occurrence_id = -9149771978458038515
variable_concept_id = 3016723

visit_index = edata.obs["visit_occurrence_id"] == visit_occurrence_id
variable_index = edata.var["data_table_concept_id"] == variable_concept_id

patient_id = edata.obs[visit_index]["person_id"]
visit_id = edata.obs[visit_index]["visit_occurrence_id"]
visit_start = edata.obs[visit_index]["visit_start_datetime"]


print(f"Example: Visit {visit_occurrence_id} (Patient {patient_id})")
print(f"Visit start: {visit_start}")

# Show measurements for this visit from the database
print("\nOriginal measurements from OMOP for this visit:")
measurements = con.execute(f"""
    SELECT measurement_datetime, measurement_concept_id, value_as_number
    FROM measurement
    WHERE visit_occurrence_id = {visit_occurrence_id}
    AND measurement_concept_id = {variable_concept_id}
    ORDER BY measurement_datetime
    LIMIT 1
""").df()

measurements

Example: Visit -9149771978458038515 (Patient 1    4239478333578644568
Name: person_id, dtype: int64)
Visit start: 1   2177-03-12 07:15:00
Name: visit_start_datetime, dtype: datetime64[us]

Original measurements from OMOP for this visit:


Unnamed: 0,measurement_datetime,measurement_concept_id,value_as_number
0,2177-03-12 12:47:00,3016723,1.1


Now let's see how to **access this data in the `.layers` tensor**:

Looking at the measurement above:
- **Visit start**: 2177-03-12 07:15:00
- **Measurement time**: 2177-03-12 12:47:00
- **Time elapsed**: ~5.5 hours (5 hours and 32 minutes)

This measurement falls into **interval 5** (the time bin from 5-6 hours after visit start). Since we're using 1-hour intervals starting from the visit start time, the measurement at 5.5 hours gets aggregated into the 6th hourly bin (index 5, since we start counting from 0).


In [91]:
print(edata[visit_index, variable_index, :].layers["tem_data"])

[[[nan nan nan nan nan 1.1 nan nan nan nan nan nan nan nan nan nan nan
   nan nan nan nan nan nan nan]]]


`ehrapy` provides functions for exploration, vizualisation, and analysis of data in the `EHRData` structure - check out its tutorials for insights that do not rely on the basic print statements we used here for simplicity.

## How OMOP Maps to EHRData

Here's how the OMOP CDM tables map to the EHRData structure:

| OMOP CDM | EHRData | Description |
|----------|---------|-------------|
| `person` table | `.obs` rows | Each patient becomes an observation |
| `person` columns | `.obs` columns | Demographics (age, gender, etc.) |
| `measurement` concepts | `.var` rows | Each unique concept_id becomes a variable |
| `measurement` values | `.layers` tensor | Time series data (patients × variables × time) |
| `concept` table | `.var` enrichment | Concept names and metadata |
| Time intervals | `.tem` | Discretized time points |

**Key transformations:**
- OMOP's long format (many rows per patient) → EHRData's tensor format (3D array)
- OMOP's timestamp-based data → Discretized time intervals
- OMOP concept IDs → Human-readable variable names (via enrichment)


## Summary

In this tutorial, we learned:

- ✅ What the OMOP Common Data Model is and why it's useful
- ✅ How to download and load MIMIC-IV data in OMOP format  
- ✅ How to build an EHRData object step-by-step using:
  - `ed.io.omop.setup_obs()` - Define observation units (.obs)
  - `ed.io.omop.setup_variables()` - Extract time series data (.var, .layers, .tem)
- ✅ How OMOP CDM tables map to the EHRData structure
- ✅ How OMOP's long format transforms into EHRData's 3D tensor

## Next Tutorial

Continue with **[OMOP Machine Learning](omop_ml)** to learn how ehrdata quickstarts ML workflows based on an OMOP dataset.

## Further Resources

- **[The Book of OHDSI](https://ohdsi.github.io/TheBookOfOhdsi/)** - Comprehensive guide to OHDSI and the OMOP Common Data Model
- **[OMOP CDM Website](https://www.ohdsi.org/data-standardization/)** - Official OHDSI data standardization resources