# Loading Data from OMOP Common Data Model into `EHRData`

This tutorial demonstrates how to load electronic health record data from the **OMOP Common Data Model** (CDM) with `ehrdata`.

## What is OMOP CDM?

The [OMOP CDM](https://ohdsi.github.io/CommonDataModel/) standardizes observational healthcare data across different sources to enable reproducible, multi-site analyses.

**Key benefits:**
- **Standardization**: Different hospitals' data formats are harmonized
- **Vocabularies**: Clinical concepts are mapped to standard terminologies (SNOMED, ICD, LOINC)
- **Privacy**: Analysis code can be sent to data rather than moving patient data
- **Reproducibility**: Same analysis can run across multiple sites

We'll use the [MIMIC-IV demo dataset](https://physionet.org/content/mimic-iv-demo-omop/0.9/) (100 patients) in OMOP format as our example. 

## Setup and Data Download


```{note}
**ehrdata provides ready-to-use OMOP datasets!**

The `ed.dt.mimic_iv_omop()` function automatically:
1. Downloads real healthcare data in OMOP table format
2. Loads the tables into a database connection
3. Makes them ready for you to construct EHRData objects

This workflow mirrors what you'd do with your own OMOP database:
**Download OMOP tables** → **Connect to database** → **Create EHRData structure** (using arguments explained below)
```


In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import ehrdata as ed
import duckdb

We'll use ehrdata's built-in `mimic_iv_omop()` function, which downloads the data as OMOP tables and loads them into a database connection:


In [3]:
# Create an in-memory database
con = duckdb.connect(":memory:")

# Download data and load OMOP tables into the database
# This creates tables like: person, measurement, observation, etc.
data_path = ed.dt.mimic_iv_omop(backend_handle=con)
print(f"Data downloaded to: {data_path}")

Data downloaded to: None


## Quick Look at OMOP Tables

The OMOP CDM organizes data into standardized tables. Let's explore the key tables:


In [4]:
# See what tables we have
tables = con.execute("SHOW TABLES").df()
print(f"Number of tables: {len(tables)}")
print("\nAvailable tables:")
print(tables)

Number of tables: 30

Available tables:
                    name
0              care_site
1             cdm_source
2                 cohort
3      cohort_definition
4                concept
5   concept_relationship
6          condition_era
7   condition_occurrence
8                   cost
9                  death
10       device_exposure
11              dose_era
12              drug_era
13         drug_exposure
14     fact_relationship
15              location
16           measurement
17              metadata
18                  note
19              note_nlp
20           observation
21    observation_period
22     payer_plan_period
23                person
24  procedure_occurrence
25              provider
26              specimen
27          visit_detail
28      visit_occurrence
29            vocabulary


### The `person` table

The `person` table contains demographic information about each patient:


In [5]:
person_df = con.execute("SELECT * FROM person LIMIT 5").df()
person_df.head()

Unnamed: 0,person_id,gender_concept_id,year_of_birth,month_of_birth,day_of_birth,birth_datetime,race_concept_id,ethnicity_concept_id,location_id,provider_id,care_site_id,person_source_value,gender_source_value,gender_source_concept_id,race_source_value,race_source_concept_id,ethnicity_source_value,ethnicity_source_concept_id
0,3589912774911670296,8507,2095,,,,0,38003563,,,,10009628,M,0,,0,HISPANIC/LATINO,2000001408
1,-3210373572193940939,8507,2079,,,,0,38003563,,,,10011398,M,0,,0,HISPANIC/LATINO,2000001408
2,-775517641933593374,8507,2149,,,,8516,0,,,,10004235,M,0,BLACK/AFRICAN AMERICAN,2000001406,,0
3,-2575767131279873665,8507,2050,,,,8516,0,,,,10024043,M,0,BLACK/AFRICAN AMERICAN,2000001406,,0
4,-8970844422700220177,8507,2114,,,,8527,0,,,,10038933,M,0,WHITE,2000001404,,0


### The `measurement` table

The `measurement` table contains lab values, vital signs, and other measurements:


In [6]:
measurement_df = con.execute("SELECT * FROM measurement LIMIT 5").df()
measurement_df.head()

Unnamed: 0,measurement_id,person_id,measurement_concept_id,measurement_date,measurement_datetime,measurement_time,measurement_type_concept_id,operator_concept_id,value_as_number,value_as_concept_id,unit_concept_id,range_low,range_high,provider_id,visit_occurrence_id,visit_detail_id,measurement_source_value,measurement_source_concept_id,unit_source_value,value_source_value
0,7620661609057829801,-7437341330444582833,3007913,2113-09-14,2113-09-14 10:41:00,,32856,,586.0,,8876,,,,3697313480337443666,,50801,2000001001,mm Hg,586
1,-6166868866082303206,-2312013739856114142,3012501,2116-07-05,2116-07-05 05:51:00,,32856,,-4.0,,9557,,,,-5005846256467230136,,50802,2000001002,mEq/L,-4
2,-5240588523649662838,-4234372750442829205,3012501,2154-01-02,2154-01-02 20:18:00,,32856,,-2.0,,9557,,,,-5317261811030552609,,50802,2000001002,mEq/L,-2
3,6455538495061268502,8805478484003283429,3012501,2114-06-20,2114-06-20 09:59:00,,32856,,3.0,,9557,,,,4821424463988433938,,50802,2000001002,mEq/L,3
4,8972684215776493449,6339505631013617478,3012501,2111-11-15,2111-11-15 03:09:00,,32856,,-2.0,,9557,,,,-6354047184485090226,,50802,2000001002,mEq/L,-2


## Building an EHRData Object from OMOP

Now let's construct an `EHRData` object from the OMOP database. This happens in three steps:

1. **Setup observations** (`.obs`) - Define what each row represents (patients, visits, etc.)
2. **Setup variables** (`.var` and `.layers`) - Extract clinical measurements as time series
3. **Explore the result** - See how OMOP data maps to EHRData structure


### Step 1: Setup Observations (`.obs`)

The first step is to define what each observation (row) in our dataset represents. We use `ed.io.omop.setup_obs()` for this.

In OMOP, we can choose different observation units:
- `"person"` - Each row is one patient
- `"person_visit_occurrence"` - Each row is one hospital visit
- `"person_observation_period"` - Each row is one observation period

```{important}
**The choice of observation table determines the reference timepoint (t=0) for time series data:**

| Observation Table | Timepoint 0 Reference |
|-------------------|----------------------|
| `"person"` | Patient's birth date |
| `"person_visit_occurrence"` | Start of the hospital visit |
| `"person_observation_period"` | Start of the observation period |

This allows you to leverage the temporal information already defined in the OMOP database to align your time series data to clinically meaningful reference points.
```

Let's use `"person"` to have one row per patient:

In [7]:
# Step 1: Setup observations from the person table
edata = ed.io.omop.setup_obs(
    backend_handle=con,
    observation_table="person_visit_occurrence",
    death_table=True,  # Include death information
)

print(f"Created EHRData with {edata.n_obs} patients")
edata

Created EHRData with 852 patients


EHRData object with n_obs × n_vars × n_t = 852 × 0 × 1
    obs: 'person_id', 'gender_concept_id', 'year_of_birth', 'month_of_birth', 'day_of_birth', 'birth_datetime', 'race_concept_id', 'ethnicity_concept_id', 'location_id', 'provider_id', 'care_site_id', 'person_source_value', 'gender_source_value', 'gender_source_concept_id', 'race_source_value', 'race_source_concept_id', 'ethnicity_source_value', 'ethnicity_source_concept_id', 'visit_occurrence_id', 'person_id_1', 'visit_concept_id', 'visit_start_date', 'visit_start_datetime', 'visit_end_date', 'visit_end_datetime', 'visit_type_concept_id', 'provider_id_1', 'care_site_id_1', 'visit_source_value', 'visit_source_concept_id', 'admitting_source_concept_id', 'admitting_source_value', 'discharge_to_concept_id', 'discharge_to_source_value', 'preceding_visit_occurrence_id', 'death_date', 'death_datetime', 'death_type_concept_id', 'cause_concept_id', 'cause_source_value', 'cause_source_concept_id'
    uns: 'omop_io_observation_table'

Let's examine the `.obs` table - it contains patient demographics from the OMOP `person` table:


In [8]:
edata.obs.head()

Unnamed: 0,person_id,gender_concept_id,year_of_birth,month_of_birth,day_of_birth,birth_datetime,race_concept_id,ethnicity_concept_id,location_id,provider_id,...,admitting_source_value,discharge_to_concept_id,discharge_to_source_value,preceding_visit_occurrence_id,death_date,death_datetime,death_type_concept_id,cause_concept_id,cause_source_value,cause_source_concept_id
0,4783904755296699562,8507,2049,,,,2000001401,0,,,...,,,,-3100295671072277114,2116-03-12,2116-03-12 07:45:00,32817.0,0.0,,0.0
1,-6225647829918357531,8532,2083,,,,8527,0,,,...,,,,-2238366349102619807,NaT,NaT,,,,
2,7918537411740862407,8532,2055,,,,8516,0,,,...,,,,-8600570148620615801,NaT,NaT,,,,
3,-6225647829918357531,8532,2083,,,,8527,0,,,...,,,,4782862144633917111,NaT,NaT,,,,
4,7155255168997124770,8507,2086,,,,8527,0,,,...,,,,-3275690209157700579,NaT,NaT,,,,


### Step 2: Setup Variables (`.var` and `.layers`)

Now we extract clinical measurements from the OMOP `measurement` table and organize them as time series data.

We need to specify:
- **`data_tables`**: Which OMOP tables to use (e.g., `measurement`, `observation`)
- **`data_field_to_keep`**: Which field contains the values (e.g., `"value_as_number"`)
- **`interval_length`** and **`num_intervals`**: How to discretize time
- **`aggregation_strategy`**: How to handle multiple measurements in one interval


In [9]:
# TODO after run:
# Hui, here is some things that seem to be breaking:
# - ValueError: cannot convert a DataFrame with a non-unique MultiIndex into xarray
# - Person: no key found [add, with birthdate?]
# - Casting error


# edata = ed.io.omop.setup_obs(
#     backend_handle=con,
#     observation_table="person_visit_occurrence",
# )
edata = ed.io.omop.setup_variables(
    edata=edata,
    backend_handle=con,
    data_tables=["measurement"],
    data_field_to_keep=["value_as_number"],
    interval_length_number=1,
    interval_length_unit="h",
    num_intervals=24,
    concept_ids="all",
    aggregation_strategy="last",
)

 [  2]
 [  6]
 [  9]
 [ 12]
 [ 13]
 [ 19]
 [ 21]
 [ 22]
 [ 26]
 [ 28]
 [ 35]
 [ 41]
 [ 45]
 [ 49]
 [ 52]
 [ 53]
 [ 54]
 [ 57]
 [ 59]
 [ 63]
 [ 68]
 [ 71]
 [ 74]
 [ 76]
 [ 77]
 [ 79]
 [ 87]
 [ 88]
 [ 92]
 [ 93]
 [ 97]
 [ 99]
 [102]
 [104]
 [111]
 [119]
 [140]
 [142]
 [160]
 [169]
 [178]
 [182]
 [183]
 [184]
 [189]
 [193]
 [194]
 [200]
 [203]
 [204]
 [206]
 [210]
 [213]
 [215]
 [216]
 [230]
 [246]
 [301]
 [332]
 [337]
 [352]
 [353]
 [354]
 [365]
 [366]]


ValueError: cannot convert a DataFrame with a non-unique MultiIndex into xarray

In [27]:
# Step 2: Setup variables from measurements
edata = ed.io.omop.setup_variables(
    edata=edata,
    backend_handle=con,
    data_tables=["measurement"],
    data_field_to_keep={"measurement": "value_as_number"},
    interval_length_number=1,
    interval_length_unit="day",
    num_intervals=24,  # Track 30 days
    aggregation_strategy="mean",  # Average values within each day
    enrich_var_with_feature_info=True,  # Add concept names
)

print(f"\nFinal shape: {edata.n_obs} patients × {edata.n_vars} variables × {edata.n_tem} time points")
edata

BinderException: Binder Error: No function matches the given name and argument types 'mean(VARCHAR)'. You might need to add explicit type casts.
	Candidate functions:
	mean(DECIMAL) -> DECIMAL
	mean(SMALLINT) -> DOUBLE
	mean(INTEGER) -> DOUBLE
	mean(BIGINT) -> DOUBLE
	mean(HUGEINT) -> DOUBLE
	mean(INTERVAL) -> INTERVAL
	mean(DOUBLE) -> DOUBLE
	mean(TIMESTAMP) -> TIMESTAMP
	mean(TIMESTAMP WITH TIME ZONE) -> TIMESTAMP WITH TIME ZONE
	mean(TIME) -> TIME
	mean(TIME WITH TIME ZONE) -> TIME WITH TIME ZONE


LINE 6: ... AS value_as_number, MEAN(unit_concept_id) AS unit_concept_id, MEAN(unit_source_value) AS unit_source_value         FROM...
                                                                          ^

### Step 3: Explore the Result

Let's examine what was constructed from the OMOP database:


First, let's compare the original OMOP `person` table with the `.obs` attribute of our EHRData:


In [None]:
# Original OMOP person table
print("OMOP person table:")
con.execute("SELECT * FROM person LIMIT 5").df()

In [None]:
# EHRData .obs (derived from person table)
print("EHRData .obs:")
edata.obs.head()

**The `.var` table** contains information about each clinical variable (mapped from OMOP concepts):


In [None]:
edata.var.head(10)

**The `.layers` tensor** contains the time series data (patients × variables × time):


In [None]:
print(f"edata.layers['tem_data'] shape: {edata.layers['tem_data'].shape}")
print(f"Data type: {edata.layers['tem_data'].dtype}")
print("\nExample: Patient 0, Variable 0 over 30 days:")
print(edata.layers["tem_data"][0, 0, :])

## How OMOP Maps to EHRData

Here's how the OMOP CDM tables map to the EHRData structure:

| OMOP CDM | EHRData | Description |
|----------|---------|-------------|
| `person` table | `.obs` rows | Each patient becomes an observation |
| `person` columns | `.obs` columns | Demographics (age, gender, etc.) |
| `measurement` concepts | `.var` rows | Each unique concept_id becomes a variable |
| `measurement` values | `.layers` tensor | Time series data (patients × variables × time) |
| `concept` table | `.var` enrichment | Concept names and metadata |
| Time intervals | `.tem` | Discretized time points |

**Key transformations:**
- OMOP's long format (many rows per patient) → EHRData's tensor format (3D array)
- OMOP's timestamp-based data → Discretized time intervals
- OMOP concept IDs → Human-readable variable names (via enrichment)


## Summary

In this tutorial, we learned:

- ✅ What the OMOP Common Data Model is and why it's useful
- ✅ How to download and load MIMIC-IV data in OMOP format  
- ✅ How to build an EHRData object step-by-step using:
  - `ed.io.omop.setup_obs()` - Define observation units (.obs)
  - `ed.io.omop.setup_variables()` - Extract time series data (.var, .layers, .tem)
- ✅ How OMOP CDM tables map to the EHRData structure
- ✅ How OMOP's long format transforms into EHRData's 3D tensor

## Next Tutorial

Continue with **[OMOP Machine Learning](omop_ml)** to learn how ehrdata quickstarts ML workflows based on an OMOP dataset.

## Further Resources

- **[The Book of OHDSI](https://ohdsi.github.io/TheBookOfOhdsi/)** - Comprehensive guide to OHDSI and the OMOP Common Data Model
- **[OMOP CDM Website](https://www.ohdsi.org/data-standardization/)** - Official OHDSI data standardization resources