# Loading Data from OMOP Common Data Model into `EHRData`

This tutorial demonstrates how to load electronic health record data from the **OMOP Common Data Model** (CDM) into `ehrdata`.

## What is OMOP CDM?

The [OMOP CDM](https://ohdsi.github.io/CommonDataModel/) standardizes observational healthcare data across different sources to enable reproducible, multi-site analyses.

**Key benefits:**
- **Standardization**: Different hospitals' data formats are harmonized
- **Vocabularies**: Clinical concepts are mapped to standard terminologies (SNOMED, ICD, LOINC)
- **Privacy**: Analysis code can be sent to data rather than moving patient data
- **Reproducibility**: Same analysis can run across multiple sites

We'll use the [MIMIC-IV demo dataset](https://physionet.org/content/mimic-iv-demo-omop/0.9/) (100 patients) in OMOP format as our example.


## Setup and Data Download


```{note}
**ehrdata provides ready-to-use OMOP datasets!**

The `ed.dt.mimic_iv_omop()` function automatically:
1. Downloads real healthcare data in OMOP table format
2. Loads the tables into a database connection
3. Makes them ready for you to construct EHRData objects

This workflow mirrors what you'd do with your own OMOP database:
**Download OMOP tables** → **Connect to database** → **Create EHRData structure** (using arguments explained below)
```


In [None]:
import ehrdata as ed
import duckdb

We'll use ehrdata's built-in `mimic_iv_omop()` function, which downloads the data as OMOP tables and loads them into a database connection:


In [None]:
# Create an in-memory database
con = duckdb.connect(":memory:")

# Download data and load OMOP tables into the database
# This creates tables like: person, measurement, observation, etc.
data_path = ed.dt.mimic_iv_omop(backend_handle=con)
print(f"Data downloaded to: {data_path}")

## Quick Look at OMOP Tables

The OMOP CDM organizes data into standardized tables. Let's explore the key tables:


In [None]:
# See what tables we have
tables = con.execute("SHOW TABLES").df()
print(f"Number of tables: {len(tables)}")
print("\nAvailable tables:")
print(tables)

### The `person` table

The `person` table contains demographic information about each patient:


In [None]:
person_df = con.execute("SELECT * FROM person LIMIT 5").df()
person_df.head()

### The `measurement` table

The `measurement` table contains lab values, vital signs, and other measurements:


In [None]:
measurement_df = con.execute("SELECT * FROM measurement LIMIT 5").df()
measurement_df.head()

## Building an EHRData Object from OMOP

Now let's construct an `EHRData` object from the OMOP database. This happens in three steps:

1. **Setup observations** (`.obs`) - Define what each row represents (patients, visits, etc.)
2. **Setup variables** (`.var` and `.X`) - Extract clinical measurements as time series
3. **Explore the result** - See how OMOP data maps to EHRData structure


### Step 1: Setup Observations (`.obs`)

The first step is to define what each observation (row) in our dataset represents. We use `ed.io.omop.setup_obs()` for this.

In OMOP, we can choose different observation units:
- `"person"` - Each row is one patient
- `"person_visit_occurrence"` - Each row is one hospital visit
- `"person_observation_period"` - Each row is one observation period

Let's use `"person"` to have one row per patient:


In [None]:
# Step 1: Setup observations from the person table
edata = ed.io.omop.setup_obs(
    backend_handle=con,
    observation_table="person",
    death_table=True,  # Include death information
)

print(f"Created EHRData with {edata.n_obs} patients")
edata

Let's examine the `.obs` table - it contains patient demographics from the OMOP `person` table:


In [None]:
edata.obs.head()

### Step 2: Setup Variables (`.var` and `.X`)

Now we extract clinical measurements from the OMOP `measurement` table and organize them as time series data.

We need to specify:
- **`data_tables`**: Which OMOP tables to use (e.g., `measurement`, `observation`)
- **`data_field_to_keep`**: Which field contains the values (e.g., `"value_as_number"`)
- **`interval_length`** and **`num_intervals`**: How to discretize time
- **`aggregation_strategy`**: How to handle multiple measurements in one interval


In [None]:
# Step 2: Setup variables from measurements
edata = ed.io.omop.setup_variables(
    edata=edata,
    backend_handle=con,
    data_tables=["measurement"],
    data_field_to_keep={"measurement": "value_as_number"},
    interval_length_number=1,
    interval_length_unit="day",
    num_intervals=30,  # Track 30 days
    aggregation_strategy="mean",  # Average values within each day
    enrich_var_with_feature_info=True,  # Add concept names
)

print(f"\nFinal shape: {edata.n_obs} patients × {edata.n_vars} variables × {edata.n_tem} time points")
edata

### Step 3: Explore the Result

Let's examine what was constructed from the OMOP database:


**The `.var` table** contains information about each clinical variable (mapped from OMOP concepts):


In [None]:
edata.var.head(10)

**The `.X` tensor** contains the time series data (patients × variables × time):


In [None]:
print(f"X shape: {edata.X.shape}")
print(f"Data type: {edata.X.dtype}")
print("\nExample: Patient 0, Variable 0 over 30 days:")
print(edata.X[0, 0, :])

## How OMOP Maps to EHRData

Here's how the OMOP CDM tables map to the EHRData structure:

| OMOP CDM | EHRData | Description |
|----------|---------|-------------|
| `person` table | `.obs` rows | Each patient becomes an observation |
| `person` columns | `.obs` columns | Demographics (age, gender, etc.) |
| `measurement` concepts | `.var` rows | Each unique concept_id becomes a variable |
| `measurement` values | `.X` tensor | Time series data (patients × variables × time) |
| `concept` table | `.var` enrichment | Concept names and metadata |
| Time intervals | `.tem` | Discretized time points |

**Key transformations:**
- OMOP's long format (many rows per patient) → EHRData's tensor format (3D array)
- OMOP's timestamp-based data → Discretized time intervals
- OMOP concept IDs → Human-readable variable names (via enrichment)


## Summary

In this tutorial, we learned:

- ✅ What the OMOP Common Data Model is and why it's useful
- ✅ How to download and load MIMIC-IV data in OMOP format  
- ✅ How to build an EHRData object step-by-step using:
  - `ed.io.omop.setup_obs()` - Define observation units (.obs)
  - `ed.io.omop.setup_variables()` - Extract time series data (.var, .X, .tem)
- ✅ How OMOP CDM tables map to the EHRData structure
- ✅ How OMOP's long format transforms into EHRData's 3D tensor

## Where to go next

- **[Machine Learning on OMOP Data](omop_ml)** - Apply PyPOTS models to predict ICU mortality from OMOP data with cohort building and time series analysis.

## Further resources

- Check the [OHDSI Book](https://ohdsi.github.io/TheBookOfOhdsi/) for more on OMOP CDM
