# Extracting, Representing, Validating and Visualizing Data from an OMOP CDM Database with ehrdata, lamin and Vitessce

## Background

In a nutshell
1. Extract data from a database of the [OMOP Common Data Model](https://ohdsi.github.io/CommonDataModel/index.html)
2. Represent this data in an [ehrdata](https://ehrdata.readthedocs.io/en/latest/#) object
3. Validate this ehrdata object using [lamin](https://lamin.ai/) functionality (optional but recommended)
4. Visualize this data with [Vitessce](https://vitessce.io/), either in a notebook or on cloud storage via lamin hub.

### OMOP
[OMOP](https://ohdsi.github.io/CommonDataModel/index.html) is a data model by [OHDSI](https://www.ohdsi.org/).

#### The Example Dataset used: MIMIC IV OMOP Demo Dataset
Dataset available on [Physionet](https://physionet.org/content/mimic-iv-demo-omop/0.9/).

Dataset:<br>
Kallfelz, M., Tsvetkova, A., Pollard, T., Kwong, M., Lipori, G., Huser, V., Osborn, J., Hao, S., & Williams, A. (2021). MIMIC-IV demo data in the OMOP Common Data Model (version 0.9). PhysioNet. https://doi.org/10.13026/p1f5-7x35.

Physionet:<br>
Goldberger, A., Amaral, L., Glass, L., Hausdorff, J., Ivanov, P. C., Mark, R., ... & Stanley, H. E. (2000). PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220.

### Extract
This notebook guides you through the extraction from data following the OMOP CDM.

### Represent
See [ehrdata](https://ehrdata.readthedocs.io/en/latest/#) for more information on ehrdata.

### Validate
See [lamin](https://lamin.ai/) for more information on lamin.

### Visualize
See [Vitessce](https://vitessce.io/) for more information on Vitessce

## The extraction workflow

Here, we use [duckdb](https://duckdb.org/)'s Python API to load csv tables as they are available from the link above. (which is absolutely useless for immediate purposes but why not)

In [2]:
%load_ext autoreload
%autoreload 2

In [3]:
import anndata as ad
import duckdb
import ehrapy as ep
import ehrdata as ed
import ehrdata
import numpy as np

In [4]:
from widgets import create_single_option_widget, create_multiple_options_widget

In [5]:
options = ["person", "person_cohort", "person_observation_period", "person_visit_occurence"]


ui_obs, selected_obs = create_single_option_widget(
    title_text="Please select which table should be used for .obs indexing in ehrdata:",
    options=options,
)

In [6]:
options = ["measurement", "observation", "drug_occurrence"]
ui_var, selected_vars = create_multiple_options_widget(title_text="Options to use as the variables", options=options)

In [7]:
options = [
    "pca",
    "umap",
]
ui_emb, selected_emb = create_single_option_widget(
    title_text="Please select which embedding do you want to use:",
    options=options,
)

### Set up a local database connection

In [8]:
con = duckdb.connect()

Load the data into your database

In [9]:
ehrdata.dt.mimic_iv_omop(backend_handle=con)

[93m![0m File ehrapy_data/mimic-iv-demo-data-in-the-omop-common-data-model-0.9/mimic-iv-demo-data-in-the-omop-common-data-model-0.9 already exists! Using already downloaded dataset...


See what tables there are

In [10]:
tables = con.execute("SHOW TABLES;").df()
tables

Unnamed: 0,name
0,care_site
1,cdm_source
2,cohort
3,cohort_definition
4,concept
5,concept_relationship
6,condition_era
7,condition_occurrence
8,cost
9,death


In [28]:
display(ui_obs)

VBox(children=(HTML(value='<h3>Please select which table should be used for .obs indexing in ehrdata:</h3>'), …

In [34]:
# if obs_base.value == 'person':
#     obs = ehrdata.io.omop.extract_person(con)
# elif obs_base.value == 'observation_period':
#     obs = ehrdata.io.omop.extract_observation_period(con)
# elif obs_base.value == 'visit_occurrence':
#     obs = ehrdata.io.omop.extract_visit_occurrence(con)
# elif obs_base.value == 'condition_occurrence':
#     obs = ehrdata.io.omop.extract_condition_occurrence(con)

# obs.head()

if selected_obs.value == "person":
    edata = ed.io.omop.setup_obs(con, "person") 
elif selected_obs.value == "person_cohort": # person cohort = 0 x 0?
    edata = ed.io.omop.setup_obs(con, "person_cohort") 
elif selected_obs.value == "person_observation_period":
    edata = ed.io.omop.setup_obs(con, "person_observation_period")
elif selected_obs.value == "person_visit_occurrence":
    edata = ed.io.omop.setup_obs(con, "person_visit_occurrence")

edata

EHRData object with n_obs × n_vars = 100 × 0
    obs: 'person_id', 'gender_concept_id', 'year_of_birth', 'month_of_birth', 'day_of_birth', 'birth_datetime', 'race_concept_id', 'ethnicity_concept_id', 'location_id', 'provider_id', 'care_site_id', 'person_source_value', 'gender_source_value', 'gender_source_concept_id', 'race_source_value', 'race_source_concept_id', 'ethnicity_source_value', 'ethnicity_source_concept_id', 'observation_period_id', 'person_id_1', 'observation_period_start_date', 'observation_period_end_date', 'period_type_concept_id'
    uns: 'omop_io_observation_table'

#### Interlude - Irregularly sampled time series data
Electronic health records can be regarded as (that is, form a model of a person via) irregular sampling irregularly sampled time series.

Following notation and explanation from [Horn et al.](https://proceedings.mlr.press/v119/horn20a.html), a time series of a patient can be described as a set of tuples (t, z, m), where t denotes the time, z the observed value, and m a modality description of the measurement.

The time series can have different lengths, and a "typical" number of observed values might not exist.

Generally, an irregularly-sampled time series can be converted into a missing data problem by discretizing the time axis into non-overlapping intervals, and declaring intervals in which no data was sampled as missing (Bahadori & Lipton, 2019). [Horn et al.](https://proceedings.mlr.press/v119/horn20a.html).

In [35]:
display(ui_var)

VBox(children=(HTML(value="<h3 style='color: #333; font-family: Arial, sans-serif;'>Options to use as the vari…

In [36]:
edata = ed.io.omop.setup_variables(
    edata=edata,
    backend_handle=con,
    data_tables=list(selected_vars.value),
    data_field_to_keep=["value_as_number"],
    interval_length_number=20,
    interval_length_unit="day",
    num_intervals=10,
    concept_ids="all",
    aggregation_strategy="last",
    enrich_var_with_feature_info=True,
    enrich_var_with_unit_info=False,
)
edata.uns["unit_report_measurement"]

 [ 16]
 [ 18]
 [ 28]
 [ 39]
 [ 54]
 [ 71]
 [ 74]
 [ 86]
 [138]
 [160]
 [179]
 [196]
 [202]
 [220]
 [244]
 [332]
 [339]
 [389]]


Unnamed: 0,concept_id,unit_concept_id,no_units,multiple_units
0,3007733,9557,False,False
1,3006175,,False,False
2,3009201,9093,False,False
3,3014037,8784,False,False
4,3004295,8840,False,False
...,...,...,...,...
436,3022094,,False,False
437,3024463,9461,False,False
438,4046245,,False,False
439,3002091,9550,False,False


In [98]:
edata

EHRData object with n_obs × n_vars × n_t = 100 × 450 × 10
    obs: 'person_id', 'gender_concept_id', 'year_of_birth', 'month_of_birth', 'day_of_birth', 'birth_datetime', 'race_concept_id', 'ethnicity_concept_id', 'location_id', 'provider_id', 'care_site_id', 'person_source_value', 'gender_source_value', 'gender_source_concept_id', 'race_source_value', 'race_source_concept_id', 'ethnicity_source_value', 'ethnicity_source_concept_id', 'observation_period_id', 'person_id_1', 'observation_period_start_date', 'observation_period_end_date', 'period_type_concept_id'
    var: 'data_table_concept_id', 'concept_id', 'concept_name', 'domain_id', 'vocabulary_id', 'concept_class_id', 'standard_concept', 'concept_code', 'valid_start_date', 'valid_end_date', 'invalid_reason'
    tem: '0', '1', '2', '3', '4', '5', '6', '7', '8', '9'
    uns: 'omop_io_observation_table', 'unit_report_measurement'
    shape of .X: (100, 450)
    shape of .R: (100, 450, 10)

In [99]:
edata.X = np.nanmean(edata.R, 2)

In [None]:
ep.pl.missing_values_matrix(edata)

KeyboardInterrupt: 

In [None]:
ep.pp.explicit_impute(edata, replacement=0)

### Lamin Validation



In [None]:
!lamin connect theislab/ehr

In [None]:
import omop as op
import pandas as pd

#### Prepare inputs

In [None]:
edata.var.rename(columns={0: "concept_id"}, inplace=True)

In [None]:
# Concepts vocabulary from OMOP. Intersection with ehdata.var
omop_concepts = pd.read_csv("./metadata/omop_validation_slice.csv")

In [None]:
edata.var.concept_id.isin(omop_concepts.concept_id)

In [None]:
pd.merge(edata.var, omop_concepts, on="concept_id", how="inner")

In [None]:
# change the type to match `omop.Concept` typing
omop_concepts = omop_concepts.astype(
    {
        "standard_concept": "str",
        "invalid_reason": "str",
    }
)

In [None]:
DEFAULTS_VALUES_VAR = {
    "concept_id": int,
    "concept_name": str,
    "domain_id": str,
    "vocabulary_id": str,
    "concept_class": str,
    "standard_concept": (str, type(None)),
    "concept_code": str,
    "valid_start_date": str,
    "valid_end_date": str,
    "invalid_reason": (str, type(None)),
}

for column, expected_type in DEFAULTS_VALUES_VAR.items():
    # Check if the column exists in the DataFrame
    if column not in omop_concepts.columns:
        msg = f"Required column '{column}' is missing from the DataFrame."
        raise ValueError(msg)

    # Adjust type check for string columns (object is the pandas dtype for strings)
    if expected_type is str:
        if omop_concepts[column].dtype != "object":
            msg = f"Column '{column}' has incorrect data type. Expected string (object in pandas)."
            raise TypeError(msg)
    elif isinstance(expected_type, tuple):  # For optional fields (e.g., str or None)
        if not omop_concepts[column].map(lambda x, expected_type=expected_type: isinstance(x, expected_type)).all():
            msg = f"Column '{column}' has incorrect data type. Expected one of {expected_type}."
            raise TypeError(msg)
    elif not omop_concepts[column].map(lambda x, expected_type=expected_type: isinstance(x, expected_type)).all():
        msg = f"Column '{column}' has incorrect data type. Expected {expected_type.__name__}."
        raise TypeError(msg)

#### Push to lamin

In [None]:
# Skip, concepts already pushed Lamin
concepts = [op.Concept(**row.to_dict()) for _, row in omop_concepts.iterrows()]
for concept in concepts:
    concept.save()

#### EHR curator

In [None]:
curator = ehrdata.tl.EHRCurator(
    edata=edata,
    concepts_var_column="concept_id",
)

In [None]:
edata = curator.validate_adata(op)

In [None]:
edata.var.valid_concept_id.value_counts().plot(kind="bar")

#### Visualization

In [None]:
display(ui_emb)

In [None]:
ep.pp.pca(edata)
if selected_emb.value == "umap":
    ep.pp.neighbors(edata)
    ep.tl.umap(edata)

In [None]:
adata = ad.AnnData(X=edata.X, obs=edata.obs, var=edata.var)

#### Q: why is any of this interesting?
#### A: because now ehrapy and more tools in the future of its ecosystem (like with scanpy) can nicely access this!

In [None]:
ct = ep.tl.CohortTracker(
    edata,
    columns=[
        "gender_concept_id",
        "year_of_birth",
        "race_concept_id",
        "period_type_concept_id",
    ],
    categorical=["gender_concept_id", "race_concept_id", "period_type_concept_id"],
)

ct(edata)

ct.plot_cohort_barplot(
    legend_labels={
        # 0: "Unknown",
        # 8516: "Black or African American",
        # "year_of_birth": "Birthyear (artificial)",
        # 8507: "Male",
        # 8532: "Female",
    },
    legend_subtitles_names={"gender_concept_id": "Gender"},
)

### Visualization with Vitessce in notebook

1. Import dependencies

In [None]:
from pathlib import Path

from vitessce.data_utils import optimize_adata, VAR_CHUNK_SIZE

2. Save the AnnData object to Zarr

In [None]:
zarr_filepath = Path("data", "processed_ehrdata.zarr")

In [None]:
if not zarr_filepath.is_dir():
    edata = optimize_adata(
        edata,
        obs_cols=["gender_concept_id", "race_concept_id"],
        obsm_keys=["X_pca", "X_umap"],
        optimize_X=True,
    )
    edata.write_zarr(zarr_filepath, chunks=[edata.shape[0], VAR_CHUNK_SIZE])
else:
    print(f"path exists, did not write new file: {zarr_filepath}")

3. Create a Vitessce view config

In [None]:
import ehrdata.pl.vitessce

vc = ehrdata.pl.vitessce.gen_config(zarr_filepath)

4. Create the Vitessce widget

In [None]:
import lamindb as ln

In [None]:
ln.connect("theislab/ehr")

In [None]:
vw = vc.widget()
vw

Should look like this:

![](../_static/tutorial_images/vitessce_screenshot.png)

### Visualization with Vitessce on lamin
Uploading dataset on lamin allows even easier sharing and looking at dataset together. Together with the dedicated validation functionality that lamin has and we might extend, this makes lamin + ehrdata a powerful coupling.

**This requires to connect to lamindb from terminal to work!**
```
lamin login <credentials>
```

In [None]:
import lamindb as ln

In [None]:
zarr_artifact = ln.Artifact(
    zarr_filepath,
    description="Dummy EHRDataset",
)
zarr_artifact.save()

In [None]:
vc = ehrdata.pl.vitessce.gen_config(artifact=zarr_artifact, url=zarr_artifact.path.to_url())
vc.widget()

In [None]:
from lamindb.integrations import save_vitessce_config

In [None]:
vc_artifact = save_vitessce_config(vc, description="Dummy OMOP prepared dataset")

Now our data is stored on the cloud, managed by lamin:

![](../_static/tutorial_images/laminhub_screenshot.png)

Now we can share the data with others easily, and give them a look at it: all they need is access to our lamin storage & click the vitessce button next to the dataset in their browser!

### Bonus: Lamin utility
Using lamin offers a lot of powerful tracking of our data and how we operated on it.

In [None]:
vc_artifact.view_lineage()