# Getting started with EHRData

![](../_static/tutorial_images/logo.svg)


In this tutorial, we introduce basic properties of `EHRData`.

`EHRData` is building upon and extending [AnnData](https://anndata.readthedocs.io/en/stable/), a powerful Python package for handling annotated data.
`EHRData` is intended as the datastructure on which [ehrapy](https://ehrapy.readthedocs.io/en/stable/) operates on.

We recommend readers to familiarize with the basics of `AnnData`, starting for example with the brief tutorial [Getting started with AnnData](https://anndata.readthedocs.io/en/stable/tutorials/notebooks/getting-started.html).

`EHRData` is designed to represent and work with data of $n$ observations of $d$ variables of $t$ repeats.

For instance, in a clinical study, each enrolled subject corresponds to an observation, each registered clinical parameter corresponds to a variable, and each visit corresponds to a repeat.
Furthermore, we might have metadata for each of these axis. For example, for each subject, we might have additional static metadata, such as birthdata, or sex. For each registered clinical parameter, we might have metadata such as a concept identifier, a descriptive name, or the unit it was measured in. For the repeated measurements, we might have a descriptive name per measurement, or the number of weeks after study entry.

We think that still today no other alternative exists that:

* Handles structured data at this expressiveness and simplicity
* Is user-friendly


In [93]:
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [94]:
import numpy as np
import pandas as pd
import ehrdata as ed

## Initializing EHRData

Let's start by building a basic EHRData object with some measurements, e.g. blood pressure of two individuals.

In [95]:
patients = pd.DataFrame(
    {"patient_id": ["P001", "P002"], "birthdate": ["1980-01-01", "1975-05-15"], "gender": ["M", "F"]}
).set_index("patient_id")

clinical_parameters = pd.DataFrame(
    {
        "parameter_id": ["BP_Systolic", "BP_Diastolic"],
        "name": ["Systolic Blood Pressure", "Diastolic Blood Pressure"],
        "unit": ["mmHg", "mmHg"],
    }
).set_index("parameter_id")

measurements = np.array(
    [[120, 121], [81, 81]],
)

edata = ed.EHRData(
    X=measurements,
)
edata

EHRData object with n_obs × n_vars = 2 × 2
    shape of .X: (2, 2)

EHRData provides a classical flat 2D matrix field `edata.X`, suitable for classical tabular variable representation.

In [96]:
edata.X

array([[120, 121],
       [ 81,  81]])

When we have measurements along a time course, we want to represent an axis of time (e.g. clinical visits, calendar time, ...) and repeats of measurements. In the example above, the measurements could e.g. be repeated three times.

In [97]:
visit_dates = pd.DataFrame({"visit_number": ["1", "2", "3"], "visit_id": ["V001", "V002", "V003"]}).set_index(
    "visit_number"
)

repeated_measurements = np.array(
    [
        [
            [120, np.nan, 121],
            [81, np.nan, 81],
        ],
        [
            [130, 135, 125],
            [84, 81, 80],
        ],
    ]
)

edata = ed.EHRData(
    r=repeated_measurements,
)
edata

EHRData object with n_obs × n_vars = 2 × 2
    shape of .X: (2, 2)
    shape of .r: (2, 2, 3)

EHRData provides a 3D array field `edata.r`, suitable for this kind of data.

In [98]:
edata.r

array([[[120.,  nan, 121.],
        [ 81.,  nan,  81.]],

       [[130., 135., 125.],
        [ 84.,  81.,  80.]]])

Now, we group the data together with its metadata, using the `obs`, `var`, and `t` fields of EHRData.

- the `obs` field stores static person-level metadata
- the `var` field stores variable-level metadata
- the `t` field stores time axis-level metadata

In [99]:
edata = ed.EHRData(
    r=repeated_measurements,
    obs=patients,
    var=clinical_parameters,
    t=visit_dates,
)
edata

EHRData object with n_obs × n_vars = 2 × 2
    obs: 'birthdate', 'gender'
    var: 'name', 'unit'
    shape of .X: (2, 2)
    shape of .r: (2, 2, 3)

### Subsetting EHRData
#### Subsetting with indices

The index values can be used to subset the EHRData, which provides a view of the EHRData object. We can imagine this to be useful to subset the AnnData to particular patients, variables, or time intervals of interest. The rules for subsetting EHRData are quite similar to that of a Pandas DataFrame. You can use values in the `obs/var_names`, boolean masks, or cell index integers.

In [100]:
edata[["P001"], ["BP_Systolic"]]

View of EHRData object with n_obs × n_vars = 1 × 1
    obs: 'birthdate', 'gender'
    var: 'name', 'unit'
    shape of .X: (1, 1)
    shape of .r: (1, 1, 3)

#### Subsetting using metadata

We can also subset the EHRData using the metadata:

In [101]:
edata[edata.obs["gender"] == "F"]

View of EHRData object with n_obs × n_vars = 1 × 2
    obs: 'birthdate', 'gender'
    var: 'name', 'unit'
    shape of .X: (1, 2)
    shape of .r: (1, 2, 3)

## Observation/variable-level matrices

We might also have metadata at either level that has many dimensions to it, such as a UMAP embedding of the data. For this type of metadata, EHRData has the `.obsm/.varm` attributes. We use keys to identify the different matrices we insert. The restriction of `.obsm/.varm` are that `.obsm` matrices must length equal to the number of observations as `.n_obs` and `.varm` matrices must length equal to `.n_vars`. They can each independently have different number of dimensions.

Let's start with a randomly generated matrix that we can interpret as a UMAP embedding of the data we'd like to store, as well as some random gene-level metadata:

In [102]:
edata.obsm["X_umap"] = np.random.normal(0, 1, size=(edata.n_obs, 2))
edata.varm["variable_stuff"] = np.random.normal(0, 1, size=(edata.n_vars, 5))
edata.obsm

AxisArrays with keys: X_umap

A few more notes about `.obsm/.varm`

1. The "array-like" metadata can originate from a Pandas DataFrame, scipy sparse matrix, or numpy dense array.
2. When using scanpy, their values (columns) are not easily plotted, where instead items from `.obs` are easily plotted on, e.g., UMAP plots.

## Unstructured metadata

EHRData has `.uns`, which allows for any unstructured metadata. This can be anything, like a list or a dictionary with some general information that was useful in the analysis of our data.

In [103]:
edata.uns["random"] = [1, 2, 3]
edata.uns

OrderedDict([('random', [1, 2, 3])])

## Layers

Finally, we may have different forms of our original core data, perhaps one that is normalized and one that is not. These can be stored in different layers in EHRData. For example, let's log transform the original data and store it in a layer:

In [104]:
edata.layers["log_transformed"] = np.log1p(edata.r)
edata

EHRData object with n_obs × n_vars = 2 × 2
    obs: 'birthdate', 'gender'
    var: 'name', 'unit'
    uns: 'random'
    obsm: 'X_umap'
    varm: 'variable_stuff'
    layers: 'log_transformed'
    shape of .X: (2, 2)
    shape of .r: (2, 2, 3)

## Writing the results to disk

`EHRData` comes with a persistent HDF5-based file format: `h5ad`. If string columns with small number of categories aren't yet categoricals, `EHRData` will auto-transform to categoricals.

In [105]:
edata.write("my_results.h5ad", compression="gzip")

## Wrapping up the introduction

EHRData is straightforward to use and faciliatates more reproducible analyses with it's key-based storage.

Keep reading on the [AnnData tutorials](https://anndata.readthedocs.io/en/stable/tutorials/notebooks/getting-started.html) to better understand "views", on-disk backing, and other details.

<div class="alert alert-info">

**Note**

Similar to numpy arrays, EHRData objects can either hold actual data or reference another `EHRData` object. In the later case, they are referred to as "view".

Subsetting EHRData objects always returns views, which has two advantages:

- no new memory is allocated
- it is possible to modify the underlying EHRData object

You can get an actual EHRData object from a view by calling `.copy()` on the view. Usually, this is not necessary, as any modification of elements of a view (calling `.[]` on an attribute of the view) internally calls `.copy()` and makes the view an EHRData object that holds actual data. See the example below.
    
</div>

Get access to the first 5 rows for two variables.

<div class="alert alert-info">

**Note**

Indexing into AnnData will assume that integer arguments to `[]` behave like `.iloc` in pandas, whereas string arguments behave like `.loc`. `AnnData` always assumes string indices.
    
</div> 

In [65]:
edata[["P001"], ["BP_Systolic"]]

View of EHRData object with n_obs × n_vars = 1 × 1
    obs: 'birthdate', 'gender'
    var: 'name', 'unit'
    uns: 'random'
    obsm: 'X_umap'
    varm: 'gene_stuff', 'variable_stuff'
    shape of .X: (1, 1)
    shape of .r: (1, 1, 3)

This is a view! If we want an `EHRData` that holds the data in memory, let's call `.copy()`

In [66]:
edata_subset = edata[["P001"], ["BP_Systolic"]].copy()

If you try to write to parts of a view of an AnnData, the content will be auto-copied and a data-storing object will be generated.

In [67]:
edata_subset = edata[["P001"], ["BP_Systolic"]]
edata_subset

View of EHRData object with n_obs × n_vars = 1 × 1
    obs: 'birthdate', 'gender'
    var: 'name', 'unit'
    uns: 'random'
    obsm: 'X_umap'
    varm: 'gene_stuff', 'variable_stuff'
    shape of .X: (1, 1)
    shape of .r: (1, 1, 3)

In [68]:
edata_subset.obs["foo"] = "bar"

  edata_subset.obs['foo'] = "bar"


Now `edata_subset` stores the actual data and is no longer just a reference to `edata`.

In [71]:
edata_subset

[autoreload of ehrdata.core.ehrdata failed: Traceback (most recent call last):
  File "/Users/eljas.roellin/Documents/ehrapy_workspace/ehrapy_ehrdata_venv_feb/lib/python3.11/site-packages/IPython/extensions/autoreload.py", line 276, in check
    superreload(m, reload, self.old_objects)
  File "/Users/eljas.roellin/Documents/ehrapy_workspace/ehrapy_ehrdata_venv_feb/lib/python3.11/site-packages/IPython/extensions/autoreload.py", line 500, in superreload
    update_generic(old_obj, new_obj)
  File "/Users/eljas.roellin/Documents/ehrapy_workspace/ehrapy_ehrdata_venv_feb/lib/python3.11/site-packages/IPython/extensions/autoreload.py", line 397, in update_generic
    update(a, b)
  File "/Users/eljas.roellin/Documents/ehrapy_workspace/ehrapy_ehrdata_venv_feb/lib/python3.11/site-packages/IPython/extensions/autoreload.py", line 349, in update_class
    if update_generic(old_obj, new_obj):
       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/eljas.roellin/Documents/ehrapy_workspace/ehrapy_ehrd

EHRData object with n_obs × n_vars = 1 × 1
    obs: 'birthdate', 'gender', 'foo'
    var: 'name', 'unit'
    uns: 'random'
    obsm: 'X_umap'
    varm: 'gene_stuff', 'variable_stuff'
    shape of .X: (1, 1)
    shape of .r: (1, 1, 3)