# Tutorial: Using EHR Data as Donor-Level Input in `cellink`

The `cellink` package was originally developed for integrating single-cell measurements with donor-level **genetic data**, but its modular design allows you to swap in **any structured donor-level modality**. In this tutorial, we demonstrate how to use **electronic health records (EHR)** as the donor-level input, replacing genotypes. 

This unlocks a wide range of applications — for example, using blood pressure, lab values, or medical history in combination with cell-level transcriptomics.

## Setup and Configuration

We start by importing relevant modules and creating local directories to store input/output files. This ensures that any annotation tools have a consistent file structure to work with.

In [1]:
import numpy as np
import pandas as pd
import anndata as ad
import ehrdata as ed
from cellink import DonorData

## Create Repeated EHR Measurements

We begin by simulating repeated clinical measurements for two patients across three visits. These measurements could represent time-varying vital signs or lab results.

In [2]:
patients = pd.DataFrame(
    {
        "patient_id": ["P001", "P002"],
        "birthdate": ["1980-01-01", "1975-05-15"],
        "gender": ["M", "F"]
    }
).set_index("patient_id")

clinical_parameters = pd.DataFrame(
    {
        "parameter_id": ["BP_Systolic", "BP_Diastolic"],
        "name": ["Systolic Blood Pressure", "Diastolic Blood Pressure"],
        "unit": ["mmHg", "mmHg"],
    }
).set_index("parameter_id")

visit_dates = pd.DataFrame({
    "visit_number": ["1", "2", "3"],
    "visit_id": ["V001", "V002", "V003"]
}).set_index("visit_number")

repeated_measurements = np.array([
    [
        [120, 118, 121],
        [81, 80, 82],
    ],
    [
        [130, 135, 125],
        [84, 81, 80],
    ]
])

## Construct the `EHRData` Object
We wrap the patient information, clinical parameters, and repeated measurements into an `EHRData` object. This object mirrors the structure expected by cellink, and will later be used in place of genetic data.

In [3]:
ehr = ed.EHRData(
    r=repeated_measurements,
    obs=patients,
    var=clinical_parameters,
    t=visit_dates,
)

ehr.obs["donor_id"] = ["D0", "D1"]
ehr.obs.index = ehr.obs["donor_id"]

## Simulate Single-Cell RNA-seq Data
To demonstrate multimodal integration, we generate a synthetic single-cell RNA-seq dataset. Each cell is assigned to a donor and annotated with a predicted cell type. We filter to keep only CD8 Naive cells.

In [4]:
n_cells = 200
n_genes = 100

X = np.random.poisson(1.5, size=(n_cells, n_genes)).astype(np.float32)

cell_obs = pd.DataFrame({
    "cell_id": [f"C{i}" for i in range(n_cells)],
    "donor_id": np.random.choice(["D0", "D1"], size=n_cells),
    "predicted.celltype.l2": np.random.choice(["CD8 Naive", "CD4 TCM"], size=n_cells)
}).set_index("cell_id")

gene_var = pd.DataFrame(index=[f"gene_{i}" for i in range(n_genes)])

adata = ad.AnnData(X=X, obs=cell_obs, var=gene_var)
adata = adata[adata.obs["predicted.celltype.l2"] == "CD8 Naive", :].copy()

## Combine EHR and Single-Cell Data via `DonorData`
We now use `DonorData` to merge the donor-level EHR data with the cell-level transcriptomics. This object enables unified access and is used throughout the cellink pipeline.

In [5]:
dd = DonorData(G=ehr, C=adata).copy()
dd

