# The PhysioNet 2012 Dataset in the `EHRData` format


This tutorial demonstrates how `EHRData` structures real-world longitudinal clinical data using the **PhysioNet Challenge 2012** dataset as an example.

```{note}
It is helpful to check out the [Getting started with EHRData](getting_started) to learn the basics of `EHRData` before diving into this tutorial.
```

The [PhysioNet Challenge 2012](https://physionet.org/content/challenge-2012/1.0.0/) dataset contains ICU patient data collected during the first 48 hours of admission. It was used for predicting in-hospital mortality.

## Dataset Overview

The dataset includes:
- **12,000 ICU stays** from three hospitals (set-a, set-b, set-c)
- **37 clinical variables** measured over time (vitals, lab values, etc.)
- **4 static features**: Age, Gender, ICUType, Height
- **Outcomes**: In-hospital death, survival time, and SAPS-I score

Let's explore how `EHRData` organizes this complex data structure!


## Loading the Dataset

The `ehrdata` package provides multiple datasets out-of-the-box, and this dataset is one of them.

See :func:`~ehrdata.datasets.physionet2012` for more details about how the dataset is loaded.

In [None]:
import ehrdata as ed
import numpy as np
import matplotlib.pyplot as plt

This downloads the data if needed and processes it into an `EHRData` object:


In [None]:
edata = ed.dt.physionet2012(layer="tem_data")
edata

```{note}
The first time you run this, it will download ~140MB of data. Subsequent runs will use the cached version.
```


## Reminder: the EHRData Structure

<p style="text-align:center; padding: 2em 0;">
<img src="../_static/tutorial_images/ehrdata_logo.png" width="400" height="400" alt="Logo">
</p>

An `EHRData` object organizes data across three dimensions:

- **`n_obs`**: Number of observations (patients/ICU stays)
- **`n_vars`**: Number of variables (clinical parameters)
- **`n_tem`**: Number of temporal measurements (time points)

Let's explore each component with PhysioNet Challenge 2012 data!


### The `.layers` Attribute: Time Series Data

The `.layers` attribute contains the 3D tensor of shape `(n_obs, n_vars, n_tem)` with all time series measurements:


In [None]:
print(f"Shape of layers: {edata.layers['tem_data'].shape}")
print(f"Data type: {edata.layers['tem_data'].dtype}")
print("\nThis represents:")
print(f"  - {edata.n_obs} patients")
print(f"  - {edata.n_vars} clinical variables")
print(f"  - {edata.n_tem} time intervals")

### The `.obs` Attribute: Static Patient Metadata

The `.obs` DataFrame contains static information and outcomes for each patient:


In [None]:
edata.obs.head()

The `.obs` table includes:
- **Static features**: Age, Gender, ICUType, Height
- **Outcomes**: In-hospital_death, Survival, SAPS-I (severity score)
- **Metadata**: set (which hospital data came from)


### The `.var` Attribute: Dynamic Variable Metadata

The `.var` DataFrame contains information about each clinical variable being measured:


In [None]:
print(f"Number of variables: {edata.n_vars}\n")
print("All clinical parameters:")
edata.var

### The `.tem` Attribute: Temporal Information

The `.tem` DataFrame contains information about the time intervals:


In [None]:
print(f"Number of time intervals: {edata.n_tem}\n")
edata.tem.head(10)

## Exploring Individual Patients

Let's look at a specific patient's data and visualize their vital signs over time:


In [None]:
# Select the first patient
patient_id = edata.obs_names[0]
print(f"Patient ID: {patient_id}\n")

# View their static information
print("Static Information:")
print(edata.obs.loc[patient_id])

In [None]:
# Select a few vital signs to visualize
vital_signs = ["HR", "Temp", "SysABP", "MAP", "RespRate"]

fig, axes = plt.subplots(len(vital_signs), 1, figsize=(12, 10), sharex=True)
fig.suptitle(f"Vital Signs for Patient {patient_id} Over 48 Hours", fontsize=14)

for idx, var_name in enumerate(vital_signs):
    if var_name in edata.var_names:
        # Get the data for this variable
        var_idx = np.where(edata.var_names == var_name)[0][0]
        data = edata[patient_id].X.squeeze()[var_idx, :]

        # Plot
        axes[idx].plot(range(edata.n_tem), data, marker="o", linestyle="-", markersize=3)
        axes[idx].set_ylabel(var_name)
        axes[idx].grid(visible=True, alpha=0.3)

axes[-1].set_xlabel("Hours since ICU admission")
plt.tight_layout()
plt.show()

The good news: You don't need to write a lot of code for such visualizations anymore!

`ehrapy` has many utility functions for processing and vizualizing data in the `EHRData` format - for a fancy version of this plot here, see for instance :func:`~ehrapy.plot.timeseries`

## Subsetting and Filtering

`EHRData` supports powerful subsetting operations similar to numpy arrays:


In [None]:
# Filter by patients - get only those who died in hospital
deceased = edata[edata.obs["In-hospital_death"] == 1]
print(f"Deceased patients: {deceased.n_obs}")

# Filter by variables - get only cardiovascular measurements
cardio_vars = ["HR", "SysABP", "DiasBP", "MAP"]
cardio_data = edata[:, cardio_vars]
print(f"Cardiovascular data shape: {cardio_data.X.shape}")

# Filter by time - get only the first 24 hours
first_24h = edata[:, :, :24]
print(f"First 24 hours shape: {first_24h.X.shape}")

# Combined filtering
subset = edata[edata.obs["Gender"] == 0.0, cardio_vars, :12]
print(f"Female patients, cardiovascular vars, first 12h: {subset.X.shape}")

## Choosing different time intervals

Depending on the question at hand, different time intervals are of interest.

For the `physionet2012()`, in the intensive care unit setting, the observations of patient data happen within minutes to hours, and usually only for a few days.

For observational health data, the observations happen rather across weeks or months, and span for many years.

The `physionet2012()` function provides arguments to specify more about the time intervals:


In [None]:
# Load with different time resolution (2-hour intervals)
edata_2h = ed.dt.physionet2012(
    interval_length_number=2,
    interval_length_unit="h",
    num_intervals=24,  # 48 hours / 2 hours = 24 intervals
    layer="tem_data",
)
print(f"2-hour intervals shape: {edata_2h.layers['tem_data'].shape}")

This gives a less fine-grained, but easier digestable plot.

In [None]:
# Select a few vital signs to visualize
vital_signs = ["HR", "Temp", "SysABP", "MAP", "RespRate"]

fig, axes = plt.subplots(len(vital_signs), 1, figsize=(12, 10), sharex=True)
fig.suptitle(f"Vital Signs for Patient {patient_id} Over 48 Hours", fontsize=14)

for idx, var_name in enumerate(vital_signs):
    if var_name in edata.var_names:
        # Get the data for this variable
        var_idx = np.where(edata.var_names == var_name)[0][0]
        data = edata[patient_id].X.squeeze()[var_idx, :]

        # Plot
        axes[idx].plot(range(edata.n_tem), data, marker="o", linestyle="-", markersize=3)
        axes[idx].set_ylabel(var_name)
        axes[idx].grid(visible=True, alpha=0.3)

axes[-1].set_xlabel("Hours since ICU admission")
plt.tight_layout()
plt.show()

## Summary

In this tutorial, we learned:

- ‚úÖ How to load the PhysioNet 2012 dataset with `ed.dt.physionet2012()`
- ‚úÖ The structure of `EHRData` objects with three dimensions: obs √ó vars √ó tem
- ‚úÖ How to visualize individual patient trajectories
- ‚úÖ How to subset and filter the data
- ‚úÖ How to customize the data loading parameters

The `ehrdata` package makes it easy to work with complex longitudinal clinical data in a structured, intuitive way!

## Where to go next

Now that you understand how `ehrdata` structures the PhysioNet2012 dataset, you can:

### üîç Interactive Exploration
- **[Interactive Visualization with Vitessce](interactive_visualization_of_ehrdata)** - Explore your EHRData interactively in Jupyter notebooks with linked, coordinated views.

- TODO: add ehrapy tutorial with longidutinal analysis and trajectory analysis


Or see the other tutorials for more advanced applications!
