To demonstrate ehrapy’s capability to analyze heterogeneous datasets from a broad patient set across multiple care units, we applied our exploratory strategy to the PIC43 database. 

The PIC database is a single-center database hosting information on children admitted to critical care units at the Children’s Hospital of Zhejiang University School of Medicine in China. 

It contains 13,499 distinct hospital admissions of 12,881 individual pediatric patients admitted between 2010 and 2018 for whom demographics, diagnoses, doctors’ notes, vital signs, laboratory and microbiology tests, medications, fluid balances and more were collected (Extended Data Figs. 1 and 2a and Methods). 

After missing data imputation and subsequent pre-processing (Extended Data Figs. 2b,c and 3 and Methods), we:

- generated a uniform manifold approximation and projection (UMAP) embedding to visualize variation across all patients using ehrapy (Fig. 2a). 

This visualization of the low-dimensional patient manifold shows the heterogeneity of the collected data in the PIC database, with malformations, perinatal and respiratory being the most abundant International Classification of Diseases (ICD) chapters (Fig. 2b). The most common respiratory disease categories (Fig. 2c) were labeled pneumonia and influenza

## Focus

We focused on pneumonia to apply ehrapy to a challenging, broad-spectrum disease that affects all age groups. Pneumonia is a prevalent respiratory infection that poses a substantial burden on public health60 and is characterized by inflammation of the alveoli and distal airways60. Individuals with pre-existing chronic conditions are particularly vulnerable, as are children under the age of 5 (ref. 61). Pneumonia can be caused by a range of microorganisms, encompassing bacteria, respiratory viruses and fungi.

## Population

We selected the age group ‘youths’ (13 months to 18 years of age) for further analysis, addressing a total of 265 patients who dominated the pneumonia cases and were diagnosed with ‘unspecified pneumonia’ (Fig. 2d and Extended Data Fig. 4). Neonates (0–28 d old) and infants (29 d to 12 months old) were excluded from the analysis as the disease context is significantly different in these age groups due to distinct anatomical and physical conditions. 

Patients were 61% male, had a total of 277 admissions, had a mean age at admission of 54 months (median, 38 months) and had an average LOS of 15 d (median, 7 d). Of these, 152 patients were admitted to the pediatric intensive care unit (PICU), 118 to the general ICU (GICU), four to the surgical ICU (SICU) and three to the cardiac ICU (CICU). 

### Missing Data

Laboratory measurements typically had 12–14% missing data, except for serum procalcitonin (PCT), a marker for bacterial infections, with 24.5% missing, and C-reactive protein (CRP), a marker of inflammation, with 16.8% missing. Measurements assigned as ‘vital signs’ contained between 44% and 54% missing values. 

Stratifying patients with unspecified pneumonia further enables a more nuanced understanding of the disease, potentially facilitating tailored approaches to treatment.

## Past Results

To deepen clinical phenotyping for the disease group ‘unspecified pneumonia’, we:
- calculated a k-nearest neighbor graph to cluster patients into groups and visualize these in UMAP space (Methods). 
- Leiden clustering62 identified four patient groupings with distinct clinical features that we annotated (Fig. 2e).

To identify the laboratory values, medications and pathogens that were most characteristic for these four groups (Fig. 2f), we applied t-tests for numerical data and g-tests for categorical data between the identified groups using ehrapy (Extended Data Fig. 5 and Methods).

Based on this analysis, we identified patient groups with:
1. ‘sepsis-like, 
2. ‘severe pneumonia with co-infection’, 
3. ‘viral pneumonia’ and 
4. ‘mild pneumonia’ phenotypes. 

The ‘sepsis-like’ group of patients (n = 28) was characterized by:
- rapid disease progression as exemplified by an increased number of deaths (adjusted P ≤ 5.04 × 10−3, 43% (n = 28), 95% confidence interval (CI): 23%, 62%); 
- indication of multiple organ failure, such as elevated creatinine (adjusted P ≤ 0.01, 52.74 ± 23.71 μmol L−1) or reduced albumin levels (adjusted P ≤ 2.89 × 10−4, 33.40 ± 6.78 g L−1); 
- and increased expression levels and peaks of inflammation markers, including PCT (adjusted P ≤ 3.01 × 10−2, 1.42 ± 2.03 ng ml−1), whole blood cell count, neutrophils, lymphocytes, monocytes and lower platelet counts (adjusted P ≤ 6.3 × 10−2, 159.30 ± 142.00 × 109 per liter) and changes in electrolyte levels—that is, lower potassium levels (adjusted P ≤ 0.09 × 10−2, 3.14 ± 0.54 mmol L−1). 

Patients whom we associated with the term ‘severe pneumonia with co-infection’ (n = 74) were characterized by:
-  prolonged ICU stays (adjusted P ≤ 3.59 × 10−4, 15.01 ± 29.24 d); organ affection, such as higher levels of creatinine (adjusted P ≤ 1.10 × 10−4, 52.74 ± 23.71 μmol L−1) and lower platelet count (adjusted P ≤ 5.40 × 10−23, 159.30 ± 142.00 × 109 per liter); increased inflammation markers, such as peaks of PCT (adjusted P ≤ 5.06 × 10−5, 1.42 ± 2.03 ng ml−1), CRP (adjusted P ≤ 1.40 × 10−6, 50.60 ± 37.58 mg L−1) and neutrophils (adjusted P ≤ 8.51 × 10−6, 13.01 ± 6.98 × 109 per liter); detection of bacteria in combination with additional pathogen fungals in sputum samples (adjusted P ≤ 1.67 × 10−2, 26% (n = 74), 95% CI: 16%, 36%); and increased application of medication, including antifungals (adjusted P ≤ 1.30 × 10−4, 15% (n = 74), 95% CI: 7%, 23%) and catecholamines (adjusted P ≤ 2.0 × 10−2, 45% (n = 74), 95% CI: 33%, 56%). 

Patients in the ‘mild pneumonia’ group were characterized by:
- positive sputum cultures in the presence of relatively lower inflammation markers, such as PCT (adjusted P ≤ 1.63 × 10−3, 1.42 ± 2.03 ng ml−1) and CRP (adjusted P ≤ 0.03 × 10−1, 50.60 ± 37.58 mg L−1), while receiving antibiotics more frequently (adjusted P ≤ 1.00 × 10−5, 80% (n = 78), 95% CI: 70%, 89%) and additional medications (electrolytes, blood thinners and circulation-supporting medications) (adjusted P ≤ 1.00 × 10−5, 82% (n = 78), 95% CI: 73%, 91%). Finally, patients in the ‘viral pneumonia’ group were characterized by shorter LOSs (adjusted P ≤ 8.00 × 10−6, 15.01 ± 29.24 d), a lack of non-viral pathogen detection in combination with higher lymphocyte counts (adjusted P ≤ 0.01, 4.11 ± 2.49 × 109 per liter), lower levels of PCT (adjusted P ≤ 0.03 × 10−2, 1.42 ± 2.03 ng ml−1) and reduced application of catecholamines (adjusted P ≤ 5.96 × 10−7, 15% (n = 97), 95% CI: 8%, 23%), antibiotics (adjusted P ≤ 8.53 × 10−6, 41% (n = 97), 95% CI: 31%, 51%) and antifungals (adjusted P ≤ 5.96 × 10−7, 0% (n = 97), 95% CI: 0%, 0%).



