# PyHealth Datasets Overview

This notebook provides an overview of all datasets supported in PyHealth. For each dataset, we include:
- **Description**: What the dataset contains.
- **Source/Link**: Where to find the original data.
- **Download Method**: How to obtain the data.
- **Restrictions**: Any access requirements or limitations.
- **Example Usage**: Code to load the dataset in PyHealth.

**Note**: Many datasets require accounts, credentials, or compliance with data use agreements.

## 1. MIMIC-III

**Description**: A large dataset of de-identified health records from ICU patients at Beth Israel Deaconess Medical Center (2001-2012). Includes demographics, vital signs, lab results, medications, and more.

**Source/Link**: https://physionet.org/content/mimiciii/1.4/

**Download Method**:
- Create PhysioNet account and complete HIPAA training.
- Download: `wget -r -N -c -np --user [USERNAME] --ask-password https://physionet.org/files/mimiciii/1.4/`
- Demo (no auth): `wget -r -N -c -np https://physionet.org/files/mimiciii-demo/1.4/`

**Restrictions**: Requires PhysioNet account, HIPAA certification, and data use agreement. ~40GB.

**Example Usage**:

In [18]:
from pyhealth.datasets import MIMIC3Dataset

# Download Demo MIMIC3 dataset
mimic3_demo = MIMIC3Dataset(
    root="https://physionet.org/files/mimiciii-demo/1.4/",
    tables=["DIAGNOSES_ICD", "PRESCRIPTIONS"],
    dev=True  # Use dev mode for small subset
)


No config path provided, using default config
Initializing mimic3 dataset from https://physionet.org/files/mimiciii-demo/1.4/ (dev mode: True)
Scanning table: patients from https://physionet.org/files/mimiciii-demo/1.4/PATIENTS.csv.gz
Original path does not exist. Using alternative: https://physionet.org/files/mimiciii-demo/1.4/PATIENTS.csv
Scanning table: admissions from https://physionet.org/files/mimiciii-demo/1.4/ADMISSIONS.csv.gz
Original path does not exist. Using alternative: https://physionet.org/files/mimiciii-demo/1.4/ADMISSIONS.csv
Scanning table: icustays from https://physionet.org/files/mimiciii-demo/1.4/ICUSTAYS.csv.gz
Original path does not exist. Using alternative: https://physionet.org/files/mimiciii-demo/1.4/ICUSTAYS.csv
Scanning table: diagnoses_icd from https://physionet.org/files/mimiciii-demo/1.4/DIAGNOSES_ICD.csv.gz
Original path does not exist. Using alternative: https://physionet.org/files/mimiciii-demo/1.4/DIAGNOSES_ICD.csv
Joining with table: https://physione

In [19]:
print(mimic3_demo.stats())

Collecting global event dataframe...
Dev mode enabled: limiting to 1000 patients
Collected dataframe with shape: (12524, 46)
Dataset: mimic3
Dev mode: True
Number of patients: 100
Number of events: 12524
None


## 2. MIMIC-IV

**Description**: Updated version of MIMIC-III with EHR, clinical notes, and chest X-rays from 2008-2019.

**Source/Link**: https://physionet.org/content/mimiciv/0.4/

**Download Method**:
- PhysioNet account required.
- Download: `wget -r -N -c -np --user [USERNAME] --ask-password https://physionet.org/files/mimiciv/2.2/`
- Demo: `wget -r -N -c -np https://physionet.org/files/mimic-iv-demo/2.2/`

**Restrictions**: PhysioNet account, HIPAA training. ~200GB.

**Example Usage**:

In [20]:
from pyhealth.datasets import MIMIC4Dataset

# EHR only
mimic4_demo = MIMIC4Dataset(
    ehr_root="https://physionet.org/files/mimic-iv-demo/2.2/",
    ehr_tables=["diagnoses_icd", "prescriptions"],
    dev=True
)


Memory usage Starting MIMIC4Dataset init: 806.7 MB
Initializing MIMIC4EHRDataset with tables: ['diagnoses_icd', 'prescriptions'] (dev mode: True)
Using default EHR config: /home/ubuntu/PyHealth/pyhealth/datasets/configs/mimic4_ehr.yaml
Memory usage Before initializing mimic4_ehr: 806.7 MB
Initializing mimic4_ehr dataset from https://physionet.org/files/mimic-iv-demo/2.2/ (dev mode: False)
Scanning table: diagnoses_icd from https://physionet.org/files/mimic-iv-demo/2.2/hosp/diagnoses_icd.csv.gz
Joining with table: https://physionet.org/files/mimic-iv-demo/2.2/hosp/admissions.csv.gz
Scanning table: prescriptions from https://physionet.org/files/mimic-iv-demo/2.2/hosp/prescriptions.csv.gz
Scanning table: patients from https://physionet.org/files/mimic-iv-demo/2.2/hosp/patients.csv.gz
Scanning table: admissions from https://physionet.org/files/mimic-iv-demo/2.2/hosp/admissions.csv.gz
Scanning table: icustays from https://physionet.org/files/mimic-iv-demo/2.2/icu/icustays.csv.gz
Memory usag

In [21]:
print(mimic4_demo.stats())

Collecting global event dataframe...
Dev mode enabled: limiting to 1000 patients
Collected dataframe with shape: (23108, 34)
Dataset: mimic4
Dev mode: True
Number of patients: 100
Number of events: 23108
None


## 3. eICU

**Description**: Multi-center ICU database from US hospitals (2014-2015), including demographics, diagnoses, treatments, labs, and vitals.

**Source/Link**: https://eicu-crd.mit.edu/

**Download Method**:
- Register and agree to terms.
- Download: `wget -r -N -c -np --user [USERNAME] --ask-password https://physionet.org/files/eicu-crd/2.0/`

**Restrictions**: Account required, data use agreement. ~10GB.

**Example Usage**:

In [None]:
from pyhealth.datasets import eICUDataset

# No demo is available for EICU, so this will be empty unless you have the data locally
eicu = eICUDataset(
    root="/path/to/eicu-crd/2.0",
    tables=["diagnosis", "medication"],
    dev=True
)


In [None]:
eicu

<pyhealth.datasets.eicu.eICUDataset at 0x76e5605c3650>

## 4. Datasets in OMOP format

**Description**: Observational Medical Outcomes Partnership Common Data Model (OMOP-CDM) is a standard for structuring healthcare data. PyHealth supports datasets formatted in OMOP-CDM (e.g., from various sources like EHR systems).

**Source/Link**: https://www.ohdsi.org/data-standardization/the-common-data-model/

**Download Method**: Varies by source; often requires partnerships or custom access to OMOP-formatted data.

**Restrictions**: Institutional access often required.

**Example Usage**:

In [40]:
%%script false --no-raise-error

from pyhealth.datasets import OMOPDataset

dataset = OMOPDataset(
    root="/path/to/omop/data",
    tables=["condition_occurrence", "drug_exposure"]
)


## 5. Sleep-EDF

**Description**: Sleep EEG recordings for sleep staging.

**Source/Link**: https://physionet.org/content/sleep-edfx/1.0.0/

**Download Method**: `wget -r -N -c -np https://physionet.org/files/sleep-edfx/1.0.0/`

**Restrictions**: Publicly available, no auth.

**Example Usage**:

In [43]:
%%script false --no-raise-error

from pyhealth.datasets import SleepEDFDataset

dataset = SleepEDFDataset(
    root="/path/to/sleep-edfx",
    dev=True
)
print(dataset.stats())

## 6. SHHS

**Description**: Sleep Heart Health Study polysomnography data.

**Source/Link**: https://sleepdata.org/datasets/shhs

**Download Method**: Register and download from site.

**Restrictions**: Account required.

**Example Usage**:

In [45]:
%%script false --no-raise-error

from pyhealth.datasets import SHHSDataset

dataset = SHHSDataset(
    root="/path/to/shhs",
    dev=True
)
print(f"Loaded {len(dataset.patients)} patients.")

## 7. ISRUC

**Description**: ISRUC-SLEEP dataset for sleep staging.

**Source/Link**: https://sleeptight.isr.uc.pt/?page_id=48

**Download Method**: Download from site.

**Restrictions**: Public.

**Example Usage**:

In [46]:
%%script false --no-raise-error

from pyhealth.datasets import ISRUCDataset

dataset = ISRUCDataset(
    root="/path/to/isruc",
    dev=True
)
print(f"Loaded {len(dataset.patients)} patients.")

## 8. Cardiology (PhysioNet Challenge 2020)

**Description**: ECG data from multiple sources for arrhythmia detection.

**Source/Link**: https://physionet.org/content/challenge-2020/1.0.2/

**Download Method**: `wget -r -N -c -np https://physionet.org/files/challenge-2020/1.0.2/`

**Restrictions**: Public.

**Example Usage**:

In [54]:

from pyhealth.datasets import CardiologyDataset

cardiology = CardiologyDataset(
    #root="/path/to/challenge-2020",
    root = "https://physionet.org/files/challenge-2020/1.0.2/",
    dev=True
)
print(cardiology)

<pyhealth.datasets.cardiology.CardiologyDataset object at 0x76e557bf53d0>


## 9. COVID-19 CXR

**Description**: Chest X-ray images for COVID-19 classification.

**Source/Link**: Custom or public sources (check PyHealth docs).

**Download Method**: Varies; often from Kaggle or GitHub.

**Restrictions**: Public datasets.

**Example Usage**:

In [58]:
%%script false --no-raise-error

from pyhealth.datasets import COVID19CXRDataset

dataset = COVID19CXRDataset(
    root="/path/to/covid19-cxr",
    dev=True
)
print(f"Loaded {len(dataset.patients)} patients.")

## 10. Sample Dataset (Test/Synthetic)

**Description**: Synthetic dataset for testing and development, with customizable samples.

**Source/Link**: Built-in to PyHealth (no external source).

**Download Method**: No download needed; created programmatically.

**Restrictions**: None; for testing only.

**Example Usage**:

In [61]:
SampleDataset?

[31mInit signature:[39m
SampleDataset(
    samples: List[Dict],
    input_schema: Dict[str, Union[str, Type[pyhealth.processors.base_processor.FeatureProcessor]]],
    output_schema: Dict[str, Union[str, Type[pyhealth.processors.base_processor.FeatureProcessor]]],
    dataset_name: Optional[str] = [38;5;28;01mNone[39;00m,
    task_name: Optional[str] = [38;5;28;01mNone[39;00m,
    input_processors: Optional[Dict[str, pyhealth.processors.base_processor.FeatureProcessor]] = [38;5;28;01mNone[39;00m,
    output_processors: Optional[Dict[str, pyhealth.processors.base_processor.FeatureProcessor]] = [38;5;28;01mNone[39;00m,
) -> [38;5;28;01mNone[39;00m
[31mDocstring:[39m     
Sample dataset class for handling and processing data samples.

Attributes:
    samples (List[Dict]): List of data samples.
    input_schema (Dict[str, Union[str, Type[FeatureProcessor], Tuple[Union[str, Type[FeatureProcessor]], Dict[str, Any]]]]):
        Schema for input data. Values can be string aliases

In [None]:
from pyhealth.datasets import SampleDataset

# Create synthetic samples
samples = [
    {"patient_id": "1", "conditions": ["C001", "C002"], "label": 0},
    {"patient_id": "2", "conditions": ["C003"], "label": 1}
]

dataset = SampleDataset(
    samples=samples,
    input_schema={"conditions": "sequence"},
    output_schema={"label": "binary"}
)
print(f"Loaded {len(dataset)} synthetic samples.")

Label label vocab: {0: 0, 1: 1}


Processing samples:   0%|          | 0/2 [00:00<?, ?it/s]


TypeError: unhashable type: 'list'

## Additional Datasets

- **DREAMT**: https://physionet.org/content/dreamt/ - Download from PhysioNet.
- **EHRShot**: Benchmark dataset for EHR tasks.
- **Medical Transcriptions**: Text data for NLP.
- **BMD_HS**: Bone mineral density data.
- **TUAB/TUEV**: EEG datasets from Temple University.
- **MIMIC-Extract**: Processed MIMIC data.

For full details, check the PyHealth documentation and dataset source links.