# PyHealth Datasets Overview

This notebook provides an overview of all datasets supported in PyHealth. For each dataset, we include:
- **Description**: What the dataset contains.
- **Source/Link**: Where to find the original data.
- **Download Method**: How to obtain the data.
- **Restrictions**: Any access requirements or limitations.
- **Example Usage**: Code to load the dataset in PyHealth.

**Note**: Many datasets require accounts, credentials, or compliance with data use agreements. Start with demos if available.

## 1. MIMIC-III

**Description**: A large dataset of de-identified health records from ICU patients at Beth Israel Deaconess Medical Center (2001-2012). Includes demographics, vital signs, lab results, medications, and more.

**Source/Link**: https://physionet.org/content/mimiciii/1.4/

**Download Method**:
- Create PhysioNet account and complete HIPAA training.
- Download: `wget -r -N -c -np --user [USERNAME] --ask-password https://physionet.org/files/mimiciii/1.4/`
- Demo (no auth): `wget -r -N -c -np https://physionet.org/files/mimiciii-demo/1.4/`

**Restrictions**: Requires PhysioNet account, HIPAA certification, and data use agreement. ~40GB.

**Example Usage**:

In [None]:
from pyhealth.datasets import MIMIC3Dataset

# For demo
dataset = MIMIC3Dataset(
    root="https://physionet.org/files/mimiciii-demo/1.4/",
    tables=["DIAGNOSES_ICD", "PRESCRIPTIONS"],
    dev=True  # Use dev mode for small subset
)
print(f"Loaded {len(dataset.patients)} patients.")

## 2. MIMIC-IV

**Description**: Updated version of MIMIC-III with EHR, clinical notes, and chest X-rays from 2008-2019.

**Source/Link**: https://physionet.org/content/mimiciv/0.4/

**Download Method**:
- PhysioNet account required.
- Download: `wget -r -N -c -np --user [USERNAME] --ask-password https://physionet.org/files/mimiciv/2.2/`
- Demo: `wget -r -N -c -np https://physionet.org/files/mimic-iv-demo/2.2/`

**Restrictions**: PhysioNet account, HIPAA training. ~200GB.

**Example Usage**:

In [None]:
from pyhealth.datasets import MIMIC4Dataset

# EHR only
dataset = MIMIC4Dataset(
    ehr_root="https://physionet.org/files/mimic-iv-demo/2.2/",
    ehr_tables=["diagnoses_icd", "prescriptions"],
    dev=True
)
print(f"Loaded {len(dataset.patients)} patients.")

## 3. eICU

**Description**: Multi-center ICU database from US hospitals (2014-2015), including demographics, diagnoses, treatments, labs, and vitals.

**Source/Link**: https://eicu-crd.mit.edu/

**Download Method**:
- Register and agree to terms.
- Download: `wget -r -N -c -np --user [USERNAME] --ask-password https://physionet.org/files/eicu-crd/2.0/`

**Restrictions**: Account required, data use agreement. ~10GB.

**Example Usage**:

In [None]:
from pyhealth.datasets import eICUDataset

dataset = eICUDataset(
    root="/path/to/eicu-crd/2.0",
    tables=["diagnosis", "medication"],
    dev=True
)
print(f"Loaded {len(dataset.patients)} patients.")

## 4. OMOP

**Description**: Observational Medical Outcomes Partnership Common Data Model (OMOP-CDM) formatted datasets.

**Source/Link**: https://www.ohdsi.org/data-standardization/the-common-data-model/

**Download Method**: Varies by source; often requires partnerships or custom access.

**Restrictions**: Institutional access often required.

**Example Usage**:

In [None]:
from pyhealth.datasets import OMOPDataset

dataset = OMOPDataset(
    root="/path/to/omop/data",
    tables=["condition_occurrence", "drug_exposure"]
)
print(f"Loaded {len(dataset.patients)} patients.")

## 5. Sleep-EDF

**Description**: Sleep EEG recordings for sleep staging.

**Source/Link**: https://physionet.org/content/sleep-edfx/1.0.0/

**Download Method**: `wget -r -N -c -np https://physionet.org/files/sleep-edfx/1.0.0/`

**Restrictions**: Publicly available, no auth.

**Example Usage**:

In [None]:
from pyhealth.datasets import SleepEDFDataset

dataset = SleepEDFDataset(
    root="/path/to/sleep-edfx",
    dev=True
)
print(f"Loaded {len(dataset.patients)} patients.")

## 6. SHHS

**Description**: Sleep Heart Health Study polysomnography data.

**Source/Link**: https://sleepdata.org/datasets/shhs

**Download Method**: Register and download from site.

**Restrictions**: Account required.

**Example Usage**:

In [None]:
from pyhealth.datasets import SHHSDataset

dataset = SHHSDataset(
    root="/path/to/shhs",
    dev=True
)
print(f"Loaded {len(dataset.patients)} patients.")

## 7. ISRUC

**Description**: ISRUC-SLEEP dataset for sleep staging.

**Source/Link**: https://sleeptight.isr.uc.pt/?page_id=48

**Download Method**: Download from site.

**Restrictions**: Public.

**Example Usage**:

In [None]:
from pyhealth.datasets import ISRUCDataset

dataset = ISRUCDataset(
    root="/path/to/isruc",
    dev=True
)
print(f"Loaded {len(dataset.patients)} patients.")

## 8. Cardiology (PhysioNet Challenge 2020)

**Description**: ECG data from multiple sources for arrhythmia detection.

**Source/Link**: https://physionet.org/content/challenge-2020/1.0.2/

**Download Method**: `wget -r -N -c -np https://physionet.org/files/challenge-2020/1.0.2/`

**Restrictions**: Public.

**Example Usage**:

In [None]:
from pyhealth.datasets import CardiologyDataset

dataset = CardiologyDataset(
    root="/path/to/challenge-2020",
    dev=True
)
print(f"Loaded {len(dataset.patients)} patients.")

## 9. COVID-19 CXR

**Description**: Chest X-ray images for COVID-19 classification.

**Source/Link**: Custom or public sources (check PyHealth docs).

**Download Method**: Varies; often from Kaggle or GitHub.

**Restrictions**: Public datasets.

**Example Usage**:

In [None]:
from pyhealth.datasets import COVID19CXRDataset

dataset = COVID19CXRDataset(
    root="/path/to/covid19-cxr",
    dev=True
)
print(f"Loaded {len(dataset.patients)} patients.")

## Additional Datasets

- **DREAMT**: https://physionet.org/content/dreamt/ - Download from PhysioNet.
- **EHRShot**: Benchmark dataset for EHR tasks.
- **Medical Transcriptions**: Text data for NLP.
- **BMD_HS**: Bone mineral density data.
- **TUAB/TUEV**: EEG datasets from Temple University.
- **MIMIC-Extract**: Processed MIMIC data.
- **Sample Dataset**: Synthetic data for testing.

For full details, check the PyHealth documentation and dataset source links.