This notebook contains examples of how to convert different datasets into .h5 format. We choose to use the .h5 format because data can be accessed far quicker than other formats that require loading the whole dataset into memory. For more information about available datasets, see our [website page](https://wearablebp.github.io/datasets) on publicly available datasets.

The examples below creates segments of arbitrary size. Each reshaped record is stacked together in the form [segment number, sensor data number, segment length]. For example, in the MIMIC-II dataset from UCI Repository, record 0 will be saved in the same [21, 3, 128] where 21 is the number of segments, 3 is the number of signals (ECG, PPG, ABP), and 128 is the segment length. Data can be retrieved by specifying '<record number>' and indexing (example: `hf['0'][0, :, :]`). The record numbers can be viewed using the command `hf.keys()`.   
   
1. [MIMIC-II from UCI Repository](#mimic)
2. [PPG-BP](#ppgbp)
3. [University of Queensland Vital Signs Database](#uoq)
4. [PTT-PPG](#pttpgg)
5. [CHARIS](#charis)
6. [HYPE](#hype)
7. [Non-Invasive Blood Pressure Estimation](#nibp)
8. [VitalDB](#vitaldb)
9. [Aurora-BP](#aurora)

In [None]:
import h5py
import os
import numpy as np
import pandas as pd
import json
from scipy import signal
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings("ignore")

In [None]:
# special packages used for particular datasets
!pip install openpyxl
!pip install devicely
!pip install glob2
!pip install wfdb
!pip install vitaldb

import glob
import devicely
import vitaldb

In [None]:
def load_h5_dset(datapath):
    dset = h5py.File(datapath, 'r')
    return dset

# MIMIC-II from UCI Repository ([dataset link](https://archive.ics.uci.edu/ml/datasets/Cuff-Less+Blood+Pressure+Estimation), [paper link](https://ieeexplore.ieee.org/document/7491263))



Contains raw ECG, PPG, and ABP signals from the physionet [MIMIC-II Waveform Database](https://archive.physionet.org/physiobank/database/mimic2wdb/) processed by [Kachuee et al., (2015)](https://ieeexplore.ieee.org/document/7491263). Which signals contain signals from the same patients is unclear. In the paper, approximately 1 in 12 signals come from the same person. A [follow up work](https://ieeexplore.ieee.org/document/8938751) with a co-author also used this dataset and reduces the data leakage by shuffling and random sampling.

<a id='mimic'></a>

In [None]:
datapath = '../../datasets/MIMIC-II/Part1234.mat'
dset = load_h5_dset(datapath)
segment_length = 128

with h5py.File('../../datasets/MIMIC-II/kachuee17_' + str(int(segment_length)) + '.h5', 'w') as hf:
    keys = list(dset.keys())
    for k in range(len(keys)):
        subj_data = []
        if dset[keys[k]][:].shape[1] >= segment_length:
            for i in range(0, segment_length*(dset[keys[k]][:].shape[1]//segment_length), segment_length):
                subj_data.append(dset[keys[k]][:, i:i+segment_length])
        if len(subj_data) > 0:
            hf.create_dataset(keys[k], data=np.array(subj_data))

In [None]:
with h5py.File('../../datasets/MIMIC-II/kachuee17_1000.h5', 'r') as hf:
    print(hf['0'])

# PPG-BP Dataset ([dataset link](https://figshare.com/articles/dataset/PPG-BP_Database_zip/5459299), [paper link](https://www.nature.com/articles/sdata201820))

Contains ~2.1s PPG data and cuff BP measurements and demographic information. Cuff BP measurements are contained in 'Table 1.xlsx'. The PPG data is filtered using SQI metrics from [Liang et al., (2018)](https://www.nature.com/articles/sdata201876).

Because .h5 files store data in arrays, the data need to be matched in all dimensions. Since cuff BP measurements are only reported once for each subject, an array of size <segment length> is created with the first half of values as SBP and second half of values as DBP.

<a id='ppgbp'></a>

In [None]:
def ppgbp_to_h5(dirpath):

    fnames = os.listdir(dirpath)
    subjs = []
    for fname in fnames:
        subjs.append(fname.split('_')[0])
    unique_subjs = np.unique(subjs)

    gt = pd.read_excel(dirpath + '../' + 'PPG-BP dataset.xlsx', header=1)
    subj_ids = gt['subject_ID'].astype(str).values
    sbp_gt = gt['Systolic Blood Pressure(mmHg)'].values
    dbp_gt = gt['Diastolic Blood Pressure(mmHg)'].values

    with h5py.File('../../datasets/PPG-BP/liang18.h5', 'w') as hf:
        for subj in unique_subjs:
            subj_data = np.array([])
            for filename in os.listdir(dirpath):
                s = filename.split('_')
                if s[0] == subj:
                    record = s[1].split('.')[0]
                    d = np.loadtxt(dirpath + filename)
                    d = np.expand_dims(d, axis=0)
                    s = d.shape
                    d = np.stack((d, np.hstack((np.ones((1, s[1]//2))*sbp_gt[np.where(subj_ids == subj)], np.ones((1, s[1]//2))*dbp_gt[np.where(subj_ids == subj)]))), axis=1)
                    if ~subj_data.any():
                        subj_data = d
                    else:
                        if subj_data.shape[1] == len(d):
                            subj_data = np.vstack((subj_data, d))
                        else:
                            for i in range(0, len(d), 2100):
                                subj_data = np.vstack((subj_data, d[:, :, i:i+2100]))
            hf.create_dataset(subj, data=subj_data)
    
dirpath = '../../datasets/PPG-BP/0_subject/'
ppgbp_to_h5(dirpath)

In [None]:
datapath = '../../datasets/PPG-BP/liang18.h5'
dset = load_h5_dset(datapath)
dset['10']

# University of Queensland Vital Signs Database ([dataset link](https://outbox.eait.uq.edu.au/uqdliu3/uqvitalsignsdataset/index.html), [paper link](https://journals.lww.com/anesthesia-analgesia/Fulltext/2012/03000/University_of_Queensland_Vital_Signs_Dataset_.15.aspx))

Contains vital signs data of 32 surgical cases where patients underwent at the Royal Adelaide Hospital. Compared to MIMIC, UoQ provides more complete dataset with simultaneous and synchronized recording of multiple vital sign parameters. Contains ECG, PPG, ABP signals and more.

<a id='uoq'></a>

In [None]:
def uoq_to_h5(dirpath):
    
    numeric_cols = np.array([])
    for case in os.listdir(dirpath):
        if 'case' in case:
            fulldata_dir = dirpath + case + '/fulldata/'
            subj_data = np.array([])
            for f in os.listdir(fulldata_dir):
                d = pd.read_csv(fulldata_dir + f, error_bad_lines=False)
                columns = d.columns
                d = d._get_numeric_data()
                numeric_cols = np.append(numeric_cols, d.columns)
    numeric_cols = np.unique(numeric_cols)
    
    hf = h5py.File(dirpath + 'uoq_dset.h5', 'w')
    for case in os.listdir(dirpath):
        if 'case' in case:
            fulldata_dir = dirpath + case + '/fulldata/'
            for f in os.listdir(fulldata_dir):
                d = pd.read_csv(fulldata_dir + f, error_bad_lines=False)
                columns = d.columns
                d = d[numeric_cols]
                for col in d.columns:
                    d[col] = pd.to_numeric(d[col], errors='coerce')
            hf.create_dataset(case + '_' + f.split('_')[-1].split('.')[0], data=d.T)
    hf.close()

dirpath = '../../datasets/uqvitalsignsdata/'
uoq_to_h5(dirpath)
dset = load_h5_dset('../../datasets/uqvitalsignsdata/uoq_dset.h5')

# PTT-PPG ([dataset link](https://physionet.org/content/pulse-transit-time-ppg/1.0.0/))

Contains time synchronised (and some multi-site) signals worn at different body locations including PPG, gyroscope, cuff BP, and ECG from 22 healthy subjects performing 3 physical activities. ECG data is also annotated. Also includes SpO2 from PPG.

<a id='pttppg'></a>

In [None]:
def pttppg_to_h5(dirpath)
    numeric_cols = np.array([])
    for s in unique_subjs:
        for f in unique_fnames:
            if 's' + s + '_' in f:
                d = pd.read_csv(dirpath + f)
                d = d._get_numeric_data()
                columns = d.columns
            numeric_cols = np.append(numeric_cols, d.columns)
    numeric_cols = np.unique(numeric_cols)

    hf = h5py.File(dirpath + 'pttppg.h5', 'w')
    for s in unique_subjs:
        for f in unique_fnames:
            if 's' + s + '_' in f:
                d = pd.read_csv(dirpath + f)
                d = d[numeric_cols]
                for col in d.columns:
                    d[col] = pd.to_numeric(d[col], errors='coerce')
                hf.create_dataset(f.split('.')[0], data=d.T)
    hf.close()  
    return numeric_cols

dirpath = '../../datasets/pulse-transit-time-ppg/1.1.0/csv/'
cols = pttppg_to_h5(dirpath)
dset = load_h5_dset('../../datasets/pulse-transit-time-ppg/1.1.0/csv/pttppg.h5')

# CHARIS ([dataset link](https://physionet.org/content/charisdb/1.0.0/), [paper link](https://link.springer.com/article/10.1007/s10877-015-9779-3))

Contains multi-channel recordings of ECG, arterial blood pressure (ABP), and intracranial pressure (ICP) of 29 patients diagnosed with traumatic brain injury (TBI) over a 18-month period.

<a id="charis"></a>

In [None]:
def charis_to_h5(dirpath):
    fnames = np.array([])
    for f in os.listdir(dirpath):
        if ('charis' in f) & ('h5' not in f):
            fnames = np.append(fnames, f.split('.')[0])
    unique_fnames = np.unique(fnames)

    hf = h5py.File(dirpath + 'charis.h5', 'w')
    for f in unique_fnames:
        if ('charis' in f) & ('h5' not in f):
            signals, fields = wfdb.rdsamp(dirpath + f.split('.')[0])
            hf.create_dataset(f.split('.')[0].split('charis')[1], data=signals.T)
    hf.close()
    
dirpath = '../../datasets/charisdb/1.0.0/'
charis_to_h5(dirpath)
dset = load_h5_dset('../../datasets/charisdb/1.0.0/charis.h5')

# HYPE ([dataset request form](https://docs.google.com/forms/d/e/1FAIpQLSe9ak7gqdGeRhhvG5Z3DqsyLxvfUcQ3Ktbs7wFfay7VxmU9ag/viewform), [paper link](https://link.springer.com/chapter/10.1007/978-3-030-59137-3_29), [GitHub repo](https://github.com/arianesasso/aime-2020))

Contains PPG signals from Empatica E4 Watch for subjects performing stress tests (N=8) and over 24 hours (N=9) with cuff BP reference. Also included Age, Gender, and BMI information.

<a id="hype" ></a>

In [None]:
# process stress test data
def hype_to_h5(dirpath):
    hf = h5py.File(dirpath + 'hype.h5', 'w')
    for s in os.listdir(dirpath):
        if ('h5' not in s) & ('DS' not in s):
            patient_base_path = dirpath + s
            sources = {}
            if os.path.exists(patient_base_path+'/Tag'):
                sources['tag'] = glob.glob(patient_base_path+r'/Tag*').pop()
            if os.path.exists(patient_base_path+'/Empatica'):
                sources['empatica'] = glob.glob(patient_base_path+r'/Empatica*').pop()
            if os.path.exists(patient_base_path+'/SpaceLabs'):
                sources['spacelabs'] = glob.glob(patient_base_path+r'/SpaceLabs*').pop() 
            
            if len(sources) > 0:
                empatica = devicely.EmpaticaReader(sources['empatica'])
                if os.path.exists(sources['spacelabs']):
                    for file in os.listdir(sources['spacelabs']):
                        if file.endswith(".abp"):
                            spacelabsfile = os.path.join(sources['spacelabs'], file)
                            break
        #         print(spacelabsfile)
                bp = devicely.SpacelabsReader(spacelabsfile)
            #     bp.drop_EB()
                bp.timeshift(pd.Timedelta(-2, unit='H'))

                edata = empatica.data._get_numeric_data()
                bpdata = bp.data._get_numeric_data()
                subj_data = pd.DataFrame()
                for i in range(len(bpdata.index)-1):
                    e = edata[(edata.index >= bpdata.index[i]) & (edata.index < bpdata.index[i+1])]
                    for col in bpdata.iloc[i].index:
                        e[col] = bpdata.iloc[i][col]
                        columns = e.columns
                    subj_data = subj_data.append(e)
                hf.create_dataset(s, data=subj_data.T)
    return columns

dirpath = '../../datasets/hype-de/hype/2019/'
cols = hype_to_h5(dirpath)
dset = load_h5_dset('../../datasets/hype-de/hype/2019/hype.h5')

In [None]:
# process 24 hour data
def hype24H_to_h5(dirpath):
    hf = h5py.File(dirpath + 'hype24H.h5', 'w')
    for s in os.listdir(dirpath):
        if ('h5' not in s) & ('DS' not in s):
            patient_base_path = dirpath + s + '/24 Hours'
            sources = {}
            if os.path.exists(patient_base_path+'/Empatica'):
                sources['empatica'] = glob.glob(patient_base_path+r'/Empatica*').pop()
            if os.path.exists(patient_base_path+'/SpaceLabs'):
                sources['spacelabs'] = glob.glob(patient_base_path+r'/SpaceLabs*').pop() 

            if len(sources) > 0:
                empatica = devicely.EmpaticaReader(sources['empatica'])
                if os.path.exists(sources['spacelabs']):
                    for file in os.listdir(sources['spacelabs']):
                        if file.endswith(".abp"):
                            spacelabsfile = os.path.join(sources['spacelabs'], file)
                            break
        #         print(spacelabsfile)
                bp = devicely.SpacelabsReader(spacelabsfile)
            #     bp.drop_EB()
                bp.timeshift(pd.Timedelta(-2, unit='H'))

                edata = empatica.data._get_numeric_data()
                bpdata = bp.data._get_numeric_data()
                subj_data = pd.DataFrame()
                for i in range(len(bpdata.index)-1):
                    e = edata[(edata.index >= bpdata.index[i]) & (edata.index < bpdata.index[i+1])]
                    for col in bpdata.iloc[i].index:
                        e[col] = bpdata.iloc[i][col]
                        columns = e.columns
                    subj_data = subj_data.append(e)
                hf.create_dataset(s, data=subj_data.T)
    hf.close()
    return columns

dirpath = '../../datasets/hype-de/hype/2019/'
cols = hype24H_to_h5(dirpath)
dset = load_h5_dset('../../datasets/hype-de/hype/2019/hype24H.h5')

# Non-invasive Blood Pressure Estimation ([dataset link](https://www.kaggle.com/datasets/mkachuee/noninvasivebp), [paper link](https://ieeexplore.ieee.org/document/8032000))

Contains PCG, ECG, and PPG signals from 26 subjects. Additional information include age, weight, and height. Also contains signals from a force-sensing resistor (FSR) placed under the cuff BP device to distinguish exact moments of reference BP measurements.

<a id="nibp" ></a>

In [None]:
def find_mins(a, num_mins, window):
    found_mins = []
    amax = a.max()
    hwindow = window // 2
    a = np.array(a)
    for i in range(num_mins):
        found_min = np.argmin(a)
        found_mins.append(found_min)
        a[found_min-hwindow:found_min+hwindow] = amax
    del a
    return sorted(found_mins)

def rolling_window(a, window):
    shape = a.shape[:-1] + (a.shape[-1] - window + 1, window)
    strides = a.strides + (a.strides[-1],)
    return np.lib.stride_tricks.as_strided(a, shape=shape, strides=strides)

def find_bp_measurement_points(data, plot):
    data_FSR = -np.array(data['data_FSR'])
    max_diff = 50
    data_FSR_clear = np.array(data_FSR, dtype=np.float)
    data_FSR_outliers = np.abs(data_FSR[1:] - data_FSR[:-1]) > max_diff
    data_FSR_outliers = np.append(data_FSR_outliers, False)
    data_FSR_clear[data_FSR_outliers] = np.nan
    if plot == True:
        plt.plot(1/np.array(data['data_FSR']))

    mean_window = 10
    data_FSR_roll_mean = np.nanmean(rolling_window(data_FSR_clear, mean_window), axis=-1)
    data_FSR_clear[np.isnan(data_FSR_clear)] = \
        data_FSR_roll_mean[np.isnan(data_FSR_clear)[:1-mean_window]]
    assert np.isnan(data_FSR_clear).sum() == 0
    data_FSR_smooth = signal.savgol_filter(data_FSR_clear, 51, 0)

    diff_n = 1000
    roll_window = 21
    data_FSR_diff = data_FSR_smooth[diff_n:] - data_FSR_smooth[:-diff_n]
    data_FSR_diff_roll = rolling_window(data_FSR_diff, roll_window).mean(axis=-1)

    num_mins = len(data['data_BP'])
    min_window = 15000        
    data_FSR_mins = find_mins(data_FSR_diff_roll, num_mins, min_window)

    if plot == True:
        plt.figure(figsize=(14, 6))
        plt.plot(data_FSR_smooth, label='Smoothed FSR')
        data_FSR_max, data_FSR_min = data_FSR_smooth.max(), data_FSR_smooth.min()
        for m in data_FSR_mins:
            plt.vlines(m + diff_n/2, data_FSR_min, data_FSR_max, color='red')
        plt.legend()
        plt.title('BP measures points')
    return data_FSR_mins

def kachueeNIBPE_to_h5(dirpath):
    data_keys = ['data_PPG', 'data_ECG', 'data_PCG', 'data_FSR']
    hf = h5py.File(dirpath + 'kachueeNIBPE.h5', 'w')
    for fname in os.listdir(dirpath):
        if ('eval' not in fname) & ('h5' not in fname) & ('ipynb' not in fname):
            with open(dirpath + fname, 'r') as f:
                data = json.load(f)
            idxs = find_bp_measurement_points(data, plot=False)
            data_keys = ['data_PPG', 'data_ECG', 'data_PCG', 'data_FSR']
            data['data_BP'].append({'SBP': 0, 'DBP': 0})
            d = pd.DataFrame()
            for col in data_keys:
                d[col] = data[col]
            temp = np.append([0], idxs)
            temp = np.append(temp, len(data['data_PPG']))
            sbps = np.array([])
            dbps = np.array([])
            d['SBP'] = 0
            d['DBP'] = 0
            for i in range(len(temp)-1):
                d.iloc[temp[i]:temp[i+1]]['SBP'] = data['data_BP'][i]['SBP']
                d.iloc[temp[i]:temp[i+1]]['DBP'] = data['data_BP'][i]['DBP']
            hf.create_dataset(fname.split('.')[0], data=d.to_numpy().T)
    hf.close()
    return d.columns

dirpath = '../../datasets/kachueeNIBPE/'
cols = kachueeNIBPE_to_h5(dirpath)
dset = load_h5_dset('../../datasets/kachueeNIBPE/kachueeNIBPE.h5')

# VitalDB ([dataset link](https://vitaldb.net/), [paper link](https://www.nature.com/articles/s41597-022-01411-5), [GitHub repo](https://github.com/vitaldb))

Contains high-resolution multi-parameter data from 6388 surgical patients, including 486451 waveform and numeric data tracks of 196 intraoperative monitoring parameters, 73 perioperative clinical parameters, and 34 time-series laboratory result parameters.

Not all data fields are simultaneously recorded. Therefore, in this example, we follow [Zhang et al., (2022)](https://iopscience.iop.org/article/10.1088/1361-6579/abf889/pdf) to extract continuous ECG and PPG data by keeping 5 minute windows with non-zero data and creating 8s segments. The data is downlaoded from the VitalDB database using the the VitalDB package. Note: here, not all data is downloaded - only the ones that contain ECG and PPG.

<a id="vitaldb" ></a>

In [None]:
def find_nonzero_runs(a):
    # Create an array that is 1 where a is nonzero, and pad each end with an extra 0.
    isnonzero = np.concatenate(([0], (np.asarray(a) != 0).view(np.int8), [0]))
    absdiff = np.abs(np.diff(isnonzero))
    # Runs start and end where absdiff is 1.
    ranges = np.where(absdiff == 1)[0].reshape(-1, 2)
    return ranges

def vitaldb_to_h5(dirpath, parameters, fsamp):
    caseids = vitaldb.find_cases(parameters)
    with h5py.File(dirpath + 'vitaldb_p1.h5', 'w') as hf:
        for case in caseids:
            print(case)
            d = vitaldb.load_case(case, parameters, fsamp)

            # filter data. keep 5 min windows with non-zero data
            # find nonzero segments > 8s
            d[np.isnan(d)] = 0
            d[d[:, 2] <= 30] = 0
            nonzero_segs = find_nonzero_runs(d[:, 0]*d[:, 1]*d[:, 2])
            valid_segs = []
            for seg in nonzero_segs:
                # > 5mins then split into 8s segments
                if seg[1]-seg[0] > 1/fsamp*60*5:
                    idxs = np.arange(seg[0], seg[1], 1/fsamp*8)
                    for i in range(len(idxs)-1):
                        valid_segs.append([int(idxs[i]), int(idxs[i+1])])
            subj_data = np.array([])
            for seg in valid_segs:
                if len(subj_data) == 0:
                    subj_data = np.expand_dims(d[seg[0]:seg[1], :].T, axis=0)
                else:
                    subj_data = np.vstack((subj_data, np.expand_dims(d[seg[0]:seg[1], :].T, axis=0)))
            hf.create_dataset(str(case), data=subj_data)
    return caseids

dirpath = '../../datasets/vitaldb/'
parameters = parameters = ['SNUADC/PLETH', 'SNUADC/ECG_II', 'SNUADC/ART']
fsamp = 1/125
caseids = vitaldb_to_h5(dirpath, parameters, fsamp)
dset = load_h5_dset('../../datasets/vitaldb/vitaldb_p1.h5')

# Aurora-BP ([sample dataset link](https://github.com/microsoft/aurorabp-sample-data/main/sample), [full dataset request link](https://microsoft.na3.adobesign.com/public/esignWidget?wid=CBFCIBAA3AAABLblqZhD74UtFW8mtjvfuL24R-oLahbMHQd2OJTLURiy0cT8RXlTEFf3n5Y8OzpPdEPSiEvY*), [GitHub repo](https://github.com/microsoft/aurorabp-sample-data))

Sample data contains [auscultatory](https://github.com/microsoft/aurorabp-sample-data/tree/main/sample/measurements_auscultatory) or [oscillometric](https://github.com/microsoft/aurorabp-sample-data/tree/main/sample/measurements_oscillometric) data of 5 subjects, including calibration data, exercise challenge, static challenge, static seated, and temporal challenge data. Also contains subject data information such as height, weight, and age. The full dataset also includes data from 1125 subjects and 24-hour BP monitoring data.

The data can be processed in many different ways. Here, we place subject data from each stage (i.e. calibration, exercise challenge) into separate subdicts.

<a id="aurora" ></a>

In [None]:
def aurorabp_to_h5(dirpath):
    hf = h5py.File(dirpath + 'aurorabp.h5', 'w')
    for subj in os.listdir(dirpath):
        for fname in os.listdir(dirpath + subj):
            d = pd.read_csv(dirpath + subj + '/' + fname, delimited='\t')
            hf.create_dataset(fname.split('.tsv')[0], data=d.T)
            cols = d.columns
    hf.close()
    return cols

dirpath = '~/datasets/aurorabp-sample-data/sample/measurements_auscultatory/'
cols = aurorabp_to_h5(dirpath)
dset = load_h5_dset('../../datasets/aurorabp/aurorabp.h5')