# Data Exploration

Exploring MIMIC waveform data.

---

## Identify ICU stays

Identify the ICU stays in the [MIMIC III Waveform Database](https://doi.org/10.13026/c2607m)

### Specify the required Python packages
- The WFDB package is imported using `import wfdb`

In [1]:
# Setup
import sys
from pathlib import Path

import wfdb # The WFDB Toolbox

wfdb.set_db_index_url('https://challenge.physionet.org/benjamin/db')

<div class="alert alert-block alert-warning"> <b>Resource:</b> You can find out more about the WFDB package <a href="https://physionet.org/content/wfdb-python/3.4.1/">here</a>. </div>

### Get a list of ICU stays in the database
- Use the [`get_record_list`](https://wfdb.readthedocs.io/en/latest/io.html#wfdb.io.get_record_list) function from the WFDB toolbox to get a list of ICU stays (here, corresponding to records) in the database.

In [3]:
icustay_records = []
database_name = 'mimic4wdb/0.1.0' # The name of the MIMIC IV Waveform Demo Database on Physionet (see URL: )
subjects = wfdb.get_record_list(f'{database_name}')
for subject in subjects:
    studies = wfdb.get_record_list(f'{database_name}/{subject}')
    for study in studies:
        icustay_records.append(Path(f'{subject}{study}'))
print("Done: Loaded list of {} ICU stays for '{}' database".format(len(icustay_records), database_name))

Done: Loaded list of 200 ICU stays for 'mimic4wdb/0.1.0' database


- Display the first few records

In [4]:
print("First five ICU stays: {}".format(str(icustay_records[0:5])))

First five ICU stays: [PosixPath('p100/p10014354/81739927/81739927'), PosixPath('p100/p10019003/87033314/87033314'), PosixPath('p100/p10020306/83404654/83404654'), PosixPath('p100/p10039708/83411188/83411188'), PosixPath('p100/p10039708/85583557/85583557')]


Note the formatting of these records: each starts with an intermediate directory ("30" in this case), followed by a record directory.

<div class="alert alert-block alert-info"> <b>Q:</b> Can you print the names of the last five ICU stays? <br> <b>Hint:</b> in Python, the last five elements can be specified using '[-5:]' </div>

---
## Extract metadata for an ICU stay

Each ICU stay contains metadata stored in a header file, named "\<ICU stay record name\>.hea"

### Specify the online directory containing an ICU stay's data

In this case, each ICU stay corresponds to a record.

In [5]:
icustay_no = 3 # specify the first record (noting that in Python the first index is 0)
icustay_record = icustay_records[icustay_no]
# icustay_record_dir = database_name + '/' + icustay_record
icustay_record_dir = f'{database_name}/{icustay_record.parent}'
print("Physionet directory specified for this ICU stay: {}".format(icustay_record_dir))

Physionet directory specified for this ICU stay: mimic4wdb/0.1.0/p100/p10039708/83411188


### Specify the name of the ICU stay

Extract the ICU stay record name (e.g. '3000003') from the ICU stay record (e.g. '30/300003'):

In [6]:
icustay_record_name = icustay_record.name
print("ICU stay name: {}".format(icustay_record_name))

ICU stay name: 83411188


### Load the metadata for this ICU stay
- Use the [`rdheader`](https://wfdb.readthedocs.io/en/latest/io.html#wfdb.io.rdheader) function from the WFDB toolbox to load metadata from the record header file

In [9]:
icustay_record_data = wfdb.rdheader(icustay_record_name, pn_dir=icustay_record_dir, rd_segments=True)
# NOTE "https://physionet.org/content/" won't be correct until the MIMIC-IV-Waveform is officially released
print("Done: metadata loaded for ICU stay '{}' from header file at URL: {}".format(icustay_record_name, "https://physionet.org/content/" + icustay_record_dir + "/" + icustay_record_name + ".hea"))

Done: metadata loaded for ICU stay '83411188' from header file at URL: https://physionet.org/content/mimic4wdb/0.1.0/p100/p10039708/83411188/83411188.hea


---
## Inspect details of physiological signals recorded in this ICU stay
- Printing a few details of the signals from the extracted metadata

In [10]:
print("- Number of signals: {}".format(icustay_record_data.n_sig))
print("- Duration: {:.1f} hours".format(icustay_record_data.sig_len/(icustay_record_data.fs*60*60))) 
# NOTE fs isn't 125 Hz
print("- Sampling frequency: {} Hz".format(icustay_record_data.fs))

- Number of signals: 6
- Duration: 14.2 hours
- Sampling frequency: 62.4725 Hz


Note that:
- Not all signals may be present throughout the duration of the record
- All signals in MIMIC are sampled at 125 Hz.

---
## Inspect the segments making up an ICU stay
Each ICU stay is typically made up of several segments (which correspond to records)

- Inspect the files in this ICU stay

In [16]:
# NOTE THIS ISN'T WHAT IS NEEDED, NEED ALL .hea FILES UNDER THE 2ND LEVEL RECORDS
# I DON'T THINK THIS IS POSSIBLE WITH get_record_list UNLESS BENJAMIN UPDATES HIS 2ND LEVEL TO INCLUDE ALL .hea's
# icustay_record_data = wfdb.rdheader(icustay_record_name, pn_dir=icustay_record_dir, rd_segments=False)
icustay_files = wfdb.get_record_list(str(Path(icustay_record_dir).parent))
print("Done: Loaded list of {} studies for subject '{}'".format(len(icustay_files), str(Path(icustay_record_dir).parent.parts[-1])))

Done: Loaded list of 2 studies for subject 'p10039708'


Inspect the contents of the first two files:

In [17]:
print("The first study is: '{}'".format(icustay_files[0]) )
print("The second study is: '{}'".format(icustay_files[1]) )

The first study is: '83411188/83411188'
The second study is: '85583557/85583557'


The remaining files contain the waveform data for the ICU stay, split into segments, with one file per segment.

In [21]:
# icustay_segments = [s for s in icustay_files if "_" in s]
# icustay_record_data.get_sig_segments()
# icustay_record_data.get_sig_name()
# icustay_segments = [s for s in icustay_files if "_" in s]
icustay_segments = icustay_record_data.seg_name
print("The {} segments from study {} are: {}".format(len(icustay_segments), icustay_files[0], icustay_segments) )

The 6 segments from study 83411188/83411188 are: ['83411188_0000', '83411188_0001', '83411188_0002', '83411188_0003', '83411188_0004', '83411188_0005']


Note the format of the names of the files containing waveform data for each segment: record directory, "_", segment number

---
## Inspect an individual segment
### Read the metadata for this segment
- Read the metadata from the header file

In [30]:
# segment_name = icustay_segments[0]
# NOTE We can just use what was loaded above
# segment_metadata = wfdb.rdheader(record_name=segment_name, pn_dir=icustay_record_dir) 
segment_metadata = wfdb.rdheader(record_name=icustay_segments[1], pn_dir=icustay_record_dir) 
print("Header metadata loaded for the segment 1 in study '{}' for subject '{}'".format(icustay_record_name, str(Path(icustay_record_dir).parent.parts[-1])))

Header metadata loaded for the segment 1 in study '83411188' for subject 'p10039708'


### Find out what signals are present, and for how long

In [31]:
print("This segment contains the following signals: {}".format(icustay_record_data.sig_name))
print("The signals are measured in units of: {}".format(segment_metadata.units))

This segment contains the following signals: ['II', 'V', 'aVR', 'ABP', 'Pleth', 'Resp']
The signals are measured in units of: ['mV', 'mmHg', 'NU', 'Ohm']


See [here](https://archive.physionet.org/mimic2/mimic2_waveform_overview.shtml#signals-125-samplessecond) for definitions of signal abbreviations.

<div class="alert alert-block alert-info"> <b>Q:</b> Which of these signals is still present in segment '3000003_0014'? </div>

All signals in a segment are time-aligned, measured at the same sampling frequency, and last the same duration:

In [32]:
print("All the signals are sampled at {} Hz".format(segment_metadata.fs))
print("and they last for {:.1f} minutes".format(segment_metadata.sig_len/(segment_metadata.fs*60)) )

All the signals are sampled at 62.4725 Hz
and they last for 0.1 minutes
