# Data Exploration

Exploring MIMIC waveform data.

---

## Identify records

Identify the records in the MIMIC IV Waveform Database

### Specify the required Python packages
- The WFDB package is imported using `import wfdb`

In [1]:
# Setup
import sys
from pathlib import Path

import wfdb # The WFDB Toolbox

wfdb.set_db_index_url('https://challenge.physionet.org/benjamin/db')

<div class="alert alert-block alert-warning"> <b>Resource:</b> You can find out more about the WFDB package <a href="https://physionet.org/content/wfdb-python/3.4.1/">here</a>. </div>

### Get a list of records in the database
- Use the [`get_record_list`](https://wfdb.readthedocs.io/en/latest/io.html#wfdb.io.get_record_list) function from the WFDB toolbox to get a list records in the database.

In [2]:
records = []
database_name = 'mimic4wdb/0.1.0' # The name of the MIMIC IV Waveform Demo Database on Physionet (see URL: )
subjects = wfdb.get_record_list(f'{database_name}')
for subject in subjects:
    studies = wfdb.get_record_list(f'{database_name}/{subject}')
    for study in studies:
        records.append(Path(f'{subject}{study}'))
print("Done: Loaded list of {} records for '{}' database".format(len(records), database_name))

Done: Loaded list of 200 records for 'mimic4wdb/0.1.0' database


- Display the first few records

In [3]:
print("First five records: {}".format(str(records[0:5])))

First five records: [PosixPath('p100/p10014354/81739927/81739927'), PosixPath('p100/p10019003/87033314/87033314'), PosixPath('p100/p10020306/83404654/83404654'), PosixPath('p100/p10039708/83411188/83411188'), PosixPath('p100/p10039708/85583557/85583557')]


Note the formatting of these records: each starts with an intermediate directory ("p100" in this case), followed by a subject identifier and a record identifier.

<div class="alert alert-block alert-info"> <b>Q:</b> Can you print the names of the last five records? <br> <b>Hint:</b> in Python, the last five elements can be specified using '[-5:]' </div>

---
## Extract metadata for a record

Each record contains metadata stored in a header file, named "\<record name\>.hea"

### Specify the online directory containing a records data

In [4]:
no = 3 # specify the first record (noting that in Python the first index is 0)
record = records[no]
record_dir = f'{database_name}/{record.parent}'
print("Physionet directory specified for this record: {}".format(record_dir))

Physionet directory specified for this record: mimic4wdb/0.1.0/p100/p10039708/83411188


### Specify the name of the record

Extract the record name (e.g. '83411188') from the record directory (e.g. 'p100/p10039708/83411188/83411188'):

In [5]:
record_name = record.name
print("Record name: {}".format(record_name))

Record name: 83411188


### Load the metadata for this record
- Use the [`rdheader`](https://wfdb.readthedocs.io/en/latest/io.html#wfdb.io.rdheader) function from the WFDB toolbox to load metadata from the record header file

In [6]:
record_data = wfdb.rdheader(record_name, pn_dir=record_dir, rd_segments=True)
# NOTE "https://physionet.org/content/" won't be correct until the MIMIC-IV-Waveform is officially released
print("Done: metadata loaded for record '{}' from header file at URL: {}".format(record_name, "https://physionet.org/content/" + record_dir + "/" + record_name + ".hea"))

Done: metadata loaded for record '83411188' from header file at URL: https://physionet.org/content/mimic4wdb/0.1.0/p100/p10039708/83411188/83411188.hea


---
## Inspect details of physiological signals recorded in this record
- Printing a few details of the signals from the extracted metadata

In [7]:
print("- Number of signals: {}".format(record_data.n_sig))
print("- Duration: {:.1f} hours".format(record_data.sig_len/(record_data.fs*60*60))) 
# NOTE fs isn't 125 Hz
print("- Sampling frequency: {} Hz".format(record_data.fs))

- Number of signals: 6
- Duration: 14.2 hours
- Sampling frequency: 62.4725 Hz


Note that:
- Not all signals may be present throughout the duration of the record
- All signals in MIMIC are sampled at 125 Hz.

---
## Inspect the segments making up a record
Each record is typically made up of several segments

- Inspect the files in this record

In [8]:
files = wfdb.get_record_list(str(Path(record_dir).parent))
print("Done: Loaded list of {} records for subject '{}'".format(len(files), str(Path(record_dir).parent.parts[-1])))

Done: Loaded list of 2 records for subject 'p10039708'


Inspect the contents of the first two files:

In [9]:
print("The first record is: '{}'".format(files[0]) )
print("The second record is: '{}'".format(files[1]) )

The first record is: '83411188/83411188'
The second record is: '85583557/85583557'


The number before the / is the record number and the number after is for the multi-segment header file. This multi-segment header file provides general information about the record. 

The remaining files under a record contain the waveform data, split into segments, with one file per segment.

In [10]:
segments = record_data.seg_name
print("The {} segments from study {} are: {}".format(len(segments), record_name, segments) )

The 6 segments from study 83411188 are: ['83411188_0000', '83411188_0001', '83411188_0002', '83411188_0003', '83411188_0004', '83411188_0005']


Note the format of the names of the files containing waveform data for each segment: record directory, "_", segment number

---
## Inspect an individual segment
### Read the metadata for this segment
- Read the metadata from the header file

In [16]:
segment_metadata = wfdb.rdheader(record_name=segments[2], pn_dir=record_dir) 
print("Header metadata loaded for the segment '{}' in study '{}' for subject '{}'".format(segments[1], record_name, str(Path(record_dir).parent.parts[-1])))

Header metadata loaded for the segment '83411188_0001' in study '83411188' for subject 'p10039708'


### Find out what signals are present, and for how long

In [17]:
print("This segment contains the following signals: {}".format(record_data.sig_name))
print("The signals are measured in units of: {}".format(segment_metadata.units))

This segment contains the following signals: ['II', 'V', 'aVR', 'ABP', 'Pleth', 'Resp']
The signals are measured in units of: ['mV', 'mV', 'mV', 'mmHg', 'NU', 'Ohm']


See [here](https://archive.physionet.org/mimic2/mimic2_waveform_overview.shtml#signals-125-samplessecond) for definitions of signal abbreviations.

<div class="alert alert-block alert-info"> <b>Q:</b> Which of these signals is still present in segment '83411188_0000'? </div>

All signals in a segment are time-aligned, measured at the same sampling frequency, and last the same duration:

In [18]:
print("All the signals are sampled at {} Hz".format(segment_metadata.fs))
print("and they last for {:.1f} minutes".format(segment_metadata.sig_len/(segment_metadata.fs*60)) )

All the signals are sampled at 62.4725 Hz
and they last for 0.9 minutes
