# Data Exploration

In this tutorial we will explore data in the MIMIC Waveform Database.

The **objectives** are:
- To gain an understanding of the structure of the MIMIC Database, consisting of subjects, studies, records, and segments.
- To start using the WaveForm DataBase Toolbox to work with waveform data files
- To understand how to find out what signals are present in a record or a segment, and how long they last.
- To gain experience in searching for records which contain the signals required for a particular study.

---
## Setup

### Specify the required Python packages
We'll import the following:
- _sys_: an essential python package
- _wfdb_: The WaveForm DataBase Toolbox package, which contains tools for processing waveform data such as those in MIMIC.
- _pathlib_ (well a particular function from _pathlib_, called _Path_)

In [3]:
import sys
import wfdb
from pathlib import Path

Now that we have imported these packages (_i.e._ toolboxes) we have a set of tools (functions) ready to use.

<div class="alert alert-block alert-warning"> <b>Resource:</b> You can find out more about the WFDB package <a href="https://physionet.org/content/wfdb-python/3.4.1/">here</a>. </div>

### Specify the version of the MIMIC Waveform Database to use
We will use either the:
- [MIMIC III Waveform Database](https://doi.org/10.13026/c2607m), or
- MIMIC IV Waveform Database

- Specify which version of the database to use:

In [4]:
mimic_wfdb_version = 4 # either: 3 or 4

- Setup for this particular version of the database:

In [5]:
if mimic_wfdb_version == 3:
    wfdb.set_db_index_url('https://physionet.org/files') # use the version of the WFDB toolbox which corresponds to MIMIC IV.
    database_name = 'mimic3wdb/1.0' # The name of the MIMIC III Waveform Database on Physionet (see URL: https://physionet.org/content/mimic3wdb/1.0/)
elif mimic_wfdb_version == 4:
    wfdb.set_db_index_url('https://challenge.physionet.org/benjamin/db') # use the version of the WFDB toolbox which corresponds to MIMIC IV.
    database_name = 'mimic4wdb/0.1.0' # The name of the MIMIC IV Waveform Demo Database on Physionet
print("Using MIMIC Waveform Database version: {}".format(mimic_wfdb_version))

Using MIMIC Waveform Database version: 4


---
## Identify the records in the database

### Get a list of records

- Use the [`get_record_list`](https://wfdb.readthedocs.io/en/latest/io.html#wfdb.io.get_record_list) function from the WFDB toolbox to get a list of records in the database.

In [6]:
records = []
no_records_to_load = 205
subjects = wfdb.get_record_list(database_name)
print("The '{}' database contains data from {} subjects".format(database_name, len(subjects)))
finished = False
for subject in subjects:
    if (not finished):
        studies = wfdb.get_record_list(f'{database_name}/{subject}')
        for study in studies:
            if mimic_wfdb_version == 3 & ("_" not in study): # This skips any files which don't contain waveform data (e.g. numerics files)
                continue
            records.append(Path(f'{subject}{study}'))
            if len(records) >= no_records_to_load: # stop if we've loaded enough records
                finished = True
if not finished:
    print("Done: Went through entire database and loaded list of {} records for '{}' database".format(len(records), database_name))
else:
    print("Done: Went through first bit of database and loaded list of {} records for '{}' database".format(len(records), database_name))

The 'mimic4wdb/0.1.0' database contains data from 198 subjects
Done: Went through entire database and loaded list of 200 records for 'mimic4wdb/0.1.0' database


### Look at the records

- Display the first few records

In [7]:
print("First five records: {}".format(records[0:5]))
if mimic_wfdb_version == 3:
    print("\nNote the formatting of these records:\n - intermediate directory ('30' in this case)\n - subject identifier ('3000003' in this case)\n - record identifier (e.g. '3000003_0001'")
else:
    print("\nNote the formatting of these records:\n - intermediate directory ('p100' in this case)\n - subject identifier (e.g. 'p10014354')\n - record identifier (e.g. '81739927'")

First five records: [PosixPath('p100/p10014354/81739927/81739927'), PosixPath('p100/p10019003/87033314/87033314'), PosixPath('p100/p10020306/83404654/83404654'), PosixPath('p100/p10039708/83411188/83411188'), PosixPath('p100/p10039708/85583557/85583557')]

Note the formatting of these records:
 - intermediate directory ('p100' in this case)
 - subject identifier (e.g. 'p10014354')
 - record identifier (e.g. '81739927'


<div class="alert alert-block alert-info"> <b>Q:</b> Can you print the names of the last five records? <br> <b>Hint:</b> in Python, the last five elements can be specified using '[-5:]' </div>

---
## Extract metadata for a record

Each record contains metadata stored in a header file, named "\<record name\>.hea"

### Specify the online directory containing a record's data

In [8]:
if mimic_wfdb_version == 3:
    no = 0
else:
    no = 3 # specify the fourth record (noting that in Python the first index is 0)
record = records[no]
record_dir = f'{database_name}/{record.parent}'
print("Physionet directory specified for this record: {}".format(record_dir))

Physionet directory specified for this record: mimic4wdb/0.1.0/p100/p10039708/83411188


### Specify the subject identifier

Extract the record name (e.g. '83411188') from the record (e.g. 'p100/p10039708/83411188/83411188'):

In [9]:
record_name = record.name
print("Record name: {}".format(record_name))

Record name: 83411188


### Load the metadata for this record
- Use the [`rdheader`](https://wfdb.readthedocs.io/en/latest/io.html#wfdb.io.rdheader) function from the WFDB toolbox to load metadata from the record header file

In [10]:
record_data = wfdb.rdheader(record_name, pn_dir=record_dir, rd_segments=True)
print("Done: metadata loaded for record '{}' from header file at URL: {}".format(record_name, "https://physionet.org/content/" + record_dir + "/" + record_name + ".hea"))
# NOTE "https://physionet.org/content/" won't be correct until the MIMIC-IV-Waveform is officially released

Done: metadata loaded for record '83411188' from header file at URL: https://physionet.org/content/mimic4wdb/0.1.0/p100/p10039708/83411188/83411188.hea


---
## Inspect details of physiological signals recorded in this record
- Printing a few details of the signals from the extracted metadata

In [11]:
print("- Number of signals: {}".format(record_data.n_sig))
print("- Duration: {:.1f} hours".format(record_data.sig_len/(record_data.fs*60*60))) 
# NOTE fs isn't 125 Hz
print("- Sampling frequency: {} Hz".format(record_data.fs))

- Number of signals: 6
- Duration: 14.2 hours
- Sampling frequency: 62.4725 Hz


---
## Inspect the segments making up a record
Each record is typically made up of several segments

### Inspect the files in a record

In [12]:
if mimic_wfdb_version == 3:
    files = wfdb.get_record_list(str(Path(record_dir)))
    print("Done: Loaded list of {} records for subject '{}'".format(len(files), str(Path(record_dir).parts[-1])))
else:
    files = wfdb.get_record_list(str(Path(record_dir).parent))
    print("Done: Loaded list of {} records for subject '{}'".format(len(files), str(Path(record_dir).parent.parts[-1])))

Done: Loaded list of 2 records for subject 'p10039708'


### Inspect the contents of the first two files

In [13]:
if mimic_wfdb_version == 3:
    print("The first file is: '{}', which provides metadata for the whole record".format(files[0]) )
    print("The second file is: '{}', which contains numerics data".format(files[1]) )    
else:
    print("The first file contains the first record: '{}'".format(files[0]) )
    print("The second file contains the second record: '{}'".format(files[1]) )
    print("where the number before the / is the record number and the number after is for the multi-segment header file. This multi-segment header file provides general information about the record.")

The first file contains the first record: '83411188/83411188'
The second file contains the second record: '85583557/85583557'
where the number before the / is the record number and the number after is for the multi-segment header file. This multi-segment header file provides general information about the record.


### Inspect the segments within a record

In [14]:
segments = record_data.seg_name
# --- these lines remove non-waveform data from MIMIC III files
for curr_seg in segments:
    if ("~" in curr_seg) or ("layout" in curr_seg):
        segments.remove(curr_seg)
# ---
print("The {} segments from record {} are: {}".format(len(segments), record_name, segments) )

The 6 segments from record 83411188 are: ['83411188_0000', '83411188_0001', '83411188_0002', '83411188_0003', '83411188_0004', '83411188_0005']


Note the format of the names of the files containing waveform data for each segment: record directory, "_", segment number

---
## Inspect an individual segment
### Read the metadata for this segment
- Read the metadata from the header file

In [15]:
segment_metadata = wfdb.rdheader(record_name=segments[2], pn_dir=record_dir) 
print("Header metadata loaded for the segment '{}' in study '{}' for subject '{}'".format(segments[1], record_name, str(Path(record_dir).parent.parts[-1])))

Header metadata loaded for the segment '83411188_0001' in study '83411188' for subject 'p10039708'


### Find out what signals are present

In [16]:
print("This segment contains the following signals: {}".format(segment_metadata.sig_name))
print("The signals are measured in units of: {}".format(segment_metadata.units))

This segment contains the following signals: ['II', 'V', 'aVR', 'ABP', 'Pleth', 'Resp']
The signals are measured in units of: ['mV', 'mV', 'mV', 'mmHg', 'NU', 'Ohm']


See [here](https://archive.physionet.org/mimic2/mimic2_waveform_overview.shtml#signals-125-samplessecond) for definitions of signal abbreviations.

<div class="alert alert-block alert-info"> <b>Q:</b> Which of these signals is still present in segment '3000003_0014' (for MIMIC III) or segment '83411188_0000' (for MIMIC IV)? </div>

### Find out how long each signal lasts

All signals in a segment are time-aligned, measured at the same sampling frequency, and last the same duration:

In [17]:
print("All the signals are sampled at {} Hz".format(segment_metadata.fs))
print("and they last for {:.1f} minutes".format(segment_metadata.sig_len/(segment_metadata.fs*60)) )

All the signals are sampled at 62.4725 Hz
and they last for 0.9 minutes


## Identify records suitable for analysis

The signals available and their durations vary from one record (and segment) to the next. Since most studies require specific types of signals (e.g. blood pressure and photoplethysmography signals), it is important to identify which records (or segments) contain the required signals for the required duration.

### Setup

In [36]:
import pandas as pd
from pprint import pprint

In [22]:
print("Earlier, we loaded a list of {} records for '{}' database".format(len(records), database_name))

Earlier, we loaded a list of 200 records for 'mimic4wdb/0.1.0' database


### Specify requirements

- Required signals

In [27]:
required_sigs = ['ABP', 'Pleth']

- Required duration

In [28]:
req_seg_duration = 10*60  # converting from minutes to seconds

### Find out how many records meet the requirements

In [44]:
matching_recs = {'dir':[],'seg_name':[],'length':[]}
for record in records:
    print('Record: {}'.format(record), end="", flush=True)
    record_dir = f'{database_name}/{record.parent}'
    record_name = record.name
    print(' (reading data)')
    record_data = wfdb.rdheader(record_name, pn_dir=record_dir, rd_segments=True)
    # Get the segments for the record
    segments = record_data.seg_name
    # First check to see if the segment is 10 min long, if not move onto the next one
    gen = (segment for segment in segments if segment != '~')
    for segment in gen:
        print(' - Segment: {}'.format(segment), end="", flush=True)
        segment_metadata = wfdb.rdheader(record_name=segment, pn_dir=record_dir)
        seg_length = segment_metadata.sig_len/(segment_metadata.fs)
        if seg_length < req_seg_duration:
            print(' (too short at {:.1f} mins)'.format(seg_length/60))
            continue
        # Next check that all required signals are present in the segment
        sigs_present = segment_metadata.sig_name
        
        if all(x in sigs_present for x in required_sigs):
            matching_recs['dir'].append(record_dir)
            matching_recs['seg_name'].append(segment)
            matching_recs['length'].append(seg_length)
            print(' (met requirements)')
            # Since we only need one segment per record break out of loop
            break
        else:
            print(' (long enough, but missing signal(s))')

print("A total of {} records met the requirements".format(len(matching_recs['dir'])))
#df_matching_recs = pd.DataFrame(data=matching_recs)
#df_matching_recs.to_csv('matching_records.csv', index=False)
#p=1

Record: p100/p10014354/81739927/81739927 (reading data)
 - Segment: 81739927_0000 (too short at 0.0 mins)
 - Segment: 81739927_0001 (too short at 0.1 mins)
 - Segment: 81739927_0002 (too short at 0.9 mins)
 - Segment: 81739927_0003 (too short at 0.1 mins)
 - Segment: 81739927_0004 (too short at 0.0 mins)
 - Segment: 81739927_0005 (too short at 0.5 mins)
 - Segment: 81739927_0006 (too short at 0.1 mins)
 - Segment: 81739927_0007

KeyboardInterrupt: 