## Table of Contents
- Section 1: Introduction
- Section 2: Methodology
    - Section 2.i: Obtain Data
        - Section 2.i.a: Breath Tables
        - Section 2.i.b: Tidal breathing flow-loop loop (TBFVL) Table
        - Section 2.i.c: SPX Table
        - Section 2.i.d: Screenshot Table
    - Section 2.ii: Organize Data
        - Section 2.ii.a: Clean and combine REDCap
        - Section 2.ii.b: Link raw data
        - Section 2.ii.c: Split data set
    - Section 2.iii: Preprocess Data
    - Section 2.iv: Load and Process Data
    - Section 2.v: Model Training
- Section 3: Results
- Section 4: Discussion

## Section 1: Introduction<a name="section-1-introduction"></a>

Current spirometry values, such forced expiratory volume in 1 second (FEV<sub>1</sub>), are used to quantitatively assess the disease progression of restrictive diseases such as pulmonary fibrosis. FEV<sub>1</sub> is best suited for detecting changes in large airways and may not be sensitive enough to adequately detect peripheral changes in other lung diseases such as early stage cystic fibrosis.

The lung clearance index (LCI), a ratio of the cumulative expired volume to the functional residual capacity, has been proposed as an alternative metric to the current spirometry paradigm. The LCI can be thought of as the number of times the resting breath volume has to be turned over in order to remove a tracer gas from the lungs. If an individual has ventilation inhomogeneity due to lung obstructions, such as those associated with cystic fibrosis, a higher number of breaths and associated volume will be required to remove the tracer gas from the lung.

The LCI is obtained through the [multiple breath inert gas washout (MBW)](http://www.mbwtraining.com/ECFS_MBW_SOP.pdf). Currently, the LCI is a research metric and not a clinical metric. Industry research studies utilizing MBW usually involve a third party to review the MBW trials to determine if the MBW trial is research quality. The North American, European, and Australian MBW central over-reading centres (CORCs) exist to formally review the MBW data. The raw MBW signals (Figure 1; CO<sub>2</sub>, O<sub>2</sub>, N<sub>2</sub>, flow, volume) are quality controlled for the following critera:
- No leaks
- No coughing, laughing or talking
- No signal misalignment
- Adequate time between trials
- Clear end of test
- No abnormal tidal breathing patterns    

For a more complete discussion of the criteria of a success trial, please refer to the [Central Overreading Centre's training documentation](https://lab.research.sickkids.ca/ratjen/wp-content/uploads/sites/39/2017/09/MBWN2-Training-and-Qualification-Requirements_Aug-30-2017.pdf). There is potential for automation since reviewing MBW data can be labour intensive. The primary purpose of this project was to classify trial outcome ('accept' or 'reject') with trial grades ('A', 'B', 'C', 'D', 'E', 'F', 'N/A') as a secondary outcome, where a trial grade of 'A', 'B', or 'C' corresponds to an 'accept' outcome.

<p align="center">
  <img src="assets/mbw_example.png">
</p>
<p align="center">
  Figure 1. Example of raw multiple breath inert gas washout signals. From top to bottom, CO<sub>2</sub>, O<sub>2</sub>, N<sub>2</sub>, flow, volume.
</p>

## Section 2: Methodology<a name="section-2-methodology"></a>

### Section 2.i: Obtain Data<a name="section-2i-data"></a>
Data for this project was derived from two studies involving MBW: TRACK, and LONGITUDINAL. As part of those studies, data was collected and processed in Spiroware version 3.2 or earlier with each individual trial being given a grade and an outcome ('accept' or 'reject') by trained CORC over-readers. The end goal of this section is to describe how the data for this project was obtained.

In [1]:
import pandas as pd
import os
import numpy as np

### Section 2.i.a: Breath and Tidal breathing flow-volume loop (TBFVL) Tables<a name="section-2ia-breath-tbfvl"></a>

The Breath Tables provide summary information associated with a single breath such as inspiratory and expiratory volume. The TBFVL derives metrics, such as respiratory quotient, using a flow volume loop during tidal breathing. The Breath and TBFVL tables were previously obtained from the Spiroware software as part of the original study analysis.

### Section 2.i.b: SPX tables<a name="section-2ib-spx-tables"></a>
Like the Breath and TBFVL tables, the SPX tables were obtained from the Spiroware software. Unlike the Breath and TBFVL tables, the SPX tables contained summary information obtained at the conclusion of the trial (i.e. LCI). In contrast, the Breath and TBFVL tables report data collected during the trial. Because of how the data was exported, the raw SPX files contains data from multiple subjects and trials. The raw SPX data needed to be split so a single subject-trial combination was contained in a single file.

In [2]:
TRIAL_EXPORT_PATH = "../data/external/Primary Training Data Set/SPX Exports/"
SPX_PATH = "../data/raw/spx_exports/"

In [3]:
for root, _, files in os.walk(TRIAL_EXPORT_PATH):
    for file in files:
        # read data file with combined subject, trials
        if file.endswith('N2MultiBreathWashoutTest_TrialData.csv'):
            # all files have a semi-colon separator except for 'IND 241-259'
            if 'IND 241-259' in os.path.join(root, file):
                spx_export = pd.read_csv(os.path.join(root, file))
            else:
                spx_export = pd.read_csv(
                    os.path.join(root, file), sep=';', skiprows=1
                )
            
            # split into individual subject-trial combination and save
            for patient_trial in zip(
                spx_export['Patient-ID'], spx_export['Trial #']
            ):
                patient, trial = patient_trial

                patient_trial_spx = spx_export[
                    (spx_export['Patient-ID'] == patient)
                    & ((spx_export['Trial #'] == trial)) 
                ]

                # quality check to determine if data is missing
                if patient_trial_spx.shape[0] == 1:
                    patient_trial_spx.to_csv(
                        '{}{}_trial_{}_spx.csv'.format(
                            SPX_PATH, str(patient), str(trial)
                        ),
                        index=False
                    )
                else:
                    print(patient_trial_spx.head())

### Section 2.i.c: Screenshot tables<a name="section-2ic-screenshot-tables"></a>


The raw spirometry signals (N2, O2, CO2, volume, flow) is effectively the same as the Breath and TBFVL data but with a higher granularity. Obtaining the data from the raw spirometry signals was done in three steps:
1. Taking screenshots of the raw spirometry figure in Spiroware  
    An image of the Spiroware data was collected using the script `mbw_qc/mbw_qc/data/1-spiroware_screenshot.py`. An example of the script in operation is given in the video below.
2. Confirm screenshots integrity  
    A second python script, `mbw_qc/mbw_qc/data/2-confirm_screenshot.py`, was run on the captured screenshots to checking the integrity of the screenshots. The script worked by confirming the location and wording of the vertical axis. Any issues found with this script resulting in rerunning the `mbw_qc/mbw_qc/data/1-spiroware_screenshot.py` script on the trials with issues.
3. Digitize the screenshots  
    If no additional issues were identified, the script `mbw_qc/mbw_qc/data/3-digitize_screenshot.py` was used to convert the screenshot to numeric values. The script uses a modified version of [plotdigitizer](https://pypi.org/project/plotdigitizer/).

<p align="center">
  <video width='640' height='480' controls src='../notebooks/assets/spiroware_screenshots.mp4'>animation</video>
</p>


Once the screenshots were digitized, outlier values were checked to determine if there were any oustanding issues with the previous two scripts. Because of the nature of the task, it was not possible to test the functions with a testing suite such as `pytest`.

In [4]:
# outlier values to manually check
min_max ={
    'o2': {'l_bound': -5, 'u_bound': 105}, 
    'co2': {'l_bound': -5, 'u_bound': 105}, 
    'n2': {'l_bound': -5, 'u_bound': 105}, 
    'flow':{'l_bound': -4000, 'u_bound': 4000}, 
    'volume':{'l_bound': -4000, 'u_bound': 4000}
}

for table_type in ['o2', 'co2', 'n2', 'flow', 'volume']:
    tables_list = []
    # get path of all txt files in the breath tables folder
    for root, dirs, files in os.walk('../data/raw/digitize_screenshots'):
        for file in files:
            if file.endswith('_{}.csv'.format(table_type)):
                tables_list.append(os.path.join(root, file))

    for single_table in tables_list:
        try:
            table_temp = pd.read_csv(
                single_table, header=None, delim_whitespace=True
            )

            if table_temp.iloc[:, 1].max() > min_max[table_type]['u_bound']:
                print(single_table)
            if table_temp.iloc[:, 1].min() < min_max[table_type]['l_bound']:
                print(single_table)
        except pd.errors.EmptyDataError:
            pass

../data/raw/digitize_screenshots\216.6_trial_4_o2.csv
../data/raw/digitize_screenshots\241_25_trial_3_o2.csv
../data/raw/digitize_screenshots\245_29_trial_4_o2.csv
../data/raw/digitize_screenshots\313.2_trial_1_o2.csv
../data/raw/digitize_screenshots\LSC02_22_trial_3_o2.csv
../data/raw/digitize_screenshots\LSC02_28_trial_5_o2.csv
../data/raw/digitize_screenshots\LSC46_25_trial_1_o2.csv
../data/raw/digitize_screenshots\LSC49_26_trial_1_o2.csv
../data/raw/digitize_screenshots\LSC59_29_trial_1_o2.csv
../data/raw/digitize_screenshots\LSC68_26_trial_3_o2.csv
../data/raw/digitize_screenshots\203_24_trial_6_n2.csv
../data/raw/digitize_screenshots\205_24_trial_3_n2.csv
../data/raw/digitize_screenshots\205_30_trial_2_n2.csv
../data/raw/digitize_screenshots\206_21_trial_2_n2.csv
../data/raw/digitize_screenshots\231_22_trial_6_n2.csv
../data/raw/digitize_screenshots\242_25_trial_5_n2.csv
../data/raw/digitize_screenshots\245_29_trial_4_n2.csv
../data/raw/digitize_screenshots\LSC02_22_trial_3_n2.cs

The digitized values with outliers (above) from `mbw_qc/mbw_qc/data/3-digitize_screenshot.py` were manually compared to the screenshots obtained from `mbw_qc/mbw_qc/data/1-spiroware_screenshot.py`. Only digitized screenshots that differed from the source screenshot were futher investigated. Most of the issues occured because the test was longer than normal or there was a higher than normal breath frequency. This resulted in an occlusion of the axis which could not be identired by the `mbw_qc/mbw_qc/data/3-digitize_screenshot.py` script. The relevant issues were placed in the `data/raw/digitize_screenshots_manual` folder and manually corrected.

In [5]:
import shutil

screenshot_issues = [
    '../data/raw/spiroware_screenshots/216.6_trial_4_o2.png',
    '../data/raw/spiroware_screenshots/241_25_trial_3_o2.png',
    '../data/raw/spiroware_screenshots/245_29_trial_4_o2.png',
    '../data/raw/spiroware_screenshots/313.2_trial_1_o2.png',
    '../data/raw/spiroware_screenshots/LSC02_22_trial_3_o2.png',
    '../data/raw/spiroware_screenshots/LSC02_28_trial_5_o2.png',
    '../data/raw/spiroware_screenshots/LSC46_25_trial_1_o2.png',
    '../data/raw/spiroware_screenshots/LSC49_26_trial_1_o2.png',
    '../data/raw/spiroware_screenshots/LSC59_29_trial_1_o2.png',
    '../data/raw/spiroware_screenshots/LSC68_26_trial_3_o2.png',
    '../data/raw/spiroware_screenshots/203_24_trial_6_n2.png',
    '../data/raw/spiroware_screenshots/205_24_trial_3_n2.png',
    '../data/raw/spiroware_screenshots/205_30_trial_2_n2.png',
    '../data/raw/spiroware_screenshots/206_21_trial_2_n2.png',
    '../data/raw/spiroware_screenshots/231_22_trial_6_n2.png',
    '../data/raw/spiroware_screenshots/242_25_trial_5_n2.png',
    '../data/raw/spiroware_screenshots/245_29_trial_4_n2.png',
    '../data/raw/spiroware_screenshots/LSC02_22_trial_3_n2.png',
    '../data/raw/spiroware_screenshots/LSC02_28_trial_5_n2.png',
    '../data/raw/spiroware_screenshots/LSC03_30_trial_6_n2.png',
    '../data/raw/spiroware_screenshots/LSC09_27_trial_1_n2.png',
    '../data/raw/spiroware_screenshots/LSC12_29_trial_4_n2.png',
    '../data/raw/spiroware_screenshots/LSC18_23_trial_2_n2.png',
    '../data/raw/spiroware_screenshots/LSC18_23_trial_6_n2.png',
    '../data/raw/spiroware_screenshots/LSC41_22_trial_2_n2.png',
    '../data/raw/spiroware_screenshots/LSC41_22_trial_4_n2.png',
    '../data/raw/spiroware_screenshots/LSC41_30_trial_3_n2.png',
    '../data/raw/spiroware_screenshots/LSC43_28_trial_3_n2.png',
    '../data/raw/spiroware_screenshots/LSC45_21_trial_2_n2.png',
    '../data/raw/spiroware_screenshots/LSC45_21_trial_3_n2.png',
    '../data/raw/spiroware_screenshots/LSC45_23_trial_1_n2.png',
    '../data/raw/spiroware_screenshots/LSC49_26_trial_1_n2.png',
    '../data/raw/spiroware_screenshots/LSC49_26_trial_2_n2.png',
    '../data/raw/spiroware_screenshots/LSC49_27_trial_3_n2.png',
    '../data/raw/spiroware_screenshots/LSC49_28_trial_1_n2.png',
    '../data/raw/spiroware_screenshots/LSC53_30_trial_2_n2.png',
    '../data/raw/spiroware_screenshots/LSC56_25_trial_4_n2.png',
    '../data/raw/spiroware_screenshots/LSC59_29_trial_1_n2.png',
    '../data/raw/spiroware_screenshots/LSC59_29_trial_3_n2.png',
    '../data/raw/spiroware_screenshots/LSC59_29_trial_5_n2.png',
    '../data/raw/spiroware_screenshots/LSC61_22_trial_4_n2.png',
    '../data/raw/spiroware_screenshots/LSC61_31_trial_5_n2.png',
    '../data/raw/spiroware_screenshots/LSC68_26_trial_3_n2.png',
    '../data/raw/spiroware_screenshots/LSC77_21_trial_2_n2.png',
    '../data/raw/spiroware_screenshots/LSC78_27_trial_2_n2.png',
    '../data/raw/spiroware_screenshots/SPX_15_LSC03_1month_20130826_trial_5_n2.png',
    '../data/raw/spiroware_screenshots/202_31_trial_5_flow.png',
    '../data/raw/spiroware_screenshots/202_31_trial_5_flow.png',
    '../data/raw/spiroware_screenshots/213_26_trial_5_flow.png',
    '../data/raw/spiroware_screenshots/213_27_trial_1_flow.png',
    '../data/raw/spiroware_screenshots/213_27_trial_2_flow.png',
    '../data/raw/spiroware_screenshots/213_27_trial_3_flow.png',
    '../data/raw/spiroware_screenshots/213_27_trial_4_flow.png',
    '../data/raw/spiroware_screenshots/213_27_trial_5_flow.png',
    '../data/raw/spiroware_screenshots/213_27_trial_6_flow.png',
    '../data/raw/spiroware_screenshots/215_25_trial_1_flow.png',
    '../data/raw/spiroware_screenshots/215_31_trial_1_flow.png',
    '../data/raw/spiroware_screenshots/215_31_trial_2_flow.png',
    '../data/raw/spiroware_screenshots/215_31_trial_3_flow.png',
    '../data/raw/spiroware_screenshots/215_31_trial_4_flow.png',
    '../data/raw/spiroware_screenshots/215_31_trial_5_flow.png',
    '../data/raw/spiroware_screenshots/215_31_trial_6_flow.png',
    '../data/raw/spiroware_screenshots/216_26_trial_12_flow.png',
    '../data/raw/spiroware_screenshots/256_27_trial_8_flow.png',
    '../data/raw/spiroware_screenshots/LSC01_25_trial_7_flow.png',
    '../data/raw/spiroware_screenshots/LSC07_21_trial_9_flow.png',
    '../data/raw/spiroware_screenshots/LSC07_26_trial_11_flow.png',
    '../data/raw/spiroware_screenshots/LSC07_26_trial_6_flow.png',
    '../data/raw/spiroware_screenshots/LSC12.0.9_trial_3_flow.png',
    '../data/raw/spiroware_screenshots/LSC12.0.9_trial_3_flow.png',
    '../data/raw/spiroware_screenshots/LSC18_22_trial_5_flow.png',
    '../data/raw/spiroware_screenshots/LSC34_24_1_trial_4_flow.png',
    '../data/raw/spiroware_screenshots/LSC34_30_trial_2_flow.png',
    '../data/raw/spiroware_screenshots/LSC41_21_trial_1_flow.png',
    '../data/raw/spiroware_screenshots/LSC41_21_trial_2_flow.png',
    '../data/raw/spiroware_screenshots/LSC41_21_trial_3_flow.png',
    '../data/raw/spiroware_screenshots/LSC41_21_trial_4_flow.png',
    '../data/raw/spiroware_screenshots/LSC41_21_trial_5_flow.png',
    '../data/raw/spiroware_screenshots/LSC41_21_trial_6_flow.png',
    '../data/raw/spiroware_screenshots/LSC41_22_trial_1_flow.png',
    '../data/raw/spiroware_screenshots/LSC41_22_trial_2_flow.png',
    '../data/raw/spiroware_screenshots/LSC41_22_trial_3_flow.png',
    '../data/raw/spiroware_screenshots/LSC41_22_trial_4_flow.png',
    '../data/raw/spiroware_screenshots/LSC41_23_trial_1_flow.png',
    '../data/raw/spiroware_screenshots/LSC41_23_trial_2_flow.png',
    '../data/raw/spiroware_screenshots/LSC41_23_trial_3_flow.png',
    '../data/raw/spiroware_screenshots/LSC41_23_trial_4_flow.png',
    '../data/raw/spiroware_screenshots/LSC41_23_trial_5_flow.png',
    '../data/raw/spiroware_screenshots/LSC41_24_trial_1_flow.png',
    '../data/raw/spiroware_screenshots/LSC41_24_trial_2_flow.png',
    '../data/raw/spiroware_screenshots/LSC41_24_trial_4_flow.png',
    '../data/raw/spiroware_screenshots/LSC41_24_trial_5_flow.png',
    '../data/raw/spiroware_screenshots/LSC44_25_trial_1_flow.png',
    '../data/raw/spiroware_screenshots/LSC44_25_trial_4_flow.png',
    '../data/raw/spiroware_screenshots/LSC44_25_trial_6_flow.png',
    '../data/raw/spiroware_screenshots/LSC44_25_trial_7_flow.png',
    '../data/raw/spiroware_screenshots/LSC47_24_trial_1_flow.png',
    '../data/raw/spiroware_screenshots/LSC49_21_trial_1_flow.png',
    '../data/raw/spiroware_screenshots/LSC49_21_trial_4_flow.png',
    '../data/raw/spiroware_screenshots/LSC50_23_trial_1_flow.png',
    '../data/raw/spiroware_screenshots/LSC50_23_trial_2_flow.png',
    '../data/raw/spiroware_screenshots/LSC50_23_trial_3_flow.png',
    '../data/raw/spiroware_screenshots/LSC50_25_trial_1_flow.png',
    '../data/raw/spiroware_screenshots/LSC50_25_trial_2_flow.png',
    '../data/raw/spiroware_screenshots/LSC50_25_trial_3_flow.png',
    '../data/raw/spiroware_screenshots/LSC52_24_trial_1_flow.png',
    '../data/raw/spiroware_screenshots/LSC52_24_trial_2_flow.png',
    '../data/raw/spiroware_screenshots/LSC52_24_trial_3_flow.png',
    '../data/raw/spiroware_screenshots/LSC53_25_trial_1_flow.png',
    '../data/raw/spiroware_screenshots/LSC53_25_trial_2_flow.png',
    '../data/raw/spiroware_screenshots/LSC53_25_trial_3_flow.png',
    '../data/raw/spiroware_screenshots/LSC53_26_trial_1_flow.png',
    '../data/raw/spiroware_screenshots/LSC53_26_trial_2_flow.png',
    '../data/raw/spiroware_screenshots/LSC53_26_trial_3_flow.png',
    '../data/raw/spiroware_screenshots/LSC53_26_trial_4_flow.png',
    '../data/raw/spiroware_screenshots/LSC54_27_trial_1_flow.png',
    '../data/raw/spiroware_screenshots/LSC56_25_trial_1_flow.png',
    '../data/raw/spiroware_screenshots/LSC56_25_trial_2_flow.png',
    '../data/raw/spiroware_screenshots/LSC56_25_trial_3_flow.png',
    '../data/raw/spiroware_screenshots/LSC56_25_trial_4_flow.png',
    '../data/raw/spiroware_screenshots/LSC56_27_trial_1_flow.png',
    '../data/raw/spiroware_screenshots/LSC56_27_trial_2_flow.png',
    '../data/raw/spiroware_screenshots/LSC56_27_trial_3_flow.png',
    '../data/raw/spiroware_screenshots/LSC64_26_trial_2_flow.png',
    '../data/raw/spiroware_screenshots/LSC64_26_trial_3_flow.png',
    '../data/raw/spiroware_screenshots/LSC64_26_trial_4_flow.png',
    '../data/raw/spiroware_screenshots/LSC68_23_trial_1_flow.png',
    '../data/raw/spiroware_screenshots/LSC68_23_trial_2_flow.png',
    '../data/raw/spiroware_screenshots/LSC68_23_trial_3_flow.png',
    '../data/raw/spiroware_screenshots/LSC68_24_trial_1_flow.png',
    '../data/raw/spiroware_screenshots/LSC68_24_trial_2_flow.png',
    '../data/raw/spiroware_screenshots/LSC68_24_trial_3_flow.png',
    '../data/raw/spiroware_screenshots/LSC68_24_trial_4_flow.png',
    '../data/raw/spiroware_screenshots/LSC69_23_trial_2_flow.png',
    '../data/raw/spiroware_screenshots/LSC69_23_trial_3_flow.png',
    '../data/raw/spiroware_screenshots/LSC72_26_trial_1_flow.png',
    '../data/raw/spiroware_screenshots/LSC72_26_trial_2_flow.png',
    '../data/raw/spiroware_screenshots/LSC72_26_trial_3_flow.png',
    '../data/raw/spiroware_screenshots/LSH09_24_trial_9_flow.png',
    '../data/raw/spiroware_screenshots/LSH19_29_trial_3_flow.png',
    '../data/raw/spiroware_screenshots/LSH22_29_trial_2_flow.png',
    '../data/raw/spiroware_screenshots/LSH31_24_2_trial_5_flow.png',
    '../data/raw/spiroware_screenshots/LSH31_28_trial_2_flow.png',
    '../data/raw/spiroware_screenshots/LSH33_24_1_trial_1_flow.png',
    '../data/raw/spiroware_screenshots/LSH33_26_trial_10_flow.png',
    '../data/raw/spiroware_screenshots/LSH33_26_trial_6_flow.png',
    '../data/raw/spiroware_screenshots/LSH33_26_trial_8_flow.png',
    '../data/raw/spiroware_screenshots/LSH40_24_trial_1_flow.png',
    '../data/raw/spiroware_screenshots/LSH40_24_trial_1_flow.png',
    '../data/raw/spiroware_screenshots/LSH40_27_trial_3_flow.png',
    '../data/raw/spiroware_screenshots/SPX_164_LSC08_3month_20130823_trial_2_flow.png',
    '../data/raw/spiroware_screenshots/SPX_165_LSC20_6month_20140401_trial_1_flow.png',
    '../data/raw/spiroware_screenshots/SPX_313_LSH24_6month_20150102_trial_11_flow.png',
    '../data/raw/spiroware_screenshots/212_21_trial_6_volume.png',
    '../data/raw/spiroware_screenshots/213_27_trial_2_volume.png',
    '../data/raw/spiroware_screenshots/215_31_trial_6_volume.png',
    '../data/raw/spiroware_screenshots/232.5_trial_3_volume.png',
    '../data/raw/spiroware_screenshots/238_25_trial_12_volume.png',
    '../data/raw/spiroware_screenshots/242_22_2_trial_2_volume.png',
    '../data/raw/spiroware_screenshots/242_22_2_trial_3_volume.png',
    '../data/raw/spiroware_screenshots/242_22_2_trial_4_volume.png',
    '../data/raw/spiroware_screenshots/245_24_trial_4_volume.png',
    '../data/raw/spiroware_screenshots/256.1_trial_6_volume.png',
    '../data/raw/spiroware_screenshots/259_21_trial_10_volume.png',
    '../data/raw/spiroware_screenshots/LSC10_24_trial_1_volume.png',
    '../data/raw/spiroware_screenshots/LSC16_28_trial_1_volume.png',
    '../data/raw/spiroware_screenshots/LSC28_29_trial_4_volume.png',
    '../data/raw/spiroware_screenshots/LSC29_25_trial_1_volume.png',
    '../data/raw/spiroware_screenshots/LSC35_21_trial_6_volume.png',
    '../data/raw/spiroware_screenshots/LSC38_21_trial_5_volume.png',
    '../data/raw/spiroware_screenshots/LSC41_23_trial_1_volume.png',
    '../data/raw/spiroware_screenshots/LSC41_23_trial_3_volume.png',
    '../data/raw/spiroware_screenshots/LSC41_23_trial_4_volume.png',
    '../data/raw/spiroware_screenshots/LSC41_23_trial_5_volume.png',
    '../data/raw/spiroware_screenshots/LSC41_30_trial_3_volume.png',
    '../data/raw/spiroware_screenshots/LSC44_21_trial_1_volume.png',
    '../data/raw/spiroware_screenshots/LSC44_21_trial_2_volume.png',
    '../data/raw/spiroware_screenshots/LSC44_21_trial_5_volume.png',
    '../data/raw/spiroware_screenshots/LSC44_21_trial_7_volume.png',
    '../data/raw/spiroware_screenshots/LSC44_21_trial_8_volume.png',
    '../data/raw/spiroware_screenshots/LSC44_25_trial_1_volume.png',
    '../data/raw/spiroware_screenshots/LSC44_25_trial_4_volume.png',
    '../data/raw/spiroware_screenshots/LSC44_25_trial_6_volume.png',
    '../data/raw/spiroware_screenshots/LSC45_24_trial_3_volume.png',
    '../data/raw/spiroware_screenshots/LSC46_21_trial_2_volume.png',
    '../data/raw/spiroware_screenshots/LSC46_21_trial_3_volume.png',
    '../data/raw/spiroware_screenshots/LSC47_24_trial_1_volume.png',
    '../data/raw/spiroware_screenshots/LSC48_25_trial_2_volume.png',
    '../data/raw/spiroware_screenshots/LSC48_25_trial_5_volume.png',
    '../data/raw/spiroware_screenshots/LSC48_26_trial_1_volume.png',
    '../data/raw/spiroware_screenshots/LSC48_26_trial_5_volume.png',
    '../data/raw/spiroware_screenshots/LSC48_26_trial_6_volume.png',
    '../data/raw/spiroware_screenshots/LSC50_29_trial_1_volume.png',
    '../data/raw/spiroware_screenshots/LSC50_29_trial_3_volume.png',
    '../data/raw/spiroware_screenshots/LSC51_29_trial_1_volume.png',
    '../data/raw/spiroware_screenshots/LSC51_29_trial_2_volume.png',
    '../data/raw/spiroware_screenshots/LSC53_23_trial_1_volume.png',
    '../data/raw/spiroware_screenshots/LSC53_23_trial_2_volume.png',
    '../data/raw/spiroware_screenshots/LSC53_23_trial_3_volume.png',
    '../data/raw/spiroware_screenshots/LSC53_24_trial_3_volume.png',
    '../data/raw/spiroware_screenshots/LSC53_25_trial_1_volume.png',
    '../data/raw/spiroware_screenshots/LSC53_25_trial_2_volume.png',
    '../data/raw/spiroware_screenshots/LSC53_25_trial_3_volume.png',
    '../data/raw/spiroware_screenshots/LSC53_27_trial_4_volume.png',
    '../data/raw/spiroware_screenshots/LSC53_28_trial_1_volume.png',
    '../data/raw/spiroware_screenshots/LSC53_28_trial_2_volume.png',
    '../data/raw/spiroware_screenshots/LSC53_28_trial_3_volume.png',
    '../data/raw/spiroware_screenshots/LSC54_22_trial_4_volume.png',
    '../data/raw/spiroware_screenshots/LSC54_27_trial_1_volume.png',
    '../data/raw/spiroware_screenshots/LSC56_25_trial_1_volume.png',
    '../data/raw/spiroware_screenshots/LSC56_25_trial_2_volume.png',
    '../data/raw/spiroware_screenshots/LSC56_25_trial_3_volume.png',
    '../data/raw/spiroware_screenshots/LSC56_26_trial_3_volume.png',
    '../data/raw/spiroware_screenshots/LSC59_24_trial_7_volume.png',
    '../data/raw/spiroware_screenshots/LSC59_28_trial_1_volume.png',
    '../data/raw/spiroware_screenshots/LSC59_28_trial_2_volume.png',
    '../data/raw/spiroware_screenshots/LSC59_29_trial_3_volume.png',
    '../data/raw/spiroware_screenshots/LSC59_29_trial_4_volume.png',
    '../data/raw/spiroware_screenshots/LSC60_23_trial_4_volume.png',
    '../data/raw/spiroware_screenshots/LSC60_25_trial_8_volume.png',
    '../data/raw/spiroware_screenshots/LSC60_30_trial_3_volume.png',
    '../data/raw/spiroware_screenshots/LSC60_30_trial_4_volume.png',
    '../data/raw/spiroware_screenshots/LSC64_26_trial_2_volume.png',
    '../data/raw/spiroware_screenshots/LSC64_26_trial_3_volume.png',
    '../data/raw/spiroware_screenshots/LSC64_26_trial_4_volume.png',
    '../data/raw/spiroware_screenshots/LSC67_29_trial_2_volume.png',
    '../data/raw/spiroware_screenshots/LSC67_29_trial_3_volume.png',
    '../data/raw/spiroware_screenshots/LSC68_22_trial_3_volume.png',
    '../data/raw/spiroware_screenshots/LSC68_28_trial_1_volume.png',
    '../data/raw/spiroware_screenshots/LSC69_24_trial_1_volume.png',
    '../data/raw/spiroware_screenshots/LSC69_24_trial_2_volume.png',
    '../data/raw/spiroware_screenshots/LSC69_24_trial_3_volume.png',
    '../data/raw/spiroware_screenshots/LSC69_24_trial_4_volume.png',
    '../data/raw/spiroware_screenshots/LSC72_21_trial_5_volume.png',
    '../data/raw/spiroware_screenshots/LSC72_26_trial_1_volume.png',
    '../data/raw/spiroware_screenshots/LSC72_26_trial_2_volume.png',
    '../data/raw/spiroware_screenshots/LSC72_26_trial_3_volume.png',
    '../data/raw/spiroware_screenshots/LSC73_25_trial_4_volume.png',
    '../data/raw/spiroware_screenshots/LSC74_28_trial_1_volume.png',
    '../data/raw/spiroware_screenshots/LSC75_22_trial_2_volume.png',
    '../data/raw/spiroware_screenshots/LSC75_22_trial_3_volume.png',
    '../data/raw/spiroware_screenshots/LSC75_26_trial_2_volume.png',
    '../data/raw/spiroware_screenshots/LSC75_26_trial_3_volume.png',
    '../data/raw/spiroware_screenshots/LSC75_30_trial_1_volume.png',
    '../data/raw/spiroware_screenshots/LSC76_21_trial_3_volume.png',
    '../data/raw/spiroware_screenshots/LSC79_23_trial_2_volume.png',
    '../data/raw/spiroware_screenshots/LSC79_28_trial_2_volume.png',
    '../data/raw/spiroware_screenshots/LSC79_28_trial_3_volume.png',
    '../data/raw/spiroware_screenshots/LSC79_29_trial_3_volume.png',
    '../data/raw/spiroware_screenshots/LSH37_21_trial_3_volume.png',
    '../data/raw/spiroware_screenshots/LSH40_23_trial_3_volume.png',
    '../data/raw/spiroware_screenshots/LSH40_27_trial_3_volume.png',
    '../data/raw/spiroware_screenshots/SPX_14_LSC10_3month_20130924_trial_5_volume.png',
    '../data/raw/spiroware_screenshots/SPX_164_LSC08_3month_20130823_trial_2_volume.png',
    '../data/raw/spiroware_screenshots/SPX_164_LSC08_3month_20130823_trial_6_volume.png',
    '../data/raw/spiroware_screenshots/SPX_164_LSC08_3month_20130823_trial_7_volume.png',
    '../data/raw/spiroware_screenshots/SPX_164_LSC08_3month_20130823_trial_8_volume.png',
    '../data/raw/spiroware_screenshots/SPX_196_LSH02_3month_20131220_trial_7_volume.png',
    '../data/raw/spiroware_screenshots/SPX_263_LSH12_1month_20140112_trial_8_volume.png',

]

for screenshot_issue in screenshot_issues:
    shutil.copy(screenshot_issue, '../data/raw/spiroware_screenshots_manual')

A typical output looked like Figure 2 after successfully digitizing.

<p align="center">
  <img src="assets/digitize_example.png">
</p>
<p align="center">
  Figure 2. Typical digitization (right column) of the Spiroware screenshot (left column). From top to bottom, CO<sub>2</sub>, O<sub>2</sub>, N<sub>2</sub>, flow, volume.
</p>




### Section 2.ii: Organize Files<a name="section-2ii-data"></a>
After all the raw data was obtained, it was necessary to link the associated feature data with the correct trial data from REDCap. The REDCap data contains the grade and outcome (i.e. labels used for training). The end goal of this section was to create a dataframe where each row represents an individual trial and the associated path of the feature file.

#### Section 2.ii.a: Clean and combine REDCap<a name="section-2iia-clean-rc"></a>
Since the data came from more than one study, it was necessary to combine the REDCap data into a single data set.

In [6]:
import re
import collections

In [7]:
LONGITUDINAL_REDCAP_PATH = (
    '../data/external/Primary Training Data Set/PS LONGITUDINAL DATA SET/',
    'RedCap Quality Control Export/longitudinal_redcap_qc-20AUG2021.csv'
)
TRACK_REDCAP_PATH = (
    '../data/external/Primary Training Data Set/TRACK DATA SET/',
    'Track RedCap QC Export/Track RedCap QC Export/track_redcap_qc-16JUL2021.csv'
)

In [8]:
# Load LONGITUDINAL data and manipulate
longitudinal_redcap_qc = pd.read_csv(LONGITUDINAL_REDCAP_PATH)
longitudinal_redcap_qc = longitudinal_redcap_qc[[
    'studyno_rev', 'test_occasion', 'spx_filename', 'trial', 
    'trial_accepted', 'trial_accepted_label', 'qc_grade', 'qc_grade_label'
]]
longitudinal_redcap_qc.rename(columns={'studyno_rev': 'id'}, inplace=True)
longitudinal_redcap_qc.loc[
    longitudinal_redcap_qc['qc_grade'] == 7, 'qc_grade_label'
] = 'N/A'

# Load the TRACK data and manipulate
track_redcap_qc = pd.read_csv(TRACK_REDCAP_PATH)
track_redcap_qc = track_redcap_qc[[
    'id', 'trial', 'trial_accepted', 'trial_accepted_label', 'qc_grade', 
    'qc_grade_label'
]]

# TRACK file is missing 'spx_filename' and 'test_occasion'
# 'id' is equivalent to 'spx_filename'; incorporates 'test_occasion' info
track_redcap_qc['spx_filename'] = track_redcap_qc['id']

# only a few TRACK subjects have more than one test occasion; will have an 
# appended value (i.e. '_1' or '_2') to indicate test occasion
track_redcap_qc['id'] = (
    track_redcap_qc['spx_filename'].str.replace('_[12]$', '', regex=True)
)
track_redcap_qc['test_occasion'] = (
    track_redcap_qc['spx_filename'].str.replace('^[^_]+_[^_]+_*', '', regex=True)
)
track_redcap_qc.loc[
    track_redcap_qc['test_occasion'] == '', 'test_occasion'
] = '1'


Some TRACK instances were 'Rejected' but were not associated with a grade. The original data was reviewed to confirm the 'Rejected' instances were 'N/A' grades.


In [9]:
ORIG_TRACK_REDCAP_PATH = '../data/external/Raw REDCap QC/TRACK_QC.csv'

In [10]:
orig_track_redcap_qc = pd.read_csv(ORIG_TRACK_REDCAP_PATH)

aborted_col_name = []
for i in range(1, 20+1):
    old_col = 'not_accepted_why{}___7'.format(str(i))
    new_col = 'not_accepted_why___7{}'.format(str(i))
    aborted_col_name.append(new_col)
    orig_track_redcap_qc.rename(columns={old_col: new_col}, inplace=True)

aborted_col_name.append('visit_id')
track_aborted = pd.wide_to_long(
    orig_track_redcap_qc[
        orig_track_redcap_qc.columns.intersection(aborted_col_name)
    ], 
    ['not_accepted_why___7'], i='visit_id', j='trial'
)

track_aborted = track_aborted.reset_index(level=['visit_id', 'trial'])
track_aborted['id_trial'] = (
    'track_id_{}_trial_{}'.format(
        track_aborted['visit_id'], track_aborted['trial'].astype('str')
    )
)
track_aborted_id_trial = track_aborted.loc[
    track_aborted['not_accepted_why___7'] == 1, 'id_trial'
].tolist()

# combine the original and the summary data together
track_redcap_qc['id_trial'] = (
    'track_id_{}_trial_{}'.format(
        track_redcap_qc['id'], track_redcap_qc['trial'].astype('str')
    )
)
track_redcap_qc.loc[
   track_redcap_qc['id_trial'].isin(track_aborted_id_trial)
   & track_redcap_qc['qc_grade_label'].isna(),
   'qc_grade_label'
] = 'N/A'
track_redcap_qc.drop(['id_trial'], axis=1, inplace=True)

At this point, the TRACK and LONGITUDINAL REDCap data are in the same format. The data sets were then combined and modified. 

The grades 'A/B' and 'C' should correspond to an 'Accepted' decision while the grades 'D', 'E', 'F', and 'N/A' should correspond to a 'Rejected' trial. The grades and trial decision was checked to confirm consistency between trials and tests (a test contains multiple trials).

In [11]:
redcap_qc = pd.concat([track_redcap_qc, longitudinal_redcap_qc])

# drop columns with missing information
redcap_qc = redcap_qc.loc[
    redcap_qc['trial_accepted_label'].notna()
    & redcap_qc['qc_grade_label'].notna()
]

# remove trials deemed inappropriate by site
redcap_qc = redcap_qc.loc[
    redcap_qc['trial_accepted_label'] != 'Appropriately excluded by site'
]

# grades A and B are combined for some trials and not others; 
# make it consistent throughout
redcap_qc.loc[
    (
        (redcap_qc['qc_grade_label'] == 'A') 
        | (redcap_qc['qc_grade_label'] == 'B')
    ), 'qc_grade_label'
] = 'A/B'

# since "Rejected" trial decision was modified (i.e. additional "N/A"), change 
# the trial decision for all associated grades
redcap_qc.loc[
    (
        (redcap_qc['qc_grade_label'] == 'D')
        | (redcap_qc['qc_grade_label'] == 'E')
        | (redcap_qc['qc_grade_label'] == 'F')
        | (redcap_qc['qc_grade_label'] == 'N/A')
    ),
    'trial_accepted_label'
] = 'Rejected'

# check for consistency between trial and test
for qc_grade_label in ['A/B', 'C', 'D', 'E', 'F', 'N/A']:
    print(redcap_qc.loc[
        redcap_qc['qc_grade_label'] == qc_grade_label, 'trial_accepted_label'
    ].value_counts())

Accepted    4394
Name: trial_accepted_label, dtype: int64
Accepted    1767
Name: trial_accepted_label, dtype: int64
Rejected    1056
Name: trial_accepted_label, dtype: int64
Rejected    101
Name: trial_accepted_label, dtype: int64
Rejected    2913
Name: trial_accepted_label, dtype: int64
Rejected    3071
Name: trial_accepted_label, dtype: int64


#### Section 2.ii.b: Link raw data<a name="section-2iib-link-raw"></a>
There is now a data set with all the labels (i.e. grades and trial outcomes). The labels must now be linked with the raw feature files.

To aid in linking the features to the labels, a function was created to get all the path names in a folder.

In [12]:

def get_file_paths(folder_path, file_ext = '.txt'):
    """Get path of all files in all folder

    Parameters
    ----------
    folder_path : str
        Path to folder of interest
    file_ext : str, optional
        File extension of interest; will only return files ending in file_ext, 
        by default '.txt'

    Returns
    -------
    list of str
        Contains the file paths in folder_path of all files ending in file_ext
    """
    file_paths = []

    for root, _, files in os.walk(folder_path):
        for file in files:
            if file.endswith(file_ext):
                file_paths.append(os.path.join(root, file))

    return file_paths

In [13]:
TRACK_BREATH_TABLES_PATH = (
    '../data/external/Primary Training Data Set/TRACK DATA SET/',
    'Track Breath Tables TXT/'
)
TRACK_TBFVL_TABLES_PATH = (
    '../data/external/Primary Training Data Set/TRACK DATA SET/',
    'Track TBFVL Tables/'
)
LONG_BREATH_TABLES_PATH = (
    '../data/external/Primary Training Data Set/PS LONGITUDINAL DATA SET/',
    'breathtables/breath'
)
LONG_TBFVL_TABLES_PATH = (
    '../data/external/Primary Training Data Set/PS LONGITUDINAL DATA SET/',
    'tbfvltables/tbfvl'
)
SCREENSHOTS_PATH = '../data/raw/digitize_screenshots'
SPX_PATH = "../data/raw/spx_exports/"
SCREENSHOTS_PATH_MANUAL = '../data/raw/digitize_screenshots_manual'

In [14]:
breath_tables = (
    get_file_paths(TRACK_BREATH_TABLES_PATH) 
    + get_file_paths(LONG_BREATH_TABLES_PATH)
)
tbfvl_tables = (
    get_file_paths(TRACK_TBFVL_TABLES_PATH)
    + get_file_paths(LONG_TBFVL_TABLES_PATH)
)
spx = get_file_paths(SPX_PATH, file_ext='.csv')

screenshots_path = get_file_paths(SCREENSHOTS_PATH, file_ext='.csv')
screenshots_path_manual = get_file_paths(
    SCREENSHOTS_PATH_MANUAL, file_ext='.csv'
)

# if there is a file that was manually manipulated (Section 2.i.c)
# (i.e. in 'digitize_screenshots_manual'), use that version
screenshots_path = [
    screenshot.replace('digitize_screenshots', 'digitize_screenshots_manual') 
    if screenshot.replace(
        'digitize_screenshots', 'digitize_screenshots_manual'
    ) in screenshots_path_manual 
    else screenshot
    for screenshot in screenshots_path
]

co2s = [
    screenshot 
    for screenshot in screenshots_path if re.search('_co2', screenshot)
]
n2s = [
    screenshot 
    for screenshot in screenshots_path if re.search('_n2', screenshot)
]
flows = [
    screenshot 
    for screenshot in screenshots_path if re.search('_flow', screenshot)
]
o2s = [
    screenshot 
    for screenshot in screenshots_path if re.search('_o2', screenshot)
]
volumes = [
    screenshot 
    for screenshot in screenshots_path if re.search('_volume', screenshot)
]


All the file paths are collected in the associated objects (i.e. `breath_tables`, `tbfvl_tables`, `spx`, `co2s`, `n2s`, `flows`, `o2s`, `volumes`). A subject has more than one trial so it is necessary to assign the correct file path to the correct subject id-trial combination - the function `get_table_path` was created to map the correct file path to the correct subject id-trial.

In [15]:
def get_table_path(id_trials, table_list, is_screenshot=False):
    """Order file paths based on id-trial combinations

    Will order the associated file paths from table_list based on id-trial 
    combinations in id_trial_tuples

    Parameters
    ----------
    id_trial_tuples : zip
        Contains id and trial used to order table_list
    table_list : list of str
        Contains all file paths of a particular table type
    is_screenshot: bool
        Indicates if table_list is list of screenshots, by default 'False'

    Returns
    -------
    list of str
        Contains the file paths ordered according to id-trial
    """
    final_list = []
    
    for id_trial in id_trials:
        id, trial = id_trial
        
        add_path = None
        for path in table_list:
            if is_screenshot:
                if(
                    bool(re.search(('{}_'.format(id)), path)) 
                    & bool(re.search(('trial_{}_'.format(str(trial))), path))
                ):
                    add_path = path
                    break
            else:
                if(
                    bool(re.search(('-{}-'.format(id)), path)) 
                    & bool(re.search(('Trial-{}-'.format(str(trial))), path))
                ):
                    add_path = path
                    break

        final_list.append(add_path)   
         
    return final_list

We can add a new column to the REDCap dataframe that contains the file paths in the corresponding order.

In [16]:
redcap_qc['spx_export_path'] = get_table_path(
    zip(redcap_qc['spx_filename'], redcap_qc['trial']), spx, True
)
redcap_qc['breath_path'] = get_table_path(
    zip(redcap_qc['spx_filename'], redcap_qc['trial']), breath_tables
)
redcap_qc['tbfvl_path'] = get_table_path(
    zip(redcap_qc['spx_filename'], redcap_qc['trial']), tbfvl_tables
)
redcap_qc['o2_path'] = get_table_path(
    zip(redcap_qc['spx_filename'], redcap_qc['trial']), o2s, True
)
redcap_qc['n2_path'] = get_table_path(
    zip(redcap_qc['spx_filename'], redcap_qc['trial']), n2s, True
)
redcap_qc['flow_path'] = get_table_path(
    zip(redcap_qc['spx_filename'], redcap_qc['trial']), flows, True
)
redcap_qc['co2_path'] = get_table_path(
    zip(redcap_qc['spx_filename'], redcap_qc['trial']), co2s, True
)
redcap_qc['volume_path'] = get_table_path(
    zip(redcap_qc['spx_filename'], redcap_qc['trial']), volumes, True
)

The data set can be checked to make sure the assignment was done correctly.

In [17]:
# check if file(s) are assigned to more than one trial
for col_name in [
    'spx_export_path', 'breath_path', 'tbfvl_path', 'o2_path', 
    'n2_path', 'flow_path', 'co2_path', 'volume_path'
]:
    print([
        item 
        for item, count in collections.Counter(redcap_qc[col_name].tolist()).items() 
        if count > 1
    ])

# these were manually reviewed
# usually an issue with number of trials in spx not matching number of breath/tbfvl tables
# want to remove all trials associated with the spx file due to potential grade-trial misalignment
spx_filename_remove = redcap_qc.loc[redcap_qc[
    [
        'spx_export_path', 'breath_path', 'tbfvl_path', 'o2_path', 
        'n2_path', 'flow_path', 'co2_path', 'volume_path'
    ]
].isna().any(axis=1), 'spx_filename'].tolist()

[None]
[None]
[None]
[None]
[None]
[None]
[None]
[None]


In [18]:

redcap_qc['qc_grade_label'].value_counts()

A/B    4394
N/A    3071
F      2913
C      1767
D      1056
E       101
Name: qc_grade_label, dtype: int64

### Section 2.ii.c: Split data set<a name="section-2iic-split-data"></a>
The data set now contains the labels and the paths to the raw files which contain the features. The next step was to split the data. Since subjects can have more than one trial, the data is not independent and identically distributed; therefore, it was necessary to split the data by subjects instead of randomly splitting by trial. Randomly splitting by trial could lead to [data leakage](https://en.wikipedia.org/wiki/Leakage_(machine_learning)#Training_example_leakage), an instance where the model is tested on a data set which is not representative the data used in subsequent predications. Data was split into a train (~75%), validation (~12.5%), and test (~12.5%) data sets.

In [19]:
train_prop = 0.75
validate_prop = 0.125
# test proportion implicitly is 1 - train_prop - validate_prop

np.random.seed(12345)

# we want to split by subject to avoid data leakage
unique_ids = redcap_qc['id'].unique()
np.random.shuffle(unique_ids)

train_temp, validate_temp, test_temp = np.split(
        unique_ids,
        [
            int(train_prop*len(unique_ids)), 
            int((train_prop + validate_prop)*len(unique_ids))
        ]
    )

# create new column based on lists
redcap_qc['split_group'] = np.where(
    redcap_qc['id'].isin(train_temp), 'train', np.where(
        redcap_qc['id'].isin(test_temp), 'test', 
        'validate'
    )
)

# data checks
redcap_qc['split_group'].value_counts()
for group_type in ['train', 'test', 'validate']:
    print(redcap_qc.loc[
        redcap_qc['split_group'] == group_type,'qc_grade_label'
    ].value_counts())


A/B    3282
N/A    2182
F      2165
C      1287
D       812
E        62
Name: qc_grade_label, dtype: int64
A/B    545
N/A    515
F      369
C      272
D      125
E       17
Name: qc_grade_label, dtype: int64
A/B    567
F      379
N/A    374
C      208
D      119
E       22
Name: qc_grade_label, dtype: int64


At this point, we have achieved our goal for this Section 2.ii. There is a dataframe were each row represents an individual trial and contains label information as well as the associated path of the feature files.

In [20]:
# save a copy of the dataframe
# probably should have done another data check before saving
redcap_qc[
    ~redcap_qc['spx_filename'].isin(spx_filename_remove)
].to_csv('../data/intermediary/main_id_associated_files.csv', index=False)

### Section 2.iii: Preprocess Data<a name="section-2i-data"></a>
Although we now have a dataframe containing the labels and the location of the feature files, the feature files still need to be preprocessed before being used in the model. By the end of this section, information that will used in the padding, interpolation, and standardization process.

In [21]:
RAW_PATH = '../data/raw/'
INTERMEDIARY_PATH = '../data/intermediary/'
PROCESSED_PATH = '../data/processed'

In [22]:
redcap_qc = pd.read_csv(os.path.join(
    INTERMEDIARY_PATH, 'main_id_associated_files.csv'
))

When processing sequence data, it is necessary to ensure that all the data have the standard length or size. Adding additional rows so the sequence data is a uniform size is a process known as 'padding'. In order to facilitate this process, it was necessary to find the maximum number of steps in each file. For the breath and TBFVL table, a step corresponds to a single breath. For the SPX data, it was not necessary to find the number of steps since they are uniform in the data set. The maximum number of steps will be used to pad the sequences to a uniform size.

In [23]:
# find the maximum number of steps (rows) in the files
# go through all breath tables and find the maximum number of rows;
# this information will be used in padding the files
for table_type in ['breath', 'tbfvl']:
    max_steps = 0
    for single_table in redcap_qc[table_type+'_path'].to_list():
        table_temp = pd.read_csv(single_table, sep='\t')

        if table_temp.shape[0] > max_steps:
            max_steps = table_temp.shape[0]
    print(table_type)
    print(str(max_steps))

# find the maximum number of steps (rows) in the files
# go through all digitized signals and find the maximum number of rows;
# this information will be used in padding the files
for table_type in ['o2', 'n2', 'co2', 'flow', 'volume']:
    max_steps = 0
    for single_table in redcap_qc[table_type+'_path']:
        try:
            table_temp  = pd.read_csv(
                single_table, header=None, delim_whitespace=True
            )
            if table_temp.iloc[:, 0].max() > max_steps:
                max_steps = table_temp.iloc[:, 0].max()
        # ignore files with no values
        except pd.errors.EmptyDataError:
            pass        
    print(table_type)
    print(str(max_steps))

breath
170
tbfvl
185
o2
572.002
n2
572.002
co2
574.701
flow
572.002
volume
572.002



The digitization of the raw screenshot data results in a non-uniform time steps. It is necessary to interpolate the screenshot so that there is uniform magnitude between time steps.

Additionally, the feature data will be standardized which improves the performance of the model. Standardization is accomplished by applying the following function to each feature.
$$z=\frac{x-\mu}{\sigma}$$
where the $\mu$ represents the mean and $\sigma$ represents the standard deviation. Therefore it is necessary to get both the mean and standard deviation for each feature.

The following two functions outline interpolation and obtaining the mean and standard deviation, respectively.

In [24]:
def interpolate_screenshot(raw_ss, round_dig = 1):
    """Interpolate the screenshot values

    Parameters
    ----------
    raw_ss : pandas.dataframe
        Screenshot values to be interpolated
    round_dig : int, optional
        Number of digits to round the time, by default 1

    Returns
    -------
    pandas.dataframe
        Modified raw_ss dataframe with interpolated values
    """
    raw_ss = raw_ss.rename(columns={0: 'time', 1: 'val'})
    raw_ss['time'] = raw_ss['time'].round(round_dig)
    raw_ss = raw_ss.drop_duplicates(subset='time', keep='first')

    time_inc = 10**-(round_dig)
    time_df = pd.DataFrame(
        [
            round(i, round_dig) 
            for i in np.arange(
                time_inc, raw_ss['time'].max() + time_inc, time_inc
            )
        ], columns =['time']
    )

    inter_ss = (
        raw_ss
        .merge(time_df, on='time', how='outer')
        .sort_values(by=['time'])
        .interpolate(axis='rows', limit_direction='both')
    )

    # inconsistency with capturing 0, so remove
    inter_ss = inter_ss.drop(inter_ss[inter_ss['time'] == 0].index)

    return inter_ss


def get_mean_sd(raw_df, ignore_cols=[]):
    """Get mean and standard deviation of columns in a dataframe

    Parameters
    ----------
    raw_df : pandas.dataframe
        Dataframe that we want to get the mean and standard deviations for
    ignore_cols : list, optional
        Columns to ignore, by default []

    Returns
    -------
    pandas.dataframe
        Dataframe containing the mean and standard deviation for all columns of
        raw_df
    """
    raw_df = raw_df.drop(ignore_cols, axis=1)
    raw_df.replace("", np.nan, inplace=True)
    raw_df_mean = raw_df.mean(axis=0)
    raw_df_std = raw_df.std(axis=0)

    processed_df = pd.concat(
        [raw_df_mean, raw_df_std], axis=1
    ).reset_index()

    return processed_df

To use the `get_mean_sd()` function successfully, we need to use it on a data set. It is best practice to use training data instead of the entire data set so the test data is wholly unseen. Therefore, we need to combine all the data of the training data set and apply the `get_mean_sd()` function. The combined data and the dataframe with the mean and standard deviation is saved during the process. The process is repeated for each data type.

In [25]:
breath_dfs = []
# combine all data from the breath tables   
for breath_path in redcap_qc.loc[
        # only training data
        redcap_qc['split_group'] == 'train', 'breath_path'
    ].to_list():
    breath_table = pd.read_csv(breath_path, sep='\t')
    try:
        breath_table = breath_table.drop(
            columns = [
                'Excluded for Sacin/Scond calculation', 'VdCO2 Langley [ml]'
            ]
        )
        breath_table = breath_table.rename(
            columns={'VdCO2 Fowler [ml]': 'VdCO2 [ml]'}
        )
    except:
        pass
    breath_dfs.append(breath_table)
breath_df = pd.concat(breath_dfs, ignore_index=True)

# save the dataframe containing all the training data
breath_df.to_csv(
    os.path.join(INTERMEDIARY_PATH, 'combine_data_type', 'breath_raw.csv'),
    index = False
)

# replace inf value with min value of column
for col_min in ['VdCO2 [ml]']:
    col_min_val = breath_df.loc[breath_df[col_min] != np.NINF, col_min].min()
    breath_df[col_min].replace(np.NINF, col_min_val, inplace=True)

# get and save the mean and SD of the breath tables
breath_mean_sd_df = get_mean_sd(breath_df)
breath_mean_sd_df.to_csv(
    os.path.join(INTERMEDIARY_PATH, 'combine_data_type', 'breath_mean_sd.csv')
)

# repeat the process for tbfvl tables
tbfvl_dfs = []
for tbfvl_path in redcap_qc.loc[
        redcap_qc['split_group'] == 'train', 'tbfvl_path'
    ].to_list():
    tbfvl_table = pd.read_csv(tbfvl_path, sep='\t')
    tbfvl_dfs.append(tbfvl_table)
tbfvl_df = pd.concat(tbfvl_dfs, ignore_index=True)

tbfvl_df.to_csv(
    os.path.join(INTERMEDIARY_PATH, 'combine_data_type', 'tbfvl_raw.csv'), 
    index = False
)

tbfvl_mean_sd_df = get_mean_sd(tbfvl_df, ['Phase', 'Timestamp (UTC)'])
tbfvl_mean_sd_df.to_csv(
    os.path.join(INTERMEDIARY_PATH, 'combine_data_type', 'tbfvl_mean_sd.csv')
)

# repeat the process for spx
spx_dfs = []
for spx_path in redcap_qc.loc[
        redcap_qc['split_group'] == 'train', 'spx_export_path'
    ].to_list():
    spx = pd.read_csv(spx_path)
    spx_dfs.append(spx)
spx_df = pd.concat(spx_dfs, ignore_index=True)

# replace inf value with max value of column
for col_max in ['W slower', 'N2Cet norm @ TO6 [%]']:
    col_max_val = spx_df.loc[spx_df[col_max] != np.inf, col_max].max()
    spx_df[col_max].replace(np.inf, col_max_val, inplace=True)

# replace inf value with min value of column
for col_min in ['Vd CO2 mean [ml]']:
    col_min_val = spx_df.loc[spx_df[col_min] != np.NINF, col_min].min()
    spx_df[col_min].replace(np.NINF, col_min_val, inplace=True)

# convert date of birth to unix time
spx_df['Date of birth'] = (
    pd.to_datetime(spx_df['Date of birth'], format='%d.%m.%Y')
    - pd.Timestamp('1970-01-01')
) // pd.Timedelta('1s')

spx_df.to_csv(
    os.path.join(INTERMEDIARY_PATH, 'combine_data_type', 'spx_raw.csv'),
    index = False
)

spx_mean_sd_df = get_mean_sd(
    spx_df.drop(columns=[
        'Patient-ID', 'Lastname', 'Firstname', 'Gender', 
        'Ethnicity', 'Smoker', 'Asthma', 'Notes', 'Test Date (UTC)',
        'Timestamp (UTC)', 'Comment', 'FRC @ TO6 [l]', 'Scond*VT [l]',
        'Sacin*VT [l]', 'Pacin*VT [l]', 'Scond', 'Sacin','Pacin',
    ])
)

spx_mean_sd_df.to_csv(
    os.path.join(INTERMEDIARY_PATH, 'combine_data_type', 'spx_mean_sd.csv')
)

# repeat the process for raw MBW signals
ss_dict = {}
for ss_type in ['o2', 'n2', 'co2', 'flow', 'volume']:
    ss_dfs = []
    for ss_path in redcap_qc.loc[
        redcap_qc['split_group'] == 'train', ss_type +'_path'
    ].to_list():
        try:
            raw_ss = pd.read_csv(ss_path, header=None, delim_whitespace=True)
            ss_dfs.append(interpolate_screenshot(raw_ss))
        # ignore files with no values; no use for determining mean/SD
        except pd.errors.EmptyDataError:
            pass
    ss_df = pd.concat(ss_dfs, ignore_index=True)
    ss_df.to_csv(
        os.path.join(
            INTERMEDIARY_PATH, 'combine_data_type',
            '{}_raw.csv'.format(ss_type)
        ), 
        index = False
    )

    mean_sd_df = get_mean_sd(ss_df)
    mean_sd_df.to_csv(
        os.path.join(
            INTERMEDIARY_PATH, 'combine_data_type',
            '{}_mean_sd.csv'.format(ss_type)
        )
    )
    ss_dict[ss_type] = mean_sd_df

After obtaining the mean and standard deviation, they can be printed out in the form of a dictionary. The mean and standard deviation will be incorporated into a function used for standardization.

In [26]:

for feature_df in [breath_mean_sd_df, tbfvl_mean_sd_df, spx_mean_sd_df]:
    for index, row in feature_df.iterrows():
        print(
            "'{}': {'mean': {}, 'sd': {}},".format(
                row['index'], str(row[0]), str(row[1])
            )
        )
    print("")

for signal in ['o2', 'n2', 'co2', 'flow', 'volume']:
    for index, row in ss_dict[signal].iterrows():
        print(
            "'{}': {'mean': {}, 'sd': {}},".format(
                row['index'], str(row[0]), str(row[1])
            )
        )
    print("")

'Breath #': {'mean': 24.945467123196384, 'sd': 18.281016819661676},
'N2 Cet [%]': {'mean': 12.144732241076664, 'sd': 15.141453997518099},
'TO': {'mean': 5.375122485703167, 'sd': 3.4907699447313254},
'FRC [l]': {'mean': 1.0826127705424118, 'sd': 0.9513189471754464},
'SnIIIms [1/l]': {'mean': 2.4440156651685347, 'sd': 44.02705373701091},
'N2 Cet-Start [%]': {'mean': 78.71963071900763, 'sd': 0.5698597652353171},
'N2 Cet Norm [%]': {'mean': 15.426211011264344, 'sd': 19.230296481082185},
'N2 C mean slope': {'mean': 11.789620986270418, 'sd': 14.894094723646074},
'N2 C mean breath': {'mean': 8.594905581729952, 'sd': 11.023586515408446},
'VolInsp [l]': {'mean': 0.30181482172647867, 'sd': 0.15608622971075944},
'VolExp [l]': {'mean': 0.29635537040883403, 'sd': 0.20077721360498088},
'CEV [l]': {'mean': 7.3472713624810115, 'sd': 7.316445365782281},
'CEV-DS [l]': {'mean': 6.327214485182281, 'sd': 6.515492731882587},
'N2InspMean [%]': {'mean': 0.349038145152956, 'sd': 2.4698162052311963},
'VolN2Exp 

### Section 2.iv: Load and Process Data<a name="section-2iv-load-process"></a>
At this point, we have a dataframe that contain the labels and the location of the feature files. We also now have information necessary to process the data. It is time to now to create our datasets used by TensorFlow. We will take the location of the feature files, apply our functions to preprocess the data, and store the data in the TFRecord format.


In [27]:
from tensorflow import keras
import tensorflow as tf
import keras_tuner as kt

In [28]:
RAW_PATH = '../data/raw/'
INTERMEDIARY_PATH = '../data/intermediary/'
PROCESSED_PATH = '../data/processed'

In [29]:
redcap_qc = pd.read_csv(
    os.path.join(INTERMEDIARY_PATH, 'main_id_associated_files.csv'), 
    keep_default_na=False
)

Not all columns were available for all trials. It was therefore necessary to define which columns would be used in the model.

In [30]:
BREATH = [
    'N2 Cet [%]', 'TO', 'FRC [l]', 'SnIIIms [1/l]', 'N2 Cet-Start [%]',
    'N2 Cet Norm [%]', 'N2 C mean slope', 'N2 C mean breath', 'VolInsp [l]',
    'VolExp [l]', 'CEV [l]', 'CEV-DS [l]', 'N2InspMean [%]', 'VolN2Exp [ml]',
    'VolN2Netto [ml]', 'CumVolN2Netto [ml]', 'VolN2Reinsp [ml]', 'SIII',
    'SnIII, C breath*VT', 'VdCO2 [ml]', 'FlowInsp. mean [ml/s]',
    'FlowExp. mean [ml/s]', 'RR', 'VolExp-DS [l]', 'VolN2Netto filtered [ml]',
    'VolN2Netto fast [ml]', 'VdN2 [ml]', 'VT alv. N2 [ml]',
]

TBFVL = [
    'Insp.Time [s]', 'Exp.Time [s]', 'Total breath time [s]', 'PIF [ml/s]',
    'PEF [ml/s]', 'Time to PIF [s]', 'Time to PEF [s]', 'Insp. Volume [ml]',
    'Exp. Volume [ml]', 'EEL [ml]', 'EEL cum. [ml]', 'Tidal Volume [ml]',
    'RR [1/min]', 'Ratio Insp./Tot. Time [%]', 'Ratio Exp./Tot. Time [%]',
    'Ratio Insp./Exp. Time [%]', 'Ratio PEF/Exp. Time [%]', 'MTIF [ml/s]',
    'MTEF [ml/s]', 'Minute ventilation [ml/min]', 'TEF75 [ml/s]',
    'TEF50 [ml/s]', 'TEF25 [ml/s]', 'TEF10 [ml/s]', 'TIF50 [ml/s]', 'VPIF [ml]',
    'VPEF [ml]', 'TEF50/TIF50 [%]', 'TEF75/PEF [%]', 'TEF50/PEF [%]',
    'TEF25/PEF [%]', 'TEF10/PEF [%]', 'PEF/Exp.Vol. [1/s]', 'VPEF/VT [%]',
    'AFV [l*l/s]', 'VTinsp/Tinsp [ml/s]', 'O2 consumed [ml]', 
    'CO2 emitted [ml]', 'RQ', 'et CO2 [%]', 'et O2 [%]', 'W', 'P'
]

SPX = [
    'Date of birth', 'Height [cm]', 'Weight [kg]', 'Trial #',
    'Washout time [s]', '# Washout Breaths', 'FRC [l]', 'LCI-2.5', 'LCI-5',
    'FidN2', 'VdF/VT [%]', 'W faster', 'W slower', 'W full',
    'VT alv. faster [ml]', 'VT alv. slower [ml]', 'VT alv. full [ml]',
    'FRC faster / FRC full [%]', 'FRC slower / FRC full [%]',
    'Specific ventilation faster [%]', 'Specific ventilation slower [%]',
    'Specific ventilation ratio', 'FRC faster [ml]', 'FRC slower [ml]',
    'FRC full [ml]', 'VT alv. N2 mean [ml]', 'M1/M0', 'M2/M0', 'M1/M0-6',
    'M2/M0-6', 'M1/M0-8', 'M2/M0-8', 'CEV [l]', 'N2 Cet-Start [%]',
    'Flow Insp. mean [ml/s]', 'Flow Exp. mean [ml/s]', 'VT Insp. mean [ml]',
    'VT Exp. mean [ml]', 'VT mean [ml]', 'RQ', 'VT mean/FRC',
    'N2Cet norm @ TO6 [%]', 'Vd CO2 mean [ml]', 'et CO2 mean [%]', 'Male', 'Female'
]

In Section 2.iii, we obtained information used in padding and standardization. We can now define some functions that will pad and standardize the data incorporating the information from Section 2.iii. We also define a function that will create a boolean column to indicate missing values.


In [31]:
def pad_rows(raw_table, n_rows):
    """Pad/add additional rows to dataframe

    Parameters
    ----------
    raw_table : pandas.dataframe
        Dataframe of interest
    n_rows : int
        Number of rows to add to raw_table

    Returns
    -------
    pandas.dataframe
        Dataframe with padded rows of 0
    """
    padded_table = raw_table.reindex(range(n_rows)).fillna(0)

    return padded_table

def standardize(raw_table, mean_sd_dict):
    for raw_table_cols in raw_table.columns.values:

        feature_mean = mean_sd_dict[raw_table_cols]['mean']
        feature_sd = mean_sd_dict[raw_table_cols]['sd']

        raw_table[raw_table_cols] = (
            (raw_table[raw_table_cols] - feature_mean)/(feature_sd)
        )

    # subset raw table using only the keys from dictionary
    # final_table should only contain z-scores
    standardize_table = raw_table[raw_table.columns.values]

    return standardize_table

def create_bool_col(raw_table, skip_col=None):
    """Create boolean columns which correspond to missing values

    Parameters
    ----------
    raw_table : pandas.dataframe
        [description]
    skip_col : list of strings, optional
        Names of columns to skip in raw_table, by default None

    Returns
    -------
    pandas.dataframe

    """
    raw_table_cols = raw_table.columns.values.tolist()

    if skip_col != None:
        raw_table_cols = [col for col in raw_table_cols if col not in skip_col]

    for col in raw_table_cols:
        raw_table['{}_bool'.format(col)] = raw_table[col].notna().astype(int)

    return raw_table

In [32]:
def process_breath(breath_table):
    """Process single breath table

    Parameters
    ----------
    breath_table : pandas.dataframe
        The dataframe to be processed

    Returns
    -------
    pandas.dataframe
        Breath table that is cleaned, standardized, padded, with a boolean
        column to indicate missing values
    """
    breath_table.replace([np.inf, -np.inf], np.nan, inplace=True)
    try:
        breath_table = breath_table.drop(columns = [
            'Excluded for Sacin/Scond calculation', 'VdCO2 Langley [ml]'
        ])
        breath_table = breath_table.rename(
            columns={'VdCO2 Fowler [ml]': 'VdCO2 [ml]'}
        )
    except:
        pass

    breath_table = breath_table.drop(columns = ['Breath #'])
    
    # obtained from Section 2.iii; used to standardize data
    breath_dict={
        'N2 Cet [%]': {'mean': 12.153882738375403, 'sd': 15.146946095000398},
        'TO': {'mean': 5.373229399786835, 'sd': 3.4905444722306527},
        'FRC [l]': {'mean': 1.0858187396262766, 'sd': 0.9541204193329395},
        'SnIIIms [1/l]': {'mean': 2.436704496635718, 'sd': 44.172888433741115},
        'N2 Cet-Start [%]': {'mean': 78.71975309865458, 'sd': 0.5702802883948145},
        'N2 Cet Norm [%]': {'mean': 15.437814726560148, 'sd': 19.237251687090218},
        'N2 C mean slope': {'mean': 11.7981740544857, 'sd': 14.899463233593318},
        'N2 C mean breath': {'mean': 8.600724857941282, 'sd': 11.027203854199875},
        'VolInsp [l]': {'mean': 0.3024531561143202, 'sd': 0.15620154067731323},
        'VolExp [l]': {'mean': 0.29696582278173644, 'sd': 0.20118360135868985},
        'CEV [l]': {'mean': 7.368848459186139, 'sd': 7.334883401302143},
        'CEV-DS [l]': {'mean': 6.345877225078502, 'sd': 6.532314023147374},
        'N2InspMean [%]': {'mean': 0.3476471823217033, 'sd': 2.4687831098982387},
        'VolN2Exp [ml]': {'mean': 23.762126946384935, 'sd': 36.342708423428164},
        'VolN2Netto [ml]': {'mean': 21.371505963447053, 'sd': 33.760733010916375},
        'CumVolN2Netto [ml]': {'mean': 758.0587540776661, 'sd': 464.2319361660444},
        'VolN2Reinsp [ml]': {'mean': 2.3906209829378824, 'sd': 3.694096528988632},
        'SIII': {'mean': 16.62026849119295, 'sd': 52.0866479377276},
        'SnIII, C breath*VT': {'mean': 0.3929482248530059, 'sd': 1.0926357552056198},
        'VdCO2 [ml]': {'mean': 53.52361250159101, 'sd': 228.6797951540405},
        'FlowInsp. mean [ml/s]': {'mean': -239.7494503273387, 'sd': 93.56379023866181},
        'FlowExp. mean [ml/s]': {'mean': 180.1956568715721, 'sd': 83.36041837364677},
        'RR': {'mean': 21.627064533951323, 'sd': 7.543466199691876},
        'VolExp-DS [l]': {'mean': 0.2571750119173467, 'sd': 0.1967484970715272},
        'VolN2Netto filtered [ml]': {'mean': 2274.8835729392886, 'sd': 416984.9262282386},
        'VolN2Netto fast [ml]': {'mean': -218576995892672.9, 'sd': 5.882170856957783e+16},
        'VdN2 [ml]': {'mean': -13948.381121424547, 'sd': 11526257.13744747},
        'VT alv. N2 [ml]': {'mean': 14209.873164337721, 'sd': 11526256.776634963},
    }
    
    breath_table = standardize(breath_table, breath_dict)
    breath_table = create_bool_col(breath_table)
    breath_table = pad_rows(breath_table, 187)

    return breath_table

def process_tbfvl(tbfvl_table):
    """Process single TBFVL table

    Parameters
    ----------
    tbfvl_table : pandas.dataframe
        The dataframe to be processed

    Returns
    -------
    pandas.dataframe
        TBFVL table that is cleaned, standardized, padded, with a boolean
        column to indicate missing values
    """
    tbfvl_table.replace([np.inf, -np.inf], np.nan, inplace=True)
    
    tbfvl_table_dum = pd.get_dummies(tbfvl_table['Phase'])
    if not 'W' in tbfvl_table_dum.columns:
        tbfvl_table_dum['W'] = 0
    if not 'P' in tbfvl_table_dum.columns:
        tbfvl_table_dum['P'] = 0
    tbfvl_table_dum = tbfvl_table_dum[['P', 'W']]
    
    tbfvl_table = tbfvl_table.drop(
        columns = ['Phase', 'Breath #', 'Timestamp (UTC)']
    )

    # obtained from Section 2.iii; used to standardize data
    tbfvl_dict={
        'Insp.Time [s]': {'mean': 1.2797538928784322, 'sd': 0.6363674460388641},
        'Exp.Time [s]': {'mean': 1.6933594892170798, 'sd': 0.6640419329601139},
        'Total breath time [s]': {'mean': 2.9731133820954883, 'sd': 1.080811473993286},
        'PIF [ml/s]': {'mean': 346.41345521990553, 'sd': 145.94254693899146},
        'PEF [ml/s]': {'mean': 274.0723334710935, 'sd': 140.59437380117248},
        'Time to PIF [s]': {'mean': 0.6032923247234108, 'sd': 0.34241641955579544},
        'Time to PEF [s]': {'mean': 0.6677677983135155, 'sd': 0.5051787708426468},
        'Insp. Volume [ml]': {'mean': 297.521038949005, 'sd': 174.8415245099567},
        'Exp. Volume [ml]': {'mean': 292.7281631780046, 'sd': 189.86628049098528},
        'EEL [ml]': {'mean': -4.792875771000758, 'sd': 160.02623802114027},
        'EEL cum. [ml]': {'mean': -165.78904804781897, 'sd': 1202.541203976086},
        'Tidal Volume [ml]': {'mean': 295.124601063505, 'sd': 164.03439385209424},
        'RR [1/min]': {'mean': 21.778889891023105, 'sd': 8.646669811816205},
        'Ratio Insp./Tot. Time [%]': {'mean': 43.01587195715785, 'sd': 9.443479736517506},
        'Ratio Exp./Tot. Time [%]': {'mean': 56.98412804284092, 'sd': 9.443479736517506},
        'Ratio Insp./Exp. Time [%]': {'mean': 82.61291770040698, 'sd': 66.35908419531079},
        'Ratio PEF/Exp. Time [%]': {'mean': 40.659256280378564, 'sd': 30.97466072212236},
        'MTIF [ml/s]': {'mean': 236.8927543675094, 'sd': 98.50483303756096},
        'MTEF [ml/s]': {'mean': 178.82293979782767, 'sd': 85.72743995849184},
        'Minute ventilation [ml/min]': {'mean': 5959.570739235493, 'sd': 4386.050686273252},
        'TEF75 [ml/s]': {'mean': 251.62685872209016, 'sd': 133.4192342201182},
        'TEF50 [ml/s]': {'mean': 247.73133034663553, 'sd': 127.89648521975207},
        'TEF25 [ml/s]': {'mean': 204.20211195639675, 'sd': 106.99888837092331},
        'TEF10 [ml/s]': {'mean': 155.70819128243988, 'sd': 86.41135066694149},
        'TIF50 [ml/s]': {'mean': 328.8390334997921, 'sd': 137.85935839822903},
        'VPIF [ml]': {'mean': 140.12219503186364, 'sd': 84.08025056339659},
        'VPEF [ml]': {'mean': 112.2833324539225, 'sd': 85.8482249837884},
        'TEF50/TIF50 [%]': {'mean': 891.6328732699907, 'sd': 256366.21199885794},
        'TEF75/PEF [%]': {'mean': 91.07214464273777, 'sd': 11.835260521315465},
        'TEF50/PEF [%]': {'mean': 90.10795074918406, 'sd': 10.567776696055914},
        'TEF25/PEF [%]': {'mean': 75.54909539549756, 'sd': 15.576441486292023},
        'TEF10/PEF [%]': {'mean': 58.72707472051545, 'sd': 18.101581522472163},
        'PEF/Exp.Vol. [1/s]': {'mean': 1.07114594132629, 'sd': 0.6180617273977341},
        'VPEF/VT [%]': {'mean': 39.40238099155333, 'sd': 20.81656112805978},
        'AFV [l*l/s]': {'mean': 0.17308893777746165, 'sd': 1.0289346569443358},
        'VTinsp/Tinsp [ml/s]': {'mean': 233.88182082354655, 'sd': 98.95444571414659},
        'O2 consumed [ml]': {'mean': 29.333572081442092, 'sd': 127.95411766895819},
        'CO2 emitted [ml]': {'mean': 10.211686279029067, 'sd': 6.711886962072948},
        'RQ': {'mean': 0.9347406301193033, 'sd': 17.567854972313437},
        'et CO2 [%]': {'mean': 5.122768724878801, 'sd': 0.5532392910456915},
        'et O2 [%]': {'mean': 61.97589131675811, 'sd': 33.532385379981235},
    }

    tbfvl_table = standardize(tbfvl_table, tbfvl_dict)
    tbfvl_table = pd.concat(
        [tbfvl_table.reset_index(drop=True), tbfvl_table_dum], axis=1
    )
    tbfvl_table = create_bool_col(tbfvl_table)
    tbfvl_table = pad_rows(tbfvl_table, 203)

    return tbfvl_table

def process_spx(spx_df):
    """Process single spx

    Parameters
    ----------
    spx_df : pandas.dataframe
        The dataframe to be processed

    Returns
    -------
    pandas.dataframe
        SPX dataframe that is cleaned, standardized, padded, with a boolean
        column to indicate missing values
    """
    spx_df.replace([np.inf, -np.inf], np.nan, inplace=True)
    
    # obtained from Section 2.iii; used to standardize data
    spx_dict={
        'Date of birth': {'mean': 1255353813.8860104, 'sd': 77344813.94431767},
        'Height [cm]': {'mean': 123.02784766839378, 'sd': 35.381900798194565},
        'Weight [kg]': {'mean': 25.898219689119173, 'sd': 11.436210456645634},
        'Trial #': {'mean': 4.181450777202072, 'sd': 2.912964856333764},
        'Washout time [s]': {'mean': 84.50111450777202, 'sd': 63.912161320303085},
        '# Washout Breaths': {'mean': 32.407046632124356, 'sd': 23.136903741207938},
        'FRC [l]': {'mean': 0.9037610567927414, 'sd': 4.168707549575682},
        'LCI-2.5': {'mean': 7.23461735138295, 'sd': 3.5658937036862435},
        'LCI-5': {'mean': 4.9238518879363635, 'sd': 1.9927743515469327},
        'FidN2': {'mean': 0.31119971174158695, 'sd': 0.309218475498321},
        'VdF/VT [%]': {'mean': 3.054105593110619, 'sd': 10099.663484440092},
        'W faster': {'mean': 0.7265870278816335, 'sd': 0.36449056345170167},
        'W slower': {'mean': 0.7680085543316435, 'sd': 0.3380721794524283},
        'W full': {'mean': 0.7644553735920073, 'sd': 0.3256802049735196},
        'VT alv. faster [ml]': {'mean': -283748734.15008336, 'sd': 21682192749.576374},
        'VT alv. slower [ml]': {'mean': 306337992.4861179, 'sd': 25539109422.600826},
        'VT alv. full [ml]': {'mean': 84.52938464882929, 'sd': 71.22668060754128},
        'FRC faster / FRC full [%]': {'mean': -13644883.865959585, 'sd': 1152467596.0237799},
        'FRC slower / FRC full [%]': {'mean': 13644966.302934375, 'sd': 1152467596.2317512},
        'Specific ventilation faster [%]': {'mean': 8.774090131798937, 'sd': 9.364790710266577},
        'Specific ventilation slower [%]': {'mean': 9.006844446364664, 'sd': 7.89722921185944},
        'Specific ventilation ratio': {'mean': 0.9588034971418548, 'sd': 1.0859257171938819},
        'FRC faster [ml]': {'mean': -256141920.24584636, 'sd': 21625969648.396786},
        'FRC slower [ml]': {'mean': 256142719.0246412, 'sd': 21625969661.17982},
        'FRC full [ml]': {'mean': 797.9417475990174, 'sd': 705.995209510256},
        'VT alv. N2 mean [ml]': {'mean': 19639.013292178577, 'sd': 2182165.383269265},
        'M1/M0': {'mean': 1.457213483160203, 'sd': 0.8935620055758221},
        'M2/M0': {'mean': 6.2429303824395515, 'sd': 7.082424679845044},
        'M1/M0-6': {'mean': 1.1437500933063385, 'sd': 0.6156695409978187},
        'M2/M0-6': {'mean': 3.07362169269162, 'sd': 3.2769037503294145},
        'M1/M0-8': {'mean': 1.2876635170531843, 'sd': 0.709284473605157},
        'M2/M0-8': {'mean': 4.221848493457361, 'sd': 3.8156767434547034},
        'CEV [l]': {'mean': 8.354690340015287, 'sd': 8.256338563387251},
        'N2 Cet-Start [%]': {'mean': 77.6401703260693, 'sd': 9.064655291744732},
        'Flow Insp. mean [ml/s]': {'mean': 199.3130596501723, 'sd': 101.73035209256297},
        'Flow Exp. mean [ml/s]': {'mean': 150.4402457097296, 'sd': 84.5176186736973},
        'VT Insp. mean [ml]': {'mean': 254.60975300784006, 'sd': 153.44963333195182},
        'VT Exp. mean [ml]': {'mean': 249.120262279826, 'sd': 159.786142750936},
        'VT mean [ml]': {'mean': 251.86500764394808, 'sd': 153.3183014405259},
        'RQ': {'mean': 0.9029283969167898, 'sd': 1.0792542484393604},
        'VT mean/FRC': {'mean': 0.2590167139748991, 'sd': 1.2382610542745873},
        'N2Cet norm @ TO6 [%]': {'mean': 3.139571762168011e+99, 'sd': 2.0326150948725518e+101},
        'Vd CO2 mean [ml]': {'mean': 42.1308592778324, 'sd': 59.58611484255497},
        'et CO2 mean [%]': {'mean': 4.505687988198351, 'sd': 1.7889436652717987},
    }

    spx_df_dum = pd.get_dummies(spx_df['Gender'])
    if not 'Male' in spx_df_dum.columns:
        spx_df_dum['Male'] = 0
    if not 'Female' in spx_df_dum.columns:
        spx_df_dum['Female'] = 0
    spx_df_dum = spx_df_dum[['Male', 'Female']]

    spx_df['Date of birth'] = (
        pd.to_datetime(spx_df['Date of birth'], format='%d.%m.%Y')
        - pd.Timestamp('1970-01-01')
    ) // pd.Timedelta('1s')

    spx_df = spx_df.drop(columns = [
        'Patient-ID', 'Lastname', 'Firstname', 'Gender', 
        'Ethnicity', 'Smoker', 'Asthma', 'Notes', 'Test Date (UTC)',
        'Timestamp (UTC)', 'Comment', 'FRC @ TO6 [l]', 'Scond*VT', 'Sacin*VT',
        'Pacin*VT', 'Scond [1/l]',	'Sacin [1/l]',	'Pacin [1/l]',
        '1st breath SnIII*VT',	'1st breath SnIII [1/l]', 'Scond*VT [l]',
        'Sacin*VT [l]', 'Pacin*VT [l]', 'Scond', 'Sacin','Pacin',
    ], errors='ignore')

    spx_df = standardize(spx_df, spx_dict)
    spx_df = pd.concat([spx_df.reset_index(drop=True), spx_df_dum], axis=1)
    spx_df = create_bool_col(spx_df)
    spx_df = pad_rows(spx_df , 1)

    return spx_df

def process_screenshot(screenshot, ss_type):
    """Process single screenshot type

    Parameters
    ----------
    screenshot : pandas.dataframe
        The dataframe to be processed
    ss_type : str
        Raw MBW signal to be processed

    Returns
    -------
    pandas.dataframe
        Screenshot data that is cleaned, standardized, padded, with a boolean
        column to indicate missing values
    """

    # obtained from Section 2.iii; used to standardize data
    screenshot_dict = {
        'o2':  {'mean': 68.23157418109633, 'sd': 35.262795056959945},
        'co2': {'mean': 2.196753669014523, 'sd': 2.153852461462564},
        'flow': {'mean': -42.97922884856061, 'sd': 356.93793207438284},
        'n2': {'mean': 28.151857103277152, 'sd': 34.40444632705889},
        'volume': {'mean': -26.96673191012796, 'sd': 967.9686423120496}
    }

    screenshot = interpolate_screenshot(screenshot)
    screenshot.rename(columns={'val': ss_type}, inplace=True)
    screenshot = screenshot.drop(columns=['time'])
    screenshot = standardize(screenshot, screenshot_dict)
    screenshot = pad_rows(screenshot, 5748)

    return screenshot

Because the training was done on a corporate computer with limited RAM, there was a potential the data set would not fit on memory. Therefore, it was necesary to create a data set using the TFRecord format. The functions below help with the conversion.

In [33]:

def float_feature(value):
    """Returns a float_list from a float / double."""
    return tf.train.Feature(float_list=tf.train.FloatList(value=value))

# types need to be uniform in tensor
def int64_feature(value):
    """Returns an int64_list from a bool / enum / int / uint."""
    return tf.train.Feature(int64_list=tf.train.Int64List(value=value))


We now have all the functions defined to create our TFRecords.

In [34]:
# each TFRecord file will have a maximum of 1000 records
max_examples_rec = 1000

# we want separate data sets for train, validate and test
for group_type in ['train', 'validate', 'test']:
    redcap_split = redcap_qc.loc[redcap_qc['split_group'] == group_type]
    redcap_split = redcap_split.sample(frac=1).reset_index(drop=True)

    max_files = (redcap_split.shape[0]//max_examples_rec) + 1

    for tfrec_num in range(1, max_files+1):
        rec_start = (tfrec_num - 1) * max_examples_rec
        rec_stop = (tfrec_num) * max_examples_rec
        if rec_stop > redcap_split.shape[0]:
            redcap_split_sub = redcap_split.iloc[rec_start:]
        else:
            redcap_split_sub = redcap_split.iloc[rec_start:rec_stop]

        with tf.io.TFRecordWriter(
                PROCESSED_PATH + "/" + group_type + "/" 
                + group_type + "_%.2i.tfrec" % (tfrec_num)
            ) as writer:

            for single_record in zip(
                redcap_split_sub['o2_path'], redcap_split_sub['n2_path'],  
                redcap_split_sub['co2_path'],  redcap_split_sub['flow_path'],
                redcap_split_sub['volume_path'], 
                redcap_split_sub['breath_path'], redcap_split_sub['tbfvl_path'],
                redcap_split_sub['spx_export_path'], 
                redcap_split_sub['qc_grade_label'],
                redcap_split_sub['trial_accepted_label']
            ):
                (
                    o2_path, n2_path, co2_path, flow_path, 
                    volume_path, 
                    breath_path, tbfvl_path, spx_path,
                    qc_grade_label,
                    trial_accepted_label
                ) = single_record

                # used to store all data before being processed as a tf.record
                feature = {}

                # process the screenshot data
                screenshot_path_dict = {
                    'o2': o2_path,
                    'co2': co2_path, 
                    'n2': n2_path,
                    'flow': flow_path,
                    'volume': volume_path,
                }
                for key, val in screenshot_path_dict.items():
                    try:
                        screenshot = pd.read_csv(
                            val, header=None, delim_whitespace=True
                        )
                    except pd.errors.EmptyDataError:
                        # empty dataframe if screenshot is empty due to white 
                        # screenshot
                        screenshot = pd.DataFrame({'time':[0],'val':[0]})

                    processed_screenshot = process_screenshot(screenshot, key)
                    feature[key] = float_feature(
                        processed_screenshot[key].astype(float).tolist()
                    )

                # process breath table
                breath = pd.read_csv(breath_path, sep='\t')
                breath = process_breath(breath)
                i = 1
                for breath_col in BREATH:
                    feature['breath_{}'.format(str(i))] = float_feature(
                        breath[breath_col].tolist()
                    )
                    feature['breath_{}_bool'.format(str(i))] = float_feature(
                        breath[breath_col+'_bool'].astype(float).tolist()
                    )
                    i += 1

                # process tbfvl table
                tbfvl = pd.read_csv(tbfvl_path, sep='\t')
                tbfvl = process_tbfvl(tbfvl)
                i = 1
                for tbfvl_col in TBFVL:
                    feature['tbfvl_{}'.format(str(i))] = float_feature(
                        tbfvl[tbfvl_col].tolist()
                    )
                    feature['tbfvl_{}_bool'.format(str(i))] = float_feature(
                        tbfvl[tbfvl_col+'_bool'].astype(float).tolist()
                    )
                    i += 1

                # process spx data
                spx = pd.read_csv(spx_path)
                spx = process_spx(spx)
                i = 1
                for spx_col in SPX:
                    # spx has to be processed differently since there are 
                    # overlapping columns in breath and tbfvl tables
                    feature['spx_'.format(str(i))] = float_feature(
                        spx[spx_col].tolist()
                    )
                    feature['spx_{}_bool'.format(str(i))] = float_feature(
                        spx['{}_bool'.format(spx_col)].astype(float).tolist()
                    )
                    i += 1

                # process trial outcome
                if trial_accepted_label == 'Accepted':
                    feature['trial_outcome'] = int64_feature([1])
                else:
                    feature['trial_outcome'] = int64_feature([0])

                qc_grade_label_dict = {
                    'A/B': [1, 0, 0, 0, 0, 0],
                    'C': [0, 1, 0, 0, 0, 0],
                    'D': [0, 0, 1, 0, 0, 0],
                    'E': [0, 0, 0, 1, 0, 0],
                    'F': [0, 0, 0, 0, 1, 0],
                    'N/A': [0, 0, 0, 0, 0, 1]
                }
                feature['grade'] = int64_feature(
                    qc_grade_label_dict[qc_grade_label]
                )
                
                example = tf.train.Example(
                    features=tf.train.Features(feature=feature)
                )
                writer.write(example.SerializeToString())

We have sucessfully converted our data to the TFRecord format. We need to create some functions to decode the TFRecord data.

In [35]:
def parse_tfrecord_fn(example):
    """Parse TFRecord

    TFRecords contain a sequence of records. We want to process a single record

    Parameters
    ----------
    example : serialized Example
        TFRecord data to be processed

    Returns
    -------
    dict
        A dict mapping feature keys to Tensor and SparseTensor values. 
    """
    data_description = {}
    for i in range(1, len(BREATH)+1):
        data_description[
            'breath_{}'.format(str(i))
        ] = tf.io.FixedLenFeature([187], tf.float32)
        data_description[
            'breath_{}_bool'.format(str(i))
        ] = tf.io.FixedLenFeature([187], tf.float32)
    for i in range(1, len(TBFVL)+1):
        data_description[
            'tbfvl_{}'.format(str(i))
        ] = tf.io.FixedLenFeature([203], tf.float32)
        data_description[
            'tbfvl_{}_bool'.format(str(i))
        ] = tf.io.FixedLenFeature([203], tf.float32)
    for i in range(1, len(SPX)+1):
        data_description[
            'spx_'.format(str(i))
        ] = tf.io.FixedLenFeature([1], tf.float32)
        data_description[
            'spx_{}_bool'.format(str(i))
        ] = tf.io.FixedLenFeature([1], tf.float32)
    for screenshot in ['o2', 'co2', 'n2', 'flow', 'volume']:
        data_description[
            screenshot
        ] = tf.io.FixedLenFeature([5748], tf.float32)
        
    data_description['trial_outcome'] = tf.io.FixedLenFeature([1], tf.int64)
    data_description['grade'] = tf.io.FixedLenFeature([6], tf.int64)

    example = tf.io.parse_single_example(example, data_description)

    return example


def prepare_sample(features):
    """Modify the TFRecord data

    Change the column names and modify the data type

    Parameters
    ----------
    features : dict
        A dict mapping feature keys to Tensor and SparseTensor values.

    Returns
    -------
    input_dict : dict
        Processed data used for the features
    output_dict : dict
        Processed data used for the labels
    """
    input_dict = {}
    
    table_dict = {
        'breath_input': ['breath_', len(BREATH)+1],
        'tbfvl_input': ['tbfvl_', len(TBFVL)+1],
    }
    for key, val in table_dict.items():
        table_list = []
        for i in range(1, val[1]):
            table_list.append(features[val[0] + str(i)])
            table_list.append(features['{}{}_bool'.format(val[0], str(i))])
        input_dict[key] = tf.stack(table_list, axis=1)

    #spx must be processed differently since there are overlapping column names
    table_list = []
    for i in range(1, len(SPX)+1):
        table_list.append(features['spx_{}'.format(str(i))])
        table_list.append(features['spx_{}_bool'.format(str(i))])
    input_dict['spx_input'] = tf.stack(table_list, axis=1)

    for screenshot in ['o2', 'co2', 'n2', 'flow', 'volume']:
        input_dict['{}_input'.format(screenshot)] = tf.cast(
            features[screenshot], tf.float32
        )

    output_dict = {}
    output_dict['trial_outcome'] = tf.cast(features['trial_outcome'], tf.int32)
    output_dict['grade'] = tf.cast(features['grade'], tf.int32)

    return input_dict, output_dict

def get_dataset(filenames, batch_size):
    dataset = (
        tf.data.TFRecordDataset(filenames, num_parallel_reads=tf.data.AUTOTUNE)
        .map(parse_tfrecord_fn, num_parallel_calls=tf.data.AUTOTUNE)
        .map(prepare_sample, num_parallel_calls=tf.data.AUTOTUNE)
        .shuffle(batch_size * 10)
        .batch(batch_size)
        .prefetch(tf.data.AUTOTUNE)
        .repeat()
        .shuffle(buffer_size=1000, reshuffle_each_iteration=True)
    )
    return dataset

### Section 2.v: Model Training<a name="section-2v-model-training"></a>
We have our data in a usable format. We can now train a model using the data. The goal of this section was to train a model.

#### Section 2.v.a: Multi-Head CNN-LSTM Model<a name="section-2va-multi-head-cnn-lstm"></a>


The model architecture is highlighted in Figure 3. Since the breath and TBFVL table contained time series data, they were ingested using a bidirectional LSTM layer. In contrast, the SPX data did not contain time series data so it was processed with a Dense and Dropout layer. The raw MBW signals were ingested based off a [multi-head CNN-LSTM](https://www.sciencedirect.com/science/article/abs/pii/S0925231219309877) architecture. The signals are ingested using a multi-head structure that uses a one-dimensional CNN layer to process the individual time series data by extracting convoluted features. The separate CNNs are concatenated before being ingested into an bidirectonal LSTM layer. It was theorized they separate channels would be more successful in capturing the significant features of each time series and ultimately producing a better prediction.

<p align="center">
  <img src="assets/model.png">
</p>
<p align="center">
  Figure 3. Architecture of TensorFlow model.
</p>

In [36]:
def model_builder(hp):
    """Model summary

    Parameters
    ----------
    hp : keras.HyperParameters
        Hyperparameter used to train the model instance

    Returns
    -------
    keras.Model
        Compiled model incorporating hyperparameters
    """
    breath_input = keras.Input(shape=(187, 56), name="breath_input")  
    tbfvl_input = keras.Input(shape=(203, 86), name="tbfvl_input")
    spx_input = keras.Input(shape=(1, 92), name="spx_input")
    o2_input = keras.Input(shape=(5748, 1), name="o2_input")
    co2_input = keras.Input(shape=(5748, 1), name="co2_input")
    n2_input = keras.Input(shape=(5748, 1), name="n2_input")
    flow_input = keras.Input(shape=(5748, 1), name="flow_input")
    volume_input = keras.Input(shape=(5748, 1), name="volume_input")

    screenshot_list = []
    hp_kernal_pool = hp.Int('kernal_pool', min_value=3, max_value=33, step=5)
 
    stride_pool = 1
    stride_conv1d = 2

    for ss_input in [o2_input, co2_input, n2_input, flow_input, volume_input]:
        conv_layer_1 = keras.layers.Conv1D(
            filters=512, kernel_size=hp_kernal_pool, activation='relu', 
            input_shape=(5748, 1), padding='valid', strides=stride_conv1d
        )(ss_input)
        pooling_layer_1 = keras.layers.MaxPooling1D(
            pool_size=hp_kernal_pool, padding='same', strides=stride_pool
        )(conv_layer_1)

        conv_layer_2 = keras.layers.Conv1D(
            filters=128, kernel_size=hp_kernal_pool, activation='relu', 
            input_shape=(2873, 512), padding='valid', strides=stride_conv1d
        )(pooling_layer_1)
        pooling_layer_2 = keras.layers.MaxPooling1D(
            pool_size=hp_kernal_pool, padding='same', strides=stride_pool
        )(conv_layer_2)

        conv_layer_3 = keras.layers.Conv1D(
            filters=64, kernel_size=hp_kernal_pool, activation='relu', 
            input_shape=(1436, 128), padding='valid', strides=stride_conv1d
        )(pooling_layer_2)
        pooling_layer_3 = keras.layers.MaxPooling1D(
            pool_size=hp_kernal_pool, padding='same', strides=stride_pool
        )(conv_layer_3)

        conv_layer_4 = keras.layers.Conv1D(
            filters=32, kernel_size=hp_kernal_pool, activation='relu', 
            input_shape=(717, 64), padding='valid', strides=stride_conv1d
        )(pooling_layer_3)
        pooling_layer_4 = keras.layers.MaxPooling1D(
            pool_size=hp_kernal_pool, padding='same', strides=stride_pool
        )(conv_layer_4)

        screenshot_list.append(pooling_layer_4)

    # combine all screen shots
    screenshot = keras.layers.concatenate(screenshot_list)

    # mask layers
    breath_mask = keras.layers.Masking()(breath_input)
    tbfvl_mask = keras.layers.Masking()(tbfvl_input)
    screenshot_mask = keras.layers.Masking()(screenshot)

    # LSTM for time series
    hp_units_1 = hp.Int('units_1', min_value=64, max_value=1024, step=128)
    breath_features = keras.layers.Bidirectional(
        keras.layers.LSTM(units=hp_units_1, input_shape=[187, 56])
    )(breath_mask)
    tbfvl_features = keras.layers.Bidirectional(
        keras.layers.LSTM(units=hp_units_1, input_shape=[203, 86])
    )(tbfvl_mask)
    screenshot_features = keras.layers.Bidirectional(
        keras.layers.LSTM(units=hp_units_1, )
    )(screenshot_mask)
    
    # dense layer to non-time series data
    hp_units_2 = hp.Int('units_2', min_value=64, max_value=1024, step=128)
    spx_features = keras.layers.Dense(
        units=hp_units_2, activation='relu'
    )(spx_input)

    # individual drop out layers
    hp_rate_1 = hp.Float('rate_1', min_value=0, max_value=0.8, step=0.2)
    breath_features = keras.layers.Dropout(rate=hp_rate_1)(breath_features)
    hp_rate_2 = hp.Float('rate_2', min_value=0, max_value=0.4, step=0.2)
    tbfvl_features = keras.layers.Dropout(rate=hp_rate_2)(tbfvl_features)
    hp_rate_3 = hp.Float('rate_3', min_value=0, max_value=0.8, step=0.2)
    screenshot_features = keras.layers.Dropout(rate=hp_rate_3)(screenshot_features)
    hp_rate_4 = hp.Float('rate_4', min_value=0, max_value=0.4, step=0.2)
    spx_features = keras.layers.Dropout(rate=hp_rate_4)(spx_features)
    
    spx_features = keras.layers.Flatten()(spx_features)

    total_features = keras.layers.concatenate(
        [breath_features, tbfvl_features, screenshot_features, spx_features]
    )

    total_features = keras.layers.Dense(
        units=1024, activation='relu', kernel_initializer='he_normal'
    )(total_features)


    trial_outcome = keras.layers.Dense(
        1, activation='sigmoid', name='trial_outcome'
    )(total_features)
    grade = keras.layers.Dense(
        6, activation='sigmoid', name='grade'
    )(total_features)


    model = keras.Model(
        inputs=[
            breath_input, tbfvl_input, spx_input,
            o2_input, co2_input, n2_input, flow_input, volume_input
        ],
        outputs=[trial_outcome, grade],
    )
    model.compile(
        loss=[
            keras.losses.BinaryCrossentropy(), 
            keras.losses.CategoricalCrossentropy()
        ],
        optimizer='adam',
        metrics=['acc']
    )

    return model

#### Section 2.v.b: Hyperparameter Tuning<a name="section-2vb-hyperparameter-tuning"></a>
Once we have a model, we can use the Keras tuner to determine the optimal hyperparameters.

In [37]:
train_filenames = tf.io.gfile.glob('{}/train/*.tfrec'.format(PROCESSED_PATH))
validation_filenames = tf.io.gfile.glob(
    '{}/validate/*.tfrec'.format(PROCESSED_PATH)
)
batch_size = 32
steps_per_epoch = 300

tuner = kt.BayesianOptimization(
    model_builder,
    objective=kt.Objective('val_trial_outcome_acc', direction='max'),
    max_trials=7,
    executions_per_trial=1,
    directory='../models',
    project_name='baseline_BO/tuner',
    overwrite=False
)

tuner.search(
    get_dataset(train_filenames, batch_size),
    epochs=3,
    steps_per_epoch=steps_per_epoch,
    validation_data=get_dataset(validation_filenames, batch_size), 
    validation_steps=51
)

INFO:tensorflow:Reloading Oracle from existing project ../models\baseline_BO/tuner\oracle.json
INFO:tensorflow:Reloading Tuner from ../models\baseline_BO/tuner\tuner0.json
INFO:tensorflow:Oracle triggered exit


We can print out the best hyperparameters according to the Keras tuner.

In [38]:
for tuned_value in [
    'kernal_pool', 'units_1', 'units_2', 'rate_1', 'rate_2', 'rate_3', 'rate_4'
]:
    print(
        '{}: {}'.format(
            tuned_value, 
            str(tuner.get_best_hyperparameters(num_trials=1)[0][tuned_value])
        )
    )

kernal_pool: 3
units_1: 960
units_2: 960
rate_1: 0.0
rate_2: 0.0
rate_3: 0.8
rate_4: 0.0


#### Section 2.v.c: Retrain Model<a name="section-2vc-retrain-model"></a>
It is best practice to retrain the model with the best hyperparmeters to get a final model.

In [39]:
MODELS_PATH = "../models/baseline_BO/checkpoints"

In [40]:
early_stop = keras.callbacks.EarlyStopping(
    monitor='val_grade_loss', patience=3
)
# create model checkpoints incase model stopped unexpectedly
model_checkpoint_callback = keras.callbacks.ModelCheckpoint(
    filepath=MODELS_PATH,
    save_weights_only=False,
    monitor='val_trial_outcome_acc',
    mode='max',
    save_best_only=False
)

In [41]:
# Get the optimal hyperparameters
best_hps=tuner.get_best_hyperparameters(num_trials=1)[0]

# Retrain the model
model = tuner.hypermodel.build(best_hps)
history = model.fit(
    x=get_dataset(train_filenames, batch_size), 
    epochs=25, 
    validation_data=get_dataset(validation_filenames, batch_size),
    steps_per_epoch=steps_per_epoch,
    callbacks=[early_stop, model_checkpoint_callback], 
    validation_steps=51
)

Epoch 1/25



INFO:tensorflow:Assets written to: ../models/baseline_BO\checkpoints\assets


INFO:tensorflow:Assets written to: ../models/baseline_BO\checkpoints\assets


Epoch 2/25



INFO:tensorflow:Assets written to: ../models/baseline_BO\checkpoints\assets


INFO:tensorflow:Assets written to: ../models/baseline_BO\checkpoints\assets


Epoch 3/25



INFO:tensorflow:Assets written to: ../models/baseline_BO\checkpoints\assets


INFO:tensorflow:Assets written to: ../models/baseline_BO\checkpoints\assets


Epoch 4/25



INFO:tensorflow:Assets written to: ../models/baseline_BO\checkpoints\assets


INFO:tensorflow:Assets written to: ../models/baseline_BO\checkpoints\assets


Epoch 5/25



INFO:tensorflow:Assets written to: ../models/baseline_BO\checkpoints\assets


INFO:tensorflow:Assets written to: ../models/baseline_BO\checkpoints\assets


Epoch 6/25



INFO:tensorflow:Assets written to: ../models/baseline_BO\checkpoints\assets


INFO:tensorflow:Assets written to: ../models/baseline_BO\checkpoints\assets


Epoch 7/25



INFO:tensorflow:Assets written to: ../models/baseline_BO\checkpoints\assets


INFO:tensorflow:Assets written to: ../models/baseline_BO\checkpoints\assets


Epoch 8/25



INFO:tensorflow:Assets written to: ../models/baseline_BO\checkpoints\assets


INFO:tensorflow:Assets written to: ../models/baseline_BO\checkpoints\assets


Epoch 9/25



INFO:tensorflow:Assets written to: ../models/baseline_BO\checkpoints\assets


INFO:tensorflow:Assets written to: ../models/baseline_BO\checkpoints\assets


Epoch 10/25



INFO:tensorflow:Assets written to: ../models/baseline_BO\checkpoints\assets


INFO:tensorflow:Assets written to: ../models/baseline_BO\checkpoints\assets




The model was scheduled to run for 25 epochs but ran for only 10. This is because an "early stopping" callback was defined that monitored the validation grade loss. The "patience" was set to 3 meaning the lowerest validation grade loss was achived in the 7th epoch (0.9188) but was allowed to train for 3 additional epochs to confirm.

## Section 3: Results<a name="section-3-results"></a>
Once we have a final model, we can load the test files to get our model accuracy.

In [42]:
test_filenames = tf.io.gfile.glob(f"{PROCESSED_PATH}/test/*.tfrec")

In [43]:
results = model.evaluate(
    x=get_dataset(test_filenames, batch_size),
    steps=57,
    return_dict=True
)



In [44]:
for key, value in results.items():
    print(key, ': ', value)

loss :  1.4068259000778198
trial_outcome_loss :  0.39172881841659546
grade_loss :  1.0150971412658691
trial_outcome_acc :  0.8422222137451172
grade_acc :  0.6538888812065125


In [45]:
model.save('../models/baseline_BO/final_model')



INFO:tensorflow:Assets written to: ../models/baseline_BO/final_model\assets


INFO:tensorflow:Assets written to: ../models/baseline_BO/final_model\assets


## Section 4: Discussion<a name="section-4-discussion"></a>


Although the 84.2% accuracy of the model is respectable, there is still room for improvement. Techniques such as data augmentation or feature engineering may have resulted in a higher accruacy. Additionally, modifying the model's training parameters such as training for longer, separately tuning the hyperparameters of each layer, or using a wegithed Kappy loss for the ordinal grade labels could have had a significant impact. Ultimately, it was decided to cease development of this model as it was believed that a logistic regression would be a more appropriate methodology for this use case.