# Introduction

The data are collected from ~70 insomnia patients who attended the Woolcock sleep clinic and had their sleep monitored via sophisticated equipment. The data are the following 4 time-series streams:
* **Sleep staging**: a classification of each 30s window (epoch) in the PSG data. This is the ground truth
* **Heart rate**: measured at beat-by-beat interval in milliseconds
* **Skin temperature**: measured the skin temperature at different parts of the body at every 15s
* **Actigraphy**: measured the amount of activity plus the amount of light every 30s

**Goal: Extract, clean and align all the data to the timestamp of sleep staging**

### Some challenges of time-series data
* Measurements at different interval with uneven length
* Measurements starting at different time
* Missing data in between 

### Data processing procedure
1. Extract the data from different sources and files
2. Clean and manipulate the data to proper data type
3. Resample heart rate and skin temperature to 30s interval
4. Truncate and align all the data to the timestamp of sleep staging data

Pandas has a comprehensive set of tools to work with time-series data (http://pandas.pydata.org/pandas-docs/stable/timeseries.html)

In [1]:
import pandas as pd
from pandas.tseries import offsets
import numpy as np
import os, re, glob, zipfile, tempfile
import xlrd
import datetime, time
import matplotlib.pyplot as plt
import seaborn as sns
import matplotlib
%matplotlib inline

# sns.set(style='whitegrid', context='notebook')
matplotlib.style.use('ggplot')
pd.set_option('display.max_rows', 15)

## Process sleep staging file
The sleep staging file provides the ground truth label for our sleep-wake classification. We will convert wake (SLEEP-S0) as 0, and the rest sleep stages as 1.

In [2]:
def process_one_staging_file(staging_file):
    """
    Extract epoch data from one sleep staging file, store into dataframe, convert to proper datatypes 
    and remove NaN.
    
    Args:
        staging_file: path and file name
    Return:
        pandas dataframe containing the epoch data
    """
    print("Processing {0} ...".format(staging_file))
    m = re.search(r"INS\s(\d+)", staging_file)
    if m: 
        id = m.group(1)
        patient_id = "INS_WI_" + id
    
    staging_df = pd.read_csv(staging_file, sep="\t", header=None, skiprows=14)
    staging_df = staging_df.drop([0, 3], axis=1)
    staging_df.columns = ['Datetime', 'Staging']
#     staging_df['Staging'] = staging_df['Staging'].map(lambda s: 0 if 'S0' in s else 1)
    # Convert the datetime as index
    staging_df.index = pd.to_datetime(staging_df['Datetime'])
    staging_df.index.name = None
    staging_df.insert(0, 'Patient_id', patient_id)

    staging_df = staging_df.drop('Datetime', axis=1)
    staging_df = staging_df.asfreq('30S')
    
    return staging_df

In [3]:
staging_df = process_one_staging_file("./Input/Sleep stages/INS 011-Events.txt")

Processing ./Input/Sleep stages/INS 011-Events.txt ...


In [4]:
staging_df.head(3)
# staging_df.columns
# staging_df.info()
# staging_df.index.name = None
# staging_df.index

Unnamed: 0,Patient_id,Staging
2015-10-01 00:57:16,INS_WI_011,SLEEP-S0
2015-10-01 00:57:46,INS_WI_011,SLEEP-S0
2015-10-01 00:58:16,INS_WI_011,SLEEP-S0


## Process heart rate file
Heart rate is measure beat-beat interval. It is a txt file for one subject per night. 

There is one **ONE recording date**. It is required to take care the next day flipping over when the time wraps around the midnight.

In [5]:
def process_one_hr_file(hr_file):
    """
    Extract data from one heart rate file, store into dataframe, convert to proper datatypes 
    and remove NaN.
    
    Args:
        hr_file: path and file name
    Return:
        pandas dataframe containing the epoch data
    """
    print("Processing {0}...".format(hr_file))
    
    # Extract patient_id and recording date
    patient_id, re_date = None, None
    num_lines = 0
    with open(hr_file, "r") as f:
#         if sum(1 for l in f) < 6: return None
        for line in f:
            num_lines += 1
            if line.find("Patient ID") > -1:
                patient_id = line.strip()[-10:]
            elif line.find("Recording Date") > -1:
                re_date = line.strip()[-(len(line)-len("Recording Date: ")):].strip()
#             if patient_id and re_date:
#                 break

    if num_lines < 6: return None
#     if '011' in patient_id:
#         next_date = (pd.to_datetime(re_date, dayfirst=True)).strftime('%d/%m/%Y')
#     else:
#         next_date = (pd.to_datetime(re_date, dayfirst=True) + offsets.Day(1)).strftime('%d/%m/%Y')
    # read hr records into dataframe
    hr_df = pd.read_csv(hr_file, header=None, sep="\t", skiprows=5)
    
    # check if the first record is started at "PM" or "AM"
#     print(hr_df.ix[0, 0])
    if 'AM' in hr_df.ix[0, 0]:
        next_date = (pd.to_datetime(re_date, dayfirst=True)).strftime('%d/%m/%Y')
    else:
        next_date = (pd.to_datetime(re_date, dayfirst=True) + offsets.Day(1)).strftime('%d/%m/%Y')
    
    hr_df[0] = hr_df[0].apply(lambda s: re_date + " " + s if 'PM' in s else next_date + " " + s)
    # hr_df['Patient_id'] = patient_id
    hr_df['Datetime'] = pd.to_datetime(hr_df[0], dayfirst=True) + hr_df[1].apply(offsets.Milli)
    
    ## the following doesn't work
#     hr_df['Datetime'] = pd.to_datetime(hr_df[0] + " " + hr_df[1].to_string(), format='%d/%m/%Y %I:%M:%S %p %f')
    hr_df = hr_df.drop([0, 1], axis=1)
    hr_df[2] = hr_df[2] / 1000
    hr_df.rename(columns={2: "HR_duration"}, inplace=True)
    hr_df.index = hr_df['Datetime']
    hr_df.index.name = None
    hr_df.drop('Datetime', axis=1, inplace=True)
    
    # resampling as 30S to compute the mean and stdev
    # hr_df = hr_df.resample('30S').mean()
    
    return hr_df

In [6]:
hr_df = process_one_hr_file("./Input/HRV/011, INSEKG_RR Intervals.txt")
# hr_df.head(3)

Processing ./Input/HRV/011, INSEKG_RR Intervals.txt...




In [10]:
hr_df.head(5)

Unnamed: 0,HR_duration
2015-10-01 00:53:56.398,0.84
2015-10-01 00:53:57.237,0.832
2015-10-01 00:53:58.069,0.825
2015-10-01 00:53:58.895,0.787
2015-10-01 00:53:59.682,0.796


### Align HR dataframe to staging dataframe by resampling to 30s interval
We need to aggregate the heart rate beat-by-beat interval into 30s epoch and align with the timestamp of sleep staging. Then will compute the mean and standard deviation for each epoch.

In [8]:
def align_hr_df(staging_df, hr_df):
    """
    Align HR dataframe to staging dataframe by resampling to 30S interval
    Args:
        staging_df:
        hr_df:
    Return:
        aligned_hr_df:
    """
    start = (staging_df.index.min() - offsets.Second(30)).strftime('%Y-%m-%d %H:%M:%S')
    end = (staging_df.index.max()).strftime('%Y-%m-%d %H:%M:%S')

    aligned_hr_df = hr_df[start:end]
    # check if ACT data available to align with sleep staging data   
    if len(aligned_hr_df) == 0: return None
    
    aligned_hr_df = aligned_hr_df.resample('30S', base=staging_df.index.min().second, closed='right', label='right')\
                    .agg({'HR_mean': np.mean, 'HR_stdev': np.std})
    aligned_hr_df.columns = aligned_hr_df.columns.droplevel(1)
    
    return aligned_hr_df

In [11]:
aligned_hr_df = align_hr_df(staging_df, hr_df)
aligned_hr_df.head(3)
# aligned_hr_df.info()

Unnamed: 0,HR_mean,HR_stdev
2015-10-01 00:57:46,0.954478,0.079565
2015-10-01 00:58:16,0.906577,0.059958
2015-10-01 00:58:46,0.91203,0.145576


In [20]:
aligned_hr_df[(aligned_hr_df['HR_mean'].isnull()==False) & (aligned_hr_df['HR_stdev'].isnull()==True)]

Unnamed: 0,HR_mean,HR_stdev
2015-10-01 03:47:16,0.547,
2015-10-01 06:16:16,1.111,


## Process actigraphy file
The important part of actigraphy file is **"epoch-by-epoch"** section which is the measurement of activity and white light for every 30s. However, the line number of this section varies from file to file.

In [12]:
def process_one_act_file(act_file):
    """
    Extract epoch data from one actigraphy csv file, store into dataframe, convert to proper datatypes 
    and remove NaN.
    
    Args:
        act_file: path and file name
    Return:
        pandas dataframe containing the epoch data
    """
    print("Processing {0}...".format(act_file))
    m = re.search(r".*/(INS_WI_\d+).*\.csv", act_file)
    if m:
        patient_id = m.group(1)
    else:
        return None
    rows = []
    with open(act_file, "r") as f:
        is_epoch = False
        nu_epoch_start = None
        i = 0
        for line in f:
            i += 1
            if line.find("Epoch-by-Epoch") > -1:
                is_epoch = True
            # elif re.search(r"\"Line\",\"Epoch\",\"Day\",\"Seconds\",\"Date\",\"Time\"", line) and is_epoch:
            elif is_epoch and line.find(',"Epoch","Day","Seconds"') > -1:# and is_epoch:
                #print(line)
                epoch_columns = line
                nu_epoch_start = i + 1
                #print(epoch_columns, nu_epoch_start)                
                break
#             elif is_epoch and len(line) > 1 and nu_epoch_start:
#                 fields = line.replace('"', '').strip().split(sep=",")[:-2]
#                 fields = tuple([fields[4], fields[5], fields[6], fields[8]])
#                 rows.append(fields)
                # print(rows)
                # break
    #act_df = pd.DataFrame.from_records(rows)
    act_df = pd.read_csv(act_file, header=None, skiprows=nu_epoch_start)
    act_df.drop(act_df.shape[1]-2, axis=1, inplace=True)
    
    act_df.columns = epoch_columns.replace('"', '').strip().split(sep=",")[:-1]
    act_df = act_df[['Date', 'Time', 'Activity', 'White Light']]
    for col in act_df.columns:
        if sum(act_df[col].isin(['NaN'])) > 1:
            act_df[col] = act_df[col].replace('NaN', np.nan)

    # drop any rows with NA
    act_df = act_df.dropna().reset_index(drop=True)
           
    # combine Date and Time columns
    act_df["Datetime"] = act_df["Date"] +" " + act_df["Time"]
    act_df["Datetime"] = pd.to_datetime(act_df["Datetime"], dayfirst=True)
    act_df = act_df.drop(["Date", "Time"], axis=1)
    
    
    # insert patient id 
    # act_df['Patient_id'] = patient_id
    # act_df = act_df[['Datetime', 'Activity', 'White Light']]
    act_df.index = act_df['Datetime']
    act_df.index.name = None
    act_df.drop('Datetime', axis=1, inplace=True)

    act_df[["Activity", "White Light"]] = \
        act_df[["Activity", "White Light"]].apply(pd.to_numeric)
    
    return act_df
    

In [13]:
def process_multi_act_files(act_path):
    """
    Process multiple actigraphy files for different patients and extract each epoch data for each patient

    Args:
        act_file_list: a list of actigraphy csv files containing measurement
    Return:
        pandas dataframe containing the epoch data
    """
    file_format = "*.csv"
    act_file_list = glob.glob(act_path + file_format)
    act_df = pd.DataFrame()
    for act_file in act_file_list:
        one_act_df = process_one_act_file(act_file)
        if not isinstance(one_act_df, type(None)):
            act_df = act_df.append(one_act_df, ignore_index=True)
        
    # drop duplicated entries
    act_df.drop_duplicates(inplace=True)
    return act_df
    

In [14]:
act_df = process_one_act_file("./Input/Actigraphy/INS_WI_011_20_09_2015_9_00_00_AM_BL_AEM.csv")

Processing ./Input/Actigraphy/INS_WI_011_20_09_2015_9_00_00_AM_BL_AEM.csv...


In [15]:
act_df.head(3)
# act_df.info()
# (act_df.index[1] - act_df.index[0]).seconds > 30

Unnamed: 0,Activity,White Light
2015-09-20 11:59:30,13.0,90.51
2015-09-20 12:00:00,7.0,148.96
2015-09-20 12:00:30,39.0,114.86


### Align ACT dataframe to staging dataframe by resampling to 30s interval

In [16]:
def align_act_df(staging_df, act_df):
    """
    Align ACT dataframe to staging dataframe by resampling to 30S interval
    Args:
        staging_df:
        act_df:
    Return:
        aligned_act_df:
    """
    if (act_df.index[1] - act_df.index[0]).seconds > 30:
        start = (staging_df.index.min() - offsets.Second(60)).strftime('%Y-%m-%d %H:%M:%S')
        act_df = act_df / 2
    else:
        start = (staging_df.index.min() - offsets.Second(30)).strftime('%Y-%m-%d %H:%M:%S')
    end = (staging_df.index.max()).strftime('%Y-%m-%d %H:%M:%S')
    
    aligned_act_df = act_df[start:end]
    # check if ACT data available to align with sleep staging data   
    if len(aligned_act_df) == 0: return None
    
    if (aligned_act_df.index[1] - aligned_act_df.index[0]).seconds > 30:
        aligned_act_df = aligned_act_df.resample('30S', base=staging_df.index.min().second, closed='right', label='right').ffill()
    else:
        aligned_act_df = aligned_act_df.resample('30S', base=staging_df.index.min().second, closed='right', label='right').mean()
#     aligned_hr_df.columns = aligned_hr_df.columns.droplevel(1)
    
    return aligned_act_df

In [17]:
aligned_act_df = align_act_df(staging_df, act_df)

In [18]:
aligned_act_df.head(5)
# aligned_act_df.info()

Unnamed: 0,Activity,White Light
2015-10-01 00:57:16,98.0,0.68
2015-10-01 00:57:46,0.0,0.13
2015-10-01 00:58:16,0.0,0.13
2015-10-01 00:58:46,104.0,0.08
2015-10-01 00:59:16,64.0,0.21


## Process skin temperature file
Skin temperature is stored in a zipped file per subject per night which contains about 10 excel files for each part of the body.

**We have to take care of lots of human errors on the file names and contents.**

Glob the `"Input/Skin Temperature"` folder to get a list of file names.

In [19]:
def process_one_st_file(st_file, measure_types=['Forehead', 'Fingertip', 'Abdomen']):
    """
    Extract epoch data from one skin temperature zip file for one patient, store into dataframe, convert to 
    proper datatypes and remove NaN.
    
    Args:
        st_file: a zip file containing skin temperature measurement for one patient
        measure_types: the different sides of the patient with temperature measurement
    Return:
        pandas dataframe containing the epoch data
    """
    # base_dir = "./Input/Skin Temperature"
    base_dir = tempfile.TemporaryDirectory()
    one_st_df = pd.DataFrame()
    
    with zipfile.ZipFile(st_file) as st_zip:
        for name in st_zip.namelist():
            one_measure_df = pd.DataFrame()
            for measure in measure_types:
                records = []
                patient_id, mea_type, night = "", "", ""
                file_pattern = "(?i).*(" + measure  + ")_.*xlsx"
                if re.search(file_pattern, name):
                    # print(name)
                    file_name = st_zip.extract(name, path=base_dir.name)
                    print("processing {0} ...".format(file_name))
                    wb = xlrd.open_workbook(file_name)
                    sheet = wb.sheet_by_index(0)
                    # print(sheet.nrows)
                    row_ix = 0
                    col_time =0; col_temp = 1; is_found = False
                    while row_ix < sheet.nrows:                    #for row_ix in range(sheet.nrows):
                        # print(sheet.cell(row_ix, 0))
#                         if sheet.cell(row_ix, 0).ctype == xlrd.XL_CELL_TEXT and \
#                            "Description" in sheet.cell(row_ix, 0).value:
#                             desc = sheet.cell(row_ix, 1).value
#                             m = re.search(r"([a-zA-Z]+)_(.*)_(N\d)", desc)
#                             if m:
#                                 mea_type = m.group(1)
#                                 patient_id = m.group(2)
#                                 night = m.group(3)
#                                 print(mea_type, patient_id, night)
                            # else: print("can't get description")
                        while sheet.cell(row_ix, col_time).ctype == xlrd.XL_CELL_DATE and not is_found:
                            #print(sheet.cell(row_ix, col_temp).)
                            if (sheet.cell(row_ix, col_temp).ctype == xlrd.XL_CELL_TEXT and \
                                "C" in sheet.cell(row_ix, col_temp).value) or \
                               (sheet.cell(row_ix, col_temp).ctype == xlrd.XL_CELL_NUMBER):
                                is_found = True
                            else:
                                row_ix += 1; col_time += 1; col_temp += 1
                        if is_found and not sheet.cell(row_ix, col_time).ctype == xlrd.XL_CELL_EMPTY:
                            #print(sheet.cell(row_ix, col_time))
                            datetime_value = xlrd.xldate_as_tuple(sheet.cell(row_ix, col_time).value, wb.datemode)
                            temperature = sheet.cell(row_ix, col_temp).value
                            if sheet.cell(row_ix, col_temp).ctype == xlrd.XL_CELL_TEXT and "C" in temperature:
                                # temperature = temperature[:-2]
                                temperature = float(temperature[:-2])
                            records.append(tuple((datetime.datetime(*datetime_value), temperature)))
                        
                        row_ix += 1
#                             break
                if len(records) > 0:
                    one_measure_df = pd.DataFrame.from_records(records)
                    one_measure_df.columns = ["Datetime", "Temperature"]
                    if len(patient_id) < 2:
                        # print("can't get description")
                        m = re.search(r".*/(.*)_(N\d).*", st_file)
                        if m:
                            mea_type = measure
                            patient_id = m.group(1)
                            night = m.group(2)                            
                    # one_measure_df["Patient_id"] = patient_id
                    one_measure_df["Measure_type"] = mea_type
                    # one_measure_df["Night"] = night
                    one_st_df = one_st_df.append(one_measure_df, ignore_index=True)
        
    base_dir.cleanup()
    return one_st_df

In [20]:
def process_multi_st_files(st_file_list, 
            measure_types=['Forehead', 'Fingertip', 'Chest', 'Forearm', 'Upperarm', 'Hand', 'Scapula', 'Abdomen', 
                           'Upperleg', 'Calf', 'Foot', 'Toe']):
    """
    Process multiple Skin Temperature files for different patients and extract each epoch data for each patient

    Args:
        st_file_list: a list of zip files containing skin temperature measurement for different patients
        measure_types: the different sides of the patient with temperature measurement
    Return:
        pandas dataframe containing the epoch data
    """
#     file_pattern = "*" + night_type + ".zip"
#     st_file_list = glob.glob(path + file_pattern)
    st_df = pd.DataFrame()
    for st_file in st_file_list:
        one_st_df = process_one_st_file(st_file, measure_types)
        st_df = st_df.append(one_st_df, ignore_index=True)
        
    # grouped by Datetime and measure_type
    if len(st_df) > 0:
        st_df = st_df.groupby(['Datetime', 'Measure_type'], as_index=False).mean()
        st_df = st_df.pivot(index='Datetime', columns='Measure_type', values='Temperature')
        st_df.index.name = None

    # drop duplicated entries
    # st_df.drop_duplicates(inplace=True)
    return st_df
    

In [21]:
st_file_list = glob.glob("./Input/Skin Temperature/*INS_WI_011*.zip")
len(st_file_list)

1

In [22]:
st_df = process_multi_st_files(st_file_list)

processing /tmp/tmpdbdp2hxb/INS_WI_011_N1/1437 FOREHEAD_INS_WI_011_N1.xlsx ...
processing /tmp/tmpdbdp2hxb/INS_WI_011_N1/1438 CHEST_INS_WI_011_N1.xlsx ...
processing /tmp/tmpdbdp2hxb/INS_WI_011_N1/1439 UPPERARM_INS_WI_011_N1.xlsx ...
processing /tmp/tmpdbdp2hxb/INS_WI_011_N1/ABDOMEN_INS_WI_011_N1 .xlsx ...
processing /tmp/tmpdbdp2hxb/INS_WI_011_N1/CALF_INS_WI_011_N1 .xlsx ...
processing /tmp/tmpdbdp2hxb/INS_WI_011_N1/FINGERTIP_INS_WI_011_N1 .xlsx ...
processing /tmp/tmpdbdp2hxb/INS_WI_011_N1/FOOT_INS_WI_011_N1 .xlsx ...
processing /tmp/tmpdbdp2hxb/INS_WI_011_N1/FOREARM_INS_WI_011_N1 .xlsx ...
processing /tmp/tmpdbdp2hxb/INS_WI_011_N1/HAND_INS_WI_011_N1 .xlsx ...
processing /tmp/tmpdbdp2hxb/INS_WI_011_N1/SCAPULA_INS_WI_011_N1 .xlsx ...
processing /tmp/tmpdbdp2hxb/INS_WI_011_N1/TOE_INS_WI_011_N1 .xlsx ...
processing /tmp/tmpdbdp2hxb/INS_WI_011_N1/UPPERLEG_INS_WI_011_N1 .xlsx ...


### Align ST dataframe to staging dataframe by resampling to 30s interval

In [23]:
def align_st_df(staging_df, st_df):
    """
    Align ST dataframe to staging dataframe by resampling to 30S interval
    Args:
        staging_df:
        st_df:
    Return:
        aligned_st_df:
    """
    if (st_df.index[1] - st_df.index[0]).seconds > 30:
        start = (staging_df.index.min() - offsets.Second(60)).strftime('%Y-%m-%d %H:%M:%S')
        # st_df = st_df / 2
    else:
        start = (staging_df.index.min() - offsets.Second(30)).strftime('%Y-%m-%d %H:%M:%S')
    end = (staging_df.index.max()).strftime('%Y-%m-%d %H:%M:%S')
    
    aligned_st_df = st_df[start:end]
    # check if ST data available to align with sleep staging data   
    if len(aligned_st_df) == 0: return None
    
    if (aligned_st_df.index[1] - aligned_st_df.index[0]).seconds > 30:
        aligned_st_df = aligned_st_df.resample('30S', base=staging_df.index.min().second, closed='right', label='right').ffill()
    else:
        aligned_st_df = aligned_st_df.resample('30S', base=staging_df.index.min().second, closed='right', label='right').mean()
#     aligned_hr_df.columns = aligned_hr_df.columns.droplevel(1)
    
    return aligned_st_df

In [24]:
aligned_st_df = align_st_df(staging_df, st_df)

## Process all the required files to extract data for a given subject
We define a function to extract all the measurements for a given subject. 

We need to check if the measurement is available after being aligned with sleep staging. If **NOT**, not to process for that subject and return **`None`**

In [25]:
def extract_data_per_subject(subject="INS_WI_008", hr=True, st=True, act=True,
                st_types=['Forehead', 'Fingertip', 'Chest', 'Forearm', 'Upperarm', 'Hand', 'Scapula', 'Abdomen', 
                           'Upperleg', 'Calf', 'Foot', 'Toe']):
    """
    Extract actigraphy, skin temperature, heart rate and PSG scoring for a given subject
    
    Args:
        subject: given one subject ID
        hr, st, act: indicator whether the measurement is processed
        st_type: list of sites for skin temperature
    Return:
        pandas dataframe containing the required data
    """
    staging_path = "./Input/Sleep stages/" + "INS " + subject[-3:] + "*.txt"
    act_path = "./Input/Actigraphy/" + subject + "*.csv"
    st_path = "./Input/Skin Temperature/" + subject + "*.zip"
    hr_path = "./Input/HRV/" + subject[-3:] + "*.txt"
    staging_list = glob.glob(staging_path)
    if len(staging_list) == 0: return None       
    if act:
        act_list = glob.glob(act_path)
        if len(act_list) == 0: return None       
    if st:
        st_list = glob.glob(st_path)
        if len(st_list) == 0: return None
    if hr:
        hr_list =  glob.glob(hr_path)
        if len(hr_list) == 0: return None
    
    if len(staging_list) == 1: staging_df = process_one_staging_file(staging_list[0])
    if act and len(act_list) == 1:
        act_df = process_one_act_file(act_list[0])
        if isinstance(act_df, type(None)): return None
        aligned_act_df = align_act_df(staging_df, act_df)
        if isinstance(aligned_act_df, type(None)): return None        
    if hr and len(hr_list) == 1:
        hr_df = process_one_hr_file(hr_list[0])
        if isinstance(hr_df, type(None)): return None
        aligned_hr_df = align_hr_df(staging_df, hr_df)
        if isinstance(aligned_hr_df, type(None)): return None
    if st and len(st_list) > 0: 
        st_df = process_multi_st_files(st_list, measure_types=st_types)
        if isinstance(st_df, type(None)): return None
        aligned_st_df = align_st_df(staging_df, st_df)
        if isinstance(aligned_st_df, type(None)): return None        
            
    
    if act and hr and st: combined_df = staging_df.join(aligned_act_df).join(aligned_hr_df).join(aligned_st_df)
    elif hr and st: combined_df = staging_df.join(aligned_hr_df).join(aligned_st_df)
#     return staging_df, aligned_act_df, aligned_st_df, aligned_hr_df
    return combined_df

In [111]:
combined_df = extract_data_per_subject('INS_WI_039', act=False)
# staging_df, aligned_act_df, aligned_st_df, aligned_hr_df = extract_data_per_subject()

### Data extraction, alignment to sleep staging data for multiple subjects

In [26]:
def extract_data_multi_subjects(subjects=['INS_WI_008', 'INS_WI_012'], hr=True, st=True, act=True,
                    st_types=['Forehead', 'Fingertip', 'Chest', 'Forearm', 'Upperarm', 'Hand', 'Scapula', 'Abdomen', 
                           'Upperleg', 'Calf', 'Foot', 'Toe']):
    """
    Extract actigraphy, skin temperature, heart rate and PSG scoring for a given subject list
    
    Args:
        subjects: a list containing the patient_id
    Return:
        pandas dataframe containing the required data
    """
    subjects_df = pd.DataFrame()
    for subject in subjects:
        one_subject_df = extract_data_per_subject(subject, hr, st, act, st_types)
        if not isinstance(one_subject_df, type(None)):
            subjects_df = subjects_df.append(one_subject_df)#, ignore_index=True)
        
    return subjects_df

In [112]:
combined_df = extract_data_multi_subjects(subjects=['INS_WI_001'], act=False)

Processing ./Input/Sleep stages/INS 001-Events.txt ...
Processing ./Input/HRV/001, INSEKG_RR Intervals.txt...




processing /tmp/tmph1gox5s8/INS_WI_001_N2/1242 Chest_INS_WI_001_N2.xlsx ...
processing /tmp/tmph1gox5s8/INS_WI_001_N2/1242 Forehead_INS_WI_001_N2.xlsx ...
processing /tmp/tmph1gox5s8/INS_WI_001_N2/1242_UpperArm_INS_WI_001_N2.xlsx ...
processing /tmp/tmph1gox5s8/INS_WI_001_N2/1247 Scapula_INS_WI_001_N2.xlsx ...
processing /tmp/tmph1gox5s8/INS_WI_001_N2/1248 Forearm_INS_WI_001_N2.xlsx ...
processing /tmp/tmph1gox5s8/INS_WI_001_N2/1249 Hand_INS_WI_001_N2.xlsx ...
processing /tmp/tmph1gox5s8/INS_WI_001_N2/1250 Fingertip_INS_WI_001_N2.xlsx ...
processing /tmp/tmph1gox5s8/INS_WI_001_N2/1251 Abdomen_INS_WI_001_N2.xlsx ...
processing /tmp/tmph1gox5s8/INS_WI_001_N2/1252 Calf_INS_WI_001_N2.xlsx ...
processing /tmp/tmph1gox5s8/INS_WI_001_N2/1253 Foot_INS_WI_001_N2.xlsx ...
processing /tmp/tmph1gox5s8/INS_WI_001_N2/1254 Toe_INS_WI_001_N2.xlsx ...
processing /tmp/tmpc97q_64i/INS_WI_001_N1/Abdomen_INS_WI_001_N1.xlsx ...
processing /tmp/tmpc97q_64i/INS_WI_001_N1/Calf_INS_WI_001_N1.xlsx ...
processing

In [113]:
combined_df.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 889 entries, 2015-09-02 22:49:52 to 2015-09-03 06:13:52
Data columns (total 16 columns):
Patient_id    889 non-null object
Staging       889 non-null object
HR_mean       801 non-null float64
HR_stdev      800 non-null float64
Abdomen       888 non-null float64
Calf          888 non-null float64
Chest         888 non-null float64
Fingertip     888 non-null float64
Foot          888 non-null float64
Forearm       888 non-null float64
Forehead      888 non-null float64
Hand          888 non-null float64
Scapula       888 non-null float64
Toe           888 non-null float64
Upperarm      888 non-null float64
Upperleg      888 non-null float64
dtypes: float64(14), object(2)
memory usage: 118.1+ KB


## Data extraction and process for subjects with Actigraphy

In [48]:
# subject list with ACT available - totally 20 subjects
subject_list = ['INS_WI_008','INS_WI_012','INS_WI_013','INS_WI_017','INS_WI_019','INS_WI_020','INS_WI_021','INS_WI_022','INS_WI_025',
                'INS_WI_030','INS_WI_032','INS_WI_034','INS_WI_035', 'INS_WI_037','INS_WI_038', 'INS_WI_040', 'INS_WI_004','INS_WI_007'
                ,'INS_WI_011','INS_WI_041']
subjects_df = extract_data_multi_subjects(subject_list)

Processing ./Input/Sleep stages/INS 004-Events.txt ...
Processing ./Input/Actigraphy/INS_WI_004_26_08_2015_6_00_00_PM_BL_AEM.csv...
Processing ./Input/HRV/004, INSEKG_RR Intervals.txt...




processing /tmp/tmpjtjlryco/INS_WI_004_N2/1314 Forehead_INS_WI_004_N2.xlsx ...
processing /tmp/tmpjtjlryco/INS_WI_004_N2/1315 Chest_INS_WI_004_N2.xlsx ...
processing /tmp/tmpjtjlryco/INS_WI_004_N2/1316 UpperArm_INS_WI_004_N2.xlsx ...
processing /tmp/tmpjtjlryco/INS_WI_004_N2/1317 Scapula_INS_WI_004_N2.xlsx ...
processing /tmp/tmpjtjlryco/INS_WI_004_N2/1318 Forearm_INS_WI_004_N2.xlsx ...
processing /tmp/tmpjtjlryco/INS_WI_004_N2/1319 Hand_INS_WI_004_N2.xlsx ...
processing /tmp/tmpjtjlryco/INS_WI_004_N2/1320 Fingertip_INS_WI_004_N2.xlsx ...
processing /tmp/tmpjtjlryco/INS_WI_004_N2/1321 Abdomen_INS_WI_004_N2.xlsx ...
processing /tmp/tmpjtjlryco/INS_WI_004_N2/1322 Upperleg_INS_WI_004_N2.xlsx ...
processing /tmp/tmpjtjlryco/INS_WI_004_N2/1323 Calf_INS_WI_004_N2.xlsx ...
processing /tmp/tmpjtjlryco/INS_WI_004_N2/1325 Foot_INS_WI_004_N2.xlsx ...
processing /tmp/tmpjtjlryco/INS_WI_004_N2/1326 Toe_INS_WI_004_N2.xlsx ...
processing /tmp/tmp88np_k2h/INS_WI_004_N1/Abdomen_INS_WI_004_N1.xlsx ...
p

## Data extraction and process for subjects without Actigraphy

In [85]:
subject_list = ['INS_WI_085','INS_WI_089']
subjects_df = extract_data_multi_subjects(subject_list, act=False)

Processing ./Input/Sleep stages/INS 085-Events.txt ...
Processing ./Input/HRV/085, INSEKG_RR Intervals.txt...




processing /tmp/tmpqrwwknex/INS_WI_085_N1/Abdomen_INS_WI_085_N1.xlsx ...
processing /tmp/tmpqrwwknex/INS_WI_085_N1/Calf_INS_WI_085_N1.xlsx ...
processing /tmp/tmpqrwwknex/INS_WI_085_N1/Chest_INS_WI_085_N1.xlsx ...
processing /tmp/tmpqrwwknex/INS_WI_085_N1/Fingertip_INS_WI_085_N1.xlsx ...
processing /tmp/tmpqrwwknex/INS_WI_085_N1/Foot_INS_WI_085_N1.xlsx ...
processing /tmp/tmpqrwwknex/INS_WI_085_N1/Forearm_INS_WI_085_N1.xlsx ...
processing /tmp/tmpqrwwknex/INS_WI_085_N1/Forehead_INS_WI_085_N1.xlsx ...
processing /tmp/tmpqrwwknex/INS_WI_085_N1/Hand_INS_WI_085_N1.xlsx ...
processing /tmp/tmpqrwwknex/INS_WI_085_N1/Scapula_INS_WI_085_N1.xlsx ...
processing /tmp/tmpqrwwknex/INS_WI_085_N1/Toe_INS_WI_085_N1.xlsx ...
processing /tmp/tmpqrwwknex/INS_WI_085_N1/Upperarm_INS_WI_085_N1.xlsx ...
processing /tmp/tmpqrwwknex/INS_WI_085_N1/Upperleg_INS_WI_085_N1.xlsx ...
Processing ./Input/Sleep stages/INS 089-Events.txt ...
Processing ./Input/HRV/089, INSEKG_RR Intervals.txt...
processing /tmp/tmphi6o5

# Feature engineering

### Compute the DPG for skin temperature

In [102]:
df2['ST_Distal'] = df2[['Fingertip', 'Toe', 'Hand', 'Foot']].mean(axis=1)
df2['ST_Proximal'] = df2[['Abdomen', 'Chest', 'Upperarm', 'Upperleg']].mean(axis=1)
df2['ST_DPG'] = df2['ST_Proximal'] - df2['ST_Distal']