<h1>Explanation of code 'makeinput.py'</h1>


In [1]:
import glob
import pandas as pd
import re
from datetime import timedelta, datetime
import numpy as np
from pandas import DataFrame, Series

The data is present in 4000 text files (for each patient RecordID) in set-a of the data folder. To get all the files we'll use the glob module.

In [2]:
path = 'data/set-a/*.txt'
files = glob.glob(path)
for name in files[:10]:
    print name

data/set-a/135052.txt
data/set-a/140525.txt
data/set-a/134872.txt
data/set-a/135365.txt
data/set-a/133493.txt
data/set-a/134633.txt
data/set-a/137593.txt
data/set-a/136083.txt
data/set-a/135534.txt
data/set-a/137746.txt


Each patient has a maximum of 6 + 37 descriptors. 
The 6 general descriptors are common for all patients and are measured at time of check-in for that patient (00:00). They're fixed.

* RecordID, 
* Age,
* Gender (0-female, 1-male),
* Height,
* ICUType (1, 2, 3 or 4),
* Weight

The 37 time series descriptors are measured at different points in time and keep changing. The times are stored relative to 00:00. It's not required for all of these to be measured even once for each patient. Each can be measured many times though.

* ALP,
* ALT,
* AST,
* Albumin,
* BUN,
* Bilirubin,
* Cholesterol,
* Creatinine,
* DiasABP,
* FiO2,
* GCS,
* Glucose,
* HCO3,
* HCT,
* HR,
* K,
* Lactate,
* MAP,
* MechVent,
* Mg,
* NIDiasABP,
* NIMAP,
* NISysABP,
* Na,
* PaCO2,
* PaO2,
* Platelets,
* RespRate,
* SaO2,
* SysABP,
* Temp,
* TroponinI,
* TroponinT,
* Urine,
* WBC,
* Weight,
* pH

'Weight' is common in both. Not sure how that works out.

So each file has three columns- Time, Parameter, Value

I'll apply each of the functions he's defined to one file to explain

In [3]:
df = pd.read_csv(files[0], sep = ',')
df

Unnamed: 0,Time,Parameter,Value
0,00:00,RecordID,135052.0
1,00:00,Age,84.0
2,00:00,Gender,0.0
3,00:00,Height,-1.0
4,00:00,ICUType,3.0
5,00:00,Weight,73.0
6,00:24,Lactate,3.5
7,01:26,FiO2,1.0
8,01:26,GCS,15.0
9,01:26,HR,104.0


We need to pivot this table so that it shows what happens at each timestamp

In [4]:
df.drop_duplicates(subset=['Time', 'Parameter'], inplace=True)
df = df.pivot(index='Time', columns='Parameter', values='Value')
df.sort(axis=1, inplace=True)
df.reset_index(inplace=True)
df

Parameter,Time,ALP,ALT,AST,Age,Albumin,BUN,Bilirubin,Creatinine,FiO2,...,NISysABP,Na,Platelets,RecordID,RespRate,Temp,TroponinT,Urine,WBC,Weight
0,00:00,,,,84,,,,,,...,,,,135052,,,,,,73
1,00:24,,,,,,,,,,...,,,,,,,,,,
2,01:26,,,,,,,,,1.00,...,144,,,,21,36.2,,80,,
3,01:44,,,,,,,,,,...,,,,,,,,,,
4,02:26,,,,,,,,,,...,117,,,,21,,,140,,73
5,03:11,,,,,3.2,27,,0.9,,...,,142,128,,,,0.34,,38.3,
6,03:20,,,,,,,,,,...,,,,,,,,,,
7,03:26,,,,,,,,,,...,123,,,,24,,,60,,73
8,04:26,,,,,,,,,,...,142,,,,26,,,,,73
9,05:26,,,,,,,,,,...,115,,,,24,35.9,,,,73


Since the 6 general descriptors are only measured once, we can use pandas' *first_valid_index* to return the first non-null value of these and then delete these columns from the dataframe. (Except Weight, cuz that's also a timeseries variable)


In [5]:
recordId = int(df['RecordID'][df['RecordID'].first_valid_index()])
del df['RecordID']
age = int(df['Age'][df['Age'].first_valid_index()])
del df['Age']
gender = 1 - int(df['Gender'][df['Gender'].first_valid_index()])
gender = -1 if gender < 0 or gender > 1 else gender
del df['Gender']
height = df['Height'][df['Height'].first_valid_index()] 
del df['Height']
icuType = int(df['ICUType'][df['ICUType'].first_valid_index()])
del df['ICUType']
weight = df['Weight'][df['Weight'].first_valid_index()]

df.replace(to_replace=-1, value=np.nan, inplace = True)

print recordId, age, gender, height, icuType, weight
df

135052 84 1 -1.0 3 73.0


Parameter,Time,ALP,ALT,AST,Albumin,BUN,Bilirubin,Creatinine,FiO2,GCS,...,NIMAP,NISysABP,Na,Platelets,RespRate,Temp,TroponinT,Urine,WBC,Weight
0,00:00,,,,,,,,,,...,,,,,,,,,,73
1,00:24,,,,,,,,,,...,,,,,,,,,,
2,01:26,,,,,,,,1.00,15,...,85,144,,,21,36.2,,80,,
3,01:44,,,,,,,,,,...,,,,,,,,,,
4,02:26,,,,,,,,,,...,73,117,,,21,,,140,,73
5,03:11,,,,3.2,27,,0.9,,,...,,,142,128,,,0.34,,38.3,
6,03:20,,,,,,,,,,...,,,,,,,,,,
7,03:26,,,,,,,,,,...,79,123,,,24,,,60,,73
8,04:26,,,,,,,,,,...,81,142,,,26,,,,,73
9,05:26,,,,,,,,,,...,76,115,,,24,35.9,,,,73


Now we need to convert the 'Time' string to a pandas datetime object. He uses today's date and time and adds the timestamp for each. 

In [6]:
def convert_time_str(s, today=None):
    m = re.match("(\d\d)\:(\d\d)", s) #to get the hours and the minutes
    assert(m)
    hours = int(m.group(1))
    minutes = int(m.group(2))
    if today is None:
        today = datetime.today()
    return today + timedelta(hours = hours, minutes = minutes)

Next function is to avoid a for loop for each time and just apply it to the whole df.

In [7]:
def make_time_str_converter(today):
    def converter(s):
        return convert_time_str(s, today=today)
    return converter

In [8]:
df['TimeOriginal'] = df.Time
converter = make_time_str_converter(datetime.today())
df.Time = df.TimeOriginal.apply(converter)
del df['TimeOriginal']
df

Parameter,Time,ALP,ALT,AST,Albumin,BUN,Bilirubin,Creatinine,FiO2,GCS,...,NIMAP,NISysABP,Na,Platelets,RespRate,Temp,TroponinT,Urine,WBC,Weight
0,2015-10-07 04:36:39.106855,,,,,,,,,,...,,,,,,,,,,73
1,2015-10-07 05:00:39.106855,,,,,,,,,,...,,,,,,,,,,
2,2015-10-07 06:02:39.106855,,,,,,,,1.00,15,...,85,144,,,21,36.2,,80,,
3,2015-10-07 06:20:39.106855,,,,,,,,,,...,,,,,,,,,,
4,2015-10-07 07:02:39.106855,,,,,,,,,,...,73,117,,,21,,,140,,73
5,2015-10-07 07:47:39.106855,,,,3.2,27,,0.9,,,...,,,142,128,,,0.34,,38.3,
6,2015-10-07 07:56:39.106855,,,,,,,,,,...,,,,,,,,,,
7,2015-10-07 08:02:39.106855,,,,,,,,,,...,79,123,,,24,,,60,,73
8,2015-10-07 09:02:39.106855,,,,,,,,,,...,81,142,,,26,,,,,73
9,2015-10-07 10:02:39.106855,,,,,,,,,,...,76,115,,,24,35.9,,,,73


In [9]:
df.sort(axis=1, inplace=True)
df.sort(axis=0, inplace=True)

In [10]:
df

Parameter,ALP,ALT,AST,Albumin,BUN,Bilirubin,Creatinine,FiO2,GCS,Glucose,...,NISysABP,Na,Platelets,RespRate,Temp,Time,TroponinT,Urine,WBC,Weight
0,,,,,,,,,,,...,,,,,,2015-10-07 04:36:39.106855,,,,73
1,,,,,,,,,,,...,,,,,,2015-10-07 05:00:39.106855,,,,
2,,,,,,,,1.00,15,,...,144,,,21,36.2,2015-10-07 06:02:39.106855,,80,,
3,,,,,,,,,,,...,,,,,,2015-10-07 06:20:39.106855,,,,
4,,,,,,,,,,,...,117,,,21,,2015-10-07 07:02:39.106855,,140,,73
5,,,,3.2,27,,0.9,,,160,...,,142,128,,,2015-10-07 07:47:39.106855,0.34,,38.3,
6,,,,,,,,,,,...,,,,,,2015-10-07 07:56:39.106855,,,,
7,,,,,,,,,,,...,123,,,24,,2015-10-07 08:02:39.106855,,60,,73
8,,,,,,,,,,,...,142,,,26,,2015-10-07 09:02:39.106855,,,,73
9,,,,,,,,,,,...,115,,,24,35.9,2015-10-07 10:02:39.106855,,,,73


Now we need to handle the time-series variables. What he does is subtract each time from the first to get the elapsed number of minutes (which are the new timestamps) and then make a T x D numpy matrix of T timestamps and D variables. He also returns a vector of just the timestamps. There's an optional *hours* parameter which youc an use to trim the data to a maximum number of hours.

In [11]:
def generate_elapsed_timestamps(timestamps, first_dt):
    return (timestamps-first_dt).apply(lambda x: x / np.timedelta64(1, 'm'))

In [12]:
def as_nparray_with_timestamps(df, hours=None):
    df1 = df.copy().reset_index()
    df1['Elapsed'] = generate_elapsed_timestamps(df.Time, df.Time.min()).astype(int)
    if hours is not None:
        df1 = df1.ix[df.Elapsed < hours*60]
    df1.set_index('Elapsed', inplace=True)
    del df1['Time']
    df1.sort(axis=1, inplace=True)
    df1.sort_index(inplace=True)
    return df1, df1.as_matrix(), df1.index.to_series().as_matrix()

In [13]:
df_elapsed, X, Ts = as_nparray_with_timestamps(df)
print X, Ts
print '\n'
print df_elapsed

[[ nan  nan  nan ...,  nan  73.   0.]
 [ nan  nan  nan ...,  nan  nan   1.]
 [ nan  nan  nan ...,  nan  nan   2.]
 ..., 
 [ nan  nan  nan ...,  nan  73.  67.]
 [ nan  nan  nan ...,  nan  73.  68.]
 [ nan  nan  nan ...,  nan  73.  69.]] [   0   24   86  104  146  191  200  206  266  326  341  386  421  446  506
  566  593  596  611  626  686  746  776  806  818  866  896  926  947  986
 1046 1106 1136 1166 1205 1226 1286 1346 1406 1466 1496 1526 1566 1586 1646
 1706 1766 1826 1886 1914 1946 2006 2066 2126 2186 2216 2246 2306 2366 2426
 2466 2486 2546 2606 2635 2646 2666 2726 2786 2846]


Parameter  ALP  ALT  AST  Albumin  BUN  Bilirubin  Creatinine  FiO2  GCS  \
Elapsed                                                                    
0          NaN  NaN  NaN      NaN  NaN        NaN         NaN   NaN  NaN   
24         NaN  NaN  NaN      NaN  NaN        NaN         NaN   NaN  NaN   
86         NaN  NaN  NaN      NaN  NaN        NaN         NaN  1.00   15   
104        NaN  NaN  NaN  

Another thing he's done is make the same array as above except with all missing values imputed/resampled. I can't get this to work here. It works in his code though.

In [17]:
def as_nparray_resampled(df, hours=None, rate='H', bucket=True, impute=False):
    """Returns time series data as resampled T x D matrix. T is number of samples, D is
    number of variables. Leverages pandas.DataFrame.resample routine. Can impute missing values for
    time series with at least one measurement.
    :param hours: trim data to maximum number of hours
    :param rate: target sampling rate (in string format, as required by pandas.DataFrame.resample
    :param bucket: if True, take mean of measurements in window; otherwise, use first measurement
    :param impute: if True, use forward- and backward-filling to impute missing measurements.
    :return: TxD matrix of data
    """
    df2 = df.copy()
    if impute:
        df2 = df2.resample(rate, how='mean' if bucket else 'first', closed='left', label='left', fill_method='ffill')
        df2.ffill(axis=0, inplace=True)
        df2.bfill(axis=0, inplace=True)
    else:
        df2 = df2.resample(rate, how='mean' if bucket else 'first', closed='left', label='left', fill_method=None)
    df2.reset_index(inplace=True)
    df2['Elapsed'] = generate_elapsed_timestamps(df2.Time, df2.Time.min()).astype(int)
    if hours is not None:
        df2 = df2.ix[df2.Elapsed < hours*60]
    df2.set_index('Elapsed', inplace=True)
    del df2['Time']
    df2.sort(axis=1, inplace=True)
    df2.sort_index(inplace=True)
    return df2, df2.as_matrix()

In [35]:
rng = pd.date_range('1/1/2011', periods=72, freq='H')
rng
ts = Series(np.random.rand(len(rng)), index=rng)
ts.head()
ts.resample('D', how = 'mean')

2011-01-01    0.494539
2011-01-02    0.544267
2011-01-03    0.569843
Freq: D, dtype: float64

In [44]:
#df.set_index('Time', inplace = True)
#print df.index
df_filled, X2= as_nparray_resampled(df, impute = True, bucket = False)
print X2
print df_filled

[[  63.    64.    69.  ...,   80.    38.3   73. ]
 [  63.    64.    69.  ...,   80.    38.3   73. ]
 [  63.    64.    69.  ...,   80.    38.3   73. ]
 ..., 
 [ 110.    65.    57.  ...,  160.    27.5   73. ]
 [ 110.    65.    57.  ...,  140.    27.5   73. ]
 [ 110.    65.    57.  ...,  140.    27.5   73. ]]
Parameter  ALP  ALT  AST  Albumin  BUN  Bilirubin  Creatinine  FiO2  GCS  \
Elapsed                                                                    
0           63   64   69      3.2   27        0.4         0.9  1.00   15   
60          63   64   69      3.2   27        0.4         0.9  1.00   15   
120         63   64   69      3.2   27        0.4         0.9  1.00   15   
180         63   64   69      3.2   27        0.4         0.9  1.00   15   
240         63   64   69      3.2   27        0.4         0.9  1.00   15   
300         63   64   69      3.2   27        0.4         0.9  1.00   15   
360         63   64   69      3.2   27        0.4         0.9  1.00   15   
420     

He also defines a method to combine certain variables. Like *NIDiasABP* [Non-invasive diastolic arterial blood pressure (mmHg)] and *DiasABP* [Invasive diastolic arterial blood pressure (mmHg)]; *NIMAP* [Non-invasive mean arterial blood pressure (mmHg)] and *MAP* [Invasive mean arterial blood pressure (mmHg)]; *NISysABP* [Non-invasive systolic arterial blood pressure (mmHg)] and *SysABP* [Invasive systolic arterial blood pressure (mmHg)]

In [45]:
def merge_variables(df, to_merge):
    #to_merge: dictionary mapping new variable name to list of variables to be merged.
    dold = df.copy()
    s = Series(data=np.zeros((dold.shape[0],)), index=dold.index).replace(0, np.nan)
    dnew = DataFrame(dict([ (k, s) for k in to_merge.keys() if len(set(to_merge[k]).intersection(dold.columns))>0 ]))
    for newvar in dnew.columns:
        for oldvar in to_merge[newvar]:
            if oldvar in dold.columns:
                dnew[newvar][dold[oldvar].notnull()] = dold[oldvar][dold[oldvar].notnull()]
                del dold[oldvar]
    dnew = dnew.join(dold, how='outer')
    dnew.sort(axis=1, inplace=True)
    dnew.sort(axis=0, inplace=True)
    return dnew

This file doesn't have any any 2 of these together so it doesn't change

In [46]:
to_merge = { 'SysABP': ('NISysABP', 'SysABP'), 'DiasABP': ('NIDiasABP', 'DiasABP'), 'MAP': ('NIMAP', 'MAP') }
df_merged = merge_variables(df_elapsed, to_merge)
print df_merged.shape
print df_elapsed.shape

(70, 28)
(70, 28)


Now we need to add the outcomes of the training sample from *Outcomes-a.txt*

In [47]:
recordId

135052

In [48]:
outcomes_filename = 'data/Outcomes-a.txt'
outcomes = DataFrame.from_csv(outcomes_filename, index_col = 'RecordID')
outcomes.head()

Unnamed: 0_level_0,SAPS-I,SOFA,Length_of_stay,Survival,In-hospital_death
RecordID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
132539,6,1,5,-1,0
132540,16,8,8,-1,0
132541,21,11,19,-1,0
132543,7,1,9,575,0
132545,17,2,4,918,0


The Outcomes file has these attributes for each record:
* RecordID,
* SAPS-I score,
* SOFA score,
* Length of stay (days),
* Survival (days),
* In-hospital death (0: survivor, or 1: died in-hospital)

[SAPS-I score](http://www.ncbi.nlm.nih.gov/pubmed/6499483) and [SOFA score](http://www.ncbi.nlm.nih.gov/pubmed/11594901) are some weird things used to estimate death. Don't think we need to know them.

He's made invalid and conflicting survival data -999. 
* When Survival is given as -1
* When Survival is given as 0 but In-hospital_death is 0 (Survived)

In [49]:
outcomes.Survival[outcomes.Survival==-1] = -999
outcomes.Survival[(outcomes.Survival==0)&(outcomes['In-hospital_death']==0)] = -999

In [50]:
outcomes

Unnamed: 0_level_0,SAPS-I,SOFA,Length_of_stay,Survival,In-hospital_death
RecordID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
132539,6,1,5,-999,0
132540,16,8,8,-999,0
132541,21,11,19,-999,0
132543,7,1,9,575,0
132545,17,2,4,918,0
132547,14,11,6,1637,0
132548,14,4,9,-999,0
132551,19,8,6,5,1
132554,11,0,17,38,0
132555,14,6,8,-999,0


Getting the outcomes for our example record. He handles exceptions in separate classes. They're easy to understand, so not including that bit.

In [51]:
saps1 = outcomes['SAPS-I'][recordId]
sofa = outcomes['SOFA'][recordId]
los = outcomes['Length_of_stay'][recordId]
mortality = outcomes['In-hospital_death'][recordId]
survival = outcomes['Survival'][recordId]
saps1, sofa, los, mortality, survival
    

(17, 7, 6, 0, -999)

So that's how he does it for one record. Here's the full code for makeinput.py

In [52]:
# %load makeinput.py
"""
@author: dbell
@author: davekale
"""

import re
from datetime import timedelta, datetime

import numpy as np
from pandas import DataFrame, Series


class InvalidChallenge2012DataException(Exception):
    def __init__(self, field, err, value, recordid=None):
        s = '{0} has invalid {1}: {2}'.format(field, err, value)
        if recordid is not None:
            s = s + ' (record {0})'.format(recordid)
        Exception.__init__(self, s)
        self.field = field
        self.err = err
        self.value = value
        self.recordid = recordid

class Challenge2012Episode:
    def __init__(self, recordID, age, gender, height, icuType, weight, data, source_set='UNKNOWN'):
        """Constructor for Challenge2012Episode.
        :param recordID: Record ID
        :param age: patient age in years
        :param gender: patient gender (1: female, 0: male, np.nan: unknown/other)
        :param height: patient height in cm (-1: not available)
        :param icuType: type of ICU (1-4)
        :param weight: patient weight in kg (-1: not available)
        :param data: physiologic measurements as pandas.DataFrame object with
                datetime as index, one column per variable
        :param source_set: a, b, c, or UNKNOWN
        """
        self._recordId = recordID
        self._age      = age
        self._gender   = gender
        self._height   = height
        self._icuType  = icuType
        self._weight   = weight

        self._saps1     = -1
        self._sofa      = -1
        self._los       = -1
        self._survival  = -1
        self._mortality = -1

        self._data = data.copy()
        self._set = source_set

    @staticmethod    
    def generate_elapsed_timestamps(timestamps, first_dt):
        """Converts datetime to minutes elapsed since first_dt.
        Arguments:
        :param timestamps: pandas.Series containing datetime.datetime objects
        :param first_dt: datetime.datetime object
        """
        return (timestamps-first_dt).apply(lambda x: x / np.timedelta64(1, 'm'))

    @staticmethod
    def convert_time_str(s, today=None):
        """Converts time string in Challenge 2012 format to a datetime relative to "today."
        Challenge 2012 format is in MM:SS (minutes:seconds).
        Arguments:
        :param s: string in MM:SS format
        :param today: datetime.datetime object
        """
        m = re.match("(\d\d)\:(\d\d)", s)
        assert(m)
        hours = int(m.group(1))
        minutes = int(m.group(2))
        if today is None:
            today = datetime.today()
        return today + timedelta(hours=hours, minutes=minutes)

    @staticmethod
    def make_time_str_converter(today):
        """Closure function that returns a Challenge 2012 time string converter, relative to
        today argument.
        Arguments:
        :param today: datetime.datetime object
        """
        def converter(s):
            return Challenge2012Episode.convert_time_str(s, today=today)

        return converter

    @staticmethod
    def from_file(filename, variables):
        """Read data for one Challenge2012Episode from one text file, including only specified variables.
        Arguments:
        :param filename: string with full path to file to be read.
        :param variables: list of variable names to keep, as strings.
        """
        match = re.search('\d{6}.txt', filename) #ensure that file matches given format
        if not match:
            raise InvalidChallenge2012DataException('file', 'name', filename)
        df = DataFrame.from_csv(filename, index_col=None)
        df.drop_duplicates(subset=['Time', 'Parameter'], inplace=True)
        df = df.pivot(index='Time', columns='Parameter', values='Value')

        variables = set(variables)
        variables.update(['RecordID', 'Age', 'Gender', 'Height', 'ICUType', 'Weight'])
        emptyvec = np.empty((df.shape[0],))
        emptyvec[:] = np.nan
        for v in variables:
            if v not in df.columns:
                df[v] = emptyvec
        for v in df.columns:
            if v not in variables:
                del df[v]

        df.sort(axis=1, inplace=True)
        df.reset_index(inplace=True)

        if df['RecordID'].notnull().sum() != 1:
            raise InvalidChallenge2012DataException('recordID', 'count', df['RecordID'].notnull().sum())
        recordId = int(df['RecordID'][df['RecordID'].first_valid_index()])
        del df['RecordID']

        if df['Age'].notnull().sum() != 1:
            raise InvalidChallenge2012DataException('Age', 'count', df['Age'].notnull().sum(), recordId)
        age = int(df['Age'][df['Age'].first_valid_index()])
        del df['Age']

        if df['Gender'].notnull().sum() != 1:
            raise InvalidChallenge2012DataException('Gender', 'count', df['Gender'].notnull().sum(), recordId)
        gender = 1 - int(df['Gender'][df['Gender'].first_valid_index()])
        gender = -1 if gender < 0 or gender > 1 else gender
        del df['Gender']

        if df['Height'].notnull().sum() != 1:
            raise InvalidChallenge2012DataException('Height', 'count', df['Height'].notnull().sum(), recordId)
        height = df['Height'][df['Height'].first_valid_index()]
        del df['Height']

        if df['ICUType'].notnull().sum() != 1:
            raise InvalidChallenge2012DataException('ICUType', 'count', df['ICUType'].notnull().sum(), recordId)
        icuType = int(df['ICUType'][df['ICUType'].first_valid_index()])
        if icuType not in {1,2,3,4}:
            raise InvalidChallenge2012DataException('ICUType', 'value', icuType, recordId)
        del df['ICUType']

        if df['Weight'].notnull().sum() < 1:
            raise InvalidChallenge2012DataException('Weight', 'count', df['Weight'].notnull().sum(), recordId)
        weight = df['Weight'][df['Weight'].first_valid_index()]

        df.replace(to_replace=-1, value=np.nan, inplace=True)

        df['TimeOriginal'] = df.Time
        converter = Challenge2012Episode.make_time_str_converter(datetime.today())
        try:
            df.Time = df.TimeOriginal.apply(converter)
        except:
            raise InvalidChallenge2012DataException('timestamp', 'format', df.TimeOriginal[0], recordId)
        del df['TimeOriginal']
        df.set_index('Time', inplace=True)
        df.sort(axis=1, inplace=True)
        df.sort(axis=0, inplace=True)

        m = re.search('/set-([abc])/\d{6}.txt', filename)
        if m:
            return Challenge2012Episode(recordId, age, gender, height, icuType, weight, df, source_set=m.group(1))
        else:
            return Challenge2012Episode(recordId, age, gender, height, icuType, weight, df)

    def printObj(self):
        print self.__dict__

    def merge_variables(self, to_merge):
        """Merges time series variables into new time series variables.
        :param to_merge: dictionary mapping new variable name to list of variables to be merged.
        :return:
        """
        dold = self._data.copy()
        s = Series(data=np.zeros((dold.shape[0],)), index=dold.index).replace(0, np.nan)
        dnew = DataFrame(dict([ (k, s) for k in to_merge.keys() if len(set(to_merge[k]).intersection(dold.columns))>0 ]))
        for newvar in dnew.columns:
            for oldvar in to_merge[newvar]:
                if oldvar in dold.columns:
                    dnew[newvar][dold[oldvar].notnull()] = dold[oldvar][dold[oldvar].notnull()]
                    del dold[oldvar]
        dnew = dnew.join(dold, how='outer')
        dnew.sort(axis=1, inplace=True)
        dnew.sort(axis=0, inplace=True)
        self._data = dnew
        
    def as_nparray_with_timestamps(self, hours=None):
        """Returns time series data as T x D matrix, along with T-vector of timestamps. T is number of samples, D is
        number of variables. Timestamps are in minutes elapsed and may be irregular.
        :param hours: trim data to maximum number of hours
        :return: tuple of TxD matrix of data, T vector of timestamps
        """
        df = self._data.reset_index()

        df['Elapsed'] = Challenge2012Episode.generate_elapsed_timestamps(df.Time, df.Time.min()).astype(int)
        if hours is not None:
            df = df.ix[df.Elapsed < hours*60]
        df.set_index('Elapsed', inplace=True)
        del df['Time']
        df.sort(axis=1, inplace=True)
        df.sort_index(inplace=True)
        return df.as_matrix(), df.index.to_series().as_matrix()
      
    def as_nparray_resampled(self, hours=None, rate='1H', bucket=True, impute=False): #, normal_values=None):
        """Returns time series data as resampled T x D matrix. T is number of samples, D is
        number of variables. Leverages pandas.DataFrame.resample routine. Can impute missing values for
        time series with at least one measurement.
        :param hours: trim data to maximum number of hours
        :param rate: target sampling rate (in string format, as required by pandas.DataFrame.resample
        :param bucket: if True, take mean of measurements in window; otherwise, use first measurement
        :param impute: if True, use forward- and backward-filling to impute missing measurements.
        :return: TxD matrix of data
        """
        df = self._data #.reset_index()

        if impute:
            df = df.resample(rate, how='mean' if bucket else 'first', closed='left', label='left', fill_method='ffill')
            df.ffill(axis=0, inplace=True)
            df.bfill(axis=0, inplace=True)
            #        assert(df[varid].notnull().all())
        else:
            df = df.resample(rate, how='mean' if bucket else 'first', closed='left', label='left', fill_method=None)

        df.reset_index(inplace=True)
        df['Elapsed'] = Challenge2012Episode.generate_elapsed_timestamps(df.Time, df.Time.min()).astype(int)
        if hours is not None:
            df = df.ix[df.Elapsed < hours*60]
        df.set_index('Elapsed', inplace=True)
        del df['Time']
        df.sort(axis=1, inplace=True)
        df.sort_index(inplace=True)
        return df.as_matrix()

class ConflictingChallenge2012DataException(Exception):
    def __init__(self, field1, value1, field2, value2, recordid=None):
        s = '{0}, {1} values conflict: {2}, {3}'.format(field1, field2, value1, value2)
        if recordid is not None:
            s = s + ' (record {0})'.format(recordid)
        Exception.__init__(self, s)
        self.field1 = field1
        self.field2 = field2
        self.value1 = value1
        self.value2 = value2
        self.recordid = recordid

def add_outcomes(eps, outcomes_filename):
    """
    :param eps: list of Challenge2012Episode objects
    :param outcomes_filename: full path as string to outcomes CSV file
    :return: list of Challenge2012Episode objects with updated outcomes data
    """
    try:
        outcomes = DataFrame.from_csv(outcomes_filename, index_col='RecordID')
    except:
        raise InvalidChallenge2012DataException('outcome', 'filename', outcomes_filename)

    ## address conflicting outcomes data to make it consistent with
    ## rules described on http://physionet.org/challenge/2012/#data-correction
    #idx = (outcomes['In-hospital_death']==1)
    #idx = idx & (outcomes['Survival']<2) & (outcomes['Length_of_stay']>=2)
    #outcomes['Survival'][idx] = outcomes['Length_of_stay'][idx]
    #outcomes['Length_of_stay'][(outcomes['Length_of_stay']<2)&(outcomes['In-hospital_death']==0)] = -123456
    #outcomes['Survival'][(outcomes['Survival']>-1)&(outcomes['Survival']<2)] = -1
    ##
    outcomes.Survival[outcomes.Survival==-1] = -999
    outcomes.Survival[(outcomes.Survival==0)&(outcomes['In-hospital_death']==0)] = -999

    for ep in eps:
        if ep._recordId in outcomes.index:
            ep._saps1 = outcomes['SAPS-I'][ep._recordId]
            ep._sofa = outcomes['SOFA'][ep._recordId]
            ep._los = outcomes['Length_of_stay'][ep._recordId]
            ep._mortality = outcomes['In-hospital_death'][ep._recordId]
            if ep._mortality != 1 and ep._mortality != 0:
                raise InvalidChallenge2012DataException('mortality', 'value', ep._mortality, ep._recordId)
            ep._survival = outcomes['Survival'][ep._recordId]
            if (ep._survival > ep._los or ep._survival == -999) and ep._mortality != 0:
                raise ConflictingChallenge2012DataException('survival', ep._survival, 'mortality', ep._mortality, ep._recordId)
            if (ep._survival > 0 and ep._survival <= ep._los) and ep._mortality != 1:
                raise ConflictingChallenge2012DataException('survival', ep._survival, 'mortality', ep._mortality, ep._recordId)
            
    return eps

#Explanation of code in processinput.py#

To do this for all the records, he has another code file which I've named processinput.py. Don't run the code in the next cell, it's just for loading to look at it. You can run it in the cell 2 cells down where I've passed the arguments it needs.

In [None]:
# %load processinput.py
"""
Created on Wed Apr  9 14:29:38 2014
@author: dbell
@author: davekale
"""

from __future__ import division

import argparse
import glob
import os
import sys

import numpy as np
import scipy.io as sio

import makeinput
from makeinput import Challenge2012Episode

parser = argparse.ArgumentParser()
parser.add_argument('data_dir', type=unicode)
parser.add_argument('out_dir', type=unicode)
parser.add_argument('-b', '--basename', type=unicode, default='physionet_challenge2012')
parser.add_argument('-v', '--variables', type=unicode, nargs='+', default=['ALP', 'ALT', 'AST', 'Albumin', 'BUN',
                                                                           'Bilirubin', 'Cholesterol', 'Creatinine',
                                                                           'DiasABP', 'FiO2', 'GCS', 'Glucose', 'HCO3',
                                                                           'HCT', 'HR', 'K', 'Lactate', 'MAP',
                                                                           'MechVent', 'Mg', 'NIDiasABP', 'NIMAP',
                                                                           'NISysABP', 'Na', 'PaCO2', 'PaO2',
                                                                           'Platelets', 'RespRate', 'SaO2', 'SysABP',
                                                                           'Temp', 'TroponinI', 'TroponinT', 'Urine',
                                                                           'WBC', 'Weight', 'pH'])
parser.add_argument('-r', '--resample_rate', type=int, default=60)
parser.add_argument('--merge_bp', action='store_true')
args = parser.parse_args()
args.variables = set(args.variables)

fns = glob.glob(os.path.join(os.path.join(args.data_dir, 'set-a'), '*.txt'))
#fns.extend(glob.glob(os.path.join(os.path.join(args.data_dir, 'set-b'), '*.txt')))
#eps = [ Challenge2012Episode.from_file(fn, args.variables) for fn in fns ]
eps = []
sentinel = 0.0
for i, fn in enumerate(fns):
    if i / len(fns) > sentinel:
        sys.stdout.write('.')
        sys.stdout.flush()
        sentinel += 0.01
    eps.append(Challenge2012Episode.from_file(fn, args.variables))
sys.stdout.write('\n')
eps = makeinput.add_outcomes(eps, os.path.join(args.data_dir, 'Outcomes-a.txt'))

variables = args.variables
if args.merge_bp:
    to_merge = { 'SysABP': ('NISysABP', 'SysABP'), 'DiasABP': ('NIDiasABP', 'DiasABP'), 'MAP': ('NIMAP', 'MAP') }
    variables = None
    for ep in eps:
        ep.merge_variables(to_merge)
        variables = set(ep._data.columns.tolist()) if variables is None else variables
variables = sorted(variables)

Xraw  = []
Traw  = np.zeros((len(eps),), dtype=int)
tsraw = []
Xmiss = []
X     = []
T     = np.zeros((len(eps),), dtype=int)

recordid = np.zeros((len(eps),), dtype=int)
age      = np.zeros((len(eps),), dtype=int)
gender   = np.zeros((len(eps),), dtype=int)
height   = np.zeros((len(eps),))
weight   = np.zeros((len(eps),))
icutype  = np.zeros((len(eps),), dtype=int)
#source   = np.zeros((len(eps),), dtype=int)

saps1 = np.zeros((len(eps),), dtype=int)
sofa  = np.zeros((len(eps),), dtype=int)
ym    = np.zeros((len(eps),), dtype=int)
ylos  = np.zeros((len(eps),), dtype=int)
ysurv = np.zeros((len(eps),), dtype=int)

for i, ep in enumerate(eps):
    x, ts = ep.as_nparray_with_timestamps()
    Xraw.append(x)
    tsraw.append(ts)
    Traw[i] = x.shape[0]

    x = ep.as_nparray_resampled(impute=False)
    Xmiss.append(x)
    T[i] = x.shape[0]

    x = ep.as_nparray_resampled(impute=True)
    X.append(x)

    recordid[i] = ep._recordId
    gender[i]   = ep._gender
    age[i]      = ep._age
    height[i]   = ep._height
    weight[i]   = ep._weight
    icutype[i]  = ep._icuType
    '''if ep._set == 'a':
        source[i] = 1
    elif ep._set == 'b':
        source[i] = 2
    elif ep._set == 'c':
        source[i] = 3
    else:
        source[i] = 0'''
    saps1[i] = ep._saps1
    sofa[i]  = ep._sofa
    ym[i]    = ep._mortality
    ylos[i]  = ep._los
    ysurv[i] = ep._survival

np.savez(os.path.join(args.out_dir, args.basename + '.npz'), Xraw=Xraw, tsraw=tsraw, Traw=Traw, Xmiss=Xmiss, X=X, T=T,
         recordid=recordid, gender=gender, age=age, height=height, weight=weight, icutype=icutype,# source=source,
         saps1=saps1, sofa=sofa, ym=ym, ylos=ylos, ysurv=ysurv)

sio.savemat(os.path.join(args.out_dir, args.basename + '.mat'), {'Xraw': Xraw, 'tsraw': tsraw, 'Traw': Traw, 'Xmiss': Xmiss,
            'X': X, 'T': T, 'recordid': recordid, 'gender': gender, 'age': age, 'height': height, 'weight': weight,
            'icutype': icutype, 'saps1': saps1, 'sofa': sofa, 'ym': ym, 'ylos': ylos, 'ysurv': ysurv}) #'source': source

f = open(os.path.join(args.out_dir, args.basename + '-variables.csv'), 'w')
for i,v in enumerate(variables):
    f.write('{0},{1}\n'.format(i+1,v))
f.close()

* He uses parser to add all the arguments we need. I dunno the syntax of this properly. I guess we can Google if we need it. 
* Then he gets all the filenames using glob like we did.
* The whole sentinel thing is just to print dots while it's running. 
* *eps* has each file's info stored as the way the *from_file* function in the previous code gives it - a dict with the 6 fixed descriptors as keys and their corresponding values, along with a data key which has the dataframe of the time-series  variables.
* Then you add outcomes to each ep using the *add_outcomes* function above
* If you give merge_bp as True, you can merge those variables with the non-invasive and invasive techniques
* Make a 4000 (number of training samples) sized zero-filled vector for the general descriptors and the outcomes.
* Traw is just the number of distinct timestamps for each record.
* tsraw is the list of timestamps for each record
* Xmiss is the T x D matrix of each record when we use the *as_nparray_resampled* method setting impute as False.
* T is the number of timestamps when we use the *as_nparray_resampled* method setting impute as False.
* X is the T x D matrix of each record when we use the *as_nparray_resampled* method setting impute as True.
* The general descriptors are set per record. Ignore the commented out par, *source* was necessary for him cuz there were 3 datasets, a, b and c. We're only using a cuz the others were test sets without labels.
* The outcomes are also set per record.
* So now we have numpy arrays of size 4000 for each input attribute and each output attribute. These are stored in a numpy zip file, *physionet_challenge2012.npz*. We don't need the .mat file I guess.
* The list of variables are stored in *physionet_challenge2012-variables.csv*

To run this code you need to have a folder called **dataout** with empty documents named **physionet_challenge2012.npz**, **physionet_challenge2012.mat** and **physionet_challenge2012-variables.csv**. Dunno why but it doesn't create them. And you need to have your **set-a** folder and **Outcomes-a.txt** in a folder called **data**.

In [None]:
%run processinput.py data dataout

You can extract the .npz file using np.load to get the files in it. It's a dict of filename : array. This is the final preprocessed data.

In [None]:
filename = 'dataout/physionet_challenge2012.npz'
npzfile = np.load(filename)
print npzfile.files
for name in npzfile.files:
    print len(npzfile[name]) #each array has 4000 entries
for arr in npzfile['X']:
    print arr.shape #X is the T x D matrix after imputing values, where T is number of timestamps and D is number of variables

##Now we need to figure out how these can be transformed into input vectors for an ML algo. ##