In [None]:
!pip install -U wget
!rm -rf data.zip data lib
!mkdir lib

In [None]:
import wget
wget.download('https://github.com/shengpu1126/BDSI2019-ML/raw/master/lib/config.yaml', 'lib/config.yaml')
wget.download('https://github.com/shengpu1126/BDSI2019-ML/raw/master/lib/helper.py', 'lib/helper.py')
wget.download('https://github.com/shengpu1126/BDSI2019-ML/raw/master/data.zip', 'data.zip')

import zipfile
with zipfile.ZipFile("data.zip","r") as zip_ref:
    zip_ref.extractall(".")

# 2 - Feature Extraction

The goal of feature extraction is to transform each patient’s raw data into a $d$-dimensional feature vector, so that we can learn an machine learning model. Here, we will summarize each time-varying variable, by taking the mean. See **Figure 1** for an illustration.

![title](lib/EHR_feature.png)

**Figure 1.** Transforming EHR data into a feature vector. `Age` and `Height` are time-invariant variables, each of which is encoded as a separate feature. For this patient, the feature `Age` has its original value, while the feature `Height` is `np.nan` since it is unknown (−1). `HR`, `Temp` and `RespRate` are time-varying variables. Here, we will encode each variable by its mean. The feature `mean_HR` contains the mean heart rate measurements, whereas `mean_RespeRate` is `np.nan` because no respiratory rate measurements were recorded for this patient.

In [None]:
import numpy as np
import pandas as pd
from tqdm import tqdm
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
from lib.helper import load_data, config

## Helper functions to implement...

- `generate_feature_vector(df)`: <br> For the time-invariant variables, use the raw values. Replace unknown observations (−1) with undefined (use `np.nan`), and name these features with the original variable names. For each time-varying variable, compute the mean of all measurements for that variable. If no measurement exists for a variable, the mean is also undefined (use `np.nan`). Name these features as `mean_{Variable}` for each variable. For example, the variable `HR` would correspond to the feature with name `mean_HR`.

In [None]:
def generate_feature_vector(df):
    """
    Reads a dataframe containing all measurements for a single patient
    within the first 48 hours of the ICU admission, and convert it into
    a feature vector.
    
    Args:
        df: pd.Dataframe, with columns [Time, Variable, Value]
    
    Returns:
        a python dictionary of format {feature_name: feature_value}
        for example, {'Age': 32, 'Gender': 0, 'mean_HR': 84, ...}
    """
    static_variables = config['static']
    timeseries_variables = config['timeseries']

    # Replace unknown values
    df = df.replace({-1: np.nan})
    
    ## TODO: implement this function
    feature_dict = {}
    
    
    return feature_dict

## Test the function

In [None]:
# Load the dataset
# `raw_data` is a dictionary mapping patient ID to the data associated with that patient
raw_data, df_labels = load_data(N=2500)

In [None]:
IDs = sorted(raw_data.keys())
ID = IDs[0]
df = raw_data[ID]
df_i = generate_feature_vector(df)

In [None]:
print(ID)
df_i

## Generate features for the first 2500 patients

In [None]:
features = [generate_feature_vector(df) for _, df in tqdm(sorted(raw_data.items()), desc='Generating feature vectors')]

In [None]:
df_features = pd.DataFrame(features).sort_index(axis=1)
feature_names = df_features.columns.tolist()
X, y = df_features.values, df_labels['In-hospital_death'].values

In [None]:
## TODO:
# Report the dimensionality of feature vector


In [None]:
## TODO:
# What are the names of each feature? 


In [None]:
## TODO:
# For each feature, what is the fraction of patients having that feature missing?


In [None]:
## TODO:
# Report the average value of each feature (considering only recorded non-missing values)


## Reflection

1. Read the [documentation](https://physionet.org/challenge/2012/\#general-descriptors) on the variable `ICUType`, and reflect on the current feature representation of this variable. What does such a representation imply, when using a linear classifier? How else might you represent this variable (as possibly more than one feature)?
2. Here we only consider the mean of the numerical variables. What limitations are associated with this representation? What other summary statistics could be useful?
3. How should we handle missing values? 
4. Notice that features could have values in different orders of magnitudes (age between 18 and 100 while gender is 0 or 1). How should we handle these?