# 3 - Getting X and y



In [None]:
#@title Run this cell to download the data and helper files. { display-mode: "form" }
!pip install -U wget
!rm -rf data.zip data lib
!mkdir lib

import wget
wget.download('https://github.com/shengpu1126/BDSI2019-ML/raw/master/lib/config.yaml', 'lib/config.yaml')
wget.download('https://github.com/shengpu1126/BDSI2019-ML/raw/master/lib/helper.py', 'lib/helper.py')
wget.download('https://github.com/shengpu1126/BDSI2019-ML/raw/master/data.zip', 'data.zip')

import zipfile
with zipfile.ZipFile("data.zip","r") as zip_ref:
    zip_ref.extractall(".")

In [None]:
import numpy as np
import pandas as pd
from tqdm import tqdm
from lib.helper import load_data, config

## Generate features for the first 2500 patients

In [None]:
# From yesterday :)
def generate_feature_vector(df):
    """
    Reads a dataframe containing all measurements for a single patient
    within the first 48 hours of the ICU admission, and convert it into
    a feature vector.
    
    Args:
        df: pd.Dataframe, with columns [Time, Variable, Value]
    
    Returns:
        a python dictionary of format {feature_name: feature_value}
        for example, {'Age': 32, 'Gender': 0, 'mean_HR': 84, ...}
    """
    static_variables = config['invariant']
    timeseries_variables = config['timeseries']

    # Replace unknow values
    df = df.replace({-1: np.nan})
    
    # Split time invariant and time series
    static, timeseries = df.iloc[0:5], df.iloc[5:]
    static = static.pivot('Time', 'Variable', 'Value')

    feature_dict = static.iloc[0].to_dict()
    for variable in timeseries_variables:
        measurements = timeseries[timeseries['Variable'] == variable]['Value']
        feature_dict['mean_' + variable] = np.mean(measurements)
    
    return feature_dict

# Load the dataset
# `raw_data` is a dictionary mapping patient ID to the data associated with that patient
raw_data, df_labels = load_data(N=2500)
features = [generate_feature_vector(df) for _, df in tqdm(sorted(raw_data.items()), desc='Generating feature vectors')]

df_features = pd.DataFrame(features).sort_index(axis=1)
feature_names = df_features.columns.tolist()
X, y = df_features.values, df_labels['In-hospital_death'].values

### Before anything

In [None]:
# What is X?
X

In [None]:
# Just so you know pd.DataFrame can be used to display a matrix nicely
pd.DataFrame(X)

## Helper functions to implement...

- `impute_missing_values(X)`: <br> Given a feature matrix `X` (where each row corresponds to a patient admission and each column a feature) with missing values, we consider each feature column independently. For each column, we impute the missing values by replacing it with the mean value of the observed values in that column. Hint: use `np.nanmean()` to compute the mean of an `np.array` with `np.nan` values. 
- `normalize_feature_matrix(X)`: <br> Notice that many of these feature values lie on very different scales. Here, we will address this issue by nnormalizing the features. Given a feature matrix X (where each row corre- sponds to a patient admission and each column a feature) now without any missing values, we use the following formula to normalize each feature column xd to have range between 0 and 1:
$$\tilde{x}_d = \frac{(x_d - \min)}{(\max - \min)} \text{, where } \min = \min_{i=1\dots n} x_d^{(i)} \text{, and } \max = \max_{i=1\dots n} x_d^{(i)}$$

In [None]:
def impute_missing_values_by_mean(X):
    """
    For each feature column, impute missing values (np.nan) with the 
    population mean for that feature.
    
    Args:
        X: np.array, shape (N, d). X could contain missing values
    Returns:
        X: np.array, shape (N, d). X does not contain any missing values
    """
    # TODO: implement this function
    return X


from sklearn.preprocessing import MinMaxScaler
def normalize_feature_matrix(X):
    """
    For each feature column, normalize all values to range [0, 1].

    Args:
        X: np.array, shape (N, d).
    Returns:
        X: np.array, shape (N, d). Values are normalized per column.
    """
    # TODO: implement this function
    return X


### Test: Mean imputation

In [None]:
X_imputed = impute_missing_values_by_mean(X)

In [None]:
# What is X_imputed?


### Test: 0/1 normalization

In [None]:
X_normalized = normalize_feature_matrix(X_imputed)

In [None]:
# What is X_normalized?


### Finally...

In [None]:
## TODO
# What's the average feature vector (after mean imputation and 0/1 normalization)? 


## Challenge

- Implement both functions without using for-loops (because for-loops in python are slow). 
- Implement each function with one line. 

## Challenge+

References:
- scikit-learn – "imputation of missing values": https://scikit-learn.org/stable/modules/impute.html
- Imputation techniques with examples: https://towardsdatascience.com/6-different-ways-to-compensate-for-missing-values-data-imputation-with-examples-6022d9ca0779

Implement one (or more) of the other imputation techniques, and compare it with mean imputation in terms of:
- runtime/efficiency, 
- resultant feature distribution, 
- what assumptions does each technique make. 

In [None]:
def impute_missing_values_by_????(X):
    """
    Impute missing values (np.nan) using 
    your choice of imputation technique. 
    
    Args:
        X: np.array, shape (N, d). X could contain missing values
    Returns:
        X: np.array, shape (N, d). X does not contain any missing values
    """
    # TODO: implement this function
    return X

In [None]:
X_imp_mean = impute_missing_values_by_mean(X)
X_imp_???? = impute_missing_values_by_????(X)

## Challenge++

References:
- scikit-learn – "preprocessing": https://scikit-learn.org/stable/modules/preprocessing.html
- https://medium.com/greyatom/why-how-and-when-to-scale-your-features-4b30ab09db5e

Try out other feature scaling/normalization/transformation techniques, and compare it with 0/1 normalization.
- Range of resultant feature values
- What assumptions might a model make about its input features?

In [None]:
def normalize_feature_matrix_by_???(X):
    """
    For each feature column, apply your chosen "normalization".

    Args:
        X: np.array, shape (N, d).
    Returns:
        X: np.array, shape (N, d). Values are normalized per column.
    """
    # TODO: implement this function
    return X
