# SBAR
SBAR communication technique was originally developed by the military and specifically for nuclear submarines. Now also popular in healthcare. 

## Situation
This post is a fork of [Ahmet Erdem's][1] notebook [here][2].
LOFO (Leave One Feature Out) Importance recent results: **numerical** `R` and `C` were the top 2 features, both with more than 5 times higher importance than `u_in_cumsum`.
## Background
-  Both `R` and `C` only have 3 different values by design from the Google Ventilator data, lack of variations for a usual good continuous numerical feature.
Many shared ventilator kernels only used **cateogrical** dummy coding of `R` and `C`, such as LGBM and LSTM models. 
-  Additionally, thanks to [Chris Deotte](https://www.kaggle.com/cdeotte), [LSTM permutation featrue importance](https://www.kaggle.com/aerdem4/google-ventilator-lofo-feature-importance) has also been posted, where `u_in_diff1` to `u_in_diff4` were the top 4 features.

## Assessment
How useful is LOFO here? Two simple experiments:
-  `R` and `C`: the predictive ability of the raw numerical varaibles were basically absorbed by categorical codings.
-  Evaluated on the same 50 features from the LSTM.
    -  `u_in_diff1` to `u_in_diff4` were not among the top 5 features anymore by LOFO.
    -  Categorical`R_50` and `C_10` were among the top 5 features.            
    -  Most `u_out` derived new feature had little importance.
    -  Negative importance: `breath_id__u_in__diffmean` 
        -  When it was removed from the LSTM, oof val_loss slightly reduced by 0.0004.

## Recommendation
* Categorical OHE are better than raw numerical features for `R` and `C`.
* Feature Importance can differ a lot between LOFO and Permutation, or between LGBM and LSTM as they are very different models.
* LOFO only needs few lines of codes and don't need TPU or other acceleratos. 

[1]: https://www.kaggle.com/aerdem4
[2]: https://www.kaggle.com/aerdem4/google-ventilator-lofo-feature-importance
[3]: https://www.kaggle.com/manabendrarout/single-bi-lstm-model-pressure-predict-gpu-infer


## References
1. https://www.kaggle.com/aerdem4/google-ventilator-lofo-feature-importance
1. https://www.kaggle.com/manabendrarout/single-bi-lstm-model-pressure-predict-gpu-infer
1. https://www.kaggle.com/cdeotte/lstm-feature-importance

In [None]:
!pip install lofo-importance

In [None]:
import numpy as np
import pandas as pd
from tqdm import tqdm

#from sklearn.preprocessing import RobustScaler, MinMaxScaler

import matplotlib.pyplot as plt
%matplotlib inline

import os, sys
pd.set_option('display.max_columns', None)

In [None]:
DEBUG = False
# DEBUG = True

PATH = "../input/ventilator-pressure-prediction"
train = pd.read_csv(f"{PATH}/train.csv")
print(train.shape)
train.head()

In [None]:
if DEBUG:
    df = train[:80*1000]
else:
    df = train   
df.tail()

# What we know

[LOFO](https://github.com/aerdem4/lofo-importance) states some advantages:
*  It is model agnostic
*  It gives negative importance to features that hurt performance upon inclusion
*  It can group the features, such as highly correlated features 

In [None]:
# from Version 1: https://www.kaggle.com/aerdem4/google-ventilator-lofo-feature-importance
def engineer_features(df):
    df = df.copy()
    df["u_in_sum"] = df.groupby("breath_id")["u_in"].transform("sum")
    df["u_in_cumsum"] = df.groupby("breath_id")["u_in"].cumsum()
    df["u_in_std"] = df.groupby("breath_id")["u_in"].transform("std")
    df["u_in_min"] = df.groupby("breath_id")["u_in"].transform("min")
    df["u_in_max"] = df.groupby("breath_id")["u_in"].transform("max")
    df["u_in_cumsum_reverse"] = df["u_in_sum"] - df["u_in_cumsum"]
    
    df["u_in_lag1"] = df.groupby("breath_id")["u_in"].shift(1)
    df["u_in_lead1"] = df.groupby("breath_id")["u_in"].shift(-1)
    df["u_in_lag1_diff"] = df["u_in"] - df["u_in_lag1"]
    df["u_in_lead1_diff"] = df["u_in"] - df["u_in_lead1"]
    
    df["u_out_sum"] = df.groupby("breath_id")["u_out"].transform("sum")
    
    df["time_passed"] = df.groupby("breath_id")["time_step"].diff()
    
    return df
    
df = engineer_features(df)

In [None]:
in_df = df[df["u_out"] == 0].reset_index(drop=True)
in_df.shape

In [None]:
from lofo import Dataset, LOFOImportance, plot_importance
from sklearn.model_selection import GroupKFold

cv = list(GroupKFold(n_splits=4).split(in_df, in_df["pressure"], groups=in_df["breath_id"]))

features = ["time_step", "u_in", "R", "C",
            "u_in_sum", "u_in_cumsum", "u_in_std", "u_in_min", "u_in_max", "u_in_cumsum_reverse",
            "u_in_lead1", "u_in_lag1", "u_in_lag1_diff", "u_in_lead1_diff",
            "u_out_sum", "time_passed"]

ds = Dataset(in_df, target="pressure", features=features,
    feature_groups=None,
    auto_group_threshold=0.9
)

LOFO will run **LightGBM** as a default model, if a model is not passed to it.

In [None]:
lofo_imp = LOFOImportance(ds, cv=cv, scoring="neg_mean_absolute_error")

importance_df = lofo_imp.get_importance()
# add coefficient of variation
importance_df['CoV'] = importance_df['importance_std'] / importance_df['importance_mean']
importance_df

The output is the same as the original [kernel](https://www.kaggle.com/aerdem4/google-ventilator-lofo-feature-importance)

In [None]:
plot_importance(importance_df, figsize=(8, 8))

## Function

In [None]:
def LOFO_features(df, features, threshold=0.9, figsize=(8,8)):
    cv = list(GroupKFold(n_splits=4).split(df, in_df["pressure"], groups=in_df["breath_id"]))
    ds = Dataset(df, target="pressure", features=features,
                feature_groups=None,
                auto_group_threshold=threshold)
    lofo_imp = LOFOImportance(ds, cv=cv, scoring="neg_mean_absolute_error")

    importance_df = lofo_imp.get_importance()
    importance_df['CoV'] = importance_df['importance_std'] / importance_df['importance_mean']
    
#     plot_importance(importance_df, figsize=(12, 12))
    plot_importance(importance_df, figsize=figsize, kind='box')
    
    return importance_df


In [None]:
def lab_features(df):
    '''
    Small experiments during inspiratory phase (u_out=0) for some relevant features. 
    '''
    df = df.copy()
    ## Top 5 Permutation features from LSTM (LB 0.152 after rounding)
    ## Refer: https://www.kaggle.com/cdeotte/lstm-feature-importance
    df['u_in_diff1'] = df.groupby('breath_id')['u_in'].diff(1)
    df['u_in_diff2'] = df.groupby('breath_id')['u_in'].diff(2)
    df['u_in_diff3'] = df.groupby('breath_id')['u_in'].diff(3)
    df['u_in_diff4'] = df.groupby('breath_id')['u_in'].diff(4)
    df['u_in_lag2'] = df.groupby('breath_id')['u_in'].shift(2)
    
    df['u_in_lag4'] = df.groupby('breath_id')['u_in'].shift(4) 
    
    df['u_in_cumsum'] = df.groupby('breath_id')['u_in'].cumsum()       
    
    df['u_in_first'] = df.groupby(['breath_id'])['u_in'].transform('first')
    df['u_in_last'] = df.groupby(['breath_id'])['u_in'].transform('last')
    df['u_in_max'] = df.groupby(['breath_id'])['u_in'].transform('max')
    df['u_in_min'] = df.groupby(['breath_id'])['u_in'].transform('min')
    df['u_in_mean'] = df.groupby(['breath_id'])['u_in'].transform('mean')
    df['breath_id__u_in__diffmean'] = df.groupby(['breath_id'])['u_in'].transform('mean') - df['u_in']    
   
    df['R_cat'] = df['R'].astype(str)
    df['C_cat'] = df['C'].astype(str)    
    df['RC'] = df['R'].astype(str) + '__' + df['C'].astype(str)
    df = pd.get_dummies(df)
    
    df = df.fillna(0)
    
    return df


## Lab Data

In [None]:
if DEBUG:
    df = train[:80*1000]
else:
    df = train    
df.tail()

In [None]:
df.shape

In [None]:
df = lab_features(df)
sorted(df.columns)

In [None]:
in_df = df[df["u_out"] == 0].reset_index(drop=True)
in_df.shape

In [None]:
in_df

# R/C: Numeric vs Categorical Coding

## Lab1A: Numeric R/C + Cat Combinations

In [None]:
## Keep numerical R and C, but not inclue their dummies for now
features_1a = [col for col in df.columns if col not in ['pressure', 'id', 'breath_id',
                                                    'C_cat_10', 'C_cat_20', 'C_cat_50', 
                                                    'R_cat_5', 'R_cat_20', 'R_cat_50',                                  
                                                    ]]    
len(features_1a), sorted(features_1a) 

In [None]:
Dataset(in_df, target="pressure", features=features_1a,
    feature_groups=None,
    auto_group_threshold=0.9
)

### Set correlation threshold = 1, not auto-grouping features

In [None]:
importance_1a = LOFO_features(df = in_df,features = features_1a, threshold=1, figsize=(8,8))
importance_1a = importance_1a.reset_index(drop=True)
importance_1a

-  u_in_diff1 ~ u_in_diff4 were not the top 9 features by LOFO 
- The importance of the numerical R and C were reduced but still among top 5 features after including their combinations 

My guess on CoV: Although some features had larger coefficient of variations, their boxplot sizes look comparable to other features. Given only 4 folds cross-validation,coefficient of variations might not be a good exclusion indicator. A small CoV (say <10%) and high importance mean is  likely to be a good inclusion indicator for a feature, such as `u_in_cumsum`  

## Lab1B: Numeric + Cat Combinations + Dummies

In [None]:
features_1b = [col for col in df.columns if col not in ['pressure', 'id', 'breath_id']]
len(features_1b), sorted(features_1b) 

In [None]:
Dataset(in_df, target="pressure", features=features_1b,
    feature_groups=None,
    auto_group_threshold=0.9
)

In [None]:
importance_1b = LOFO_features(df = in_df,features = features_1b, threshold=1, figsize=(8,8))
importance_1b = importance_1b.reset_index(drop=True)
importance_1b

## Lab1C: Remove Numeric R/C

In [None]:
features_1c = [col for col in df.columns if col not in ['pressure', 'id', 'breath_id', 'C', 'R']]
len(features_1c), sorted(features_1c)

In [None]:
Dataset(in_df, target="pressure", features=features_1c,
    feature_groups=None,
    auto_group_threshold=0.9
),

importance_1c = LOFO_features(df = in_df,features = features_1c, threshold=1, figsize=(8,8))
importance_1c = importance_1c.reset_index(drop=True)
importance_1c

## Side-by-Side Comparison

In [None]:
fig, (ax1, ax2, ax3) = plt.subplots(1, 3, figsize=(20, 8))

ax1.barh(importance_1a['feature'], importance_1a['importance_mean'])
ax1.invert_yaxis()
ax1.set_xlabel('LOFO Feature Importance')
ax1.title.set_text('1a.Numeric R/C + Combinations')
ax1.annotate('Numeric R/C still\n top features', xy=(0.43, .82), xycoords='axes fraction', fontsize=16, color='green')

ax2.barh(importance_1b['feature'], importance_1b['importance_mean'])
ax2.invert_yaxis()
ax2.set_xlabel('LOFO Feature Importance')
ax2.title.set_text('1b.Numeric R/C + Combinations + Dummies')
ax2.annotate('After their dummies included,\nnumeric R/C move down\nto non-importance', xy=(0.15, .3), xycoords='axes fraction', fontsize=16, color='green')
            
ax3.barh(importance_1c['feature'], importance_1c['importance_mean'])
ax3.invert_yaxis()
ax3.set_xlabel('LOFO Feature Importance')
ax3.title.set_text('1c.Numeric R/C Removed')
ax3.annotate('R_cat_50 and\n C_cat_10 move up', xy=(0.22, .75), xycoords='axes fraction', fontsize=16, color='red')
#ax3.text(0.3, 0.8, 'R_cat_50 and C_cat_10 move up', transform=ax3.transAxes)

fig.tight_layout()
plt.show()

### After adding categorical R and C dummies, the numerical raw features became less importance.

-  `u_in_last` was the top 1 feature here even during the inspiratory phase `u_out` = 0
-  `u_in_diff1` to `u_in_diff4` were not among the top 5 LOFO importance. 
    -  vs. top 4 [permutation features](https://www.kaggle.com/cdeotte/lstm-feature-importance) by LSTM. LOFO uses LGBM which is different from LSTM though.
-  `u_in_lag4` had much less LOFO importance than `u_in_lag2`

# 50 Features by LSTM

In [None]:
# from https://www.kaggle.com/tenffe/finetune-of-tensorflow-bidirectional-lstm/notebook
def add_features(df):
    df = df.copy()
    df['area'] = df['time_step'] * df['u_in']
    df['area'] = df.groupby('breath_id')['area'].cumsum()
    
    df['u_in_cumsum'] = (df['u_in']).groupby(df['breath_id']).cumsum()
    
    df['u_in_lag1'] = df.groupby('breath_id')['u_in'].shift(1)
    df['u_out_lag1'] = df.groupby('breath_id')['u_out'].shift(1)
    df['u_in_lag_back1'] = df.groupby('breath_id')['u_in'].shift(-1)
    df['u_out_lag_back1'] = df.groupby('breath_id')['u_out'].shift(-1)
    df['u_in_lag2'] = df.groupby('breath_id')['u_in'].shift(2)
    df['u_out_lag2'] = df.groupby('breath_id')['u_out'].shift(2)
    df['u_in_lag_back2'] = df.groupby('breath_id')['u_in'].shift(-2)
    df['u_out_lag_back2'] = df.groupby('breath_id')['u_out'].shift(-2)
    df['u_in_lag3'] = df.groupby('breath_id')['u_in'].shift(3)
    df['u_out_lag3'] = df.groupby('breath_id')['u_out'].shift(3)
    df['u_in_lag_back3'] = df.groupby('breath_id')['u_in'].shift(-3)
    df['u_out_lag_back3'] = df.groupby('breath_id')['u_out'].shift(-3)
    df['u_in_lag4'] = df.groupby('breath_id')['u_in'].shift(4)
    df['u_out_lag4'] = df.groupby('breath_id')['u_out'].shift(4)
    df['u_in_lag_back4'] = df.groupby('breath_id')['u_in'].shift(-4)
    df['u_out_lag_back4'] = df.groupby('breath_id')['u_out'].shift(-4)
    df = df.fillna(0)
    
    df['breath_id__u_in__max'] = df.groupby(['breath_id'])['u_in'].transform('max')
    df['breath_id__u_out__max'] = df.groupby(['breath_id'])['u_out'].transform('max')
    
    df['u_in_diff1'] = df['u_in'] - df['u_in_lag1']
    df['u_out_diff1'] = df['u_out'] - df['u_out_lag1']
    df['u_in_diff2'] = df['u_in'] - df['u_in_lag2']
    df['u_out_diff2'] = df['u_out'] - df['u_out_lag2']
    
    df['breath_id__u_in__diffmax'] = df.groupby(['breath_id'])['u_in'].transform('max') - df['u_in']
    df['breath_id__u_in__diffmean'] = df.groupby(['breath_id'])['u_in'].transform('mean') - df['u_in']
    
    ## The following two lines would be deleted per zhangxin' the https://www.kaggle.com/tenffe/finetune-of-tensorflow-bidirectional-lstm/comments
    ## They were just duplicate u_in_diffmax, u_in_diffmean, not intended u_out_diffmax and u_out_diffmean
#     df['breath_id__u_in__diffmax'] = df.groupby(['breath_id'])['u_in'].transform('max') - df['u_in']
#     df['breath_id__u_in__diffmean'] = df.groupby(['breath_id'])['u_in'].transform('mean') - df['u_in']
    
    df['u_in_diff3'] = df['u_in'] - df['u_in_lag3']
    df['u_out_diff3'] = df['u_out'] - df['u_out_lag3']
    df['u_in_diff4'] = df['u_in'] - df['u_in_lag4']
    df['u_out_diff4'] = df['u_out'] - df['u_out_lag4']
    df['cross']= df['u_in']*df['u_out']
    df['cross2']= df['time_step']*df['u_out']
    
    df['R'] = df['R'].astype(str)
    df['C'] = df['C'].astype(str)
    df['R__C'] = df["R"].astype(str) + '__' + df["C"].astype(str)
    df = pd.get_dummies(df)
    return df


## Re-loading data: 

In [None]:
if DEBUG:
    df = train[:80*1000]
else:
    df = train    
df.tail()

In [None]:
df = add_features(df)

In [None]:
df.shape, sorted(df.columns)

In [None]:
features_LSTM = [col for col in df.columns if col not in ['pressure', 'id', 'breath_id']]

In [None]:
in_df = df[df["u_out"] == 0].reset_index(drop=True)
in_df.shape

In [None]:
Dataset(in_df, target="pressure", features=features_LSTM,
    feature_groups=None,
    auto_group_threshold=0.9
)

In [None]:
cv = list(GroupKFold(n_splits=4).split(in_df, in_df["pressure"], groups=in_df["breath_id"]))

In [None]:
ds = Dataset(in_df, target="pressure", features=features_LSTM,
    feature_groups=None,
    auto_group_threshold=1
)

lofo_imp = LOFOImportance(ds, cv=cv, scoring="neg_mean_absolute_error")
importance_LSTM_in = lofo_imp.get_importance().reset_index(drop=True)
importance_LSTM_in['CoV'] = importance_LSTM_in['importance_std'] / importance_LSTM_in['importance_mean']
importance_LSTM_in

In [None]:
importance_LSTM_in.to_csv('LOFO_importance_LSTM_in.csv')

In [None]:
# some negative features
plot_importance(importance_LSTM_in, figsize=(12, 12))

When `breath_id__u_in__diffmean` was removed from the LSTM, oof val_loss slightly reduced by 0.0004.

## All Data `u_out` = 0 & 1
(Took > 2hr and failed twice)