# Google Brain Ventilator Pressure Competition
We can use this competition as a practice of using Fastai Tabular module. Check [Fastbook Chapter 9](https://nbviewer.jupyter.org/github/fastai/fastbook/blob/master/09_tabular.ipynb) for more details. In fact, we are almost copy and pasting the code except that we are using a different dataset, as practice. And of course, plus several own thoughts, thoughts taken from discussion forums, and etc are integrated into this notebook. 

In [None]:
!pip install -Uqq fastbook kaggle waterfallcharts treeinterpreter dtreeviz
import fastbook
fastbook.setup_book()

In [None]:
import seaborn as sns
from fastbook import *
from pandas.api.types import is_string_dtype, is_numeric_dtype, is_categorical_dtype
from fastai.tabular.all import *
from sklearn.ensemble import RandomForestRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_absolute_error
from dtreeviz.trees import *
from IPython.display import Image, display_svg, SVG, clear_output

pd.options.display.max_rows = 20
pd.options.display.max_columns = 10

In [None]:
path = Path("../input/ventilator-pressure-prediction")

In [None]:
df = pd.read_csv(path/"train.csv", low_memory=False)
df_test = pd.read_csv(path/"test.csv", low_memory=False)
df.columns

In [None]:
df.head()

In [None]:
df["breath_id"].unique()

In [None]:
len(df)

In [None]:
df.R.unique(), df.C.unique()

In [None]:
# https://www.kaggle.com/c/ventilator-pressure-prediction/discussion/273974
df["u_in_cumsum"] = df.u_in.groupby(df.breath_id).cumsum()
df_test["u_in_cumsum"] = df_test.u_in.groupby(df_test.breath_id).cumsum()
df.head()

In [None]:
procs = [Categorify, FillMissing]

In [None]:
np.where(df.time_step == 0)

So the time step is calculated over 80 values each? Let's check. 

In [None]:
np.count_nonzero(np.where(df.time_step == 0)[0] % 80)

Yes, so each time step covers 80 steps. Instead of using real (floating point) time, we could change `time_step` into "steps". We could use the ID (which increase incrementally, we assumed) to create this by getting it's modulo. 

In [None]:
df["new_time_step"] = (df.id - 1) % 80
df_test["new_time_step"] = (df_test.id - 1) % 80
df.head()

In [None]:
df.new_time_step.unique()

In [None]:
df.head(82)

For now, we will leave `time_step` and `u_in` unremoved. We might decide to remove them later but let's leave it for now. 

In [None]:
df = df.drop(columns="id")
df_test = df_test.drop(columns="id")
df

In [None]:
df.isnull().values.any(), df.isnull().sum().sum()

In [None]:
# Split based on GroupKFold so the same group wound't appear on the same set, to imitate test set. 
from sklearn.model_selection import GroupKFold

groups = df.breath_id.to_numpy()
X = df.drop(columns=["pressure", "breath_id"]).to_numpy()
y = df.pressure.to_numpy()

gkfold = GroupKFold(n_splits=5)
for train_idx, valid_idx in gkfold.split(X, y, groups): pass

We shall just use the last split as splitting of data. Additionally, you could train a n-fold model if you would like to. Here, we save the hassle. 

In [None]:
splits = (list(train_idx),list(valid_idx))
dep_var = "pressure"  # dependent variable

# use fastai function. 
cont, cat = cont_cat_split(df, 3, dep_var=dep_var)
to = TabularPandas(df, procs, cat, cont, y_names=dep_var, splits=splits)

The second value of `cont_cat_split` means how many `.unique()` values less than defined will be treated as categorical? Here, if we put 1, we treat none as categorical (unless there are class with just one single category, which is not very predictive anyway that should be removed). Here we put 3, beacuse previously we check R and C to have 3 distinct values, hence they are considered categorical. Then, `u_out` is binary so it will also be considered categorical. 

In [None]:
len(to.train), len(to.valid)

In [None]:
to.show(3)

In [None]:
TabularPandas(df, procs, cat, [], y_names=dep_var, splits=splits).show(3)

In [None]:
to.items.head(3)

Although we use `show` it show back the original value, the underlying value used for training are numeric. You can see it from `to.items.head(...)` (which is what actually passed to training. `show` actually decodes back by mapping the internal value to their actual value). 

In [None]:
to.classes["R"]

`#na#` is missing value. However, perhaps it might just be there just in case, because we previously checked the dataframe doesn't have any nan values. Let's check again to make sure this class no nan. 

In [None]:
df.R.isnull().values.any()

In [None]:
save_path = Path("/kaggle/working")
save_pickle(save_path/"to.pkl", to)  # save preprocessing steps. 

## Creating Decision Tree

In [None]:
# to = load_pickle(save_path/"to.pkl")

xs, y = to.train.xs, to.train.y
valid_xs, valid_y = to.valid.xs, to.valid.y

In [None]:
m = DecisionTreeRegressor(max_leaf_nodes=4)
m.fit(xs, y)

In [None]:
draw_tree(m, xs, size=10, leaves_parallel=True, precision=2)

We can use dtreeviz to show information. 

In [None]:
samp_idx = np.random.permutation(len(y))[:500]
dtreeviz(m, xs.iloc[samp_idx], y.iloc[samp_idx], xs.columns, dep_var,
        fontname="DejaVu Sans", scale=1.6, label_fontsize=11, orientation="LR")

After the `time_step` 1.065, the patient is breathing out so the pressure stays "constant". There looks like some problem with `u_in_cumsum` where some patients have much much more larger lungs than average patients does? One doesn't know what this means lol... Let's try plot the max value we get for `u_in_cumsum()` and see what it means. 

In [None]:
cumsum_y = df[df.new_time_step == 79].u_in_cumsum.to_numpy()
cumsum_y

In [None]:
# histogram plot? 
plt.figure(dpi=100)
sns.distplot(cumsum_y)

In [None]:
cumsum_test_y = df_test[df_test.new_time_step == 79].u_in_cumsum.to_numpy()
plt.figure(dpi=100)
sns.distplot(cumsum_test_y)

Some patients have extra huge lungs. Looks like simulated data cannot really represent real data. (It must be simulated from some distribution, that's why there're probabilities to have such long-tailed distributions. Particularly, **although they seem true, they're never true. No one will have such large lungs and this looks very fake**. 

Irregardless of whether it's true or fake, we shall limit the max size of patient's lung, one guesses. We won't be doing it immediately here because we still have another step that would overwrite this step. We would like to see the distribution after that step first. See next step. 

Second thing: let's try and see if we "reset" `u_in_cumsum` when patient is breathing out, so we stop the cumsum according to `time_step == 1.065`, so that might be `new_time_step=40` (or something like that, we have to check). This means that after this time step, (or after this new_time_step), we shall stop cumulating and keep it constant. 

**Note, finally this is implemented as setting all values of `u_in` == 0 when `u_out` == 1. This is useful because if you see the plot below:**

In [None]:
sns.distplot(df[df.u_out == 1].u_in.to_numpy())

**Notice that there are some extreme pumping of air into the lungs while breathing out. One cannot be sure, as one isn't a doctor, whether that would prevent the patient from having difficulty breathing out or not.**

So this might be simulation error again. If not, then the "artificial bellows test lung" wouldn't be a safe product? **Note this is AN OPINION. For proper instruction, this requires medical experts to determine whether it is or not**. 

For now, let's continue with our task of setting them == 0. And then, update our `u_in_cumsum`. 

### Update
Since it doesn't do lots of improvement, we are not setting `df.u_in`  == 0 anymore. 

In [None]:
# df.u_in.where(df.u_out == 0, 0, inplace=True)
# np.count_nonzero(df[df.u_out == 1].u_in.to_numpy())

In [None]:
# recalculate cumsum after setting to zero. 
# df["u_in_cumsum"] = df.u_in.groupby(df.breath_id).cumsum()
# df

Let's check the original cumsum distribution how it looks like. 

In [None]:
cumsum_y = df[df.new_time_step == 79].u_in_cumsum.to_numpy()
plt.figure(dpi=100)
sns.distplot(cumsum_y)

We also consider is using the `final_value` of cumulative as a new row. Then, we will make it categorical by binning it. Bin shall be varying depending on values, such that each bin have approximately the same counts. This way, we can get the max values. 

In [None]:
# Ideas from https://www.kaggle.com/artgor/ventilator-pressure-prediction-eda-fe-and-models
df["max_cumsum"] = df.groupby("breath_id")["u_in_cumsum"].transform("last")
df_test["max_cumsum"] = df_test.groupby("breath_id")["u_in_cumsum"].transform("last")
df.head()

In [None]:
# We use binning = 5. Feel free to change the value. 
# Remember to change the value of `to.pkl` as well to (at least) q so that it treats it
# as a categorical variable. 
df["binned_max_cumsum_5fold"] = pd.qcut(df.max_cumsum, q=5)
df.head()

In [None]:
m = df.binned_max_cumsum_5fold.unique().to_numpy()
tbins = sorted([g.right for g in m] + [-0.001])

# To avoid NAN, we need to change the smallest and largest bins values to incorporate all.
tbins[0] = df_test.max_cumsum.min() - 0.01
tbins[-1] = df_test.max_cumsum.max()

df_test["binned_max_cumsum_5fold"] = pd.cut(df_test.max_cumsum, tbins)

There are also other transformations in https://www.kaggle.com/artgor/ventilator-pressure-prediction-eda-fe-and-models which we will be using here. Particularly, based on the chart of LightBGM, we see some columns contains significant importance, which we could use it here. We will include them although some of them doesn't make direct sense of why they are important. 

**However, we will check (and welcome you readers to check) whether there are data leakage or not. Report it in the comments if noticed so. **

Then in the second notebook, where we check whether they are "closely related" and "redundant" or not, we will remove those that are redundant (after we choose which features gives the best result, based on greedy-method (training and compare MAE by retaining one at a time, keeping all else constant)). 

In [None]:
gby = df.groupby("breath_id")
df["breathId_uIn_max"] = gby.u_in.transform("max")
df["breathId_uIn_diffmax"] = gby.u_in.transform("max") - df.u_in
df["breathId_uIn_diffmean"] = gby.u_in.transform("mean") - df.u_in

df["uIn_lag1"] = gby.u_in.shift(1)
df["uIn_lag2"] = gby.u_in.shift(2)
df["uIn_lag3"] = gby.u_in.shift(3)

df["uIn_diff1"] = df.u_in - df.uIn_lag1
df["uIn_diff2"] = df.u_in - df.uIn_lag2
df["uIn_diff3"] = df.u_in - df.uIn_lag3  # this is NA in the referenced notebook.

df["uIn_lagback1"] = gby.u_in.shift(-1)
df["uIn_lagback2"] = gby.u_in.shift(-2)
df["uIn_lagback3"] = gby.u_in.shift(-3)

for col in df.columns:
    if df[col].isnull().values.any(): df[col] = df[col].fillna(0)
df.isnull().values.any(), df.isnull().sum().sum()

In [None]:
# repeat for test
gby = df_test.groupby("breath_id").u_in
df_test["breathId_uIn_max"] = gby.transform("max")
df_test["breathId_uIn_diffmax"] = gby.transform("max") - df_test.u_in
df_test["breathId_uIn_diffmean"] = gby.transform("mean") - df_test.u_in

for i in range(1, 4):
    df_test[f"uIn_lag{i}"] = gby.shift(i)
    df_test[f"uIn_lagback{i}"] = gby.shift(-i)
    df_test[f"uIn_diff{i}"] = df.u_in - df[f"uIn_lag{i}"]
    
for col in df_test.columns:
    if df_test[col].isnull().values.any(): df_test[col] = df_test[col].fillna(0)
df_test.isnull().values.any(), df_test.isnull().sum().sum()

The reason we did step 1 after step 2 is because, we cannot be sure that the long-tail distributions are due to `u_out`. Seeing the graph above, we know it isn't. Hence, we can cap it. 

In [None]:
df.u_in_cumsum.where(df.u_in_cumsum < 1500, 1500, inplace=True)
df_test.u_in_cumsum.where(df_test.u_in_cumsum < 1500, 1500, inplace=True)
df.u_in_cumsum.describe()

In [None]:
cumsum_y = df[df.new_time_step == 79].u_in_cumsum.to_numpy()
plt.figure(dpi=100)
sns.distplot(cumsum_y)

One first decide to cap at 1000, but it seems like there are still a lot of values above 1000, so one changes one's mind to cap at 1500, or else you would see that most values (and by most, one means highest density) value will be at 1000. With 1500, the value is more lenient. 

Of course, for this, you won't see any changes in score with Decision Trees (since it is robust). However, for Deep Neural Network (DNN) that we might experiment with later on, this could be useful. 

## Explanation on long-tailed of cumulative `u_in_cumsum`

**Note**: In notebook 2 one founds that without capping it will give better results (well, slightly better results, something like 5e-4 less in MAE). Second thing is, the reason some are large and some are small is because when the patient breathe, the machine not necessarily only open up a small hole (since `u_in` is percentage), it could open up fully and let the patient breath hard hard. Hence, this is why most are small (very shallow breath) while some are large (deep breath). If you notice your own breath, usually we breath very shallow (like not even filling half of our lungs, unconscious breathing mostly); while other times we breath very deep "down to the bottom" (when we breath consciously). Unless you practice meditation or other breathing exercises, normal human don't breath that deep most of the time. 

So one previously mention the difference in lung isn't quite correct. It's more of *how much of the lungs are filled during breathing*. 

One more thing: what are the distributions of max `time_step`? One remembers they're not equal, so let's see if the distribution is far apart before we could determine whether or not to use `new_time_step`. 

In [None]:
tsdist = df[df.new_time_step == 79].time_step.to_numpy()
tsdist

In [None]:
plt.figure(dpi=100)
sns.distplot(tsdist)

Looks like the time aren't too far apart. We could try both and see how they works. Particularly, some patients have longer breathing cycle some have less, but they are still in the range $2.45 \leq x \leq 2.75$ mostly. However, it seems like to have a long tail distribution, so let's see what's the maximum value? 

In [None]:
tsdist.max()

In [None]:
tsdist = df_test[df_test.new_time_step == 79].time_step.to_numpy()
plt.figure(dpi=100)
sns.distplot(tsdist)

Perhaps this (artificial) patient knows meditative breathing hehehe... One doesn't know how to deal with this. You can't say "capped" it at a value, there's no such thing as "stop the time of the patient" or "ask the patient to breath faster" that kind of thing. 

For the "no stopping criteria" part of the notebook, we will skip. The data here is (much) much more than the original data used in the book, so it might take (quite a long time) forever. 

Another thing is instead of using our own function for MAE (the competition metric), we shall use `sklearn`'s `mean_absolute_error` instead. We will still define the value `m_rmse` equivalent, though. 

In [None]:
def m_mae(m, xs, y): return mean_absolute_error(y, m.predict(xs))

In [None]:
m_mae(m, xs, y), m_mae(m, valid_xs, valid_y)

Now that we have the baseline, let's make our dataloaders again using the new pandas dataframe. This time, we shall try our third point: make two models and see if using `time_step` is better or `new_time_step`. 

In [None]:
splits = (list(train_idx),list(valid_idx))
dep_var = "pressure"  # dependent variable

# use fastai function. 
df_drop = df.drop(columns="new_time_step")
cont, cat = cont_cat_split(df_drop, 5, dep_var=dep_var)
to = TabularPandas(df_drop, procs, cat, cont,
                   y_names=dep_var, splits=splits)

In [None]:
xs, y = to.train.xs, to.train.y
valid_xs, valid_y = to.valid.xs, to.valid.y

m = DecisionTreeRegressor(min_samples_leaf=300)
m.fit(to.train.xs, to.train.y)
m_mae(m, xs, y), m_mae(m, valid_xs, valid_y)

In [None]:
m.get_n_leaves()

Now try with `new_time_step`. 

In [None]:
splits = (list(train_idx),list(valid_idx))
dep_var = "pressure"  # dependent variable

# use fastai function. 
df_drop = df.drop(columns="time_step")
cont, cat = cont_cat_split(df_drop, 5, dep_var=dep_var)
to = TabularPandas(df_drop, procs, cat, cont,
                   y_names=dep_var, splits=splits)

xs, y = to.train.xs, to.train.y
valid_xs, valid_y = to.valid.xs, to.valid.y

m = DecisionTreeRegressor(min_samples_leaf=300)
m.fit(to.train.xs, to.train.y)
m_mae(m, xs, y), m_mae(m, valid_xs, valid_y)

In [None]:
m.get_n_leaves()

Looks like using `time_step` gives better result. We shall pass on `new_time_step` instead of deleting it now, just for the tutorial reason to see how it does later on. 

For the rest of the tutorial, please see [notebook 2](https://www.kaggle.com/wabinab/ventpressure2) as this notebook is getting a little bit long. (Also if you decide not to break here and continue, it's about time to free up some RAM). 

Before we end, we shall save final preprocessing steps that could be imported into notebook 2. 

In [None]:
# df = df.drop(columns="new_time_step")
df.to_csv("train_preprocessed.csv", index=False)
df_test.to_csv("test_preprocessed.csv", index=False)
df.head()

In [None]:
splits = (list(train_idx),list(valid_idx))
dep_var = "pressure"  # dependent variable

# use fastai function. 
cont, cat = cont_cat_split(df, 5, dep_var=dep_var)
to = TabularPandas(df, procs, cat, cont,
                   y_names=dep_var, splits=splits)

save_pickle(save_path/"to.pkl", to)  # overwrite the original to.pkl
save_pickle(save_path/"split.pkl", splits)

to_test = TabularPandas(df_test, procs)
save_pickle(save_path/"to_test.pkl", to_test)

# Experiment

# Other things to think about
- Whether or not to set `u_in` to 0 if `u_out` is True. According to https://www.kaggle.com/artgor/ventilator-pressure-prediction-eda-fe-and-models, it is the most important feature. For this feature, you could say that it makes sense or it makes no sense. It makes sense because if you are breathing out, and if you are pumping air in while breathing out, you "seems to" accumulate greater pressure. It makes no sense because why "the last" value? 
