# Google Brain Ventilation Pressure Final Notebook
## Neural Networks
We shall try training with basic Linear network and see how it does. 

And as usual, one doesn't know of any fastai's function to predict batch, so this is a modified function to "predict_batch" (based on "predict")

In [None]:
!pip install -Uqq fastbook kaggle waterfallcharts treeinterpreter dtreeviz
import fastbook
fastbook.setup_book()

import seaborn as sns
from fastbook import *
from pandas.api.types import is_string_dtype, is_numeric_dtype, is_categorical_dtype
from fastai.tabular.all import *
from sklearn.ensemble import RandomForestRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_absolute_error
from dtreeviz.trees import *
from IPython.display import Image, display_svg, SVG, clear_output
from tqdm.auto import tqdm

pd.options.display.max_rows = 20
pd.options.display.max_columns = 10


def predict_batch(self, df):
    dl = self.dls.test_dl(df_test)
    dl.dataset.conts = dl.dataset.conts.astype(np.float32)
    preds, targs = self.get_preds(dl=dl)
    return preds, targs

Learner.predict_batch = predict_batch

In [None]:
path = Path("../input/ventpressure2")
save_path = Path("/kaggle/working")

df_nn_final = pd.read_csv("../input/ventpressure2/train_preprocessed.csv")
to_drop = ['breathId_uIn_diffmean', 'uIn_diff3', 'breathId_uIn_diffmax']
try: df_nn_final = df_nn_final.drop(to_drop, axis=1)
except Exception: pass

In [None]:
dep_var = "pressure"
splits = load_pickle("../input/ventpressure1/split.pkl")
cont_nn, cat_nn = cont_cat_split(df_nn_final, max_card=9000, dep_var=dep_var)
cont_nn

In [None]:
df_nn_final[cat_nn].nunique()

We won't be dropping anything, but you could try to drop either `R` or `C` if you would like to. 

In [None]:
procs_nn = [Categorify, FillMissing, Normalize]
to_nn = TabularPandas(df_nn_final, procs_nn, cat_nn, cont_nn, splits=splits, 
                     y_names=dep_var)
dls = to_nn.dataloaders(1024)

It's good idea to set `y_range` for regression models. Let's find min and max of variable. 

In [None]:
y = to_nn.train.y
y.min(), y.max()

As we have a much much larger dataset than the original notebook, not only will we boost the hidden units but also changes the linear layer numbers as well. 

In [None]:
learn = tabular_learner(dls, y_range=(-2, 65), layers=[400, 300, 200, 100], 
                       n_out=1, loss_func=F.mse_loss, metrics=mae)
learn.lr_find()

We can already see that it is not starting with a good result. Check out the loss is just super high. However, we might just as well train for a while and see if the loss is always that high since super high loss without much training doesn't mean much. 

In [None]:
learn.fit_one_cycle(7, 1e-2)

In [None]:
preds, targs = learn.get_preds()
mean_absolute_error(targs, preds)

In [None]:
learn.save(save_path/"nn")

## Ensembling

In [None]:
def rf(xs, y, n_estimators=40, max_samples=200_000, 
      max_features=0.5, min_samples_leaf=5, **kwargs):
    return RandomForestRegressor(n_jobs=-1, n_estimators=n_estimators,
        max_samples=max_samples, max_features=max_features,
        min_samples_leaf=min_samples_leaf, oob_score=True).fit(xs, y)


def m_mae(m, xs, y): return mean_absolute_error(y, m.predict(xs))

In [None]:
cont, cat = cont_cat_split(df_nn_final, 5, dep_var=dep_var)
procs = [Categorify, FillMissing]
to = TabularPandas(df_nn_final, procs, cat, cont, y_names=dep_var, splits=splits)
xs_final, y = to.train.xs, to.train.y
valid_xs_final, valid_y = to.valid.xs, to.valid.y

In [None]:
try:
    to_drop = ["breathId_uIn_diffmean", "uIn_diff3", "breathId_uIn_diffmax"]
    xs_final = xs_final.drop(to_drop, axis=1)
    valid_xs_final = valid_xs_final.drop(to_drop, axis=1)
    print("Dropped")
except Exception as e: print(f"{type(e)}: {e}")

m = rf(xs_final, y, n_estimators=45)
m_mae(m, valid_xs_final, valid_y)

In [None]:
rf_preds = m.predict(valid_xs_final)
ens_preds = (to_np(preds.squeeze()) + rf_preds) / 2
mean_absolute_error(valid_y, ens_preds)

## Ensemble Predictions.

And we can now do ensemble prediction for our test set. 

In [None]:
df_test = pd.read_csv("../input/ventpressure2/test_preprocessed.csv")
df_test.columns

In [None]:
df_nn_final.columns

In [None]:
to_drop_test = list(set(df_test.columns).difference(set(df_nn_final.columns)))
df_test = df_test.drop(to_drop_test, axis=1)
df_test.columns

In [None]:
preds, targs = learn.predict_batch(df_test)

In [None]:
cont, cat = cont_cat_split(df_test, 5, dep_var=dep_var)
to_test = TabularPandas(df_test, procs, cat, cont)
to_test.xs

In [None]:
rf_preds = m.predict(to_test.xs)

In [None]:
# del rf_preds, ens_preds
rf_preds = m.predict(to_test.xs)
ens_preds = (to_np(preds.squeeze()) + rf_preds) / 2

In [None]:
plt.plot(rf_preds[:80])

From the graph above we see that Random Forest does not "protrude" where we want it to (particularly when breathing in, if we check other people's notebooks such as [this notebook](https://www.kaggle.com/dmitryuarov/ventilator-pressure-eda-lstm-0-189/data?select=lstm.csv) we see that the range goes from 6 to 25 (roughly) rather than staying around 16-18. This makes random forest a worse result when trying to predict the MAE for test set, making worse the ensemble model (around MAE = 16 after ensembling). 

In [None]:
plt.plot(preds.squeeze()[:80])

Although this upper range is not around 25, but this at least shows the graph we would like to see: a peak at the beginning, reaches a maxima, before dropping exponentially when breathing out to some mean values of breathing-out pressure (which not necessarily be zero, because in one of the previous notebook we see the out pressure most seen is less than 10, which is as expected; though how far it deviates from original is another question. 

In [None]:
ens_preds

In [None]:
submission = pd.read_csv("../input/ventilator-pressure-prediction/sample_submission.csv")
submission.pressure = to_np(preds.squeeze())
submission.to_csv("submission.csv", index=False)

Note that currently the result isn't great. One is trying to find the reasoning behind it. It may be due to the `procs` does not `Categorify` it with the same categorical variable, resulting in different categories (non-meaningful). These are all perhaps and work still in progress. 