# Ventilation Pressure Notebook 2
## Random Forest


**Note:** Due to the large size of dataset after preprocessed, it is best to use PySpark and Parquet files, which improves preprocessing and training. Here we don't use that, however, one might create another notebook on that (if that happens one will direct the link here). 

In [None]:
!pip install -Uqq fastbook kaggle waterfallcharts treeinterpreter dtreeviz
import fastbook
fastbook.setup_book()

import seaborn as sns
from fastbook import *
from pandas.api.types import is_string_dtype, is_numeric_dtype, is_categorical_dtype
from fastai.tabular.all import *
from sklearn.ensemble import RandomForestRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_absolute_error
from dtreeviz.trees import *
from IPython.display import Image, display_svg, SVG, clear_output
from tqdm.auto import tqdm

pd.options.display.max_rows = 20
pd.options.display.max_columns = 10

In [None]:
df = pd.read_csv("../input/ventpressure1/train_preprocessed.csv", low_memory=False)
to = load_pickle("../input/ventpressure1/to.pkl")

In [None]:
for col, i in tqdm(zip(df.dtypes.index, df.dtypes)): 
    i = str(i)
    if i == "float64": df = df.astype({col: "float32"})
    elif i == "int64": df = df.astype({col: "int32"})

In [None]:
df.dtypes

We might also require the splits, though we haven't saved it yet. We might update that next time. 

In [None]:
path = Path("../input/ventilator-pressure-prediction")
save_path = Path("/kaggle/working")

In [None]:
def rf(xs, y, n_estimators=40, max_samples=200_000, 
      max_features=0.5, min_samples_leaf=5, **kwargs):
    return RandomForestRegressor(n_jobs=-1, n_estimators=n_estimators,
        max_samples=max_samples, max_features=max_features,
        min_samples_leaf=min_samples_leaf, oob_score=True).fit(xs, y)

In [None]:
xs, y = to.train.xs, to.train.y
valid_xs, valid_y = to.valid.xs, to.valid.y
xs["u_in_cumsum_2"] = xs.u_in.groupby(xs.breath_id).cumsum()
valid_xs["u_in_cumsum_2"] = valid_xs.u_in.groupby(valid_xs.breath_id).cumsum()

m = rf(xs, y)

In [None]:
def m_mae(m, xs, y): return mean_absolute_error(y, m.predict(xs))

In [None]:
m_mae(m, xs, y), m_mae(m, valid_xs, valid_y)

Not too bad predictions.

To see the impact of `n_estimators`, let's get predictions from each individual tree in forest. 

In [None]:
preds = np.stack([t.predict(valid_xs) for t in m.estimators_])

In [None]:
mean_absolute_error(valid_y, preds.mean(0))

In [None]:
plt.plot([mean_absolute_error(valid_y, preds[:i+1].mean(0)) for i in range(40)])

In [None]:
mean_absolute_error(y, m.oob_prediction_)

# Model Interpretation
- How confident are we in our predictions using a particular row of data?
- For predicting with a particular row of data, what were the most important factors, and how did they influence that prediction?
- Which columns are the strongest predictors, which can we ignore?
- Which columns are effectively redundant with each other, for purposes of prediction?
- How do predictions vary, as we vary these columns? 

### Tree Variance for Prediction Confidence
Standard deviation tells us the *relative* confidence of predictions. 

In [None]:
preds = np.stack([t.predict(valid_xs) for t in m.estimators_])
preds.shape

In [None]:
preds_std = preds.std(0)
preds_std[:5]

We have very high standard deviations, and we have varying standard deviations. 

### Feature Importance
*how* it's making predictions. 

In [None]:
def rf_feat_importance(m, df):
    return pd.DataFrame({"cols": df.columns, "imp": m.feature_importances_}
                       ).sort_values("imp", ascending=False)

In [None]:
# show first few most important columns. 
fi = rf_feat_importance(m, xs)
fi

In [None]:
# plot of feature importances
def plot_fi(fi): return fi.plot("cols", "imp", "barh", figsize=(12, 7), legend=False)

plot_fi(fi)

### Removing Low-Importance Variables
Particularly, remove `breath_id` . One could also try removing `C` and `R` just to experiment with it. 

In [None]:
to_keep = fi[fi.imp > 0.005].cols
len(to_keep)

In [None]:
# Retrain using subset
xs_imp = xs[to_keep]
valid_xs_imp = valid_xs[to_keep]

m = rf(xs_imp, y)
m_mae(m, xs_imp, y), m_mae(m, valid_xs_imp, valid_y)

It did worse than with all features.

In [None]:
# And if we remove C and R also
to_keep1 = fi[fi.imp > 0.05].cols
xs_imp1 = xs[to_keep]
valid_xs_imp1 = valid_xs[to_keep]

m1 = rf(xs_imp1, y)
m_mae(m1, xs_imp1, y), m_mae(m1, valid_xs_imp1, valid_y)

In [None]:
del to_keep1, xs_imp1, m1, valid_xs_imp1

In [None]:
plot_fi(rf_feat_importance(m, xs_imp))

### Removing Redundant Features. 

In [None]:
cluster_columns(xs_imp)

We don't have lots of columns, so we didn't see any redundant features. Of course, perhaps if we haven't remove `new_time_step` in notebook 1 we might have it here as redundant, for example. Example of one with such is as below. We create `u_in_cumsum_2` which is extremely similar to `u_in_cumsum` except there are no binning. 

Continue by creating function to get OOB score. 

In [None]:
def get_oob(df):
    m = RandomForestRegressor(n_estimators=40, min_samples_leaf=15, 
        max_samples=50000, max_features=0.5, n_jobs=-1, oob_score=True)
    m.fit(df, y)
    to_ret = m.oob_score_
    del m
    return to_ret

For us, we would just remove one variable, which is `breath_id`. And looks like non-capped cumsum have better oob (although very slightly) than capped cumsum. 

In [None]:
get_oob(xs_imp)

In [None]:
get_oob(xs)

In [None]:
to_drop = ["u_in_cumsum", "new_time_step"]
get_oob(xs.drop(to_drop, axis=1))

In [None]:
xs_final = xs_imp.drop(to_drop, axis=1)
valid_xs_final = valid_xs_imp.drop(to_drop, axis=1)
# xs_final = xs.drop(to_drop, axis=1)
# valid_xs_final = valid_xs.drop(to_drop, axis=1)

save_pickle(save_path/"xs_final.pkl", xs_final)
save_pickle(save_path/"valid_xs_final.pkl", valid_xs_final)

### Partial Dependence
We want to understand the relationship between predictors and dependent variable (pressure). We will do so for the most predictive two variables: `u_in_cumsum_2` and `time_step`. 

In [None]:
from sklearn.inspection import plot_partial_dependence

fig, ax = plt.subplots(figsize=(12, 4))
plot_partial_dependence(m, valid_xs_final, ["time_step", "u_in_cumsum_2"],
                       grid_resolution=20, ax=ax)

In [None]:
fig, ax = plt.subplots(figsize=(12, 4))
plot_partial_dependence(m, valid_xs_final, ["R"],
                       grid_resolution=20, ax=ax)

In [None]:
fig, ax = plt.subplots(figsize=(12, 4))
plot_partial_dependence(m, valid_xs_final, ["uIn_lag1", "uIn_lag2", "uIn_lag3"],
                       grid_resolution=20, ax=ax)

In [None]:
fig, ax = plt.subplots(figsize=(12, 4))
plot_partial_dependence(m, valid_xs_final, ["uIn_diff1", "uIn_diff2", "uIn_diff3"],
                       grid_resolution=20, ax=ax)

In [None]:
fig, ax = plt.subplots(figsize=(12, 4))
plot_partial_dependence(m, valid_xs_final, ["breathId_uIn_diffmax"],
                       grid_resolution=20, ax=ax)

In [None]:
fig, ax = plt.subplots(figsize=(12, 4))
plot_partial_dependence(m, valid_xs_final, ["breathId_uIn_diffmean"],
                       grid_resolution=20, ax=ax)

For `time_step`, we have most of the pressure accumulated before time 1.0 (or slightly after that steep drop). This is expected because pressure is high when we breath in, and pressure low when breathing out (after the time_step steep drop). Is this *data leakage*? 

Remember R and C are assigned a mapping so that's why it's not its original value. 

### Tree Interpreter

In [None]:
import warnings
warnings.simplefilter("ignore", FutureWarning)

from treeinterpreter import treeinterpreter
from waterfall_chart import plot as waterfall

Look at the contributions per variable. 

In [None]:
m = rf(xs_final, y)

In [None]:
row = valid_xs_final.iloc[:5]
prediction, bias, contributions = treeinterpreter.predict(m, row.values)
prediction[0], bias[0], contributions[0].sum()

In [None]:
waterfall(valid_xs_final.columns, contributions[0], threshold=0.08,
         rotation_value=45, formatting="{:,.3f}")

## Extrapolation and Neural Networks
Random Forests don't generalize well. Neural networks generalizes better. 

Actually, our program doesn't really requires using neural network, because we don't need to generalize too well. We cover quite a lot of range and mostly we want are interpolations. We will still look at neural networks and see how they do, though. 

### Finding out-of-domain data. 
To predict whether a row is validation or training set. 

In [None]:
df_dom = pd.concat([xs_final, valid_xs_final])
is_valid = np.array([0] * len(xs_final) + [1] * len(valid_xs_final))

m = rf(df_dom, is_valid)
rf_feat_importance(m, df_dom)

Let's get a baseline of original random forest model MAE, and see effect of removing each of these columns in turn. 

In [None]:
m = rf(xs_final, y)
print("orig", m_mae(m, valid_xs_final, valid_y))

for c in tqdm(("max_cumsum breathId_uIn_max u_in_cumsum_2 breathId_uIn_diffmax breathId_uIn_diffmean".split(" "))):
    m = rf(xs_final.drop(c, axis=1), y)
    print(c, m_mae(m, valid_xs_final.drop(c, axis=1), valid_y))

In [None]:
for c in tqdm(("uIn_diff1 uIn_diff2 uIn_diff3 uIn_lag1 uIn_lag2 uIn_lag3".split(" "))):
    m = rf(xs_final.drop(c, axis=1), y)
    print(c, m_mae(m, valid_xs_final.drop(c, axis=1), valid_y))

In [None]:
to_drop = ["breathId_uIn_max", "breathId_uIn_diffmean", "uIn_diff1", "uIn_diff2", 
          "uIn_lag2", "uIn_lag3"]
m = rf(xs_final.drop(to_drop, axis=1), y)
print("Multi_drop: ", m_mae(m, valid_xs_final.drop(to_drop, axis=1), valid_y))

In [None]:
to_drop = ["breathId_uIn_diffmean", "uIn_diff3", "breathId_uIn_diffmax"]
m = rf(xs_final.drop(to_drop, axis=1), y)
print("Multi_drop: ", m_mae(m, valid_xs_final.drop(to_drop, axis=1), valid_y))

Okay, these are the ones we want to drop. Let's try filtering for cumsum less than 1500 and see if it improves accuracy. (MAE). 

In [None]:
to_test = pd.read_csv("../input/ventpressure1/test_preprocessed.csv")
to_test["u_in_cumsum_2"] = to_test["u_in"].groupby(to_test["breath_id"]).cumsum()
to_drop1 = ["u_in_cumsum", "new_time_step"]
to_test = to_test.drop(to_drop1, axis=1)
to_test = to_test.drop(to_drop, axis=1)
# to_test = to_test.drop("breath_id", axis=1)
to_test.to_csv("test_preprocessed.csv", index=False)
del to_test

In [None]:
filt = xs.u_in_cumsum < 1500
xs_filt = xs_final[filt]
y_filt = y[filt]

m = rf(xs_filt, y_filt)
m_mae(m, xs_filt, y_filt), m_mae(m, valid_xs_final, valid_y)

Doesn't seem to give better result than the original. Original is 0.885 and this is 0.909. Let's try 1000 and see really. 

In [None]:
filt = xs.u_in_cumsum < 1000
xs_filt = xs_final[filt]
y_filt = y[filt]

m = rf(xs_filt, y_filt)
m_mae(m, xs_filt, y_filt), m_mae(m, valid_xs_final, valid_y)

## Using a Neural Network

In [None]:
del xs_filt, y_filt, filt, xs_imp, valid_xs_imp, preds, preds_std, to

In [None]:
try: del df
except Exception: pass
df_train = pd.concat([xs_final, y], axis=1)
df_valid = pd.concat([valid_xs_final, valid_y], axis=1)
df_nn_final = pd.concat([df_train, df_valid])
df_nn_final = df_nn_final.sort_index()
df_nn_final.to_csv("train_preprocessed.csv", index=False)
df_nn_final.dtypes

Since we require using GPU, and this instance is best run with CPU (as Kaggle offers more CPU without GPU), we will continue in another notebook. Third, we are about to reach OOM (out of memory). 

In [None]:
del df_train, df_valid, m, is_valid, df_dom, xs, y, valid_xs, valid_y
import gc
gc.collect()

In [None]:
dep_var = "pressure"
cont_nn, cat_nn = cont_cat_split(df_nn_final, max_card=9000, dep_var=dep_var)
cont_nn

In [None]:
df_nn_final[cat_nn].nunique()

In [None]:
procs_nn = [Categorify, FillMissing, Normalize]
splits = load_pickle("../input/ventpressure1/split.pkl")
to_nn = TabularPandas(df_nn_final, procs_nn, cat_nn, cont_nn, splits=splits, 
                     y_names=dep_var)

In [None]:
# Tabular models don't generally require much GPU RAM. 
# Hence we use larger batch size. 
dls = to_nn.dataloaders(1024)

It's good idea to set `y_range` for regression models. Let's find min and max of variable. 

In [None]:
y = to_nn.train.y
y.min(), y.max()

In [None]:
learn = tabular_learner(dls, y_range=(-2, 65), layers=[1000, 500], 
                       n_out=1, loss_func=F.mse_loss, metrics=mae)

In [None]:
learn.lr_find()

In [None]:
learn.fit_one_cycle()