# Random Forest Model interpretation
#### This is an annotated copy of the Fast.ai Machine Learning Lesson 2  notebook. Notes and edits by Joseph Catanzarite

In [None]:
%load_ext autoreload
%autoreload 2

In [None]:
%matplotlib inline

from fastai.imports import *
from fastai.structured import *
from pandas_summary import DataFrameSummary
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier
from IPython.display import display
from sklearn import metrics
import feather


In [None]:
set_plot_sizes(12,14,16)

## Load in our data from last lesson and run proc_df to preprocess it

In [None]:
PATH = "C:/Users/jcat/fastai/data/bulldozers/"
# df_raw = pd.read_feather('tmp/bulldozers-raw')
df_raw = feather.read_dataframe('tmp/bulldozers-raw')
df_trn, y_trn, nas = proc_df(df_raw, 'SalePrice')

In [None]:
# split data into training and validation parts
def split_vals(a,n): return a[:n], a[n:]
n_valid = 12000
n_trn = len(df_trn)-n_valid
X_train, X_valid = split_vals(df_trn, n_trn)
y_train, y_valid = split_vals(y_trn, n_trn)
raw_train, raw_valid = split_vals(df_raw, n_trn)

In [None]:
# functions to define and print scores
def rmse(x,y): return math.sqrt(((x-y)**2).mean())

def print_score(m):
    res = [rmse(m.predict(X_train), y_train), rmse(m.predict(X_valid), y_valid),
                m.score(X_train, y_train), m.score(X_valid, y_valid)]
    if hasattr(m, 'oob_score_'): res.append(m.oob_score_)
    print(res)

In [None]:
df_raw

# Confidence based on tree variance

For model interpretation, there's no need to use the full dataset on each tree - using a subset will be both faster, and also provide better interpretability (since an overfit model will not provide much variance across trees).

In [None]:
# use a subset of examples for each tree, 
#     instead of the full bootstrap sample 
set_rf_samples(50000)

In [None]:
??set_rf_samples

In [None]:
# metric = 0.2509
m = RandomForestRegressor(n_estimators=40, min_samples_leaf=3, max_features=0.5, n_jobs=-1, oob_score=True)
m.fit(X_train, y_train)
print_score(m)

In [None]:
# compare to full bootstrap sample
# score = 0.2268, so it's better by 0.024
reset_rf_samples()
m = RandomForestRegressor(n_estimators=40, min_samples_leaf=3, max_features=0.5, n_jobs=-1, oob_score=True)
m.fit(X_train, y_train)
print_score(m)

# return to subsampling
set_rf_samples(50000)

We saw how the model averages predictions across the trees to get an estimate - but how can we know the confidence of the estimate? One simple way is to use the standard deviation of predictions, instead of just the mean. This tells us the *relative* confidence of predictions - that is, for rows where the trees give very different results, you would want to be more cautious of using those results, compared to cases where they are more consistent. Using the same example as in the last lesson when we looked at bagging:

In [None]:
%time preds = np.stack([t.predict(X_valid) for t in m.estimators_])
np.mean(preds[:,0]), np.std(preds[:,0])

When we use python to loop through trees like this, we're calculating each in series, which is slow! We can use parallel processing to speed things up:

??parallel_trees
Signature: parallel_trees(m, fn, n_jobs=8)
Docstring: <no docstring>
Source:   
def parallel_trees(m, fn, n_jobs=8):
        return list(ProcessPoolExecutor(n_jobs).map(fn, m.estimators_))
File:      c:\users\jcat\fastai\courses\ml1\fastai\structured.py
Type:      function


In [None]:
# problem with parallelization
def get_preds(t): return t.predict(X_valid)
%time preds = np.stack(parallel_trees(m, get_preds))
np.mean(preds[:,0]), np.std(preds[:,0])

We can see that different trees are giving different estimates this this auction. In order to see how prediction confidence varies, we can add this into our dataset.

In [None]:
preds.shape

In [None]:
x = raw_valid.copy()
x['pred_std'] = np.std(preds, axis=0)
x['pred'] = np.mean(preds, axis=0)
x.Enclosure.value_counts().plot.barh();

In [None]:
flds = ['Enclosure', 'SalePrice', 'pred', 'pred_std']
enc_summ = x[flds].groupby('Enclosure', as_index=False).mean()
enc_summ

In [None]:
# plot sale price grouped by enclosure category
enc_summ = enc_summ[~pd.isnull(enc_summ.SalePrice)]
enc_summ.plot('Enclosure', 'SalePrice', 'barh', xlim=(0,11));

In [None]:
# include error bars
enc_summ.plot('Enclosure', 'pred', 'barh', xerr='pred_std', alpha=0.6, xlim=(0,11));

*Question*: Why are the predictions nearly exactly right, but the error bars are quite wide?

In [None]:
raw_valid.ProductSize.value_counts().plot.barh();

In [None]:
flds = ['ProductSize', 'SalePrice', 'pred', 'pred_std']
summ = x[flds].groupby(flds[0]).mean()
summ

In [None]:
# fractional error in predicted price
(summ.pred_std/summ.pred).sort_values(ascending=False)

# Feature importance

It's not normally enough to just to know that a model can make accurate predictions - we also want to know *how* it's making predictions. The most important way to see this is with *feature importance*.

In [None]:
fi = rf_feat_importance(m, df_trn); fi[:10]

In [None]:
fi.plot('cols', 'imp', figsize=(10,6), legend=False);

In [None]:
def plot_fi(fi): return fi.plot('cols', 'imp', 'barh', figsize=(12,7), legend=False)

In [None]:
plot_fi(fi[:30]);

In [None]:
to_keep = fi[fi.imp>0.005].cols; len(to_keep)

In [None]:
# keep features with importance > 0.005
df_keep = df_trn[to_keep].copy()
X_train, X_valid = split_vals(df_keep, n_trn)

In [None]:
# eliminating unimportant features improved score by 0.007
m = RandomForestRegressor(n_estimators=40, min_samples_leaf=3, max_features=0.5,
                          n_jobs=-1, oob_score=True)
m.fit(X_train, y_train)
print_score(m)

In [None]:
# feature importances within reduced feature set 
#     vary a bit from previous order
fi = rf_feat_importance(m, df_keep)
plot_fi(fi);

## One-hot encoding

proc_df's optional *max_n_cat* argument will turn some categorical variables into new columns.

For example, the column **ProductSize** which has 6 categories:

* Large
* Large / Medium
* Medium
* Compact
* Small
* Mini

gets turned into 6 new columns:

* ProductSize_Large
* ProductSize_Large / Medium
* ProductSize_Medium
* ProductSize_Compact
* ProductSize_Small
* ProductSize_Mini

and the column **ProductSize** gets removed.

It will only happen to columns whose number of categories is no bigger than the value of the *max_n_cat* argument.

Now some of these new columns may prove to have more important features than in the earlier situation, where all categories were in one column.

In [None]:
# one-hot-encoding made metric worse by 0.01!
df_trn2, y_trn, nas = proc_df(df_raw, 'SalePrice', max_n_cat=7)
X_train, X_valid = split_vals(df_trn2, n_trn)

m = RandomForestRegressor(n_estimators=40, min_samples_leaf=3, max_features=0.6, n_jobs=-1, oob_score=True)
m.fit(X_train, y_train)
print_score(m)

In [None]:
# importance ordering is changed 
fi = rf_feat_importance(m, df_trn2)
plot_fi(fi[:25]);

#### Done with one-hot-encoding experiment!

# Removing redundant features

One thing that makes this harder to interpret is that there seem to be some variables with very similar meanings. Let's try to remove redundent features.

In [None]:
from scipy.cluster import hierarchy as hc

In [None]:
# spearman correlation
corr = np.round(scipy.stats.spearmanr(df_keep).correlation, 4)
corr_condensed = hc.distance.squareform(1-corr)
z = hc.linkage(corr_condensed, method='average')
fig = plt.figure(figsize=(16,10))
dendrogram = hc.dendrogram(z, labels=df_keep.columns, orientation='left', leaf_font_size=16)
plt.show()

Let's try removing some of these related features to see if the model can be simplified without impacting the accuracy.

In [None]:
def get_oob(df):
    # why vary parameters from original values?
    #     n_estimators = 40
    #     min_samples_leaf = 3
    #     max_features = 0.5
    # m = RandomForestRegressor(n_estimators=30, min_samples_leaf=5, max_features=0.6, n_jobs=-1, oob_score=True)
    # original parameter values improve metric by 0.005
    #     so let's keep them
    m = RandomForestRegressor(n_estimators=40, min_samples_leaf=3, max_features=0.5, n_jobs=-1, oob_score=True)
    x, _ = split_vals(df, n_trn)
    m.fit(x, y_train)
    return m.oob_score_

Here's our baseline.

In [None]:
# revert to df_keep
get_oob(df_keep)

Now we try removing each variable one at a time.

In [None]:
# removing these features has little effect on oob score
for c in ('saleYear', 'saleElapsed', 'fiModelDesc', 'fiBaseModel', 'Grouser_Tracks', 'Coupler_System'):
    print(c, get_oob(df_keep.drop(c, axis=1)))

It looks like we can try one from each group for removal. Let's see what that does.

In [None]:
# metric is worse by 0.002
to_drop = ['saleYear', 'fiBaseModel', 'Grouser_Tracks']
get_oob(df_keep.drop(to_drop, axis=1))

Looking good! Let's use this dataframe from here. We'll save the list of columns so we can reuse it later.

In [None]:
df_keep.drop(to_drop, axis=1, inplace=True)
X_train, X_valid = split_vals(df_keep, n_trn)

In [None]:
# save list of columns to keep
np.save('tmp/keep_cols.npy', np.array(df_keep.columns))

In [None]:
# retrieve list of columns to keep
keep_cols = np.load('tmp/keep_cols.npy')
df_keep = df_trn[keep_cols]

And let's see how this model looks on the full dataset.

In [None]:
# revert to full bootstrap sample
reset_rf_samples()

In [None]:
# metric improved to 0.227 using full bootstrap, 
#     which is what we had before
m = RandomForestRegressor(n_estimators=40, min_samples_leaf=3, max_features=0.5, n_jobs=-1, oob_score=True)
m.fit(X_train, y_train)
print_score(m)

### Conclusion: OHE, and eliminating unimportant and/or redundant features do _not_ improve the metric


# Partial dependence

In [None]:
# first, install pdpbox and plotnine
# pip install pdpbox
# conda install -c conda-forge plotnine
from pdpbox import pdp
from plotnine import *

In [None]:
set_rf_samples(50000)

This next analysis will be a little easier if we use the 1-hot encoded categorical variables, so let's load them up again.

In [None]:
# start with metric 0.253
df_trn2, y_trn, nas = proc_df(df_raw, 'SalePrice', max_n_cat=7)
X_train, X_valid = split_vals(df_trn2, n_trn)
m = RandomForestRegressor(n_estimators=40, min_samples_leaf=3, max_features=0.6, n_jobs=-1)
m.fit(X_train, y_train);
print_score(m)

In [None]:
plot_fi(rf_feat_importance(m, df_trn2)[:10]);

In [None]:
df_raw.plot('YearMade', 'saleElapsed', 'scatter', alpha=0.01, figsize=(10,8));

In [None]:
x_all = get_sample(df_raw[df_raw.YearMade>1930], 500)

In [None]:
# first install scikit-misc
# pip install scikit-misc
ggplot(x_all, aes('YearMade', 'SalePrice'))+stat_smooth(se=True, method='loess')

In [None]:
x = get_sample(X_train[X_train.YearMade>1930], 500)

In [None]:
def plot_pdp(feat_name, clusters=None):
    p = pdp.pdp_isolate(m, x, feature=feat_name, model_features=x.columns)
    return pdp.pdp_plot(p, feat_name, plot_lines=True, 
                        cluster=clusters is not None, n_cluster_centers=clusters)

In [None]:
plot_pdp('YearMade')

In [None]:
plot_pdp('YearMade', clusters=5)

In [None]:
feats = ['saleElapsed', 'YearMade']
p = pdp.pdp_interact(m, x, feats)
pdp.pdp_interact_plot(p, feats)

In [None]:
plot_pdp(['Enclosure_EROPS w AC', 'Enclosure_EROPS', 'Enclosure_OROPS'], 5)#, 'Enclosure')

In [None]:
# define engineered feature 'age'
df_raw.YearMade[df_raw.YearMade<1950] = 1950
df_keep['age'] = df_raw['age'] = df_raw.saleYear-df_raw.YearMade

In [None]:
# age becomes the most important feature!
X_train, X_valid = split_vals(df_keep, n_trn)
m = RandomForestRegressor(n_estimators=40, min_samples_leaf=3, max_features=0.6, n_jobs=-1)
m.fit(X_train, y_train)
plot_fi(rf_feat_importance(m, df_keep));

# Tree interpreter
#### not sure what this section demonstrates?

In [None]:
# install treeinterpreter
# pip install treeinterpreter
from treeinterpreter import treeinterpreter as ti

In [None]:
df_train, df_valid = split_vals(df_raw[df_keep.columns], n_trn)

In [None]:
row = X_valid.values[None,0]; row

In [None]:
prediction, bias, contributions = ti.predict(m, row)

In [None]:
len(contributions[0])

In [None]:
prediction[0], bias[0]

In [None]:
idxs = np.argsort(contributions[0])

In [None]:
df_valid.iloc[0]

In [None]:
[o for o in zip(df_keep.columns[idxs], df_valid.iloc[0][idxs], contributions[0][idxs])]

In [None]:
contributions[0].sum()

# Extrapolation
#### It's not totally clear what this section says about extrapolation. We identify and eliminate unnecessary features, and thereby realize an improvement in the metric.

In [None]:
df_ext = df_keep.copy()
df_ext['is_valid'] = 1
df_ext.is_valid[:n_trn] = 0
x, y, nas = proc_df(df_ext, 'is_valid')

In [None]:
m = RandomForestClassifier(n_estimators=40, min_samples_leaf=3, max_features=0.5, n_jobs=-1, oob_score=True)
m.fit(x, y);
m.oob_score_

In [None]:
fi = rf_feat_importance(m, x); fi[:10]

In [None]:
# top 3 features
feats=['SalesID', 'saleElapsed', 'MachineID']

In [None]:
(X_train[feats]/1000).describe()

In [None]:
(X_valid[feats]/1000).describe()

In [None]:
# drop top three features
x.drop(feats, axis=1, inplace=True)

In [None]:
# score is a bit worse
m = RandomForestClassifier(n_estimators=40, min_samples_leaf=3, max_features=0.5, n_jobs=-1, oob_score=True)
m.fit(x, y)
m.oob_score_

In [None]:
fi = rf_feat_importance(m, x); fi[:10]

In [None]:
#speed up by subsampling
set_rf_samples(50000)

In [None]:
# return to original sample
X_train, X_valid = split_vals(df_keep, n_trn)
m = RandomForestRegressor(n_estimators=40, min_samples_leaf=3, max_features=0.5, n_jobs=-1, oob_score=True)
m.fit(X_train, y_train)
print_score(m)

In [None]:
# top six features
feats=['SalesID', 'saleElapsed', 'MachineID', 'age', 'YearMade', 'saleDayofyear']

In [None]:
# remove top six features, one at a time to see effect on metric
# metrics vary between 0.245 and 0.255
for f in feats:
    df_subs = df_keep.drop(f, axis=1)
    X_train, X_valid = split_vals(df_subs, n_trn)
    m = RandomForestRegressor(n_estimators=40, min_samples_leaf=3, max_features=0.5, n_jobs=-1, oob_score=True)
    m.fit(X_train, y_train)
    print(f)
    print_score(m)

In [None]:
# revert to full bootstrap sample
reset_rf_samples()

In [None]:
# removing these features gave a significant score reduction
#     recall that previously with full bootstrap sample we
#     got a score of 0.2182, original score was 0.2268
# drop SalesID, MachineID, saleDayOfyear because dropping
#    them individually reduced the metric more than any of 
#    the other three.
df_subs = df_keep.drop(['SalesID', 'MachineID', 'saleDayofyear'], axis=1)
X_train, X_valid = split_vals(df_subs, n_trn)
m = RandomForestRegressor(n_estimators=40, min_samples_leaf=3, max_features=0.5, n_jobs=-1, oob_score=True)
m.fit(X_train, y_train)
print_score(m)

In [None]:
plot_fi(rf_feat_importance(m, X_train));

In [None]:
np.save('tmp/subs_cols.npy', np.array(df_subs.columns))

# Our final model!

In [None]:
# use more trees, and grow them completely
# our metric improved by 0.007 to 0.2114
m = RandomForestRegressor(n_estimators=160, max_features=0.5, n_jobs=-1, oob_score=True)
%time m.fit(X_train, y_train)
print_score(m)