# Useful ML Functions - for Scikit-Learn #  

These were collected, or developed from various sources (i.e. books, online courses, or through my own work with ML)

In [1]:
%load_ext autoreload
%autoreload 2

import pandas as pd
import numpy as np

## Random Forest - useful helpers and functions ##

Set RF Samples. This will set a random set of samples in your dataset to be used with Random Forest

In [2]:
def set_rf_samples(n):
    """ Changes Scikit learn's random forests to give each tree a random sample of
    n random rows.
    """
    forest._generate_sample_indices = (lambda rs, n_samples:
        forest.check_random_state(rs).randint(0, n_samples, n))

Reset samples that were set by above function

In [4]:
def reset_rf_samples():
    """ Undoes the changes produced by set_rf_samples.
    """
    forest._generate_sample_indices = (lambda rs, n_samples:
        forest.check_random_state(rs).randint(0, n_samples, n_samples))

*** Split Values ***

In [8]:
def split_vals(a,n): return a[:n], a[n:]
n_valid = 12000
n_trn = len(df_trn)-n_valid  #df_trn is our training dataframe
X_train, X_valid = split_vals(df_trn, n_trn)
y_train, y_valid = split_vals(y_trn, n_trn)
raw_train, raw_valid = split_vals(df_raw, n_trn)

Print Score

In [None]:
def rmse(x,y): return math.sqrt(((x-y)**2).mean())

def print_score(m):
    res = [rmse(m.predict(X_train), y_train), rmse(m.predict(X_valid), y_valid),
                m.score(X_train, y_train), m.score(X_valid, y_valid)]
    if hasattr(m, 'oob_score_'): res.append(m.oob_score_)
    print(res)

Look at **Feature Importance**. The higher the number, the more important the feature

In [3]:
#m is our model
def rf_feat_importance(m, df):
    return pd.DataFrame({'cols':df.columns, 'imp':m.feature_importances_}
                       ).sort_values('imp', ascending=False)

Then if we want to for instance work only with features/columns that have a score of greater than 0.05 we could do the following:

In [6]:
fi=rf_feat_importance(m,df) #check feature importance using our model and dataframe
fi[:20] #look at say the first 10 features
to_keep = fi[fi.imp>0.005].cols; len(to_keep) #keep all those with a score of 0,005 and higher
df_keep = df_trn[to_keep].copy()
X_train, X_valid = split_vals(df_keep, n_trn)

Now check the scores on the validation set, etc. If it gets worse, keep more features and re-run fit. You can also plot and re-plot before and after with the reduced amount of features and see results.

*** One Hot Encoding ***

Not always required, despite claims to the contrary, but good to use to see if score will improve. Some key points:

- logistic regression (classification) - obviously you wouldn't use ordinal encoding as logicistic regression is binary
- linear models - then make sure to take into account the dummy variable problem. Most models take this into account as they hate colinearity.
- for Random Forests, we don't need to worry about the dummy variable problem when doing 1-hot encoding

Following on above, including the dummy variable in Random Forest improves performance as it doesn't need to infer from the other columns the final dummy variable.

**Note:** Don't encode every variable as this can cause memory problems. But a good rule of thumb is if you have about 7 or less unique categorical variables, 1-hot encode them.

Then look at your feature_importance graph. The results might surprise you

*** Correlation and Removing un-needed features ***

In [10]:
from scipy.cluster import hierarchy as hc

In [None]:
corr = np.round(scipy.stats.spearmanr(df_keep).correlation, 4)
corr_condensed = hc.distance.squareform(1-corr)
z = hc.linkage(corr_condensed, method='average')
fig = plt.figure(figsize=(16,10))
dendrogram = hc.dendrogram(z, labels=df_keep.columns, orientation='left', leaf_font_size=16)
plt.show()

Above from Fast.ai course - or can use seaborn's `sns.clustermap()`

After looking at features that seems co-related one can proceed as below (ex. fast.ai course)

In [11]:
def get_oob(df):
    m = RandomForestRegressor(n_estimators=30, min_samples_leaf=5, max_features=0.6, n_jobs=-1, oob_score=True)
    x, _ = split_vals(df, n_trn)
    m.fit(x, y_train)
    return m.oob_score_

Get baseline

In [None]:
get_oob(df_keep)

and loop through each co-related candidate, drop it and and print the OOB score:

In [None]:
for c in ('saleYear', 'saleElapsed', 'fiModelDesc', 'fiBaseModel', 'Grouser_Tracks', 'Coupler_System'):
    print(c, get_oob(df_keep.drop(c, axis=1)))

then drop one from each related category since they're measuring the same thing. 