Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Permutation importances for Random Forest #8027

Closed

Conversation

@dmwinslow
Copy link

dmwinslow commented Dec 9, 2016

This provides an alternative method for calculating feature importances for Random Forests, by means of a permutation test. This test was described in the initial random forest paper [1], and codified in subsequent literature [2]. We based this work in part on the work done in a previous pull request #3436, but opted to start a new pull request as our implementation differs substantially, and uses somewhat different methodology in performing the actual permutation test (we could not verify the method used in #3436 anywhere in the literature).

Relevant to issue #6024, which is requesting permutation importances specifically for the RandomForestClassifier - but also makes the point that permutation importances could be applied more generally to classifiers/regressors.

References
.. [1] L. Breiman, "Random Forests", Machine Learning, 45(1), 5-32, 2001.
.. [2] Jerome Paul and Pierre Dupont, "Inferring statistically significant features from random forests", Neurocomputing, 150, Part B:471–480, 2015.

Outstanding issues

  • No support for RandomForestRegressor
    -- Regression would require a different accuracy metric. R^2 score is one option, but we should consider which metric to use as default, and if we want to support alternative accuracy metrics for permutation importance calculations.
  • No support multiple outputs
    -- This is easy enough to rectify, but we will need to decide how to handle aggregate accuracy over multiple outputs. One way to do this might be a simple average over all outputs. Should consider how this aggregation might interact with other accuracy metrics, if we choose to support other options.
  • No support for class/sample weights
    -- This implementation does not currently account for class weights or sample weights in calculating feature importances. This may be desirable, but it seems more sensible to use the same weights that were used to train the forest. Need to do a quick literature search on this, happy to accept comments if anyone else has an opinion.
  • Use of static methods for helper functions
    -- We opted to implement a number of helper functions as static methods on the ForestClassifier class in order to make the code more readable and avoid repeated code (i.e., _get_oob_data, _predict_oob, _calc_mislabel_rate). This doesn't seem like standard practice, so any advice on the preferred method for doing this would be appreciated.
  • Use of _feature_importances at the tree level
    -- The changes made to the BaseDecisionTree are a bit of a hack, but were necessary to ensure that the feature importances reported by the tree are self-consistent with those calculated at the Forest level (e.g., for the std bars reported in the example we added). We could also accomplish this by moving the permutation test logic into the tree itself, but the trees being aware of the out-of-bag sampling seems wrong as well. Suggestions definitely welcome here.
y_oob_pred = self._predict_oob(X_oob, tree)

feature_importances = np.zeros(n_features)
for feature_ind in xrange(n_features):

This comment has been minimized.

Copy link
@dalmia

dalmia Dec 10, 2016

Contributor

Python 3 renamed xrange() to range(). Please make the change.

Copy link
Member

jnothman left a comment

This looks interesting. Some initial code comments while I fall asleep...

n_jobs=n_jobs,
random_state=random_state,
verbose=verbose,
warm_start=warm_start,
class_weight=class_weight)

def _set_permutation_feature_importances(self, X, y):
# Convert data to arrays, y to row vector
X = np.array(X)

This comment has been minimized.

Copy link
@jnothman

jnothman Dec 12, 2016

Member

X will either be an array or sparse matrix already. If sparse matrix, we should be raising a NotImplementedError, I think

def _set_permutation_feature_importances(self, X, y):
# Convert data to arrays, y to row vector
X = np.array(X)
y = np.squeeze(np.array(y))

This comment has been minimized.

Copy link
@jnothman

jnothman Dec 12, 2016

Member

y will already be an array.

y_oob_pred_perm = self._predict_oob(X_oob, tree,
shuffle_ind=feature_ind,
random_state=random_state)
n_wrong = self._calc_mislabel_rate(y_oob, y_oob_pred)

This comment has been minimized.

Copy link
@jnothman

jnothman Dec 12, 2016

Member

it looks like this can be taken out of the loop


@staticmethod
def _calc_mislabel_rate(expected, predicted):
bool_mislabel = np.squeeze(expected) != np.squeeze(predicted)

This comment has been minimized.

Copy link
@jnothman

jnothman Dec 12, 2016

Member

in what case should we expect expected.shape != predicted.shape that we should want to squeeze?

@ryanvarley

This comment has been minimized.

Copy link

ryanvarley commented Sep 6, 2017

Given the recent discussion in #8898 regarding eli5 implementing permutation importances is it worth progressing with this PR?

This technique for random forests is more standard (It is available in R and was mentioned in the original random forest paper) and can use the OOB data meaning the API stays the same (no additional input is needed for this importance measure). It would be a nice addition to scikit-learn especially given the some of the issues with the existing importance measure (gini importance) which many wont see if its only available in an external package.

That said, the eli5 implementation is very nice and is much more useful as it can easily be applied to any classifier under multiple configurations and so the extra value we gain here is minimal.

Thoughts @jnothman @kmike?

@jnothman

This comment has been minimized.

Copy link
Member

jnothman commented Sep 7, 2017

@ryanvarley

This comment has been minimized.

Copy link

ryanvarley commented Sep 18, 2017

In that case I suggest we close this PR and the inactive PR this was based on (#3436)

@jnothman

This comment has been minimized.

Copy link
Member

jnothman commented Sep 18, 2017

Thanks for the reminder.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants
You can’t perform that action at this time.