Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] Adds Permutation Importance #13146

Open
wants to merge 78 commits into
base: master
from

Conversation

Projects
None yet
7 participants
@thomasjpfan
Copy link
Member

commented Feb 12, 2019

Reference Issues/PRs

Resolves #11187

What does this implement/fix? Explain your changes.

Adds permutation importance to a model_inspection module.

TODO

  • Initial implementation.
  • Add example demonstrating the differences between permutation importance and feature_importances_ when using trees.
  • Add to user guide.
  • Support pandas dataframes.

thomasjpfan added some commits Feb 8, 2019

thomasjpfan added some commits Feb 12, 2019

@jnothman
Copy link
Member

left a comment

Do we want to provide a meta estimator giving feature_importances_ for use the local where that's expected?

Please also consider looking at eli5 for feature parity, and perhaps testing ideas

thomasjpfan added some commits Feb 13, 2019

@jnothman
Copy link
Member

left a comment

Hmmmm... By conducting cross validation over multiple splits, this determines feature importance for a class of model, rather than a specific model. If we are trying to inspect a specific model, surely we should not be fitting cv-many different models, but merely assessing the importance of features to prediction accuracy for the given model.

for column in columns:
with _permute_column(X_test, column, random_state) as X_perm:
feature_score = scoring(estimator, X_perm, y_test)
permutation_importance_scores.append(baseline_score -

This comment has been minimized.

Copy link
@jnothman

jnothman Feb 13, 2019

Member

What does it mean when this value is negative? Do we need to clip in that case??

This comment has been minimized.

Copy link
@thomasjpfan

thomasjpfan Feb 14, 2019

Author Member

Negative means that the model performed better with the feature permuted. This could mean that the feature should be dropped.

There is a paragraph about this in https://explained.ai/rf-importance/index.html at Figure 3(a)

This comment has been minimized.

Copy link
@ogrisel

ogrisel Feb 18, 2019

Member

Interesting. I think both the docstring and the user guide should explain the meaning of negative importance.

@thomasjpfan

This comment has been minimized.

Copy link
Member Author

commented Feb 14, 2019

Hmmmm... By conducting cross validation over multiple splits, this determines feature importance for a class of model, rather than a specific model.

This is correct. I will add a prefit option to inspect a specific model (turning off the cross validation).

The CV mode isn't inspecting the model, it is using a multiple models to find the importance of the features. It is "inspecting the data". If the scope of the inspect module is for data and model inspection, then this CV feature could be kept in.

@jnothman

This comment has been minimized.

Copy link
Member

commented Feb 14, 2019

@ogrisel

This comment has been minimized.

Copy link
Member

commented Feb 18, 2019

+1 for focusing first on a tool used for the single (fitted) model inspection use case. Here are alternative implementations:

Then we could think of a tool for automated feature selection using a nested cross-validation loop that can be used in Pipeline as the SelectFromModel does. However, to me, it's less of a priority.

@ogrisel
Copy link
Member

left a comment

Because it's so cheap to resample the individual predictions (on the permuted validation set), we should take advantage of this to recompute the mean score on many resampled predictions (bootstrap estimates of the importance). I think it's very important that the default behavior of this tool makes it natural to get bootstrap confidence intervals on the feature importance (e.g. a 2.5%-97.5% percentile interval in addition to the median importance across resampled importances.

Also, the feature importance plot in the example should use horizontal mustache/ box plots to highlight the uncertainty of this feature importance estimates:

https://matplotlib.org/gallery/pyplots/boxplot_demo_pyplot.html#sphx-glr-gallery-pyplots-boxplot-demo-pyplot-py

@ogrisel

This comment has been minimized.

Copy link
Member

commented Feb 18, 2019

We could even set the opacity of feature boxplots where 0 is outside of the 2.5%-97.5% range to highlight that those features are not predictive (given the others).

@ogrisel

This comment has been minimized.

Copy link
Member

commented Feb 18, 2019

Here are other interesting references that I have not carefully read yet:

@thomasjpfan

This comment has been minimized.

Copy link
Member Author

commented Feb 18, 2019

@ogrisel Thank you for all the suggestions! I will focus this PR on inspecting a single fitted model and tune the API to make it easy to get bootstrap results.

thomasjpfan added some commits Feb 19, 2019

@jnothman

This comment has been minimized.

Copy link
Member

commented Feb 19, 2019

@thomasjpfan

This comment has been minimized.

Copy link
Member Author

commented Feb 20, 2019

It’s a prefix I use to mean “REFACTOR”.

@jnothman

This comment has been minimized.

Copy link
Member

commented Feb 20, 2019

scores : array, shape (n_features, bootstrap_samples)
Permutation importance scores
"""

This comment has been minimized.

Copy link
@amueller

amueller Mar 5, 2019

Member

Needs a reference - and a user guide!

thomasjpfan added some commits Mar 14, 2019

thomasjpfan added some commits May 29, 2019

REV
@amueller
Copy link
Member

left a comment

some initial remarks

.. currentmodule:: sklearn.inspection

Permutation feature importance is a model inspection technique that can be used
for any `fitted` `estimator` and the data is rectangular. This is especially

This comment has been minimized.

Copy link
@amueller

amueller May 29, 2019

Member
Suggested change
for any `fitted` `estimator` and the data is rectangular. This is especially
for any `fitted` `estimator` when the data is rectangular. This is especially

The :func:`permutation_importance` function calculates the feature importance
of `estimators` for a given dataset. The ``n_rounds`` parameter sets the number
of times a feature is randomly shuffled and returns a distribution of feature

This comment has been minimized.

Copy link
@amueller

amueller May 29, 2019

Member

n_rounds returns a distribution?

The :func:`permutation_importance` function calculates the feature importance
of `estimators` for a given dataset. The ``n_rounds`` parameter sets the number
of times a feature is randomly shuffled and returns a distribution of feature
importances. Note that :func:`permutation_importance` can accept a training

This comment has been minimized.

Copy link
@amueller

amueller May 29, 2019

Member

I would say "Note that permutation_importance can be computed on any dataset, though it is common practice to use a hold-out set since estimates on the training set might not be accurate." or something like that?

test set gives the importance of a feature on unseen data, which gives a sense
of the features that are *actually* important.

Tree based models provides their own feature importances based on statistics

This comment has been minimized.

Copy link
@amueller

amueller May 29, 2019

Member
Suggested change
Tree based models provides their own feature importances based on statistics
Tree based models provides a different measure of feature importances based on statistics
test set gives the importance of a feature on unseen data, which gives a sense
of the features that are *actually* important.

Tree based models provides their own feature importances based on statistics

This comment has been minimized.

Copy link
@amueller

amueller May 29, 2019

Member

Maybe add a header here "relation to feature importance in trees"?
Also maybe mention that that's based on the criterion, while here we're using an arbitrary metric on the actual outcome? This article summarizes it nicely btw:
https://medium.com/the-artificial-impostor/feature-importance-measures-for-tree-models-part-i-47f187c1a2c3

I think we should call the random forest feature importance MDI (mean decrease in impurity) and also use that name in the RF docs, to be explicit in what they compute. We can leave the update of the docs for a different PR though.

@@ -0,0 +1,182 @@
"""
==========================================================
Permutation Importance vs Random Forest Feature Importance

This comment has been minimized.

Copy link
@amueller

amueller May 29, 2019

Member

again, we should use a name for the RF importance, either MDI or gini importance (though it can also be computed using other critera).

Permutation Importance vs Random Forest Feature Importance
==========================================================
.. currentmodule:: sklearn.inspection

This comment has been minimized.

Copy link
@amueller

amueller May 29, 2019

Member

That doesn't do anything, right?

This comment has been minimized.

Copy link
@thomasjpfan

thomasjpfan Jun 17, 2019

Author Member

Makes it slightly easier to write

:func:`permutation_importance`

(Will be removed)

.. topic:: References:
.. [1] Strobl, C., Boulesteix, AL., Zeileis, A. et al. BMC Bioinformatics

This comment has been minimized.

Copy link
@amueller

amueller May 29, 2019

Member

Can you add the title please? Also: breiman already pointed this out, so we could also just reference the random forest paper.


rf = Pipeline([
('preprocess', preprocessing),
('classifier', RandomForestClassifier(n_estimators=100, min_samples_leaf=1,

This comment has been minimized.

Copy link
@amueller

amueller May 29, 2019

Member

min_samples_leaf=1 is the default, as is n_estimators=100, right?

@@ -0,0 +1,125 @@
"""
===================================================
Permutation Importance with Multicollinear Features

This comment has been minimized.

Copy link
@amueller

amueller May 29, 2019

Member

Why multicollinear instead of correlated?

This comment has been minimized.

Copy link
@amueller

amueller May 30, 2019

Member

I guess multicollinearity is more general though I imagine correlation is actually the most common case of multicollinearity, right? Maybe use both words in the title?

@codecov

This comment has been minimized.

Copy link

commented May 30, 2019

Codecov Report

Merging #13146 into master will decrease coverage by 0.1%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #13146      +/-   ##
==========================================
- Coverage   96.84%   96.74%   -0.11%     
==========================================
  Files         393      395       +2     
  Lines       72087    71930     -157     
  Branches     7902     7872      -30     
==========================================
- Hits        69813    69589     -224     
- Misses       2251     2324      +73     
+ Partials       23       17       -6
Impacted Files Coverage Δ
...rn/inspection/tests/test_permutation_importance.py 100% <100%> (ø)
sklearn/inspection/__init__.py 100% <100%> (ø) ⬆️
sklearn/inspection/permutation_importance.py 100% <100%> (ø)
sklearn/utils/fixes.py 38.7% <0%> (-28.23%) ⬇️
sklearn/preprocessing/tests/test_encoders.py 94.56% <0%> (-5.44%) ⬇️
sklearn/linear_model/setup.py 16% <0%> (-4%) ⬇️
sklearn/decomposition/sparse_pca.py 96.1% <0%> (-3.9%) ⬇️
sklearn/ensemble/_hist_gradient_boosting/loss.py 97.61% <0%> (-2.39%) ⬇️
sklearn/manifold/spectral_embedding_.py 87.57% <0%> (-1.87%) ⬇️
sklearn/neighbors/tests/test_dist_metrics.py 97.5% <0%> (-0.84%) ⬇️
... and 45 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f283ed6...d2fad37. Read the comment docs.

@amueller
Copy link
Member

left a comment

Looks good apart from inline comments.

Main concerns are: default value of n_rounds (maybe also the name of that parameter? n_repeats?), the shape of the returned importances, and how we refer to the previously existing importances, which we should decide on a name for - like impurity importance or mean decrease in impurity.

I don't like that this is not tested as well as the estimators because it's not included in the common tests.
For example it's not tested with read-only data (and 100000 other edge-cases).
Not sure what to do about that, though.
I was wondering if we can create a feature selection class from this somehow. It's non-trivial to wrap this in any feature selection right now, correct? That would allow testing it at least indirectly.

It might be good to also have a stronger test about the scores that are returned. If we train a non-penalized linear regression model and compute permutation importance on the training set, the feature importance should be related to the coefficient, right? at least with scoring being absolute error maybe?

#
# It might be possible to trade some accuracy on the training set for a
# slightly better accuracy on the test set by limiting the capacity of the
# trees (for instance by setting ``max_samples_leaf=5`` or

This comment has been minimized.

Copy link
@amueller

amueller May 30, 2019

Member
Suggested change
# trees (for instance by setting ``max_samples_leaf=5`` or
# trees (for instance by setting ``min_samples_leaf=5`` or
# It might be possible to trade some accuracy on the training set for a
# slightly better accuracy on the test set by limiting the capacity of the
# trees (for instance by setting ``max_samples_leaf=5`` or
# ``max_samples_leaf=10``) so as to limit overfitting while not introducing too

This comment has been minimized.

Copy link
@amueller

amueller May 30, 2019

Member
Suggested change
# ``max_samples_leaf=10``) so as to limit overfitting while not introducing too
# ``min_samples_leaf=10``) so as to limit overfitting while not introducing too


##############################################################################
# Tree Based Feature Importance

This comment has been minimized.

Copy link
@amueller

amueller May 30, 2019

Member

Impurity based Feature Importance? MDI?

feature_names = []
for col, cats in zip(categorical_columns, ohe.categories_):
for cat in cats:
feature_names.append("{}_{}".format(col, cat))

This comment has been minimized.

Copy link
@amueller

amueller May 30, 2019

Member

we have to do this manually because the SimpleImputer doesn't forward feature names, right? Can we get rid of the SimpleImputer to simplify the code here?

ax.barh(y_ticks, tree_feature_importances[sorted_idx])
ax.set_yticklabels(feature_names[sorted_idx])
ax.set_yticks(y_ticks)
ax.set_title("Random Forest Feature Importances")

This comment has been minimized.

Copy link
@amueller

amueller May 30, 2019

Member

Impurity based feature importance?

Returns
-------
importances : array, shape (n_features, n_rounds)

This comment has been minimized.

Copy link
@amueller

amueller May 30, 2019

Member

Should that be the default signature, not the mean? Returning the mean would require adding a parameter to return the raw thing so maybe this is better, but it feels a bit odd as it has a weird shape for something returned by an sklearn function.

This comment has been minimized.

Copy link
@ogrisel

ogrisel Jun 14, 2019

Member

I agree with @amueller that returning the full array is a bit odd but I do not like too much introducing kwarg to change the returned value. Maybe we should return some kind of data object with the mean and the std across rounds and the full importances array as attributes?

Also in the future I would like to make it possible to compute bootstrap confidence intervals by bootstrapping on the samples from the concatenation of the permutated validation sets. That would require changing the scorers so this should probably be not be done as part of this PR.

One other benefit of using a customly structured data object for the result of this function is that we could have a nice notebook display for it, for instance printing the importance in decreasing order with feature names mean +/- std especially when X is a dataframe.

This comment has been minimized.

Copy link
@amueller

amueller Jun 17, 2019

Member

How about a dictionary like cross_validate? Not entirely sure though.

n_jobs=None, random_state=None):
"""Permutation importance for feature evaluation. [BRE]_
The permutation importance of a feature is calculated as follows. First,

This comment has been minimized.

Copy link
@amueller

amueller May 30, 2019

Member

I would formulate it differently. This sounds like training happens inside this function, which it doesn't.
Maybe say "estimator is required to be a fitted estimator. X can be the data set used to train the estimator or a hold-out set."

return scores


def permutation_importance(estimator, X, y, scoring=None, n_rounds=1,

This comment has been minimized.

Copy link
@amueller

amueller May 30, 2019

Member

with is n_rounds=1? That seems not very useful, right? What's a good default? What do other libraries do, what does Breiman do?

This comment has been minimized.

Copy link
@ogrisel

ogrisel Jun 14, 2019

Member

We use cv=5 by default, so n_rounds=5 might feel like a consistent default for this function.

This comment has been minimized.

Copy link
@amueller

amueller Jun 17, 2019

Member

Hm not sure they are comparable. What does the literature say? You could similarly argue that RandomForest uses 100 bootstraps so we should use 100 here ;)

Data on which permutaiton importance will be computed.
y : array-like, shape = (n_samples, ...)
Targets for supervised learning.

This comment has been minimized.

Copy link
@amueller

amueller May 30, 2019

Member

I guess you could fit a GMM (or any other density model) and then compute permutation importance. High importance would indicate strong interactions, low importance would indicate the feature is independent.


if hasattr(X, 'iloc'): # pandas dataframe
# Need to copy in case where the dataframe is a view
X = X.copy()

This comment has been minimized.

Copy link
@amueller

amueller May 30, 2019

Member

Do we need a common test for read-only dataframes?

thomasjpfan added some commits May 30, 2019

@jnothman

This comment has been minimized.

Copy link
Member

commented May 31, 2019

@ogrisel
Copy link
Member

left a comment

A few more comments:

# we plot a heatmap of the correlated features:
corr = spearmanr(X).correlation
corr_linkage = hierarchy.ward(corr)
dendro = hierarchy.dendrogram(corr_linkage, no_plot=True)

This comment has been minimized.

Copy link
@ogrisel

ogrisel Jun 14, 2019

Member

Actually, why not plot the dendrogram as part of this example?

return scores


def permutation_importance(estimator, X, y, scoring=None, n_rounds=1,

This comment has been minimized.

Copy link
@ogrisel

ogrisel Jun 14, 2019

Member

We use cv=5 by default, so n_rounds=5 might feel like a consistent default for this function.

Returns
-------
importances : array, shape (n_features, n_rounds)

This comment has been minimized.

Copy link
@ogrisel

ogrisel Jun 14, 2019

Member

I agree with @amueller that returning the full array is a bit odd but I do not like too much introducing kwarg to change the returned value. Maybe we should return some kind of data object with the mean and the std across rounds and the full importances array as attributes?

Also in the future I would like to make it possible to compute bootstrap confidence intervals by bootstrapping on the samples from the concatenation of the permutated validation sets. That would require changing the scorers so this should probably be not be done as part of this PR.

One other benefit of using a customly structured data object for the result of this function is that we could have a nice notebook display for it, for instance printing the importance in decreasing order with feature names mean +/- std especially when X is a dataframe.

thomasjpfan added some commits Jun 17, 2019

@thomasjpfan

This comment has been minimized.

Copy link
Member Author

commented Jun 17, 2019

This PR was updated:

  1. User guide updated based on comments.
  2. Renames n_rounds to n_repeats and set it to 5 by default.
  3. permutation_importance now returns a bunch of mean, std, and the raw importances per round.
  4. Adds dendrogram to example.

The biggest todo is to include more test coverage, either by introducing a SelectFromModel feature or have other ways to verify the permutation importance:

If we train a non-penalized linear regression model and compute permutation importance on the training set, the feature importance should be related to the coefficient, right? at least with scoring being absolute error maybe?

@amueller

This comment has been minimized.

Copy link
Member

commented Jun 18, 2019

I'm not 100% sold on the bunch, but I'm also not opposed. I think we can leave it for now and focus on testing.

I'm a bit confused. I didn't entirely think this through but my thoughts were along these lines:

from sklearn.datasets import make_regression
from sklearn.linear_model import LinearRegression
from sklearn.inspection import permutation_importance

X, y = make_regression(n_samples=10000, n_features=20, random_state=0)
y -= y.mean()
X.shape

lr = LinearRegression().fit(X, y)

import numpy as np
np.set_printoptions(suppress=True, precision=3)
print(lr.coef_)

res = permutation_importance(lr, X, y, scoring='neg_mean_absolute_error', n_repeats=100)

res.mean - lr.coef_ / 0.889

The result is 0 up to like one decimal. But I have no idea where the 0.889 comes from.
Basically if you look at res.mean / lr.coef_ it's off by a constant factor (part from the near zero coefficients, for which the ratio is obviously meaningless).

So that seems to me like I wasn't entirely wrong but I have no idea where that magic number comes from :-/

@amueller

This comment has been minimized.

Copy link
Member

commented Jun 18, 2019

Hm ok I should really do something else (and/or try to solve this on the blackboard).
Note for later:
coef_[i] is clearly

np.sqrt(((np.dot(X_, lr.coef_) - y)**2).mean())

where X_ is X where the i-th column is replaced by zeros.

But replacing by zeros is not the same as replacing by the permutation, even if the permuted feature has expectation zero.

This should introduce an error that's related to coef_ and X[i].std() methinks?

Using neg_mean_squared_error (and taking the root) the magic number becomes 0.7.

@amueller

This comment has been minimized.

Copy link
Member

commented Jun 18, 2019

lol ok my blackboard says the magic number makes sense and it's 1/sqrt(2),

so the importance is sqrt(2) coef_ if we use RMSE for importance.

@amueller

This comment has been minimized.

Copy link
Member

commented Jun 18, 2019

This raises an interesting question: would it be more intuitive to scale the result by sqrt(2) to be "consistent" with the coefficients? Though I guess when using other losses this is less meaningful and surprising?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.