Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add permutation based feature importance? #11187

Closed
amueller opened this issue Jun 2, 2018 · 28 comments
Closed

Add permutation based feature importance? #11187

amueller opened this issue Jun 2, 2018 · 28 comments

Comments

@amueller
Copy link
Member

@amueller amueller commented Jun 2, 2018

I think adding permutation based feature importances would be cool:
https://link.springer.com/article/10.1186%2F1471-2105-8-25

There is a python package that does it for our random forests with bagging:
https://github.com/parrt/random-forest-importances

But I'd rather like to see a generic permutation based importance score with cross-validation or hold-out.
I think this would be great analysis tool.

One easy way to implement it would be to provide a function for plotting etc. But Ideally we'd be able to use it in feature selection, I think, so we'd need to create a meta-estimator PermutationImportanceCV that only provides feature_importances_, I guess, so it can be wrapped with SelectFromModel?

@amueller
Copy link
Member Author

@amueller amueller commented Jun 2, 2018

Also see this blog post:
http://parrt.cs.usfca.edu/doc/rf-importance/index.html

I'm not suggesting we change our RF feature importances, but having a more expensive but possibly higher quality alternative would be great.

@jnothman
Copy link
Member

@jnothman jnothman commented Jun 3, 2018

@glemaitre
Copy link
Contributor

@glemaitre glemaitre commented Jun 4, 2018

I'm not suggesting we change our RF feature importances, but having a more expensive but possibly higher quality alternative would be great.

It was something we thought about with @ogrisel.

Moreover, the implementation describes in the blog post is model agnostic if I remember right.

@amueller
Copy link
Member Author

@amueller amueller commented Jun 4, 2018

Their implementation is not model agnostic, but the general algorithm is. And I'd be happy to upstream from eli5 given that this seems a relatively natural and accepted method.

@jnothman
Copy link
Member

@jnothman jnothman commented Jun 5, 2018

@VineethKanaparthi
Copy link

@VineethKanaparthi VineethKanaparthi commented Jun 16, 2018

I am waiting for this method to be included. We need reliable results more than anything even if it's not fast. We prefer not be wrong anytime. I implemented it myself for my model.(for a competition)

@jnothman
Copy link
Member

@jnothman jnothman commented Jun 17, 2018

@datajanko
Copy link
Contributor

@datajanko datajanko commented Jul 20, 2018

I think PermutationImportanceCV is a good approach. I also see the advantage of a uniform interface to all functions. The question is, wouldn't a keyword argument for the random forest classes, to use the a different feature importance algorithm? We also provide multiple impurity measures. So why not maintain multiple importance measures?

@datajanko
Copy link
Contributor

@datajanko datajanko commented Aug 1, 2018

the rfpimp package allows to treat multiple features as one. I think it would make sense to include this as well (maybe later). What are your thoughts on that? I think, it would require changes in SelectFromModel. But this also would raise immediately the question, if one could be more flexible when splitting features in trees. E.g. defining some kind of feature group one could enforce that e.g. latitude and longitude appearing always together in splits.

@datajanko
Copy link
Contributor

@datajanko datajanko commented Aug 3, 2018

So instead of permuting, I tried an additional approach presented in the articles: dropping columns.

To show some code: The FeatureImportanceCV could roughly look like this:

class FeatureImportanceCV(BaseEstimator, MetaEstimatorMixin):

def __init__(self, estimator, cv=KFold(3)):
    self.estimator = estimator
    self.cv = cv
    
def fit(self, X, y):
    n_splits = self.cv.get_n_splits()
    fold_importances = np.zeros((n_splits, X.shape[1]))
    for i, (train, test) in enumerate(self.cv.split(y)):
        baseline_fold_i = self.estimator.fit(X[train], y[train]).score(X[test], y[test])
        except_feature_fold_i = [(self.estimator.fit(np.delete(X[train], j, axis=1),y[train])
                                                     .score(np.delete(X[test], j, axis=1), y[test])) 
                               for j in np.arange(X.shape[1])]
        fold_importances[i, :] = baseline_fold_i - except_feature_fold_i
    self.feature_importances_ = np.mean(fold_importances, axis=0)

Would this be the way to go? Of course the list comprehension should be replaces by something less dense. Permuting can be integrated here easily as well (except for maybe introducing more .copy()calls).

What should the class be capable of? Am I doing something completely wrong, just by looking at it.

I was able to run this on my machine with:

X, y = make_classification(n_samples=10, n_features=5, n_clusters_per_class=1, n_informative=3)
ficv = FeatureImportanceCV(LogisticRegression())
ficv.fit(X, y)
ficv.feature_importances_
array([0.38888889, 0.        , 0.19444444, 0.        , 0.11111111])

Additionally, the call to SelectFromModel works as well:

sfm = SelectFromModel(ficv).fit(X, y)
sfm.transform(X)

chooses a subset of the data

array([[-0.58617384, -1.30229985],
       [-0.52552405, -1.07458527],
       [ 0.94055998, -0.91529262],
       [-1.04839412, -0.2912939 ],
       [ 0.22273055, -1.36454126],
       [ 1.32253391, -1.68018815],
       [-0.13443558, -0.66608385],
       [ 0.29920044,  0.44704055],
       [ 1.94086301, -3.04870667],
       [ 0.4591779 ,  0.22705557]])
@parrt
Copy link

@parrt parrt commented Sep 2, 2018

I do a drop column thing too in rfpimp. It gives slightly different results but takes longer due to all the retraining.

@datajanko
Copy link
Contributor

@datajanko datajanko commented Sep 3, 2018

compared to what? I think doing the random permutations and re_fitting should have the same complexity. You should only see the performance gains when you are computing the feature importance out-of-bag. But this is what you meant, right?

@parrt
Copy link

@parrt parrt commented Sep 3, 2018

Well, isn't it the case that training a model is vastly more expensive than using the model? It is always that way in my experience, except for kNN. If you have 50 features you have to retrain the model 50 times for drop col importance, which could take overnight to train even once. Permuting and running predictions should be many times faster.

@datajanko
Copy link
Contributor

@datajanko datajanko commented Sep 3, 2018

Of course it is, and I'd really like to see the feature directly in sklearn for random forests. However, this feature importance approach might be useful also for different classifiers. Using for example dask (and dask-ml) one might be able to scale this quite easily.

@parrt
Copy link

@parrt parrt commented Sep 3, 2018

I was just referring to your comment on complexity. Training complexity is very different from the complexity for running samples through, but now you are agreeing the cost is higher? Just because you turn on a few extra core won't change the complexity. Whatever, drop column feature importance is still useful, which is why I implemented it as well. Yep, drop column and permutation importance works with any model.

@datajanko
Copy link
Contributor

@datajanko datajanko commented Sep 3, 2018

I think there is a slight misunderstanding: Instead of dropping a column, you could also just permute it and train again with the permuted column. This has almost the same complexity as just dropping the column. I understood your package to also provide that option (but maybe I just overread a small detail in your really nice remarks concerning the feature importances).

However, one could of course also just sample from the data a specific proportion and score before and after permuting the columns on that sample (one even might want to fit on a smaller set, but if I recall, those ideas are already mentioned in the details to your package)

@parrt
Copy link

@parrt parrt commented Sep 3, 2018

Ok, I think we are in agreement! :)

@candalfigomoro
Copy link

@candalfigomoro candalfigomoro commented Dec 19, 2019

@amueller @thomasjpfan

Sorry for the late participation.

An alternative (and better, in my opinion) way to compute Permutation Importance, is to permute the target instead of permuting features.

See:

According to the paper

To preserve the relations between features, we use permutations of the outcome.

@glemaitre
Copy link
Contributor

@glemaitre glemaitre commented Dec 19, 2019

@candalfigomoro Do you mean: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.permutation_test_score.html

However you don't investigate on the importance of the feature but rather the model prediction variation

@amueller
Copy link
Member Author

@amueller amueller commented Dec 23, 2019

@glemaitre that's not a per-feature measure. I wasn't aware of that paper, it seems interesting.

@candalfigomoro
Copy link

@candalfigomoro candalfigomoro commented Dec 24, 2019

For completeness, this is the original R implementation: http://www.altmann.eu/documents/PIMP.R

@jnothman
Copy link
Member

@jnothman jnothman commented Dec 24, 2019

@candalfigomoro
Copy link

@candalfigomoro candalfigomoro commented Dec 24, 2019

Basically it compares the difference in feature importances (e.g. Gini importances) when you train the model against the true target values and when you train it against permuted target values.

@parrt
Copy link

@parrt parrt commented Dec 24, 2019

@jnothman yep, with null-distribution you get empirical p-values for each feature. It's a simple loop around a valid feature importance mechanism (built mine just yesterday). And make sure to compute the ratio correctly. this paper says don't use count/trials use (count+1)/(trials+1).

@jnothman
Copy link
Member

@jnothman jnothman commented Dec 29, 2019

@Devilmoon
Copy link

@Devilmoon Devilmoon commented Aug 20, 2020

Hi, has any work been done in implementing Altmann's approach to permutation importance in the end?
I am going to reimplement it myself in Python because the expected run time of the approach leveraging scikit's RFs would be 50 minutes on my dataset against the expected 25 hours on R, just wanted to know if this is still something that is being considered for scikit or if any implementation has already been proposed.

@glemaitre
Copy link
Contributor

@glemaitre glemaitre commented Aug 21, 2020

I would advise opening a new issue to discuss new features as mentioned by @jnothman . You will get little attention on a closed issue.

But I think that it would be interesting to check this method. It should much faster than the current method. However, I don't know if it has some drawbacks regarding some biases that we should know about (I need to read the paper thoroughly).

@Devilmoon
Copy link

@Devilmoon Devilmoon commented Aug 21, 2020

hey @glemaitre, thanks for the quick response. I will definitely open a new issue since this method is very interesting and the only implementations I could find in R are quite slow given the implementation of RFs they leverage.

As for the drawbacks of this method, I believe the main one would be the number of permutations needed to get reliable p-values in the end. The original paper suggests between 50-100 permutations of the response vector, however one of the implementations I found (I believe in ranger in R) suggests that >>100 are needed for accurate p-values.
Also, it is still based on Random Forests and Gini Impurity measures, so I believe there is still a bias towards high cardinality features and correlated features, but the whole distribution estimation/p-value procedure should help filter out some redundant or otherwise statistically insignificant importance scores.
The benefits are that it is easier/faster to implement than the conditional permutation scheme by Strobl et al. while leaving the dependence between features untouched, and that for a large number of features it would be faster to compute than standard permutation importance (altough PIMP requires retraining the model for each permutation, classical permutation importance does not). On the note of speed I should also mention the paper by Janitza et. al (https://link.springer.com/article/10.1007/s11634-016-0276-4) which proposes an algorithm that is 100 to 125 times faster than Altmann's PIMP, altough it needs a high number of features in the dataset and for many of them to have non-positive importance scores.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

8 participants