Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sequential Feature Selection should apply "groups" parameter #524

Open
arc12 opened this issue Apr 30, 2019 · 5 comments

Comments

@arc12
Copy link
Contributor

commented Apr 30, 2019

I've been attempting to use GroupKFold to preserve independence during CV. Scikit learn can do this with RFECV, but I've been looking at alternatives (long story, but the sklearn implementation makes assumptions which dont work for me).

If SFS were to propagate groups from fit() to the CV iterator fit(), in a similar manner to that which allows for classifier parameters to be propagated, then it would make me happy indeed!

Somewhere in the call-sequence, "groups" should presumably be extracted from fit_params and promoted. I suppose sequential_feature_selector._calc_score() would be an appropriate location, at the point of interface to sklearn. A one-line change!

Does that analysis sound right?
Would a PR with such a change be likely to be accepted (assuming done correctly)?
(and whats the likely timescale for the next "dot" release?)

@qiagu

This comment has been minimized.

Copy link
Contributor

commented Apr 30, 2019

I am curious how SFS performs compared to RFE/RFECV or other equivalent. Who can provide a reference, please? I'll appreciate that very much!

@rasbt

This comment has been minimized.

Copy link
Owner

commented Apr 30, 2019

That's a good idea. I am curious whether you have tried to use GroupKFold with the SFS by chance and encountered? While I have not specifically considered or tried this, I think it could even already work as you can pass KFold, StratifiedKFold etc iterators via the cv argument. As far as I understand, GroupKFold should have a similar iterator structure. Instead of modifying fit, can the splits be specified in the constructor? But maybe this is not a good idea because then trying different groups would require reinitialization each time ...

How do you currently do that with RFECV? Are the groups specified in fit or via the cv argument of the constructor? I think we should maybe stick close to what's done there. In general, I'd be happy to have a PR for this.

(and whats the likely timescale for the next "dot" release?)

I don't have a specific time/date in mind as this is more of a hobby/side-project and usually do this when enough changes have accumulated, but I was just thinking about making a new release the other day. Probably some time in May :).

@qiagu

I am curious how SFS performs compared to RFE/RFECV or other equivalent. Who can provide a reference, please? I'll appreciate that very much!

I think there is no easy/obvious answer because it depends on a lot of things. RFE is usually based on linear models and based on its assumptions, it only really makes sense to use that if you data can be somewhat reasonably well separated by a linear decision boundary. Also, the selection is not compatible with non-parametric models such as Decision trees, Random Forests, KNN, Kernel SVM and thus limits its use cases. However, for models like linear regression (and variants) as well as logistic regression, it's a fine choice because it's faster and probably can get you reasonably well results (and more control over the number of features compared to using L1 regularization with those).

@qiagu

This comment has been minimized.

Copy link
Contributor

commented May 1, 2019

My god, I used to try a lot of RFECV with XGBoost, RandomForest. Thank you very much for your clarification. @rasbt

@arc12

This comment has been minimized.

Copy link
Contributor Author

commented May 1, 2019

[I made a typo in the original statement: "CV iterator fit()" should have said "CV iterator split()"]

Using GroupKFold is part of an effort to "do it right", independent of which specific feature selection approach we might use. There are two issues: a) that there will typically be from 2 to 20 records relating to an individual, with these often being strongly correlated, b) the dataset is quite imbalanced in the target class so 4x or 5x inflation of the minority class is not unusual. Simply applying standard feature selection routines with default parameters is, therefore, quite misleading.

sklearn API pattern is that "groups" is a parameter to cv.split(); it cannot form part of the constructor because it refers to instances of X, y.

sklearn API pattern is also that "groups" is a named parameter to rfecv fit()

Hence I see 2 options for amending mlxtend:
a) add explicit "groups" to SFS fit()
b) handle using **fit_params

(a) seems more "pure" from an API design but represents a formal change of interface.
@rasbt I will take your advice on which you prefer.

@rasbt

This comment has been minimized.

Copy link
Owner

commented May 1, 2019

Oh you are right, I thought you could do sth similar to

from sklearn.datasets import load_iris
from sklearn.feature_selection import RFECV
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import StratifiedKFold

iris = load_iris()
X, y = iris.data, iris.target

cv = StratifiedKFold(shuffle=True, n_splits=5, random_state=123)
estimator = LogisticRegression(multi_class='multinomial', solver='newton-cg')
selector = RFECV(estimator, step=1, cv=cv)
selector = selector.fit(X, y)

i.e.,

from sklearn.model_selection import GroupKFold


cv = GroupKFold(n_splits=5, groups=75*[0] + 75*[1])
estimator = LogisticRegression(multi_class='multinomial', solver='newton-cg')
selector = RFECV(estimator, step=1, cv=cv)
selector = selector.fit(X, y)

But I just see that the group splitting is done via the additional get_n_splits/split call like you mentioned.

Adding groups as an additional parameter to existing code will probably break existing code, but I think it's probably the better way to go because then we don't need to create inelegant workarounds for parsing it out of **fit_params. So, I would suggest this step instead. Also, there is probably only a very small subset of users who actually use **fit_params at all.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
3 participants
You can’t perform that action at this time.