Join GitHub today
Sequential Feature Selection should apply "groups" parameter #524
I've been attempting to use GroupKFold to preserve independence during CV. Scikit learn can do this with RFECV, but I've been looking at alternatives (long story, but the sklearn implementation makes assumptions which dont work for me).
If SFS were to propagate groups from fit() to the CV iterator fit(), in a similar manner to that which allows for classifier parameters to be propagated, then it would make me happy indeed!
Somewhere in the call-sequence, "groups" should presumably be extracted from fit_params and promoted. I suppose sequential_feature_selector._calc_score() would be an appropriate location, at the point of interface to sklearn. A one-line change!
Does that analysis sound right?
That's a good idea. I am curious whether you have tried to use GroupKFold with the SFS by chance and encountered? While I have not specifically considered or tried this, I think it could even already work as you can pass KFold, StratifiedKFold etc iterators via the
How do you currently do that with
I don't have a specific time/date in mind as this is more of a hobby/side-project and usually do this when enough changes have accumulated, but I was just thinking about making a new release the other day. Probably some time in May :).
I think there is no easy/obvious answer because it depends on a lot of things. RFE is usually based on linear models and based on its assumptions, it only really makes sense to use that if you data can be somewhat reasonably well separated by a linear decision boundary. Also, the selection is not compatible with non-parametric models such as Decision trees, Random Forests, KNN, Kernel SVM and thus limits its use cases. However, for models like linear regression (and variants) as well as logistic regression, it's a fine choice because it's faster and probably can get you reasonably well results (and more control over the number of features compared to using L1 regularization with those).
[I made a typo in the original statement: "CV iterator fit()" should have said "CV iterator split()"]
Using GroupKFold is part of an effort to "do it right", independent of which specific feature selection approach we might use. There are two issues: a) that there will typically be from 2 to 20 records relating to an individual, with these often being strongly correlated, b) the dataset is quite imbalanced in the target class so 4x or 5x inflation of the minority class is not unusual. Simply applying standard feature selection routines with default parameters is, therefore, quite misleading.
sklearn API pattern is that "groups" is a parameter to cv.split(); it cannot form part of the constructor because it refers to instances of X, y.
sklearn API pattern is also that "groups" is a named parameter to rfecv fit()
Hence I see 2 options for amending mlxtend:
(a) seems more "pure" from an API design but represents a formal change of interface.
Oh you are right, I thought you could do sth similar to
from sklearn.datasets import load_iris from sklearn.feature_selection import RFECV from sklearn.linear_model import LogisticRegression from sklearn.model_selection import StratifiedKFold iris = load_iris() X, y = iris.data, iris.target cv = StratifiedKFold(shuffle=True, n_splits=5, random_state=123) estimator = LogisticRegression(multi_class='multinomial', solver='newton-cg') selector = RFECV(estimator, step=1, cv=cv) selector = selector.fit(X, y)
from sklearn.model_selection import GroupKFold cv = GroupKFold(n_splits=5, groups=75* + 75*) estimator = LogisticRegression(multi_class='multinomial', solver='newton-cg') selector = RFECV(estimator, step=1, cv=cv) selector = selector.fit(X, y)
But I just see that the group splitting is done via the additional