New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fixes 502 Add feature_groups
parameter to SequentialFeatureSelection
#965
Conversation
@rasbt |
@rasbt What should the program return when Now, if we take a look at the
then, I will get the following error when I run the pytest:
Which means we are testing |
Good points. I think the concern you have is that A counter argument for postprocessing though is that (particularly for the EFS), a run can take a very long time. Sometimes, users want to manually interrupt it due to time issues but using the results up to that point. If we don't do the post-processing for them, the users would have to figure it out themselves. It's not a deal-breaker, but I think when it comes to convenience, I think in >80% of the cases users would probably prefer if an interrupted SFS/EFS can later be used the same way as a fully fitted one without extra steps. Or, maybe a middle ground would be this: we do what you propose regarding
but we add a clean-up method that users have to call (at their own responsibility) after interruption. What would be a good name for this? a. I would say b) & c) are more clear, but if we go with a) or d), that's also something we can use internally (just the last steps refactored into a method). What do you think? |
Yes. That's correct.
Yes, I understand. This is to just give users something to work with.
I think this is a good approach. I think
And, then in the docstring, we can explain that when keyboard is interrupted, users can still get the best score and the corresponding subset of features by using Sounds good? (Is this what you had in mind?) |
Yes, I think this is perfect. This way, people who let's say accidentally interrupt it are not misled, because it will take an extra deliberate step. But the step is not that inconvenient, so if you know what you are doing, that's a good, easy workaround! |
@rasbt According to the test function
And, this test function is passing. Now, if I run this function in the Jupyter, I get a different result(!):
And, I now see that Now, please take a look at Any thoughts? |
@rasbt |
@rasbt
I think there are three things we may want to do as the final steps (or leave them to someone else):
|
Thanks for the updates!
I can take care of this above
Hm, yeah, that's not an easy thing. In general, I think that scikit-learn does all checks in fit(). I forgot the details, but there was a rationale behind it. However, I agree with you and some things are better checked in init. Like if there are 2 functionalities that interact/depend on each other like fixed_features and feature_groups. There could be an extra check in fit() to check if the data indeed contains the specified features. But on the other hand, I feel like this can also be up to the user's responsibility, and we don't have to check each detail :P |
Just added the unit tests. I think there's one unexpected behavior (second from the bottom) that doesn't look right (haha, or am I overlooking something?) def test_check_pandas_dataframe_with_feature_groups_and_fixed_features():
iris = load_iris()
X = iris.data
y = iris.target
lr = SoftmaxRegression(random_seed=123)
df = pd.DataFrame(
X, columns=["sepal length", "sepal width", "petal length", "petal width"]
)
sfs1 = SFS(
lr,
k_features=2,
forward=True,
floating=False,
scoring="accuracy",
feature_groups=[
["petal length", "petal width"],
["sepal length"],
["sepal width"],
],
fixed_features=[ "sepal length", "petal length", "petal width"],
cv=0,
verbose=0,
n_jobs=1,
)
sfs1 = sfs1.fit(df, y)
assert sfs1.k_feature_names_ == (
'sepal length', 'petal length', 'petal width',
), sfs1.k_feature_names_ This returns But I think that's not possible. I.e., it can't return 'sepal width' because it's not in def test_check_feature_groups():
iris = load_iris()
X = iris.data
y = iris.target
lr = SoftmaxRegression(random_seed=123)
sfs1 = SFS(
lr,
k_features=2,
forward=True,
floating=False,
scoring="accuracy",
feature_groups=[[2, 3], [0], [1]],
fixed_features=[0, 2, 3],
cv=0,
verbose=0,
n_jobs=1,
)
sfs1 = sfs1.fit(X, y)
assert sfs1.k_feature_idx_ == (0, 2, 3), sfs1.k_feature_idx_ Or am I confused now? 😅 |
@rasbt |
@rasbt |
@rasbt
I think things should be good now :) |
Thanks so much, this looks reasonable to me! I will see if there's anything to add on the documentation front but other than that, I think this is good to merge. I am also planning to make a release this weekend so that people can actually start using the awesome features you've added :) |
Cool :) Thanks for your help and support throughout this journey. It has been fun :) |
This PR is created to add support for
feature_groups
inSequentialFeatureSelection
, as discussed in #502.Before developing the new feature, I decided to clean the code first as I noticed it has some
nested-if
blocks. The modulesequential_feature_selector.py
now has about3070 lines less, and I think it might be a little bit more readable.I will add the parameter
feature_groups
in the nearby future.