-
Notifications
You must be signed in to change notification settings - Fork 867
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP: having exhaustive feature selector take a custom iterator as input #834
base: master
Are you sure you want to change the base?
Conversation
Hello @jonathan-taylor! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:
Comment last updated at 2021-11-29 18:26:37 UTC |
…uential as special cases; allows for categorical variables
Generic feature selector with categorical features and pd.DataFrame support
Obviously some linting needed, but have generalized the sequential / exhaustive feature selector to allow for categorical and custom search strategies. The previous searches default to special cases now. Post-processing of search results still need implementing. |
from sklearn.exceptions import NotFittedError | ||
|
||
|
||
class Column(NamedTuple): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is currently no type checking implemented in mlxtend, but maybe one day, so why not being proactive (or in other words, thanks for adhering to potential future best practices :)).
|
||
strategy = min_max(X, | ||
max_features=4, | ||
fixed_features=[2,3]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are some pep8 issues, but we can fix them later, no worries.
The type checking is more just a declaration of what a strategy is at this point.
Will work on PEP8 issues and also will try to have EFS and SFS use the generic one. I think EFS is straightforward, but SFS I have to understand the "floating" logic a little better -- I gather this is just "search both forward and backward" but somehow in self.subsets_ we only keep a given model of a fixed size -- is this meant to be the best model of a fixed size? Or maybe I still have the logic wrong.
________________________________
From: Sebastian Raschka ***@***.***>
Sent: Tuesday, October 5, 2021 6:20 PM
To: rasbt/mlxtend ***@***.***>
Cc: Jonathan Taylor ***@***.***>; Mention ***@***.***>
Subject: Re: [rasbt/mlxtend] WIP: having exhaustive feature selector take a custom iterator as input (#834)
@rasbt commented on this pull request.
________________________________
In mlxtend/feature_selection/columns.py<#834 (comment)>:
@@ -0,0 +1,271 @@
+
+from typing import NamedTuple, Any
+from copy import copy
+
+import numpy as np
+from sklearn.base import clone
+from sklearn.preprocessing import (OneHotEncoder,
+ OrdinalEncoder)
+from sklearn.utils.validation import check_is_fitted
+from sklearn.exceptions import NotFittedError
+
+
+class Column(NamedTuple):
There is currently no type checking implemented in mlxtend, but maybe one day, so why not being proactive (or in other words, thanks for adhering to potential future best practices :)).
________________________________
In mlxtend/feature_selection/tests/test_generic_selector.py<#834 (comment)>:
+
+have_pandas = True
+try:
+ import pandas as pd
+except ImportError:
+ have_pandas = False
+
+def test_exhaustive_selector():
+
+ n, p = 100, 5
+ X = np.random.standard_normal((n, p))
+ Y = np.random.standard_normal(n)
+
+ strategy = min_max(X,
+ max_features=4,
+ fixed_features=[2,3])
There are some pep8 issues, but we can fix them later, no worries.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub<#834 (review)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AACTM23LRNTCT6Z7H3MLC6LUFOP5XANCNFSM47DDCTAA>.
Triage notifications on the go with GitHub Mobile for iOS<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
|
Thanks a lot for the PR, this is very exciting! Big picture-wise, there are a few thoughts.
it only supports the standard forward and backward. I assume that 'both' means it runs forward first and finds the best set. Then it runs backward (independently) to find the best set. Then, the best set is determined by comparing the results from forward and backward? This is actually a neat addition.
Amazing work, though. What you put together here is really exciting and impressive! |
Oh, I should have described this a bit better in the docs. In the floating forward variant, it allows you to delete a previously selected feature if it improves performance. In the floating backward variant, it allows you to add back a previously deleted feature. Maybe it is easier to explain with a toy example. Say we have a feature set [0, 1, 3, 4, 5] and we do sequential "floating" forward selection. Round 1Starting with the empty set [], we check all features and find that feature 1 is best:
Round 2Starting with the set [1], we check all features and find that feature 4 is best:
Round 3
Round 4
Note that we would not have selected [1, 2, 3] with regular forward selection. With regular forward selection, we would have had [ ] -> [1] -> [1, 4] -> [1, 2, 4] Or in other words, with regular forward selection, any feature set of length 3 would have included the value 4. However, in combination with other features, 4 may not be useful. So, in practice, the floating forward variant allows us to test a few more combinations than the regular forward variant, without being as expensive as the exhaustive feature selection of course. The floating backward variant works similar to the floating forward variant, except we are allowed to add a previously removed feature back. |
So it seems floating is as I thought. This is "both" - the stepwise will look to add or drop columns from current set. This is what "both" in R's step is.
My issue with the logic was just somehow how you tracked the progress within self.subsets_ as there is no longer necessarily a single model of size 3, say. The generic one just stores every state visited.
Get Outlook for Android<https://aka.ms/ghei36>
…________________________________
From: Sebastian Raschka ***@***.***>
Sent: Tuesday, October 5, 2021 6:49:56 PM
To: rasbt/mlxtend ***@***.***>
Cc: Jonathan Taylor ***@***.***>; Mention ***@***.***>
Subject: Re: [rasbt/mlxtend] WIP: having exhaustive feature selector take a custom iterator as input (#834)
I think EFS is straightforward, but SFS I have to understand the "floating" logic a little better -- I gather this is just "search both forward and backward" but somehow in self.subsets_ we only keep a given model of a fixed size -- is this meant to be the best model of a fixed size? Or maybe I still have the logic wrong.
Oh, I should have described this a bit better in the docs. In the floating forward variant, it allows you to delete a previously selected feature if it improves performance. In the floating backward variant, it allows you to add back a previously deleted feature.
Maybe it is easier to explain with a toy example.
Say we have a feature set [0, 1, 3, 4, 5] and we do sequential "floating" forward selection.
Round 1
Starting with the empty set [], we check all features and find that feature 1 is best:
* Select feature 1 to be added to [ ]
* Now we have the optional floating step where we can optionally delete a previously selected feature. Here, this does not apply because we only have the currently selected feature.
* Return feature set [1]
Round 2
Starting with the set [1], we check all features and find that feature 4 is best:
* Select feature 4 to be added to [1]
* Optional floating step. Here, we can try removing 1 but this can be skipped, because it would be the same as selecting the 4 in the first place in the previous round. We can try removing it for simplicity though.
* Return feature set [1, 4]
Round 3
* Select feature 2 to be added to [1, 4].
* Optional floating step. Here, we can try removing 1 or 4. Removing 4 can technically be skipped, because it would be the same as selecting the 2 in the first place in the previous round. We can try removing it for simplicity though. I.e., we try [1, 2], [2, 4].
* return [1, 2, 4] (assuming it is better than [1, 2], [2, 4])
Round 4
* Select feature 3 to be added to [1, 2, 4].
* Optional floating step. Here, we can try removing 1, 2, or 4. We try [1, 2, 3], [1, 3, 4], [2, 3, 4].
* return [1, 2, 3] (assuming it is better than [1, 3, 4], [2, 3, 4], [1, 2, 3, 4])
Note that we would not have selected [1, 2, 3] with regular forward selection. With regular forward selection, we would have had
[ ] -> [1] -> [1, 4] -> [1, 2, 4]
Or in other words, with regular forward selection, any feature set of length 3 would have included the value 4. However, in combination with other features, 4 may not be useful.
So, in practice, the floating forward variant allows us to test a few more combinations than the regular forward variant, without being as expensive as the exhaustive feature selection of course.
The floating backward variant works similar to the floating forward variant, except we are allowed to add a previously removed feature back.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub<#834 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AACTM22FMPR6UAAKBHH64ZTUFOTMJANCNFSM47DDCTAA>.
Triage notifications on the go with GitHub Mobile for iOS<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
|
Yeah, I think with the new version it will be way easier to track. Regarding "both" there is a difference between floating-forward and floating-backward though. They can both give you different end results. I think "both" is currently doing floating-forward? |
… have a model with no features (design==0)
Thanks for the updates. Actually, some of the failing tests are related to sklearn 1.0. Due to bug fixes or changes, their have been very slight changes in some of the accuracy values, which will cause values when comparing the results on 3 positions after the decimal point. I am currently looking into this and will update the tests |
Got it. I am pretty comfortable with the state of the code, but I think more tests are needed on my end.
It is possible to rewrite the two current estimators in this form. Do you think that's a high priority?
…________________________________
From: Sebastian Raschka ***@***.***>
Sent: Tuesday, November 2, 2021 11:42 AM
To: rasbt/mlxtend ***@***.***>
Cc: Jonathan Taylor ***@***.***>; Mention ***@***.***>
Subject: Re: [rasbt/mlxtend] WIP: having exhaustive feature selector take a custom iterator as input (#834)
Thanks for the updates. Actually, some of the failing tests are related to sklearn 1.0. Due to bug fixes or changes, their have been very slight changes in some of the accuracy values, which will cause values when comparing the results on 3 positions after the decimal point. I am currently looking into this and will update the tests
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub<#834 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AACTM23DC62Z5USSFS7MGM3UKA5I7ANCNFSM47DDCTAA>.
Triage notifications on the go with GitHub Mobile for iOS<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
|
I am not sure how many people use the floating versions. One option could be to deprecate/remove them. We could maybe add a module Here it is compared to the floating version: ![Screen Shot 2021-11-02 at 1 52 06 PM](https://user- So I would say having SFFS (floating forward) and SFBS (floating backward) is useful. But I can understand if it is too much work in terms of rewriting all the code so having only one of the two is probably sufficient (given that performance is very close anyway) |
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
Finally got around fixing the CI. Then, I was trying to rebase this PR but since it is not in a feature branch but in the master branch itself, I couldn't figure out how. I manually applied all the changes to this PR, and it appears I created quite a mess |
Actually not sure why the workflow CI doesn't run. It worked fine in another PR that I just closed a few minutes ago. Maybe it would be easiest to just close this PR, and make a new PR with the 4 files you had initially submitted, i.e,
if you could do this in a separate feature branch instead of master, I think this would make things easier. Let me know if I can help with anything |
Code of Conduct
A working candidate for #833
Description
Related issues or pull requests
Pull Request Checklist
./docs/sources/CHANGELOG.md
file (if applicable)./mlxtend/*/tests
directories (if applicable)mlxtend/docs/sources/
(if applicable)PYTHONPATH='.' pytest ./mlxtend -sv
and make sure that all unit tests pass (for small modifications, it might be sufficient to only run the specific test file, e.g.,PYTHONPATH='.' pytest ./mlxtend/classifier/tests/test_stacking_cv_classifier.py -sv
)flake8 ./mlxtend