Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

What Is Approach To Select The Best Subset Of Features? #238

Closed
sashml opened this issue Sep 8, 2017 · 4 comments
Closed

What Is Approach To Select The Best Subset Of Features? #238

sashml opened this issue Sep 8, 2017 · 4 comments
Labels

Comments

@sashml
Copy link

sashml commented Sep 8, 2017

Hi there,

mlxtend contains good feature selection approach via SFS(k_features=(5,10)). Regarding that I have a few questions:

  1. When I put k_features=(5,7), I was thinking that only combinations of features 5,6,7 have to be considered during feature evaluation procedure.
    If statement above is correct, why did I see features estimations from whole range from 1 till 7?
    If No, how can I achieve the flow what I described?

  2. How can I determine the best subset of features to be passed to k_features model parameter?
    When I put range (10, 20) it's just my first guess, but in reality for this particular dataset maybe (20,25) range would be the best case. Is there any mechanism to detect (20,25) range?

Thank you!

@rasbt
Copy link
Owner

rasbt commented Sep 8, 2017

When I put k_features=(5,7), I was thinking that only combinations of features 5,6,7 have to be considered during feature evaluation procedure.

If you are interested in different combinations of the features 5, 6, 7, I recommend using the ExhaustiveFeatureSelector. In the SequentialFeatureSelector k_features means the number of features. If you set k_features=(5,10), it will return the best performing subset that is between 5 and 10 features long.

How can I determine the best subset of features to be passed to k_features model parameter?

In this case, I recommend to run backward selection all the way down to 1 feature (or run forward selection all the way up to the m features in your dataset). Then, you can look at the performances (e.g., via the plotting function mentioned in the docs) and decide.

@rasbt rasbt added the Question label Sep 8, 2017
@sashml
Copy link
Author

sashml commented Sep 8, 2017

Thank you!

Regarding #2, I was hoping that I missed something and would be able to avoid so complex duties :)

@rasbt
Copy link
Owner

rasbt commented Sep 8, 2017

One common way to decide which feature subset to choose (if the size of the feature subset doesn't matter) is to look at the smallest feature subset and choose the subset that falls within 1 standard error of the best performing one. I guess this could be easily automated and added via k_features='auto' or so.

@rasbt
Copy link
Owner

rasbt commented Sep 9, 2017

I added some capabilities for 'best' and most'parsimonious' feature selection via #240 . Hope that's useful!

@rasbt rasbt closed this as completed Sep 9, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants