Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixes #956 Add features_group parameter to ExhaustiveFeatureSelector [WIP] #957

Merged
merged 36 commits into from
Aug 10, 2022
Merged

Fixes #956 Add features_group parameter to ExhaustiveFeatureSelector [WIP] #957

merged 36 commits into from
Aug 10, 2022

Conversation

NimaSarajpoor
Copy link
Contributor

@NimaSarajpoor NimaSarajpoor commented Jul 8, 2022

Description

This PR addresses issue #956. The idea is to allow user to consider some features to be considered together throughout the feature selection process.

  • Added a note about the modification or contribution to the ./docs/sources/CHANGELOG.md file (if applicable)
  • Added appropriate unit test functions in the ./mlxtend/*/tests directories (if applicable)
  • Modify documentation in the corresponding Jupyter Notebook under mlxtend/docs/sources/ (if applicable)
  • Ran PYTHONPATH='.' pytest ./mlxtend -sv and make sure that all CURRENT unit tests pass (for small modifications, it might be sufficient to only run the specific test file, e.g., PYTHONPATH='.' pytest ./mlxtend/classifier/tests/test_stacking_cv_classifier.py -sv)
  • Checked for style issues by running flake8 ./mlxtend

What I have done so far:

  • Added parameter features_group to mlxtend/feature_selection/exhaustive_feature_selector.py
  • All EXISTING tests for feature_selection pass

What needs to be done:

  • Review changes and see if the taken approach is reasonable
  • Write test function for the new feature
  • Add this parameter to sequential_feature_selector.py

Please let me know what you think


side note:
I added some comments wherever I felt something should be noticed. We should remove the comments.

@NimaSarajpoor
Copy link
Contributor Author

NimaSarajpoor commented Jul 16, 2022

Recently, I found another PR ( #651 ) for adding fixed_features to ExhaustiveFeatureSelector. So, I think we are going to have a slight problem here.

Let's say our features are [1, 2, 3, 4, 5] and the user wants to:

  • consider feature_groups as [[1], [2] , [3], [4,5]]
  • And, consider fixed features: [1, 4]

So...how should we deal with this? We can:

  • throw a warning and consider [1, 4, 5] as fixed features?
  • Or, throw an error and let the user know that there is a conflict between these two parameters?
  • Or, ???

Copy link
Owner

@rasbt rasbt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks actually pretty good to me. If this works, this is a simple and elegant solution. I guess the next step would be unit tests. Would you like any help with it?

mlxtend/feature_selection/exhaustive_feature_selector.py Outdated Show resolved Hide resolved
mlxtend/feature_selection/exhaustive_feature_selector.py Outdated Show resolved Hide resolved
mlxtend/feature_selection/exhaustive_feature_selector.py Outdated Show resolved Hide resolved
@rasbt
Copy link
Owner

rasbt commented Jul 17, 2022

Thanks for this PR btw! I think having feature_group support would be an awesome feature (no pun intended) for so many people!

EDIT: We can also make this PR specific to the Exhaustive Feature Selector and revisit the sequential one at a later time. I think it will require more work for sure.

@rasbt
Copy link
Owner

rasbt commented Jul 17, 2022

Recently, I found another PR ( #651 ) for adding fixed_features to ExhaustiveFeatureSelector. So, I think we are going to have a slight problem here.
Let's say our features are [1, 2, 3, 4, 5] and the user wants to:
consider feature_groups as [[1], [2] , [3], [4,5]]
And, consider fixed features: [1, 4]
So...how should we deal with this? We can:
throw a warning and consider [1, 4, 5] as fixed features?
Or, throw an error and let the user know that there is a conflict between these two parameters?
Or, ???

Oh good call. To be honest, I think it might be best to check for conflicts and raise an error so the user can fix it. It's maybe better than having a silent behavior that is not clear to the user.

So in the case above, maybe an error message would make sense like

raise ValueError("Feature 4 was specified as a fixed feature but feature 4 has other features in its feature group.")

And in the docstring, we can maybe say something like

If both fixed_features and feature_groups are used, all feature grouped members of a fixed feature need to be specified as a fixed feature.

What do you think?

@NimaSarajpoor
Copy link
Contributor Author

I think it might be best to check for conflicts and raise an error so the user can fix it. It's maybe better than having a silent behavior that is not clear to the user.

You are right... It would be safer. We are putting just a little load on the shoulder of users to take care of such conflict. However, they would have then a better understanding of what's going on :)

@NimaSarajpoor
Copy link
Contributor Author

Thanks for this PR btw! I think having feature_group support would be an awesome feature (no pun intended) for so many people!

Thanks for creating this opportunity :) I enjoy doing this :)

We can also make this PR specific to the Exhaustive Feature Selector and revisit the sequential one at a later time. I think it will require more work for sure.

Sure.

I guess the next step would be unit tests. Would you like any help with it?

I will probably need some help 😄 For now, I am going to take a look at the existing test function(s) and see how things are. I will try to do it myself. If I get into any issue, I will let you know.

@NimaSarajpoor
Copy link
Contributor Author

NimaSarajpoor commented Jul 17, 2022

@rasbt
Also, I do not have access to my main pc that contains my local repo and my in-progress work during the weekend. I use another pc during the weekend and thus I just review my codes. So, sometimes you may notice that I review my own code and catch some issues (so I do not forget them later when I want to push the next set of commits).

Copy link
Contributor Author

@NimaSarajpoor NimaSarajpoor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will resolve/address these comments in my next push.

mlxtend/feature_selection/exhaustive_feature_selector.py Outdated Show resolved Hide resolved
mlxtend/feature_selection/exhaustive_feature_selector.py Outdated Show resolved Hide resolved
mlxtend/feature_selection/exhaustive_feature_selector.py Outdated Show resolved Hide resolved
mlxtend/feature_selection/exhaustive_feature_selector.py Outdated Show resolved Hide resolved
@NimaSarajpoor
Copy link
Contributor Author

NimaSarajpoor commented Jul 18, 2022

We also need to add two notes to the docstring:

  • for min_features / max_features: When the parameter feature_groups is provided (and not None), then the number of features is in fact corresponding to the number of groups of features. For instance, when feature_groups = [[0], [1], [2,3], [4]], the max_features can be "4" at the most as we have four groups only.

  • Although two or more individual features may be considered as one group throughout the feature-selection process, it does not mean the individual features of that group have the same impact on the outcome. For instance, in linear regression, the coefficient of the feature 2 and 3 can be different even if they are considered as one group in feature_groups.

@rasbt
Copy link
Owner

rasbt commented Jul 19, 2022

I like the suggestion! Slightly more concise version for the first:

If parameter feature_groups is not None, the number of features is equal to the number of feature groups (len(feature_groups)). For example, if feature_groups=[[0], [1], [2,3], [4]] the max_features value cannot exceed 4.

@NimaSarajpoor
Copy link
Contributor Author

I like the suggestion! Slightly more concise version for the first:

Cool! Thanks for revising the note. I will add them to the docstring.

@pep8speaks
Copy link

pep8speaks commented Jul 20, 2022

Hello @NimaSarajpoor! Thanks for updating this PR.

Line 327:89: E501 line too long (94 > 88 characters)
Line 361:89: E501 line too long (93 > 88 characters)
Line 390:89: E501 line too long (89 > 88 characters)

Comment last updated at 2022-08-09 01:29:49 UTC

@NimaSarajpoor
Copy link
Contributor Author

NimaSarajpoor commented Jul 20, 2022

@rasbt
UPDATE

  • I resolved the issues previously discussed (except one, please see note below)
  • I added new test function to test new parameter feature_groups
  • All tests are passing :)

Notes:

  • I have not added the new parameter fixed_features (PR fixed_features for ExhaustiveFeatureSelector #579 #651) yet. Should I do it here in this branch? or, should I work on it in its own dedicated branch?
  • I did not add a check for the validity of the input parameter feature_groups. I think I need to check if the input is valid, and provide an error message otherwise.

@rasbt rasbt changed the title Fixes #956 Add features_group parameter to features selection [WIP] Fixes #956 Add features_group parameter to ExhaustiveFeatureSelector [WIP] Jul 22, 2022
@rasbt
Copy link
Owner

rasbt commented Jul 22, 2022

Thanks so much for the update.

  1. If it is not too much hassle, maybe add the fixed_feature also here. We could have a separate PR, but that is maybe adding additional work for you since some of the steps are basically repeated.

  2. I can take care of the documentation and the Changelog. I think these are the most "boring" things usually, and you are already doing so much

For the check, maybe the following ones are good:

a) a check that the feature groups contain all feature indices

if _merge_lists(feature_groups) != set(range(X.shape[1]):
    raise ValueError(`feature_group` must contain all features within `range(X.shape[1]`) 

b) check that feature groups are not empty:

for fg in feature_groups:
    if not fg:
	raise ValueError("Feature group can't be empty)

All tests are passing :)

Wohooo!

Ohh, 3 a) reminds me that the ExhaustiveFeatureSelector actually supports pandas DataFrames (http://rasbt.github.io/mlxtend/user_guide/feature_selection/ExhaustiveFeatureSelector/#example-7-working-with-pandas-dataframes). I am not sure this would work with the current feature_groups implementation.

Option 1 (ideal):

I suppose what we want in this case is that the user can specify the feature groups by column names. For example, [['Sepal length', 'Petal length'], ['Sepal width'], ['Petal width']].

Option 2 (ok):

We could also let the feature group selection integer based, e.g., [[1, 2], [3], [4]] and select DataFrame columns by index for now. We can then add string support later on.

Option 3:

For now, we can just add a check and say that feature_groups are currently not supported for pandas DataFrames. We can then add Option 1 or 2 in the future in another PR.

What do you think, is option 1 feasible at this point or do you prefer Option 2/3 and to revisit this later?

@NimaSarajpoor
Copy link
Contributor Author

maybe add the fixed_feature also here

Sure! Will do so.

2. I can take care of the documentation and the Changelog. I think these are the most "boring" things usually, and you are already doing so much

Thank you! :)

For the check, maybe the following ones are good:

a) a check that the feature groups contain all feature indices

Just for the records, I think we "can" allow users to provide not all but just some of the features in the parameter features_group. For instance, I THINK that even if the parameter value is [[0],[1,2],[3]], the Exhaustive Feature Selection will be performed no matter how many features the data actually have. Having said that, I think if a user wants to do that, they should simply pass the chunk of their data that contains those features only. So, I think error message (a) is good as it prevents users from making any accidental mistakes.

b) check that feature groups are not empty:

for fg in feature_groups:
    if not fg:
	raise ValueError("Feature group can't be empty)

Cool

Ohh, 3 a) reminds me that the ExhaustiveFeatureSelector actually supports pandas DataFrames (http://rasbt.github.io/mlxtend/user_guide/feature_selection/ExhaustiveFeatureSelector/#example-7-working-with-pandas-dataframes). I am not sure this would work with the current feature_groups implementation.

Option 1 (ideal):

I suppose what we want in this case is that the user can specify the feature groups by column names. For example, [['Sepal length', 'Petal length'], ['Sepal width'], ['Petal width']].

This is exactly what I thought after pushing the last set of commits! I am planning to work on this. And, then, I will take care of the fixed_feature :)

@NimaSarajpoor
Copy link
Contributor Author

@rasbt
I think everything should be good now. Something in my code was bothering me and I fixed it (cleaned it). So, I am going to stop pushing now :)

@rasbt
Copy link
Owner

rasbt commented Jul 27, 2022

Thanks for all the updates! On the one hand, I think the custom_feature_names parameter is very useful. However, I think it's from before we added pandas DataFrame support. I would that it has become kind of redundant and we can remove it for the sake of simplicity. Over time, things in the mlxtend library are becoming more and more complicated, so I guess it's fair that we remove one parameter in favor of adding the new feature_groups parameter 😉.

To keep it manageable right now, and to not overcomplicate things when we add feature_groups, I would say let's drop the custom_feature_names right away (versus deprecating it). Let me take care of it in this PR before I continue with the review. I can add a documentation example to illustrate how users can achieve the same thing with pandas:

from sklearn.neighbors import KNeighborsClassifier
from mlxtend.feature_selection import ExhaustiveFeatureSelector as EFS
from sklearn.datasets import load_iris
import pandas as pd

iris_data = load_iris()
X, y = iris_data.data, iris_data.target

df_X = pd.DataFrame(X, columns=["Sepal length", "Sepal width", "Petal length", "Petal width"])



knn = KNeighborsClassifier(n_neighbors=3)

efs1 = EFS(knn, 
           min_features=1,
           max_features=4,
           scoring='accuracy',
           print_progress=True,
           cv=5)

efs1 = efs1.fit(df_X, y)

print('Best accuracy score: %.2f' % efs1.best_score_)
print('Best subset (indices):', efs1.best_idx_)
print('Best subset (corresponding names):', efs1.best_feature_names_)
Features: 15/15
Best accuracy score: 0.97
Best subset (indices): (0, 2, 3)
Best subset (corresponding names): ('Sepal length', 'Petal length', 'Petal width')

@NimaSarajpoor
Copy link
Contributor Author

To keep it manageable right now, and to not overcomplicate things when we add feature_groups, I would say let's drop the custom_feature_names right away (versus deprecating it). Let me take care of it in this PR before I continue with the review

Sure. Removing the feature makes sense. Thanks for your effort.

I can add a documentation example to illustrate how users can achieve the same thing with pandas:

Sounds good :)

@review-notebook-app
Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@rasbt
Copy link
Owner

rasbt commented Jul 27, 2022

Alright, just removed it to simplify the code. I may have to call it a day here, but I will be back with the code reviews in the next couple of days!

@NimaSarajpoor
Copy link
Contributor Author

@rasbt
Thanks for the changes! I am going to go through my code. If I notice anything should be revised, I review them and then wait and see your opinion on them.

Also, if you have an idea about what I should do as the next step in this PR, please let me know.

Copy link
Contributor Author

@NimaSarajpoor NimaSarajpoor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rasbt
I reviewed the code and left some comments. Please feel free to let me know if you disagree with some/all of them, and also please add your comments if you have any.

@rasbt
Copy link
Owner

rasbt commented Aug 6, 2022

Thanks for the comments! I would say other than those you mentioned, it looks really good. I wouldn't worry about any additional changes to the code

@NimaSarajpoor
Copy link
Contributor Author

Thanks for the comments! I would say other than those you mentioned, it looks really good. I wouldn't worry about any additional changes to the code

Cool! I will take care of the comments and push the changes.

@NimaSarajpoor
Copy link
Contributor Author

@rasbt
I took care of the issues. I think things should be good now :)

@rasbt
Copy link
Owner

rasbt commented Aug 10, 2022

Wow thanks so much, looks great now and will merge!

@rasbt rasbt merged commit 2ed7ecf into rasbt:master Aug 10, 2022
@NimaSarajpoor
Copy link
Contributor Author

@rasbt
Thanks a lot for your guidance throughout this journey! I enjoyed working on this! So, thank you!

@rasbt
Copy link
Owner

rasbt commented Aug 11, 2022

Likewise, thanks so much for the high-quality contribution @NimaSarajpoor. This is well appreciated! Actually, I may make a small release in the following days if time permits!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants