Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixes 502 Add feature_groups parameter to SequentialFeatureSelection #965

Merged
merged 93 commits into from Sep 14, 2022
Merged

Fixes 502 Add feature_groups parameter to SequentialFeatureSelection #965

merged 93 commits into from Sep 14, 2022

Conversation

NimaSarajpoor
Copy link
Contributor

@NimaSarajpoor NimaSarajpoor commented Aug 27, 2022

This PR is created to add support for feature_groups in SequentialFeatureSelection, as discussed in #502.

Before developing the new feature, I decided to clean the code first as I noticed it has some nested-if blocks. The module sequential_feature_selector.py now has about 30 70 lines less, and I think it might be a little bit more readable.

I will add the parameter feature_groups in the nearby future.

@NimaSarajpoor
Copy link
Contributor Author

@rasbt
Okay...I think we are now at a good position to add parameter feature_groups. In the meantime, if you have any suggestion, please let me know.

@NimaSarajpoor
Copy link
Contributor Author

NimaSarajpoor commented Sep 7, 2022

@rasbt
Could you please let me know what you think about this?

What should the program return when KeyboardInterrup is raised?
Let's consider sequential_feature_selector.py for now. Let's assume user is interested in forward selection and k_features is set to (3, 5). What would we want user to see when the program got interrupted right after the first or second round of feature selection? Note that in such case, the keys in the dictionary self.subsets_ are not in {3, 4, 5}. So, I think it might be better to avoid post-processing in such case. That is why I did return self right after exception block in the sequential case. Also, self.fitted should remain False in such case (what do you think?)

Now, if we take a look at the exhaustive module, we can see that we does not return self after exception block. If I do:

# in exhaustive_feature_selector.py

except KeyboardInterrup
   self.interrupted_ = True
   sys.stderr.write("\nSTOPPING EARLY DUE TO KEYBOARD INTERRUPT...")
   return self  # NEW

then, I will get the following error when I run the pytest:

# in tests/test_exhaustive_feature_selector.py

# in function `test_check_pandas_dataframe_fit`

sfs1._TESTING_INTERRUPT_MODE = True
        sfs1 = sfs1.fit(df, y)
>       assert efs1.best_idx_ == (0, 1), efs1.best_idx_
E       AssertionError: None
E       assert None == (0, 1)

Which means we are testing efs.best_idx_ in such case. However, I think we should just do return self inside exception block (or right after that) and avoid post-processing. we should not test the best_idx_. I mean we can test the last indices of features or the dictionary subsets_ to test _TESTING_INTERRUPT_MODE. We should not give user self.best_idx_. If user is interested in the best-so-far score or indices, then they should be able to get it from the dictionary .subsets_. What do you think?

@rasbt
Copy link
Owner

rasbt commented Sep 7, 2022

Good points. I think the concern you have is that best_idx_ is not the "best" index (or None) at the point when the run was interrupted.

A counter argument for postprocessing though is that (particularly for the EFS), a run can take a very long time. Sometimes, users want to manually interrupt it due to time issues but using the results up to that point.

If we don't do the post-processing for them, the users would have to figure it out themselves. It's not a deal-breaker, but I think when it comes to convenience, I think in >80% of the cases users would probably prefer if an interrupted SFS/EFS can later be used the same way as a fully fitted one without extra steps.

Or, maybe a middle ground would be this:

we do what you propose regarding

  1. returning self
  2. leaving self.fitted=False

but we add a clean-up method that users have to call (at their own responsibility) after interruption. What would be a good name for this?

a. cleanup
b. cleanup_from_interrupt
c. postprocess_from_interrupt
d. finalize_fit

I would say b) & c) are more clear, but if we go with a) or d), that's also something we can use internally (just the last steps refactored into a method).

What do you think?

@NimaSarajpoor
Copy link
Contributor Author

NimaSarajpoor commented Sep 7, 2022

I think the concern you have is that best_idx_ is not the "best" index (or None) at the point when the run was interrupted.

Yes. That's correct.

A counter argument for postprocessing though is that (particularly for the EFS), a run can take a very long time. Sometimes, users want to manually interrupt it due to time issues but using the results up to that point.

If we don't do the post-processing for them, the users would have to figure it out themselves. It's not a deal-breaker, but I think when it comes to convenience, I think in >80% of the cases users would probably prefer if an interrupted SFS/EFS can later be used the same way as a fully fitted one without extra steps.

Yes, I understand. This is to just give users something to work with.

Or, maybe a middle ground would be this:

we do what you propose regarding

  1. returning self
  2. leaving self.fitted=False

but we add a clean-up method that users have to call (at their own responsibility) after interruption. What would be a good name for this?

a. cleanup b. cleanup_from_interrupt c. postprocess_from_interrupt d. finalize_fit

I would say b) & c) are more clear, but if we go with a) or d), that's also something we can use internally (just the last steps refactored into a method).

What do you think?

I think this is a good approach. I think d should work. We may do something like this:

def finalize_fit(self, ...):
      # getting best score

def fit():
    # key is interrupted
    return self
     
    # otherwise:
     self.finalize_fit(...)
     return self

And, then in the docstring, we can explain that when keyboard is interrupted, users can still get the best score and the corresponding subset of features by using .finalize_fit at their own responsibility.

Sounds good? (Is this what you had in mind?)

@rasbt
Copy link
Owner

rasbt commented Sep 7, 2022

Yes, I think this is perfect. This way, people who let's say accidentally interrupt it are not misled, because it will take an extra deliberate step. But the step is not that inconvenient, so if you know what you are doing, that's a good, easy workaround!

@NimaSarajpoor
Copy link
Contributor Author

NimaSarajpoor commented Sep 9, 2022

@rasbt
There is still 1 test function that fails. Something weird is going on here...

According to the test function tests/test_exhaustive_feature_selector.py:: test_knn_wo_cv, we can see:

expect = {
        0: {
            "feature_idx": (0, 1),
            "feature_names": ("0", "1"),
            "avg_score": 0.82666666666666666,
            "cv_scores": np.array([0.82666667]),
        },
       ....

And, this test function is passing. Now, if I run this function in the Jupyter, I get a different result(!):

iris = load_iris()
X = iris.data
y = iris.target
knn = KNeighborsClassifier(n_neighbors=4)
efs1 = EFS(
    knn,
    min_features=2,
    max_features=3,
    scoring="accuracy",
    cv=0,
    print_progress=False,
)
efs1 = efs1.fit(X, y)
efs1.subsets_

And, I now see that expect[0]['avg_score'] is 0.8333333333333334. How is that possible? Note that this test function is passing when I do pytest. But it should fail. right? (or maybe I am missing something)


Now, please take a look at tests/test_sequential_feature_selector_feature_groups.py:: test_knn_wo_cv_with_fixed_features_and_feature_groups_case1 (This is the only failure we are getting in unit testing) The expected score for features (0, 1) is 0.8266... but it is different than the score we are getting, which is 0.83.

Any thoughts?

@NimaSarajpoor
Copy link
Contributor Author

NimaSarajpoor commented Sep 9, 2022

@rasbt
Weird! I am getting one failure in unit testing when I run it in my local repo! (I explained it in my previous comment)

@NimaSarajpoor
Copy link
Contributor Author

NimaSarajpoor commented Sep 9, 2022

@rasbt
Update:

  • I cannot figure out the problem I had in my local repo. My goal was to push the commits so you can see the single failure that I got in my local repo. But, apparently things are good here.
  • I tried to refactor several functions and put them in the utilities.py
  • I added some checks for k_features, fixed_feautres, feature_groups .

I think there are three things we may want to do as the final steps (or leave them to someone else):

  • Add test for dataframe/string in sequential feature selection with feature_groups
  • Move some of the checks to __init__. Currently, I put most checks in the .fit() and I moved some checks from __init__ to .fit(). However, it is better to put the checks in the appropriate place. For instance, if a user provides both fixed_features and feature_groups in SFS(...), it is better to raise error there if there is a feature in fixed_features that is not in feature_groups. However, this error is now raised in .fit(). Now, if the user does not provide feature_groups. Then, we should do another check in .fit() to see if the data X contains the features in fixed_features or not. This is not an easy process :) For now, I just wanted to make sure we have checks for params and things are working.
  • Add test functions with assert_raises(..) to test the error messages.

@rasbt
Copy link
Owner

rasbt commented Sep 11, 2022

Thanks for the updates!

Add test for dataframe/string in sequential feature selection with feature_groups

I can take care of this above

Move some of the checks to ...

Hm, yeah, that's not an easy thing. In general, I think that scikit-learn does all checks in fit(). I forgot the details, but there was a rationale behind it. However, I agree with you and some things are better checked in init. Like if there are 2 functionalities that interact/depend on each other like fixed_features and feature_groups. There could be an extra check in fit() to check if the data indeed contains the specified features. But on the other hand, I feel like this can also be up to the user's responsibility, and we don't have to check each detail :P

@rasbt
Copy link
Owner

rasbt commented Sep 11, 2022

Just added the unit tests. I think there's one unexpected behavior (second from the bottom) that doesn't look right (haha, or am I overlooking something?)

def test_check_pandas_dataframe_with_feature_groups_and_fixed_features():
    iris = load_iris()
    X = iris.data
    y = iris.target
    lr = SoftmaxRegression(random_seed=123)

    df = pd.DataFrame(
        X, columns=["sepal length", "sepal width", "petal length", "petal width"]
    )
    sfs1 = SFS(
        lr,
        k_features=2,
        forward=True,
        floating=False,
        scoring="accuracy",
        feature_groups=[
            ["petal length", "petal width"],
            ["sepal length"],
            ["sepal width"],
        ],
        fixed_features=[ "sepal length", "petal length", "petal width"],
        cv=0,
        verbose=0,
        n_jobs=1,
    )

    sfs1 = sfs1.fit(df, y)
    assert sfs1.k_feature_names_ == (
        'sepal length', 'petal length', 'petal width',
    ), sfs1.k_feature_names_

This returns ('sepal width', 'petal length', 'petal width')

But I think that's not possible. I.e., it can't return 'sepal width' because it's not in fixed_features. For the integer index version, this seems to work correctly:

def test_check_feature_groups():
    iris = load_iris()
    X = iris.data
    y = iris.target
    lr = SoftmaxRegression(random_seed=123)
    sfs1 = SFS(
        lr,
        k_features=2,
        forward=True,
        floating=False,
        scoring="accuracy",
        feature_groups=[[2, 3], [0], [1]],
        fixed_features=[0, 2, 3],
        cv=0,
        verbose=0,
        n_jobs=1,
    )

    sfs1 = sfs1.fit(X, y)
    assert sfs1.k_feature_idx_ == (0, 2, 3), sfs1.k_feature_idx_

Or am I confused now? 😅

@NimaSarajpoor
Copy link
Contributor Author

NimaSarajpoor commented Sep 11, 2022

@rasbt
Thanks for the help. Yeah... it seems something is off...I will investigate it. (It is good that you caught it! Thanks!!)

@NimaSarajpoor
Copy link
Contributor Author

@rasbt
It seems that error you mentioned is now resolved. Btw, I think I did not do anything about that test function. I am now going to resolve the one that still fails.

@NimaSarajpoor
Copy link
Contributor Author

NimaSarajpoor commented Sep 12, 2022

@rasbt
Updates

  • The decimal was set to 2 in Exhaustive , and 3 in Sequential in testing process. I now set them both to 2.
  • Previously, I set the decimal to 3 and found some discrepancies (e.g. see failed test ) I do not know how the expected values were calculated in the first place. So, maybe the expected values in a few cases should be revised. However, I was not sure and I thought maybe the expected values were calculated in a particular way. So, I just set the decimal to 2 for testing both modules.

I think things should be good now :)

@rasbt
Copy link
Owner

rasbt commented Sep 14, 2022

Thanks so much, this looks reasonable to me! I will see if there's anything to add on the documentation front but other than that, I think this is good to merge. I am also planning to make a release this weekend so that people can actually start using the awesome features you've added :)

@NimaSarajpoor
Copy link
Contributor Author

Cool :) Thanks for your help and support throughout this journey. It has been fun :)

@rasbt rasbt merged commit 1f46260 into rasbt:master Sep 14, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants