Fixes 502 Add `feature_groups` parameter to SequentialFeatureSelection #965

NimaSarajpoor · 2022-08-27T10:22:54Z

This PR is created to add support for feature_groups in SequentialFeatureSelection, as discussed in #502.

Before developing the new feature, I decided to clean the code first as I noticed it has some nested-if blocks. The module sequential_feature_selector.py now has about 30 70 lines less, and I think it might be a little bit more readable.

I will add the parameter feature_groups in the nearby future.

NimaSarajpoor · 2022-08-27T22:22:51Z

@rasbt
Okay...I think we are now at a good position to add parameter feature_groups. In the meantime, if you have any suggestion, please let me know.

NimaSarajpoor · 2022-09-07T17:08:03Z

@rasbt
Could you please let me know what you think about this?

What should the program return when KeyboardInterrup is raised?
Let's consider sequential_feature_selector.py for now. Let's assume user is interested in forward selection and k_features is set to (3, 5). What would we want user to see when the program got interrupted right after the first or second round of feature selection? Note that in such case, the keys in the dictionary self.subsets_ are not in {3, 4, 5}. So, I think it might be better to avoid post-processing in such case. That is why I did return self right after exception block in the sequential case. Also, self.fitted should remain False in such case (what do you think?)

Now, if we take a look at the exhaustive module, we can see that we does not return self after exception block. If I do:

# in exhaustive_feature_selector.py

except KeyboardInterrup
   self.interrupted_ = True
   sys.stderr.write("\nSTOPPING EARLY DUE TO KEYBOARD INTERRUPT...")
   return self  # NEW

then, I will get the following error when I run the pytest:

# in tests/test_exhaustive_feature_selector.py

# in function `test_check_pandas_dataframe_fit`

sfs1._TESTING_INTERRUPT_MODE = True
        sfs1 = sfs1.fit(df, y)
>       assert efs1.best_idx_ == (0, 1), efs1.best_idx_
E       AssertionError: None
E       assert None == (0, 1)

Which means we are testing efs.best_idx_ in such case. However, I think we should just do return self inside exception block (or right after that) and avoid post-processing. we should not test the best_idx_. I mean we can test the last indices of features or the dictionary subsets_ to test _TESTING_INTERRUPT_MODE. We should not give user self.best_idx_. If user is interested in the best-so-far score or indices, then they should be able to get it from the dictionary .subsets_. What do you think?

rasbt · 2022-09-07T21:34:20Z

Good points. I think the concern you have is that best_idx_ is not the "best" index (or None) at the point when the run was interrupted.

A counter argument for postprocessing though is that (particularly for the EFS), a run can take a very long time. Sometimes, users want to manually interrupt it due to time issues but using the results up to that point.

If we don't do the post-processing for them, the users would have to figure it out themselves. It's not a deal-breaker, but I think when it comes to convenience, I think in >80% of the cases users would probably prefer if an interrupted SFS/EFS can later be used the same way as a fully fitted one without extra steps.

Or, maybe a middle ground would be this:

we do what you propose regarding

returning self
leaving self.fitted=False

but we add a clean-up method that users have to call (at their own responsibility) after interruption. What would be a good name for this?

a. cleanup
b. cleanup_from_interrupt
c. postprocess_from_interrupt
d. finalize_fit

I would say b) & c) are more clear, but if we go with a) or d), that's also something we can use internally (just the last steps refactored into a method).

What do you think?

NimaSarajpoor · 2022-09-07T22:28:37Z

I think the concern you have is that best_idx_ is not the "best" index (or None) at the point when the run was interrupted.

Yes. That's correct.

A counter argument for postprocessing though is that (particularly for the EFS), a run can take a very long time. Sometimes, users want to manually interrupt it due to time issues but using the results up to that point.

If we don't do the post-processing for them, the users would have to figure it out themselves. It's not a deal-breaker, but I think when it comes to convenience, I think in >80% of the cases users would probably prefer if an interrupted SFS/EFS can later be used the same way as a fully fitted one without extra steps.

Yes, I understand. This is to just give users something to work with.

Or, maybe a middle ground would be this:

we do what you propose regarding

returning self

leaving self.fitted=False

but we add a clean-up method that users have to call (at their own responsibility) after interruption. What would be a good name for this?

a. cleanup b. cleanup_from_interrupt c. postprocess_from_interrupt d. finalize_fit

I would say b) & c) are more clear, but if we go with a) or d), that's also something we can use internally (just the last steps refactored into a method).

What do you think?

I think this is a good approach. I think d should work. We may do something like this:

def finalize_fit(self, ...):
      # getting best score

def fit():
    # key is interrupted
    return self
     
    # otherwise:
     self.finalize_fit(...)
     return self

And, then in the docstring, we can explain that when keyboard is interrupted, users can still get the best score and the corresponding subset of features by using .finalize_fit at their own responsibility.

Sounds good? (Is this what you had in mind?)

rasbt · 2022-09-07T22:38:11Z

Yes, I think this is perfect. This way, people who let's say accidentally interrupt it are not misled, because it will take an extra deliberate step. But the step is not that inconvenient, so if you know what you are doing, that's a good, easy workaround!

NimaSarajpoor · 2022-09-09T21:32:13Z

@rasbt
There is still 1 test function that fails. Something weird is going on here...

According to the test function tests/test_exhaustive_feature_selector.py:: test_knn_wo_cv, we can see:

expect = {
        0: {
            "feature_idx": (0, 1),
            "feature_names": ("0", "1"),
            "avg_score": 0.82666666666666666,
            "cv_scores": np.array([0.82666667]),
        },
       ....

And, this test function is passing. Now, if I run this function in the Jupyter, I get a different result(!):

iris = load_iris()
X = iris.data
y = iris.target
knn = KNeighborsClassifier(n_neighbors=4)
efs1 = EFS(
    knn,
    min_features=2,
    max_features=3,
    scoring="accuracy",
    cv=0,
    print_progress=False,
)
efs1 = efs1.fit(X, y)
efs1.subsets_

And, I now see that expect[0]['avg_score'] is 0.8333333333333334. How is that possible? Note that this test function is passing when I do pytest. But it should fail. right? (or maybe I am missing something)

Now, please take a look at tests/test_sequential_feature_selector_feature_groups.py:: test_knn_wo_cv_with_fixed_features_and_feature_groups_case1 (This is the only failure we are getting in unit testing) The expected score for features (0, 1) is 0.8266... but it is different than the score we are getting, which is 0.83.

Any thoughts?

NimaSarajpoor · 2022-09-09T21:36:15Z

@rasbt
~~Weird! I am getting one failure in unit testing when I run it in my local repo! (I explained it in my previous comment)~~

NimaSarajpoor · 2022-09-09T21:55:15Z

@rasbt
Update:

~~I cannot figure out the problem I had in my local repo. My goal was to push the commits so you can see the single failure that I got in my local repo. But, apparently things are good here.~~
I tried to refactor several functions and put them in the utilities.py
I added some checks for k_features, fixed_feautres, feature_groups .

I think there are three things we may want to do as the final steps (or leave them to someone else):

Add test for dataframe/string in sequential feature selection with feature_groups
Move some of the checks to __init__. Currently, I put most checks in the .fit() and I moved some checks from __init__ to .fit(). However, it is better to put the checks in the appropriate place. For instance, if a user provides both fixed_features and feature_groups in SFS(...), it is better to raise error there if there is a feature in fixed_features that is not in feature_groups. However, this error is now raised in .fit(). Now, if the user does not provide feature_groups. Then, we should do another check in .fit() to see if the data X contains the features in fixed_features or not. This is not an easy process :) For now, I just wanted to make sure we have checks for params and things are working.
Add test functions with assert_raises(..) to test the error messages.

rasbt · 2022-09-11T17:30:33Z

Thanks for the updates!

Add test for dataframe/string in sequential feature selection with feature_groups

I can take care of this above

Move some of the checks to ...

Hm, yeah, that's not an easy thing. In general, I think that scikit-learn does all checks in fit(). I forgot the details, but there was a rationale behind it. However, I agree with you and some things are better checked in init. Like if there are 2 functionalities that interact/depend on each other like fixed_features and feature_groups. There could be an extra check in fit() to check if the data indeed contains the specified features. But on the other hand, I feel like this can also be up to the user's responsibility, and we don't have to check each detail :P

rasbt · 2022-09-11T18:27:16Z

Just added the unit tests. I think there's one unexpected behavior (second from the bottom) that doesn't look right (haha, or am I overlooking something?)

def test_check_pandas_dataframe_with_feature_groups_and_fixed_features():
    iris = load_iris()
    X = iris.data
    y = iris.target
    lr = SoftmaxRegression(random_seed=123)

    df = pd.DataFrame(
        X, columns=["sepal length", "sepal width", "petal length", "petal width"]
    )
    sfs1 = SFS(
        lr,
        k_features=2,
        forward=True,
        floating=False,
        scoring="accuracy",
        feature_groups=[
            ["petal length", "petal width"],
            ["sepal length"],
            ["sepal width"],
        ],
        fixed_features=[ "sepal length", "petal length", "petal width"],
        cv=0,
        verbose=0,
        n_jobs=1,
    )

    sfs1 = sfs1.fit(df, y)
    assert sfs1.k_feature_names_ == (
        'sepal length', 'petal length', 'petal width',
    ), sfs1.k_feature_names_

This returns ('sepal width', 'petal length', 'petal width')

But I think that's not possible. I.e., it can't return 'sepal width' because it's not in fixed_features. For the integer index version, this seems to work correctly:

def test_check_feature_groups():
    iris = load_iris()
    X = iris.data
    y = iris.target
    lr = SoftmaxRegression(random_seed=123)
    sfs1 = SFS(
        lr,
        k_features=2,
        forward=True,
        floating=False,
        scoring="accuracy",
        feature_groups=[[2, 3], [0], [1]],
        fixed_features=[0, 2, 3],
        cv=0,
        verbose=0,
        n_jobs=1,
    )

    sfs1 = sfs1.fit(X, y)
    assert sfs1.k_feature_idx_ == (0, 2, 3), sfs1.k_feature_idx_

Or am I confused now? 😅

NimaSarajpoor · 2022-09-11T20:21:24Z

@rasbt
Thanks for the help. Yeah... it seems something is off...I will investigate it. (It is good that you caught it! Thanks!!)

NimaSarajpoor · 2022-09-12T02:17:16Z

@rasbt
It seems that error you mentioned is now resolved. Btw, I think I did not do anything about that test function. I am now going to resolve the one that still fails.

NimaSarajpoor · 2022-09-12T04:13:17Z

@rasbt
Updates

The decimal was set to 2 in Exhaustive , and 3 in Sequential in testing process. I now set them both to 2.
Previously, I set the decimal to 3 and found some discrepancies (e.g. see failed test ) I do not know how the expected values were calculated in the first place. So, maybe the expected values in a few cases should be revised. However, I was not sure and I thought maybe the expected values were calculated in a particular way. So, I just set the decimal to 2 for testing both modules.

I think things should be good now :)

rasbt · 2022-09-14T01:52:45Z

Thanks so much, this looks reasonable to me! I will see if there's anything to add on the documentation front but other than that, I think this is good to merge. I am also planning to make a release this weekend so that people can actually start using the awesome features you've added :)

NimaSarajpoor · 2022-09-14T02:03:19Z

Cool :) Thanks for your help and support throughout this journey. It has been fun :)

NimaSarajpoor added 22 commits August 26, 2022 21:19

replace None with empty tuple to clean code

4165b89

Remove unnecessary if statement

2029d60

Remove else to clean code

641a6c3

Factor out a block to clean code

e580173

combine two if block, all tests passing

d81e50f

minor change

7ba3733

revise if statement to improve readability

50a6b78

major clean up of while loop

7f2285b

Gives error message its own if block

72f3f0c

Remove redundant line

eac3ee4

Merge two nested if

d78eeae

replace float(-inf) with np.NINF

4509305

convert k_features int into tuple

df31812

Improve readability of private functions

1b067a0

change if condition to make two private functions similar

2227d46

Add fixed features seperately in backward

0cece9d

Replace while with for loop to increase safety

fe747a8

Replace set with generator

37b00df

Factor out feature selection_phase1_ok

8b255ab

Factor out feature selection_phase2_ok

6f5be4e

Remove unnecessary comments

8806f0b

Remove old functions

8e81321

NimaSarajpoor added 7 commits August 27, 2022 17:40

Remove unncessary condition

58db2ca

Rename variable

c3e6212

factor out code in if-else block

200beaf

restructure the code

d901214

Use number of fixed features as minimum number of features

5574383

Avoid changing orig_set variable

eb4cf68

Add a few if checks

725c567

NimaSarajpoor added 5 commits September 7, 2022 19:57

Handle KeyboardInterrupt

8cd28f5

minor changes

0a21907

Rename variables to improve readability

26d62b4

Add check for k_features

104d9a0

minor changes

580801a

NimaSarajpoor added 2 commits September 9, 2022 22:56

Remove some checks from __init__

09ba017

Add note on using finalize_fit

b4acac5

rasbt added 2 commits September 11, 2022 13:07

feature group tests (fail)

c4e07cb

upate unit tests

634b93b

Modify k_feature

d45e71d

NimaSarajpoor added 4 commits September 11, 2022 20:56

Correct actual value

9cdbd3b

Change actual value in test function

b445050

Change test function

d7253e0

Change decimal precision from 3 to 2

4b86cbd

add keyboardinterrupt notes

89f9ba3

rasbt merged commit 1f46260 into rasbt:master Sep 14, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixes 502 Add `feature_groups` parameter to SequentialFeatureSelection #965

Fixes 502 Add `feature_groups` parameter to SequentialFeatureSelection #965

NimaSarajpoor commented Aug 27, 2022 •

edited

NimaSarajpoor commented Aug 27, 2022

NimaSarajpoor commented Sep 7, 2022 •

edited

rasbt commented Sep 7, 2022

NimaSarajpoor commented Sep 7, 2022 •

edited

rasbt commented Sep 7, 2022

NimaSarajpoor commented Sep 9, 2022 •

edited

NimaSarajpoor commented Sep 9, 2022 •

edited

NimaSarajpoor commented Sep 9, 2022 •

edited

rasbt commented Sep 11, 2022

rasbt commented Sep 11, 2022

NimaSarajpoor commented Sep 11, 2022 •

edited

NimaSarajpoor commented Sep 12, 2022

NimaSarajpoor commented Sep 12, 2022 •

edited

rasbt commented Sep 14, 2022

NimaSarajpoor commented Sep 14, 2022

Fixes 502 Add feature_groups parameter to SequentialFeatureSelection #965

Fixes 502 Add feature_groups parameter to SequentialFeatureSelection #965

Conversation

NimaSarajpoor commented Aug 27, 2022 • edited

NimaSarajpoor commented Aug 27, 2022

NimaSarajpoor commented Sep 7, 2022 • edited

rasbt commented Sep 7, 2022

NimaSarajpoor commented Sep 7, 2022 • edited

rasbt commented Sep 7, 2022

NimaSarajpoor commented Sep 9, 2022 • edited

NimaSarajpoor commented Sep 9, 2022 • edited

NimaSarajpoor commented Sep 9, 2022 • edited

rasbt commented Sep 11, 2022

rasbt commented Sep 11, 2022

NimaSarajpoor commented Sep 11, 2022 • edited

NimaSarajpoor commented Sep 12, 2022

NimaSarajpoor commented Sep 12, 2022 • edited

rasbt commented Sep 14, 2022

NimaSarajpoor commented Sep 14, 2022

Fixes 502 Add `feature_groups` parameter to SequentialFeatureSelection #965

Fixes 502 Add `feature_groups` parameter to SequentialFeatureSelection #965

NimaSarajpoor commented Aug 27, 2022 •

edited

NimaSarajpoor commented Sep 7, 2022 •

edited

NimaSarajpoor commented Sep 7, 2022 •

edited

NimaSarajpoor commented Sep 9, 2022 •

edited

NimaSarajpoor commented Sep 9, 2022 •

edited

NimaSarajpoor commented Sep 9, 2022 •

edited

NimaSarajpoor commented Sep 11, 2022 •

edited

NimaSarajpoor commented Sep 12, 2022 •

edited