Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make sure meta-estimators are lenient towards missing data #15319

Closed
adrinjalali opened this issue Oct 21, 2019 · 16 comments · Fixed by #17987
Closed

Make sure meta-estimators are lenient towards missing data #15319

adrinjalali opened this issue Oct 21, 2019 · 16 comments · Fixed by #17987
Labels
API Moderate Anything that requires some knowledge of conventions and best practices

Comments

@adrinjalali
Copy link
Member

This item is in our roadmap, and I don't totally understand it. Trying to kinda track the progress of those items, I'm creating this issue. Not sure who wrote it. @amueller maybe you could elaborate? (feel free to edit the description of the issue).

@glemaitre
Copy link
Member

I think that we developed some meta-estimators which would call check_X_y and not allowing missing-values. However, this check can be done by the underlying estimator instead.

Let's give an example: If BaggingClassifier with an HistGradientBoostingClassifier as a base estimator. The gradient boosting will natively manage missing values. If the BaggingClassifier already raises an error at fit, it would be wrong.

So the idea is to delay this checking.

@jnothman jnothman added the Easy Well-defined and straightforward way to resolve label Dec 4, 2019
@jnothman
Copy link
Member

jnothman commented Dec 4, 2019

This is also part of #9854

@jnothman
Copy link
Member

jnothman commented Dec 4, 2019

Also related: #12072

@jnothman
Copy link
Member

jnothman commented Dec 4, 2019

I think looking through our implementations to check which meta-estimators and ensembles enforce finiteness is pretty straightforward, so I've tagged this as such.

@NicolasHug
Copy link
Member

NicolasHug commented Dec 4, 2019

IMHO, this is not an easy task from the perspective of new or inexperienced contributors (who are attracted by the help wanted tag).

Seeing something tagged as easy when it doesn't feel like it can be quite discouraging, and since our barrier of entry is already crazy high, I think we need to be careful when tagging issues as such

@jnothman jnothman added Moderate Anything that requires some knowledge of conventions and best practices and removed Easy Well-defined and straightforward way to resolve labels Dec 5, 2019
@jnothman
Copy link
Member

jnothman commented Dec 5, 2019

I usually consider "easy" to require a grade of familiarity higher than "good first issue"

@venkyyuvy
Copy link
Contributor

Im interested in working for this issue.

Some pointers would be helpful

@glemaitre
Copy link
Member

@venkyyuvy The idea is "passthrough" NaN values. Since there are scikit-learn compatible estimator which internally manage missing values, it could be good that our preprocessor does not raise any error but instead discard and propagate missing values.

@yashika51
Copy link
Contributor

Hi, I am interested in working on this issue.
Which estimators should be able to pass NaN?

@jnothman
Copy link
Member

jnothman commented Jul 9, 2020 via email

@yashika51
Copy link
Contributor

yashika51 commented Jul 9, 2020

So just to confirm, the estimators should be able to pass NaNs and not raise any error at check_X_y. And the base estimators should be able to handle these missing values(and they already do that).
Any changes will be made to estimators to pass these values.

@venkyyuvy
Copy link
Contributor

venkyyuvy commented Jul 25, 2020

Looks like bagging regressor is already lenient towards missing values.

Reference

I tried the following test and it is passing:

def test_leniency_for_missing_data():
    rng = np.random.RandomState(42)
    X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target,
                                                        random_state=rng)

    mask = np.random.choice([1, 0], X_train.shape, p=[.1, .9]).astype(bool)
    X_train[mask] = np.nan
    hgbc = HistGradientBoostingClassifier(max_iter=3)
    clf = BaggingClassifier(base_estimator=hgbc)
    clf.fit(X_train, y_train)

    assert clf.score(X_test, y_test) > 0.8

Probably I should start with other meta estimators?

Update:

Even VotingClassifier and StakingClassifier is working fine with missing values in X.

def test_leniency_for_missing_data():
    rng = np.random.RandomState(42)
    X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                        random_state=rng)

    mask = np.random.choice([1, 0], X_train.shape, p=[.1, .9]).astype(bool)
    X_train[mask] = np.nan
    hgbc = HistGradientBoostingClassifier(max_iter=3)
    clf = VotingClassifier(estimators=[('hgbc_1', hgbc), ('hgbc_2', hgbc)])
    clf.fit(X_train, y_train)

    assert clf.score(X_test, y_test) > 0.8

    clf = StackingClassifier(estimators=[('hgbc_1', hgbc), ('hgbc_2', hgbc)])
    clf.fit(X_train, y_train)
    assert clf.score(X_test, y_test) > 0.8

@jnothman
Copy link
Member

jnothman commented Jul 26, 2020 via email

@adrinjalali
Copy link
Member Author

@glemaitre does the PR indeed close this issue? I couldn't figure from the conversation there whether it fixes all the meta estimators or not.

@glemaitre
Copy link
Member

I would say yes if we did not miss any: #17987 (comment)

@glemaitre
Copy link
Member

glemaitre commented Aug 26, 2020 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Moderate Anything that requires some knowledge of conventions and best practices
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants