-
-
Notifications
You must be signed in to change notification settings - Fork 25.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make sure meta-estimators are lenient towards missing data #15319
Comments
I think that we developed some meta-estimators which would call Let's give an example: If So the idea is to delay this checking. |
This is also part of #9854 |
Also related: #12072 |
I think looking through our implementations to check which meta-estimators and ensembles enforce finiteness is pretty straightforward, so I've tagged this as such. |
IMHO, this is not an easy task from the perspective of new or inexperienced contributors (who are attracted by the help wanted tag). Seeing something tagged as easy when it doesn't feel like it can be quite discouraging, and since our barrier of entry is already crazy high, I think we need to be careful when tagging issues as such |
I usually consider "easy" to require a grade of familiarity higher than "good first issue" |
Im interested in working for this issue. Some pointers would be helpful |
@venkyyuvy The idea is "passthrough" |
Hi, I am interested in working on this issue. |
Basic idea is to look at ensemble, multioutput, etc and make sure that they
do not validate their input beyond what is required for their algorithm to
work
|
So just to confirm, the estimators should be able to pass |
Looks like bagging regressor is already lenient towards missing values. I tried the following test and it is passing: def test_leniency_for_missing_data():
rng = np.random.RandomState(42)
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target,
random_state=rng)
mask = np.random.choice([1, 0], X_train.shape, p=[.1, .9]).astype(bool)
X_train[mask] = np.nan
hgbc = HistGradientBoostingClassifier(max_iter=3)
clf = BaggingClassifier(base_estimator=hgbc)
clf.fit(X_train, y_train)
assert clf.score(X_test, y_test) > 0.8 Probably I should start with other meta estimators? Update: Even def test_leniency_for_missing_data():
rng = np.random.RandomState(42)
X_train, X_test, y_train, y_test = train_test_split(X, y,
random_state=rng)
mask = np.random.choice([1, 0], X_train.shape, p=[.1, .9]).astype(bool)
X_train[mask] = np.nan
hgbc = HistGradientBoostingClassifier(max_iter=3)
clf = VotingClassifier(estimators=[('hgbc_1', hgbc), ('hgbc_2', hgbc)])
clf.fit(X_train, y_train)
assert clf.score(X_test, y_test) > 0.8
clf = StackingClassifier(estimators=[('hgbc_1', hgbc), ('hgbc_2', hgbc)])
clf.fit(X_train, y_train)
assert clf.score(X_test, y_test) > 0.8 |
If there is no existing test for it, please add it.
|
@glemaitre does the PR indeed close this issue? I couldn't figure from the conversation there whether it fixes all the meta estimators or not. |
I would say yes if we did not miss any: #17987 (comment) |
Now we can move forward to run the common tests which is a bit more tricky
…On Wed, 26 Aug 2020 at 20:44, Adrin Jalali ***@***.***> wrote:
@glemaitre <https://github.com/glemaitre> does the PR indeed close this
issue? I couldn't figure from the conversation there whether it fixes all
the meta estimators or not.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#15319 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABY32PZHHWN4NYG2MQER2R3SCVJYXANCNFSM4JDBNZJA>
.
--
Guillaume Lemaitre
Scikit-learn @ Inria Foundation
https://glemaitre.github.io/
|
This item is in our roadmap, and I don't totally understand it. Trying to kinda track the progress of those items, I'm creating this issue. Not sure who wrote it. @amueller maybe you could elaborate? (feel free to edit the description of the issue).
The text was updated successfully, but these errors were encountered: