Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG + 2] Add check to regression models to raise error when targets are NaN #5431

Merged
merged 19 commits into from Oct 22, 2015

Conversation

Projects
None yet
4 participants
@hlin117
Copy link
Contributor

hlin117 commented Oct 17, 2015

Addressing issue #5322.

This pull request addresses the following issues:

  • Having regression models raise errors when targets are nan / inf
    • Addresses change in at least 4 files that include models
    • Regression tests for this error
  • A bug in _yield_regressor_checks that prevents tests from being run on certain models
@hlin117

This comment has been minimized.

Copy link
Contributor Author

hlin117 commented Oct 17, 2015

I'm running the following testing script

from sklearn.utils.estimator_checks import check_estimator
from sklearn.ensemble import RandomForestRegressor as RF
from sklearn.tree import DecisionTreeRegressor as DT
from sklearn.ensemble import ExtraTreesRegressor as ETR

from sklearn.linear_model import MultiTaskElasticNet as MTE, MultiTaskLasso as MTL

check_estimator(RF)
check_estimator(DT)
check_estimator(ETR)
check_estimator(MTE)
check_estimator(MTL)

Seems to be failing on check_estimator(MTE). The line of code that is making this build fail, is here. It seems that the y on this line is a 1d array, which is why I get the following error message:

ValueError: For mono-task outputs, use ElasticNet

I'm not sure why that line is being executed when check_estimator is called for a MultiTaskElasticNet. Since that's not my piece of code, is it fine for me to change it?

@amueller

This comment has been minimized.

Copy link
Member

amueller commented Oct 18, 2015

You shouldn't change that. calling check_estimator is maybe not the best idea. I'm not entirely sure why this doesn't work, but it looks like the class name is not passed correctly to "MultiTaskElasticNet".
Can you just add your test to _yield_regressor_checks?

@hlin117

This comment has been minimized.

Copy link
Contributor Author

hlin117 commented Oct 18, 2015

I think one of the diffs I've written (here) should already add a test to_yield_regressor_checks. Is this what you want to see?

Something strange I found while using a pycharm debugger was that when check_estimator(RandomForestRegressor) is called, name=ABCMeta. Is this expected behavior?

screenshot 2015-10-18 10 49 02

name=ABCMeta seems to be instantiated in the check_estimator function itself.

screenshot 2015-10-18 10 51 10

Estimator().fit(X, y1)
except ValueError as e:
if e.message != errmsg:
warnings.warn("Estimator {0} raised warning as expected, but "

This comment has been minimized.

Copy link
@amueller

amueller Oct 18, 2015

Member

it's an error, not a warning.

@amueller

This comment has been minimized.

Copy link
Member

amueller commented Oct 18, 2015

You are right, that is odd. And probably a bug I introduced. It looks like the tests for check_estimator are not very tight.

Do you get an error when just running the test suite? The continuous integration seems to just time out, which is odd.

@amueller

This comment has been minimized.

Copy link
Member

amueller commented Oct 18, 2015

(there are a bunch of errors, too). I think we should open a separate issue for the check_array problem and focus on fixing the errors here.

@amueller

This comment has been minimized.

Copy link
Member

amueller commented Oct 18, 2015

sorry after more coffee: the timing out is entirely expected. These are exactly the failure cases we want to protect again.

@hlin117

This comment has been minimized.

Copy link
Contributor Author

hlin117 commented Oct 18, 2015

The test suite hangs on the 40th test of sklearn.tests.test_common.test_non_meta_estimators('NuSVR', <class 'sklearn.svm.classes.NuSVR'>). That might be my test; the NuSVR seemed to either seg fault or hang when its passed a y with NaN values.

@hlin117

This comment has been minimized.

Copy link
Contributor Author

hlin117 commented Oct 18, 2015

I think we discussed in #5322 that we wouldn't be changing SVR or NuSVR for now, and to just focus on the regressors that don't error out. Should we handle these problems now or later?

@amueller

This comment has been minimized.

Copy link
Member

amueller commented Oct 18, 2015

I think your link is not linking where you think it's linking. Yeah NuSVR seems to crash. We were taking about accepting y of a different shape before, which is a different issue. We should fix that it is crashing on NaN.

@hlin117

This comment has been minimized.

Copy link
Contributor Author

hlin117 commented Oct 18, 2015

(I updated the link above.)

Great, I'll get to work on fixing NuSVR (and SVR) then.

@hlin117

This comment has been minimized.

Copy link
Contributor Author

hlin117 commented Oct 18, 2015

After my change in estimator_checks in commit e2d9da8, the MultiTaskElasticNet and MultiTaskLasso operated as expected. I'm not sure how you feel about my change to estimator_checks, but it certainly resolved the errors I was having before.

@hlin117

This comment has been minimized.

Copy link
Contributor Author

hlin117 commented Oct 19, 2015

My most recent commit changed the test in test_estimator_checks.test_check_estimator. This is fixing failure in commit 9ad3a45 with this build (search for "failure"), and it would be nice to have this approved.

@hlin117

This comment has been minimized.

Copy link
Contributor Author

hlin117 commented Oct 19, 2015

The latest travis build is failing on Python 3.5, thanks to my check_supervised_y_no_nan function. (See here.)

Basically, I think all of the estimators are failing. This is most likely because of some subclassing issue, which would explain why python2 would build, but python3 will not.

Should I pursue fixing these errors?

@hlin117

This comment has been minimized.

Copy link
Contributor Author

hlin117 commented Oct 19, 2015

It looks like appveyor and travis is passing with my latest commit, but ci/circleci is failing because of #5472. Can someone review my code, and see whether it's suitable for merge?

@@ -75,7 +75,8 @@ def test_check_estimator():
msg = "Estimator doesn't check for NaN and inf in predict"
assert_raises_regex(AssertionError, msg, check_estimator, NoCheckinPredict)
# check for sparse matrix input handling
msg = "Estimator type doesn't seem to fail gracefully on sparse data"

This comment has been minimized.

Copy link
@amueller
"""
name = Estimator.__class__.__name__

This comment has been minimized.

Copy link
@amueller

amueller Oct 19, 2015

Member

Argh, this is already the class, right. Can you add a regression test?

This comment has been minimized.

Copy link
@hlin117

hlin117 Oct 20, 2015

Author Contributor

(Addressed in 0977c9f)

@@ -152,6 +152,7 @@ def fit(self, X, y, sample_weight=None, check_input=True,
random_state = check_random_state(self.random_state)
if check_input:
X = check_array(X, dtype=DTYPE, accept_sparse="csc")
y = check_array(y, accept_sparse='csc', ensure_2d=False, dtype=None)

This comment has been minimized.

Copy link
@amueller

amueller Oct 19, 2015

Member

so we don't accept csr?

This comment has been minimized.

Copy link
@hlin117

hlin117 Oct 19, 2015

Author Contributor

I wrote accept_sparse='csc' because the original check for sparse matrices checked for csc matrices. Not sure whether we've actually tested whether DecisionTreeRegressor or DecisionTreeClassifier can take in csr matrices for targets.

This comment has been minimized.

Copy link
@MechCoder

MechCoder Oct 20, 2015

Member

It should but converts them into csc form. Probably the csc format is optimized for the DecisionTree code hence.

"does not match expected error message" \
.format(name))
else:
raise ValueError("Estimator {0} should have raised error on fitting "

This comment has been minimized.

Copy link
@amueller

amueller Oct 19, 2015

Member

nan in target / y maybe?

This comment has been minimized.

Copy link
@hlin117

hlin117 Oct 20, 2015

Author Contributor

(Addressed in commit 86487ae.)

@amueller

This comment has been minimized.

Copy link
Member

amueller commented Oct 19, 2015

looks good apart from my nitpick about adding a regression test for check_estimator.

@hlin117

This comment has been minimized.

Copy link
Contributor Author

hlin117 commented Oct 19, 2015

Since check_estimator was failing on MultiTaskElasticNet before, all I did was call check_estimator(MultiTaskElasticNet) in test_check_estimator. How does this look?

@amueller

This comment has been minimized.

Copy link
Member

amueller commented Oct 20, 2015

Hm it would be nice to actually check that the name is passed correctly but I guess that is not easy to test. So never mind, this looks good.

@amueller amueller changed the title Add check to regression models to raise error when targets are NaN [MRG + 1] Add check to regression models to raise error when targets are NaN Oct 20, 2015

@@ -147,7 +147,8 @@ def fit(self, X, y, sample_weight=None):
raise TypeError("Sparse precomputed kernels are not supported.")
self._sparse = sparse and not callable(self.kernel)

X = check_array(X, accept_sparse='csr', dtype=np.float64, order='C')
#X = check_array(X, accept_sparse='csr', dtype=np.float64, order='C')
X, y = check_X_y(X, y, dtype=np.float64, order='C', accept_sparse='csr')

This comment has been minimized.

Copy link
@MechCoder

MechCoder Oct 20, 2015

Member

is this needed? looks like self._validate_targets does it anyway.

This comment has been minimized.

Copy link
@MechCoder

MechCoder Oct 20, 2015

Member

well actually no, sorry (which makes _validate_targets a strange name.


warnings.simplefilter("ignore")
X = np.random.randn(10, 5)
y1 = np.random.randn(10) / 0.

This comment has been minimized.

Copy link
@MechCoder

MechCoder Oct 20, 2015

Member

you should set the seed, I guess

This comment has been minimized.

Copy link
@hlin117

hlin117 Oct 20, 2015

Author Contributor

Should it matter? y1 will end up as an array of infs and -infs anyways.

This comment has been minimized.

Copy link
@MechCoder

MechCoder Oct 20, 2015

Member

It will not matter in the absolute sense. But it is better to have deterministic input for every run (both X and y1)

"dtype('float64')."
try:
if "MultiTask" in name:
Estimator().fit(X, y2)

This comment has been minimized.

Copy link
@MechCoder

MechCoder Oct 20, 2015

Member

You have multioutput_estimator_convert_y_2d to handle this. But I admit the name is too long.

This comment has been minimized.

Copy link
@hlin117

hlin117 Oct 20, 2015

Author Contributor

Ironically if I didn't add that if "MultiTask in name statement, I wouldn't have discovered the bug fixed in a5860a1. Would you still want me to use multioutput_estimator_convert_y_2d?

This comment has been minimized.

Copy link
@MechCoder

MechCoder Oct 21, 2015

Member

Whoops. Nice catch.

However I still think code reuse would be a good idea.

Can you

  1. change if name in ([]) to if "MultiTask" in name
  2. return y[:, np.newaxis] to return np.reshape(y, (-1, 1)) (Using newaxis produces strange strides)

in multioutput_estimator_convert_y_2d?

This comment has been minimized.

Copy link
@hlin117

hlin117 Oct 21, 2015

Author Contributor

Addressed in commit 52a6acc.

This comment has been minimized.

Copy link
@amueller

amueller Oct 21, 2015

Member

Actually, you would have still discovered the bug, I think, as that also checks the name.

This comment has been minimized.

Copy link
@MechCoder

MechCoder Oct 21, 2015

Member

So, can you change this to use the existing function? Thanks !

This comment has been minimized.

Copy link
@hlin117

hlin117 Oct 21, 2015

Author Contributor

Check out commit 1f48814. Thanks!

@hlin117 hlin117 force-pushed the hlin117:nan-targets branch from 86487ae to 3bfbe79 Oct 20, 2015

def check_supervised_y_no_nan(name, Estimator):
# Checks that the Estimator targets are not NaN.

warnings.simplefilter("ignore")

This comment has been minimized.

Copy link
@MechCoder

MechCoder Oct 21, 2015

Member

Why are there warnings? If it's due to the division by zero, you can replace it with y1 = np.inf * np.ones(10)

This comment has been minimized.

Copy link
@MechCoder

MechCoder Oct 21, 2015

Member

Why are there warnings? If it's due to the division by zero, you can replace it with y1 = np.inf * np.ones(10)

This comment has been minimized.

Copy link
@hlin117

hlin117 Oct 21, 2015

Author Contributor

Ahh, smart =] I'll push this change soon.

@@ -151,7 +151,8 @@ def fit(self, X, y, sample_weight=None):
raise TypeError("Sparse precomputed kernels are not supported.")
self._sparse = sparse and not callable(self.kernel)

X = check_array(X, accept_sparse='csr', dtype=np.float64, order='C')
#X = check_array(X, accept_sparse='csr', dtype=np.float64, order='C')

This comment has been minimized.

Copy link
@MechCoder

MechCoder Oct 21, 2015

Member

remove this commented line please

from ..utils import check_array, check_random_state, column_or_1d
from ..utils import compute_class_weight, deprecated
from ..utils import check_array, check_random_state, column_or_1d, check_X_y
from ..utils import ConvergenceWarning, compute_class_weight, deprecated

This comment has been minimized.

Copy link
@MechCoder

MechCoder Oct 21, 2015

Member

looks like ConvergenceWarning is unused.

This comment has been minimized.

Copy link
@amueller

amueller Oct 21, 2015

Member

this one wasn't addressed, right?

# Checks that the Estimator targets are not NaN.

warnings.simplefilter("ignore")
np.random.seed(888)

This comment has been minimized.

Copy link
@MechCoder

MechCoder Oct 21, 2015

Member

nit: Can you replace with this rng = np.random.RandomState(888)

just to be consistent with others.

@MechCoder

This comment has been minimized.

Copy link
Member

MechCoder commented Oct 21, 2015

lgtm apart from inline comments !

@MechCoder MechCoder changed the title [MRG + 1] Add check to regression models to raise error when targets are NaN [MRG + 2] Add check to regression models to raise error when targets are NaN Oct 21, 2015

# Checks that the Estimator targets are not NaN.

np.random.seed(888)
X = np.random.randn(10, 5)

This comment has been minimized.

Copy link
@GaelVaroquaux

GaelVaroquaux Oct 21, 2015

Member

You should use a local random number generator here, as in 'rng = np.random.RandomState(888)' and not seed the global, as it create a side effect and is not concurrency safe.

This comment has been minimized.

Copy link
@MechCoder

MechCoder Oct 21, 2015

Member

I thought I commented the same thing. Maybe disappeared in the git diffs

This comment has been minimized.

Copy link
@hlin117

hlin117 Oct 21, 2015

Author Contributor

@MechCoder: I thought that np.random.seed(888) would provide a sufficient random number generator; I didn't know that it would seed the global, so sorry about that!

I fixed this in 290e0ea.

@GaelVaroquaux

This comment has been minimized.

Copy link
Member

GaelVaroquaux commented Oct 21, 2015

LGTM aside from comment on random number generator.

@hlin117

This comment has been minimized.

Copy link
Contributor Author

hlin117 commented on sklearn/utils/estimator_checks.py in 1f48814 Oct 21, 2015

In retrospect, maybe this call should lie outside of the "try" block. What do you think, @MechCoder?

This comment has been minimized.

Copy link
Member

MechCoder replied Oct 21, 2015

Indeed.

This comment has been minimized.

Copy link
Member

GaelVaroquaux replied Oct 21, 2015

Yup

This comment has been minimized.

Copy link
Contributor Author

hlin117 replied Oct 21, 2015

Addressed in commit f725485. Thanks!

@GaelVaroquaux

This comment has been minimized.

Copy link
Member

GaelVaroquaux commented Oct 22, 2015

Travis is happy. Merging!

GaelVaroquaux added a commit that referenced this pull request Oct 22, 2015

Merge pull request #5431 from hlin117/nan-targets
[MRG + 2] Add check to regression models to raise error when targets are NaN

@GaelVaroquaux GaelVaroquaux merged commit 744d161 into scikit-learn:master Oct 22, 2015

1 of 2 checks passed

continuous-integration/appveyor Waiting for AppVeyor build to complete
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
@hlin117

This comment has been minimized.

Copy link
Contributor Author

hlin117 commented Oct 22, 2015

Thanks everyone!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.