New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stronger common tests for setting init params? / check_estimator #7738

Closed
amueller opened this Issue Oct 24, 2016 · 13 comments

Comments

Projects
None yet
4 participants
@amueller
Member

amueller commented Oct 24, 2016

In #7477 a solution was proposed that did something like

class Estimator(BaseEstimator):
    def __init__(self, param=None):
        self._param = param

    @property
    def param(self):
        return some_stuff(self._param)

The common tests let this pass, though that should wreck havoc on get_params and set_params.
I haven't looked into it but I think the tests should fail on this.

@amueller amueller added this to the 0.19 milestone Oct 24, 2016

@absolutelyNoWarranty

This comment has been minimized.

Contributor

absolutelyNoWarranty commented Oct 24, 2016

I am interested in contributing. It sounds to me like you want check_estimator to verify that there are no properties which are parameter names?

@amueller

This comment has been minimized.

Member

amueller commented Oct 24, 2016

Thanks for wanting to contribute.
I think we want to check that calling set_params is equivalent to passing parameters in __init__.
Actually, I'm a bit appalled to see that we never test set_params at all, it seems.

Can you please double check that with the current common tests the example I gave above passes?
The simplest test I can think of is to add a line to check_parameters_default_constructible
that does estimator.set_params(**estimator.get_params()). That should fail with the example I gave.

@absolutelyNoWarranty

This comment has been minimized.

Contributor

absolutelyNoWarranty commented Oct 25, 2016

Does "with the current common tests" mean use check_estimator?
Yes, your example can pass check_estimator but fail on set_params. However it is quite easy to bypass. The example below will pass check_estimator and set_params.

from sklearn.svm import LinearSVC
from sklearn.utils.estimator_checks import check_estimator


class ClipLinearSVC(LinearSVC):
    def __init__(self, penalty='l2', loss='squared_hinge', dual=True, tol=1e-4,
                 C=1.0, multi_class='ovr', fit_intercept=True,
                 intercept_scaling=1, class_weight=None, verbose=0,
                 random_state=None, max_iter=1000):
        self.dual = dual
        self.tol = tol
        self._C = C
        self.multi_class = multi_class
        self.fit_intercept = fit_intercept
        self.intercept_scaling = intercept_scaling
        self.class_weight = class_weight
        self.verbose = verbose
        self.random_state = random_state
        self.max_iter = max_iter
        self.penalty = penalty
        self.loss = loss

    @property
    def C(self):
        return self._C if self._C > 0 else 0

    @C.setter
    def C(self, value):
        self._C = value

check_estimator(ClipLinearSVC) # OK
e = ClipLinearSVC(C=-1000)
print e.set_params(**e.get_params())

# Output:
#ClipLinearSVC(C=0, class_weight=None, dual=True, fit_intercept=True,
#       intercept_scaling=1, loss='squared_hinge', max_iter=1000,
#       multi_class='ovr', penalty='l2', random_state=None, tol=0.0001,
#       verbose=0)

Written like this the param is actually completely transparent to get_param and set_param. I think it can only be detected by just checking the Estimator class param attribute.


from sklearn.utils.testing import assert_false

for param_name in Estimator._get_param_names():
    assert_false(hasattr(Estimator, param_name))
@absolutelyNoWarranty

This comment has been minimized.

Contributor

absolutelyNoWarranty commented Oct 25, 2016

Is it normal for check_estimator to throw AssertionErrors using build-in sklearn estimators when running from the command line? I am getting a lot of

AssertionError: TypeError not raised by fit

@amueller

This comment has been minimized.

Member

amueller commented Oct 25, 2016

@absolutelyNoWarranty can you give details of the assertion error? No, that's not normal.

And why would the hasattr fail? There's a property, right?

And it's not so much about preventing people from using properties. If they are transparent in a way that works with set_params, get_params and __init__ then this could work.

The test is mostly to avoid people implementing estimators that clearly don't work with set_params and get_params.

@absolutelyNoWarranty

This comment has been minimized.

Contributor

absolutelyNoWarranty commented Oct 25, 2016

I think I had some virtualenv issues on my end. Only DummyClassifier throws an assertion error now.

# sklearn=0.19.dev0
from sklearn.dummy import DummyClassifier
from sklearn.utils.estimator_checks import check_estimator
check_estimator(DummyClassifier)
#Output:
#AssertionError: TypeError not raised by fit

Yes the hasattr would be True. I thought that could be used in a blanket check to prevent all properties with the same name as a parameter in __init__.

And it's not so much about preventing people from using properties. If they are transparent in a way that works with set_params, get_params and init then this could work.

If I'm understanding this correctly, idempotent hyper-parameters is not a goal? It would be acceptable to have my_param_dict != estimator.set_params(**my_param_dict).get_params()?

@amueller

This comment has been minimized.

Member

amueller commented Oct 25, 2016

I'm not sure. Maybe that should be the bar?
I though estimator.set_params(estimator.get_params()) might be enough. But yours is stronger and why not. Least surprise I guess.

@jnothman

This comment has been minimized.

Member

jnothman commented Oct 25, 2016

Yes, we need setters in such cases, and the transformation needs to be invertible.

I've not read all the commentary above yet, but hasattr(Estimator, param_name) is not an appropriate test. Consider step names in Pipelines, for instance. But testing the relationship between get_params (maybe only for deep=False?) and set_params is appropriate.

(*"wreck havoc" -> "wreak havoc")

@absolutelyNoWarranty

This comment has been minimized.

Contributor

absolutelyNoWarranty commented Oct 26, 2016

After doing a search it turns out that TfidfVectorizer uses a lot of param-properties so I see why hasattr is not appropriate (though I don't understand yet why Pipeline would be incompatible.)

@jnothman

This comment has been minimized.

Member

jnothman commented Oct 26, 2016

p = Pipeline([('clf', LogisticRegression())])
assert(p.get_params()['clf'])
p.set_params(clf=SVC())
assert(not hasattr(p, 'clf'))

On 26 October 2016 at 14:58, absolutelyNoWarranty notifications@github.com
wrote:

After doing a search it turns out that TfidfVectorizer uses a lot of
param-properties so I see why hasattr is not appropriate (though I don't
understand yet why Pipeline would be incompatible.)


You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#7738 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAEz6xVvq4cqA97vYeeq2PvfvOU3Zu1rks5q3s_qgaJpZM4Ke3du
.

@absolutelyNoWarranty

This comment has been minimized.

Contributor

absolutelyNoWarranty commented Oct 26, 2016

I meant for hasattr to check against ._get_param_names() so for Pipeline the only disallowed parameter would be steps.

I opened a PR with the modification suggested by @amueller .

@absolutelyNoWarranty

This comment has been minimized.

Contributor

absolutelyNoWarranty commented Nov 13, 2016

In PR 7477, the bug is that set_params just flat out doesn't work. In the review the comment is that the PR is against the API.

What I am wondering is if something like the below is acceptable.
It solves set_param but and still uses the property to do error-checking. Putting the error-checking in alpha.setter ensures that regardless of whether you use __init__ or set_params, any modification to alpha will go through the error checking.

class MultinomialNB(BaseDiscreteNB):

    def __init__(self, alpha=1.0, fit_prior=True, class_prior=None):
        self.alpha = alpha # assign to alpha not _alpha
        self.fit_prior = fit_prior
        self.class_prior = class_prior

    @property
    def alpha(self):
        return self._alpha
    @alpha.setter
    def alpha(self, value):
        # possible error checking but no modifying of _alpha with logic
        if value < _ALPHA_MIN:
            raise ValueError('alpha too small')
        self._alpha = value

Problem 2:
I don't really understand what is meant by invertibility. If it is just like TfidfVectorizer where values simply pass through to another object then I understand but is logic like +1, -1 actually something to be allowed? Where would this be useful?

@jnothman

This comment has been minimized.

Member

jnothman commented Nov 16, 2016

Something like that can be acceptable, used sparingly: we rather use the convention of validating in fit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment