Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG + 2] Classifier and regressor tags #4418

Merged
merged 3 commits into from
Apr 2, 2015

Conversation

amueller
Copy link
Member

This improved is_classifier by making it not depend on inheritance and introduces is_regressor.
It also fixes #2588, using ranking scorers on regressors.
Todo:

  • Describe the estimator_type in the dev docs
  • remove decision_function from all regressors.
  • use tag to decide whether to call predict or decision_function in multi_class.

@amueller
Copy link
Member Author

@mblondel for using ranking metrics on regressors, that is only possible with ground-truth being binary, right? Is that your use case?

@landscape-bot
Copy link

Code Health
Repository health increased by 0.00% when pulling 306d788 on amueller:classifier_regressor_tags into bc5acea on scikit-learn:master.

@landscape-bot
Copy link

Code Health
Repository health increased by 0.00% when pulling bc10d8f on amueller:classifier_regressor_tags into bc5acea on scikit-learn:master.

@landscape-bot
Copy link

Code Health
Repository health increased by 0.00% when pulling a39d237 on amueller:classifier_regressor_tags into bc5acea on scikit-learn:master.

@mblondel
Copy link
Member

For ndcg (not yet in scikit learn) any real is fine.
On Mar 20, 2015 5:50 AM, "Andreas Mueller" notifications@github.com wrote:

@mblondel https://github.com/mblondel for using ranking metrics on
regressors, that is only possible with ground-truth being binary, right? Is
that your use case?


Reply to this email directly or view it on GitHub
#4418 (comment)
.

@landscape-bot
Copy link

Code Health
Repository health increased by 0.00% when pulling ef712e2 on amueller:classifier_regressor_tags into bc5acea on scikit-learn:master.

@amueller
Copy link
Member Author

Should be good now.

@amueller amueller changed the title [WIP] Classifier and regressor tags [MRG] Classifier and regressor tags Mar 20, 2015
@landscape-bot
Copy link

Code Health
Repository health increased by 0.00% when pulling 6e50c77 on amueller:classifier_regressor_tags into bc5acea on scikit-learn:master.

@arjoly
Copy link
Member

arjoly commented Mar 21, 2015

In which extend does it break backward compatibility with pickled model?

@amueller
Copy link
Member Author

If you unpickle a class that has a new property added, the unpickled object will have the class attribute. So is_classifier will work on "old" objects.

@mblondel
Copy link
Member

LGTM. Since it's a core design issue, it would be nice to have other opinions. random pings @larsmans @ogrisel @agramfort @jnothman

Thanks for tackling this @amueller!

@amueller amueller changed the title [MRG] Classifier and regressor tags [MRG + 1] Classifier and regressor tags Mar 24, 2015
@amueller
Copy link
Member Author

Thanks for the review :) I agree we should have more opinions, but I think it is a pretty clear-cut improvement.

@amueller
Copy link
Member Author

Argh, I need to write some docs though

@@ -266,6 +266,7 @@ def __repr__(self):
###############################################################################
class ClassifierMixin(object):
"""Mixin class for all classifiers in scikit-learn."""
estimator_type = "classifier"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that we should make this variable private: people not implementing an estimator should not be looking at it / modifying it. Only people coding a certain class should be touching it. This is pretty much the definition of an class-level private variable.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But the point of this tag system is to let third-party developers implement classes without inheriting from our base classes. Private variables can be changed. On the other hand, a third-party developer doesn't want his code to break because we changed variable names. We need to make this tag part of our API and commit to it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But the point of this tag system is to let third-party developers implement
classes without inheriting from our base classes. Private variables can be
changed.

On the other hand, a third-party developer doesn't want his code to
break because we changed variable names. We need to make this tag part
of our API and commit to it.

Agreed, but it's going to make tab completion ugly, and it's going to
worry/confuse our users.

In a sense, this is the same thing as an add function, which is a
private variable. So in Python land, typing-related class information are
usually coded why private attributes, I would say.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the __ in add is just a way to emphasize that this is a special
method rather than a private one. If the goal is to not mess w/ tab
completion I am +1 with the _ prefix but the variable should really be
documented as part of our public API (i.e. we won't change it).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On 03/24/2015 08:49 PM, Mathieu Blondel wrote:

I think the __ in add is just a way to emphasize that this is a
special method rather than a private one. If the goal is to not mess
w/ tab completion I am +1 with the _ prefix but the variable should
really be documented as part of our public API (i.e. we won't change it).
+1
I documented it in the dev docs.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@amueller
Copy link
Member Author

I added an explanation of the estimator_type attribute to the "Roll your own estimator" docs because this is when people need to know about it.
Is there a documentation of is_classifier somewhere? Should it also go to the developer docs? If so, where?

@landscape-bot
Copy link

Code Health
Repository health increased by 0.00% when pulling 06f3b75 on amueller:classifier_regressor_tags into bc5acea on scikit-learn:master.

@landscape-bot
Copy link

Code Health
Repository health decreased by 0.01% when pulling 4db5b73 on amueller:classifier_regressor_tags into bc5acea on scikit-learn:master.

@coveralls
Copy link

Coverage Status

Coverage increased (+0.03%) to 95.11% when pulling 4db5b73 on amueller:classifier_regressor_tags into bc5acea on scikit-learn:master.

@amueller
Copy link
Member Author

@landscape-bot doesn't like me accessing private attributes ^^

@amueller
Copy link
Member Author

I'm not sure where @mblondel's comment went. I agree the point is to make it accessible to third-party developers without inheritance. But I'm not sure why this means it can't be private.

:func:`cross_validation.cross_val_score` defaults to being stratified when used
on a classifier, but not otherwise. Similarly, scorers for average precision
that take a continuous prediction need to call ``decision_function`` for classifiers,
but ``predict`` for regressors. This destinction between classifiers and regressors
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

distinction

@mblondel
Copy link
Member

Still +1 for merge on my side.


def is_classifier(estimator):
"""Returns True if the given estimator is (probably) a classifier."""
estimator = _get_sub_estimator(estimator)
return isinstance(estimator, ClassifierMixin)
return getattr(estimator, "_estimator_type", None) == "classifier"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we enforce that all estimators should have a _estimator_type tag?

def is_classifier(estimator):
    """Returns True if the given estimator is a classifier."""
    if not hasattr(estimator, "_estimator_type"):
        raise ValueError("The given estimator instance does not have a _estimator_type tag.")

    return estimator._estimator_type.lower() == "classifier"

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would not work with user defined estimators currently, but this would be helpful in framing a generic estimator test framework as wished for in #3810

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But then this should be in the test framework, not the code. So people that want to be strict can run their tests, but people that don't care can still run their sloppy but sklearn compatible code.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense... Thanks for the comment :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@landscape-bot
Copy link

Code Health
Repository health decreased by 0.01% when pulling 2041386 on amueller:classifier_regressor_tags into bc5acea on scikit-learn:master.

@coveralls
Copy link

Coverage Status

Coverage increased (+0.03%) to 95.11% when pulling 2041386 on amueller:classifier_regressor_tags into bc5acea on scikit-learn:master.

@@ -1075,6 +1076,7 @@ def _decision_function(self, X):
predict_stages(self.estimators_, X, self.learning_rate, score)
return score

@deprecated(" and will be removed in 0.19")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is also staged_decision_function.

(Ping @pprett )

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was wondering about that, but you are right, it should be removed.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@landscape-bot
Copy link

Code Health
Repository health decreased by 0.01% when pulling c9a9f34 on amueller:classifier_regressor_tags into bc5acea on scikit-learn:master.

@mblondel
Copy link
Member

Shall we merge?

@amueller
Copy link
Member Author

I'll rebase (master changed). Maybe @ogrisel @GaelVaroquaux or @agramfort (that's what you get for commenting on issues) want to have a look.

@landscape-bot
Copy link

Code Health
Repository health decreased by 0.01% when pulling d89c215 on amueller:classifier_regressor_tags into e2dfd23 on scikit-learn:master.


Parameters
----------
X : array-like, shape = [n_samples, n_features]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

X : array-like, shape (n_samples, n_features)

can you go over the docstrings to make sure we use this convention here too?

@agramfort
Copy link
Member

besides docstring nitpicks LGTM

@landscape-bot
Copy link

Code Health
Repository health decreased by 0.01% when pulling 188fb11 on amueller:classifier_regressor_tags into e2dfd23 on scikit-learn:master.

@amueller
Copy link
Member Author

amueller commented Apr 1, 2015

fixed the docstrings.

@amueller amueller changed the title [MRG + 1] Classifier and regressor tags [MRG + 2] Classifier and regressor tags Apr 1, 2015
@amueller
Copy link
Member Author

amueller commented Apr 2, 2015

ping @ogrisel @GaelVaroquaux maybe check if anyone at the sprint opposes, otherwise merge?

is implemented using the ``_estimator_type`` attribute, which takes a string value.
It should be ``"classifier"`` for classifiers and ``"regressor"`` for regressors,
to work as expected. Inheriting from ``ClassifierMixin`` or ``RegressorMixin`` will
set the attribute automatically.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should add 'clusterer' here, as I see it is used in the code.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I defined it, but didn't use it. I can add it to the docs or remove it from the mixin, I wasn't sure.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@GaelVaroquaux
Copy link
Member

Once the docs are updated to add clusterer, +1 to merge.

@amueller
Copy link
Member Author

amueller commented Apr 2, 2015

Done. Thanks for the reviews :) merging.

amueller added a commit that referenced this pull request Apr 2, 2015
[MRG + 2] Classifier and regressor tags
@amueller amueller merged commit aae305c into scikit-learn:master Apr 2, 2015
@landscape-bot
Copy link

Code Health
Repository health decreased by 0.01% when pulling acb21bb on amueller:classifier_regressor_tags into e2dfd23 on scikit-learn:master.

@amueller amueller deleted the classifier_regressor_tags branch April 2, 2015 15:21
@GaelVaroquaux
Copy link
Member

GaelVaroquaux commented Apr 2, 2015 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

make_scorer needs_threshold makes wrong assumptions
10 participants