Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement estimator tags #6599

Closed
amueller opened this issue Mar 28, 2016 · 47 comments · Fixed by #8022

Comments

@amueller
Copy link
Member

@amueller amueller commented Mar 28, 2016

Currently the common test hard-code many things, like support for multi-output, or requiring positive input, or not allowing specific kinds of predictions.
That's bad design, but also a big problem for 3rd party packages that need to add to the conditions in the common tests (you need to add to the hard-coded list of estimators with a certain property).
See scikit-learn-contrib/py-earth#96 for an example.

Currently we have a very poor mechanism for distinguishing classifiers and regressors for similar purposes (but also to use when deciding the default cross-validation strategy), the estimator_type attribute. That allows only a single tag (classifier, transformer, regressor).

I think we should deprecate estimator_type and instead add a more flexible estimator_properties dictionary.
This will allow us to programmatically encode assumptions of the algorithms (like vectorizers taking non-numeric data or NB taking non-negative data) as well as clean up our act with the tests.
The people wanting to add to scikit-learn-contrib and auto-sklearn-like settings (tpot) will appreciate that ;)

List of tags that I am coming up with

  • supports sparse data
  • positive data only
  • supports missing data
  • semi-supervised
  • multi-output only
  • multi-label support
  • multi-output regression
  • multi-label multi-output
  • 1d input only
  • multi-class support (or maybe "no multi-class support"?)
  • needs fitting (or maybe "stateless"? though the GP doesn't need fitting but is not stateless)
  • input dtype / dtype conversions?
  • sparse matrix formats / conversions
  • deterministic?
  • label transformation (not for data)
  • special input format? like for CountVectorizer and DictVectorizer? Or maybe we want a field "supported inputs" that lists ndarray, sparse formats, list, strings, dicts?
  • required parameters ?
  • integer input / categorical input supported?

cc @GaelVaroquaux @ogrisel @mblondel @jnothman @mfeurer @rhiever

@amueller amueller added the API label Mar 28, 2016
@amueller amueller added this to the 0.18 milestone Mar 28, 2016
@mblondel

This comment has been minimized.

Copy link
Member

@mblondel mblondel commented Mar 28, 2016

+1 Now that we have the contrib, estimator tags are becoming very important.

For the specific case of check_estimator, estimator_properties could also be an argument of the function.

@mfeurer

This comment has been minimized.

Copy link
Contributor

@mfeurer mfeurer commented Mar 30, 2016

This is what we're using in auto-sklearn for the multinomial naive bayes (nb just as an example).

@staticmethod
def get_properties(dataset_properties=None):
    return {'shortname': 'MultinomialNB',
            'name': 'Multinomial Naive Bayes classifier',
            'handles_regression': False,
            'handles_classification': True,
            'handles_multiclass': True,
            'handles_multilabel': True,
            'is_deterministic': True,
            'input': (DENSE, SPARSE, SIGNED_DATA),
            'output': (PREDICTIONS,)}

With this information, we are able to automatically construct all valid pipelines we can configure with auto-sklearn for a given dataset. What we don't have yet, but might be interest is a preferred data structure and dtype as input, as well as the output dtype to prevent unnecessary memory copies.

If this is what you're looking for, we (developers of auto-sklearn) would be very interested in this and can help on this or even work on this depending on the amount of work and how this feature should finally look like.

cc @KEggensperger who coded a lot of properties and pipeline creation in auto-sklearn

@jnothman

This comment has been minimized.

Copy link
Member

@jnothman jnothman commented Mar 30, 2016

I think the auto-sklearn stuff is a good start, but that we need to be careful to define the scope of what are useful and not useful properties to annotate. Obviously, things that affect test applicability are worth annotating somehow, but working out what exact vocabulary describes the (in)applicability of tests is non-trivial. It would be, for instance, be much easier in some senses to have a way of saying "don't run test x for this estimator", but that needs to be updated as new tests are added, as well as not being useful for the meta-model inspection used by the likes of auto-sklearn. What we include that is not relevant to testing exclusions needs to be limited, IMO.

@raghavrv

This comment has been minimized.

Copy link
Member

@raghavrv raghavrv commented May 4, 2016

Just noting that if we can collect all the names of the fit attributes for an estimator in that 'tag', check_is_fitted could be simplified and refactored into base estimator!

@jnothman

This comment has been minimized.

Copy link
Member

@jnothman jnothman commented Jul 14, 2016

Perhaps we should resolve to one-by-one remove estimators from sklearn.utils.testing.DONT_TEST!

@amueller

This comment has been minimized.

Copy link
Member Author

@amueller amueller commented Jul 17, 2016

yes, that variable needs to go.
I would like to add annotation that is more semantic than "don't test this".
regression vs classification is already encoded in the class hierarchy, but maybe we might want to code it again?
I would definitely like to see something that tells us sparse and dense support, possibly existence of predict_proba and decision function (which are tricky), and restrictions to positive data.

We should go through the tests and see what other special cases there are apart from positive data. The bicluster methods take n_samples x n_samples affinity matrices. Maybe we should actually just change the API for that (with a precomputed=False parameter)?

@amueller

This comment has been minimized.

Copy link
Member Author

@amueller amueller commented Jul 17, 2016

with the categorical variables in trees soon, we might also want to indicate support for that.

@kmike

This comment has been minimized.

Copy link
Contributor

@kmike kmike commented Jul 22, 2016

It may be a long shot, but what about using pep-484 type hints for that?

@amueller

This comment has been minimized.

Copy link
Member Author

@amueller amueller commented Jul 27, 2016

@kmike how would that work? These properties are not types. We could create mixins that correspond to the properties and do inheritance, but I feel that would be abusing the type system for a simple annotation task.

@kmike

This comment has been minimized.

Copy link
Contributor

@kmike kmike commented Jul 27, 2016

@amueller maybe you're right; I haven't put much thought in how exactly it can be implemented. But handles_.. and input / output from #6599 (comment) look a lot like type annotations of corresponding method arguments and output values.

@amueller

This comment has been minimized.

Copy link
Member Author

@amueller amueller commented Jul 28, 2016

only that non-negative float is not a numpy type ;)

@jnothman

This comment has been minimized.

Copy link
Member

@jnothman jnothman commented Aug 2, 2016

And mypy generally does not support Numpy typing, as far as I can tell from comments like this one

@amueller

This comment has been minimized.

Copy link
Member Author

@amueller amueller commented Aug 29, 2016

hm... having this for 0.18 might be a bit ambitious.

@jnothman

This comment has been minimized.

Copy link
Member

@jnothman jnothman commented Aug 29, 2016

Quite!

@jnothman

This comment has been minimized.

Copy link
Member

@jnothman jnothman commented Aug 29, 2016

I can't recall if it's been raised here, but one complication of estimator tags is that traits may be dependent on parameters (e.g. SGDClassifier can return probabilities conditioned on the loss parameter.) See also #7270 (comment)

@jnothman

This comment has been minimized.

Copy link
Member

@jnothman jnothman commented Aug 29, 2016

(And again, looking to type annotations as a saviour is not the answer. We're not dealing with types but domain semantics.)

@amueller amueller modified the milestones: 0.19, 0.18 Aug 30, 2016
@amueller

This comment has been minimized.

Copy link
Member Author

@amueller amueller commented Aug 30, 2016

parameter-conditional behavior is certainly a tricky thing. We kinda tried to fake it with the delegate_has_attribute mechanism and duck typing. Will that allow us to go all the way? Probably not.

This somewhat ties into the discussion on default parameter grids, though they are very distinct use cases. I usually wouldn't search over different solvers when doing a grid-search, but that's exactly what we want here.

half-baked proposal with two parts:

  • Allow each class to "register" it's own dict of parameters to explore for testing (in a decentralized place, probably outside of the class definition?) that the common tests can pick up on.
  • have traits be a class method that gets parameters? Do we need to know before instantiation of an object? Instantiation is cheap, so we could always just instantiate and then ask for the traits in a property or method. [however, meta-estimators can not easily be instantiated, so maybe we want a class-method].
@amueller

This comment has been minimized.

Copy link
Member Author

@amueller amueller commented Aug 30, 2016

I think the goal should be that the common tests have decent test coverage alone btw.

@GaelVaroquaux

This comment has been minimized.

Copy link
Member

@GaelVaroquaux GaelVaroquaux commented Aug 30, 2016

I think the goal should be that the common tests have decent test coverage alone btw.

+1: let's limit the goal, and hence go for an 80/20 (or rather a 95/5)
solution. I am afraid that if we go for more ambitious goals we will get
too much sophistication going in here, and might yet not reach a goal
such as making things plug together without user understanding (because
intrinsically, machine learning doesn't work like that :D ).

@amueller

This comment has been minimized.

Copy link
Member Author

@amueller amueller commented Dec 7, 2016

@jnothman @GaelVaroquaux if you have input on my questions above, that'll help me with creating a prototype.

@jnothman

This comment has been minimized.

Copy link
Member

@jnothman jnothman commented Dec 8, 2016

I think saying the predict_proba trait is only reliable after fit (or only making it available then) is the best we can do within reason.

I'm not sure what other questions I've not answered seeing as I want an instancemethod!

@amueller

This comment has been minimized.

Copy link
Member Author

@amueller amueller commented Dec 8, 2016

@jnothman Well but then basically all tags are only "reliable" after fit. The question is what is it before fit? We definitely want to assess these properties before fitting. Should we indicate that it might change?

One of the main motivation for the tags right now is to decide which tests to apply. But we need to decide that before fitting, right? At least we need to know whether something is multi-output or not.

@amueller

This comment has been minimized.

Copy link
Member Author

@amueller amueller commented Dec 8, 2016

And for most estimators, we know for sure. Basically the SearchCV estimators can change anything, but most tags are known before fitting in nearly all estimators.

@amueller

This comment has been minimized.

Copy link
Member Author

@amueller amueller commented Dec 8, 2016

updated the issue description with a tentative list of things we might need to make our tests agnostic...
I'm not entirely sure how we should encode the supported input formats.
We could do something like
if "ndarray" in tags['valid_inputs']
maybe?

@jnothman

This comment has been minimized.

Copy link
Member

@jnothman jnothman commented Dec 9, 2016

@amueller

This comment has been minimized.

Copy link
Member Author

@amueller amueller commented Dec 9, 2016

@jnothman for grid-search we can search over steps in a pipeline, so whether the estimator accepts sparse X can change.

@amueller

This comment has been minimized.

Copy link
Member Author

@amueller amueller commented Dec 9, 2016

Playing devil's advocate, I just realized this is possible and does not result in an error:

from sklearn.pipeline import Pipeline
from sklearn.cluster import KMeans
from sklearn.neighbors import KernelDensity
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.datasets import make_blobs

X, y = make_blobs()

pipe = Pipeline([("thing", KMeans())])
grid = GridSearchCV(pipe, {"thing": [DecisionTreeClassifier(), KMeans(), KernelDensity()]})
grid.fit(X, y)

I'm not sure how I feel about that.

@GaelVaroquaux

This comment has been minimized.

Copy link
Member

@GaelVaroquaux GaelVaroquaux commented Dec 9, 2016

@amueller

This comment has been minimized.

Copy link
Member Author

@amueller amueller commented Dec 9, 2016

@GaelVaroquaux for SVC, we know this actually before fit from the __init__ parameters. The biggest issue is actually GridSearchCV.

I'm actually wondering where outside of GridSearchCV we don't know important properties before fit. If that is the only problematic case, then maybe we should treat it (and RandomizedSearchCV and other CV objects) differently.

@GaelVaroquaux

This comment has been minimized.

Copy link
Member

@GaelVaroquaux GaelVaroquaux commented Dec 9, 2016

@amueller amueller referenced this issue Dec 9, 2016
3 of 4 tasks complete
@amueller

This comment has been minimized.

Copy link
Member Author

@amueller amueller commented Dec 9, 2016

One more design question:
how do we feel about

class BaseEstimator(object):
    def _get_tags(self=None):
        return {}

this would allow _get_tags to be called with or without instantiation.
I have not really seen that before, not sure if it is a reasonable pattern. I think declaring mandatory constructor parameters would be a good tag, but I can't access that without instantiating ...

@amueller

This comment has been minimized.

Copy link
Member Author

@amueller amueller commented Dec 9, 2016

Hm that messes with super though... ugh...

@amueller

This comment has been minimized.

Copy link
Member Author

@amueller amueller commented Dec 9, 2016

Ok current issue: some models are really bad - like the Dummy ones and the NB ones, and they don't fulfill the minimum iris accuracy required in the tests. Does it make sense to try to solve that with tags? Should we not test for reasonable accuracy? Should we allow estimators to opt-out of certain tests?

@jnothman

This comment has been minimized.

Copy link
Member

@jnothman jnothman commented Dec 11, 2016

@amueller

This comment has been minimized.

Copy link
Member Author

@amueller amueller commented Dec 12, 2016

Sparse support needs to be known prior to fit, regardless of whether the fitted model can change in its ability to do a sparse transform etc. Your argument about GridSearch says anything with respect to the ability to be fit on sparse data.

I'm not sure I follow. If I grid-search and the grid involves a model that supports sparse data and one that doesn't, then it's unclear whether best_estimator_ will support fitting on sparse data.
I guess that is not really what we're interested in.

Arguably we are interested in whether fitting the grid-search on sparse data will succeed. That requires knowing whether all grid-points support fitting on sparse data, which is tricky, but not impossible.
Basically we would need to iterate over the grid, instantiate the models, get the sparse support tag from all of them and call all on it.

I guess I was more thinking about "will calling predict succeed on a sparse matrix", which we can only say for sure after fit.

Yes, I think we'll need to let estimators opt out of certain tests.

I implemented that in #8022.

@jnothman

This comment has been minimized.

Copy link
Member

@jnothman jnothman commented Dec 12, 2016

@jeremiedbb

This comment has been minimized.

Copy link
Member

@jeremiedbb jeremiedbb commented Jul 18, 2018

@amueller Is this the right place to discuss the list of tags ?

The new imputer classes in sklearn.impute accept both numeric and non numeric inputs (e.g. strings). It might be good to add a tag for whether an estimator accept non-numeric inputs or not.

@amueller

This comment has been minimized.

Copy link
Member Author

@amueller amueller commented Mar 12, 2019

@jeremiedbb Probably not, I'll open a new issue?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
scikit-learn 0.19
Issues Without PR
8 participants
You can’t perform that action at this time.