Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

check_estimator not valid for vectorizers #12491

Open
ysig opened this issue Oct 30, 2018 · 6 comments
Open

check_estimator not valid for vectorizers #12491

ysig opened this issue Oct 30, 2018 · 6 comments

Comments

@ysig
Copy link

ysig commented Oct 30, 2018

check_estimator is a really powerful tool for deciding scikit-compatibility of estimators through automated checking.

Although this is true, it restricts estimators that do not expect as input numpy arrays of shape [n_samples, n_features], but vector of objects (i.e. Vectorizers) to use this tool.

Mainly, being redirected from my issue in sklearn-template, it would be really helpful if all the constant input needed inside check_estimator could be collected as the values of an object or instantiated as the return values of a function that the user of check_estimator would be able to define, following a certain specification.

An argument can rise for that, if someone considers that any Transformer requires as input numpy arrays of shape [n_samples, n_features]. But if we approach a very common Transformer in sklearn, namely Tf-idf we can see that its input is not an np.array.

To sum up, as scikit-learn aims to be a more and more general library that can standardize a template for developing machine learning packages in python, it would be really essential to support various input formats on the automated checking of consistency of its basic object, the Estimator.

Thank you for your attention!

@rth
Copy link
Member

rth commented Oct 30, 2018

That is a known limitation. #11622 might provide a short term solution to this (the long term would be the estimator tags).

A PR to allow optionally disable certain checks of check_estimator as outlined in the above issue would be welcome.

@ysig
Copy link
Author

ysig commented Oct 30, 2018

I think it is not about disabling checks but transforming them.

For example, in the case when you want to see that an estimator pickles, the whole test depends on the input initialization:

X, y = make_blobs(n_samples=30, centers=[[0, 0, 0], [1, 1, 1]],
                  random_state=0, n_features=2, cluster_std=0.1)

By observing other tests (in order to bring them in the terms of my package) I witnessed that the most of the time the changes are all about changing the input. Considering that this input is either random/arbitrary or that it even follows a certain pattern it can be categorized in any case as the result(s) of a function or as the values of a given object parametrized for the given estimator.

Also the complexity of this checks doesn't allow someone not experienced to generalize them but maybe to make them suit their needs, which in the end may not provide a sklearn compatible solution.

That's why generalization seems a much better idea (if possible).
What are the estimator tags?

@amueller
Copy link
Member

@rth how does #11622 solve this?

FYI this is the PR @rth mentioned: #8022

@amueller
Copy link
Member

The current PR doesn't actually implement running the checks for texts, but it adds the mechanisms for it. There's an input type tag that will be text for a vectorizer, but the tests in the PR will just not run tests that don't want numpy array input.
A next step would be to generate the test data conditional on the model, as you suggest.

@rth
Copy link
Member

rth commented Oct 30, 2018

rth how does #11622 solve this?

Well, given than e.g. check_estimator(CountVectorizer) fails, if individual tests were exposed, a crude first first step could have been to at least skip those that fail for CountVectorizer and keep e.g. pickling checks etc.

I agree though estimators tags would be more general.

I think it is not about disabling checks but transforming them.
By observing other tests (in order to bring them in the terms of my package) I witnessed that the most of the time the changes are all about changing the input.

Well the problem is that this would add quite a lot of complexity. Most of checks assume that input is 2D arrays and output is 1D or 2D. Now we consider text vectorizers, in the same way one could say, well I use ML on time series or recommendation algoritm or metric learning and my input data is not a 2D matrix (n_samples, n_features) but I would still like to use check_estimator to check some consistence to the scikit-learn API to the extent it's possible (i.e. consistency between fit + predict / fit_predict, picklabilty etc.). I hope estimator tags will help, but in general it's doesn't sound that simple to generalize current estimator checks to work in all possible use case.

You could say that text vectorizers are in scope (because they are in scikit-learn) and the the rest is not, as far as estimator checks are concerned, but I don't see a fundamental difference between this and anything else that's not the usual 2D array input.

@jnothman
Copy link
Member

jnothman commented Oct 31, 2018 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants