Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Helpers for managing different set of classes #8100

Open
jnothman opened this issue Dec 22, 2016 · 16 comments

Comments

@jnothman
Copy link
Member

@jnothman jnothman commented Dec 22, 2016

It seems like half the bugs we've solved in the past couple of months surround problems of having different sets of classes, in the context of:

  • cross-validation splitting that yields different subsets of classes in different training or testing subsets (and hence issues in alignment of class-wise outputs from predict_proba, decision_function or metrics, or in normalising macro-averaged scores)
  • partial_fit where classes are specified upfront, but then repeated calls need matching to those classes
  • warm_start where classes_ from the first fit must be identical to the set of classes in y in each call to fit()

These are all subtly different problems, but at the moment it seems like we're handling them on an ad-hoc (and too often a post-hoc) basis.

It would be amazing if someone could review these issues and identify where either API changes (classes as a constructor parameter to classifiers has been suggested) or helper utilities might help avoid these issues in the future.

@dalmia

This comment has been minimized.

Copy link
Contributor

@dalmia dalmia commented Jan 12, 2017

I'll take this up.

@jnothman

This comment has been minimized.

Copy link
Member Author

@jnothman jnothman commented Jan 12, 2017

I think it will be a challenge, @dalmia, but go ahead.

@dalmia

This comment has been minimized.

Copy link
Contributor

@dalmia dalmia commented Jan 12, 2017

Thanks for the heads up @jnothman. :)

@amueller

This comment has been minimized.

Copy link
Member

@amueller amueller commented Mar 3, 2017

@dalmia are you still working on this?

@dalmia

This comment has been minimized.

Copy link
Contributor

@dalmia dalmia commented Mar 8, 2017

@amueller No Sir, please retain the 'Need Contributor' label.

@amueller

This comment has been minimized.

Copy link
Member

@amueller amueller commented Jun 28, 2017

Ok so in #6231 we talked about adding classes as a construction parameter, which I think might be the best solution.

If we do that, what's the shape of coef_ in linear models? I guess it needs to reflect all classes to solve the warm-start issue.

@amueller

This comment has been minimized.

Copy link
Member

@amueller amueller commented Jul 7, 2017

Should we sort the classes that the user provides? That might be surprising to the user, but not doing it might lead to subtle bugs as we rely on that in some places, I think?

It seems a bit strange to me that get_params and set_params return / set classes as searching over it is nonsensical, but I guess it's not the worst....

@jnothman

This comment has been minimized.

Copy link
Member Author

@jnothman jnothman commented Jul 8, 2017

@amueller amueller added this to needs pr in Andy's pets Jul 21, 2017
@amueller

This comment has been minimized.

Copy link
Member

@amueller amueller commented Aug 4, 2017

@amueller

This comment has been minimized.

Copy link
Member

@amueller amueller commented Aug 11, 2017

Should we have attributes reflect the additional classes in all cases? I think so, because otherwise it will be very confusing. @aarshayj points out that that requires changing coef_ and intercept_ in the linear models, if we want them to correspond to the entries in decision_function.

@aarshayj

This comment has been minimized.

Copy link
Contributor

@aarshayj aarshayj commented Aug 11, 2017

yea i think we'll have to do this. for instance, if the user sets classes=[1,2,3] while y has only [1,2], then an SVM will return just 1 set of coefficients and 1d decision function because the library we use makes use of data only in y. but in our API, we need to supplement both coef and decision function to make them 3d.

the returned values are for the positive class (2). so we can fill in 0s for class (3) and fill in class (1) as negative of the values for class (2).

@jnothman

This comment has been minimized.

Copy link
Member Author

@jnothman jnothman commented Aug 13, 2017

@amueller

This comment has been minimized.

Copy link
Member

@amueller amueller commented Aug 30, 2017

So I'm not entirely certain this is the right solution any more, because of issues we face in the implementation.

There are two or three main points:

  • are we requiring classes in our API contract for third-party estimators?
  • If we don't, then issues like in cross_val_predict will remain for third-party estimators, and we end up where we started.
  • how much of the internal structure of the estimators should reflect the new classes? coef_ et al is very vague. For GradientBoostingClassifier, what should estimators_ be? Should we actually create trees for classes that don't exist? What should coef_ and intercept_ be? Ideally we want zero probabilities, which means -np.inf for decision function and intercept, which is maybe not great.

If we used a wrapper/meta-estimator approach, these issues would not exist as we could wrap third-party estimators and not expose any internal data structures. I'm not sure if a wrapper would be the right solution to ensure consistent scoring, but it's probably possible.

Ping @jnothman @GaelVaroquaux @ogrisel @agramfort.

@amueller

This comment has been minimized.

Copy link
Member

@amueller amueller commented Aug 30, 2017

Hm for the third-party estimators, I guess we don't have to require classes, and just raise an error if we want to combine multiple estimators that have different classes_ like in cross_val_predict. Basically someone only needs to implement classes if they want these edge cases to work. Still seems like a bit of a burden on the estimator author that we could avoid with a wrapper approach.

@jnothman

This comment has been minimized.

Copy link
Member Author

@jnothman jnothman commented Aug 31, 2017

are we requiring classes in our API contract for third-party estimators?

In short, yes. Eventually. We need to maintain the classes_ mismatch handling in cross_val_predict for now in any case.

We don't end up where we started in that we have a way for metaestimators to pass classes to supporting classifiers (and scorers), and that we remove the problem in partial_fit for us and for third parties who adopt the classes parameter. But yes, when the user has not specified classes, weird errors may be raised in partial_fit still (unless protected by common tests).

A wrapper approach for meta-estimators interpreting the output of decision_function? Okay, but it won't work for interpreting coef_, will it? It won't work for passing classes to scorers, but I suppose we can make that the user's problem as we currently do.

I do see your problems about making the models represent all the classes in a consistent way. I'm not yet sure what the solution is.

@GaelVaroquaux

This comment has been minimized.

Copy link
Member

@GaelVaroquaux GaelVaroquaux commented Aug 31, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Andy's pets
needs pr
7 participants
You can’t perform that action at this time.