Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue with CalibratedClassifierCV with multiclass classification problems #18709

Open
glemaitre opened this issue Oct 29, 2020 · 4 comments
Open

Comments

@glemaitre
Copy link
Member

glemaitre commented Oct 29, 2020

While reviewing #17856, @ogrisel, @lucyleeow and myself found some weird things going on in the CalibratedClassifierCV.

EDIT by Olivier: in particular we found out that our existing multiclass calibration test was very brittle and had a high likelhood of failing by changing the random seed.

The paper used as a reference is the following:

Zadrozny, Bianca, and Charles Elkan. "Transforming classifier scores into accurate multiclass probability estimates." Proceedings of the Eighth ACM SIGKDD international conference on Knowledge discovery and data mining. 2002.

The issues are linked with the way to combine probabilities in multiclass settings.

Issue with classifier natively supporting mutliclass problem

The paper mentioned that the multiclass problem should be tackled as a set of binary problems. However, classifiers that natively supports the multiclass problem, meaning without using one-vs-rest, will not be decoupled into one-vs-rest binary problems. Instead, we calibrated the different probabilities that the classifier output. So we don't implement what is written in the reference paper.

Issue with the performance of one-vs-rest and normalization

The paper states 3 strategies to handle multiclass:

  1. train one-vs-one binary classifiers and use "coupling" for merging binary probabilities;
  2. train one-vs-one binary classifiers and use "least-squares" minimization for merging binary probabilities;
  3. train one-vs-rest binary classifiers and normalize the probabilities to sum to 1.

We are implementing 3. However, we should revisit this approach with extensive testing and reproduce the experiment shown in the paper with the Brier score (MSE).

@ogrisel
Copy link
Member

ogrisel commented Oct 30, 2020

More details about the existing OvR case: we found out that a naive softmax normalization of the raw decision function of the OvR LinearSVC was often competitive with Platt / isotonic calibration of the binary classifier followed by a simple probs.sum(axis=1) normalization in terms of logloss. Maybe the choice of the logloss is the source of the problem (see below) but by re-reading the main paper we reference for multiclass classification Transforming Classifier Scores into Accurate Multiclass Probability Estimates by Bianca Zadrozny and CharlesElkan I have the feeling that the subject of multiclass calibration has not been properly investigated. In particular there are only 2 toy experiments in the paper (pendigits and 20 newsgroups) only with Naive Bayes and Boosted NB as the base classifier. Furthermore the theoretical justification for binary calibration (OvR or other) followed by probs.sum(axis=1) normalization seems very weak.

I also think using the logloss to evaluate multiclass calibration was not necessary a good idea: for not perfectly calibrated models, there is not guarantee that the model with the "best" but imperfect calibration has the lowest log loss. I changed the multiclass calibration test in deb75fc to use a multiclass version of the Brier loss which seems slightly more stable (e.g. try to change the seed) but it also seems like an imperfect calibration metric (see #10883). Maybe we should try to extend Expected Calibration Error (#11096) to the multiclass setting but I am not sure whether this is common practice or not.

Another baseline we could compare to: stacking the uncalibrated model with a multinomial logistic regression or multinomial gradient boosting, possibly with positivity constrained (for LR) or monotonicity constraints for GBRT.

If the later baseline proves to work in extensive benchmarks with various base classifiers on various multiclass classification datasets, it would be worth documenting it as an example and linking this strategy as an alternative to CalibratedClassifierCV in the user guide.

@ogrisel
Copy link
Member

ogrisel commented Oct 30, 2020

Mentioning @lucyleeow @dsleo @samronsin as you might be interested in this and maybe sharing your own insights.

@ogrisel
Copy link
Member

ogrisel commented Oct 30, 2020

We could even introduce an additional temperature hyperparemeter in the multinomial loss of LR / HGBRT. It would be set to 1. by default to get regular LR but could be grid searched with multiclass ECE when those models are used as second stage calibrators (instead of relying on a regularization alone).

@ogrisel
Copy link
Member

ogrisel commented Oct 30, 2020

Cross-referencing a survey of our community on twitter: https://twitter.com/ogrisel/status/1322119718334013443 with very relevant references in the replies.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants