# [MRG] Implement calibration loss metrics #11096

Open
wants to merge 46 commits into
from
Open

# [MRG] Implement calibration loss metrics#11096

wants to merge 46 commits into from
+413 −25

## Conversation

Contributor

### samronsin commented May 15, 2018 • edited

 See discussion on issue #10883. This PR implements calibration losses for binary classifiers. It also updates the doc about calibration, especially inaccurate references to the Brier score.
added 2 commits May 15, 2018
 Implement calibration loss metric (WIP) 
 7e20414 
 Add reference and replace center of bin with centroid of predictions … 
 8206172 
…within bin
mentioned this pull request May 15, 2018
added 2 commits May 15, 2018
 Fix docstring examples and misc PEP8 issues 
 3282fe6 
 Fix docstring examples 
 aa01c7b 
reviewed
Member

### jnothman left a comment

 You will also need to update: doc/modules/classes.rst doc/modules/model_evaluation.rst sklearn/metrics/tests/test_common.py
reviewed
Member

### jnothman left a comment

 Should this be available as a scorer for cross validation? Should it be available for CalibratedClassifierCV?
 sample_weight=None, pos_label=None): """Compute calibration loss. Across all items in a set N predictions, the calibration loss measures

#### jnothman May 15, 2018

Member

We usually put most of this detail in the user guide, not in API reference

#### samronsin May 16, 2018

Author Contributor

Thanks for the suggestion, I'll move it there then.

 reducer: string, must be among 'l1', 'l2', 'max' Aggregation method. nbins: int, positive, optional (default=10)

#### jnothman May 15, 2018

Member

Need space before colon. I also think n_bins or just bins might be more conventional.

 If true, return the mean loss per sample. Otherwise, return the sum of the per-sample losses. sample_weight : array-like of shape = [n_samples], optional

#### jnothman May 15, 2018

Member

I'd put this up next to y_prob

 y_prob : array, shape (n_samples,) Probabilities of the positive class. reducer: string, must be among 'l1', 'l2', 'max'

#### jnothman May 15, 2018

Member

Below you have 'sum'

 loss = 0. count = 0. for x in np.arange(0, 1, step_size): in_range = (x <= y_prob) & (y_prob < x + step_size)

#### jnothman May 15, 2018

Member

You could get the bin for each y_prob with searchsorted, then use bincount (or ufunc.at) to get weighted sums

#### samronsin May 16, 2018

Author Contributor

Indeed, I was contemplating this option, or use of np.digitize and np.bincount as in the calibration_curve function.

#### jnothman May 16, 2018

Member

For a small number of bins, digitize is faster than searchsorted, but otherwise, which is chosen is immaterial

 delta_count = (in_range * sample_weight).sum() avg_pred_true = ((in_range * y_true * sample_weight).sum() / float(delta_count)) bin_centroid = (in_range * y_prob).sum() / float(delta_count)

#### jnothman May 15, 2018

Member

Sums of element-wise products can be calculated with np.dot

 Fix misc docstring stuff + add test on sample weights + fix bug on sa… 
 76ba581 
…mple weights
Member

### jnothman commented May 16, 2018

 You don't need to test sample weights as the common tests will
added 4 commits May 19, 2018
 Switch loss computation to algorithm based on sorting 
 d71da9e 
 Fix bug in test with sample_weight 
 2f2742a 
 Switch from Brier score to calibration loss in tests of isotonic / si… 
 52d6fe3 
…gmoid calibration methods
 Update doc 
 7c69ae9 
Member

### amueller commented May 19, 2018

 I'm not sure if it makes sense to have this as a ready-made scorer, and so I might want to hold out on that? It should probably be mentioned in the calibration docs and examples.
Member

### amueller commented May 19, 2018 • edited

 Can you maybe also add https://www.math.ucdavis.edu/~saito/data/roc/ferri-class-perf-metrics.pdf to the references and maybe include some of the discussion there about this loss? In particular, this is one of two calibration losses they discuss, so maybe we should be more specific with the name?
 Fix misc formating 
 a69160e 
Contributor Author

### samronsin commented May 20, 2018

 Actually I did not implement any of the losses from the paper you mention @amueller, as I take non-overlapping bins instead of sliding window in CalB. CalB would totally make sense in this PR, although: n_bins would have a very different meaning: e.g. the number of non-overlapping bins taking sample_weight into account seems non-trivial because of the sliding window implementing CalB efficiently would actually be non-trivial in itself since it is based on a sliding window greedy implementation would be quite drag on performance, compared to non-overlapping binning
 Add sliding window implementation (calB from Ferri et al.) + doc 
 b6846af 
Contributor Author

### samronsin commented May 25, 2018

 I ended up implementing the calB loss suggested by @amueller, but did not provide support for sample weights. Also, I'll be happy to take suggestions regarding its implementation.
 Fix misc formatting 
 867a46d 
reviewed
Member

### jnothman left a comment

 Sorry for fashion again not being in a situation to review the main content...
 Therefore, the lower the calibration loss is for a set of predictions, the better the predictions are calibrated. The aggregation method can be either: - 'sum' for :math:\sum_k P_k \delta_k, denoted as expected calibration error

#### jnothman May 28, 2018

Member

I haven't checked the rendering, but I am pretty sure you need blank lines before and after this list.

 sliding_window=False, normalize=True, pos_label=None): """Compute calibration loss. Across all items in a set N predictions, the calibration loss measures

#### jnothman May 28, 2018

Member

Please remove or abridge the description rather than duplicate the guide

 provided sufficient data in bins. sliding_window : bool, optional (default=False) If true, compute the

#### jnothman May 28, 2018

Member

Unfinished

 of the actual outcome. Therefore, the lower the calibration loss is for a set of predictions, the better the predictions are calibrated. The aggregation method can be either:

#### jnothman May 28, 2018

Member

Blank line before this

added 5 commits Jul 16, 2018
 Merge branch 'master' into calibration-loss 
 c81e9ec 
 Improve wording on calibration loss with sliding window 
 f150c57 
 Fix and update doc 
 554b4f4 
 Fix indenting 
 1b1dd39 
 WIP: improve doc 
 197f456 
reviewed
 @@ -62,6 +62,7 @@ Scoring Function 'balanced_accuracy' :func:metrics.balanced_accuracy_score for binary targets 'average_precision' :func:metrics.average_precision_score 'brier_score_loss' :func:metrics.brier_score_loss 'calibration_loss' :func:metrics.calibration_loss

#### agramfort Jul 16, 2018

Member

to respect the convention higher is better we should maybe call it

neg_calibration_error

thoughts anyone?

reviewed
 def calibration_loss(y_true, y_prob, sample_weight=None, reducer="sum", n_bins=10, sliding_window=False, normalize=True, pos_label=None): """Compute calibration loss.

#### agramfort Jul 16, 2018

Member

2 spaces before calibration

#### agramfort Jul 16, 2018

Member

reading the suggested papers is there a recommended default for the parameters?
for example I would expect the sliding_window=True option to be recommended
by default.

reviewed
 y_prob : array, shape (n_samples,) Probabilities of the positive class. sample_weight : array-like of shape = [n_samples], optional

#### agramfort Jul 16, 2018

Member

array-like of -> array-like,

#### agramfort Jul 16, 2018

Member

actually:

sample_weight : array-like, shape (n_samples,), optional

requested changes
 sample_weight : array-like of shape = [n_samples], optional Sample weights. reducer : string, must be among 'sum', 'max'

#### agramfort Jul 16, 2018

Member

reducer : 'sum' | 'max'

 If true, return the mean loss per sample. Otherwise, return the sum of the per-sample losses. pos_label : int or str, default=None

#### agramfort Jul 16, 2018

Member

pos_label : int or str, optional (default=None)

 def calibration_loss(y_true, y_prob, sample_weight=None, reducer="sum", n_bins=10, sliding_window=False, normalize=True, pos_label=None): """Compute calibration loss.

#### agramfort Jul 16, 2018

Member

reading the suggested papers is there a recommended default for the parameters?
for example I would expect the sliding_window=True option to be recommended
by default.

 @@ -2007,3 +2007,148 @@ def brier_score_loss(y_true, y_prob, sample_weight=None, pos_label=None): y_true = np.array(y_true == pos_label, int) y_true = _check_binary_probabilistic_predictions(y_true, y_prob) return np.average((y_true - y_prob) ** 2, weights=sample_weight) def calibration_loss(y_true, y_prob, sample_weight=None, reducer="sum",

#### agramfort Jul 16, 2018

Member

is brier_score a subcase of what calibration_loss can do? if so the code should be factorized to brier_score calls calibration_loss with the correct options.

#### samronsin Jul 17, 2018

Author Contributor

So to conclude on the experiments in this gist, I'd say that defaults that are close to the calibration curve (sliding_window=False and bin_size_ratio=0.1) would be good both in terms of metrics quality (measures accurately calibration) and consistency between loss and curve.

#### samronsin Jul 17, 2018

Author Contributor

Also Brier score is not quite a a subcase of any of these metrics (this would require to square the differences with sliding_window=True and bin_size_ratio=1./N).

added this to PR phase in Andy's pets Jul 17, 2019
added the label Aug 6, 2019
Member

### amueller commented Sep 4, 2019

 Also see #12479
Member

### amueller commented Dec 11, 2019

 There's an interesting discussion of debiasing the calibration error in https://arxiv.org/pdf/1909.10155.pdf That's a current NeurIPS paper but the method they are discussing is actually already established, so it might be a good candidate. cc @thomasjpfan who has shown interest.
Member

### agramfort commented Dec 11, 2019

 I saw the spotlight this morning too. This appears indeed a good solution of our problem of calibration metrics.
Contributor Author

### samronsin commented Dec 13, 2019

 Thanks @amueller for the reference, I'll have a look asap.
Contributor Author

### samronsin commented Jan 27, 2020

 I eventually read the paper by Kumar, Liang and Ma, which triggered a few questions: the convergence results of the debiased estimator in Section 5 only apply to binned calibrators with a fixed number of bins IIRC. If we want to stick to their result this implies a fixed number of (non-overlapping) bins. I didn't find useful indication on the choice of the number of bins though, so that question remains open... I plan on running the experiments described above in this thread with the "debiased estimator" during the sprint this week and see how it fairs on the toy models. they argue that bins-based calibration error on continuous (non-binned) predictors are biased (Section 3), and thus advocate for binned calibrators. So should we implement the algorithm they describe (in Section 4) -- basically binning calibrators according to the quantiles (not clear how many tiles) of the empirical distribution of predicted probabilities ? This would probably mean a different PR.
added 4 commits Jan 30, 2020
 Merge branch 'master' into calibration-loss 
 2bd50f8 
 Merge branch 'master' of https://github.com/scikit-learn/scikit-learn … 
 1f50b60 
…into calibration-loss
 Fix merge 
 81356c3 
 Fix merge 
 b7f95a7 
Member

### NicolasHug commented Jan 30, 2020

 LMK when this is ready for a review
 Check values outside of [0, 1] range 
 3724220 
Contributor Author

### samronsin commented Jan 30, 2020 • edited

 @agramfort as discussed earlier today, I looked around what was done in R regarding calibration estimation: Caret (version 6.0-85) has the module calibration.R that computes the calibration curve by binning the data. CalibratR is a more recent package that focuses on calibration and implements ECE and MCE calibration errors (average and max distance between calibration curve computed by binning), which correspond to the current implementation with reducer set to avg and max. I didn't find anything with overlapping bins, nor kernel smoothing methods for estimating the calibration curve.
added 4 commits Jan 31, 2020
 adding debiased option to calibration loss 
 9ee785f 
 fixing bias term 
 97708ff 
 hopefully last fix for morning mistake to the bias term of calibratio… 
 6d0b112 
…n loss
 yet another fix for bias term 
 f58f448 
Contributor Author

### samronsin commented Jan 31, 2020

 Will do @NicolasHug, thanks ! To recap our discussion with @agramfort earlier today: estimation of E[Y|Y_hat]: this PR should focus on the current strategies implemented in calibration_curve (binned with either fixed-size bins or using quantiles) and follow its API, another PR should address other strategies (possibly non-binned e.g. kernel-based) estimation of the calibration error: we should stick with MCE, ECE and L2-CE (with "debiased" term by default for L2-CE) for the time being and document properly the effect of the "debiased" term on L2-CE which could be added by default
 Cleaning calibration_loss 
 d2bbdd6 
Contributor

### dsleo commented Jan 31, 2020 • edited

 Following up on the discussion and to give more context before the proposed PRs, here is the proposal implementation of histogram calibration error for reference with bias reduction as proposed in the above article of Kumar, Liang and Ma. We did the following experiments: For perfectly calibrated model (a Bernouilli model). def sample_calibrated(N): xs = [] ys = [] for _ in range(N): x = np.random.rand() xs.append(x) y = 1 if np.random.rand() < x else 0 ys.append(y) x_arr = np.array(xs) y_arr = np.array(ys) return y_arr, x_arr Running calibration_loss with and without the reduce_bias term over 50 iterations gives the following: For poorly calibrated model (a twisted Bernoulli model). def twist(x, e): if x < 0.5: return np.power(x, e) / np.power(0.5, e-1) else: return 1 - np.power(1-x, e) / np.power(0.5, e-1) def sample_poorly_calibrated(N, e): xs = [] ys = [] for _ in range(N): x = np.random.rand() xs.append(x) h = twist(x, e) y = 1 if np.random.rand() < h else 0 ys.append(y) x_arr = np.array(xs) y_arr = np.array(ys) return y_arr, x_arr Running calibration_loss with and without the reduce_bias term over 50 iterations gives the following (for a twist e=1./8): The picture is less clear than in the perfectly calibrated model case. Note that for the time being the binning strategy is uniform which gives a higher variance under twist. We'll try with quantiles binning to see if this is corrected.
bot added the label Mar 2, 2020
removed the label Mar 18, 2020
Member

### amueller commented Mar 18, 2020

 Hey @samronsin are you still working on this?
Contributor Author

### samronsin commented Mar 23, 2020

 Yes @amueller ! My colleague @dsleo has fixes for the CI coming this week.
and others added 5 commits Mar 23, 2020
 Fixing test for calibration loss + new tests 
 f24f273 
 linting and fixing doc 
 e3635c1 
 some more linting :) 
 f694bf6 
 linting... 
 61679e1 
 Clean up mistakenly added files 
 2156bd5