Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] Implement calibration loss metrics #11096

Open
wants to merge 46 commits into
base: master
from

Conversation

@samronsin
Copy link
Contributor

samronsin commented May 15, 2018

See discussion on issue #10883.

This PR implements calibration losses for binary classifiers.

It also updates the doc about calibration, especially inaccurate references to the Brier score.

Copy link
Member

jnothman left a comment

You will also need to update:

  • doc/modules/classes.rst
  • doc/modules/model_evaluation.rst
  • sklearn/metrics/tests/test_common.py
Copy link
Member

jnothman left a comment

Should this be available as a scorer for cross validation? Should it be available for CalibratedClassifierCV?

sample_weight=None, pos_label=None):
"""Compute calibration loss.
Across all items in a set N predictions, the calibration loss measures

This comment has been minimized.

Copy link
@jnothman

jnothman May 15, 2018

Member

We usually put most of this detail in the user guide, not in API reference

This comment has been minimized.

Copy link
@samronsin

samronsin May 16, 2018

Author Contributor

Thanks for the suggestion, I'll move it there then.

reducer: string, must be among 'l1', 'l2', 'max'
Aggregation method.
nbins: int, positive, optional (default=10)

This comment has been minimized.

Copy link
@jnothman

jnothman May 15, 2018

Member

Need space before colon. I also think n_bins or just bins might be more conventional.

If true, return the mean loss per sample.
Otherwise, return the sum of the per-sample losses.
sample_weight : array-like of shape = [n_samples], optional

This comment has been minimized.

Copy link
@jnothman

jnothman May 15, 2018

Member

I'd put this up next to y_prob

y_prob : array, shape (n_samples,)
Probabilities of the positive class.
reducer: string, must be among 'l1', 'l2', 'max'

This comment has been minimized.

Copy link
@jnothman

jnothman May 15, 2018

Member

Below you have 'sum'

loss = 0.
count = 0.
for x in np.arange(0, 1, step_size):
in_range = (x <= y_prob) & (y_prob < x + step_size)

This comment has been minimized.

Copy link
@jnothman

jnothman May 15, 2018

Member

You could get the bin for each y_prob with searchsorted, then use bincount (or ufunc.at) to get weighted sums

This comment has been minimized.

Copy link
@samronsin

samronsin May 16, 2018

Author Contributor

Indeed, I was contemplating this option, or use of np.digitize and np.bincount as in the calibration_curve function.

This comment has been minimized.

Copy link
@jnothman

jnothman May 16, 2018

Member

For a small number of bins, digitize is faster than searchsorted, but otherwise, which is chosen is immaterial

delta_count = (in_range * sample_weight).sum()
avg_pred_true = ((in_range * y_true * sample_weight).sum()
/ float(delta_count))
bin_centroid = (in_range * y_prob).sum() / float(delta_count)

This comment has been minimized.

Copy link
@jnothman

jnothman May 15, 2018

Member

Sums of element-wise products can be calculated with np.dot

@jnothman

This comment has been minimized.

Copy link
Member

jnothman commented May 16, 2018

You don't need to test sample weights as the common tests will

@amueller

This comment has been minimized.

Copy link
Member

amueller commented May 19, 2018

I'm not sure if it makes sense to have this as a ready-made scorer, and so I might want to hold out on that? It should probably be mentioned in the calibration docs and examples.

@amueller

This comment has been minimized.

Copy link
Member

amueller commented May 19, 2018

Can you maybe also add https://www.math.ucdavis.edu/~saito/data/roc/ferri-class-perf-metrics.pdf to the references and maybe include some of the discussion there about this loss? In particular, this is one of two calibration losses they discuss, so maybe we should be more specific with the name?

@samronsin

This comment has been minimized.

Copy link
Contributor Author

samronsin commented May 20, 2018

Actually I did not implement any of the losses from the paper you mention @amueller, as I take non-overlapping bins instead of sliding window in CalB.
CalB would totally make sense in this PR, although:

  • n_bins would have a very different meaning: e.g. the number of non-overlapping bins
  • taking sample_weight into account seems non-trivial because of the sliding window
  • implementing CalB efficiently would actually be non-trivial in itself since it is based on a sliding window
  • greedy implementation would be quite drag on performance, compared to non-overlapping binning
@samronsin

This comment has been minimized.

Copy link
Contributor Author

samronsin commented May 25, 2018

I ended up implementing the calB loss suggested by @amueller, but did not provide support for sample weights.
Also, I'll be happy to take suggestions regarding its implementation.

Copy link
Member

jnothman left a comment

Sorry for fashion again not being in a situation to review the main content...

Therefore, the lower the calibration loss is for a set of predictions, the
better the predictions are calibrated.
The aggregation method can be either:
- 'sum' for :math:`\sum_k P_k \delta_k`, denoted as expected calibration error

This comment has been minimized.

Copy link
@jnothman

jnothman May 28, 2018

Member

I haven't checked the rendering, but I am pretty sure you need blank lines before and after this list.

sliding_window=False, normalize=True, pos_label=None):
"""Compute calibration loss.
Across all items in a set N predictions, the calibration loss measures

This comment has been minimized.

Copy link
@jnothman

jnothman May 28, 2018

Member

Please remove or abridge the description rather than duplicate the guide

provided sufficient data in bins.
sliding_window : bool, optional (default=False)
If true, compute the

This comment has been minimized.

Copy link
@jnothman

jnothman May 28, 2018

Member

Unfinished

of the actual outcome.
Therefore, the lower the calibration loss is for a set of predictions, the
better the predictions are calibrated.
The aggregation method can be either:

This comment has been minimized.

Copy link
@jnothman

jnothman May 28, 2018

Member

Blank line before this

samronsin added 5 commits Jul 16, 2018
@@ -62,6 +62,7 @@ Scoring Function
'balanced_accuracy' :func:`metrics.balanced_accuracy_score` for binary targets
'average_precision' :func:`metrics.average_precision_score`
'brier_score_loss' :func:`metrics.brier_score_loss`
'calibration_loss' :func:`metrics.calibration_loss`

This comment has been minimized.

Copy link
@agramfort

agramfort Jul 16, 2018

Member

to respect the convention higher is better we should maybe call it

neg_calibration_error

thoughts anyone?

def calibration_loss(y_true, y_prob, sample_weight=None, reducer="sum",
n_bins=10, sliding_window=False, normalize=True,
pos_label=None):
"""Compute calibration loss.

This comment has been minimized.

Copy link
@agramfort

agramfort Jul 16, 2018

Member

2 spaces before calibration

This comment has been minimized.

Copy link
@agramfort

agramfort Jul 16, 2018

Member

reading the suggested papers is there a recommended default for the parameters?
for example I would expect the sliding_window=True option to be recommended
by default.

y_prob : array, shape (n_samples,)
Probabilities of the positive class.
sample_weight : array-like of shape = [n_samples], optional

This comment has been minimized.

Copy link
@agramfort

agramfort Jul 16, 2018

Member

array-like of -> array-like,

This comment has been minimized.

Copy link
@agramfort

agramfort Jul 16, 2018

Member

actually:

sample_weight : array-like, shape (n_samples,), optional

sample_weight : array-like of shape = [n_samples], optional
Sample weights.
reducer : string, must be among 'sum', 'max'

This comment has been minimized.

Copy link
@agramfort

agramfort Jul 16, 2018

Member

reducer : 'sum' | 'max'

If true, return the mean loss per sample.
Otherwise, return the sum of the per-sample losses.
pos_label : int or str, default=None

This comment has been minimized.

Copy link
@agramfort

agramfort Jul 16, 2018

Member

pos_label : int or str, optional (default=None)

def calibration_loss(y_true, y_prob, sample_weight=None, reducer="sum",
n_bins=10, sliding_window=False, normalize=True,
pos_label=None):
"""Compute calibration loss.

This comment has been minimized.

Copy link
@agramfort

agramfort Jul 16, 2018

Member

reading the suggested papers is there a recommended default for the parameters?
for example I would expect the sliding_window=True option to be recommended
by default.

@@ -2007,3 +2007,148 @@ def brier_score_loss(y_true, y_prob, sample_weight=None, pos_label=None):
y_true = np.array(y_true == pos_label, int)
y_true = _check_binary_probabilistic_predictions(y_true, y_prob)
return np.average((y_true - y_prob) ** 2, weights=sample_weight)


def calibration_loss(y_true, y_prob, sample_weight=None, reducer="sum",

This comment has been minimized.

Copy link
@agramfort

agramfort Jul 16, 2018

Member

is brier_score a subcase of what calibration_loss can do? if so the code should be factorized to brier_score calls calibration_loss with the correct options.

This comment has been minimized.

Copy link
@samronsin

samronsin Jul 17, 2018

Author Contributor

So to conclude on the experiments in this gist, I'd say that defaults that are close to the calibration curve (sliding_window=False and bin_size_ratio=0.1) would be good both in terms of metrics quality (measures accurately calibration) and consistency between loss and curve.

This comment has been minimized.

Copy link
@samronsin

samronsin Jul 17, 2018

Author Contributor

Also Brier score is not quite a a subcase of any of these metrics (this would require to square the differences with sliding_window=True and bin_size_ratio=1./N).

@amueller amueller added this to PR phase in Andy's pets Jul 17, 2019
@amueller

This comment has been minimized.

Copy link
Member

amueller commented Sep 4, 2019

Also see #12479

@amueller

This comment has been minimized.

Copy link
Member

amueller commented Dec 11, 2019

There's an interesting discussion of debiasing the calibration error in https://arxiv.org/pdf/1909.10155.pdf

That's a current NeurIPS paper but the method they are discussing is actually already established, so it might be a good candidate. cc @thomasjpfan who has shown interest.

@agramfort

This comment has been minimized.

Copy link
Member

agramfort commented Dec 11, 2019

@samronsin

This comment has been minimized.

Copy link
Contributor Author

samronsin commented Dec 13, 2019

Thanks @amueller for the reference, I'll have a look asap.

@samronsin

This comment has been minimized.

Copy link
Contributor Author

samronsin commented Jan 27, 2020

I eventually read the paper by Kumar, Liang and Ma, which triggered a few questions:

  • the convergence results of the debiased estimator in Section 5 only apply to binned calibrators with a fixed number of bins IIRC. If we want to stick to their result this implies a fixed number of (non-overlapping) bins. I didn't find useful indication on the choice of the number of bins though, so that question remains open... I plan on running the experiments described above in this thread with the "debiased estimator" during the sprint this week and see how it fairs on the toy models.
  • they argue that bins-based calibration error on continuous (non-binned) predictors are biased (Section 3), and thus advocate for binned calibrators. So should we implement the algorithm they describe (in Section 4) -- basically binning calibrators according to the quantiles (not clear how many tiles) of the empirical distribution of predicted probabilities ? This would probably mean a different PR.
samronsin added 4 commits Jan 30, 2020
…into calibration-loss
@NicolasHug

This comment has been minimized.

Copy link
Member

NicolasHug commented Jan 30, 2020

LMK when this is ready for a review

@samronsin

This comment has been minimized.

Copy link
Contributor Author

samronsin commented Jan 30, 2020

@agramfort as discussed earlier today, I looked around what was done in R regarding calibration estimation:

  • Caret (version 6.0-85) has the module calibration.R that computes the calibration curve by binning the data.
  • CalibratR is a more recent package that focuses on calibration and implements ECE and MCE calibration errors (average and max distance between calibration curve computed by binning), which correspond to the current implementation with reducer set to avg and max.

I didn't find anything with overlapping bins, nor kernel smoothing methods for estimating the calibration curve.

@samronsin

This comment has been minimized.

Copy link
Contributor Author

samronsin commented Jan 31, 2020

Will do @NicolasHug, thanks !

To recap our discussion with @agramfort earlier today:

  • estimation of E[Y|Y_hat]: this PR should focus on the current strategies implemented in calibration_curve (binned with either fixed-size bins or using quantiles) and follow its API, another PR should address other strategies (possibly non-binned e.g. kernel-based)
  • estimation of the calibration error: we should stick with MCE, ECE and L2-CE (with "debiased" term by default for L2-CE) for the time being and document properly the effect of the "debiased" term on L2-CE which could be added by default
@dsleo

This comment has been minimized.

Copy link
Contributor

dsleo commented Jan 31, 2020

Following up on the discussion and to give more context before the proposed PRs, here is the proposal implementation of histogram calibration error for reference with bias reduction as proposed in the above article of Kumar, Liang and Ma.

We did the following experiments:

  1. For perfectly calibrated model (a Bernouilli model).
def sample_calibrated(N):
    xs = []
    ys = []
    for _ in range(N):
        x = np.random.rand()
        xs.append(x)
        y = 1 if np.random.rand() < x else 0
        ys.append(y)
    
    x_arr = np.array(xs)
    y_arr = np.array(ys)
    return y_arr, x_arr

Running calibration_loss with and without the reduce_bias term over 50 iterations gives the following:

image

  1. For poorly calibrated model (a twisted Bernoulli model).
def twist(x, e):
    if x < 0.5:
        return np.power(x, e) / np.power(0.5, e-1)
    else:
        return 1 - np.power(1-x, e) / np.power(0.5, e-1)

def sample_poorly_calibrated(N, e):
    xs = []
    ys = []
    for _ in range(N):
        x = np.random.rand()
        xs.append(x)
        h = twist(x, e)
        y = 1 if np.random.rand() < h else 0
        ys.append(y)
    
    x_arr = np.array(xs)
    y_arr = np.array(ys)
    return y_arr, x_arr

Running calibration_loss with and without the reduce_bias term over 50 iterations gives the following (for a twist e=1./8):

image

The picture is less clear than in the perfectly calibrated model case. Note that for the time being the binning strategy is uniform which gives a higher variance under twist. We'll try with quantiles binning to see if this is corrected.

@amueller

This comment has been minimized.

Copy link
Member

amueller commented Mar 18, 2020

Hey @samronsin are you still working on this?

@samronsin

This comment has been minimized.

Copy link
Contributor Author

samronsin commented Mar 23, 2020

Yes @amueller ! My colleague @dsleo has fixes for the CI coming this week.

dsleo and others added 5 commits Mar 23, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Andy's pets
PR phase
Linked issues

Successfully merging this pull request may close these issues.

None yet

7 participants
You can’t perform that action at this time.