Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] Implement calibration loss metrics #11096

Open
wants to merge 97 commits into
base: main
Choose a base branch
from

Conversation

samronsin
Copy link
Contributor

@samronsin samronsin commented May 15, 2018

See discussion on issue #10883.
Closes #18268.

This PR implements calibration losses for binary classifiers.

It also updates the doc about calibration, especially inaccurate references to the Brier score.

Copy link
Member

@jnothman jnothman left a comment

You will also need to update:

  • doc/modules/classes.rst
  • doc/modules/model_evaluation.rst
  • sklearn/metrics/tests/test_common.py

Copy link
Member

@jnothman jnothman left a comment

Should this be available as a scorer for cross validation? Should it be available for CalibratedClassifierCV?

sklearn/metrics/classification.py Outdated Show resolved Hide resolved
sklearn/metrics/classification.py Outdated Show resolved Hide resolved
sklearn/metrics/classification.py Outdated Show resolved Hide resolved
sklearn/metrics/classification.py Outdated Show resolved Hide resolved
sklearn/metrics/classification.py Outdated Show resolved Hide resolved
sklearn/metrics/classification.py Outdated Show resolved Hide resolved
@jnothman
Copy link
Member

@jnothman jnothman commented May 16, 2018

You don't need to test sample weights as the common tests will

@amueller
Copy link
Member

@amueller amueller commented May 19, 2018

I'm not sure if it makes sense to have this as a ready-made scorer, and so I might want to hold out on that? It should probably be mentioned in the calibration docs and examples.

@amueller
Copy link
Member

@amueller amueller commented May 19, 2018

Can you maybe also add https://www.math.ucdavis.edu/~saito/data/roc/ferri-class-perf-metrics.pdf to the references and maybe include some of the discussion there about this loss? In particular, this is one of two calibration losses they discuss, so maybe we should be more specific with the name?

@samronsin
Copy link
Contributor Author

@samronsin samronsin commented May 20, 2018

Actually I did not implement any of the losses from the paper you mention @amueller, as I take non-overlapping bins instead of sliding window in CalB.
CalB would totally make sense in this PR, although:

  • n_bins would have a very different meaning: e.g. the number of non-overlapping bins
  • taking sample_weight into account seems non-trivial because of the sliding window
  • implementing CalB efficiently would actually be non-trivial in itself since it is based on a sliding window
  • greedy implementation would be quite drag on performance, compared to non-overlapping binning

@samronsin
Copy link
Contributor Author

@samronsin samronsin commented May 25, 2018

I ended up implementing the calB loss suggested by @amueller, but did not provide support for sample weights.
Also, I'll be happy to take suggestions regarding its implementation.

Copy link
Member

@jnothman jnothman left a comment

Sorry for fashion again not being in a situation to review the main content...

doc/modules/model_evaluation.rst Outdated Show resolved Hide resolved
sklearn/metrics/classification.py Outdated Show resolved Hide resolved
sklearn/metrics/classification.py Outdated Show resolved Hide resolved
doc/modules/model_evaluation.rst Outdated Show resolved Hide resolved
@@ -62,6 +62,7 @@ Scoring Function
'balanced_accuracy' :func:`metrics.balanced_accuracy_score` for binary targets
'average_precision' :func:`metrics.average_precision_score`
'brier_score_loss' :func:`metrics.brier_score_loss`
'calibration_loss' :func:`metrics.calibration_loss`
Copy link
Member

@agramfort agramfort Jul 16, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to respect the convention higher is better we should maybe call it

neg_calibration_error

thoughts anyone?

Copy link
Member

@ogrisel ogrisel Jun 10, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes for the grid search tools in scikit-learn will always to maximize the scoring criterion. If the criterion is derived from a loss to minimize we instead maximize the the negative criterion. It's important to name it accordingly to make the column names and column values of pd.DataFrame(grid_search.cv_results_) consistent.

Copy link
Member

@ogrisel ogrisel Jun 10, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might also want to provide standard aliases neg_max_calibration_error and neg_expected_calibration_error to set the norm parameter to make it possible to grid search for the ECE or MCE metrics traditionally used in the literature.

sklearn/metrics/classification.py Outdated Show resolved Hide resolved
sklearn/metrics/classification.py Outdated Show resolved Hide resolved
sklearn/metrics/classification.py Outdated Show resolved Hide resolved
sklearn/metrics/classification.py Outdated Show resolved Hide resolved
sklearn/metrics/classification.py Outdated Show resolved Hide resolved
sklearn/metrics/classification.py Outdated Show resolved Hide resolved
@samronsin
Copy link
Contributor Author

@samronsin samronsin commented Jul 14, 2020

Thank @thomasjpfan and @lucyleeow for your comments. I think @dsleo and I addressed all of them.

@ogrisel unless I'm missing something, the only remaining discussion point is what should happen when no pos_label is explicitly given by the user (especially for non-integer target) ? Is there a new consensus we should follow ?

@ogrisel
Copy link
Member

@ogrisel ogrisel commented Aug 6, 2020

@ogrisel unless I'm missing something, the only remaining discussion point is what should happen when no pos_label is explicitly given by the user (especially for non-integer target) ? Is there a new consensus we should follow ?

I have not followed recent development but this is related to this discussion: #17704

@lucyleeow
Copy link
Member

@lucyleeow lucyleeow commented Aug 11, 2020

For housekeeping, I think this will close #10971 and #12479

Copy link
Contributor

@glemaitre glemaitre left a comment

It is just some comments regarding the code itself.
I have read more about the score itself and check if we don't miss anything in the test.

In particular, I see that we don't test anything when it is used as a scorer. I am not sure we have a common test there.


if pos_label is None:
pos_label = y_true.max()
y_true = np.array(y_true == pos_label, int)
Copy link
Contributor

@glemaitre glemaitre Aug 14, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the score is not symmetric, I would make pos_label consistent with the scorer that raise an error with string.
To be more specific, we should carry this check:

# ensure binary classification if pos_label is not specified
# classes.dtype.kind in ('O', 'U', 'S') is required to avoid
# triggering a FutureWarning by calling np.array_equal(a, b)
# when elements in the two arrays are not comparable.
classes = np.unique(y_true)
if (pos_label is None and (
classes.dtype.kind in ('O', 'U', 'S') or
not (np.array_equal(classes, [0, 1]) or
np.array_equal(classes, [-1, 1]) or
np.array_equal(classes, [0]) or
np.array_equal(classes, [-1]) or
np.array_equal(classes, [1])))):
classes_repr = ", ".join(repr(c) for c in classes)
raise ValueError("y_true takes value in {{{classes_repr}}} and "
"pos_label is not specified: either make y_true "
"take value in {{0, 1}} or {{-1, 1}} or "
"pass pos_label explicitly.".format(
classes_repr=classes_repr))
elif pos_label is None:
pos_label = 1.

We could make a small refactoring indeed. So pos_label would be in line with:

pos_label: int or str, default=None

    The label of the positive class. When pos_label=None, if y_true is in {-1, 1} or {0, 1},
    pos_label is set to 1, otherwise, an error will be raised.

sklearn/metrics/_classification.py Outdated Show resolved Hide resolved
sklearn/metrics/_classification.py Outdated Show resolved Hide resolved
raise ValueError("y_prob has values outside of [0, 1] range")

labels = np.unique(y_true)
if len(labels) > 2:
Copy link
Contributor

@glemaitre glemaitre Aug 14, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here I think that we should use sklearn.multiclass.type_of_target for consistency

y_type = type_of_target(y_true)
if y_type != "binary":
    raise ValueError(...)

sklearn/metrics/_classification.py Outdated Show resolved Hide resolved
sklearn/metrics/_scorer.py Outdated Show resolved Hide resolved
y_pred = np.array([0.25, 0.25, 0.25, 0.25] + [0.75, 0.75, 0.75, 0.75])
sample_weight = np.array([1, 1, 1, 1] + [3, 3, 3, 3])

assert_almost_equal(
Copy link
Contributor

@glemaitre glemaitre Aug 14, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are using the following pattern

assert calibration_error(...) == pytest.approx(0.1875)

instead of assert_almost_equal since we introduced pytest

0.2165063)


def test_calibration_error_raises():
Copy link
Contributor

@glemaitre glemaitre Aug 14, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can parametrize this test and use pytest

@pytest.mark.parametrize(
    "y_pred",
    [np.array([0.25, 0.25, 0.25, 0.25] + [0.75, 0.75, 0.75]),
    np.array([0.25, 0.25, 0.25, 0.25] + [0.75, 0.75, 0.75, 1.75]),
    ....]
def test_calibration_error_raises(y_pred):
    ...
    with pytest.raises(ValueError, match=err_msg):
        calibration_error(y_true, y_pred)

Copy link
Contributor

@glemaitre glemaitre Aug 14, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the err_msg is changing, we can include it in the parametrization as well.

y_true = np.array([0, 1, 0, 1])
y_pred = np.array([0., 0.25, 0.5, 0.75])

assert_almost_equal(
Copy link
Contributor

@glemaitre glemaitre Aug 14, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use pytest.approx

y_true = np.array([1, 0, 0, 1])
y_pred = np.array([0.25, 0.25, 0.75, 0.75])

assert_almost_equal(
Copy link
Contributor

@glemaitre glemaitre Aug 14, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use pytest.approx

@lucyleeow
Copy link
Member

@lucyleeow lucyleeow commented Aug 14, 2020

Sorry to add more work but would it be possible to add a quick note in model_evaluation.rst under Brier score:

This is because the Brier score can be decomposed as the sum of calibration loss and refinement loss [Bella2012]. Calibration loss is defined as the mean squared deviation from empirical probabilities derived from the slope of ROC segments.

to clarify that the calibration loss metric we implement here is not the same as the one referred to above, and maybe how they relate?

(ref: #18051 (comment))

@rth
Copy link
Member

@rth rth commented Sep 11, 2020

Thanks for this work! In terms of implementation can't we re-use the calculation of prob_true, prob_pred done in calibration_curve, as suggested in #18268 instead of re-computing it from scratch? Or do we need a re-implementatation for reduce_bias=True case? In case, if that work it would be be good to a add a test comparing the manual calculation e.g. of ECE form calibration_curve with the added function.

dsleo and others added 2 commits Nov 5, 2020
Format fixes

Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
Base automatically changed from master to main Jan 22, 2021
The :func:`calibration_error` function computes the expected and maximum
calibration errors as defined in [1]_ for binary classes.
Copy link

@rflperry rflperry Mar 2, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would the general ECE for multiclass problems would be more desirable, as used in Guo 2017 (Ref 1)? This implementation seems targeted at binary classification. I believe you could use this as is for multiclass if you pass in y_pred as the maximum predicted probability and y_true as 1 if the predicted class equals the true class and 0 otherwise. But I would think it more desirable to just have the multiclass logic under the hood.

sample_weight[i_start:i_end])
/ delta_count[i])
if norm == "l2" and reduce_bias:
delta_debias = (
Copy link

@edwardclem edwardclem Jun 6, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this implementation of the bias term correct? I believe it should be (avg_pred_true[i] * (avg_pred_true[i] - 1) * delta_count[i])/ (count * (delta_count[i] - 1)) going off the definition here. The 1/count doesn't get distributed properly, I think. This will produce nans if there's only one element in a bin, so I'm not sure if this was changed for numerical stability reasons. The authors of the paper mention this in their implementation, so it seems like the "undefined if bin size is 1" issue is expected.

@lorentzenchr
Copy link
Member

@lorentzenchr lorentzenchr commented Nov 29, 2021

We should have a look at https://cran.r-project.org/web/packages/reliabilitydiag/index.html and the accompanying paper https://doi.org/10.1073/pnas.2016191118. First, let us look if this PR corresponds to their miscalibration measure (MCB). Second, they propose a binning that might help us.

@ColdTeapot273K
Copy link

@ColdTeapot273K ColdTeapot273K commented Jan 17, 2022

Folks, I've made a separate implementation in #22233 which has some specific benefits (e.g. binless). What do you think?

@lorentzenchr
Copy link
Member

@lorentzenchr lorentzenchr commented Jun 27, 2022

I would much prefer a more general solution for calibration scores by a general score decomposition as proposed in #23767.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment