Sampling uncertainty on precision-recall and ROC curves #25856

stephanecollot · 2023-03-14T23:23:32Z

Describe the workflow you want to enable

We would like to add the possibility to plot sampling uncertainty on precision-recall and ROC curves.

Describe your proposed solution

We (@mbaak, @RUrlus, @ilanfri and I) published a paper in AISTAT 2023 called Pointwise sampling uncertainties on the Precision-Recall curve, where we compared multiple methods to compute and plot them.

We found out that a great way to compute them is to use profile likelihoods based on Wilks’ theorem.
It consists of the following steps:

Get the curve
Get the confusion matrix of each point of the curve
For each observed point of the curve, estimate a surrounding 6 (i.e. more than the desired number) sigmas uncertainty grid rectangle (based on first-order approximation of the covariance matrix, with the bivariate normal distribution assumption)
For each of these hypothesis point in the grid, compute the test static with the observed point, called the profile log likelihood ratio (using the fact that the confusion matrix follows a multinomial distribution).
Plot the 3 sigmas contour (i.e. isoline) for the observed points (using Wilks’ theorem stating that the profile log likelihood ratio is described asymptotically by a chi2 distribution)

We have a minimal pure Python implementation:
https://github.com/RUrlus/ModelMetricUncertaintyResearch/blob/sklearn_pull_request/notebooks/pr_ellipse_validation/demo_ROC_PR_curves_sklearn_pull_request.ipynb

And a C++ implementation: the paper is supported by our package ModelMetricUncertainty which has a C++ core with, optional, OpenMP support and Pybind11 bindings. Note that this package contains much more functionality than the above notebook. The core is binding agnostic allowing a switch to Cython if needed. Upside is that it is much faster (multiple orders) than the above Python implementation at the cost of complexity.

The pure Python implementation would look like this:

I’m also suggesting other visual improvements:

Add x and y axis limit: [0, 1], in sklearn axes currently start at ~-0.1
Modify the plotting frame: either remove the top and right lines to see the curve better when values are close to 1, or plot the frame with a dotted line
Fix aspect ratio to squared, since the two axes are the same scale.

With those it can look like this:

Remark: I set the contour color to lightblue, let me know if it is fine.

We need to align on the API integration. I suggest adding some parameters in PrecisionRecallDisplay and in RocCurveDisplay called:

uncertainty=True to enabel plot uncertainty band (or plot_uncertainty_style= ?)
uncertainty_n_std=3 to decide how +/- standard deviation the band should be
uncertainty_n_bins=100 to decide how fine-grained the band should be (see remark about running time)

Describe alternatives you've considered, if relevant

Other ways to compute uncertainties are evaluated in our paper.

We have noticed that there is open pull request on related topic: #21211
That is great, however cross-validation covers different sources of uncertainties, and has some limitations (a bias is introduced by overlapping training folds, introducing a correlation in the trained models. In addition, this uncertainty depends on the size of a fold, and is likely larger than on the test set, see ref.)

Additional context

Running time discussion

Here is an analysis of the running time of this pure Python method:

The execution time depends on the number of points (i.e. thresholds) plotte and on uncertainty_n_bins.
With a surrounding grid of uncertainty_n_bins=100 per point it is fast enough and fine enough.
There is barely any noticeable visual difference between 50 and 100 (or more) points (at least in this example), see curves.
For let’s say for a 100k set, it is too slow for ROC, because there is much more thresholds, but this is going to be fixed soon here #24668 . But anyway, in this case, the uncertainties are really small, so plotting them doesn’t really make.

The text was updated successfully, but these errors were encountered:

glemaitre · 2023-03-18T19:43:02Z

I am actually interested in this topic. I reactivated some works that I started in #21211 last week.

The idea is to provide some uncertainty measures for the different displays. Our original thought was to use cross-validation (using cross_validate) and offer uncertainties in displays with a new from_cv_results method.

I will have a look at the paper to get a better understanding of the statistical aspect of the confidence intervals. Somehow, reading the thread tells me that we need to be extra careful when reporting error bars: we need to be explicit on what those mean, i.e. what type of uncertainties are we providing.

stephanecollot · 2023-03-20T23:05:58Z

Yes, we saw you pull request before, see my comment above.
You are right, here are some relevant extracts of our paper about uncertainties sources:

The sources of uncertainty on model metrics are many, such as data sampling, model initialization, and hyper-parameter optimization (Bouthillier et al., 2021). The sampling uncertainty is often the dominant source (Bouthillier et al., 2021, Fig. 1). Priority is given here to the sampling uncertainty of the test set, which we refer to as the classifier uncertainty. Since the test set is usually smaller than the training set, its sampling uncertainty is generally the largest.

A modern review on the topic of uncertainty estimation as related to machine learning can be found in Hüllermeier and Waegeman (2021). A recent and comprehensive study to cover the topic of uncertainty estimation as it relates to model selection and accounting for multiple sources of variation in realistic setups, is Bouthillier et al. (2021).
The present work focuses on the uncertainty due to sampling variability in the test set, in contrast to previous seminal
works Nadeau and Bengio (1999); Dietterich (1998) which consider uncertainty due to training set variability.

Let me know if you want the suggested feature, and I will open the PR.

lorentzenchr · 2023-03-21T16:56:15Z

confidence intervals. Somehow, reading the thread tells me that we need to be extra careful when reporting error bars: we need to be explicit on what those mean, i.e. what type of uncertainties are we providing

IIUC, this issue is quite explicit by asking for sample uncertainty of the (test) data, given a fixed model (i.e. no cross validation). This would be similar to the plotting capabilities of https://lorentzenchr.github.io/model-diagnostics/.

@stephanecollot It is easier to open separate issues for your suggested plot improvements.

stephanecollot · 2023-03-21T21:38:37Z

IIUC, this issue is quite explicit by asking for sample uncertainty of the (test) data, given a fixed model (i.e. no cross validation).

Yes exactly

This would be similar to the plotting capabilities of https://lorentzenchr.github.io/model-diagnostics/.

Could you point me more specifically where (and how) the sample uncertainty is computed in model-diagnostics?

@stephanecollot It is easier to open separate issues for your suggested plot improvements.

Ok here is the separate issue for plot improvements: #25929

lorentzenchr · 2023-03-21T22:32:47Z

Could you point me more specifically where (and how) the sample uncertainty is computed in model-diagnostics?

It‘s best seen in the example. The simplest function is compute_bias which computes standard errors. The corresponding plot is then plot_bias.

stephanecollot · 2023-03-25T14:54:08Z

@lorentzenchr, interesting package, thanks for sharing! I see that it plots uncertainties on the calibration curves (i.e. std of the difference between prediction and observed values).

We are proposing some quite different on the 2D uncertainties for PR or ROC curves. (In particular for PR curves the 2D correlation is non-trivial.) We think it would be very nice to integrate this in sklearn, so anyone can use it.

Code is ready on my side; I'm waiting for a confirmation that sklearn want the feature before opening the pull request.

lorentzenchr · 2023-04-10T11:23:22Z

This issue proposes to add sample uncertainties to ROC and PR curves. While the paper clearly fails our inclusion criteria, the proposed method using Wilk’s theorem is much older.

I personally like it to visualize uncertainty.

@scikit-learn/core-devs @scikit-learn/contributor-experience-team @scikit-learn/communication-team opinions welcome.

stephanecollot · 2023-04-11T08:31:02Z

I don't think this section "What are the inclusion criteria for new algorithms?" applies here, since it is not really an algorithm that does fit/transform.

betatim · 2023-04-11T08:48:20Z

I think this would be a useful tool to have. As I understand it this tool allows you to estimate how well you know the performance of your fitted estimator, given the test dataset. This means it should be easy to demonstrate in an example that the uncertainty reduces as you increase the size of the test dataset (keeping everything else fixed).

If the above understanding is correct then I'd vote for adding this.

glemaitre · 2023-04-11T09:08:58Z

I still did not get time to read the paper but I am +1 for adding uncertainty visualization. Here, I think that we should get them via from_estimator and from_predictions.

I would find it complementary with the from_cv_results that provides another type of uncertainty.

stephanecollot · 2023-04-11T10:35:08Z

Yes, ok I will open the PR soon.

RUrlus · 2023-04-11T12:04:17Z

I think this would be a useful tool to have. As I understand it this tool allows you to estimate how well you know the performance of your fitted estimator, given the test dataset. This means it should be easy to demonstrate in an example that the uncertainty reduces as you increase the size of the test dataset (keeping everything else fixed).

If the above understanding is correct then I'd vote for adding this.

@betatim you're correct. A quick example:

from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

import mmu
from mmu.viz.utils import _set_plot_style

_ = _set_plot_style()

seeds = mmu.commons.utils.SeedGenerator(2343451)
X, y = make_classification(
    n_samples=2000, n_classes=2, random_state=seeds()
)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.5, random_state=seeds()
)
model = LogisticRegression(solver='lbfgs')
model.fit(X_train, y_train)
y_score = model.predict_proba(X_test)[:, 1]

err_500 = mmu.PRU.from_scores(y=y_test[:500], scores=y_score[:500])
err_1000 = mmu.PRU.from_scores(y=y_test[:1000], scores=y_score[:1000])
ax = err_500.plot(other=err_1000)

produces

stephanecollot added Needs Triage Issue requires triage New Feature labels Mar 14, 2023

thomasjpfan added Needs Decision - Include Feature Requires decision regarding including feature and removed Needs Triage Issue requires triage labels Mar 24, 2023

stephanecollot mentioned this issue Mar 24, 2023

Visual improvements for ROC and precision-recall plots #25929

Open

stephanecollot linked a pull request Apr 16, 2023 that will close this issue

Add sampling uncertainty on precision-recall and ROC curves #26192

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sampling uncertainty on precision-recall and ROC curves #25856

Sampling uncertainty on precision-recall and ROC curves #25856

stephanecollot commented Mar 14, 2023 •

edited

glemaitre commented Mar 18, 2023

stephanecollot commented Mar 20, 2023 •

edited

lorentzenchr commented Mar 21, 2023

stephanecollot commented Mar 21, 2023 •

edited

lorentzenchr commented Mar 21, 2023 •

edited

stephanecollot commented Mar 25, 2023

lorentzenchr commented Apr 10, 2023

stephanecollot commented Apr 11, 2023

betatim commented Apr 11, 2023

glemaitre commented Apr 11, 2023 •

edited

stephanecollot commented Apr 11, 2023

RUrlus commented Apr 11, 2023

Sampling uncertainty on precision-recall and ROC curves #25856

Sampling uncertainty on precision-recall and ROC curves #25856

Comments

stephanecollot commented Mar 14, 2023 • edited

Describe the workflow you want to enable

Describe your proposed solution

Describe alternatives you've considered, if relevant

Additional context

Running time discussion

glemaitre commented Mar 18, 2023

stephanecollot commented Mar 20, 2023 • edited

lorentzenchr commented Mar 21, 2023

stephanecollot commented Mar 21, 2023 • edited

lorentzenchr commented Mar 21, 2023 • edited

stephanecollot commented Mar 25, 2023

lorentzenchr commented Apr 10, 2023

stephanecollot commented Apr 11, 2023

betatim commented Apr 11, 2023

glemaitre commented Apr 11, 2023 • edited

stephanecollot commented Apr 11, 2023

RUrlus commented Apr 11, 2023

stephanecollot commented Mar 14, 2023 •

edited

stephanecollot commented Mar 20, 2023 •

edited

stephanecollot commented Mar 21, 2023 •

edited

lorentzenchr commented Mar 21, 2023 •

edited

glemaitre commented Apr 11, 2023 •

edited