Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sampling uncertainty on precision-recall and ROC curves #25856

Open
stephanecollot opened this issue Mar 14, 2023 · 12 comments · May be fixed by #26192
Open

Sampling uncertainty on precision-recall and ROC curves #25856

stephanecollot opened this issue Mar 14, 2023 · 12 comments · May be fixed by #26192
Labels
Needs Decision - Include Feature Requires decision regarding including feature New Feature

Comments

@stephanecollot
Copy link
Contributor

stephanecollot commented Mar 14, 2023

Describe the workflow you want to enable

We would like to add the possibility to plot sampling uncertainty on precision-recall and ROC curves.

Describe your proposed solution

We (@mbaak, @RUrlus, @ilanfri and I) published a paper in AISTAT 2023 called Pointwise sampling uncertainties on the Precision-Recall curve, where we compared multiple methods to compute and plot them.

We found out that a great way to compute them is to use profile likelihoods based on Wilks’ theorem.
It consists of the following steps:

  1. Get the curve
  2. Get the confusion matrix of each point of the curve
  3. For each observed point of the curve, estimate a surrounding 6 (i.e. more than the desired number) sigmas uncertainty grid rectangle (based on first-order approximation of the covariance matrix, with the bivariate normal distribution assumption)
  4. For each of these hypothesis point in the grid, compute the test static with the observed point, called the profile log likelihood ratio (using the fact that the confusion matrix follows a multinomial distribution).
  5. Plot the 3 sigmas contour (i.e. isoline) for the observed points (using Wilks’ theorem stating that the profile log likelihood ratio is described asymptotically by a chi2 distribution)

We have a minimal pure Python implementation:
https://github.com/RUrlus/ModelMetricUncertaintyResearch/blob/sklearn_pull_request/notebooks/pr_ellipse_validation/demo_ROC_PR_curves_sklearn_pull_request.ipynb

And a C++ implementation: the paper is supported by our package ModelMetricUncertainty which has a C++ core with, optional, OpenMP support and Pybind11 bindings. Note that this package contains much more functionality than the above notebook. The core is binding agnostic allowing a switch to Cython if needed. Upside is that it is much faster (multiple orders) than the above Python implementation at the cost of complexity.

The pure Python implementation would look like this:
image4

I’m also suggesting other visual improvements:

  1. Add x and y axis limit: [0, 1], in sklearn axes currently start at ~-0.1
  2. Modify the plotting frame: either remove the top and right lines to see the curve better when values are close to 1, or plot the frame with a dotted line
  3. Fix aspect ratio to squared, since the two axes are the same scale.

With those it can look like this:
download-3
download-2

Remark: I set the contour color to lightblue, let me know if it is fine.

We need to align on the API integration. I suggest adding some parameters in PrecisionRecallDisplay and in RocCurveDisplay called:

  • uncertainty=True to enabel plot uncertainty band (or plot_uncertainty_style= ?)
  • uncertainty_n_std=3 to decide how +/- standard deviation the band should be
  • uncertainty_n_bins=100 to decide how fine-grained the band should be (see remark about running time)

Describe alternatives you've considered, if relevant

Other ways to compute uncertainties are evaluated in our paper.

We have noticed that there is open pull request on related topic: #21211
That is great, however cross-validation covers different sources of uncertainties, and has some limitations (a bias is introduced by overlapping training folds, introducing a correlation in the trained models. In addition, this uncertainty depends on the size of a fold, and is likely larger than on the test set, see ref.)

Additional context

Running time discussion

Here is an analysis of the running time of this pure Python method:

image5

The execution time depends on the number of points (i.e. thresholds) plotte and on uncertainty_n_bins.
With a surrounding grid of uncertainty_n_bins=100 per point it is fast enough and fine enough.
There is barely any noticeable visual difference between 50 and 100 (or more) points (at least in this example), see curves.
For let’s say for a 100k set, it is too slow for ROC, because there is much more thresholds, but this is going to be fixed soon here #24668 . But anyway, in this case, the uncertainties are really small, so plotting them doesn’t really make.

@stephanecollot stephanecollot added Needs Triage Issue requires triage New Feature labels Mar 14, 2023
@glemaitre
Copy link
Member

I am actually interested in this topic. I reactivated some works that I started in #21211 last week.

The idea is to provide some uncertainty measures for the different displays. Our original thought was to use cross-validation (using cross_validate) and offer uncertainties in displays with a new from_cv_results method.

I will have a look at the paper to get a better understanding of the statistical aspect of the confidence intervals. Somehow, reading the thread tells me that we need to be extra careful when reporting error bars: we need to be explicit on what those mean, i.e. what type of uncertainties are we providing.

@stephanecollot
Copy link
Contributor Author

stephanecollot commented Mar 20, 2023

Yes, we saw you pull request before, see my comment above.
You are right, here are some relevant extracts of our paper about uncertainties sources:

The sources of uncertainty on model metrics are many, such as data sampling, model initialization, and hyper-parameter optimization (Bouthillier et al., 2021). The sampling uncertainty is often the dominant source (Bouthillier et al., 2021, Fig. 1). Priority is given here to the sampling uncertainty of the test set, which we refer to as the classifier uncertainty. Since the test set is usually smaller than the training set, its sampling uncertainty is generally the largest.

A modern review on the topic of uncertainty estimation as related to machine learning can be found in Hüllermeier and Waegeman (2021). A recent and comprehensive study to cover the topic of uncertainty estimation as it relates to model selection and accounting for multiple sources of variation in realistic setups, is Bouthillier et al. (2021).
The present work focuses on the uncertainty due to sampling variability in the test set, in contrast to previous seminal
works Nadeau and Bengio (1999); Dietterich (1998) which consider uncertainty due to training set variability.

Let me know if you want the suggested feature, and I will open the PR.

@lorentzenchr
Copy link
Member

confidence intervals. Somehow, reading the thread tells me that we need to be extra careful when reporting error bars: we need to be explicit on what those mean, i.e. what type of uncertainties are we providing

IIUC, this issue is quite explicit by asking for sample uncertainty of the (test) data, given a fixed model (i.e. no cross validation). This would be similar to the plotting capabilities of https://lorentzenchr.github.io/model-diagnostics/.

@stephanecollot It is easier to open separate issues for your suggested plot improvements.

@stephanecollot
Copy link
Contributor Author

stephanecollot commented Mar 21, 2023

IIUC, this issue is quite explicit by asking for sample uncertainty of the (test) data, given a fixed model (i.e. no cross validation).

Yes exactly

This would be similar to the plotting capabilities of https://lorentzenchr.github.io/model-diagnostics/.

Could you point me more specifically where (and how) the sample uncertainty is computed in model-diagnostics?

@stephanecollot It is easier to open separate issues for your suggested plot improvements.

Ok here is the separate issue for plot improvements: #25929

@lorentzenchr
Copy link
Member

lorentzenchr commented Mar 21, 2023

Could you point me more specifically where (and how) the sample uncertainty is computed in model-diagnostics?

It‘s best seen in the example. The simplest function is compute_bias which computes standard errors. The corresponding plot is then plot_bias.

@thomasjpfan thomasjpfan added Needs Decision - Include Feature Requires decision regarding including feature and removed Needs Triage Issue requires triage labels Mar 24, 2023
@stephanecollot
Copy link
Contributor Author

@lorentzenchr, interesting package, thanks for sharing! I see that it plots uncertainties on the calibration curves (i.e. std of the difference between prediction and observed values).

We are proposing some quite different on the 2D uncertainties for PR or ROC curves. (In particular for PR curves the 2D correlation is non-trivial.) We think it would be very nice to integrate this in sklearn, so anyone can use it.

Code is ready on my side; I'm waiting for a confirmation that sklearn want the feature before opening the pull request.

@lorentzenchr
Copy link
Member

This issue proposes to add sample uncertainties to ROC and PR curves. While the paper clearly fails our inclusion criteria, the proposed method using Wilk’s theorem is much older.

I personally like it to visualize uncertainty.

@scikit-learn/core-devs @scikit-learn/contributor-experience-team @scikit-learn/communication-team opinions welcome.

@stephanecollot
Copy link
Contributor Author

I don't think this section "What are the inclusion criteria for new algorithms?" applies here, since it is not really an algorithm that does fit/transform.

@betatim
Copy link
Member

betatim commented Apr 11, 2023

I think this would be a useful tool to have. As I understand it this tool allows you to estimate how well you know the performance of your fitted estimator, given the test dataset. This means it should be easy to demonstrate in an example that the uncertainty reduces as you increase the size of the test dataset (keeping everything else fixed).

If the above understanding is correct then I'd vote for adding this.

@glemaitre
Copy link
Member

glemaitre commented Apr 11, 2023

I still did not get time to read the paper but I am +1 for adding uncertainty visualization. Here, I think that we should get them via from_estimator and from_predictions.

I would find it complementary with the from_cv_results that provides another type of uncertainty.

@stephanecollot
Copy link
Contributor Author

Yes, ok I will open the PR soon.

@RUrlus
Copy link
Contributor

RUrlus commented Apr 11, 2023

I think this would be a useful tool to have. As I understand it this tool allows you to estimate how well you know the performance of your fitted estimator, given the test dataset. This means it should be easy to demonstrate in an example that the uncertainty reduces as you increase the size of the test dataset (keeping everything else fixed).

If the above understanding is correct then I'd vote for adding this.

@betatim you're correct. A quick example:

from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

import mmu
from mmu.viz.utils import _set_plot_style

_ = _set_plot_style()

seeds = mmu.commons.utils.SeedGenerator(2343451)
X, y = make_classification(
    n_samples=2000, n_classes=2, random_state=seeds()
)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.5, random_state=seeds()
)
model = LogisticRegression(solver='lbfgs')
model.fit(X_train, y_train)
y_score = model.predict_proba(X_test)[:, 1]

err_500 = mmu.PRU.from_scores(y=y_test[:500], scores=y_score[:500])
err_1000 = mmu.PRU.from_scores(y=y_test[:1000], scores=y_score[:1000])
ax = err_500.plot(other=err_1000)

produces
size_effect

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Needs Decision - Include Feature Requires decision regarding including feature New Feature
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants