FEA Implementation of "threshold-dependent metric per threshold value" curve #25639

vitaliset · 2023-02-18T05:10:42Z

Towards #21391.

Intending to later build the MetricThresholdCurveDisplay following the same structure that other Displays have, this PR implements the associate curve. I decided to break the original issue into two parts (curve and Display) for easier review (but I don't mind adding the Display to this PR as well).

A quick example of usage of the implementation here:

import matplotlib.pyplot as plt
from imblearn.datasets import fetch_datasets
from sklearn.inspection import metric_threshold_curve
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import fbeta_score
from functools import partial

dataset = fetch_datasets()["coil_2000"]
X, y = dataset.data, (dataset.target==1).astype(int)

X_train_model, X_test, y_train_model, y_test = train_test_split(X, y, random_state=0, stratify=y)

model = RandomForestClassifier(random_state=0).fit(X_train_model, y_train_model)
predict_proba = model.predict_proba(X_test)[:, 1]

f2_values, thresholds = metric_threshold_curve(
    y_test, predict_proba, partial(fbeta_score, beta=2), threshold_grid=500
)

fig, ax = plt.subplots(figsize=(5, 2.4))
ax.plot(thresholds, f2_values)
ax.set_xlabel("thresholds")
ax.set_ylabel("f2 score")
plt.tight_layout()

Most of the code for metric_threshold_curve function is an adaptation of _binary_clf_curve.

Points of doubt:

I thought the inspection module would be suitable for this type of analysis, but it is not 100% clear to me that this curve (and then the Display) should go here - other current options would be metrics or model_selection._prediction (just like the related meta-estimator from [WIP] FEA New meta-estimator to post-tune the decision_function/predict_proba threshold for binary classifiers #16525).
I would appreciate some help with test ideas!
I made the first version of the documentation to go along with the function. It's preliminary. I wanted to go into it only a little while we don't define how the function will look like. Ideas for it would be appreciated as well! :)

sklearn/inspection/_metric_threshold_curve.py

…/scikit-learn into metric_threshold_curve

github-actions · 2024-05-20T14:25:11Z

❌ Linting issues

This PR is introducing linting issues. Here's a summary of the issues. Note that you can avoid having linting issues by enabling pre-commit hooks. Instructions to enable them can be found here.

You can see the details of the linting issues under the lint job here

`black`

black detected issues. Please run black . locally and push the changes. Here you can see the detected issues. Note that running black might also fix some of the issues which might be detected by ruff. Note that the installed black version is black=24.3.0.


--- /home/runner/work/scikit-learn/scikit-learn/sklearn/metrics/_decision_threshold.py	2024-05-22 05:00:10.745270+00:00
+++ /home/runner/work/scikit-learn/scikit-learn/sklearn/metrics/_decision_threshold.py	2024-05-22 05:00:27.285693+00:00
@@ -107,11 +107,11 @@
                 "In a multiclass scenario, you must pass a `pos_label` to `scoring_kwargs`."
             )
         raise ValueError("{0} format is not supported".format(y_type))
 
     sample_weight = scoring_kwargs.get("sample_weight")
-    check_consistent_length( y_true, y_score, sample_weight)
+    check_consistent_length(y_true, y_score, sample_weight)
     y_true = column_or_1d(y_true)
     y_score = column_or_1d(y_score)
     assert_all_finite(y_true)
     assert_all_finite(y_score)
 
@@ -122,11 +122,11 @@
         sample_weight = _check_sample_weight(sample_weight, y_true)
         nonzero_weight_mask = sample_weight != 0
         y_true = y_true[nonzero_weight_mask]
         y_score = y_score[nonzero_weight_mask]
         sample_weight = sample_weight[nonzero_weight_mask]
-    
+
     pos_label = _check_pos_label_consistency(pos_label, y_true)
 
     # Make y_true a boolean vector.
     y_true = y_true == pos_label
 
@@ -134,14 +134,14 @@
     desc_score_indices = np.argsort(y_score, kind="mergesort")[::-1]
     y_score = y_score[desc_score_indices]
     y_true = y_true[desc_score_indices]
     if sample_weight is not None:
         sample_weight = sample_weight[desc_score_indices]
-    
+
     if "sample_weight" in scoring_kwargs:
         scoring_kwargs["sample_weight"] = sample_weight
-    
+
     # Logic to see if we need to use all possible thresholds (distinct values).
     all_thresholds = isinstance(thresholds, int) and len(set(y_score)) < thresholds
 
     if all_thresholds:
         # y_score typically has many tied values. Here we extract
@@ -151,13 +151,11 @@
         threshold_idxs = np.r_[distinct_value_indices, y_true.size - 1]
         thresholds = y_score[threshold_idxs[::-1]]
     elif isinstance(thresholds, int):
         # It takes representative score points to calculate the metric
         # with these thresholds.
-        thresholds = np.percentile(
-            list(set(y_score)), np.linspace(0, 100, thresholds)
-        )
+        thresholds = np.percentile(list(set(y_score)), np.linspace(0, 100, thresholds))
     else:
         # If thresholds is an array then run some checks and sort
         # it for consistency.
         thresholds = column_or_1d(thresholds)
         assert_all_finite(thresholds)
@@ -165,11 +163,9 @@
 
     # For each threshold calculates the metric.
     metric_values = []
     for threshold in thresholds:
         preds_threshold = (y_score > threshold).astype(int)
-        metric_values.append(
-            scoring(y_true, preds_threshold, **scoring_kwargs)
-        )
+        metric_values.append(scoring(y_true, preds_threshold, **scoring_kwargs))
     # TODO: should we multithread the metric calculations?
 
     return np.array(metric_values), thresholds
would reformat /home/runner/work/scikit-learn/scikit-learn/sklearn/metrics/_decision_threshold.py
--- /home/runner/work/scikit-learn/scikit-learn/sklearn/metrics/tests/test_decision_threshold.py	2024-05-22 05:00:10.749270+00:00
+++ /home/runner/work/scikit-learn/scikit-learn/sklearn/metrics/tests/test_decision_threshold.py	2024-05-22 05:00:27.932117+00:00
@@ -90,13 +90,11 @@
 
 
 def test_len_of_threshold_when_passing_int():
     y = [0] * 500 + [1] * 500
     y_score = list(range(1000))
-    _, thresholds = decision_threshold_curve(
-        y, y_score, accuracy_score, thresholds=13
-    )
+    _, thresholds = decision_threshold_curve(y, y_score, accuracy_score, thresholds=13)
 
     assert len(thresholds) == 13
 
 
 def test_passing_the_grid():
would reformat /home/runner/work/scikit-learn/scikit-learn/sklearn/metrics/tests/test_decision_threshold.py

Oh no! 💥 💔 💥
2 files would be reformatted, 923 files would be left unchanged.

`ruff`

ruff detected issues. Please run ruff check --fix --output-format=full . locally, fix the remaining issues, and push the changes. Here you can see the detected issues. Note that the installed ruff version is ruff=0.4.4.


sklearn/metrics/_decision_threshold.py:9:31: F401 [*] `numbers.Real` imported but unused
   |
 7 | # License: BSD 3 clause
 8 | 
 9 | from numbers import Integral, Real
   |                               ^^^^ F401
10 | 
11 | import numpy as np
   |
   = help: Remove unused import: `numbers.Real`

sklearn/metrics/_decision_threshold.py:107:89: E501 Line too long (92 > 88)
    |
105 |         if y_type == "multiclass":
106 |             raise ValueError(
107 |                 "In a multiclass scenario, you must pass a `pos_label` to `scoring_kwargs`."
    |                                                                                         ^^^^ E501
108 |             )
109 |         raise ValueError("{0} format is not supported".format(y_type))
    |

sklearn/metrics/_decision_threshold.py:127:1: W293 [*] Blank line contains whitespace
    |
125 |         y_score = y_score[nonzero_weight_mask]
126 |         sample_weight = sample_weight[nonzero_weight_mask]
127 |     
    | ^^^^ W293
128 |     pos_label = _check_pos_label_consistency(pos_label, y_true)
    |
    = help: Remove whitespace from blank line

sklearn/metrics/_decision_threshold.py:139:1: W293 [*] Blank line contains whitespace
    |
137 |     if sample_weight is not None:
138 |         sample_weight = sample_weight[desc_score_indices]
139 |     
    | ^^^^ W293
140 |     if "sample_weight" in scoring_kwargs:
141 |         scoring_kwargs["sample_weight"] = sample_weight
    |
    = help: Remove whitespace from blank line

sklearn/metrics/_decision_threshold.py:142:1: W293 [*] Blank line contains whitespace
    |
140 |     if "sample_weight" in scoring_kwargs:
141 |         scoring_kwargs["sample_weight"] = sample_weight
142 |     
    | ^^^^ W293
143 |     # Logic to see if we need to use all possible thresholds (distinct values).
144 |     all_thresholds = isinstance(thresholds, int) and len(set(y_score)) < thresholds
    |
    = help: Remove whitespace from blank line

sklearn/metrics/tests/test_decision_threshold.py:1:1: I001 [*] Import block is un-sorted or un-formatted
   |
 1 | / from functools import partial
 2 | | 
 3 | | import numpy as np
 4 | | import pytest
 5 | | 
 6 | | from sklearn.datasets import make_classification
 7 | | from sklearn.ensemble import RandomForestClassifier
 8 | | from sklearn.metrics import decision_threshold_curve
 9 | | from sklearn.metrics import (
10 | |     accuracy_score,
11 | |     f1_score,
12 | |     fbeta_score,
13 | |     precision_score,
14 | |     recall_score,
15 | | )
16 | | from sklearn.utils._testing import assert_allclose
17 | | from sklearn.utils.validation import check_random_state
18 | | 
19 | | 
20 | | def test_grid_int_bigger_than_set_then_all():
   | |_^ I001
21 |       """When `thresholds` parameter is bigger than the number of unique
22 |       `y_score` then `len(thresholds)` should be equal to `len(set(y_score))`.
   |
   = help: Organize imports

Found 6 errors.
[*] 5 fixable with the `--fix` option.

_{Generated for commit: 1fb1c13. Link to the linter CI: here}

glemaitre · 2024-05-20T14:26:02Z

OK now that we merged the FixedThresholdClassifier and TunedThresholdClassifierCV, it gives me another perspective on the tool.

I think this is time to review and prioritize this feature.

@vitaliset would you have time to dedicate to work on this feature?

glemaitre

Some initial thoughts. I did not look at the documentation or test but it will come later.

glemaitre · 2024-05-20T14:29:33Z

doc/modules/classes.rst

@@ -674,6 +674,7 @@ Kernels

   inspection.partial_dependence
   inspection.permutation_importance
+   inspection.metric_threshold_curve


I don't think that we should include this feature in this module. Actually, as the precision-recall and roc curves, this is a curve metric. Also, by including it in sklearn.metrics module, we can drop the metric_ naming.

I would therefore call it sklearn.metrics.decision_threshold_curve

glemaitre · 2024-05-20T14:31:45Z

sklearn/inspection/_metric_threshold_curve.py

+        "y_true": ["array-like"],
+        "y_score": ["array-like"],
+        "score_func": [callable],
+        "threshold_grid": [


to be consistent with the TunedThresholdClassifierCV, we need to call this thresholds. I would not allow for None, because with many sample, we are going to reinterpolate anyway. So let's be consistent as well with the classifier.

glemaitre · 2024-05-20T14:35:48Z

sklearn/inspection/_metric_threshold_curve.py

+def metric_threshold_curve(
+    y_true,
+    y_score,
+    score_func,


We might want to call this scoring as well for consistency. But here, we should only accept a callable.

glemaitre · 2024-05-20T14:41:25Z

sklearn/inspection/_metric_threshold_curve.py

+    pos_label=None,
+    sample_weight=None,


Those are actually not necessary per-se: they are parameters of scoring and we should accept any keyword argument and propogate them. I assume that we should have a scoring_kwargs that is a dictionary.

glemaitre · 2024-05-20T14:47:07Z

sklearn/inspection/_metric_threshold_curve.py

+    # Make y_true a boolean vector.
+    y_true = y_true == pos_label
+
+    # Sort scores and corresponding truth values.
+    desc_score_indices = np.argsort(y_score, kind="mergesort")[::-1]
+    y_score = y_score[desc_score_indices]
+    y_true = y_true[desc_score_indices]
+    if sample_weight is not None:
+        sample_weight = sample_weight[desc_score_indices]
+
+    # Logic to see if we need to use all possible thresholds (distinct values).
+    all_thresholds = False
+    if threshold_grid is None:
+        all_thresholds = True
+    elif isinstance(threshold_grid, int):
+        if len(set(y_score)) < threshold_grid:
+            all_thresholds = True
+
+    if all_thresholds:
+        # y_score typically has many tied values. Here we extract
+        # the indices associated with the distinct values. We also
+        # concatenate a value for the end of the curve.
+        distinct_value_indices = np.where(np.diff(y_score))[0]
+        threshold_idxs = np.r_[distinct_value_indices, y_true.size - 1]
+        thresholds = y_score[threshold_idxs[::-1]]
+    elif isinstance(threshold_grid, int):
+        # It takes representative score points to calculate the metric
+        # with these thresholds.
+        thresholds = np.percentile(
+            list(set(y_score)), np.linspace(0, 100, threshold_grid)
+        )
+    else:
+        # If threshold_grid is an array then run some checks and sort
+        # it for consistency.
+        threshold_grid = column_or_1d(threshold_grid)
+        assert_all_finite(threshold_grid)
+        thresholds = np.sort(threshold_grid)
+
+    # For each threshold calculates the metric.
+    metric_values = []
+    for threshold in thresholds:
+        preds_threshold = (y_score > threshold).astype(int)
+        metric_values.append(
+            score_func(y_true, preds_threshold, sample_weight=sample_weight)
+        )
+    # TODO: should we multithread the metric calculations?
+
+    return np.array(metric_values), thresholds


All this code is already implemented by the _score function of the sklearn.model_selection._classification_threshold._CurveScorer class.

I think that we should leverage this code by creating this scorer. We probably need to dissociate getting y_score from the scoring itself such that here we only call the scoring part.

So now, it makes sense to move the _CurveScorer in metrics.

So now, it makes sense to move the _CurveScorer in metrics.

Do you want me to do this in this PR or create a separate one?

It would be better to be in a separate PR. Depending on the schedule, I might start to do the PR.

vitaliset · 2024-05-21T02:55:14Z

I think this is time to review and prioritize this feature.

@vitaliset would you have time to dedicate to work on this feature?

Awesome news! I might need a couple of weeks, but I would love to make this feature available! Will work on your comments as soon as I can, @glemaitre.

vitaliset added 2 commits February 18, 2023 01:47

initial proposal with preliminary tests

172ac47

removing check that validate_params already does

d038e11

github-actions bot added the module:inspection label Feb 18, 2023

vitaliset added 9 commits February 18, 2023 02:25

changelog and linting from CI

322eccf

trying to resolve doc related ci

7dbbec5

duplicate label

2a0c6b3

docstring example import error

fbb9b9b

docstring typo

acb94be

docstring typo

a5cd201

docstring typo

253b3e2

docstring typo

cb5fee1

change in doc order and typos

9e45e2e

glemaitre reviewed Feb 20, 2023

View reviewed changes

sklearn/inspection/_metric_threshold_curve.py Outdated Show resolved Hide resolved

removing example

ad901a2

vitaliset mentioned this pull request Mar 27, 2023

add sklearn.metrics Display class to plot Precision/Recall/F1 for probability thresholds #21391

Open

glemaitre mentioned this pull request Apr 14, 2023

MAINT bump minimum version for pytest #26184

Merged

vitaliset and others added 8 commits May 14, 2023 01:43

Merge branch 'main' into metric_threshold_curve

1a4ce1b

Update import of _check_pos_label_consistency

9b4febb

codecov

119db53

Merge branch 'metric_threshold_curve' of https://github.com/vitaliset…

347f524

…/scikit-learn into metric_threshold_curve

linting

be893c8

correcting typo

bd1e64f

test typo

0318950

add example again to check pytest

efd6d72

glemaitre self-requested a review May 15, 2023 08:15

This was referenced May 15, 2023

DEBUG check pytest internal error #26371

Closed

MAINT bump minimum pytest to 7.1.2 #26373

Merged

vitaliset and others added 2 commits May 16, 2023 08:01

Merge branch 'main' into metric_threshold_curve

1e500c0

Merge remote-tracking branch 'origin/main' into pr/vitaliset/25639

10ebc90

fixing imports

dfa66a5

glemaitre added this to the 1.6 milestone May 20, 2024

glemaitre reviewed May 20, 2024

View reviewed changes

glemaitre mentioned this pull request May 21, 2024

TunedThresholdClassifierCV: add other metrics #29061

Closed

towards glemaitre suggestions

1fb1c13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FEA Implementation of "threshold-dependent metric per threshold value" curve #25639

FEA Implementation of "threshold-dependent metric per threshold value" curve #25639

vitaliset commented Feb 18, 2023 •

edited

github-actions bot commented May 20, 2024 •

edited

glemaitre commented May 20, 2024

glemaitre left a comment

glemaitre May 20, 2024

glemaitre May 20, 2024

glemaitre May 20, 2024

glemaitre May 20, 2024

glemaitre May 20, 2024

glemaitre May 20, 2024

vitaliset May 21, 2024

glemaitre May 21, 2024 •

edited

vitaliset commented May 21, 2024

FEA Implementation of "threshold-dependent metric per threshold value" curve #25639

Are you sure you want to change the base?

FEA Implementation of "threshold-dependent metric per threshold value" curve #25639

Conversation

vitaliset commented Feb 18, 2023 • edited

github-actions bot commented May 20, 2024 • edited

❌ Linting issues

black

ruff

glemaitre commented May 20, 2024

glemaitre left a comment

Choose a reason for hiding this comment

glemaitre May 20, 2024

Choose a reason for hiding this comment

glemaitre May 20, 2024

Choose a reason for hiding this comment

glemaitre May 20, 2024

Choose a reason for hiding this comment

glemaitre May 20, 2024

Choose a reason for hiding this comment

glemaitre May 20, 2024

Choose a reason for hiding this comment

glemaitre May 20, 2024

Choose a reason for hiding this comment

vitaliset May 21, 2024

Choose a reason for hiding this comment

glemaitre May 21, 2024 • edited

Choose a reason for hiding this comment

vitaliset commented May 21, 2024

vitaliset commented Feb 18, 2023 •

edited

github-actions bot commented May 20, 2024 •

edited

`black`

`ruff`

glemaitre May 21, 2024 •

edited