Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FEA Implementation of "threshold-dependent metric per threshold value" curve #25639

Open
wants to merge 24 commits into
base: main
Choose a base branch
from

Conversation

vitaliset
Copy link
Contributor

@vitaliset vitaliset commented Feb 18, 2023

Towards #21391.

Intending to later build the MetricThresholdCurveDisplay following the same structure that other Displays have, this PR implements the associate curve. I decided to break the original issue into two parts (curve and Display) for easier review (but I don't mind adding the Display to this PR as well).

A quick example of usage of the implementation here:

import matplotlib.pyplot as plt
from imblearn.datasets import fetch_datasets
from sklearn.inspection import metric_threshold_curve
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import fbeta_score
from functools import partial

dataset = fetch_datasets()["coil_2000"]
X, y = dataset.data, (dataset.target==1).astype(int)

X_train_model, X_test, y_train_model, y_test = train_test_split(X, y, random_state=0, stratify=y)

model = RandomForestClassifier(random_state=0).fit(X_train_model, y_train_model)
predict_proba = model.predict_proba(X_test)[:, 1]

f2_values, thresholds = metric_threshold_curve(
    y_test, predict_proba, partial(fbeta_score, beta=2), threshold_grid=500
)

fig, ax = plt.subplots(figsize=(5, 2.4))
ax.plot(thresholds, f2_values)
ax.set_xlabel("thresholds")
ax.set_ylabel("f2 score")
plt.tight_layout()

image

Most of the code for metric_threshold_curve function is an adaptation of _binary_clf_curve.

Points of doubt:

  • I thought the inspection module would be suitable for this type of analysis, but it is not 100% clear to me that this curve (and then the Display) should go here - other current options would be metrics or model_selection._prediction (just like the related meta-estimator from [WIP] FEA New meta-estimator to post-tune the decision_function/predict_proba threshold for binary classifiers #16525).
  • I would appreciate some help with test ideas!
  • I made the first version of the documentation to go along with the function. It's preliminary. I wanted to go into it only a little while we don't define how the function will look like. Ideas for it would be appreciated as well! :)

Copy link

github-actions bot commented May 20, 2024

❌ Linting issues

This PR is introducing linting issues. Here's a summary of the issues. Note that you can avoid having linting issues by enabling pre-commit hooks. Instructions to enable them can be found here.

You can see the details of the linting issues under the lint job here


black

black detected issues. Please run black . locally and push the changes. Here you can see the detected issues. Note that running black might also fix some of the issues which might be detected by ruff. Note that the installed black version is black=24.3.0.


--- /home/runner/work/scikit-learn/scikit-learn/sklearn/metrics/_decision_threshold.py	2024-05-22 05:00:10.745270+00:00
+++ /home/runner/work/scikit-learn/scikit-learn/sklearn/metrics/_decision_threshold.py	2024-05-22 05:00:27.285693+00:00
@@ -107,11 +107,11 @@
                 "In a multiclass scenario, you must pass a `pos_label` to `scoring_kwargs`."
             )
         raise ValueError("{0} format is not supported".format(y_type))
 
     sample_weight = scoring_kwargs.get("sample_weight")
-    check_consistent_length( y_true, y_score, sample_weight)
+    check_consistent_length(y_true, y_score, sample_weight)
     y_true = column_or_1d(y_true)
     y_score = column_or_1d(y_score)
     assert_all_finite(y_true)
     assert_all_finite(y_score)
 
@@ -122,11 +122,11 @@
         sample_weight = _check_sample_weight(sample_weight, y_true)
         nonzero_weight_mask = sample_weight != 0
         y_true = y_true[nonzero_weight_mask]
         y_score = y_score[nonzero_weight_mask]
         sample_weight = sample_weight[nonzero_weight_mask]
-    
+
     pos_label = _check_pos_label_consistency(pos_label, y_true)
 
     # Make y_true a boolean vector.
     y_true = y_true == pos_label
 
@@ -134,14 +134,14 @@
     desc_score_indices = np.argsort(y_score, kind="mergesort")[::-1]
     y_score = y_score[desc_score_indices]
     y_true = y_true[desc_score_indices]
     if sample_weight is not None:
         sample_weight = sample_weight[desc_score_indices]
-    
+
     if "sample_weight" in scoring_kwargs:
         scoring_kwargs["sample_weight"] = sample_weight
-    
+
     # Logic to see if we need to use all possible thresholds (distinct values).
     all_thresholds = isinstance(thresholds, int) and len(set(y_score)) < thresholds
 
     if all_thresholds:
         # y_score typically has many tied values. Here we extract
@@ -151,13 +151,11 @@
         threshold_idxs = np.r_[distinct_value_indices, y_true.size - 1]
         thresholds = y_score[threshold_idxs[::-1]]
     elif isinstance(thresholds, int):
         # It takes representative score points to calculate the metric
         # with these thresholds.
-        thresholds = np.percentile(
-            list(set(y_score)), np.linspace(0, 100, thresholds)
-        )
+        thresholds = np.percentile(list(set(y_score)), np.linspace(0, 100, thresholds))
     else:
         # If thresholds is an array then run some checks and sort
         # it for consistency.
         thresholds = column_or_1d(thresholds)
         assert_all_finite(thresholds)
@@ -165,11 +163,9 @@
 
     # For each threshold calculates the metric.
     metric_values = []
     for threshold in thresholds:
         preds_threshold = (y_score > threshold).astype(int)
-        metric_values.append(
-            scoring(y_true, preds_threshold, **scoring_kwargs)
-        )
+        metric_values.append(scoring(y_true, preds_threshold, **scoring_kwargs))
     # TODO: should we multithread the metric calculations?
 
     return np.array(metric_values), thresholds
would reformat /home/runner/work/scikit-learn/scikit-learn/sklearn/metrics/_decision_threshold.py
--- /home/runner/work/scikit-learn/scikit-learn/sklearn/metrics/tests/test_decision_threshold.py	2024-05-22 05:00:10.749270+00:00
+++ /home/runner/work/scikit-learn/scikit-learn/sklearn/metrics/tests/test_decision_threshold.py	2024-05-22 05:00:27.932117+00:00
@@ -90,13 +90,11 @@
 
 
 def test_len_of_threshold_when_passing_int():
     y = [0] * 500 + [1] * 500
     y_score = list(range(1000))
-    _, thresholds = decision_threshold_curve(
-        y, y_score, accuracy_score, thresholds=13
-    )
+    _, thresholds = decision_threshold_curve(y, y_score, accuracy_score, thresholds=13)
 
     assert len(thresholds) == 13
 
 
 def test_passing_the_grid():
would reformat /home/runner/work/scikit-learn/scikit-learn/sklearn/metrics/tests/test_decision_threshold.py

Oh no! 💥 💔 💥
2 files would be reformatted, 923 files would be left unchanged.

ruff

ruff detected issues. Please run ruff check --fix --output-format=full . locally, fix the remaining issues, and push the changes. Here you can see the detected issues. Note that the installed ruff version is ruff=0.4.4.


sklearn/metrics/_decision_threshold.py:9:31: F401 [*] `numbers.Real` imported but unused
   |
 7 | # License: BSD 3 clause
 8 | 
 9 | from numbers import Integral, Real
   |                               ^^^^ F401
10 | 
11 | import numpy as np
   |
   = help: Remove unused import: `numbers.Real`

sklearn/metrics/_decision_threshold.py:107:89: E501 Line too long (92 > 88)
    |
105 |         if y_type == "multiclass":
106 |             raise ValueError(
107 |                 "In a multiclass scenario, you must pass a `pos_label` to `scoring_kwargs`."
    |                                                                                         ^^^^ E501
108 |             )
109 |         raise ValueError("{0} format is not supported".format(y_type))
    |

sklearn/metrics/_decision_threshold.py:127:1: W293 [*] Blank line contains whitespace
    |
125 |         y_score = y_score[nonzero_weight_mask]
126 |         sample_weight = sample_weight[nonzero_weight_mask]
127 |     
    | ^^^^ W293
128 |     pos_label = _check_pos_label_consistency(pos_label, y_true)
    |
    = help: Remove whitespace from blank line

sklearn/metrics/_decision_threshold.py:139:1: W293 [*] Blank line contains whitespace
    |
137 |     if sample_weight is not None:
138 |         sample_weight = sample_weight[desc_score_indices]
139 |     
    | ^^^^ W293
140 |     if "sample_weight" in scoring_kwargs:
141 |         scoring_kwargs["sample_weight"] = sample_weight
    |
    = help: Remove whitespace from blank line

sklearn/metrics/_decision_threshold.py:142:1: W293 [*] Blank line contains whitespace
    |
140 |     if "sample_weight" in scoring_kwargs:
141 |         scoring_kwargs["sample_weight"] = sample_weight
142 |     
    | ^^^^ W293
143 |     # Logic to see if we need to use all possible thresholds (distinct values).
144 |     all_thresholds = isinstance(thresholds, int) and len(set(y_score)) < thresholds
    |
    = help: Remove whitespace from blank line

sklearn/metrics/tests/test_decision_threshold.py:1:1: I001 [*] Import block is un-sorted or un-formatted
   |
 1 | / from functools import partial
 2 | | 
 3 | | import numpy as np
 4 | | import pytest
 5 | | 
 6 | | from sklearn.datasets import make_classification
 7 | | from sklearn.ensemble import RandomForestClassifier
 8 | | from sklearn.metrics import decision_threshold_curve
 9 | | from sklearn.metrics import (
10 | |     accuracy_score,
11 | |     f1_score,
12 | |     fbeta_score,
13 | |     precision_score,
14 | |     recall_score,
15 | | )
16 | | from sklearn.utils._testing import assert_allclose
17 | | from sklearn.utils.validation import check_random_state
18 | | 
19 | | 
20 | | def test_grid_int_bigger_than_set_then_all():
   | |_^ I001
21 |       """When `thresholds` parameter is bigger than the number of unique
22 |       `y_score` then `len(thresholds)` should be equal to `len(set(y_score))`.
   |
   = help: Organize imports

Found 6 errors.
[*] 5 fixable with the `--fix` option.

Generated for commit: 1fb1c13. Link to the linter CI: here

@glemaitre
Copy link
Member

OK now that we merged the FixedThresholdClassifier and TunedThresholdClassifierCV, it gives me another perspective on the tool.

I think this is time to review and prioritize this feature.

@vitaliset would you have time to dedicate to work on this feature?

@glemaitre glemaitre added this to the 1.6 milestone May 20, 2024
Copy link
Member

@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some initial thoughts. I did not look at the documentation or test but it will come later.

@@ -674,6 +674,7 @@ Kernels

inspection.partial_dependence
inspection.permutation_importance
inspection.metric_threshold_curve
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that we should include this feature in this module. Actually, as the precision-recall and roc curves, this is a curve metric. Also, by including it in sklearn.metrics module, we can drop the metric_ naming.

I would therefore call it sklearn.metrics.decision_threshold_curve

"y_true": ["array-like"],
"y_score": ["array-like"],
"score_func": [callable],
"threshold_grid": [
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to be consistent with the TunedThresholdClassifierCV, we need to call this thresholds. I would not allow for None, because with many sample, we are going to reinterpolate anyway. So let's be consistent as well with the classifier.

def metric_threshold_curve(
y_true,
y_score,
score_func,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might want to call this scoring as well for consistency. But here, we should only accept a callable.

Comment on lines 43 to 44
pos_label=None,
sample_weight=None,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those are actually not necessary per-se: they are parameters of scoring and we should accept any keyword argument and propogate them. I assume that we should have a scoring_kwargs that is a dictionary.

Comment on lines 130 to 177
# Make y_true a boolean vector.
y_true = y_true == pos_label

# Sort scores and corresponding truth values.
desc_score_indices = np.argsort(y_score, kind="mergesort")[::-1]
y_score = y_score[desc_score_indices]
y_true = y_true[desc_score_indices]
if sample_weight is not None:
sample_weight = sample_weight[desc_score_indices]

# Logic to see if we need to use all possible thresholds (distinct values).
all_thresholds = False
if threshold_grid is None:
all_thresholds = True
elif isinstance(threshold_grid, int):
if len(set(y_score)) < threshold_grid:
all_thresholds = True

if all_thresholds:
# y_score typically has many tied values. Here we extract
# the indices associated with the distinct values. We also
# concatenate a value for the end of the curve.
distinct_value_indices = np.where(np.diff(y_score))[0]
threshold_idxs = np.r_[distinct_value_indices, y_true.size - 1]
thresholds = y_score[threshold_idxs[::-1]]
elif isinstance(threshold_grid, int):
# It takes representative score points to calculate the metric
# with these thresholds.
thresholds = np.percentile(
list(set(y_score)), np.linspace(0, 100, threshold_grid)
)
else:
# If threshold_grid is an array then run some checks and sort
# it for consistency.
threshold_grid = column_or_1d(threshold_grid)
assert_all_finite(threshold_grid)
thresholds = np.sort(threshold_grid)

# For each threshold calculates the metric.
metric_values = []
for threshold in thresholds:
preds_threshold = (y_score > threshold).astype(int)
metric_values.append(
score_func(y_true, preds_threshold, sample_weight=sample_weight)
)
# TODO: should we multithread the metric calculations?

return np.array(metric_values), thresholds
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All this code is already implemented by the _score function of the sklearn.model_selection._classification_threshold._CurveScorer class.

I think that we should leverage this code by creating this scorer. We probably need to dissociate getting y_score from the scoring itself such that here we only call the scoring part.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So now, it makes sense to move the _CurveScorer in metrics.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So now, it makes sense to move the _CurveScorer in metrics.

Do you want me to do this in this PR or create a separate one?

Copy link
Member

@glemaitre glemaitre May 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be better to be in a separate PR. Depending on the schedule, I might start to do the PR.

@vitaliset
Copy link
Contributor Author

I think this is time to review and prioritize this feature.

@vitaliset would you have time to dedicate to work on this feature?

Awesome news! I might need a couple of weeks, but I would love to make this feature available! Will work on your comments as soon as I can, @glemaitre.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: In Progress
Development

Successfully merging this pull request may close these issues.

None yet

3 participants