Build in an option to change the "positive" class in sklearn.metrics.precision_recall_curve and sklearn.metrics.roc_curve #26758

ThorbenMaa · 2023-07-04T05:40:33Z

Describe the workflow you want to enable

In some cases, it is nice to compare a machine learning classifier with experimental data using ROC or Precision-Recall Curves. For (e.g.) a logarithmic regression model, the score returned by the model will be a probability value or a value from the corresponding decision function. Both values are positively correlated with the "positive" class and can be readily used for calculation of fpr and tpr using sklearn.metrics.precision_recall_curve or sklearn.metrics.roc_curve. However, experimental data can also be negatively correlated with the "positive class". Thus, when calculating confusion matrices, one would need to assume that the "positive" events are not at the right side of the threshold, but on the left side. The same problem occurs for imbalanced test data sets. Here, it is useful to use precision-recall curves and to define the "positive" class as the "minority class" (see https://machinelearningmastery.com/roc-curves-and-precision-recall-curves-for-imbalanced-classification/). However, in some cases, the minority class is negatively correlated with the probability values or decision function values of the classifier (i.e., the minority class corresponds to a probability of zero or to negative decision funtion values). Than it is not enough to just use the "pos_label" option as again "positive" values are on the left and not on the right side of the thresholds used for calculating the confusion matrices/precision-recall curve (see above).

Describe your proposed solution

One could add a parameter that is given to sklearn.metrics.precision_recall_curve and sklearn.metrics.roc_curve. This parameter descibes whether the y_score paramater is positively or negatively correlated with the "positive" class (something like pos_corr=String, default='psoitive' ; the options would be 'positive' or 'negative'). The functions would than get an additional if condition:

y_score=np.array(y_score)
if pos_corr == 'negative':
    y_score=y_score*(-1)

This would transform a negative correlation to a positive one and sklearn.metrics.precision_recall_curve and sklearn.metrics.roc_curve could be used without further changes.

Describe alternatives you've considered, if relevant

No response

Additional context

No response

The text was updated successfully, but these errors were encountered:

glemaitre · 2023-07-04T18:30:52Z

@ThorbenMaa From your argumentation, I have the impression that this is already covered by the parameter pos_label. Could you provide a specific short code snippet to illustrate the shortcoming of pos_label?

ThorbenMaa · 2023-07-05T14:53:06Z

Hi @glemaitre, if I'm not mistaken, it's not. Here is an example:

import numpy as np
from sklearn import metrics
y = np.array([1, 1, 2]) #labels
scores = np.array([0.1, 0.4, 0.35]) #probabilities for a data point from (e.g.) log reg model to belong to class 2.
fpr, tpr, thresholds = metrics.roc_curve(y, scores, pos_label=2)
print("labels:")
print (y)
print("scores:")
print (scores)
for i in range (0, np.shape(y)[0], 1):    
    print("threshold: " +str(thresholds[i]))
    print("fpr      : " +str(fpr[i]))
    print("tpr      : " +str(tpr[i])+"\n")
    
print("At a threshold of 0.35, the first entry is <0.35 and should be tn, the second is >=0.35 and should be fp, and the third is >=0.35 and should be tp. This would give a fpr of 0.5 (fpr=fp/fp+tn). Everything is all right :-)")

print("However, defining 1 as pos_label gives you:")
fpr, tpr, thresholds = metrics.roc_curve(y, scores, pos_label=1)
for i in range (0, np.shape(y)[0], 1):    
    print("threshold: " +str(thresholds[i]))
    print("fpr      : " +str(fpr[i]))
    print("tpr      : " +str(tpr[i])+"\n")
    

print("At a threshold of 0.35, the first entry is <=0.35 and should be tp, the second is >0.35 and should be fn, and the third is >0.35 and should be tn. This would give a fpr of 0 (fpr=fp/fp+tn), but it is 1!")

print("This is because metric.roc assigns the positive class to vlaues higher than the threshold. In other words, it expects a positive correlation between scores and the positive class.")

print("If one would like to have the tpr of the class labeled with 1 as positive class, I would need to do 'scores=(-1)*scores' ")

scores=(-1)*scores
print("resulting in labels (unchanged):")
print (y)
print ("and scores (changed):")
print (scores)
fpr, tpr, thresholds = metrics.roc_curve(y, scores, pos_label=1)
for i in range (0, np.shape(y)[0], 1):    
    print("threshold: " +str(thresholds[i]))
    print("fpr      : " +str(fpr[i]))
    print("tpr      : " +str(tpr[i])+"\n")

print("At a threshold of -0.35, the first entry is >=-0.35 and should be tp, the second is < -0.35 and should be fn, and the third is >=-0.35 and should be fp. This would give a fpr of 1 (fpr=fp/fp+tn). Everything is all right :-)")

which gives you:

labels:
[1 1 2]
scores:
[0.1 0.4 0.35]
threshold: 1.4
fpr : 0.0
tpr : 0.0

threshold: 0.4
fpr : 0.5
tpr : 0.0

threshold: 0.35
fpr : 0.5
tpr : 1.0

At a threshold of 0.35, the first entry is <0.35 and should be tn, the second is >=0.35 and should be fp, and the third is >=0.35 and should be tp. This would give a fpr of 0.5 (fpr=fp/fp+tn). Everything is all right :-)
However, turning around the label gives you:
threshold: 1.4
fpr : 0.0
tpr : 0.0

threshold: 0.4
fpr : 0.0
tpr : 0.5

threshold: 0.35
fpr : 1.0
tpr : 0.5

At a threshold of 0.35, the first entry is <=0.35 and should be tp, the second is >0.35 and should be fn, and the third is >0.35 and should be tn. This would give a fpr of 0 (fpr=fp/fp+tn), but it is 1!
This is because metric.roc assigns the positive class to vlaues higher than the threshold. In other words, it expects a positive correlation between scores and the positive class.
If I would like to have the tpr of the class labeled with 1 as positive class, I would need to do 'scores=(-1)*scores'
resulting in labels (unchanged):
[1 1 2]
and scores (changed):
[-0.1 -0.4 -0.35]
threshold: 0.9
fpr : 0.0
tpr : 0.0

threshold: -0.1
fpr : 0.0
tpr : 0.5

threshold: -0.35
fpr : 1.0
tpr : 0.5

At a threshold of -0.35, the first entry is >=-0.35 and should be tp, the second is < -0.35 and should be fn, and the third is >=-0.35 and should be fp. This would give a fpr of 1 (fpr=fp/fp+tn). Everything is all right :-)

The problem is that the functions assign the "positive class" to a score if the score is higher than the threshold. So if one only uses pos_label , one would still use this definition of how to assign a score to a class. This is not what one would want in many cases. E.g., if you want to quantify the performance of a classifier in terms of it's precision on class 1. This is for example very important if you have a very imbalanced test data set (explained in the reference that I have linked above. A quote from it is "The focus of the PR curve on the minority class makes it an effective diagnostic for imbalanced binary classification models."). You can also find the following "Realizing its disadvantages in dealing with class-imbalanced data, precision-recall curve (PRC) is more informative than ROC, and it has become a basis for assessing classification methods on class-imbalanced data [6]. The PRC has recall on the x-axis and precision on the y-axis. Precision is the proportion of true positives among the positive predictions, and recall measures the proportion of positives that are correctly identified as such. The potential advantages of PRC on class-imbalanced data are due to the fact that the PRC ignores true negatives altogether. As a result, PRC is suitable to assess the performance of the minority samples. " in https://www.sciencedirect.com/science/article/abs/pii/S0169743917303441

In my opinion, this could be easily solved by the approach mentioned in the initial post :-). What do you think?

glemaitre · 2023-07-11T18:34:01Z

Let me summarize succinctly: we convert probabilistic predictions to deterministic predictions such that y_pred = pos_label if y_proba >= proba_threshold else neg_label. pos_label and neg_label are just defined by the user.

In your example, it seems that you expect >= to become > for some reason that I don't understand. Changing the label should not change the thresholding rule.

Regarding the example above, in the case that pos_label=1, then y_score would become [2, 1, 1] when comparing >= 0.35 that provides an fpr=1 and a tpr=0.5 as reported by the roc_curve function.

ThorbenMaa · 2023-07-12T07:09:03Z

"In your example, it seems that you expect >= to become > for some reason that I don't understand."

Let me elaborate:
Let's assume y_proba=1 corresponds to "person becomes sick" and pos_label. y_proba=0 corresponds to "person stays healthy and neg_label. Metrics like fpr and tpr can be calulcated using y_proba,y_label, and y_pred given in (I).
(I):

y_label=[neg_label, neg_label, pos_label]
y_proba=[0.1, 0.4, 0.35]
y_pred = pos_label if y_proba >= proba_threshold else neg_label

Now I want to get metrics for the "person stays healthy" class. Switching labels would do the following (see (II)):
(II)

y_label=[pos_label, pos_label, neg_label]
y_proba=[0.1, 0.4, 0.35]
y_pred = pos_label if y_proba >= proba_threshold else neg_label

Now, y_proba=1 corresponds to "person becomes sick" and neg_label. y_proba=0 corresponds to "person stays healthy and pos_label. If I calculate tpr and fpr of both cases, fpr(II)=tpr(I) (see example above). However, to get metrics for the "person stays healthy" class, I would need something like y_proba=1 corresponds to "person stays healthy" and neg_label, and y_proba=0 to "person becomes sick" and pos_label. Or I would need to turn around the threshold condition.

Suming up, I would in addition to switching the lables need to either turn around y_proba=1-y_proba or to turn around the threshold condition to y_pred = pos_label if y_proba < proba_threshold else neg_label. Just switching the labels does not do the job.

I hope it is now possible to understand my problem. Turning around y_proba=y_proba-1 does of course only work for probabilities but not for decision functions. That's why I think it is easier to do something like y_proba=(-1)*y_proba. Turning around the threshold condition would also work.

ThorbenMaa added Needs Triage Issue requires triage New Feature labels Jul 4, 2023

thomasjpfan added Needs Investigation Issue requires investigation and removed Needs Triage Issue requires triage labels Jul 27, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Build in an option to change the "positive" class in sklearn.metrics.precision_recall_curve and sklearn.metrics.roc_curve #26758

Build in an option to change the "positive" class in sklearn.metrics.precision_recall_curve and sklearn.metrics.roc_curve #26758

ThorbenMaa commented Jul 4, 2023

glemaitre commented Jul 4, 2023

ThorbenMaa commented Jul 5, 2023 •

edited by glemaitre

glemaitre commented Jul 11, 2023 •

edited

ThorbenMaa commented Jul 12, 2023

Build in an option to change the "positive" class in sklearn.metrics.precision_recall_curve and sklearn.metrics.roc_curve #26758

Build in an option to change the "positive" class in sklearn.metrics.precision_recall_curve and sklearn.metrics.roc_curve #26758

Comments

ThorbenMaa commented Jul 4, 2023

Describe the workflow you want to enable

Describe your proposed solution

Describe alternatives you've considered, if relevant

Additional context

glemaitre commented Jul 4, 2023

ThorbenMaa commented Jul 5, 2023 • edited by glemaitre

glemaitre commented Jul 11, 2023 • edited

ThorbenMaa commented Jul 12, 2023

ThorbenMaa commented Jul 5, 2023 •

edited by glemaitre

glemaitre commented Jul 11, 2023 •

edited