Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Build in an option to change the "positive" class in sklearn.metrics.precision_recall_curve and sklearn.metrics.roc_curve #26758

Open
ThorbenMaa opened this issue Jul 4, 2023 · 4 comments
Labels
Needs Investigation Issue requires investigation New Feature

Comments

@ThorbenMaa
Copy link

Describe the workflow you want to enable

In some cases, it is nice to compare a machine learning classifier with experimental data using ROC or Precision-Recall Curves. For (e.g.) a logarithmic regression model, the score returned by the model will be a probability value or a value from the corresponding decision function. Both values are positively correlated with the "positive" class and can be readily used for calculation of fpr and tpr using sklearn.metrics.precision_recall_curve or sklearn.metrics.roc_curve. However, experimental data can also be negatively correlated with the "positive class". Thus, when calculating confusion matrices, one would need to assume that the "positive" events are not at the right side of the threshold, but on the left side. The same problem occurs for imbalanced test data sets. Here, it is useful to use precision-recall curves and to define the "positive" class as the "minority class" (see https://machinelearningmastery.com/roc-curves-and-precision-recall-curves-for-imbalanced-classification/). However, in some cases, the minority class is negatively correlated with the probability values or decision function values of the classifier (i.e., the minority class corresponds to a probability of zero or to negative decision funtion values). Than it is not enough to just use the "pos_label" option as again "positive" values are on the left and not on the right side of the thresholds used for calculating the confusion matrices/precision-recall curve (see above).

Describe your proposed solution

One could add a parameter that is given to sklearn.metrics.precision_recall_curve and sklearn.metrics.roc_curve. This parameter descibes whether the y_score paramater is positively or negatively correlated with the "positive" class (something like pos_corr=String, default='psoitive' ; the options would be 'positive' or 'negative'). The functions would than get an additional if condition:

y_score=np.array(y_score)
if pos_corr == 'negative':
    y_score=y_score*(-1)

This would transform a negative correlation to a positive one and sklearn.metrics.precision_recall_curve and sklearn.metrics.roc_curve could be used without further changes.

Describe alternatives you've considered, if relevant

No response

Additional context

No response

@ThorbenMaa ThorbenMaa added Needs Triage Issue requires triage New Feature labels Jul 4, 2023
@glemaitre
Copy link
Member

@ThorbenMaa From your argumentation, I have the impression that this is already covered by the parameter pos_label. Could you provide a specific short code snippet to illustrate the shortcoming of pos_label?

@ThorbenMaa
Copy link
Author

ThorbenMaa commented Jul 5, 2023

Hi @glemaitre, if I'm not mistaken, it's not. Here is an example:

import numpy as np
from sklearn import metrics
y = np.array([1, 1, 2]) #labels
scores = np.array([0.1, 0.4, 0.35]) #probabilities for a data point from (e.g.) log reg model to belong to class 2.
fpr, tpr, thresholds = metrics.roc_curve(y, scores, pos_label=2)
print("labels:")
print (y)
print("scores:")
print (scores)
for i in range (0, np.shape(y)[0], 1):    
    print("threshold: " +str(thresholds[i]))
    print("fpr      : " +str(fpr[i]))
    print("tpr      : " +str(tpr[i])+"\n")
    
print("At a threshold of 0.35, the first entry is <0.35 and should be tn, the second is >=0.35 and should be fp, and the third is >=0.35 and should be tp. This would give a fpr of 0.5 (fpr=fp/fp+tn). Everything is all right :-)")

print("However, defining 1 as pos_label gives you:")
fpr, tpr, thresholds = metrics.roc_curve(y, scores, pos_label=1)
for i in range (0, np.shape(y)[0], 1):    
    print("threshold: " +str(thresholds[i]))
    print("fpr      : " +str(fpr[i]))
    print("tpr      : " +str(tpr[i])+"\n")
    

print("At a threshold of 0.35, the first entry is <=0.35 and should be tp, the second is >0.35 and should be fn, and the third is >0.35 and should be tn. This would give a fpr of 0 (fpr=fp/fp+tn), but it is 1!")

print("This is because metric.roc assigns the positive class to vlaues higher than the threshold. In other words, it expects a positive correlation between scores and the positive class.")

print("If one would like to have the tpr of the class labeled with 1 as positive class, I would need to do 'scores=(-1)*scores' ")

scores=(-1)*scores
print("resulting in labels (unchanged):")
print (y)
print ("and scores (changed):")
print (scores)
fpr, tpr, thresholds = metrics.roc_curve(y, scores, pos_label=1)
for i in range (0, np.shape(y)[0], 1):    
    print("threshold: " +str(thresholds[i]))
    print("fpr      : " +str(fpr[i]))
    print("tpr      : " +str(tpr[i])+"\n")

print("At a threshold of -0.35, the first entry is >=-0.35 and should be tp, the second is < -0.35 and should be fn, and the third is >=-0.35 and should be fp. This would give a fpr of 1 (fpr=fp/fp+tn). Everything is all right :-)")

which gives you:

labels:
[1 1 2]
scores:
[0.1 0.4 0.35]
threshold: 1.4
fpr : 0.0
tpr : 0.0

threshold: 0.4
fpr : 0.5
tpr : 0.0

threshold: 0.35
fpr : 0.5
tpr : 1.0

At a threshold of 0.35, the first entry is <0.35 and should be tn, the second is >=0.35 and should be fp, and the third is >=0.35 and should be tp. This would give a fpr of 0.5 (fpr=fp/fp+tn). Everything is all right :-)
However, turning around the label gives you:
threshold: 1.4
fpr : 0.0
tpr : 0.0

threshold: 0.4
fpr : 0.0
tpr : 0.5

threshold: 0.35
fpr : 1.0
tpr : 0.5

At a threshold of 0.35, the first entry is <=0.35 and should be tp, the second is >0.35 and should be fn, and the third is >0.35 and should be tn. This would give a fpr of 0 (fpr=fp/fp+tn), but it is 1!
This is because metric.roc assigns the positive class to vlaues higher than the threshold. In other words, it expects a positive correlation between scores and the positive class.
If I would like to have the tpr of the class labeled with 1 as positive class, I would need to do 'scores=(-1)*scores'
resulting in labels (unchanged):
[1 1 2]
and scores (changed):
[-0.1 -0.4 -0.35]
threshold: 0.9
fpr : 0.0
tpr : 0.0

threshold: -0.1
fpr : 0.0
tpr : 0.5

threshold: -0.35
fpr : 1.0
tpr : 0.5

At a threshold of -0.35, the first entry is >=-0.35 and should be tp, the second is < -0.35 and should be fn, and the third is >=-0.35 and should be fp. This would give a fpr of 1 (fpr=fp/fp+tn). Everything is all right :-)

The problem is that the functions assign the "positive class" to a score if the score is higher than the threshold. So if one only uses pos_label , one would still use this definition of how to assign a score to a class. This is not what one would want in many cases. E.g., if you want to quantify the performance of a classifier in terms of it's precision on class 1. This is for example very important if you have a very imbalanced test data set (explained in the reference that I have linked above. A quote from it is "The focus of the PR curve on the minority class makes it an effective diagnostic for imbalanced binary classification models."). You can also find the following "Realizing its disadvantages in dealing with class-imbalanced data, precision-recall curve (PRC) is more informative than ROC, and it has become a basis for assessing classification methods on class-imbalanced data [6]. The PRC has recall on the x-axis and precision on the y-axis. Precision is the proportion of true positives among the positive predictions, and recall measures the proportion of positives that are correctly identified as such. The potential advantages of PRC on class-imbalanced data are due to the fact that the PRC ignores true negatives altogether. As a result, PRC is suitable to assess the performance of the minority samples. " in https://www.sciencedirect.com/science/article/abs/pii/S0169743917303441

In my opinion, this could be easily solved by the approach mentioned in the initial post :-). What do you think?

@glemaitre
Copy link
Member

glemaitre commented Jul 11, 2023

Let me summarize succinctly: we convert probabilistic predictions to deterministic predictions such that y_pred = pos_label if y_proba >= proba_threshold else neg_label. pos_label and neg_label are just defined by the user.

In your example, it seems that you expect >= to become > for some reason that I don't understand. Changing the label should not change the thresholding rule.

Regarding the example above, in the case that pos_label=1, then y_score would become [2, 1, 1] when comparing >= 0.35 that provides an fpr=1 and a tpr=0.5 as reported by the roc_curve function.

@ThorbenMaa
Copy link
Author

"In your example, it seems that you expect >= to become > for some reason that I don't understand."

Let me elaborate:
Let's assume y_proba=1 corresponds to "person becomes sick" and pos_label. y_proba=0 corresponds to "person stays healthy and neg_label. Metrics like fpr and tpr can be calulcated using y_proba,y_label, and y_pred given in (I).
(I):

y_label=[neg_label, neg_label, pos_label]
y_proba=[0.1, 0.4, 0.35]
y_pred = pos_label if y_proba >= proba_threshold else neg_label

Now I want to get metrics for the "person stays healthy" class. Switching labels would do the following (see (II)):
(II)

y_label=[pos_label, pos_label, neg_label]
y_proba=[0.1, 0.4, 0.35]
y_pred = pos_label if y_proba >= proba_threshold else neg_label

Now, y_proba=1 corresponds to "person becomes sick" and neg_label. y_proba=0 corresponds to "person stays healthy and pos_label. If I calculate tpr and fpr of both cases, fpr(II)=tpr(I) (see example above). However, to get metrics for the "person stays healthy" class, I would need something like y_proba=1 corresponds to "person stays healthy" and neg_label, and y_proba=0 to "person becomes sick" and pos_label. Or I would need to turn around the threshold condition.

Suming up, I would in addition to switching the lables need to either turn around y_proba=1-y_proba or to turn around the threshold condition to y_pred = pos_label if y_proba < proba_threshold else neg_label. Just switching the labels does not do the job.

I hope it is now possible to understand my problem. Turning around y_proba=y_proba-1 does of course only work for probabilities but not for decision functions. That's why I think it is easier to do something like y_proba=(-1)*y_proba. Turning around the threshold condition would also work.

@thomasjpfan thomasjpfan added Needs Investigation Issue requires investigation and removed Needs Triage Issue requires triage labels Jul 27, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Needs Investigation Issue requires investigation New Feature
Projects
None yet
Development

No branches or pull requests

3 participants