New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Build in an option to change the "positive" class in sklearn.metrics.precision_recall_curve and sklearn.metrics.roc_curve #26758
Comments
@ThorbenMaa From your argumentation, I have the impression that this is already covered by the parameter |
Hi @glemaitre, if I'm not mistaken, it's not. Here is an example: import numpy as np
from sklearn import metrics
y = np.array([1, 1, 2]) #labels
scores = np.array([0.1, 0.4, 0.35]) #probabilities for a data point from (e.g.) log reg model to belong to class 2.
fpr, tpr, thresholds = metrics.roc_curve(y, scores, pos_label=2)
print("labels:")
print (y)
print("scores:")
print (scores)
for i in range (0, np.shape(y)[0], 1):
print("threshold: " +str(thresholds[i]))
print("fpr : " +str(fpr[i]))
print("tpr : " +str(tpr[i])+"\n")
print("At a threshold of 0.35, the first entry is <0.35 and should be tn, the second is >=0.35 and should be fp, and the third is >=0.35 and should be tp. This would give a fpr of 0.5 (fpr=fp/fp+tn). Everything is all right :-)")
print("However, defining 1 as pos_label gives you:")
fpr, tpr, thresholds = metrics.roc_curve(y, scores, pos_label=1)
for i in range (0, np.shape(y)[0], 1):
print("threshold: " +str(thresholds[i]))
print("fpr : " +str(fpr[i]))
print("tpr : " +str(tpr[i])+"\n")
print("At a threshold of 0.35, the first entry is <=0.35 and should be tp, the second is >0.35 and should be fn, and the third is >0.35 and should be tn. This would give a fpr of 0 (fpr=fp/fp+tn), but it is 1!")
print("This is because metric.roc assigns the positive class to vlaues higher than the threshold. In other words, it expects a positive correlation between scores and the positive class.")
print("If one would like to have the tpr of the class labeled with 1 as positive class, I would need to do 'scores=(-1)*scores' ")
scores=(-1)*scores
print("resulting in labels (unchanged):")
print (y)
print ("and scores (changed):")
print (scores)
fpr, tpr, thresholds = metrics.roc_curve(y, scores, pos_label=1)
for i in range (0, np.shape(y)[0], 1):
print("threshold: " +str(thresholds[i]))
print("fpr : " +str(fpr[i]))
print("tpr : " +str(tpr[i])+"\n")
print("At a threshold of -0.35, the first entry is >=-0.35 and should be tp, the second is < -0.35 and should be fn, and the third is >=-0.35 and should be fp. This would give a fpr of 1 (fpr=fp/fp+tn). Everything is all right :-)") which gives you: labels: threshold: 0.4 threshold: 0.35 At a threshold of 0.35, the first entry is <0.35 and should be tn, the second is >=0.35 and should be fp, and the third is >=0.35 and should be tp. This would give a fpr of 0.5 (fpr=fp/fp+tn). Everything is all right :-) threshold: 0.4 threshold: 0.35 At a threshold of 0.35, the first entry is <=0.35 and should be tp, the second is >0.35 and should be fn, and the third is >0.35 and should be tn. This would give a fpr of 0 (fpr=fp/fp+tn), but it is 1! threshold: -0.1 threshold: -0.35 At a threshold of -0.35, the first entry is >=-0.35 and should be tp, the second is < -0.35 and should be fn, and the third is >=-0.35 and should be fp. This would give a fpr of 1 (fpr=fp/fp+tn). Everything is all right :-) The problem is that the functions assign the "positive class" to a score if the score is higher than the threshold. So if one only uses In my opinion, this could be easily solved by the approach mentioned in the initial post :-). What do you think? |
Let me summarize succinctly: we convert probabilistic predictions to deterministic predictions such that In your example, it seems that you expect Regarding the example above, in the case that |
"In your example, it seems that you expect >= to become > for some reason that I don't understand." Let me elaborate:
Now I want to get metrics for the "person stays healthy" class. Switching labels would do the following (see (II)):
Now, Suming up, I would in addition to switching the lables need to either turn around I hope it is now possible to understand my problem. Turning around |
Describe the workflow you want to enable
In some cases, it is nice to compare a machine learning classifier with experimental data using ROC or Precision-Recall Curves. For (e.g.) a logarithmic regression model, the score returned by the model will be a probability value or a value from the corresponding decision function. Both values are positively correlated with the "positive" class and can be readily used for calculation of fpr and tpr using sklearn.metrics.precision_recall_curve or sklearn.metrics.roc_curve. However, experimental data can also be negatively correlated with the "positive class". Thus, when calculating confusion matrices, one would need to assume that the "positive" events are not at the right side of the threshold, but on the left side. The same problem occurs for imbalanced test data sets. Here, it is useful to use precision-recall curves and to define the "positive" class as the "minority class" (see https://machinelearningmastery.com/roc-curves-and-precision-recall-curves-for-imbalanced-classification/). However, in some cases, the minority class is negatively correlated with the probability values or decision function values of the classifier (i.e., the minority class corresponds to a probability of zero or to negative decision funtion values). Than it is not enough to just use the "pos_label" option as again "positive" values are on the left and not on the right side of the thresholds used for calculating the confusion matrices/precision-recall curve (see above).
Describe your proposed solution
One could add a parameter that is given to sklearn.metrics.precision_recall_curve and sklearn.metrics.roc_curve. This parameter descibes whether the
y_score
paramater is positively or negatively correlated with the "positive" class (something likepos_corr=String, default='psoitive'
; the options would be 'positive' or 'negative'). The functions would than get an additional if condition:This would transform a negative correlation to a positive one and sklearn.metrics.precision_recall_curve and sklearn.metrics.roc_curve could be used without further changes.
Describe alternatives you've considered, if relevant
No response
Additional context
No response
The text was updated successfully, but these errors were encountered: