-
-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG] Class weight imprecision #10515
Conversation
Thanks. This is a good start on improving my pr. I'm not sure that I'm entitled to approve it because it's mostly my work! Now. I'm sure this is helpful in the binary case. Yet it currently will apply jitter if there are differences in total weight between any two adjacent classes, which seems arbitrary; and the jitter in all classes may then be excessive. The former should be easy to solve; the latter I'm not sure. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I hope this isn't complete overkill for a rare problem...
sklearn/utils/class_weight.py
Outdated
@@ -54,13 +54,15 @@ def compute_class_weight(class_weight, classes, y): | |||
freq = np.bincount(y_ind).astype(np.float64) | |||
recip_freq = len(y) / (len(le.classes_) * freq) | |||
weight = recip_freq[le.transform(classes)] | |||
if np.any(np.diff(freq * weight)): | |||
freq_weight = np.reshape(freq * weight, (len(freq), 1)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this would be more readable with just len(np.unique(freq * weight)) > 1
@@ -78,6 +80,26 @@ def compute_class_weight(class_weight, classes, y): | |||
return weight | |||
|
|||
|
|||
def _jitter_transform(true_order, bad_order): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wow, this got big :p
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel I may have gone a little overboard by trying to have a smaller jitter :)
sklearn/utils/class_weight.py
Outdated
# respect true order | ||
k = len(true_order) | ||
jitter = np.zeros(k) | ||
for i in range(k // 2 + 1): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're going to have to add some general comments on this algorithm!
I have qualms about including this in practice. Let's say this is used for weighting samples in scoring, rather than training. Would that be a big problem? |
I must admit that I am not sure if we should include this in practice. It is possible that other imprecision may lead to the same issue even if we had exact rationals for the weights. I tried using integers for the weights instead of fraction, and it seems that it is possible to recreate the same issue (or the opposite with 0 accuracy). |
Since nobody feels very comfortable with this change, maybe we could instead document the pitfall better? |
During a triaging meeting, we decided to close this PR and move forward with #10233 (comment) |
Reference Issues/PRs
Fixes #10233
Continue the work of #10249
What does this implement/fix? Explain your changes.
add epsilon to favour the true distribution in case when class_weight 'balanced' is used and float imprecision does not allow the sum of weights to be equal (float vs rational).
Any other comments?
Not sure if this is necessary.