Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] Class weight imprecision #10515

Closed
wants to merge 11 commits into from

Conversation

Johayon
Copy link
Contributor

@Johayon Johayon commented Jan 21, 2018

Reference Issues/PRs

Fixes #10233
Continue the work of #10249

What does this implement/fix? Explain your changes.

add epsilon to favour the true distribution in case when class_weight 'balanced' is used and float imprecision does not allow the sum of weights to be equal (float vs rational).

Any other comments?

Not sure if this is necessary.

@Johayon Johayon changed the title Class weight imprecision [MRG] Class weight imprecision Jan 21, 2018
@jnothman
Copy link
Member

Thanks. This is a good start on improving my pr. I'm not sure that I'm entitled to approve it because it's mostly my work!

Now. I'm sure this is helpful in the binary case. Yet it currently will apply jitter if there are differences in total weight between any two adjacent classes, which seems arbitrary; and the jitter in all classes may then be excessive. The former should be easy to solve; the latter I'm not sure.

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hope this isn't complete overkill for a rare problem...

@@ -54,13 +54,15 @@ def compute_class_weight(class_weight, classes, y):
freq = np.bincount(y_ind).astype(np.float64)
recip_freq = len(y) / (len(le.classes_) * freq)
weight = recip_freq[le.transform(classes)]
if np.any(np.diff(freq * weight)):
freq_weight = np.reshape(freq * weight, (len(freq), 1))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this would be more readable with just len(np.unique(freq * weight)) > 1

@@ -78,6 +80,26 @@ def compute_class_weight(class_weight, classes, y):
return weight


def _jitter_transform(true_order, bad_order):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wow, this got big :p

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel I may have gone a little overboard by trying to have a smaller jitter :)

# respect true order
k = len(true_order)
jitter = np.zeros(k)
for i in range(k // 2 + 1):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're going to have to add some general comments on this algorithm!

@jnothman
Copy link
Member

jnothman commented Feb 5, 2018

I have qualms about including this in practice. Let's say this is used for weighting samples in scoring, rather than training. Would that be a big problem?

@Johayon
Copy link
Contributor Author

Johayon commented Feb 9, 2018

I must admit that I am not sure if we should include this in practice. It is possible that other imprecision may lead to the same issue even if we had exact rationals for the weights. I tried using integers for the weights instead of fraction, and it seems that it is possible to recreate the same issue (or the opposite with 0 accuracy).

@amueller amueller added API Needs Decision Requires decision labels Aug 5, 2019
Base automatically changed from master to main January 22, 2021 10:50
@cmarmo cmarmo added Needs Decision - Close Requires decision for closing and removed Needs Decision Requires decision labels Dec 23, 2021
@ogrisel
Copy link
Member

ogrisel commented Jan 14, 2022

Since nobody feels very comfortable with this change, maybe we could instead document the pitfall better?

#10233 (comment)

@thomasjpfan
Copy link
Member

During a triaging meeting, we decided to close this PR and move forward with #10233 (comment)

CC @ogrisel @glemaitre

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Possible bug when combining SVC + class_weights='balanced' + LeaveOneOut
6 participants