Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] FIX? Add eps to obviate imprecision issues in balanced class_weight #10249

Closed

Conversation

jnothman
Copy link
Member

@jnothman jnothman commented Dec 4, 2017

This is a potential fix for the issue raised in #10233, wherein surprisingly good results were achieved under cross-validation due to numerical imprecision in compute_class_weight(..., 'balanced'). This situation is rare, but reportedly occurred in a grid search on a real-world dataset.

The fix works by making sure that when class weights can't be exactly balanced for a particular training set due to numerical imprecision, the original ranking of class distributions is marginally preferred.

What do others think of this? Is the problem substantial enough to break backwards compatibility of results in a small way? Are there alternative fixes?

weight = recip_freq[le.transform(classes)]
if np.any(np.diff(freq)):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand this line. Doesn't that rely on the class order?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I had intended frequent*weight, to check if there was exact float equality under the weighting. But we could use this hack in all cases if we think that's better (less architecture-dependent for instance)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand the hack then ;)

@jnothman
Copy link
Member Author

jnothman commented Dec 5, 2017

The test failures suggest that the tweaks should sometimes be subtracted rather than added.

@jnothman
Copy link
Member Author

@Johayon, perhaps you are interested in helping me fix this one, seeing as in some ways it bears similarity to your recent work. I'm still not sure it's worth fixing, though!

@Johayon
Copy link
Contributor

Johayon commented Jan 16, 2018

Thank you.
If you think I can help you, I will look at it.

@Johayon
Copy link
Contributor

Johayon commented Jan 16, 2018

@jnothman If I understand correctly the problem here.

The numerical imprecision when computing class_weight allows the model to always choose the right class in the edge case presented in the issue #10233. In this case the model only choose the class with the highest sum of weights because of the intercept and the value of C.

Then maybe we can try to counter this, by adding a bit of randomness in the weights:

recip_freq = (((len(y) /len(le.classes_)) + np.random.rand(len(le.classes_)) * 1e-8) / 
                np.bincount(y_ind).astype(np.float64))

@jnothman
Copy link
Member Author

If we need randomness, I'd rather not fix.

The problem is that the total sample weight for each class should be equal across all classes when balancing. However, a more populous class is more likely to have its samples' weights under-stated due to numerical imprecision, and then when summed up it will have a total sample weight slightly smaller than a less populous class. This results in not just imbalanced weights, but inverted class weight ranking. My idea is that we might be able to add jitter that prefers the more populous class, but I'm not sure of which function to use.

@Johayon
Copy link
Contributor

Johayon commented Jan 17, 2018

so maybe we just need to add epsilon inside the weight computation and not add it to the weight directly. Because it breaks the test which ensure that the sum of weighted frequencies should be to close to the sum of frequencies.

computation: Johayon@9d1715b
test : Johayon@d72c734

@jnothman I created a branch from your repository with the change (test + computation) but I am not able to add a pull request to your repository. Should I create a new pull request ?

@jnothman
Copy link
Member Author

jnothman commented Jan 17, 2018 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants