[WIP] FIX? Add eps to obviate imprecision issues in balanced class_weight #10249

jnothman · 2017-12-04T19:30:00Z

This is a potential fix for the issue raised in #10233, wherein surprisingly good results were achieved under cross-validation due to numerical imprecision in compute_class_weight(..., 'balanced'). This situation is rare, but reportedly occurred in a grid search on a real-world dataset.

The fix works by making sure that when class weights can't be exactly balanced for a particular training set due to numerical imprecision, the original ranking of class distributions is marginally preferred.

What do others think of this? Is the problem substantial enough to break backwards compatibility of results in a small way? Are there alternative fixes?

…_weight

amueller · 2017-12-04T21:24:23Z

sklearn/utils/class_weight.py

        weight = recip_freq[le.transform(classes)]
+        if np.any(np.diff(freq)):


I don't understand this line. Doesn't that rely on the class order?

Sorry, I had intended frequent*weight, to check if there was exact float equality under the weighting. But we could use this hack in all cases if we think that's better (less architecture-dependent for instance)

I don't understand the hack then ;)

jnothman · 2017-12-05T02:07:15Z

The test failures suggest that the tweaks should sometimes be subtracted rather than added.

jnothman · 2018-01-16T06:18:26Z

@Johayon, perhaps you are interested in helping me fix this one, seeing as in some ways it bears similarity to your recent work. I'm still not sure it's worth fixing, though!

Johayon · 2018-01-16T12:21:09Z

Thank you.
If you think I can help you, I will look at it.

Johayon · 2018-01-16T13:14:13Z

@jnothman If I understand correctly the problem here.

The numerical imprecision when computing class_weight allows the model to always choose the right class in the edge case presented in the issue #10233. In this case the model only choose the class with the highest sum of weights because of the intercept and the value of C.

Then maybe we can try to counter this, by adding a bit of randomness in the weights:

recip_freq = (((len(y) /len(le.classes_)) + np.random.rand(len(le.classes_)) * 1e-8) / 
                np.bincount(y_ind).astype(np.float64))

jnothman · 2018-01-17T02:18:51Z

If we need randomness, I'd rather not fix.

The problem is that the total sample weight for each class should be equal across all classes when balancing. However, a more populous class is more likely to have its samples' weights under-stated due to numerical imprecision, and then when summed up it will have a total sample weight slightly smaller than a less populous class. This results in not just imbalanced weights, but inverted class weight ranking. My idea is that we might be able to add jitter that prefers the more populous class, but I'm not sure of which function to use.

Johayon · 2018-01-17T09:55:06Z

so maybe we just need to add epsilon inside the weight computation and not add it to the weight directly. Because it breaks the test which ensure that the sum of weighted frequencies should be to close to the sum of frequencies.

computation: Johayon@9d1715b
test : Johayon@d72c734

@jnothman I created a branch from your repository with the change (test + computation) but I am not able to add a pull request to your repository. Should I create a new pull request ?

jnothman · 2018-01-17T11:52:35Z

sure, a new pull request is fine. I'm away the next few days, fyi

FIX? Add eps to obviate numerical imprecision issues in compute_class…

b07172d

…_weight

jnothman added the Waiting for Reviewer label Dec 4, 2017

amueller reviewed Dec 4, 2017

View reviewed changes

jnothman added 2 commits December 5, 2017 09:22

FIX typo

b21a1a1

Add test

aec3d2a

Johayon mentioned this pull request Jan 21, 2018

[MRG] Class weight imprecision #10515

Closed

jnothman closed this Jan 24, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] FIX? Add eps to obviate imprecision issues in balanced class_weight #10249

[WIP] FIX? Add eps to obviate imprecision issues in balanced class_weight #10249

jnothman commented Dec 4, 2017

amueller Dec 4, 2017

jnothman Dec 4, 2017

amueller Dec 5, 2017

jnothman commented Dec 5, 2017

jnothman commented Jan 16, 2018

Johayon commented Jan 16, 2018

Johayon commented Jan 16, 2018

jnothman commented Jan 17, 2018

Johayon commented Jan 17, 2018 •

edited

jnothman commented Jan 17, 2018 via email

		weight = recip_freq[le.transform(classes)]
		if np.any(np.diff(freq)):

[WIP] FIX? Add eps to obviate imprecision issues in balanced class_weight #10249

[WIP] FIX? Add eps to obviate imprecision issues in balanced class_weight #10249

Conversation

jnothman commented Dec 4, 2017

amueller Dec 4, 2017

Choose a reason for hiding this comment

jnothman Dec 4, 2017

Choose a reason for hiding this comment

amueller Dec 5, 2017

Choose a reason for hiding this comment

jnothman commented Dec 5, 2017

jnothman commented Jan 16, 2018

Johayon commented Jan 16, 2018

Johayon commented Jan 16, 2018

jnothman commented Jan 17, 2018

Johayon commented Jan 17, 2018 • edited

jnothman commented Jan 17, 2018 via email

Johayon commented Jan 17, 2018 •

edited