[MRG]Fix nearly zero division #3725

cocuh · 2014-09-30T19:15:04Z

separate "reset zero to 1.0" function from _mean_and_std()
add _replace_nearly_value() function
- withisclose(), "nealy" zero values also replaced to 1.0
add floating point error testcase

- In testcase, arr.std should return non zero value, because of floating point error - Ex #3722) np.zeros(22) + 1e-5

- divide reseting zero valued feature from _mean_and_std - with isclose(X), nearly zero value is also reset replace to 1.0

coveralls · 2014-09-30T19:27:09Z

Coverage increased (+0.0%) when pulling 22f4c1a on cocu:fix_nearly_zero_division into 7522923 on scikit-learn:master.

maximsch2 · 2014-09-30T21:07:15Z

This is actually a more tricky problem than it looks at first and this fix is not enough. Consider another test case:

a = np.zeros(10) + np.log(1e-5)*1e150
np.std(a) --> 1.4536774485912138e+135
preprocessing.scale(a) -> array([-1., -1., -1., -1., -1., -1., -1., -1., -1., -1.])

So it is not enough to check that std is close to zero (in this case it is on the order of 1e135 !). By the way, in the opposite case:

a = np.arange(10) * 1e-100
np.std(a) -> 2.8722813232690144e-100

std is very close to zero here, but this doesn't mean anything. What matters is std relative to the size of elements of the array.
I think the correct way should be to first scale the entire array by max(abs(a)). This scaling should also be obviously performed carefully in case max(abs(a)) is close to zero.

jnothman · 2014-09-30T23:31:28Z

Alternatively, would np.ptp(a) close to 0 suffice?

On 1 October 2014 07:07, Maxim Grechkin notifications@github.com wrote:

This is actually a more tricky problem than it looks at first and this
fix is not enough. Consider another test case:

a = np.zeros(10) + np.log(1e-5)*1e150
np.std(a) --> 1.4536774485912138e+135
preprocessing.scale(a) -> array([-1., -1., -1., -1., -1., -1., -1., -1.,
-1., -1.])

So it is not enough to check that std is close to zero (in this case it is
on the order of 1e135 !). By the way, in the opposite case:

a = np.arange(10) * 1e-100
np.std(a) -> 2.8722813232690144e-100

std is very close to zero here, but this doesn't mean anything. What
matters is std relative to the size of elements of the array.
I think the correct way should be to first scale the entire array by
max(abs(a)).

—
Reply to this email directly or view it on GitHub
#3725 (comment)
.

maximsch2 · 2014-09-30T23:44:35Z

Well, in this case:

a = np.arange(10) * 1e-100
np.std(a) -> 2.8722813232690144e-100

np.ptp(a) will be very close to 0 (on it's own), but compared to elements of the array it is on the same order of magnitude.
Note that right now, scale(np.arange(10)_1.0) is close to scale(np.arange(10)_1e-100) as we would expect them to be. Both ptp and currently proposed solution will break that.

…(X))+1), check if std_ is 'close to zero' instead of 'exactly equal to 0' (scikit-learn#3722, scikit-learn#3725). -just a small change in the code to satisfy pep8 requirements

ngoix · 2014-10-09T18:24:46Z

Hi, concerning the end of your last message @maximsch2, in my opninon we cannot attend scale(np.arange(10)_1.0) to be the same that scale(np.arange(10)_1e-100), this two vectors represent totally different things even if there are mathematically proportional, don't you think?

maximsch2 · 2014-10-09T18:27:06Z

They are different originally, but they shouldn't be after scaling. scale(np.arange(10)_1.0) should equal to both scale(np.arange(10)_1e100) and scale(np.arange(10)*1e-100)

ngoix · 2014-10-09T18:40:20Z

I'm not convinced of that, in practice when you have something like np.arange(10)_1e100, it is the same as np.ones(10)_1e100. For me when you have a quantity like np.arange(10)_1e-100 it is the same as np.ones(10)_1e-100. The scale() is thus to be same...

ngoix · 2014-10-09T18:42:12Z

@agramfort could you give your opinion on this ?

agramfort · 2014-10-09T18:47:39Z

#3747 is taken over? if so close this one.

maximsch2 · 2014-10-09T21:48:49Z

OK, so you consider np.arange(10)_1e100 to be the same as np.ones(10)_1e100. You probably don't consider np.arange(10) and np.ones(10) to be the same though, do you? So between np.arange(10)_X and np.ones(10)_X where do you draw the line? What if X=1e50? 10? 1e10?

If you want to truncate all numbers that are greater than something that's certainly your right. But scale function is supposed to make the data have zero mean and variance one, so you have to treat np.arange(10)*X exactly as np.arange(10) for all X. Now, granted that doing a good job is easier for some values of X and not the others, but that doesn't mean we shouldn't try produce the correct result for all of them.

ngoix · 2014-10-10T16:25:07Z

Yes I think we need an arbitrary line, for instance the line defined by isclose() or on other one harder to exceed.
It is indeed still hard for me to imagine that in practice, you can consider np.arange(10)_1e-100 to be different from zero, ie from np.ones(10)_1e-100...

This reverts commit e229783. Conflicts: sklearn/preprocessing/data.py

- improve floating point error (#3722) - idea - When array elements is all same value, `std` sometimes has floating point error(`std` should be 0).

cocuh · 2014-10-10T21:07:07Z

@maximsch2 It is strange std value. I guess ndarray.std() is strange, (may be overflow...?).
I suppose ndarray.std() should be revised to return correct std(zero).

a = np.zeros(10) + np.log(1e-5)*1e150
np.std(a) --> 1.4536774485912138e+135

In this case, isclose() is evil.

a = np.arange(10) * 1e-100
np.std(a) -> 2.8722813232690144e-100

isclose()'s threshold is inadequate, because of using default value.
Setting threshold seems difficult for me.

I have an idea without using isclose().
When array elements is all same value, std sometimes has floating error(std should be 0).
It should replace to zero then.
I implement this idea, but I should suppose input array will be n-dimentional array. This code become obfuscated...
Please advise to refactor it.....

coveralls · 2014-10-10T21:18:30Z

Coverage increased (+0.03%) when pulling 0795128 on cocu:fix_nearly_zero_division into 7522923 on scikit-learn:master.

coveralls · 2014-10-10T21:25:28Z

Coverage increased (+0.03%) when pulling 0795128 on cocu:fix_nearly_zero_division into 7522923 on scikit-learn:master.

…0' (scikit-learn#3722) -np.zeros(8) + np.log(1e-5) -np.zeros(22) + np.log(1e-5) -np.zeros(10) + 1e100 -np.ones(10) * 1e-100 in function scale() : -after a convenient normalization by 1/(max(abs(X))+1), check if std_ is 'close to zero' instead of 'exactly equal to 0' (scikit-learn#3722, scikit-learn#3725). -just a small change in the code to satisfy pep8 requirements Now, scale(np.arange(10.)*1e-100)=scale(np.arange(10.)) remove isclose which is now unnecessary New test for extrem (1e100 and 1e-100) scaling modification on Xr instead of X abs->np.abs, pep8 checker, preScale->pre_scale max->np.max() Abandon of the prescale method (problematical extra copy of the data) -ValueError in the case of too large values in X which yields (X-X.mean()).mean() > 0. -But still cover the case of std() close to 0, by substracting again the (new) mean after scaling if needed. -isclose -> np.isclose -all([isclose()]) -> np.allclose np.isclose -> isclose to avoid bug np.allclose warning

ogrisel · 2015-03-03T09:10:34Z

#3747 is taken over? if so close this one.

I think this is the case. Closing this PR in favor #3747. @ngoix please reopen if this is not the case.

cocu added 3 commits October 1, 2014 03:32

add zero nearly division test

3e15dd9

- In testcase, arr.std should return non zero value, because of floating point error - Ex #3722) np.zeros(22) + 1e-5

implement _replace_nearly_value function

e229783

- divide reseting zero valued feature from _mean_and_std - with isclose(X), nearly zero value is also reset replace to 1.0

refactoring

22f4c1a

ngoix mentioned this pull request Oct 9, 2014

preprocessing.scale provides consistent results on arrays with zero variance #3722

Closed

ngoix mentioned this pull request Oct 9, 2014

[MRG + 1] Fix preprocessing.scale for arrays with near zero ratio variance/max #3747

Closed

cocu added 2 commits October 11, 2014 02:22

Revert "implement _replace_nearly_value function"

da55f4b

This reverts commit e229783. Conflicts: sklearn/preprocessing/data.py

implement scale function, without isclose()

d179af5

- improve floating point error (#3722) - idea - When array elements is all same value, `std` sometimes has floating point error(`std` should be 0).

delete miss inserted blank line

0795128

cocu added 4 commits October 11, 2014 06:34

fix arr.size is zero errors

a3d340f

reformat

199bc1d

refactoring and fix bug for CI

16b45c6

fix format

0faeeb1

cocuh changed the title ~~Fix nearly zero division~~ [MRG]Fix nearly zero division Oct 28, 2014

MechCoder force-pushed the master branch from 6deaea0 to 3f49cee Compare November 3, 2014 12:36

amueller added the Bug label Jan 16, 2015

amueller added this to the 0.16 milestone Jan 16, 2015

ogrisel closed this Mar 3, 2015

maximsch2 mentioned this pull request Jul 26, 2016

zscore accuracy JuliaStats/StatsBase.jl#196

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[MRG]Fix nearly zero division #3725

[MRG]Fix nearly zero division #3725

cocuh commented Sep 30, 2014

coveralls commented Sep 30, 2014

maximsch2 commented Sep 30, 2014

jnothman commented Sep 30, 2014

maximsch2 commented Sep 30, 2014

ngoix commented Oct 9, 2014

maximsch2 commented Oct 9, 2014

ngoix commented Oct 9, 2014

ngoix commented Oct 9, 2014

agramfort commented Oct 9, 2014

maximsch2 commented Oct 9, 2014

ngoix commented Oct 10, 2014

cocuh commented Oct 10, 2014

coveralls commented Oct 10, 2014

coveralls commented Oct 10, 2014

ogrisel commented Mar 3, 2015

[MRG]Fix nearly zero division #3725

[MRG]Fix nearly zero division #3725

Conversation

cocuh commented Sep 30, 2014

coveralls commented Sep 30, 2014

maximsch2 commented Sep 30, 2014

jnothman commented Sep 30, 2014

maximsch2 commented Sep 30, 2014

ngoix commented Oct 9, 2014

maximsch2 commented Oct 9, 2014

ngoix commented Oct 9, 2014

ngoix commented Oct 9, 2014

agramfort commented Oct 9, 2014

maximsch2 commented Oct 9, 2014

ngoix commented Oct 10, 2014

cocuh commented Oct 10, 2014

coveralls commented Oct 10, 2014

coveralls commented Oct 10, 2014

ogrisel commented Mar 3, 2015