Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG]Fix nearly zero division #3725

Closed
wants to merge 10 commits into from
Closed

[MRG]Fix nearly zero division #3725

wants to merge 10 commits into from

Conversation

cocuh
Copy link

@cocuh cocuh commented Sep 30, 2014

fixes #3722

  • separate "reset zero to 1.0" function from _mean_and_std()
  • add _replace_nearly_value() function
    • withisclose(), "nealy" zero values also replaced to 1.0
  • add floating point error testcase

cocu added 3 commits October 1, 2014 03:32
- In testcase, arr.std should return non zero value, because of floating point error
    - Ex #3722) np.zeros(22) + 1e-5
- divide reseting zero valued feature from _mean_and_std
- with isclose(X), nearly zero value is also reset replace to 1.0
@coveralls
Copy link

Coverage Status

Coverage increased (+0.0%) when pulling 22f4c1a on cocu:fix_nearly_zero_division into 7522923 on scikit-learn:master.

@maximsch2
Copy link

This is actually a more tricky problem than it looks at first and this fix is not enough. Consider another test case:

a = np.zeros(10) + np.log(1e-5)*1e150
np.std(a) --> 1.4536774485912138e+135
preprocessing.scale(a) -> array([-1., -1., -1., -1., -1., -1., -1., -1., -1., -1.])

So it is not enough to check that std is close to zero (in this case it is on the order of 1e135 !). By the way, in the opposite case:

a = np.arange(10) * 1e-100
np.std(a) -> 2.8722813232690144e-100

std is very close to zero here, but this doesn't mean anything. What matters is std relative to the size of elements of the array.
I think the correct way should be to first scale the entire array by max(abs(a)). This scaling should also be obviously performed carefully in case max(abs(a)) is close to zero.

@jnothman
Copy link
Member

Alternatively, would np.ptp(a) close to 0 suffice?

On 1 October 2014 07:07, Maxim Grechkin notifications@github.com wrote:

This is actually a more tricky problem than it looks at first and this
fix is not enough. Consider another test case:

a = np.zeros(10) + np.log(1e-5)*1e150
np.std(a) --> 1.4536774485912138e+135
preprocessing.scale(a) -> array([-1., -1., -1., -1., -1., -1., -1., -1.,
-1., -1.])

So it is not enough to check that std is close to zero (in this case it is
on the order of 1e135 !). By the way, in the opposite case:

a = np.arange(10) * 1e-100
np.std(a) -> 2.8722813232690144e-100

std is very close to zero here, but this doesn't mean anything. What
matters is std relative to the size of elements of the array.
I think the correct way should be to first scale the entire array by
max(abs(a)).


Reply to this email directly or view it on GitHub
#3725 (comment)
.

@maximsch2
Copy link

Well, in this case:

a = np.arange(10) * 1e-100
np.std(a) -> 2.8722813232690144e-100

np.ptp(a) will be very close to 0 (on it's own), but compared to elements of the array it is on the same order of magnitude.
Note that right now, scale(np.arange(10)_1.0) is close to scale(np.arange(10)_1e-100) as we would expect them to be. Both ptp and currently proposed solution will break that.

ngoix pushed a commit to ngoix/scikit-learn that referenced this pull request Oct 9, 2014
…(X))+1), check if std_ is 'close to zero' instead of 'exactly equal to 0' (scikit-learn#3722, scikit-learn#3725).

-just a small change in the code to satisfy pep8 requirements
@ngoix
Copy link
Contributor

ngoix commented Oct 9, 2014

Hi, concerning the end of your last message @maximsch2, in my opninon we cannot attend scale(np.arange(10)_1.0) to be the same that scale(np.arange(10)_1e-100), this two vectors represent totally different things even if there are mathematically proportional, don't you think?

@maximsch2
Copy link

They are different originally, but they shouldn't be after scaling. scale(np.arange(10)_1.0) should equal to both scale(np.arange(10)_1e100) and scale(np.arange(10)*1e-100)

@ngoix
Copy link
Contributor

ngoix commented Oct 9, 2014

I'm not convinced of that, in practice when you have something like np.arange(10)_1e100, it is the same as np.ones(10)_1e100. For me when you have a quantity like np.arange(10)_1e-100 it is the same as np.ones(10)_1e-100. The scale() is thus to be same...

@ngoix
Copy link
Contributor

ngoix commented Oct 9, 2014

@agramfort could you give your opinion on this ?

@agramfort
Copy link
Member

#3747 is taken over? if so close this one.

@maximsch2
Copy link

OK, so you consider np.arange(10)_1e100 to be the same as np.ones(10)_1e100. You probably don't consider np.arange(10) and np.ones(10) to be the same though, do you? So between np.arange(10)_X and np.ones(10)_X where do you draw the line? What if X=1e50? 10? 1e10?

If you want to truncate all numbers that are greater than something that's certainly your right. But scale function is supposed to make the data have zero mean and variance one, so you have to treat np.arange(10)*X exactly as np.arange(10) for all X. Now, granted that doing a good job is easier for some values of X and not the others, but that doesn't mean we shouldn't try produce the correct result for all of them.

@ngoix
Copy link
Contributor

ngoix commented Oct 10, 2014

Yes I think we need an arbitrary line, for instance the line defined by isclose() or on other one harder to exceed.
It is indeed still hard for me to imagine that in practice, you can consider np.arange(10)_1e-100 to be different from zero, ie from np.ones(10)_1e-100...

cocu added 2 commits October 11, 2014 02:22
This reverts commit e229783.

Conflicts:
	sklearn/preprocessing/data.py
- improve floating point error (#3722)
- idea
  - When array elements is all same value, `std` sometimes has floating point error(`std` should be 0).
@cocuh
Copy link
Author

cocuh commented Oct 10, 2014

@maximsch2 It is strange std value. I guess ndarray.std() is strange, (may be overflow...?).
I suppose ndarray.std() should be revised to return correct std(zero).

a = np.zeros(10) + np.log(1e-5)*1e150
np.std(a) --> 1.4536774485912138e+135

In this case, isclose() is evil.

a = np.arange(10) * 1e-100
np.std(a) -> 2.8722813232690144e-100

isclose()'s threshold is inadequate, because of using default value.
Setting threshold seems difficult for me.

I have an idea without using isclose().
When array elements is all same value, std sometimes has floating error(std should be 0).
It should replace to zero then.
I implement this idea, but I should suppose input array will be n-dimentional array. This code become obfuscated...
Please advise to refactor it.....

@coveralls
Copy link

Coverage Status

Coverage increased (+0.03%) when pulling 0795128 on cocu:fix_nearly_zero_division into 7522923 on scikit-learn:master.

@coveralls
Copy link

Coverage Status

Coverage increased (+0.03%) when pulling 0795128 on cocu:fix_nearly_zero_division into 7522923 on scikit-learn:master.

@cocuh cocuh changed the title Fix nearly zero division [MRG]Fix nearly zero division Oct 28, 2014
@amueller amueller added the Bug label Jan 16, 2015
@amueller amueller added this to the 0.16 milestone Jan 16, 2015
ngoix pushed a commit to ngoix/scikit-learn that referenced this pull request Jan 28, 2015
…0' (scikit-learn#3722)

-np.zeros(8) + np.log(1e-5)
-np.zeros(22) + np.log(1e-5)
-np.zeros(10) + 1e100
-np.ones(10) * 1e-100

in function scale() : -after a convenient normalization by 1/(max(abs(X))+1), check if std_ is 'close to zero' instead of 'exactly equal to 0' (scikit-learn#3722, scikit-learn#3725).
-just a small change in the code to satisfy pep8 requirements

Now, scale(np.arange(10.)*1e-100)=scale(np.arange(10.))

remove isclose which is now unnecessary

New test for extrem (1e100 and 1e-100) scaling

modification on Xr instead of X

abs->np.abs, pep8 checker, preScale->pre_scale

max->np.max()

Abandon of the prescale method (problematical extra copy of the data)
-ValueError in the case of too large values in X which yields
(X-X.mean()).mean() > 0.
-But still cover the case of std() close to 0, by substracting again the (new)
mean after scaling if needed.

-isclose -> np.isclose
-all([isclose()]) -> np.allclose

np.isclose -> isclose to avoid bug

np.allclose

warning
ngoix pushed a commit to ngoix/scikit-learn that referenced this pull request Mar 2, 2015
…0' (scikit-learn#3722)

-np.zeros(8) + np.log(1e-5)
-np.zeros(22) + np.log(1e-5)
-np.zeros(10) + 1e100
-np.ones(10) * 1e-100

in function scale() : -after a convenient normalization by 1/(max(abs(X))+1), check if std_ is 'close to zero' instead of 'exactly equal to 0' (scikit-learn#3722, scikit-learn#3725).
-just a small change in the code to satisfy pep8 requirements

Now, scale(np.arange(10.)*1e-100)=scale(np.arange(10.))

remove isclose which is now unnecessary

New test for extrem (1e100 and 1e-100) scaling

modification on Xr instead of X

abs->np.abs, pep8 checker, preScale->pre_scale

max->np.max()

Abandon of the prescale method (problematical extra copy of the data)
-ValueError in the case of too large values in X which yields
(X-X.mean()).mean() > 0.
-But still cover the case of std() close to 0, by substracting again the (new)
mean after scaling if needed.

-isclose -> np.isclose
-all([isclose()]) -> np.allclose

np.isclose -> isclose to avoid bug

np.allclose

warning
ngoix pushed a commit to ngoix/scikit-learn that referenced this pull request Mar 2, 2015
…0' (scikit-learn#3722)

-np.zeros(8) + np.log(1e-5)
-np.zeros(22) + np.log(1e-5)
-np.zeros(10) + 1e100
-np.ones(10) * 1e-100

in function scale() : -after a convenient normalization by 1/(max(abs(X))+1), check if std_ is 'close to zero' instead of 'exactly equal to 0' (scikit-learn#3722, scikit-learn#3725).
-just a small change in the code to satisfy pep8 requirements

Now, scale(np.arange(10.)*1e-100)=scale(np.arange(10.))

remove isclose which is now unnecessary

New test for extrem (1e100 and 1e-100) scaling

modification on Xr instead of X

abs->np.abs, pep8 checker, preScale->pre_scale

max->np.max()

Abandon of the prescale method (problematical extra copy of the data)
-ValueError in the case of too large values in X which yields
(X-X.mean()).mean() > 0.
-But still cover the case of std() close to 0, by substracting again the (new)
mean after scaling if needed.

-isclose -> np.isclose
-all([isclose()]) -> np.allclose

np.isclose -> isclose to avoid bug

np.allclose

warning
@ogrisel
Copy link
Member

ogrisel commented Mar 3, 2015

#3747 is taken over? if so close this one.

I think this is the case. Closing this PR in favor #3747. @ngoix please reopen if this is not the case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

preprocessing.scale provides consistent results on arrays with zero variance
8 participants