-
-
Notifications
You must be signed in to change notification settings - Fork 25.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG]Fix nearly zero division #3725
Conversation
- In testcase, arr.std should return non zero value, because of floating point error - Ex #3722) np.zeros(22) + 1e-5
- divide reseting zero valued feature from _mean_and_std - with isclose(X), nearly zero value is also reset replace to 1.0
This is actually a more tricky problem than it looks at first and this fix is not enough. Consider another test case:
So it is not enough to check that std is close to zero (in this case it is on the order of 1e135 !). By the way, in the opposite case:
std is very close to zero here, but this doesn't mean anything. What matters is std relative to the size of elements of the array. |
Alternatively, would np.ptp(a) close to 0 suffice? On 1 October 2014 07:07, Maxim Grechkin notifications@github.com wrote:
|
Well, in this case:
np.ptp(a) will be very close to 0 (on it's own), but compared to elements of the array it is on the same order of magnitude. |
…(X))+1), check if std_ is 'close to zero' instead of 'exactly equal to 0' (scikit-learn#3722, scikit-learn#3725). -just a small change in the code to satisfy pep8 requirements
Hi, concerning the end of your last message @maximsch2, in my opninon we cannot attend scale(np.arange(10)_1.0) to be the same that scale(np.arange(10)_1e-100), this two vectors represent totally different things even if there are mathematically proportional, don't you think? |
They are different originally, but they shouldn't be after scaling. scale(np.arange(10)_1.0) should equal to both scale(np.arange(10)_1e100) and scale(np.arange(10)*1e-100) |
I'm not convinced of that, in practice when you have something like np.arange(10)_1e100, it is the same as np.ones(10)_1e100. For me when you have a quantity like np.arange(10)_1e-100 it is the same as np.ones(10)_1e-100. The scale() is thus to be same... |
@agramfort could you give your opinion on this ? |
#3747 is taken over? if so close this one. |
OK, so you consider np.arange(10)_1e100 to be the same as np.ones(10)_1e100. You probably don't consider np.arange(10) and np.ones(10) to be the same though, do you? So between np.arange(10)_X and np.ones(10)_X where do you draw the line? What if X=1e50? 10? 1e10? If you want to truncate all numbers that are greater than something that's certainly your right. But scale function is supposed to make the data have zero mean and variance one, so you have to treat np.arange(10)*X exactly as np.arange(10) for all X. Now, granted that doing a good job is easier for some values of X and not the others, but that doesn't mean we shouldn't try produce the correct result for all of them. |
Yes I think we need an arbitrary line, for instance the line defined by isclose() or on other one harder to exceed. |
@maximsch2 It is strange std value. I guess ndarray.std() is strange, (may be overflow...?).
In this case,
I have an idea without using |
…0' (scikit-learn#3722) -np.zeros(8) + np.log(1e-5) -np.zeros(22) + np.log(1e-5) -np.zeros(10) + 1e100 -np.ones(10) * 1e-100 in function scale() : -after a convenient normalization by 1/(max(abs(X))+1), check if std_ is 'close to zero' instead of 'exactly equal to 0' (scikit-learn#3722, scikit-learn#3725). -just a small change in the code to satisfy pep8 requirements Now, scale(np.arange(10.)*1e-100)=scale(np.arange(10.)) remove isclose which is now unnecessary New test for extrem (1e100 and 1e-100) scaling modification on Xr instead of X abs->np.abs, pep8 checker, preScale->pre_scale max->np.max() Abandon of the prescale method (problematical extra copy of the data) -ValueError in the case of too large values in X which yields (X-X.mean()).mean() > 0. -But still cover the case of std() close to 0, by substracting again the (new) mean after scaling if needed. -isclose -> np.isclose -all([isclose()]) -> np.allclose np.isclose -> isclose to avoid bug np.allclose warning
…0' (scikit-learn#3722) -np.zeros(8) + np.log(1e-5) -np.zeros(22) + np.log(1e-5) -np.zeros(10) + 1e100 -np.ones(10) * 1e-100 in function scale() : -after a convenient normalization by 1/(max(abs(X))+1), check if std_ is 'close to zero' instead of 'exactly equal to 0' (scikit-learn#3722, scikit-learn#3725). -just a small change in the code to satisfy pep8 requirements Now, scale(np.arange(10.)*1e-100)=scale(np.arange(10.)) remove isclose which is now unnecessary New test for extrem (1e100 and 1e-100) scaling modification on Xr instead of X abs->np.abs, pep8 checker, preScale->pre_scale max->np.max() Abandon of the prescale method (problematical extra copy of the data) -ValueError in the case of too large values in X which yields (X-X.mean()).mean() > 0. -But still cover the case of std() close to 0, by substracting again the (new) mean after scaling if needed. -isclose -> np.isclose -all([isclose()]) -> np.allclose np.isclose -> isclose to avoid bug np.allclose warning
…0' (scikit-learn#3722) -np.zeros(8) + np.log(1e-5) -np.zeros(22) + np.log(1e-5) -np.zeros(10) + 1e100 -np.ones(10) * 1e-100 in function scale() : -after a convenient normalization by 1/(max(abs(X))+1), check if std_ is 'close to zero' instead of 'exactly equal to 0' (scikit-learn#3722, scikit-learn#3725). -just a small change in the code to satisfy pep8 requirements Now, scale(np.arange(10.)*1e-100)=scale(np.arange(10.)) remove isclose which is now unnecessary New test for extrem (1e100 and 1e-100) scaling modification on Xr instead of X abs->np.abs, pep8 checker, preScale->pre_scale max->np.max() Abandon of the prescale method (problematical extra copy of the data) -ValueError in the case of too large values in X which yields (X-X.mean()).mean() > 0. -But still cover the case of std() close to 0, by substracting again the (new) mean after scaling if needed. -isclose -> np.isclose -all([isclose()]) -> np.allclose np.isclose -> isclose to avoid bug np.allclose warning
fixes #3722
_mean_and_std()
_replace_nearly_value()
functionisclose()
, "nealy" zero values also replaced to 1.0