-
-
Notifications
You must be signed in to change notification settings - Fork 31.7k
OverflowError in statistics.mean when summing large floats #69364
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
The following code produces an OverflowError: import statistics
statistics.mean([8.988465674311579e+307, 8.98846567431158e+307]) The error is: File "/home/david/.pyenv/versions/3.5.0/lib/python3.5/statistics.py", line 293, in mean If this is intended behaviour then it is not documented: https://docs.python.org/3/library/statistics.html#statistics.mean only specifies that it may raise StatisticsError. |
I believe it's an intended behavior as python's float has a limit after all. It's hard to reach it but definitely possible. |
I'm not sure what you mean by float having a limit here. It's certainly finite precision, but there is still a representable value with that finite precision closest to the mean. As an example where there is an obvious correct answer that will trigger this error: statistics.mean([sys.float_info.max, sys.float_info.max]), this should return sys.float_info.max (which is definitely representable!), but instead raises this overflow error. |
Whoop! I see the reason for it now. By limit I don't mean the precision limit, I mean the top limit in which float converts to "inf". It is meant to not to overflow but the "/" operator is now "//". |
Yup, it indeed fixes the problem. Sorry for thinking it's intended. |
Alright, |
Seems like this is the only viable option. It fixes the OverflowError but comes at the cost of precision and time. |
That patch doesn't really help, I'm afraid, since it introduces problems at the other end of the floating-point range: for example, So currently, when computing the mean of a sequence of floats (possibly mixed with ints), the code:
Reversing steps 2 and 3 here would solve the issue, but would require some refactoring. It's really up to Steven whether he thinks that that refactoring is worth it for these corner cases at the extremes of the floating-point range. (It's difficult to imagine that such numbers would turn up frequently in practical applications.) |
Bar, thanks for the time you put into diagnosing this error, it is I'm reluctant to say that mean() will *never* raise OverflowError, but |
Alright, I issued a fix, now testing it |
Alright, this patch passed all tests. |
Any comments on the patch? |
On Sat, Oct 10, 2015 at 04:28:22PM +0000, Bar Harel wrote:
Not yet, I've been unable to look at it, but thank you. If I haven't |
Do you have any benchmarks on the before and after? I strongly suspect that moving from float to Fraction-based ratios is going to kill performance in the common case, particularly for longer input sequences, but that's a hunch only. |
The existing code already converts each of the input items to Fraction; the only difference is that the old code converts the sum of those Fractions to float (or whatever the target type is) *before* dividing by the count, while the new code performs the sum/count division in Fraction-land, and only *then* converts to float. That is, it's the difference between "float(exact_sum) / count" and "float(exact_sum / count)". IOW, the performance is already dead. Or rather, it's just resting: IIUC, the module design prioritises correctness over speed. I'm sure Steven would be open to suggestions for faster algorithms that maintain the current accuracy. |
Has anyone confirmed that this bug actually exists? I'm afraid that I cannot verify it. I get these results on three different computers: py> x = 8.988465674311579e+307 running Python 3.4.3, a backport on 3.3.0rc3, and the default branch in the repo (3.6.0a). |
Confirmed. The initial report is not quite correct: you need three py> x = 8.988465674311579e+307
py> statistics.mean([x]*2) == x
True
py> statistics.mean([x]*3) == x
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "./statistics.py", line 289, in mean
return _sum(data)/n
File "./statistics.py", line 184, in _sum
return T(total)
File "/usr/local/lib/python3.3/numbers.py", line 296, in __float__
return self.numerator / self.denominator
OverflowError: integer division result too large for a float |
I can reproduce here (OS X 10.9, Python 3.5), exactly as described in the original post. Python 3.5.0 (default, Sep 22 2015, 18:26:54)
[GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.57)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import statistics
>>> statistics.mean([8.988465674311579e+307, 8.98846567431158e+307])
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/opt/local/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/statistics.py", line 293, in mean
return _sum(data)/n
File "/opt/local/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/statistics.py", line 184, in _sum
return T(total)
File "/opt/local/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/numbers.py", line 291, in __float__
return self.numerator / self.denominator
OverflowError: integer division result too large for a float |
Note that the two input values given in the original report are not the same: [8.988465674311579e+307, 8.98846567431158e+307] != [8.988465674311579e+307] * 2. |
Anyway, yes, it should be quite the same. I can provide some benchmarks tomorrow if you wish. |
New changeset 4bc9405c4f7b by Steven D'Aprano in branch '3.4': New changeset ed45a09e5a69 by Steven D'Aprano in branch '3.5': New changeset 0eeb39fc8ff5 by Steven D'Aprano in branch 'default': |
Larry, Is it too late to get this into 3.5rc1? changeset 99407:ed45a09e5a69 Thanks. |
New changeset a7d2307055e7 by Victor Stinner in branch 'default': |
Steven's commit here also fixed bpo-24068. |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: