-
-
Notifications
You must be signed in to change notification settings - Fork 30.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add geometric mean to statistics
module
#71368
Comments
Steven, this seems like a reasonable suggestion (though I would expect someone else will immediately suggest a harmonic mean as well). Is this within the scope of what you were trying to do with the statistics module? |
On Thu, Jun 02, 2016 at 09:04:54PM +0000, Raymond Hettinger wrote:
Yes, I think it is reasonable too. I'll aim to get this in to 3.6. |
To complicate things further... I implemented a geometric mean on my own, and then I figured out what I really want is a weighted geometric mean, so I implemented that for myself. If you'd want to include that, that'll be cool. Actually I'm not sure if the goal of the |
And of course, if the goal of the |
Choice of algorithm is a bit tricky here. There are a couple of obvious algorithms that work mathematically but result in significant accuracy loss in an IEEE 754 floating-point implementation: one is I'd suggest something along the lines of There are also algorithms for improved accuracy in a product, along the same lines as the algorithm used in fsum. See e.g., the paper "Accurate Floating-Point Product and Exponentiation" by Stef Graillat. [1] (I didn't know about this paper: I just saw a reference to it in a StackOverflow comment [2], which reminded me of this issue.) [1] http://www-pequan.lip6.fr/~graillat/papers/IEEE-TC-prod.pdf |
On the other hand, apparently def gmean(a, axis=0):
a, axis = _chk_asarray(a, axis)
log_a = ma.log(a)
return ma.exp(log_a.mean(axis=axis)) |
On Thu, Jun 09, 2016 at 09:24:04AM +0000, Mark Dickinson wrote:
Hmm, well, I don't have SciPy installed, but I've found that despite py> statistics.mean([1e50, 2e-50, -1e50, 2e-50]) py> statistics.mean([1e50, 2e-50, -1e50, 2e-50]*1000) On the other hand, np is probably a hundred times (or more) faster, so I |
Agreed. And as Ram Rachum hinted, there seems little point aiming to duplicate things that already exist in the de facto standard scientific libraries. So I think there's a place for a non-naive carefully computed geometric mean in the std. lib. statistics module, but I wouldn't see the point of simply adding an exp-mean-log implementation (not that anyone is advocating that). |
Does anyone have any strong feeling about the name for these functions? gmean and hmean; geometric_mean and harmonic_mean And "subcontrary_mean" is not an option :-) |
I would like to see them spelled-out: geometric_mean and harmonic_mean |
+1 |
New changeset 9eb5edfcf604 by Steven D'Aprano in branch 'default': |
Thanks for the patch Steven! I won't comment about the code because I don't know enough about these algorithms, but I'm thinking, since you also did a refactoring of the statistics module, maybe these should be two separate patches/commits so it'll be easy to see which part is the new feature and which part is moving existing code around? |
Also... I like the detailed docstrings with the real-life examples! That stuff helps when coding and using an unfamiliar function (since I see the docs in a panel of my IDE), so I wish I'd see more detailed docstrings like these ones in the standard library. For |
On Tue, Aug 09, 2016 at 06:44:22AM +0000, Ram Rachum wrote:
What do you mean? As in, the mathematical definition of geometric mean? Or do you mean a one sentence description of the algorithm? |
I meant the mathematical definition. |
Tests fail on a Power PC buildbot: http://buildbot.python.org/all/builders/PPC64LE%20Fedora%203.x/builds/1476/steps/test/logs/stdio Traceback (most recent call last):
File "/home/shager/cpython-buildarea/3.x.edelsohn-fedora-ppc64le/build/Lib/test/test_statistics.py", line 1216, in testExactPowers
self.assertEqual(self.nroot(x, n), i)
AssertionError: 29.000000000000004 != 29 ====================================================================== Traceback (most recent call last):
File "/home/shager/cpython-buildarea/3.x.edelsohn-fedora-ppc64le/build/Lib/test/test_statistics.py", line 1228, in testExactPowersNegatives
self.assertEqual(self.nroot(x, n), i)
AssertionError: -29.000000000000004 != -29 |
What no patch for pre-commit review?! For computing nth roots, it may be worth special-casing the case n=2: for floats, |
I thought about special-casing n=2 to math.sqrt, but as that's an I'd be interested in special-casing n=3 to math.cbrt (if and when it exists) |
It certainly is worse than sqrt, both in terms of speed and accuracy. Whether the difference is enough to make it worth special-casing is another question, of course, and as you say, that can happen later. |
I've created a new issue to track the loss of accuracy on PowerPC: http://bugs.python.org/issue27761 |
A failing case: >>> statistics.geometric_mean([0.7 for _ in range(5000)])
Traceback (most recent call last):
File "/Users/mdickinson/Python/cpython-git/Lib/statistics.py", line 362, in float_nroot
isinfinity = math.isinf(x)
OverflowError: int too large to convert to float
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/mdickinson/Python/cpython-git/Lib/statistics.py", line 595, in geometric_mean
s = 2**p * _nth_root(2**q, n)
File "/Users/mdickinson/Python/cpython-git/Lib/statistics.py", line 346, in nth_root
return _nroot_NS.float_nroot(x, n)
File "/Users/mdickinson/Python/cpython-git/Lib/statistics.py", line 364, in float_nroot
return _nroot_NS.bignum_nroot(x, n)
File "/Users/mdickinson/Python/cpython-git/Lib/statistics.py", line 489, in bignum_nroot
b = 2**q * _nroot_NS.nroot(2**r, n)
File "/Users/mdickinson/Python/cpython-git/Lib/statistics.py", line 384, in nroot
r1 = math.pow(x, 1.0/n)
OverflowError: int too large to convert to float |
FTR, multiple platforms are failing in various ways, not just PPC64, so bpo-27761 was expanded to cover them and has been marked as a "release blocker". |
Failure on s390x Debian 3.x: http://buildbot.python.org/all/builders/s390x%20Debian%203.x/builds/1455/steps/test/logs/stdio ====================================================================== Traceback (most recent call last):
File "/home/dje/cpython-buildarea/3.x.edelsohn-debian-z/build/Lib/test/test_statistics.py", line 1216, in testExactPowers
self.assertEqual(self.nroot(x, n), i)
AssertionError: 29.000000000000004 != 29 ====================================================================== Traceback (most recent call last):
File "/home/dje/cpython-buildarea/3.x.edelsohn-debian-z/build/Lib/test/test_statistics.py", line 1228, in testExactPowersNegatives
self.assertEqual(self.nroot(x, n), i)
AssertionError: -29.000000000000004 != -29 |
New changeset 54288b160243 by Victor Stinner in branch 'default': |
I would like to use buildbots to check for regressions, but I see a lot of red buildbots, so buildbots became useless :-/ I skipped failing test_statistics tests, since failures are known. I put the priority to "release blocker". I suggest to either revert the change or find a fix before 3.6b1. |
For posterity, the following failure was observed on all (9/10/11(current) FreeBSD buildbots: ====================================================================== Traceback (most recent call last):
File "/usr/home/buildbot/python/3.x.koobs-freebsd9/build/Lib/test/test_statistics.py", line 1247, in testFraction
self.assertEqual(self.nroot(x**12, 12), float(x))
AssertionError: 1.1866666666666665 != 1.1866666666666668 |
That looks like a case where the test should simply be weakened to an |
As discussed with Ned by email, I'm currently unable to build 3.6 and won't have time to work on this before b1. As discussed on bpo-27761 my tests here are too strict and should be loosened, e.g. from assertEqual to assertAlmostEqual. Ned wrote: "If you are only planning to make changes to the tests themselves, I think that can wait for b2." I have no plans to change the publicly visible interface of geometric_mean. |
Steven: any thoughts about the statistics.geometric_mean(0.7 for _ in range(5000)) failure? Should I open a separate bug report for that, or would you rather address it as part of this issue? |
>>> statistics.geometric_mean([0.7 for _ in range(5000)])
Traceback (most recent call last):
File "/Users/mdickinson/Python/cpython-git/Lib/statistics.py", line 362, in float_nroot
isinfinity = math.isinf(x)
OverflowError: int too large to convert to float => see also issue bpo-27975 |
On Mon, Sep 12, 2016 at 03:35:14PM +0000, Mark Dickinson wrote:
I've raised a new ticket bpo-28111 |
I'm sorry to say that due to technical difficulties, geometric mean is not going to be in a fit state for beta 2 of 3.6, and so is going to be removed and delayed until 3.7. |
New changeset 9dce0e41bedd by Steven D'Aprano in branch 'default': |
New changeset de0fa478c22e by Steven D'Aprano in branch '3.6': |
Thanks, Steven. Actually, we needed to remove geometric_mean from the 3.6 branch, not the default branch (which will become 3.7). I backported your removal patch to 3.6. Feel free to reapply geometric_mean to the default branch at your leisure. |
I was wondering if this has been taken up again for 3.7? Thanks! |
Updating the version in case this wanted to be considered for 3.8. |
Yes. It would be nice to get this wrapped-up. |
Almost three years have passed. In the spirit of "perfect is the enemy of good", would it be reasonable to start with a simple, fast implementation using exp-mean-log? Then if someone wants to make it more accurate later, they can do so. In some quick tests, I don't see much of an accuracy loss. It looks to be plenty good enough to use as a starting point: --- Accuracy experiments ---
>>> # https://www.wolframalpha.com/input/?i=geometric+mean+12,+17,+13,+5,+120,+7
>>> data = [12, 17, 13, 5, 120, 7]
>>> print(reduce(mul, map(Decimal, data)) ** (Decimal(1) / len(data)))
14.94412420173971227234687688
>>> exp(fmean(map(log, map(fabs, data))))
14.944124201739715
>>> data = [expovariate(50.0) for i in range(1_000)]
>>> print(reduce(mul, map(Decimal, data)) ** (Decimal(1) / len(data)))
0.01140902688569587677205587938
>>> exp(fmean(map(log, map(fabs, data))))
0.011409026885695879
>>> data = [triangular(2000.0, 3000.0, 2200.0) for i in range(10_000)]
>>> print(reduce(mul, map(Decimal, data)) ** (Decimal(1) / len(data)))
2388.381301718524160840023868
>>> exp(fmean(map(log, map(fabs, data))))
2388.3813017185225
>>> data = [lognormvariate(20.0, 3.0) for i in range(100_000)]
>>> min(data), max(data)
(2421.506538652375, 137887726484094.5)
>>> print(reduce(mul, map(Decimal, data)) ** (Decimal(1) / len(data)))
484709306.8805352290183838500
>>> exp(fmean(map(log, map(fabs, data))))
484709306.8805349 |
I think that is a reasonable idea. On the basis that something is better Getting some tricky cases down for reference: # older (removed) implementation # Raymond's newer (faster) implementation py> geometric_mean([3,27]) py> exp(fmean(map(log, [3,27]))) py> x = 2.5e15 On the other hand, sometimes rounding errors work in our favour: py> geometric_mean([1e50, 1e-50]) # people might expect 1.0 py> exp(fmean(map(log, [1e50, 1e-50]))) |
Thanks. I'll put together a PR for your consideration. |
Steven, how does this look? https://patch-diff.githubusercontent.com/raw/python/cpython/pull/12638.diff |
Feel free to reopen this if something further needed to be changed or discussed. |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: