Add statistics.fmean(seq) #80085
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
assignee = 'https://github.com/stevendaprano' closed_at = <Date 2019-02-21.23:07:02.053> created_at = <Date 2019-02-05.23:59:21.339> labels = ['3.8', 'type-bug', 'library'] title = 'Add statistics.fmean(seq)' updated_at = <Date 2019-04-23.08:35:18.758> user = 'https://github.com/rhettinger'
activity = <Date 2019-04-23.08:35:18.758> actor = 'rhettinger' assignee = 'steven.daprano' closed = True closed_date = <Date 2019-02-21.23:07:02.053> closer = 'rhettinger' components = ['Library (Lib)'] creation = <Date 2019-02-05.23:59:21.339> creator = 'rhettinger' dependencies =  files =  hgrepos =  issue_num = 35904 keywords = ['patch'] message_count = 19.0 messages = ['334894', '334899', '334902', '334980', '334984', '335014', '335045', '335048', '335057', '335071', '335076', '335101', '335113', '335129', '335181', '335796', '335877', '336269', '340702'] nosy_count = 5.0 nosy_names = ['tim.peters', 'rhettinger', 'mark.dickinson', 'steven.daprano', 'josh.r'] pr_nums = ['11892', '12919'] priority = 'normal' resolution = 'fixed' stage = 'resolved' status = 'closed' superseder = None type = 'behavior' url = 'https://bugs.python.org/issue35904' versions = ['Python 3.8']
The text was updated successfully, but these errors were encountered:
The current mean() function makes heroic efforts to achieve last bit accuracy and when possible to retain the data type of the input.
What is needed is an alternative that has a simpler signature, that is much faster, that is highly accurate without demanding perfection, and that is usually what people expect mean() is going to do, the same as their calculators or numpy.mean():
def fmean(seq: Sequence[float]) -> float: return math.fsum(seq) / len(seq)
On my current 3.8 build, this code given an approx 500x speed-up (almost three orders of magnitude). Note that having a fast fmean() function is important in resampling statistics where the mean() is typically called many times: http://statistics.about.com/od/Applications/a/Example-Of-Bootstrapping.htm
$ ./python.exe -m timeit -r 11 -s 'from random import random' -s 'from statistics import mean' -s 'seq = [random() for i in range(10_000)]' 'mean(seq)' 50 loops, best of 11: 6.8 msec per loop $ ./python.exe -m timeit -r 11 -s 'from random import random' -s 'from math import fsum' -s 'mean=lambda seq: fsum(seq)/len(seq)' -s 'seq = [random() for i in range(10_000)]' 'mean(seq)' 2000 loops, best of 11: 155 usec per loop
Correct me if I'm wrong, but at least initially, the first listed goal of statistics (per the PEP) was:
"Correctness over speed. It is easier to speed up a correct but slow function than to correct a fast but buggy one."
numpy already exists for people who need insane speed for these algorithms and are willing to compromise accuracy; am I wrong in my impression that statistics is more about providing correct batteries included that are fast enough for simple uses, not reimplementing numpy piece by piece for hardcore number crunching?
Even if such a function were desirable, I don't like the naming symmetry between fsum and fmean; it's kind of misleading. math.fsum is a slower, but more precise, version of the built-in sum. Having statistics.fmean be a faster, less accurate, version of statistics.mean reverses that relationship between the f-prefixed and non-f-prefixed versions of a function.
Are you saying that "fsum(seq) / len(seq)" is incorrect because on some older builds there is a rare possibility of an error of 1 unit in the last place? Just about everyone who uses this module will not care one whit about that. ISTM, a requirement for exactness violates "practicality beats purity". We really don't want to be in the position of advising people to never use this module because of design choices that aren't in line with user needs.
Taken to an extreme, this could be used justify a BubbleSort. In the case of statistics.mean(), slowness of three orders of magnitude makes this a nearly catastrophic choice for real world use.
Any reasonable name will suffice. I suggested an f prefix that could be taken to mean that it is "fast" or that it returns a "float". It isn't much different than isqrt() for an integer square root.
Let's see what Steven and Tim think about this.
Just to be clear, it's not that rare a possibility, and it's not restricted to older builds. I think Raymond is referring to the bug where on some machines that are still using the x87 for arithmetic, double rounding can lead to the fsum result being out (by potentially an arbitrary amount, not just 1 ulp). But even without that consideration, we'll still often be out by 1 ulp or more on typical new systems.
However, if I got my sums right (and assuming that fsum _is_ correctly rounded, and IEEE 754 arithmetic is in use, and the rounding mode hasn't been changed from its default of round-ties-to-even, and we're excluding corner cases like overflow, underflow, etc., etc.), the result of fsum(seq) / len(seq) can never be out by more than 1.5 ulps. That's better than NumPy can promise (even with its use of pairwise summation in some -- but not all -- cases), and should be good enough for almost any practical purpose.
I think I'd rather see the regular statistics.mean sacrifice the good-to-the-last-bit accuracy (and it's really not that much of a sacrifice: accurate to 1.5 ulps is a _very_ respectable compromise) than have a new function. I don't know how feasible that is, though, given all the type-handling in statistics.mean.
Bikeshedding time: the proposed fmean name seems a bit unfortunate in that it's reversing the sense of sum and fsum, where fsum is the correctly-rounded, slightly slower variant. My first assumption on seeing the name was that fmean was supposed to be to mean as fsum is to sum.
Double-checking my own assertions: here's an example of a list xs of floats for which
Python 3.7.2 (default, Dec 30 2018, 08:55:50) [Clang 10.0.0 (clang-1000.11.45.5)] on darwin Type "help", "copyright", "credits" or "license" for more information. >>> import fractions, math>>> xs = [float.fromhex(h) for h in ['0x1.88104e64ffa5cp-3', '0x1.9b793215ddca8p-3', '0x1.754cbf6b09730p-1', '0x1.e2b4ca1df3680p-2', '0x1.91689b66782e1p-1']] >>> approx_mean = math.fsum(xs) / len(xs) >>> approx_mean # in [0.25, 0.5], so 1 ulp is 2**-54 0.47536945341150305 >>> exact_mean = sum(fractions.Fraction(x) for x in xs) / len(xs) >>> exact_mean Fraction(10704368466236809, 22517998136852480) >>> error_in_ulps = abs(exact_mean - fractions.Fraction(approx_mean)) * 2**54 >>> float(error_in_ulps) 1.2
I ran checks on 1000000 such randomly generated lists, and the error never exceeded 1.5 ulps.
Sketch of proof of the 1.5 ulps bound: the fsum result is out by at most 0.5 ulps; the length n of the list is presumably represented exactly (lists with more than 2**53 elements would be infeasible). Division of the fsum result by n keeps the relative error the same, but potentially magnifies the ulps error by two, due to the usual "wobble" between relative error and ulps error, so that gives us up to 1 ulp error. Then the result of the division may need to be rounded again, giving another potential error of up to 0.5 ulps. The bound is strict: we can't actually attain 1.5 ulps, so the result we get can't be more than 1 ulp away from the correctly rounded result.
I see Josh already made this observation: apologies for the duplicate bikeshedding.
In the PEP, I did say that I was making no attempt to compete with numpy
That doesn't mean I don't care about speed. Nor do I necessarily care
I doubt that stdev etc would be able to promise that accuracy, so
But I'm not super-keen on having two separate mean() functions if that
In my ideal world, I'd have a single mean() function that had a
When I say "more accurate", this isn't a complaint about fsum(). It
Since we need both the sum and the length, this seemed like a good starting point. Also, the existing mean() function already covers the more general cases.
I suspect that it is common to keep the data in memory so that more than one descriptive statistic can be generated:
data = load_measurements() data.sort() n = len(data) mu = fastmean(data) sigma = stdev(data, xbar=mu) low, q1, q2, q3, high = data, data[n//4], data[n//2], data[3*n//4], data[-1] popular = mode(data, first_tie=True)
It's possible (though possibly not desirable) to provide an fallback path:
def fastmean(data: Iterable) -> float: try: return fsum(data) / len(data) except TypeError: # Slow alternative return float(mean(data)) # Memory intensive alternative data = list(data) return fsum(data) / len(data) # Less accurate alternative total = n = 0 for n, x in enumerate(data, start=1): total += x return total / n
On my system, I only get a 30x speed-up using your timeit code. Using
This just goes to show how sensitive these timing results are on
What do you think of this implementation?
def floatmean(data:Iterable) -> Float: try: n = len(data) except TypeError: # Handle iterators with no len. n = 0 def count(x): nonlocal n n += 1 return x total = math.fsum(map(count, data)) return total/n else: return math.fsum(data)/n
Compared to the "no frills" fsum()/len() version:
On my computer, the difference between the sequence path and the
As for the name, I think we have three reasonable candidates:
(with or without underscores for the first two). Do people have a
+1 from me as well.
I like your count() solution because 1) it gives the same answer for both an iterator and for an iterable 2) it preserves the memory friendly characteristics of iterators, 3), it is still reasonably fast, and 4) the function signature is still simple.
My top name preference is "fmean" because I'm used to "isqrt" for integers, "fsqrt" for floats, and "cmath" for complex where the first letter means the type. Incidentally, that is why "fsum" is named with an "f". My second choice is "fastmean" without an underscore. To my ear, "float_mean" is confusing, as if "float" were a verb and "mean" were the direct object, the antonym of "sink_mean" ;-)