Rounding errors with statistics.variance #64698
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
assignee = 'https://github.com/rhettinger' closed_at = <Date 2021-09-09.03:06:34.176> created_at = <Date 2014-02-03.12:46:19.312> labels = ['type-feature', 'library', '3.9', '3.10', '3.11'] title = 'Rounding errors with statistics.variance' updated_at = <Date 2021-09-09.12:02:18.658> user = 'https://bugs.python.org/oscarbenjamin'
activity = <Date 2021-09-09.12:02:18.658> actor = 'steven.daprano' assignee = 'rhettinger' closed = True closed_date = <Date 2021-09-09.03:06:34.176> closer = 'rhettinger' components = ['Library (Lib)'] creation = <Date 2014-02-03.12:46:19.312> creator = 'oscarbenjamin' dependencies =  files = ['33955', '33958', '33959', '33960'] hgrepos =  issue_num = 20499 keywords = ['patch'] message_count = 15.0 messages = ['210121', '210452', '210456', '210459', '210479', '210481', '210491', '210495', '210700', '210702', '399940', '401432', '401433', '401464', '401465'] nosy_count = 5.0 nosy_names = ['rhettinger', 'steven.daprano', 'oscarbenjamin', 'wolma', 'iritkatriel'] pr_nums = ['28230', '28248'] priority = 'normal' resolution = 'fixed' stage = 'resolved' status = 'closed' superseder = None type = 'enhancement' url = 'https://bugs.python.org/issue20499' versions = ['Python 3.9', 'Python 3.10', 'Python 3.11']
The text was updated successfully, but these errors were encountered:
The mean/variance functions in the statistics module don't quite round correctly.
The reasons for this are that although exact rational arithmetic is used internally in the _sum function it is not used throughout the module. In particular the _sum function should be changed to return an exact result and exact arithmetic should be used right up to the point before returning to the user at which point a rounding/coercion should be used to give the user their answer in the appropriate type correctly rounded once.
Using exact arithmetic everywhere makes it possible to replace all of the variance* functions with single-pass algorithms based on the computational formula for variance which should be more efficient as well.
For example the following implements pvariance so that it returns a perfectly rounded result for float input and output in a single pass:
def pvariance(data): sx = 0 sx2 = 0 for n, x in enumerate(map(Fraction, data), 1): sx += x sx2 += x ** 2 Ex = sx / n Ex2 = sx2 / n var = Ex2 - Ex ** 2 return float(var)
Comparing the above with the statistics module:
>>> pvariance([0, 0, 1]) 0.2222222222222222 >>> statistics.pvariance([0, 0, 1]) 0.22222222222222224
The true answer is:
>>> from fractions import Fraction as F >>> float(statistics.pvariance([F(0), F(0), F(1)])) 0.2222222222222222
The logic in the _sum function for computing exact integer ratios and coercing back to the output type could be moved into utility functions so that it does not need to be duplicated.
Some examples of rounding issues:
Variance with ints or floats returns a float but the float is not quite the nearest possible float:
>>> variance([0, 0, 2]) 1.3333333333333335 >>> float(variance([F(0), F(0), F(2)])) # true result rounded once 1.3333333333333333
Another example with Decimal:
>>> getcontext().prec = 5 >>> getcontext() Context(prec=5, rounding=ROUND_HALF_EVEN, Emin=-999999999, Emax=999999999, capitals=1, clamp=0, flags=[Rounded, Inexact], traps=[DivisionByZero, Overflow, InvalidOperation]) >>> variance([D(0), D(0), D(2)] * 2) # Rounded down instead of up Decimal('1.0666') >>> r = (variance([F(0), F(0), F(2)] * 2)) >>> D(r.numerator) / r.denominator # Correctly rounded Decimal('1.0667')
The mean function may also not be correctly rounded:
>>> getcontext().prec = 2 >>> r = mean((F('1.2'), F('1.3'), F('1.55'))) >>> r Fraction(27, 20) >>> D(r.numerator) / r.denominator # Correctly rounded Decimal('1.4') >>> mean([D('1.2'), D('1.3'), D('1.55')]) Decimal('1.3')
I have written a patch for this issue (I'm uploading the complete new code for everyone to try it - importing it into Python3.3 works fine; a diff with additional tests against Oscar's example will follow soon).
The implementation I chose for this is a bit different from Oscar's suggestion. Essentially, it introduces a dedicated module-private class _ExactRatio to represent numbers as exact ratios and that gets passed between different functions in the module. This class borrows many of its algorithms from fractions.Fraction, but has some specialized methods for central tasks in the statistics module making it much more efficient in this context than fractions.Fraction. This class is currently really minimal, but can easily be extended if necessary.
As for performance, the gain imagined by Oscar is not always realized even though the variance functions are now using single passes over the data. Specifically, in the case of floats the overhead of having to convert everything to exact ratios first eats up all the savings.
data type performance gain(+)/loss(-) over original module / %
With Decimal input the costs of conversion to exact ratios depends on the digits in the Decimals, so with short Decimals the savings from the single-pass algorithm are larger than the conversion costs, but this reverses for very long Decimals.
Try this, for example:
from statistics import variance as v from statistics_with_exact_ratios import variance as v2 from fractions import Fraction data = [Fraction(1,x) for x in range(1,2000)] print('calculating variance using original statistics module ...') print(float(v(data))) print('now using exact ratio calculations ...') print(float(v2(data)))
I invite everybody to test my implementation, which is very unlikely to be free of bugs at this stage.
We can add a fast Decimal.as_integer_ratio() in C.
That said, why is the sum of Decimals not done in decimal arithmetic
>>> c.prec = 50 >>> sum([D("1e50"), D(1), D("-1e50")] * 1000) Decimal('0E+1') >>> >>> c.prec = 51 >>> sum([D("1e50"), D(1), D("-1e50")] * 1000) Decimal('1000')
That would be a terrific thing to have !! Currently, Decimals perform really poorly with the statistics functions and this is entirely due to this bottleneck.
With regard to calculating the sum of Decimals in decimal arithmetic, the problem is how you'd detect that the input is all Decimals (or contains enough Decimals to justify a switch in the algorithm).
A fast Decimal.as_integer_ratio() would be useful in any case.
If you're going to use decimals though then you can trap inexact and
When I looked at this before, having special cases for everything from
One option is to do something like:
import math import itertools from decimal import Decimal from decimalsum import decimalsum def _sum(numbers): subtotals =  for T, nums in itertools.groupby(numbers, type): if T is int: subtotals.append(sum(nums)) elif T is float: subtotals.append(math.fsum(nums)) elif T is Decimal: subtotals.append(decimalsum(nums)) else: raise NotImplementedError return decimalsum(subtotals)
The main problem here is that fsum rounds every time it returns
Also having separate code blocks to manage all the different types
In principle, your approach using itertools.groupby, the existing _sum, your decimalsum, and my _ExactRatio class could be combined.
You could call decimalsum for Decimal and subclasses, the current _sum for everything else.
The problem I see is that it would cause a slow down in many cases where no Decimals or just a few are involved (think of mixes of ints and floats as a realistic scenario, and consider also that you would have to do an isinstance check to catch subclasses of Decimal).
Oscar Benjamin <firstname.lastname@example.org> wrote:
For sums that is not necessary. Create a context with MAX_EMAX, MIN_EMIN and
Of course, calculating 1/3 in MAX_PREC would be catastrophic.
Yes, I was thinking of a "don't do that" approach. Do people have mixed
I worked out a slightly speedier version of decimal_to_ratio today (Stefan: that's when I duplicated your bug report):
from decimal import Context def _decimal_to_ratio (x): _, digits, exp = x.as_tuple() if exp in _ExactRatio.decimal_infinite: # INF, NAN, sNAN assert not d.is_finite() raise ValueError if exp < 0: exp = -exp return int(x.scaleb(exp, Context(prec=len(digits)))), 10**exp return int(x), 1
makes the variance functions in my re-implementation about 20-30% faster for Decimal. It's not a big breakthrough, but still.
Regarding the call to sorted that you removed.
I just realised that this was buggy! In my testing, I found that there was a consistent speed-up by summing fractions in order of increasing denominators (small denominators first):
>>> from fractions import Fraction >>> from timeit import Timer >>> from random import shuffle >>> >>> d1 = [Fraction(1, d) for d in range(1, 1000)] >>> d2 = d1[:] >>> shuffle(d2) >>> >>> t1 = Timer('sum(d)', setup='from __main__ import d1 as d') >>> t2 = Timer('sum(d)', setup='from __main__ import d2 as d') >>> >>> min(t1.repeat(number=100, repeat=7)) # small dens first 1.048042511974927 >>> min(t1.repeat(number=100, repeat=7)) 1.0449080819962546 >>> >>> min(t2.repeat(number=100, repeat=7)) # random order 1.2761094039888121 >>> min(t2.repeat(number=100, repeat=7)) 1.2763739289948717
Summing larger denominators before smaller denominators is even slower:
>>> d3 = list(reversed(d1)) >>> t3 = Timer('sum(d)', setup='from __main__ import d3 as d') >>> >>> min(t3.repeat(number=100, repeat=7)) # large dens first 1.6581684999982826 >>> min(t3.repeat(number=100, repeat=7)) 1.6580656860023737
But then I screwed up by sorting the *tuples* (num, den) which would have the effect of summing small *numerators* first, not denominators.
Thanks for catching that they way I had written it, the sort would have had at best no real performance benefit.