Skip to content

statistics.quantiles(exclusive) returns unsorted cut points for duplicate float-inexact data #150318

@zhangjiashuo-cs

Description

@zhangjiashuo-cs

Bug description

statistics.quantiles(data, n=N, method='exclusive') with N ≥ 3 returns cut points that are not monotonically non-decreasing when data contains two equal but float-inexact values. The first and last cut points compute one ULP below the data value, while interior cut points compute exactly the data value, so the returned list is unsorted.

import statistics

# Two copies of float(pi)
data = [3.141592653589793, 3.141592653589793]
result = statistics.quantiles(data, n=9, method='exclusive')

print(result)
# [3.1415926535897927, 3.141592653589793, 3.141592653589793,
#  3.141592653589793, 3.141592653589793, 3.141592653589793,
#  3.141592653589793, 3.1415926535897927]

print(result == sorted(result))
# False  <-- cut points 0 and 7 are < cut points 1..6

Reproducer is one-liner:

>>> import statistics
>>> r = statistics.quantiles([3.141592653589793]*2, n=9, method='exclusive')
>>> r[0] <= r[1]
False

Why this is a bug

The quantiles documentation says it "Divide(s) data into n continuous intervals with equal probability. Returns a list of n - 1 cut points separating the intervals." Cut points that separate intervals must be monotonically non-decreasing — otherwise the "intervals" they delimit overlap or invert. Several downstream uses (e.g. bisect.bisect_left(cut_points, x) to bucket a value) rely silently on this invariant and will return wrong bucket indices when it's violated.

The 'exclusive' method computes each cut point with the formula:

# Lib/statistics.py
m = ld - 1
result = []
for i in range(1, n):
    j = i * m // n
    delta = i * m - j * n
    interpolated = (data[j-1] * (n - delta) + data[j] * delta) / n
    result.append(interpolated)

When data[j-1] == data[j] == x exactly, interpolated = (x*(n-delta) + x*delta)/n should be x mathematically, but for non-power-of-two delta/n ratios the float operations x*(n-delta) + x*delta do not always round to n*x, yielding a result one ULP off. The error happens asymmetrically across i, so adjacent cut points differ by 1 ULP and the list ends up unsorted.

'inclusive' is not affected because its formula is structured to be exact when data are identical.

Suggested fix

Sort the result before returning (cheap, O(n log n) on n − 1 elements which is already trivial):

--- a/Lib/statistics.py
+++ b/Lib/statistics.py
@@ -X,Y +X,Y @@ def quantiles(data, *, n=4, method='exclusive'):
         for i in range(1, n):
             j = i * m // n             # rescale i to m/n
             delta = i*m - j*n          # exact remainder
             interpolated = (data[j-1] * (n - delta) + data[j] * delta) / n
             result.append(interpolated)
-        return result
+        # The interpolation above can produce float-rounding artifacts that
+        # leave neighbouring identical-data cut points one ULP out of order
+        # (e.g. data=[pi, pi], n=9).  Restore the documented monotonic-
+        # non-decreasing invariant before returning.
+        result.sort()
+        return result

This preserves the existing values exactly when they are already monotonic and only reorders the few corner cases where float rounding broke the invariant. A regression test:

import statistics
def test_quantiles_monotonic_with_duplicates():
    for x in [3.141592653589793, 1/3, 0.1, 1e300, 1e-300]:
        for n in range(2, 20):
            for method in ('exclusive', 'inclusive'):
                r = statistics.quantiles([x]*2, n=n, method=method)
                assert r == sorted(r), (x, n, method, r)

CPython versions tested on

3.9, 3.12, 3.14

Operating systems tested on

macOS

Linked PRs

Metadata

Metadata

Assignees

Labels

stdlibStandard Library Python modules in the Lib/ directorytype-bugAn unexpected behavior, bug, or error
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions