Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add multi-dimensional Euclidean distance function to the math module #77270

Closed
rhettinger opened this issue Mar 16, 2018 · 26 comments
Closed

Add multi-dimensional Euclidean distance function to the math module #77270

rhettinger opened this issue Mar 16, 2018 · 26 comments
Labels
3.8 (EOL) end of life stdlib Python modules in the Lib dir type-feature A feature request or enhancement

Comments

@rhettinger
Copy link
Contributor

BPO 33089
Nosy @tim-one, @rhettinger, @mdickinson, @scoder, @stevendaprano, @skrah, @serhiy-storchaka, @miss-islington
PRs
  • bpo-33089: Multidimensional math.hypot() #8474
  • bpo-33089: Add math.dist() for computing the Euclidean distance between two points #8561
  • bpo-33089: Add math.dist() and math.hypot() to Whatsnew #11896
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2018-07-31.07:46:27.793>
    created_at = <Date 2018-03-16.18:50:58.696>
    labels = ['3.8', 'type-feature', 'library']
    title = 'Add multi-dimensional Euclidean distance function to the math module'
    updated_at = <Date 2019-02-16.19:00:48.319>
    user = 'https://github.com/rhettinger'

    bugs.python.org fields:

    activity = <Date 2019-02-16.19:00:48.319>
    actor = 'miss-islington'
    assignee = 'none'
    closed = True
    closed_date = <Date 2018-07-31.07:46:27.793>
    closer = 'rhettinger'
    components = ['Library (Lib)']
    creation = <Date 2018-03-16.18:50:58.696>
    creator = 'rhettinger'
    dependencies = []
    files = []
    hgrepos = []
    issue_num = 33089
    keywords = ['patch']
    message_count = 26.0
    messages = ['313967', '313979', '313983', '313984', '314074', '314078', '314079', '314082', '314099', '314102', '314103', '314113', '314129', '320380', '320389', '322400', '322401', '322433', '322434', '322454', '322564', '322588', '322651', '322743', '322814', '335710']
    nosy_count = 8.0
    nosy_names = ['tim.peters', 'rhettinger', 'mark.dickinson', 'scoder', 'steven.daprano', 'skrah', 'serhiy.storchaka', 'miss-islington']
    pr_nums = ['8474', '8561', '11896']
    priority = 'normal'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'enhancement'
    url = 'https://bugs.python.org/issue33089'
    versions = ['Python 3.8']

    @rhettinger
    Copy link
    Contributor Author

    A need for a distance-between-two-points function arises frequently enough to warrant consideration for inclusion in the math module. It shows-up throughout mathematics -- everywhere from simple homework problems for kids to machine learning and computer vision.

    In the latter cases, the function is called frequently and would benefit from a fast C implementation that includes good error checking and is algorithmically smart about numerical issues such as overflow and loss-of-precision.

    A simple implementation would be something like this:

        def dist(p, q):
            'Multi-dimensional Euclidean distance'
            # XXX needs error checking:  len(p) == len(q)
            return sqrt(sum((x0 - x1) ** 2 for x0, x1 in zip(p, q)))

    The implementation could also include value added features such as hypot() style scaling to mitigate overflow during the squaring step:

        def dist2(p, q):
            # https://en.wikipedia.org/wiki/Hypot#Implementation
            diffs = [x0 - x1 for x0, x1 in zip(p, q)]
            scale = max(diffs, key=abs)
            return abs(scale) * sqrt(fsum((d/scale) ** 2 for d in diffs))

    @rhettinger rhettinger added 3.8 (EOL) end of life stdlib Python modules in the Lib dir type-feature A feature request or enhancement labels Mar 16, 2018
    @stevendaprano
    Copy link
    Member

    This was requested once before, but rejected. I would like to see that decision re-considered.

    https://bugs.python.org/issue1880

    Some months ago, there was a very brief discussion started by Lawrence D’Oliveiro about this on the comp.lang.python newsgroup. Unfortunately neither his posts nor any replies to them seem to have been archived at the Python-List mailing list. His original post is on google groups:

    https://groups.google.com/forum/#!topic/comp.lang.python/GmrGx9QQINQ

    but I see no replies there. There is one reply on comp.lang.python, from Stefan Ram, but that doesn't appear on either the mailing list or Google Groups.

    @stevendaprano
    Copy link
    Member

    The obvious work-around of calling hypot multiple times is not only tedious, but it loses precision. For example, the body diagonal through a 1x1x1 cube should be √3 exactly:

    py> from math import hypot, sqrt
    py> hypot(hypot(1, 1), 1) == sqrt(3)
    False

    I know of at least five languages or toolkits which support this feature, or something close to it: Javascript, Julia, Matlab, GNU Scientific Library, and C++.

    Javascript and Julia support arbitrary numbers of arguments:

    https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Math/hypot

    JuliaLang/julia#15649

    Using Firefox's javascript console, I get the correct body diagonal for the cube:

    > Math.hypot(1, 1, 1) == Math.sqrt(3)
    true

    Matlab's hypot() function only takes two arguments, but norm(vector) returns the Euclidean 2-norm of the vector, i.e. equivalent to the hypot of multiple arguments.

    https://au.mathworks.com/help/matlab/ref/norm.html

    The GNU Scientific Library and C++ support a three-argument form of hypot:

    http://git.savannah.gnu.org/cgit/gsl.git/commit/?id=e1711c84f7ba5c2b22d023dcb7e10810233fff27

    http://en.cppreference.com/w/cpp/numeric/math/hypot
    http://open-std.org/JTC1/SC22/WG21/docs/papers/2015/p0030r1.pdf

    So +1 on math.hypot supporting arbitrary number of arguments.

    @stevendaprano
    Copy link
    Member

    Ah wait, I appear to have misunderstood Raymond's request. Sorry Raymond!

    (I've been spending too much time teaching Pythagoras' theorem and my mind was primed to go directly from Euclidean distance to hypotenuse.)

    Not withstanding my misunderstanding, if hypot supported arbitrary number of arguments, then the implementation of distance could simply defer to hypot:

    def distance(p, q):
        # TODO error checking that p and q have the same number of items
        return hypot(*(x-y for x,y in zip(p, q)))

    giving results like this:

    py> distance((11, 21), (14, 17))
    5.0

    In either case, I agree with Raymond that this would be a useful feature.

    @tim-one
    Copy link
    Member

    tim-one commented Mar 19, 2018

    I'd be +1 on generalizing math.hypot to accept an arbitrary number of arguments. It's the natural building block for computing distance, but the reverse is strained. Both are useful.

    Here's scaling code translated from the Fortran implementation of SNRM2 in LAPACK. It only looks at each element once, so would work almost as-is for an argument that's an arbitrary iterable (just change the formal argument from *xs to xs):

        # http://www.netlib.org/lapack/explore-html/d7/df1/snrm2_8f_source.html
        def hypot(*xs):
            import math
            scale = 0.0
            sumsq = 1.0
            for x in xs:
                if x:
                    absx = abs(x)
                    if scale < absx:
                        sumsq = 1.0 + sumsq * (scale / absx)**2
                        scale = absx
                    else:
                        sumsq += (absx / scale)**2
            return scale * math.sqrt(sumsq)

    I believe it does the right with infinities by magic (return math.inf). It returns 0.0 if no arguments are passed, which makes sense. For one argument, it's a convoluted way to return its absolute value. The "if x:" test is necessary to prevent division by 0.0 if the first argument is 0.0. It would need another blob of code to give up (and return math.nan) if one of the arguments is a NaN (there's no guessing what various C compilers would do without special-casing NaN).

    I doubt fsum() would add much value here: all the addends have the same sign, so cancellation is impossible.

    To get really gonzo, a single-rounding dot product would be the best building block (the vector norm is the square root of the dot product of a vector with itself). fsum() is a special case of that (the dot product of a vector with a same-length vector of all ones).

    @serhiy-storchaka
    Copy link
    Member

    With this implementation

    >>> hypot(*range(15), 3)
    32.00000000000001

    The exact result is 32.

    @rhettinger
    Copy link
    Contributor Author

    [Uncle Timmy]

    I doubt fsum() would add much value here: all the addends have the
    same sign, so cancellation is impossible

    fsum() may be overkill for this problem. I mentioned it because the math module already had the requisite code and because it improved accuracy with high dimensional data in machine learning examples I've encountered:

    >>> from math import fsum, sqrt
    
        >>> n = 1000
        >>> sum([0.1] * n)
        99.9999999999986
        >>> fsum([0.1] * n)
        100.0
    
        >>> sqrt(sum([0.1] * n) / n)
        0.3162277660168357
        >>> sqrt(fsum([0.1] * n) / n)
        0.31622776601683794
    
        # fsum() version exactly matches the decimal crosscheck
        >>> getcontext().prec = 40
        >>> (sum([Decimal(0.1)] * n) / n).sqrt()
        Decimal('0.3162277660168379419769730258850242641698')

    If we care about those little differences (about 80 ulp in this example), the single-rounding dot products seems like a better way to go.

    @mdickinson
    Copy link
    Member

    +1 for a single-rounded dot product. If we're allowed to assume IEEE 754, it's straightforward to code up something that's not too inefficient and gives correctly rounded results for "normal" cases, using a combination of Veltkamp splitting, Dekker multiplication, and fsum. The difficulties come in if you want to maintain correct rounding in cases where any of the partial products overflows or (especially awkwardly) underflows.

    Also, if we can figure out how to do a correctly-rounded dot product, that gives us math.fma as a special case... (a*b + c = dot([a, c], [b, 1.0])). (Of course, that's a bit backwards, since usually you'd see fma as a primitive and *use* it in computing a dot product.)

    @tim-one
    Copy link
    Member

    tim-one commented Mar 19, 2018

    Some notes on the hypot() code I pasted in: first, it has to special case infinities too - it works fine if there's only one of 'em, but returns a NaN if there's more than one (it ends up computing inf/inf, and the resulting NaN propagates).

    Second, it's not clear to me what the result "should be" if there's at least one infinity _and_ at least one NaN. At the start, "anything with a NaN input returns a NaN" was the rule everything followed. Later an exception to that was made for NaN**0 == 1, under the theory that it didn't really matter if the computation of the base screwed up, because anything whatsoever to the 0 power is 1 0 viewing NaN as meaning "we have no idea what the true value is", it simply doesn't matter what the true value is in this context. By the same logic, if there's an infinite argument to hypot(), it doesn't matter what any other argument is - the result is +inf regardless. So that requires some investigation. Offhand I'm inclined to return NaN anyway.

    Finally, if there is a robust single-rounding dot product, of course scaling can be skipped (and so eliminate another source of small errors).

    @mdickinson
    Copy link
    Member

    By the same logic, if there's an infinite argument to hypot(), it doesn't matter what any other argument is - the result is +inf regardless.

    Yep, that's what IEEE 754-2008 says for the two-argument case, so I think that's the logic that should be followed in the many-argument case: if any of the inputs is an infinity, the output should be infinity.

    From section 9.2.1 of IEEE 754:

    For the hypot function, hypot(±0, ±0) is +0, hypot(± , qNaN) is + , and
    hypot(qNaN, ± ) is + .

    @mdickinson
    Copy link
    Member

    Okay, that cut and paste didn't work so well. Just imagine an infinity symbol in those missing spaces. Trying again:

    For the hypot function, hypot(±0, ±0) is +0, hypot(±∞, qNaN) is +∞, and
    hypot(qNaN, ±∞) is +∞.

    @tim-one
    Copy link
    Member

    tim-one commented Mar 19, 2018

    Mark, thanks! I'm happy with that resolution: if any argument is infinite, return +inf; else if any argument is a NaN, return a NaN; else do something useful ;-)

    Serhiy, yes, the scaling that prevents catastrophic overflow/underflow due to naively squaring unscaled values can introduce small errors of its own. A single-rounding dot product could avoid that, leaving two sources of single-rounding errors (the dot product, and the square root).

    Raymond, yes, fsum() on its own can reduce errors. Note that scaling on its own can also reduce errors (in particular, when the arguments are all the same, they're each scaled to exactly 1.0):

    >>> import math
    >>> n = 1000
    >>> math.sqrt(sum([0.1 ** 2] * n))
    3.1622776601683524
    >>> math.sqrt(math.fsum([0.1 ** 2] * n))
    3.1622776601683795
    >>> hypot(*([0.1] * n)) # using the code above
    3.1622776601683795
    >>> math.sqrt(n) * 0.1
    3.1622776601683795

    @tim-one
    Copy link
    Member

    tim-one commented Mar 20, 2018

    Mark, how about writing a clever single-rounding dot product that merely _detects_ when it encounters troublesome cases? If so, it can fall back to a (presumably) much slower method. For example, like this for the latter:

        def srdp(xs, ys):
            "Single rounding dot product."
            import decimal
            from decimal import Decimal, Inexact
            # XXX check that len(xs) == len(ys)
            with decimal.localcontext(decimal.ExtendedContext) as ctx:
                ctx.traps[Inexact] = True
                total = Decimal(0)
                for x, y in zip(map(Decimal, xs), map(Decimal, ys)):
                    while True:
                        try:
                            total += x * y
                            break
                        except Inexact:
                            ctx.prec += 1
                return float(total)

    So it just converts everything to Decimal; relies on decimal following all the IEEE rules for special cases; retries the arithmetic boosting precision until decimal gets an exact result (which it eventually will since we're only doing + and *); and relies on float() to get everything about final rounding, overflow, and denorms right. If that doesn't work, I'd say it's a bug in the decimal module ;-)

    I'd bet a dollar that, in real life, falling back to this would almost never happen, outside of test cases.

    @rhettinger
    Copy link
    Contributor Author

    Would it be reasonable for me to get started with a "mostly good enough" version using scaling and Kahan summation?

    from operator import sub
    from math import sqrt, fabs
    
    def kahan_summation(seq):
        # https://en.wikipedia.org/wiki/Kahan_summation_algorithm#The_algorithm
        csum = 0
        err = 0
        for x in seq:
            x -= err
            nsum = csum + x
            err = (nsum - csum) - x
            csum = nsum
        return csum
    
    def hypot(*sides):
        scale = max(map(fabs, sides))
        return scale * sqrt(kahan_summation((s / scale)**2 for s in sides))
    
    def dist(p, q):
        return hypot(*map(sub, p, q))

    assert all(hypot(*([1]*d)) == sqrt(d) for d in range(1, 10000))
    print(dist(p=(11, 4, 10), q=(9, 10, 13.5)))

    @tim-one
    Copy link
    Member

    tim-one commented Jun 24, 2018

    Raymond, I'd say scaling is vital (to prevent spurious infinities), but complications beyond that are questionable, slowing things down for an improvement in accuracy that may be of no actual benefit.

    Note that your original "simple homework problems for kids to machine learning and computer vision" doesn't include cases where good-to-the-last-bit accuracy is important, but at least in machine learning and computer vision apps primitives may be called an enormous number of times - "speed matters" to them.

    Perhaps add an optional "summer" argument defaulting to __builtins__.sum? Then the user who wants to pay more for tighter error bounds can pass in whatever they like, from a raw Kahan summer, through one of its improvements, to math.fsum. There just isn't a "one size fits all" answer.

    @serhiy-storchaka
    Copy link
    Member

    I think we should guarantee some properties:

    1. The result of hypot(x, y, z, ...) shouldn't depend on the order of arguments.

    2. It should be monotonic for all arguments: hypot(..., x, ...) <= hypot(..., y, ...) for abs(x) < abs(y).

    3. hypot(..., inf, ...) = inf

    4. hypot(..., nan, ...) = nan (if no infinities)

    5. hypot(x, x, x, x) == 2*abs(x)

    @serhiy-storchaka
    Copy link
    Member

    1. hypot(x) == abs(x)

    2. hypot(x, y) should return the same result as in previous Python version, e.g. should call C lib implementation.

    3. hypot(x, y, 0, <all zeros>...) == hypot(x, y)

    @rhettinger
    Copy link
    Contributor Author

    I don't want to make any guarantees beyond what the doc patch says.

    Commutativity guarantees are difficult to deliver without exact summation. A switchover to C's hypot() creates an external dependency such that we can't make any more guarantees than were given to us (which is nothing). Also, algorithmic switchovers risk impairing monotonicity. The Inf/NaN rules already existed before this patch and I believe that Mark elected to not document them on a function by function basis. Another reason not to make guarantees is that this is a "mostly good-enough patch" and I expect that either Mark or Tim will someday improve it with a "single-rounding dot product". My aim is to make it good before any effort is given to making it perfect. In the meantime, it makes little sense to guarantee implementation details.

    @serhiy-storchaka
    Copy link
    Member

    Commutativity guarantees can be delivered by sorting arguments before summation. In any case the sorting is needed for more accurate summation (from smaller to larger).

    @rhettinger
    Copy link
    Contributor Author

    Commutativity guarantees can be delivered by sorting arguments before summation.

    No thanks -- that's too expensive for such a small payoff.

    @rhettinger
    Copy link
    Contributor Author

    New changeset c6dabe3 by Raymond Hettinger in branch 'master':
    bpo-33089: Multidimensional math.hypot() (GH-8474)
    c6dabe3

    @scoder
    Copy link
    Contributor

    scoder commented Jul 28, 2018

    > Commutativity guarantees can be delivered by sorting arguments before summation.

    No thanks -- that's too expensive for such a small payoff.

    Since I don't really see people use this on vectors with hundreds of dimensions, let me still suggest using insertion sort. It's simple and short to implement into the current code (right in the first loop) and should be plenty fast for any reasonable usage of this function.

    @rhettinger
    Copy link
    Contributor Author

    I really don't want to introduce something slower than O(n) behavior here, particularly because some of my use cases have a large n and because this will be called many times. Tim has already opined the even using a variant of Kahan summation would increase the cost unacceptably. IMO, sorting is even worse and would be a foolish thing to do.

    @rhettinger
    Copy link
    Contributor Author

    New changeset 9c18b1a by Raymond Hettinger in branch 'master':
    bpo-33089: Add math.dist() for computing the Euclidean distance between two points (GH-8561)
    9c18b1a

    @mdickinson
    Copy link
    Member

    For the record, let me register my objection here. I'm not convinced by adding math.dist, and am surprised that this was committed. Almost all the discussion in this issue was about adding multi-argument hypot, which seems like an obviously useful building block. dist seems like a specialised function, and I don't think it belongs in the math module.

    @miss-islington
    Copy link
    Contributor

    New changeset 3ff5962 by Miss Islington (bot) (Raymond Hettinger) in branch 'master':
    bpo-33089: Add math.dist() and math.hypot() to Whatsnew (GH-11896)
    3ff5962

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    3.8 (EOL) end of life stdlib Python modules in the Lib dir type-feature A feature request or enhancement
    Projects
    None yet
    Development

    No branches or pull requests

    7 participants