Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Worst-case behaviour of hash collision with float NaN #87641

Closed
congma mannequin opened this issue Mar 11, 2021 · 40 comments
Closed

Worst-case behaviour of hash collision with float NaN #87641

congma mannequin opened this issue Mar 11, 2021 · 40 comments
Assignees
Labels
3.10 only security fixes 3.11 only security fixes interpreter-core (Objects, Python, Grammar, and Parser dirs) performance Performance or resource usage stdlib Python modules in the Lib dir

Comments

@congma
Copy link
Mannequin

congma mannequin commented Mar 11, 2021

BPO 43475
Nosy @tim-one, @rhettinger, @mdickinson, @serhiy-storchaka, @miss-islington, @congma
PRs
  • bpo-43475: Fix worst case collision behavior for NaN instances #25493
  • bpo-43475: Fix the Python implementation of hash of Decimal NaN #26679
  • [3.10] bpo-43475: Fix the Python implementation of hash of Decimal NaN (GH-26679) #26706
  • bpo-43475: Add what's new entry for NaN hash changes #26725
  • [3.10] bpo-43475: Add what's new entry for NaN hash changes (GH-26725) #26743
  • Files
  • nan_key.py: PoC for NaN-collisions and possible fix for worst-case
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = 'https://github.com/rhettinger'
    closed_at = <Date 2021-06-16.11:36:54.791>
    created_at = <Date 2021-03-11.14:55:33.263>
    labels = ['interpreter-core', '3.11', 'library', '3.10', 'performance']
    title = 'Worst-case behaviour of hash collision with float NaN'
    updated_at = <Date 2021-11-24.13:46:50.396>
    user = 'https://github.com/congma'

    bugs.python.org fields:

    activity = <Date 2021-11-24.13:46:50.396>
    actor = 'cwg'
    assignee = 'rhettinger'
    closed = True
    closed_date = <Date 2021-06-16.11:36:54.791>
    closer = 'mark.dickinson'
    components = ['Interpreter Core', 'Library (Lib)']
    creation = <Date 2021-03-11.14:55:33.263>
    creator = 'congma'
    dependencies = []
    files = ['49869']
    hgrepos = []
    issue_num = 43475
    keywords = ['patch']
    message_count = 40.0
    messages = ['388508', '388511', '388512', '388513', '388514', '388519', '388522', '388579', '388603', '388604', '388605', '388608', '388702', '388717', '388748', '390698', '390701', '390767', '390770', '391605', '391606', '391607', '395643', '395653', '395685', '395745', '395746', '395748', '395749', '395750', '395761', '395773', '395783', '395892', '395894', '395918', '406917', '406925', '406927', '406928']
    nosy_count = 8.0
    nosy_names = ['tim.peters', 'rhettinger', 'mark.dickinson', 'serhiy.storchaka', 'miss-islington', 'cwg', 'realead', 'congma']
    pr_nums = ['25493', '26679', '26706', '26725', '26743']
    priority = 'normal'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'performance'
    url = 'https://bugs.python.org/issue43475'
    versions = ['Python 3.10', 'Python 3.11']

    @congma
    Copy link
    Mannequin Author

    congma mannequin commented Mar 11, 2021

    Summary: CPython hash all NaN values to 0. This guarantees worst-case behaviour for dict if numerous existing keys are NaN. I think by hashing NaN using the generic object (or "pointer") hash instead, the worst-case situation can be alleviated without changing the semantics of either dict or float. However, this also requires changes to how complex and Decimal objects hash, and moreover incompatible change to sys.hash_info. I would like to hear how Python developers think about this matter.

    --------

    Currently CPython uses the hard-coded macro constant 0 (_PyHASH_NAN, defined in Include/pyhash.h) for the hash value of all floating point NaNs. The value is part of the sys.hashinfo API and is re-used by complex and Decimal in computing its hash in accordance with Python builtin-type documentation. [0]

    (The doc [0] specifically says that "[a]ll hashable nans have the same hash value.")

    This is normally not a great concern, except for the worst case performance. The problem is that, since they hash to the same value and they're guaranteed to compare unequal to any compatible numeric value -- not even to themselves, this means they're guaranteed to collide.

    For this reason I'd like to question whether it is a good idea to have all hashable NaNs with the same hash value.

    There has been some discussions about this over the Web for some time (see [1]). In [1] the demo Python script times the insertion of k distinct NaN keys (as different objects) into a freshly created dict. Since the keys are distinct and are guaranteed to collide with each other (if any), the run time of a single lookup/insertion is roughly linear to the existing number of NaN keys. I've recreated the same script using with a more modern Python (attached).

    I'd suggest a fix for this worst-case behaviour: instead of returning the hash value 0 for all NaNs, use the generic object (pointer) hash for these objects. As a PoC (also in the attached script), it roughly means

    class myfloat(float):
        def __hash__(self):
            if self != self:  # i.e., math.isnan(self)
                return object.__hash__(self)
            return super().__hash__(self)
    

    This will

    • keep the current behaviour of dict intact;
    • keep the invariant a == b implies hash(a) == hash(b) intact, where applicable;
    • uphold all the other rules for Python numeric objects listed in [0];
    • make hash collisions no more likely than with object() instances (dict lookup time is amortized constant w.r.t. existing number of NaN keys).

    However, it will

    • not keep the current rule "All hashable nans have the same hash value";
    • not be compatible with the current sys.hash_info API (requires the removal of the "nan" attribute from there and documenting the change);
    • require congruent modifications in complex and Decimal too.

    Additionally, I don't think this will affect module-level NaN "constants" such as math.nan and how they behave. The "NaN constant" has never been a problem to begin with. It's only the *distinct* NaN objects that may cause the worst-case behaviour.

    --------

    Just for the record I'd also like to clear some outdated info or misconception about NaN keys in Python dicts. It's not true that NaN keys, once inserted, cannot be retrieved (e.g., as claimed in [1][2]). In Python, they can be, if you keep the original key object around by keeping a reference to it (or obtaining a new one from the dict by iterating over it). This, I think, is because Python dict compares for object identity before rich-comparing for equality in lookdict() in Objects/dictobject.c, so this works for d = dict():

    f = float("nan")
    d[f] = "value"
    v = d[f]
    

    but this fails with KeyError, as it should:

    d[float("nan")] = "value"
    v = d[float("nan")]
    

    In this regard the NaN float object behaves exactly like the object() instance as keys -- except for the collisions. That's why I think at least for floats the "object" hash is likely to work. The solution using PRNG [1] (implemented with the Go language) is not necessary for CPython because the live objects are already distinct.

    --------

    Links:

    [0] https://docs.python.org/3/library/stdtypes.html#hashing-of-numeric-types
    [1] https://research.swtch.com/randhash
    [2] https://readafterwrite.wordpress.com/2017/03/23/how-to-hash-floating-point-numbers/

    @congma congma mannequin added interpreter-core (Objects, Python, Grammar, and Parser dirs) stdlib Python modules in the Lib dir performance Performance or resource usage labels Mar 11, 2021
    @mdickinson
    Copy link
    Member

    Sigh. When I'm undisputed ruler of the multiverse, I'm going to make "NaN == NaN" return True, IEEE 754 be damned. NaN != NaN is fine(ish) at the level of numerics; the problems start when the consequences of that choice leak into the other parts of the language that care about equality. NaNs just shouldn't be considered "special enough to break the rules" (where the rules here are that the equivalence relation being used as the basis of equality for a general container should actually *be* an equivalence relation - reflexive, symmetric, and transitive).

    Anyway, more constructively ...

    I agree with the analysis, and the proposed solution seems sound: if we're going to change this, then using the object hash seems like a workable solution. The question is whether we actually do need to change this.

    I'm not too worried about backwards compatibility here: if we change this, we're bound to break *someone's* code somewhere, but it's hard to imagine that there's *that* much code out there that makes useful use of the property that hash(nan) == 0. The most likely source of breakage I can think of is in test code that checks that 3rd-party float-like things (NumPy's float64, or gmpy2 floats, or ...) behave like Python's floats.

    @cong Ma: do you have any examples of cases where this has proved, or is likely to prove, a problem in real-world code, or is this purely theoretical at this point?

    I'm finding it hard to imagine cases where a developer *intentionally* has a dictionary with several NaN values as keys. (Even a single NaN as a key seems a stretch; general floats as dictionary keys are rare in real-world code that I've encountered.) But it's somewhat easier to imagine situations where one could run into this accidentally. Here's one such:

    >> import collections, numpy as np
    >> x = np.full(100000, np.nan)
    >> c = collections.Counter(x)

    That took around 4 minutes of non-ctrl-C-interruptible time on my laptop. (I was actually expecting it not to "work" as a bad example: I was expecting that NaNs realised from a NumPy array would all be the same NaN object, but that's not what happens.) For comparison, this appears instant:

    >> x = np.random.rand(100000)
    >> c = collections.Counter(x)

    And it's not a stretch to imagine having a NumPy array representing a masked binary 1024x1024-pixel image (float values of NaN, 0.0 and 1.0) and using Counter to find out how many 0s and 1s there were.

    On the other hand, we've lived with the current behaviour for some decades now without it apparently ever being a real issue.

    On balance, I'm +1 for fixing this. It seems like a failure mode that would be rarely encountered, but it's quite unpleasant when it *is* encountered.

    @mdickinson
    Copy link
    Member

    Hmm. On second thoughts, the proposed solution wouldn't actually *help* with the situation I gave: the elements (lazily) realised from the NumPy array are highly likely to all end up with the same address in RAM. :-(

    >>> x = np.full(10, np.nan)
    >>> for v in x:
    ...     print(id(v))
    ...     del v
    ... 
    4601757008
    4601757008
    4601757008
    4601757008
    4601757008
    4601757008
    4601757008
    4601757008
    4601757008
    4601757008

    @mdickinson
    Copy link
    Member

    On third thoughts, of course it *would* help, because the Counter is keeping references to the realised NaN values. I think I'll go away now and come back when my brain is working.

    @congma
    Copy link
    Mannequin Author

    congma mannequin commented Mar 11, 2021

    Thank you @mark.dickinson for the detailed analysis.

    In addition to your hypothetical usage examples, I am also trying to understand the implications for user code.

    If judging by the issues people open on GitHub like this: pandas-dev/pandas#28363 yes apparently people do run into the "NaN as key" problem every now and then, I guess. (I'm not familiar enough with the poster's real problem in the Pandas setting to comment about that one, but it seems to suggest people do run into "real world" problems that shares some common features with this one). There's also StackOverflow threads like this (https://stackoverflow.com/a/17062985) where people discuss hashing a data table that explicitly use NaN for missing values. The reason seems to be that "[e]mpirical evidence suggests that hashing is an effective strategy for dimensionality reduction and practical nonparametric estimation." (https://arxiv.org/pdf/0902.2206.pdf).

    I cannot imagine whether the proposed change would make life easier for people who really want NaN keys to begin with. However I think it might reduce the exposure to worst-case performances in dict (and maybe set/frozenset), while not hurting Python's own self-consistency.

    Maybe there's one thing to consider about future compatibility though... because the proposed fix depends on the current behaviour that floats (and by extension, built-in float-like objects such as Decimal and complex) are not cached, unlike small ints and interned Unicode objects. So when you explicitly construct a NaN by calling float(), you always get a new Python object. Is this guaranteed now on different platforms with different compilers? I'm trying to parse the magic in Include/pymath.h about the definition of macro Py_NAN, where it seems to me that for certain configs it could be a static const union translation-unit-level constant. Does this affect the code that actually builds a Python object from an underlying NaN? (I apologize if this is a bad question). But if it doesn't and we're guaranteed to really have Python NaN-objects that don't alias if explicitly constructed, is this something unlikely to change in the future?

    I also wonder if there's security implication for servers that process user-submitted input, maybe running a float() constructor on some string list, checking exceptions but silently succeeding with "nan": arguably this is not going to be as common an concern as the string-key collision DoS now foiled by hash randomization, but I'm not entirely sure.

    On "theoretical interest".. yes theoretical interests can also be "real world" if one teaches CS theory in real world using Python, see https://bugs.python.org/issue43198#msg386849

    So yes, I admit we're in an obscure corner of Python here. It's a tricky, maybe-theoretical, but seemingly annoying but easy-to-fix issue..

    @congma
    Copy link
    Mannequin Author

    congma mannequin commented Mar 11, 2021

    Sorry, please ignore my rambling about "float() returning aliased object" -- in that case the problem with hashing doesn't arise.

    @mdickinson
    Copy link
    Member

    I also wonder if there's security implication for servers that process user-submitted input

    Yes, the "malicious actor" scenario is another one to consider. But unlike the string hashing attack, I'm not seeing a realistic way for the nan hash collisions to be used in attacks, and I'm content not to worry about that until someone gives an actual proof of concept. Many of Python's hash functions are fairly predictable (by design!) and there are already lots of other ways to deliberately construct lots of hash collisions with non-string non-float values.

    @rhettinger
    Copy link
    Contributor

    Idea: We could make this problem go away by making NaN a singleton.

    @mdickinson
    Copy link
    Member

    We could make this problem go away by making NaN a singleton.

    That could work, though we'd annoy anyone who cared about preserving NaN payloads and signs in Python. I don't know if such people exist. Currently the sign is accessible through things like math.copysign, and the payload through struct manipulations (though on most platforms we still don't see the full range of NaNs: signalling NaNs are quietly silenced on my machine).

    We'd also want to do some performance checks: the obvious way to do this would be to have an "is_nan" check in PyFloat_FromDouble. I'd *guess* that a typical CPU would do a reasonable job of branch prediction on that check so that it had minimal impact in normal use, but only timings will tell for sure.

    @mdickinson
    Copy link
    Member

    signalling NaNs are quietly silenced on my machine

    I just tested this assertion, and it appears that I was lying: this doesn't seem to be true. I'm almost *sure* it used to be true, and I'm not sure what changed, and when. (FWIW: Python 3.9.2 from macPorts on macOS 10.14.6 + Intel hardware. Earlier versions of Python on the same machine give similar results to those below, so it's not a core Python change.)

    Here's an attempt to create an snan Python float:

    Python 3.9.2 (default, Feb 28 2021, 13:47:30) 
    [Clang 10.0.1 (clang-1001.0.46.4)] on darwin
    Type "help", "copyright", "credits" or "license" for more information.
    >>> import struct
    >>> snan_bits = 0x7ff0000000123456
    >>> snan = struct.unpack("<d", struct.pack("<Q", snan_bits))[0]
    >>> snan
    nan

    Converting back shows that the attempt was successful: the payload, including the signalling bit (bit 51, counting with the lsb as bit 0), was preserved:

    >>> hex(struct.unpack("<Q", struct.pack("<d", snan))[0])
    '0x7ff0000000123456'

    As expected, doing any arithmetic on the value converts the signalling nan to the corresponding quiet nan (so that the leading 16 bits become 0x7ff8 instead of 0x7ff0), but preserves the rest of the payload.

    >>> hex(struct.unpack("<Q", struct.pack("<d", 0.0 + snan))[0])
    '0x7ff8000000123456'

    @congma
    Copy link
    Mannequin Author

    congma mannequin commented Mar 13, 2021

    Idea: We could make this problem go away by making NaN a singleton.

    Apart from the fact that NaN is not a specific value/object (as pointed out in the previous message by @mark.dickinson), currently each builtin singleton (None, True, False, etc.) in Python satisfies the following predicate:

    if s is a singleton, then s == s

    This is also satisfied by "flyweight" objects such as small ints.

    It feels natural and unsurprising to have this, if not officially promised. Requiring NaN to be a singleton would violate this implicit understanding about singletons, and cause another surprise on top of the other surprises with NaNs (or with rich comparison in general).

    Performance-wise, I think the current behaviour (returning _PyHASH_NAN == 0) is already nested inside two layers of conditionals in _Py_HashDouble() in Python/pyhash.c:

        if (!Py_IS_FINITE(v)) {
            if (Py_IS_INFINITY(v))
                return v > 0 ? _PyHASH_INF : -_PyHASH_INF;
            else
                return _PyHASH_NAN;
        }
    

    (v is the underlying C double, field ob_fval of PyFloatObject).

    I don't feel performance would be hurt by rewriting float_hash() in Objects/floatobject.c to the effect of

        if (!Py_IS_NAN(v->ob_fval)) {
            return _Py_HashDouble(v->ob_fval)
        }
        else {
            return _Py_HashPointer(v);
        }
    

    and simplify the conditionals in _Py_HashDouble(). But of course, only real benchmarking could tell. (Also, other float-like types would have to be adjusted, too).

    @mdickinson
    Copy link
    Member

    @cong Ma: Yes, I'm not worried about performance for the change you're proposing (though as you say, we should benchmark anyway). The performance thoughts were motivated by the idea of making NaN a singleton: adding a check to PyFloat_FromDouble would mean that almost every operation that produced a float would have to pass through that check.

    @rhettinger
    Copy link
    Contributor

    The performance thoughts were motivated by the idea of
    making NaN a singleton: adding a check to
    PyFloat_FromDouble would mean that almost every operation
    that produced a float would have to pass through that check.

    It may suffice to move the check upstream from PyFloat_FromDouble so that float('NaN') alway produces identically the same object as math.nan.

    That would handle the common cases where NaN is used for missing values or is generated from string conversions. We don't need a bullet-proof solution, just mitigation of harm.

    @serhiy-storchaka
    Copy link
    Member

    What about Decimal NaN? Even if make float NaN a singleton, there will be other NaNs.

    And making float('nan') returning a singleton, but 1e1000 * 0 returning different NaN would cause large confusion.

    @rhettinger
    Copy link
    Contributor

    And making float('nan') returning a singleton,
    but 1e1000 * 0 returning different NaN would cause large confusion.

    Not really, it would be just be an implementation detail, no different than int and strings being sometimes unique and sometimes not. Historically, immutable objects are allowed to be reused when it is convenient for the implementation.

    What about Decimal NaN?

    Decimal isn't used as much, so the need is less pressing, but we can do whatever is allowed by the spec. Presumably, that would be easier than with floats because we control all possible ways to construct Decimals.

    @rhettinger
    Copy link
    Contributor

    Mark, is there any reason hash(float('NaN')) and hash(Decimal('NaN')) have to be a constant? Since NaNs don't compare equal, the hash homomorphism has no restrictions. Why not have hash() return the id() like we do for instances of object?

    I understand that sys.hash_info.nan would be invalidated, but that was likely useless anyway.

    @mdickinson
    Copy link
    Member

    Why not have hash() return the id() like we do for instances of object?

    I think that's fine, and IIUC that's what Cong Ma was proposing. It seems like the least invasive potential fix.

    In principle I find the idea of making NaN a singleton rather attractive - the performance hit is likely negligible, and it solves a bunch of subtle NaN-related issues. (For example, there's still a proposal to provide an IEEE 754 total_ordering key so that lists of floats can be ordered sanely in the presence of nans, but even if you use that you don't know whether your nans will end up at the start or the end of the list, and you could potentially have nans in both places; fixing a single nan and its sign bit would fix that.) But there are bound to be some people somewhere making use of the NaN payload and sign bit, however inadvisedly, and Serhiy's right that this couldn't be extended to Decimal, where the sign bit and payload of the NaN are mandated by the standard.

    @tim-one
    Copy link
    Member

    tim-one commented Apr 11, 2021

    I agree hashing a NaN acting like the generic object hash (return rotated address) is a fine workaround, although I'm not convinced it addresses a real-world problem ;-) But why not? It might.

    But that's for CPython. I'm loathe to guarantee anything about this in the language itself. If an implementation wants __contains__() tests to treat all NaNs as equal instead, or consider them equal only if "is" holds, or never considers them equal, or resolves equality as bitwise representation equality ... all are fine by me. There's no truly compelling case to made for any of them, although "never considers them equal" is least "surprising" given the insane requirement that NaNs never compare equal under IEEE rules, and that Python-the-language doesn't formally support different notions of equality in different contexts.

    @rhettinger
    Copy link
    Contributor

    I'm loathe to guarantee anything about this in the language itself.

    There aren't language any guarantees being proposed. Letting the hash depend on the object id just helps avoid quadratic behavior. Making float('NaN') a singleton is also perfectly reasonable behavior for an immutable type. Neither is a guaranteed behavior, just a convenient one.

    I'm not convinced it addresses a real-world problem

    Maybe yes, maybe no. I would hope that NaNs arising from bogus calculations would be rare. OTOH, they are commonly used for missing values in Pandas where internal dict/set operations abound. Either way, I would like to close off a trivially easy way to invoke quadratic behavior unexpectedly.

    @rhettinger
    Copy link
    Contributor

    New changeset a07da09 by Raymond Hettinger in branch 'master':
    bpo-43475: Fix worst case collision behavior for NaN instances (GH-25493)
    a07da09

    @realead
    Copy link
    Mannequin

    realead mannequin commented Jun 11, 2021

    This change makes life harder for people trying to get sane behavior with sets for float-, complex- and tuple-objects containing NaNs. With "sane" meaning, that

    set([nan, nan, nan]) => {nan}

    Until now, one has only to override the equal-comparison for these types but now also hash-function needs to be overriden (which is more complicated e.g. for tuple).

    ----------------------------------------------------------------

    On a more abstract level: there are multiple possible equivalence relations.

    The one used right now for floats is Py_EQ and results in float("nan)!=float("nan") due to IEEE 754.

    In hash-set/dict we would like to use an equivalence relation for which all NaNs would be in the same equivalence class. Maybe a new comparator is called for (like PY_EQ_FOR_HASH_COLLECTION), which would yield float("nan") equivalent to float("nan") and which should be used in hash-set/dict?

    @serhiy-storchaka
    Copy link
    Member

    There is an error in the Python implementation of Decimal.__hash__. It calls super().__hash__(), but the C implementation calls object.__hash__().

    Also, the documentation for floating point hash has the same error.

    @serhiy-storchaka serhiy-storchaka added the 3.11 only security fixes label Jun 11, 2021
    @serhiy-storchaka serhiy-storchaka added the 3.11 only security fixes label Jun 11, 2021
    @serhiy-storchaka
    Copy link
    Member

    New changeset 9f1c5f6 by Serhiy Storchaka in branch 'main':
    bpo-43475: Fix the Python implementation of hash of Decimal NaN (GH-26679)
    9f1c5f6

    @mdickinson
    Copy link
    Member

    @realead

    This change makes life harder for people trying to get sane behavior with sets [...]

    Yes, code that assumes that all NaNs have the same hash value will need to change. But that doesn't seem unreasonable for objects that are already considered distinct with respect to the containment equivalence relation ("is" followed by "=="). I think this change was the right one to make, and my expectation was that there would be few cases of breakage, and that for those few cases it shouldn't be difficult to fix the breakage. If that's not true (either there are lots of cases of breakage, or the breakage is hard to fix), perhaps we should reconsider. I haven't yet seen evidence that either of those things is true, though.

    Maybe a new comparator is called for [...]

    Agreed that that seems like a possible technical solution: Python could add a new special method __container_eq__ which is required to provide (at the least) a reflexive and symmetric relation, and whose default implementation does the same is-followed-by-== check that's already used for list, dict and set containment. For what it's worth, this is close to the approach that Julia takes: they have an "is_equal" function that's mostly the same as the normal equality == but differs for NaNs (and incidentally for signed zeros). But whether this would make sense from a language design perspective is another matter, and it would be a significant language-level change (certainly requiring a PEP). It feels like an enormous conceptual change to introduce to the language just to deal with the annoyance of NaNs. And that really is all it would be good for, unless perhaps it also provides a solution to the problem of NumPy arrays in containers, again caused by NumPy arrays overriding __eq__ in a nonstandard way.

    @mdickinson
    Copy link
    Member

    @serhiy: can this be closed again? Does #70866 need to be backported to the 3.10 branch?

    @serhiy-storchaka
    Copy link
    Member

    Maybe a new comparator is called for (like PY_EQ_FOR_HASH_COLLECTION), which would yield float("nan") equivalent to float("nan") and which should be used in hash-set/dict?

    It is difficult to do from technical point of view because all rich comparison implementations support currently only 6 operations. If you call tp_richcomp with something else it will crash. So we need to add new slot tp_richcomp_ex and complex logic to call either tp_richcomp_ex or tp_richcomp and correctly handle inheritance. It is too high bar.

    Does #70866 need to be backported to the 3.10 branch?

    I thought that the original change was 3.11 only, but seems it was merged before the split of the 3.10 branch. So PR 26679 needs to be backported. I would add a NEWS entry if I knew this.

    @serhiy-storchaka serhiy-storchaka added 3.10 only security fixes labels Jun 13, 2021
    @miss-islington
    Copy link
    Contributor

    New changeset 128899d by Miss Islington (bot) in branch '3.10':
    bpo-43475: Fix the Python implementation of hash of Decimal NaN (GH-26679)
    128899d

    @serhiy-storchaka
    Copy link
    Member

    Does this change need to be mentioned in What's New?

    @realead
    Copy link
    Mannequin

    realead mannequin commented Jun 13, 2021

    @mark.dickinson

    ...my expectation was that there would be few cases of breakage, and that for those few cases it shouldn't be difficult to fix the breakage.

    This expectation is probably correct.

    My issue is somewhat only partly on-topic here: If one wants to have all NaNs in one equivalency class (e.g. if used as a key-value for example in pandas) it is almost impossible to do so in a consistent way without taking a performance hit. It seems to be possible builtin-types (even if frozenset won't be pretty), but already something like

    class A:
        def __init__(self, a):
            self.a=a
        def __hash__(self):
            return hash(self.a)
        def __eq__(self, other):
            return self.a == other.a

    is not easy to handle.

    A special comparator for containers would be an ultimative solution, but I see how this could be too much hassle for a corner case.

    @rhettinger
    Copy link
    Contributor

    If one wants to have all NaNs in one equivalency class
    (e.g. if used as a key-value for example in pandas) it
    is almost impossible to do so in a consistent way
    without taking a performance hit.

    ISTM the performance of the equivalent class case is far less important than the one we were trying to solve. Given a choice we should prefer helping normal unadorned instances rather than giving preference to a subclass that redefines the usual behaviors.

    In CPython, it is a fact of life that overriding builtin behaviors with pure python code always incurs a performance hit. Also, in your example, the subclass isn't technically correct because it relies on a non-guaranteed implementation details. It likely isn't even the fastest approach.

    The only guaranteed behaviors are that math.isnan(x) reliably detects a NaN and that x!=x when x is a NaN. Those are the only assured tools in the uphill battle to fight the weird intrinsic nature of NaNs.

    So one possible solution is to replace all the NaNs with a canonical placeholder value that doesn't have undesired properties:

    {None if isnan(x) else x for x in arr}
    

    That relies on guaranteed behaviors and is reasonably fast. IMO that beats trying to reprogram float('NaN') to behave the opposite of how it was designed.

    @mdickinson
    Copy link
    Member

    Does this change need to be mentioned in What's New?

    Yes, I think so, given that it's a change to documented behavior. It's also something that third party packages (like NumPy) potentially need to be aware of.

    @mdickinson
    Copy link
    Member

    New changeset 1d10bf0 by Mark Dickinson in branch 'main':
    bpo-43475: Add what's new entry for NaN hash changes (GH-26725)
    1d10bf0

    @mdickinson
    Copy link
    Member

    New changeset 8d0b2ca by Mark Dickinson in branch '3.10':
    [3.10] bpo-43475: Add what's new entry for NaN hash changes (GH-26725) (GH-26743)
    8d0b2ca

    @mdickinson
    Copy link
    Member

    I think this can be closed now.

    @cwg
    Copy link
    Mannequin

    cwg mannequin commented Nov 24, 2021

    Hello. I would like to point out a possible problem that this change to CPython has introduced.

    This change looks innocent, but it breaks the unwritten rule that the hash value of a number (or actually any built-in immutable type!) in Python depends only on its value.

    Thus, before this change, it was possible to convert a tuple of floats into a numpy array and back into a tuple, and the hash values of both tuples would be equal. This is no longer the case.

    Or, more generally, any hashable tuple could be pickled and unpickled, without changing its hash value. I could well imagine that this breaks real code in subtle ways.

    Likewise, it is now no longer possible to provide a library of sequences of numbers that always hashes like built-in tuples. (As "tinyarray", of which I am the author, did.)

    @mdickinson
    Copy link
    Member

    @cwg: Yep, we're aware of this. There are no good solutions here - only a mass of constraints, compromises and trade-offs. I think we're already somewhere on the Pareto boundary of the "best we can do" given the constraints. Moving to another point on the boundary doesn't seem worth the code churn.

    What concrete action would you propose that the Python core devs take at this point?

    it was possible to convert a tuple of floats into a numpy array and back into a tuple, and the hash values of both tuples would be equal. This is no longer the case.

    Sure, but the problem isn't really with hash; that's just a detail. It lies deeper than that - it's with containment itself:

    >>> import numpy as np
    >>> import math
    >>> x = math.nan
    >>> some_list = [1.5, 2.3, x]
    >>> x in some_list
    True
    >>> x in list(np.array(some_list))  # expect True, get False
    False

    The result of the change linked to this PR is that the hash now also reflects that containment depends on object identity, not just object value. Reverting the change would solve the superficial hash problem, but not the underlying containment problem, and would re-introduce the performance issue that was fixed here.

    @mdickinson
    Copy link
    Member

    Just for fun: I gave a somewhat ranty 10-minute talk on this topic at a (virtual) conference a few months ago: https://www.youtube.com/watch?v=01oeosRVwgY

    @cwg
    Copy link
    Mannequin

    cwg mannequin commented Nov 24, 2021

    What concrete action would you propose that the Python core devs take at this point?

    Nothing for now.

    I stumbled across this issue through https://gitlab.kwant-project.org/kwant/tinyarray/-/issues/20 and had the impression that the aspect that I raised (that, for example, hash values of immutable built-in objects now no longer survive pickling) was not examined in this otherwise in-depth discussion. So I added it for reference.

    If problems come up that are caused by this change, I would consider reverting it a possible solution.

    The result of the change linked to this PR is that the hash now also reflects that containment depends on object identity, not just object value.

    This is a nice way to phrase it. Thanks for the link to the entertaining talk. :-)

    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    3.10 only security fixes 3.11 only security fixes interpreter-core (Objects, Python, Grammar, and Parser dirs) performance Performance or resource usage stdlib Python modules in the Lib dir
    Projects
    None yet
    Development

    No branches or pull requests

    5 participants