Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dict lookups fail if sizeof(Py_ssize_t) < sizeof(long) #44512

Closed
ked-tao mannequin opened this issue Jan 27, 2007 · 16 comments
Closed

Dict lookups fail if sizeof(Py_ssize_t) < sizeof(long) #44512

ked-tao mannequin opened this issue Jan 27, 2007 · 16 comments
Assignees
Labels
build The build process and cross-build interpreter-core (Objects, Python, Grammar, and Parser dirs) type-bug An unexpected behavior, bug, or error

Comments

@ked-tao
Copy link
Mannequin

ked-tao mannequin commented Jan 27, 2007

BPO 1646068
Nosy @tim-one, @loewis, @birkenfeld, @mdickinson, @abalkin, @pitrou
Superseder
  • bpo-9778: Make hash values the same width as a pointer (or Py_ssize_t)
  • Files
  • dict.diff: Suggested patch (against python 2.5 release)
  • issue1646068.diff: Patch for py3k branch revision 82889
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = 'https://github.com/abalkin'
    closed_at = <Date 2010-11-20.18:53:51.223>
    created_at = <Date 2007-01-27.18:23:44.000>
    labels = ['interpreter-core', 'type-bug', 'build']
    title = 'Dict lookups fail if sizeof(Py_ssize_t) < sizeof(long)'
    updated_at = <Date 2010-11-20.18:53:51.223>
    user = 'https://bugs.python.org/ked-tao'

    bugs.python.org fields:

    activity = <Date 2010-11-20.18:53:51.223>
    actor = 'belopolsky'
    assignee = 'belopolsky'
    closed = True
    closed_date = <Date 2010-11-20.18:53:51.223>
    closer = 'belopolsky'
    components = ['Build', 'Interpreter Core']
    creation = <Date 2007-01-27.18:23:44.000>
    creator = 'ked-tao'
    dependencies = []
    files = ['2286', '18004']
    hgrepos = []
    issue_num = 1646068
    keywords = ['patch']
    message_count = 16.0
    messages = ['31105', '31106', '31107', '31108', '31109', '31110', '61525', '110293', '110300', '110303', '110306', '110307', '110310', '110315', '110329', '119021']
    nosy_count = 8.0
    nosy_names = ['tim.peters', 'loewis', 'georg.brandl', 'jimjjewett', 'mark.dickinson', 'belopolsky', 'pitrou', 'ked-tao']
    pr_nums = []
    priority = 'high'
    resolution = 'out of date'
    stage = 'patch review'
    status = 'closed'
    superseder = '9778'
    type = 'behavior'
    url = 'https://bugs.python.org/issue1646068'
    versions = ['Python 3.2']

    @ked-tao
    Copy link
    Mannequin Author

    ked-tao mannequin commented Jan 27, 2007

    Portation problem.

    Include/dictobject.h defines PyDictEntry.me_hash as a Py_ssize_t. Everywhere else uses a C 'long' for hashes.

    On the system I'm porting to, ints and pointers (and ssize_t) are 32-bit, but longs and long longs are 64-bit. Therefore, the assignments to me_hash truncate the hash and subsequent lookups fail.

    I've changed the definition of me_hash to 'long' and (in Objects/dictobject.c) removed the casting from the various assignments and changed the definition of 'i' in dict_popitem(). This has fixed my immediate problems, but I guess I've just reintroduced whatever problem it got changed for. The comment in the header says:

    /* Cached hash code of me_key. Note that hash codes are C longs.

    • We have to use Py_ssize_t instead because dict_popitem() abuses
    • me_hash to hold a search finger.
      */

    ... but that doesn't really explain what it is about dict_popitem() that requires the different type.

    Thanks. Kev.

    @ked-tao ked-tao mannequin assigned tim-one Jan 27, 2007
    @ked-tao ked-tao mannequin added the interpreter-core (Objects, Python, Grammar, and Parser dirs) label Jan 27, 2007
    @ked-tao ked-tao mannequin assigned tim-one Jan 27, 2007
    @ked-tao ked-tao mannequin added the interpreter-core (Objects, Python, Grammar, and Parser dirs) label Jan 27, 2007
    @birkenfeld
    Copy link
    Member

    This is your code, Tim.

    @jimjjewett
    Copy link
    Mannequin

    jimjjewett mannequin commented Feb 2, 2007

    The whole point of a hash is that if it doesn't match, you can skip an expensive comparison. How big to make the hash is a tradeoff between how much you'll waste calculating and storing it vs how often it will save a "real" comparison.

    The comment means that, as an implementation detail, popitem assumes it can store a pointer there instead, so hashes need to be at least as big as a pointer.

    Going to the larger of the two sizes will certainly solve your problem; it just wastes some space, and maybe some time calculating the hash.

    If you want to get that space back, just make sure the truncation is correct and consistent. I *suspect* your problem is that when there is a collision, either

    (1) It is comparing a truncated value to an untruncated value, or
    (2) The perturbation to find the next slot is going wrong, because of when the truncation happens.

    @ked-tao
    Copy link
    Mannequin Author

    ked-tao mannequin commented Feb 4, 2007

    Hi Jim. I understand what the problem is (perhaps I didn't state it clearly enough) - me_hash is a cache of the dict item's hash which is compared against the hash of the object being looked up before going any further with expensive richer comparisons. On my system, me_hash is a 32-bit quantity but hashes in general are declared 'long' which is a 64-bit quantity. Therefore for any object whose hash has any of the top 32 bits set, a dict lookup will fail as it will never get past that first check (regardless of why that slot is being checked - it has nothing to do with the perturbation to find the next slot).

    The deal is that my system is basically a 32-bit system (sizeof(int) == sizeof(void *) == 4, and therefore ssize_t is not unreasonably also 32-bit), but C longs are 64-bit.

    You say "popitem assumes it can store a pointer there", but AFAICS it's just storing an _index_, not a pointer. I was concerned that making that index a 64-bit quantity might tickle some subtlety in the code, thinking that perhaps it was changed from 'long' to 'Py_ssize_t' because it had to be 32-bit for some reason. However, it seems much more likely that it was defined like that to be more correct on a system with 64-bit addressing and 32-bit longs (which would be more common). With that in mind, I've attached a suggested patch which selects a reasonable type according to the SIZEOF_ configuration defines.

    WRT forcing the hashes 32-bit to "save space and time" - that means inventing a 'Py_hash_t' type and going through the entire python source looking for 'long's that might be used to store/calculate hashes. I think I'll pass on that ;)

    Regards, Kev.
    File Added: dict.diff

    @jimjjewett
    Copy link
    Mannequin

    jimjjewett mannequin commented Feb 4, 2007

    Yes, I'm curious about what system this is ... is it a characteristic of the whole system, or a compiler choice to get longer ints?

    As to using a Py_hash_t -- it probably wouldn't be as bad as you think. You might get away with just masking it to throw away the high order bits in dict and set. (That might not work with perturbation.)

    Even if you have to change it everywhere at the source, then there is some prior art (from when hash was allowed to be a python long), and it is almost certainly limited to methods with "hash" in the name which generate a hash. (eq/ne on the same objects may use the hash.) Consumers of hash really are limited to dict and derivatives. I think dict, set, and defaultdict may be the full list for the default distribution.

    @loewis
    Copy link
    Mannequin

    loewis mannequin commented Feb 6, 2007

    ked-tao: as for "doesn't really explain", please take a look at this comment:

        /* Set ep to "the first" dict entry with a value.  We abuse the hash
         * field of slot 0 to hold a search finger:
         * If slot 0 has a value, use slot 0.
         * Else slot 0 is being used to hold a search finger,
         * and we use its hash value as the first index to look.
         */
    

    So .popitem first returns (and removes) the item in slot 0. Afterwards, it does a
    linear scan through the dictionary, returning one item at a time. To avoid
    re-scanning the emptying dictionary over and over again, the me_hash
    value of slot 0 indicates the place to start searching when the next .popitem
    call is made. Of course, this value may start out bogus and out-of-range,
    or may become out-of-range if the dictionary shrinks; in that case, it
    starts over at index 1. If it is bogus (i.e. never set as a search finger)
    and in-range, that's fine: it will just start searching for a non-empty
    slot at me_hash.

    Because it is a slot number, me_hash must be large enough to hold a
    Py_ssize_t. On some systems (Win64 in particular), long is not large
    enough to hold Py_ssize_t.

    I believe the proposed patch is fine.

    @pitrou
    Copy link
    Member

    pitrou commented Jan 22, 2008

    The patch looks fine, but why isn't an union used instead of trying to
    figure out the longest of both types?

    @devdanzin devdanzin mannequin added build The build process and cross-build type-feature A feature request or enhancement type-bug An unexpected behavior, bug, or error and removed type-feature A feature request or enhancement labels Mar 30, 2009
    @BreamoreBoy
    Copy link
    Mannequin

    BreamoreBoy mannequin commented Jul 14, 2010

    As this is a small patch about which there is one statement from Martin that says "I believe the proposed patch is fine", there is only one query from Antoine, and because the issue discussed refers to problems with 32 and 64 bit sizes, could someone with the relevant knowledge please take a look with a view to moving this forward.

    @abalkin
    Copy link
    Member

    abalkin commented Jul 14, 2010

    Responding to Antoine question, I don't understand how you would use a union here. Certainly you cannot define Py_dicthashcache_t as a union of long and Py_ssize_t because it will not be able to easily assign long or Py_ssize_t values to it. I don't think ANSI C allows a cast from integer type to a union.

    I am OK with the patch, but if this goes into 2.7/3.x, I think the same change should be applied to the set type.

    @abalkin
    Copy link
    Member

    abalkin commented Jul 14, 2010

    On the second thought, this comment:

    • /* Cached hash code of me_key. Note that hash codes are C longs.
      • We have to use Py_ssize_t instead because dict_popitem() abuses
      • me_hash to hold a search finger.
    • */

    suggests that a union may be appropriate here. I am not sure of the standards standing of anonymous unions, but if we could do

    union {
      Py_ssize_t me_finger;
      long me_hash;
    };

    it would cleanly solve the problem. If anonymous unions are not available, a regular union could also do the trick:

    union {
      Py_ssize_t finger;
      long hash;
    } me;

    and use me.finger where me is used as search finger and me.hash where it stores hash. Less clever naming scheme would be welcome, though.

    @abalkin
    Copy link
    Member

    abalkin commented Jul 14, 2010

    Please ignore my comment about set type. Sets don't have this issue.

    @abalkin
    Copy link
    Member

    abalkin commented Jul 14, 2010

    Yet it looks like set has a bigger problem whenever sizeof(Py_ssize_t) > sizeof(long). setentry.hash is defined as long, but set_pop() stores a Py_ssize_t index in it.

    @abalkin
    Copy link
    Member

    abalkin commented Jul 14, 2010

    I am attaching a patch that uses a regular union of long and Py_ssize_t to store cached hash/index value in both set and dict entry. Using an anonymous union would simplify the patch and would reduce the likelihood of breaking extensions that access entry structs.

    @loewis
    Copy link
    Mannequin

    loewis mannequin commented Jul 14, 2010

    If this goes in, it can't go into bug fix releases, as it may break the ABI.

    @mdickinson
    Copy link
    Member

    The approach in Alexander's patch looks fine to me. +1 on applying this patch if someone can test it on a platform where sizeof(long) > sizeof(Py_ssize_t), and also verify that there are current test failures that are fixed by this patch. (If there are no such tests, we should add some.)

    I've only tested this on machines with sizeof(long) == sizeof(Py_ssize_t) == 4 and sizeof(long) == sizeof(Py_ssize_t) == 8, which isn't really much of a test at all.

    ked-tao, if you're still listening: are you in a position to test Alexander's patch on a relevant machine?

    @abalkin
    Copy link
    Member

    abalkin commented Oct 18, 2010

    Issue bpo-9778 makes this out of date.

    @abalkin abalkin assigned abalkin and unassigned tim-one Oct 18, 2010
    @abalkin abalkin assigned abalkin and unassigned tim-one Oct 18, 2010
    @abalkin abalkin closed this as completed Nov 20, 2010
    @abalkin abalkin closed this as completed Nov 20, 2010
    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    build The build process and cross-build interpreter-core (Objects, Python, Grammar, and Parser dirs) type-bug An unexpected behavior, bug, or error
    Projects
    None yet
    Development

    No branches or pull requests

    5 participants