Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hash function is not randomized properly #58826

Closed
VladoBoza mannequin opened this issue Apr 19, 2012 · 96 comments
Closed

Hash function is not randomized properly #58826

VladoBoza mannequin opened this issue Apr 19, 2012 · 96 comments
Assignees
Labels
type-security A security issue

Comments

@VladoBoza
Copy link
Mannequin

VladoBoza mannequin commented Apr 19, 2012

BPO 14621
Nosy @malemburg, @arigo, @gpshead, @mdickinson, @ncoghlan, @vstinner, @tiran, @benjaminp, @alex, @davidmalcolm, @PaulMcMillan, @serhiy-storchaka, @phmc, @dstufft
Files
  • find_hash_collision.py
  • hash.py
  • hash-attack-3.patch
  • _Py_Hash_Sip24.S
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = 'https://github.com/tiran'
    closed_at = <Date 2013-12-10.00:34:45.195>
    created_at = <Date 2012-04-19.17:58:09.068>
    labels = ['type-security']
    title = 'Hash function is not randomized properly'
    updated_at = <Date 2016-04-22.09:23:46.787>
    user = 'https://bugs.python.org/VladoBoza'

    bugs.python.org fields:

    activity = <Date 2016-04-22.09:23:46.787>
    actor = 'serhiy.storchaka'
    assignee = 'christian.heimes'
    closed = True
    closed_date = <Date 2013-12-10.00:34:45.195>
    closer = 'benjamin.peterson'
    components = []
    creation = <Date 2012-04-19.17:58:09.068>
    creator = 'Vlado.Boza'
    dependencies = []
    files = ['25281', '25375', '27917', '27919']
    hgrepos = ['159']
    issue_num = 14621
    keywords = ['patch']
    message_count = 89.0
    messages = ['158736', '158744', '158759', '158773', '158778', '158780', '158781', '158783', '158784', '158785', '158790', '158860', '159430', '159431', '159433', '159434', '173185', '173455', '173457', '173458', '173461', '173488', '173491', '173498', '174964', '174973', '174986', '174987', '174989', '174990', '174994', '174998', '174999', '175000', '175007', '175038', '175040', '175047', '175048', '175050', '175052', '175053', '175080', '175081', '175082', '175083', '175085', '175086', '175088', '175089', '175091', '175094', '175096', '175097', '175098', '175099', '175100', '175192', '175196', '175198', '175299', '175318', '175342', '175345', '176680', '176697', '176704', '176705', '176709', '176710', '176713', '176714', '176715', '176720', '176721', '176725', '176808', '178784', '178798', '178800', '178808', '178809', '178814', '178836', '178837', '178842', '203506', '205758', '205759']
    nosy_count = 28.0
    nosy_names = ['lemburg', 'arigo', 'gregory.p.smith', 'mark.dickinson', 'ncoghlan', 'vstinner', 'christian.heimes', 'benjamin.peterson', 'iElectric', 'Arfrever', 'alex', 'cvrebert', 'dmalcolm', 'Giovanni.Bajo', 'PaulMcMillan', 'serhiy.storchaka', 'bkabrda', 'Vlado.Boza', 'koniiiik', 'sbermeister', 'camara', 'pconnell', '\xc5\x81ukasz.Rekucki', 'ReneSac', 'Bob.Ziuchkovski', 'dstufft', 'isoschiz', 'editor-buzzfeed']
    pr_nums = []
    priority = 'normal'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'security'
    url = 'https://bugs.python.org/issue14621'
    versions = ['Python 2.7', 'Python 3.3']

    @VladoBoza
    Copy link
    Mannequin Author

    VladoBoza mannequin commented Apr 19, 2012

    Fix of this http://bugs.python.org/issue13703 is broken.

    tl;dr: There only 256 different hash functions (compare it to size of _Py_HashSecret prefix and suffix). And whether keys collide or not depends only on the last 8 bits of prefix.

    Problem with current randomization of hash function is following:
    Suffix does not influence whether two keys have some hash or not (it is xor-ed after everything).
    Everything except last 8 bits in prefix does not influence it also. Try adding 0x474200 to prefix and see what happens (it will add 0x474200 to resulting hash).

    To make a DoS attack, attacker must do the following:
    Generate sets of colliding keys for every 256 possible combinations of last 8 bits. Try each one of these sets - one will have significantly bigger response time, and then repeat this one.

    @VladoBoza VladoBoza mannequin added the type-security A security issue label Apr 19, 2012
    @VladoBoza
    Copy link
    Mannequin Author

    VladoBoza mannequin commented Apr 19, 2012

    E.g this strings collide for every prefix ending on 0xcd:
    0x27fd5a18, 0x26fe78fa

    @davidmalcolm
    Copy link
    Member

    davidmalcolm commented Apr 19, 2012

    Thanks for filing this bug report.

    I'm not seeing the equal hashes you describe.

    I'm using this recipe to hardcode a specific prefix and print the hashes using it:
    $ gdb --eval-command="break _PyRandom_Init" --eval-command="run" --eval-command="print _Py_HashSecret" --eval-command="set _Py_HashSecret.prefix=0xcdcdcdcd" --eval-command="print _Py_HashSecret" --eval-command="continue" -eval-command="continue" --args python -c "a='\x27\xfd\x5a\x18'; b='\x26\xfe\x78\xfa'; print(hash(a)); print(hash(b))"

    On a 32-bit build of Python 2.7.3 (i686), if I set _Py_HashSecret.prefix=0xcdcdcdcd, I get non-equal hashes for the data you specify (output trimmed somewhat for conciseness):

    $1 = {prefix = 0, suffix = 0}
    $2 = {prefix = -842150451, suffix = 0}
    Continuing.
    -121255142
    -1199906326

    Similarly, on a 64-bit build of Python 2.7.3 (x86_64), I get non-equal hashes:
    $1 = {prefix = 0, suffix = 0}
    $2 = {prefix = 3452816845, suffix = 0}
    -3992804574342296806
    -8147489705433570838

    Did I misunderstand the report? Thanks.

    @vstinner
    Copy link
    Member

    vstinner commented Apr 19, 2012

    I don't understand this issue: can you write a short script to test a collision?

    "E.g this strings collide for every prefix ending on 0xcd"

    Do you mean that prefix & 0xff == 0xcd?

    "0x27fd5a18, 0x26fe78fa"

    Is it a byte string or an Unicode string? b'\x27\xfd\x5a\x18' and b'\x26\xfe\x78\xfa'?

    --

    Using PYTHONHASHSEED environment variable, it's easy to find two values generating the same _Py_HashSecret. Just one example:

    PYTHONHASHSEED=3035016679:
    * _Py_HashSecret = {0xcd5192eff3fd4d58, 0x3926b1431b200720}
    PYTHONHASHSEED=4108758503:
    *  _Py_HashSecret = {0xcd5192eff3fd4d58, 0x3926b1431b200720}

    --

    I wrote find_hash_collision.py to try to compute a collision, but the programs fail with:
    ---
    Fail to generate a new seed!
    # seeds = 65298
    ---
    So it fails to generate a new random seed after testing 65298 different seeds. I ran the script with a function generating a seed, a seed generate a prefix "ending with 0xDC".

    See attached program: it generates a random seed. Uncomment "seed = generate_seed_0xCD()" if the prefix must ends with 0xCD byte.

    @VladoBoza
    Copy link
    Mannequin Author

    VladoBoza mannequin commented Apr 19, 2012

    My bad (I checked only function in C++, not result in python).
    This should work on 32bit:
    Prefix: anything ending on 0x00
    Suffix: anything
    Strings: "\x00\xcf\x0b\x96\x19", "\x00\x6d\x29\x45\x18"

    @VladoBoza
    Copy link
    Mannequin Author

    VladoBoza mannequin commented Apr 19, 2012

    For example take this script (on 32bit):
    ha = hash("\x00\xcf\x0b\x96\x19")
    hb = hash("\x00\x6d\x29\x45\x18")
    if ha == hb:
    print "collision"

    And run following:
    for i in seq 0 25; do echo $i; for j in seq 0 100; do ./python -R x.py; done; done;

    It gives collison too many times (around 9 out of 2500).

    @davidmalcolm
    Copy link
    Member

    davidmalcolm commented Apr 19, 2012

    $ gdb --eval-command="break _PyRandom_Init" --eval-command="run" --eval-command="print _Py_HashSecret" --eval-command="set _Py_HashSecret.prefix=0xcdcdcd00" --eval-command="print _Py_HashSecret" --eval-command="continue" -eval-command="continue" --args python -c 'a="\x00\xcf\x0b\x96\x19"; b="\x00\x6d\x29\x45\x18"; print(hash(a)); print(hash(b))'

    On 32-bit:
    $1 = {prefix = 0, suffix = 0}
    $2 = {prefix = -842150656, suffix = 0}
    1220138288
    1220138288

    On 64-bit:
    $1 = {prefix = 0, suffix = 0}
    $2 = {prefix = 3452816640, suffix = 0}
    Continuing.
    4087671194599937328
    -1679444439011306192

    @vstinner
    Copy link
    Member

    vstinner commented Apr 20, 2012

    For example take this script (on 32bit): (...)
    It gives collison too many times (around 9 out of 2500).

    I tried this script on Linux 32 bits and Linux 64 bits: I didn't see any collision. What is your operating system and the version of your operating system please?

    @koniiiik
    Copy link
    Mannequin

    koniiiik mannequin commented Apr 20, 2012

    @dmalcolm:
    As for the gdb example, you need to add --eval-command="set _Py_HashSecret_Initialized=1", otherwise _Py_HashSecret will get overwritten immediately after it is set by gdb, either to 0 if run without the -R switch, or to a random value.

    With the fresh pair of values Vlado provided, I managed to reproduce the collisions on Python 2.7.3, i686 (output trimmed like you did):

    for __PREFIX in 0x0 0x4700 0xdead00 0xcafe00; do gdb --eval-command="break _PyRandom_Init" --eval-command="run" --eval-command="print _Py_HashSecret" --eval-command="set _Py_HashSecret.prefix=${__PREFIX}" --eval-command="set _Py_HashSecret_Initialized=1" --eval-command="print _Py_HashSecret" --eval-command="continue" -eval-command="continue" --args ./python -c "a='\x00\xcf\x0b\x96\x19'; b='\x00\x6d\x29\x45\x18'; print(hash(a)); print(hash(b))"; done

    $1 = {prefix = 0, suffix = 0}
    $2 = {prefix = 0, suffix = 0}
    Continuing.
    1220138288
    1220138288

    $1 = {prefix = 0, suffix = 0}
    $2 = {prefix = 18176, suffix = 0}
    Continuing.
    -1483212240
    -1483212240

    $1 = {prefix = 0, suffix = 0}
    $2 = {prefix = 14593280, suffix = 0}
    Continuing.
    -972665808
    -972665808

    $1 = {prefix = 0, suffix = 0}
    $2 = {prefix = 13303296, suffix = 0}
    Continuing.
    1003122480
    1003122480

    @VladoBoza
    Copy link
    Mannequin Author

    VladoBoza mannequin commented Apr 20, 2012

    I tried this script on Linux 32 bits and Linux 64 bits: I didn't see any >collision. What is your operating system and the version of your >operating system please?

    uname -a
    Linux 3.0.0-17-generic #30-Ubuntu SMP Thu Mar 8 20:45:39 UTC 2012 x86_64 x86_64 x86_64 GNU/Linux

    Anyway you should be able to do following (in 32bits):

    • generate two colliding keys (with some random seed)
    • try these keys with different random seeds and they will collide ~1 out of 256 times

    @vstinner
    Copy link
    Member

    vstinner commented Apr 20, 2012

    hash.py: Python implementation of the 32-bit version of hash(bytes). Ok, I now see that only the lower 8 bits are really mixed with the input string.

    @VladoBoza
    Copy link
    Mannequin Author

    VladoBoza mannequin commented Apr 20, 2012

    One possible fix:
    Look for StringHasher in google v8 code (http://code.google.com/p/v8/source/search?q=stringhasher&origq=stringhasher&btnG=Search+Trunk). Main loop looks like this:
    raw_running_hash_ += c;
    raw_running_hash_ += (raw_running_hash_ << 10);
    raw_running_hash_ ^= (raw_running_hash_ >> 6);

    It seems not to have same collisions with many different hash seeds.

    @vstinner
    Copy link
    Member

    vstinner commented Apr 26, 2012

    One possible fix: ...
    Main loop looks like this: ..

    Well, it was decided to not impact performances to workaround one specific class of attack, whereas there are many other ways to DoS Python. So we chose to keep the main loop unchanged. Randomizing the hash is not a fix for the hash DoS, it only makes the attacker harder.

    @vstinner
    Copy link
    Member

    vstinner commented Apr 26, 2012

    Oops, I attached the wrong "hash.py" file.

    @benjaminp
    Copy link
    Contributor

    benjaminp commented Apr 26, 2012

    We should see if more security can be gained without sacrificing speed.

    @vstinner
    Copy link
    Member

    vstinner commented Apr 26, 2012

    Problem with current randomization of hash function
    is following: Suffix does not influence whether two keys
    have some hash or not (it is xor-ed after everything).

    Yes, the suffix is used to "protect" the secret. Without the suffix, it would be too simple to compute the prefix: getting a single hash value of a known string would leak the prefix.

    Suffix does not influence whether two keys have some hash
    or not (...). Everything except last 8 bits in prefix does
    not influence it also.

    I don't know if we can do better and/or if it is a critical issue.

    @tiran
    Copy link
    Member

    tiran commented Oct 17, 2012

    I've modified unicodeobject's unicode_hash() function. V8's algorithm is about 55% slower for a 800 MB ASCII string on my box.

    Python's current hash algorithm for bytes and unicode:

    while (--len >= 0)
    x = (_PyHASH_MULTIPLIER * x) ^ (Py_uhash_t) *P++;

    $ ./python -m timeit -s "t = 'abcdefgh' * int(1E8)" "hash(t)"
    10 loops, best of 3: 94.1 msec per loop

    V8's algorithm:

        while (--len >= 0) {
            x += (Py_uhash_t) *P++;
            x += ((x + (Py_uhash_t)len) << 10);
            x ^= (x >> 6);
        }
    $ ./python -m timeit -s "t = 'abcdefgh' * int(1E8)" "hash(t)"
    10 loops, best of 3: 164 msec per loop

    @arigo
    Copy link
    Mannequin

    arigo mannequin commented Oct 21, 2012

    Just to make it extra clear: Vlado showed that the "-R" switch of Python can easily be made fully pointless, with only a bit of extra work. Here is how:

    • Assume you have an algo that gives you as many strings with colliding hashes as you want, provided you know the last 8 bits of the secret prefix.

    • Say you want to attack a web server. You send it 256 requests, each with 100 strings that have identical hash for one of the 256 possible values. You measure which one is significantly slower than the others.

    • From there you are back in the original situation: you know which of the 256 values to pick, so you can make the web server crawl by sending it a large number of strings that have identical hashes for this particular value.

    It's interesting to note how this whole -R discussion made very long threads on python-dev, and python-dev has subsequently ignored (for the past 6 months!) the fact that their "fix" can be worked around in a matter of minutes.

    (For information, I'm sure that if the algorithm is improved to depend on all 32 or 64 bits of the prefix, it would still be easy to crack it. You don't actually need to send 2**32 or 2**64 requests to the web server: with careful design you can send only 32 or 64 requests that each leak one bit of information. Doing that requires a bit more research, but once the recipe is known, it can be freely reused, which seems to defeat the original point.)

    @arigo
    Copy link
    Mannequin

    arigo mannequin commented Oct 21, 2012

    For reference, the above means that we can implement -R support for PyPy as a dummy ignored flag, and get "security" that is very close to CPython's. :-)

    @arigo arigo mannequin added the easy label Oct 21, 2012
    @benjaminp
    Copy link
    Contributor

    benjaminp commented Oct 21, 2012

    That doesn't make it any easy CPython issue. :)

    @benjaminp benjaminp removed the easy label Oct 21, 2012
    @tiran
    Copy link
    Member

    tiran commented Oct 21, 2012

    As far as my understanding goes the issue can't be solved with our current hash algorithm. We'd have to use a crypto hash function or at least a hash algorithm that has an increased avalanche effect on the outcome. The current hash algorithm is designed and optimized for speed and not for security. Any other algorithm is going to slow down hashing.

    Small strings and strings with lots of NUL bytes may leak too many information, too.

    @vstinner
    Copy link
    Member

    vstinner commented Oct 21, 2012

    It's interesting to note how this whole -R discussion made very long
    threads on python-dev, and python-dev has subsequently ignored (for the
    past 6 months!) the fact that their "fix" can be worked around in a matter
    of minutes.

    No, this issue has no been ignored. Nobody proposed anything to fix this
    issue, but we are still working on it (sometimes in private).

    In my opinion, we cannot solve this issue without slowing down python. Or I
    don't know yet.a.fast and secure hash algorithm. I don't really want to
    slow down Python for one specific issue whereas there are so many other
    ways to DoS a (web) server.

    How do other languages (say Perl, Ruby, PHP, Javascript) handle this issue?

    @serhiy-storchaka
    Copy link
    Member

    serhiy-storchaka commented Oct 21, 2012

    $ ./python -m timeit -s "t = 'abcdefgh' * int(1E8)" "hash(t)"

    I got another numbers (32-bit Linux, AMD Athlon 64 X2 4600+).

    Python's current hash algorithm:
    10 loops, best of 3: 343 msec per loop

    V8's algorithm:
    10 loops, best of 3: 244 msec per loop

    @malemburg
    Copy link
    Member

    malemburg commented Oct 22, 2012

    On 21.10.2012 23:42, STINNER Victor wrote:

    STINNER Victor added the comment:

    > It's interesting to note how this whole -R discussion made very long
    threads on python-dev, and python-dev has subsequently ignored (for the
    past 6 months!) the fact that their "fix" can be worked around in a matter
    of minutes.

    No, this issue has no been ignored. Nobody proposed anything to fix this
    issue, but we are still working on it (sometimes in private).

    In my opinion, we cannot solve this issue without slowing down python. Or I
    don't know yet.a.fast and secure hash algorithm. I don't really want to
    slow down Python for one specific issue whereas there are so many other
    ways to DoS a (web) server.

    Well, I did propose a different approach to the whole problem to
    count collisions. That would have avoided the usability issues you
    have with the randomization approach, made it possible for the
    application to detect the attack and not have introduced any significant
    runtime overhead for applications not being attacked.

    The proposal was shot down with the argument that it wouldn't
    fix the problem.

    It should also be noted that the randomization only applies to
    strings/bytes, dictionaries with other colliding keys are not protected
    at all.

    Perhaps it's time to revisit the collision counting idea ?

    It would work in much the same way as the stack recursion limit
    we have in Python.

    --
    Marc-Andre Lemburg
    eGenix.com

    Professional Python Services directly from the Source  (#1, Oct 22 2012)
    >>> Python Projects, Consulting and Support ...   http://www.egenix.com/
    >>> mxODBC.Zope/Plone.Database.Adapter ...       http://zope.egenix.com/
    >>> mxODBC, mxDateTime, mxTextTools ...        http://python.egenix.com/
    ________________________________________________________________________
    2012-09-27: Released eGenix PyRun 1.1.0 ...       http://egenix.com/go35
    2012-09-26: Released mxODBC.Connect 2.0.1 ...     http://egenix.com/go34
    2012-09-25: Released mxODBC 3.2.1 ...             http://egenix.com/go33
    2012-10-23: Python Meeting Duesseldorf ...                      tomorrow

    eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48
    D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
    Registered at Amtsgericht Duesseldorf: HRB 46611
    http://www.egenix.com/company/contact/

    @arigo
    Copy link
    Mannequin

    arigo mannequin commented Nov 6, 2012

    Benjamin: oups, sorry. I don't remember setting the "easy" keyword, my mistake.

    Fwiw I'm +1 on Marc-Andre's solution. Make it a tunable setting, e.g. with sys.setcollisionlimit(). Defaults to sys.maxint on existing Pythons and some smaller value (70?) on new Pythons. It has the same benefits as the recursion limit: it's theoretically bad, but most of the time very useful.

    It would also crash on bad usages of custom __hash__() methods: e.g. if you put a lot of keys in a dict, all with a custom __hash__() that returns 42. I imagine that it can be considered a good thing to raise in this case rather than silently degrade performance forever.

    @benjaminp
    Copy link
    Contributor

    benjaminp commented Nov 6, 2012

    Here's the message that helped convince us to go against collision counting originally: http://mail.python.org/pipermail/python-dev/2012-January/115726.html

    @camara
    Copy link
    Mannequin

    camara mannequin commented Nov 6, 2012

    How about using a secure hash algorithm that's implemented in HW when available. It doesn't eliminate the issue on systems that lack this support but at least it limits the scope of the problem.

    Of course some testing would need to be done to make sure the hardware hashing doesn't have a significant impact on performance.

    @serhiy-storchaka
    Copy link
    Member

    serhiy-storchaka commented Nov 30, 2012

    René, a balanced tree requires rebalancing on every (or almost every) item for some special (sorted) data sequences.

    @koniiiik
    Copy link
    Mannequin

    koniiiik mannequin commented Nov 30, 2012

    On Fri, Nov 30, 2012 at 08:06:32PM +0000, Serhiy Storchaka wrote:

    René, a balanced tree requires rebalancing on every (or almost
    every) item for some special (sorted) data sequences.

    That's perfectly true and it holds for most unsorted sequences as
    well -- that's why they are balanced. The fact that the tree is always
    balanced guarantees that each rebalance takes at most O(log n)
    operations.

    @malemburg
    Copy link
    Member

    malemburg commented Nov 30, 2012

    On 30.11.2012 21:06, Serhiy Storchaka wrote:

    Serhiy Storchaka added the comment:

    René, a balanced tree requires rebalancing on every (or almost every) item for some special (sorted) data sequences.

    Sure, but that's still O(N*log N) for an attack scenario, not O(N^2).

    I think the main point here is that using hash tables or dictionaries
    for these things without any size limit is simply a wrong development
    approach.

    Developers need to be made aware of the problem and given data
    structures that they can use more safely to store the data and
    instead of trying to hide away the problem using some crypto
    approach, it's better to offer methods that allow developers to
    gain control over the problem (e.g. via an exception) rather than
    hoping for few hash collisions.

    If a developer has to build a lookup table from untrusted data,
    she should be able to say:

    try:
    mapping = insert_untrusted_data(source)
    except UnderAttackError:
    return 'no thank you'

    instead of:

    # Hmm, let's hope this doesn't bomb...
    mapping = insert_untrusted_data(source)

    At the moment, the best thing you can do is insert the data
    in chunks and measure the time it takes for each chunk. If it
    takes too long, you raise the UnderAttackError.

    If we make the collision counting limit a per-dictionary adjustable
    limit with some global default limit, the above could be written
    as:

    try:
    mapping = {}
    mapping.max_collisions = 100
    mapping.update(source)
    except CollisionLimitError:
    return 'no thank you'

    Aside: It's better to know you worst case and program for it, rather
    than to ignore the problem and hope an attacker won't find your secret
    key. In the above case, if you know what the worst-case timing is for
    a 100-item dictionary, you can safely deal with it, possibly adjusting
    the limit to more suitable value for your application.

    --
    Marc-Andre Lemburg
    eGenix.com

    Professional Python Services directly from the Source  (#1, Nov 30 2012)
    >>> Python Projects, Consulting and Support ...   http://www.egenix.com/
    >>> mxODBC.Zope/Plone.Database.Adapter ...       http://zope.egenix.com/
    >>> mxODBC, mxDateTime, mxTextTools ...        http://python.egenix.com/
    ________________________________________________________________________
    2012-11-28: Released eGenix mx Base 3.2.5 ...     http://egenix.com/go36
    2013-01-22: Python Meeting Duesseldorf ...                 53 days to go

    ::: Try our new mxODBC.Connect Python Database Interface for free ! ::::

    eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48
    D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
    Registered at Amtsgericht Duesseldorf: HRB 46611
    http://www.egenix.com/company/contact/

    @ReneSac
    Copy link
    Mannequin

    ReneSac mannequin commented Nov 30, 2012

    Serhiy Storchaka: Yes, but it is still O(log n) worst case. Even in the worst case rebalancing, you only need to walk up/down rotating/spliting every node in your path. As the tree height is guaranteed to be x * log n (x from 1 to 2, depending on the algorithm), the rebalancing operation is aways limited by O(log n).

    @serhiy-storchaka
    Copy link
    Member

    serhiy-storchaka commented Nov 30, 2012

    Serhiy Storchaka: Yes, but it is still O(log n) worst case. Even in the
    worst case rebalancing, you only need to walk up/down rotating/spliting
    every node in your path. As the tree height is guaranteed to be x * log n
    (x from 1 to 2, depending on the algorithm), the rebalancing operation is
    aways limited by O(log n).

    Agree. However I think that for large enough data a balanced tree is slower
    than a hashtable with any slow hash.

    @serhiy-storchaka
    Copy link
    Member

    serhiy-storchaka commented Nov 30, 2012

    try:
    mapping = {}
    mapping.max_collisions = 100
    mapping.update(source)
    except CollisionLimitError:
    return 'no thank you'

    May be use a more general solution?

    try:
    with run_with_timeout(timeout=100, timer=collisions_count):
    mapping = insert_untrusted_data(source)
    except TimeoutError:
    return 'no thank you'

    (You can can use different measurement for timeout: user time, real time, ticks
    count, collisions count, or even a user defined timer).

    @malemburg
    Copy link
    Member

    malemburg commented Nov 30, 2012

    On 30.11.2012 22:27, Serhiy Storchaka wrote:

    Serhiy Storchaka added the comment:

    > try:
    > mapping = {}
    > mapping.max_collisions = 100
    > mapping.update(source)
    > except CollisionLimitError:
    > return 'no thank you'

    May be use a more general solution?

    try:
    with run_with_timeout(timeout=100, timer=collisions_count):
    mapping = insert_untrusted_data(source)
    except TimeoutError:
    return 'no thank you'

    (You can can use different measurement for timeout: user time, real time, ticks
    count, collisions count, or even a user defined timer).

    Sure, as long as there's a way to break into the execution,
    any such method would do.

    The problem is that you'd have to check for the timeout at some
    point and the .update() call is running completely in C, so
    the only way to break into execution is either by explicitly
    adding a runtime check to the code, or use a signal (which is
    a can of worms better avoided :-)).

    Collision counting is one such method of detecting that
    something is likely not working according to plan, but it's
    really only another way of implementing the explicit runtime
    check. Using other counters or timers would work just as well.

    As long as you know that there are places in your code that can
    produce CPU time overloads, you can address those issues.

    The dictionary implementation is one such place, where you
    can run into the problem, but there are usually many other
    such areas in more complex applications as well, e.g. calculations
    that enter endless loops for some parameters, deadlocks,
    I/O operations that take unusually long (e.g. due to disk
    errors), poorly written drivers of all sorts, etc. etc.

    If there's a way to solve all these things in general
    and without explicit runtime checks, I'd like to know :-)

    @BobZiuchkovski
    Copy link
    Mannequin

    BobZiuchkovski mannequin commented Dec 2, 2012

    Why not redefine -R to mean "use secure hashing algorithms for built-in types"?

    When specified, use hashing algorithms that are secure against denial-of-service and other known attacks, at the possible expense of performance. When not specified, use whatever hashing algorithms provide the most sensible defaults for every-day use (basically hash the way python currently hashes).

    Secure hashing would apply not just to strings but to numeric and other types as well. This would break the invariant of x == y implies hash(x) == hash(y) for numeric types that Mark mentioned. However, that seems like an implementation detail that python users shouldn't rely upon.

    @iElectric
    Copy link
    Mannequin

    iElectric mannequin commented Jan 1, 2013

    According to talk at 29c3: http://events.ccc.de/congress/2012/Fahrplan/events/5152.en.html

    Quote: We also describe a vulnerability of Python's new randomized hash, allowing an attacker to easily recover the 128-bit secret seed. As a reliable fix to hash-flooding, we introduce SipHash, a family of cryptographically strong keyed hash function competitive in performance with the weak hashes, and already adopted in OpenDNS, Perl 5, Ruby, and in the Rust language.

    @tiran
    Copy link
    Member

    tiran commented Jan 2, 2013

    Thanks for the information! I'm working on a PEP for the issue at hand.

    @tiran tiran self-assigned this Jan 2, 2013
    @ncoghlan
    Copy link
    Contributor

    ncoghlan commented Jan 2, 2013

    Bob, the hash invariant isn't a mere implementation detail, it is critical to making hash based data structures work properly - if two equal objects (say the integer zero and the float zero) ever end up in different hash bins, then the uniqueness property of dictionary keys and sets breaks down.

    The three proposed mitigation strategies (using SipHash for string hashing, a tunable collision counting hash map and providing a non-hash based mapping container in the standard library) are all reasonable approaches to the problem and, most importantly, they're *orthogonal* approaches to the problem. There's nothing stopping us doing all three if someone is willing and able to provide the code.

    @GiovanniBajo
    Copy link
    Mannequin

    GiovanniBajo mannequin commented Jan 2, 2013

    Il giorno 02/gen/2013, alle ore 00:20, Domen Kožar <report@bugs.python.org> ha scritto:

    Domen Kožar added the comment:

    According to talk at 29c3: http://events.ccc.de/congress/2012/Fahrplan/events/5152.en.html

    Quote: We also describe a vulnerability of Python's new randomized hash, allowing an attacker to easily recover the 128-bit secret seed. As a reliable fix to hash-flooding, we introduce SipHash, a family of cryptographically strong keyed hash function competitive in performance with the weak hashes, and already adopted in OpenDNS, Perl 5, Ruby, and in the Rust language.

    That is exactly the vulnerability that was previously mentioned in the context of this bug. SipHash is currently the only solution for a collision-resistant fast-enough hash.

    Giovanni Bajo

    @GiovanniBajo
    Copy link
    Mannequin

    GiovanniBajo mannequin commented Jan 2, 2013

    Il giorno 02/gen/2013, alle ore 06:52, Christian Heimes <report@bugs.python.org> ha scritto:

    Christian Heimes added the comment:

    Thanks for the information! I'm working on a PEP for the issue at hand.

    Since you're collecting ideas on this, I would like to stress that, in the Python 3 transition, it was deemed acceptable to switch all objects to use unicode strings for attribute names, making the hash computation of such attributes (in the context of the instance dictionary) at least twice as slow than it used to be (the 'at least' refers to the fact that longer strings might have even worse effects because of a higher number of cache misses). SipHash isn't twice as slow as the current hash function, not even for short strings.

    So there is a precedent in slowing down the hash computation time in a very important use case, and it doesn't look like hell froze over.

    Giovanni Bajo

    @benjaminp
    Copy link
    Contributor

    benjaminp commented Jan 2, 2013

    2013/1/2 Giovanni Bajo <report@bugs.python.org>:

    Giovanni Bajo added the comment:

    Il giorno 02/gen/2013, alle ore 06:52, Christian Heimes <report@bugs.python.org> ha scritto:

    >
    > Christian Heimes added the comment:
    >
    > Thanks for the information! I'm working on a PEP for the issue at hand.

    Since you're collecting ideas on this, I would like to stress that, in the Python 3 transition, it was deemed acceptable to switch all objects to use unicode strings for attribute names, making the hash computation of such attributes (in the context of the instance dictionary) at least twice as slow than it used to be (the 'at least' refers to the fact that longer strings might have even worse effects because of a higher number of cache misses). SipHash isn't twice as slow as the current hash function, not even for short strings.

    So there is a precedent in slowing down the hash computation time in a very important use case, and it doesn't look like hell froze over.

    It's probably not to bad for attribute names because a) they're short
    b) they're interned c) the hash is cached.

    @tiran
    Copy link
    Member

    tiran commented Jan 2, 2013

    Giovanni, why do you think that hashing of unicode strings is slower than byte strings?

    First of all ASCII only unicode strings are packed and use just one byte per char. CPython's FNV implementation processes one element in each cycle, that is one byte for bytes and ASCII unicode, two bytes for UCS-2 and four bytes for UCS-4. Bytes and UCS-4 strings require the same amount of CPU instructions.

    @GiovanniBajo
    Copy link
    Mannequin

    GiovanniBajo mannequin commented Jan 2, 2013

    Il giorno 02/gen/2013, alle ore 19:51, Christian Heimes <report@bugs.python.org> ha scritto:

    Christian Heimes added the comment:

    Giovanni, why do you think that hashing of unicode strings is slower than byte strings?

    First of all ASCII only unicode strings are packed and use just one byte per char. CPython's FNV implementation processes one element in each cycle, that is one byte for bytes and ASCII unicode, two bytes for UCS-2 and four bytes for UCS-4. Bytes and UCS-4 strings require the same amount of CPU instructions.

    Ah sorry, I stand corrected (though packing wasn't there in 3.0, was it? I was specifically referred to the 2.x -> 3.0 transition).

    Giovanni Bajo

    My Blog: http://giovanni.bajo.it

    @serhiy-storchaka
    Copy link
    Member

    serhiy-storchaka commented Jan 2, 2013

    See microbenchmark results in bpo-16427. Really hashing of ASCII/UCS1 strings more than 2x slower than bytes hashing.

    @tiran
    Copy link
    Member

    tiran commented Nov 20, 2013

    The issue has been solved for Python 3.4 with the integration of PEP-456.

    @ncoghlan
    Copy link
    Contributor

    ncoghlan commented Dec 10, 2013

    This issue has belatedly had a CVE assigned: CVE-2013-7040 ("CPython hash secret can be recovered remotely")

    3.4 is not affected (due to PEP-456), but 3.3 and 2.7 are still affected.

    @benjaminp
    Copy link
    Contributor

    benjaminp commented Dec 10, 2013

    I think that's just WONTFIX at this point.

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    @radishlee
    Copy link

    radishlee commented Jun 30, 2022

    hello guys, i want to know that how to resolve 2.7.x problem, do you have any good plan?

    This issue has belatedly had a CVE assigned: CVE-2013-7040 ("CPython hash secret can be recovered remotely")

    3.4 is not affected (due to PEP-456), but 3.3 and 2.7 are still affected.

    @tiran
    Copy link
    Member

    tiran commented Jun 30, 2022

    Python 2.7 reached EOL over two years ago and is no longer supported by us. You either need to update to a supported version, fix security problems yourself, or contract a vendor.

    @radishlee
    Copy link

    radishlee commented Jun 30, 2022

    Python 2.7 reached EOL over two years ago and is no longer supported by us. You either need to update to a supported version, fix security problems yourself, or contract a vendor.

    ok .thanks

    @vstinner
    Copy link
    Member

    vstinner commented Jun 30, 2022

    hello guys, i want to know that how to resolve 2.7.x problem, do you have any good plan?

    Wow, I'm impressed that people ask security questions about Python 2.7. Are you still using Python 2.7 in production?

    @radishlee
    Copy link

    radishlee commented Jul 1, 2022

    hello guys, i want to know that how to resolve 2.7.x problem, do you have any good plan?

    Wow, I'm impressed that people ask security questions about Python 2.7. Are you still using Python 2.7 in production?

    yes. our project started at 2010, it's so huge that we can not upgrade python2.7. it's terrible

    @vstinner
    Copy link
    Member

    vstinner commented Jul 1, 2022

    yes. our project started at 2010, it's so huge that we can not upgrade python2.7. it's terrible

    You can patch Python, but Python 2.7 no longer accept any change "upstream". The branch in Git has been removed. I suggest you starting porting your code base incrementally to Python 3, but this closed issue is the wrong place to discuss that (and I'm not available to help you). Good luck with that ;-)

    @radishlee
    Copy link

    radishlee commented Sep 3, 2022

    thanks

    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    type-security A security issue
    Projects
    None yet
    Development

    No branches or pull requests

    10 participants