Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Two bytes objects of zero length don't compare equal #76612

Closed
jonathanunderwood mannequin opened this issue Dec 27, 2017 · 8 comments
Closed

Two bytes objects of zero length don't compare equal #76612

jonathanunderwood mannequin opened this issue Dec 27, 2017 · 8 comments
Labels
interpreter-core (Objects, Python, Grammar, and Parser dirs) type-bug An unexpected behavior, bug, or error

Comments

@jonathanunderwood
Copy link
Mannequin

jonathanunderwood mannequin commented Dec 27, 2017

BPO 32431
Nosy @bitdancer, @serhiy-storchaka, @zhangyangyu, @jonathanunderwood
PRs
  • bpo-32431: Ensure two bytes objects of zero length compare equal  #5021
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2020-10-22.02:38:03.196>
    created_at = <Date 2017-12-27.15:17:14.092>
    labels = ['interpreter-core', 'type-bug', 'invalid']
    title = "Two bytes objects of zero length don't compare equal"
    updated_at = <Date 2020-10-22.02:38:03.196>
    user = 'https://github.com/jonathanunderwood'

    bugs.python.org fields:

    activity = <Date 2020-10-22.02:38:03.196>
    actor = 'methane'
    assignee = 'none'
    closed = True
    closed_date = <Date 2020-10-22.02:38:03.196>
    closer = 'methane'
    components = ['Interpreter Core']
    creation = <Date 2017-12-27.15:17:14.092>
    creator = 'jonathanunderwood'
    dependencies = []
    files = []
    hgrepos = []
    issue_num = 32431
    keywords = ['patch']
    message_count = 8.0
    messages = ['309086', '309087', '309090', '309091', '309093', '309095', '309096', '311815']
    nosy_count = 4.0
    nosy_names = ['r.david.murray', 'serhiy.storchaka', 'xiang.zhang', 'jonathanunderwood']
    pr_nums = ['5021']
    priority = 'normal'
    resolution = 'not a bug'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'behavior'
    url = 'https://bugs.python.org/issue32431'
    versions = ['Python 3.6']

    @jonathanunderwood
    Copy link
    Mannequin Author

    jonathanunderwood mannequin commented Dec 27, 2017

    With the current logic in Objects/bytesobject.c in the function bytes_compare_eq it can be the case that zero length bytes object object created in an extension module like this:

    val = PyBytes_FromStringAndSize (NULL, 20);
    Py_SIZE(val) = 0;

    won't compare equal to b'' because the memory is not initialized, so the first two bytes won't be equal. Nonetheless, the Python interpreter does return b'' for print(repr(val)), so this behaviour is very confusing. To get the correct behaviour, one would have to initialize the memory:

    val = PyBytes_FromStringAndSize (NULL, 20);
    c = PyBytes_AS_STRING (val);
    c[0] = '\0';
    Py_SIZE(val) = 0;

    However, it would be more sensible to fix the logic in bytes_compare_eq in my opinion. That function should return true for two zero length bytes objects, irrespective of the memory contents.

    @jonathanunderwood jonathanunderwood mannequin added interpreter-core (Objects, Python, Grammar, and Parser dirs) type-bug An unexpected behavior, bug, or error labels Dec 27, 2017
    @bitdancer
    Copy link
    Member

    My suspicion is that this behavior/code is left over from when the code was handling strings in python2, where strings were always null terminated and so the equal-bytes test would always pass. I don't think this is appropriate for bytes objects, so I think the compare logic should be fixed. But I don't deal with the C code much, so I'd like an opinion from a core dev who does.

    @jonathanunderwood
    Copy link
    Mannequin Author

    jonathanunderwood mannequin commented Dec 27, 2017

    #5021

    @serhiy-storchaka
    Copy link
    Member

    Much C code implies that bytes are null terminated. For example bytes-to-number converters and most functions that accept bytes as file path. Changing just the compare logic will not help.

    The documented way of changing the size of the bytes object is _PyBytes_Resize(). But since this is a private function the only way of resizing the bytes object using public functions is creating a new object with PyBytes_FromStringAndSize(). If the new size is 0, it will return a singleton, thus this is a fast and memory efficient way.

    Using the Py_SIZE() macro with the bytes object is not even documented. We should either document this way with its caveats (for example misusing it can lead to inefficient use of memory), or admit that this is working with internal representation which we shouldn't encourage.

    @bitdancer
    Copy link
    Member

    I think what Py_SIZE "means" depends on the object type in the general case, not just this one, so documenting that it is mucking with the internal representation is probably the way to go.

    @jonathanunderwood
    Copy link
    Mannequin Author

    jonathanunderwood mannequin commented Dec 27, 2017

    Py_SIZE is actually precisely specified and documented[1] as it stands; I don't think a change there is needed. The usage I outlined is in line with that documentation, and many other uses of that macro in the cpython sources.

    The documentation issues are, at least:

    1. There is no documentation specifying that bytes objects should be null terminated.

    2. Nothing in the documentation of PyBytes_FromStringAndSize[2] specifies that passing 0 as the size results in a singleton being returned. This is undocumented behaviour, and it would seem fragile to rely on this.

    But there are more implementation inconsistencies: the documentation for PyBytes_AsString()[3] returns a buffer which is one byte longer than the length of the object *in order to store a terminating null*, which implies that the object need not itself have a terminating null. I could go on with other examples, but this is very poorly defined behaviour.

    Question: are bytes objects defined to be null terminated, or not?

    Because if they're not defined to be null terminated, the fix I propose is correct even if it doesn't solve the other 100 bugs lurking in the code.

    [Aside: even if bytes objects are in fact defined to be null terminated, I think the change proposed amounts to an optimization in any case.]

    [1] https://docs.python.org/3/c-api/structures.html#c.Py_SIZE
    [2] https://docs.python.org/3/c-api/bytes.html?highlight=pybytes_fromstringandsize#c.PyBytes_FromStringAndSize
    [3] https://docs.python.org/3/c-api/bytes.html?highlight=pybytes_asstring#c.PyBytes_AsString

    @jonathanunderwood
    Copy link
    Mannequin Author

    jonathanunderwood mannequin commented Dec 27, 2017

    Actually the commentary at the top of bytesobject.c for PyBytes_FromStringAndSize says:

    "... If str' is NULL then PyBytes_FromStringAndSize() will allocate size+1' bytes (setting the last byte to the null terminating character)... "

    So, perhaps that's as close to gospel as it gets - this does imply that bytes objects are expected to be null terminated. Why PyBytesAsString then adds an extra null terminator is a bit of a mystery.

    Perhaps what's needed is some documentation clarifications:

    1/ State early on that bytes objects are always expected to be null terminated.

    2/ As such, the string pointer returned by PyBytes_AsString will point to a null terminated string - I think the current docs could be misinterpreted to suggest that _AsString *adds* an extra byte for the null, which it doesn't.

    3/ Document that using Py_SIZE to reduce the length of a bytes object is dangerous, because the null terminator will be lost, and subsequent behaviour undefined.

    4/ Document that the preferred way to resize is to use PyBytes_FromStringAndSize with a new size.

    5/ Indicate clearly that _PyBytes_Resize is not a public interface and its use is discouraged in favour of PyBytes_FromStringAndSize

    @zhangyangyu
    Copy link
    Member

    In my mind I don't think here is any problem. The code

    val = PyBytes_FromStringAndSize (NULL, 20);
    Py_SIZE(val) = 0;

    doesn't create a zero length bytes object. A resizing I think should always means reorganizing the internal representation, including changing the size attribute. Using Py_SIZE won't trigger a full resizing. The code is not resizing but just breaks the internal representation and then leads to bugs. In C level, if you want, it's easy to muck Python objects. And exposing that much implementation details in documentation seems unnecessary.

    @methane methane closed this as completed Oct 22, 2020
    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    interpreter-core (Objects, Python, Grammar, and Parser dirs) type-bug An unexpected behavior, bug, or error
    Projects
    None yet
    Development

    No branches or pull requests

    4 participants