Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bytes.fromhex should ignore all whitespace #73113

Closed
nneonneo mannequin opened this issue Dec 10, 2016 · 19 comments
Closed

bytes.fromhex should ignore all whitespace #73113

nneonneo mannequin opened this issue Dec 10, 2016 · 19 comments
Assignees
Labels
3.7 interpreter-core type-feature

Comments

@nneonneo
Copy link
Mannequin

@nneonneo nneonneo mannequin commented Dec 10, 2016

BPO 28927
Nosy @terryjreedy, @ncoghlan, @vadmium, @serhiy-storchaka
PRs
  • #552
  • Files
  • fromhex.patch: patch to use isspace for fromhex
  • fromhex.patch
  • fromhex.patch: patch to skip all ASCII whitespace in bytes.fromhex
  • fromhex.patch: patch to skip all ASCII whitespace in bytes.fromhex
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = 'https://github.com/serhiy-storchaka'
    closed_at = <Date 2016-12-19.16:53:57.731>
    created_at = <Date 2016-12-10.00:07:26.864>
    labels = ['interpreter-core', 'type-feature', '3.7']
    title = 'bytes.fromhex should ignore all whitespace'
    updated_at = <Date 2017-03-31.16:36:25.899>
    user = 'https://bugs.python.org/nneonneo'

    bugs.python.org fields:

    activity = <Date 2017-03-31.16:36:25.899>
    actor = 'dstufft'
    assignee = 'serhiy.storchaka'
    closed = True
    closed_date = <Date 2016-12-19.16:53:57.731>
    closer = 'serhiy.storchaka'
    components = ['Interpreter Core']
    creation = <Date 2016-12-10.00:07:26.864>
    creator = 'nneonneo'
    dependencies = []
    files = ['45823', '45933', '45966', '45967']
    hgrepos = []
    issue_num = 28927
    keywords = ['patch']
    message_count = 19.0
    messages = ['282811', '282816', '282817', '283449', '283456', '283457', '283492', '283504', '283542', '283545', '283549', '283551', '283571', '283627', '283631', '283632', '283633', '283634', '283639']
    nosy_count = 6.0
    nosy_names = ['terry.reedy', 'ncoghlan', 'nneonneo', 'python-dev', 'martin.panter', 'serhiy.storchaka']
    pr_nums = ['552']
    priority = 'normal'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'enhancement'
    url = 'https://bugs.python.org/issue28927'
    versions = ['Python 3.7']

    @nneonneo
    Copy link
    Mannequin Author

    @nneonneo nneonneo mannequin commented Dec 10, 2016

    bytes.fromhex ignores space characters now (yay!) but still barfs if fed newlines or tabs:

    >>> bytes.fromhex('ab\ncd')
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    ValueError: non-hexadecimal number found in fromhex() arg at position 2
    >>> bytes.fromhex('ab\tcd')
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    ValueError: non-hexadecimal number found in fromhex() arg at position 2

    It's often quite useful to paste blobs of hex into source code (or the REPL) and call ".fromhex" on them. These might include spaces, tabs and/or newlines, and barfing on these other whitespace characters is inconvenient.

    I propose that bytes.fromhex should ignore all whitespace. A patch + test is attached.

    @nneonneo nneonneo mannequin added the type-feature label Dec 10, 2016
    @vadmium
    Copy link
    Member

    @vadmium vadmium commented Dec 10, 2016

    Seems a reasonable feature. The documentation would also need updating.

    Which specific (whitespace) characters do you propose to ignore? Just ASCII ones, as in bytes.isspace(), or others like b"\xA0" (non-breaking space) and U+2028 (line separator), as in str.isspace()? Perhaps you should add test cases to clarify.

    @vadmium vadmium added the 3.7 label Dec 10, 2016
    @nneonneo
    Copy link
    Mannequin Author

    @nneonneo nneonneo mannequin commented Dec 10, 2016

    I used Py_ISSPACE, which uses the .strip() default charset - I think this is a reasonable choice. We don't have to go crazy and support all the Unicode spaces.

    @terryjreedy
    Copy link
    Member

    @terryjreedy terryjreedy commented Dec 16, 2016

    I am a bit dubious about this. There is a tradeoff here between convenience and bug detection. The patch is not strictly necessary.

    >>> bytes.fromhex('ab\ncd'.replace('\n', ''))
    b'\xab\xcd'

    Bytes (and bytearray) .fromhex already ignores spaces. Not ignoring newlines, by default, may have been intentional. Nick, you wrote the doc entry for 'fromhex'. Do you know anything about the design intention?

    @nneonneo
    Copy link
    Mannequin Author

    @nneonneo nneonneo mannequin commented Dec 17, 2016

    Terry, can you elaborate what you mean by a tradeoff? I feel like such a patch makes .fromhex more consistent with other string methods like .split() and .strip() which implicitly consider all whitespace equivalent.

    Martin, I've updated the patch to include documentation updates and more testcases.

    @nneonneo
    Copy link
    Mannequin Author

    @nneonneo nneonneo mannequin commented Dec 17, 2016

    Sorry, I should have clarified that these methods consider *ASCII whitespace* equivalent - just like my proposed patch.

    @ncoghlan
    Copy link
    Contributor

    @ncoghlan ncoghlan commented Dec 17, 2016

    My recollection is that fromhex() ignores spaces to account for particularly common ways of formatting hex numbers as space separated groups:

    "CAFE F00D"
    "CAFEF00D CAFEF00D CAFEF00D"
    "CA FE F0 0D"
    etc
    

    Those show up even in structured hexadecimal data (like hex editor output and memory dumps in log files).

    That recollection is supported by the example and specification of specifically "[0-9a-fA-F ]" in PEP-358 (https://www.python.org/dev/peps/pep-0358/) that guided Georg Brandl's initial implementation in http://bugs.python.org/issue1669379

    Generally speaking, the type level string parsers *aren't* permissive when it comes to their whitespace handling - if they allow whitespace at all, it's usually only at the beginning or end, where it gets ignored (hence the PEP for 3.6 to allow underscores in both numeric literals and in the numeric constructors).

    =====================

    >>> float(" 1.0")
    1.0
    >>> float("1.0 ")
    1.0
    >>> float("1 0")
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    ValueError: could not convert string to float: '1 0'
    >>> float("1. 0")
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    ValueError: could not convert string to float: '1. 0'
    >>> float("1 .0")
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    ValueError: could not convert string to float: '1 .0'

    =====================

    The general technique to strip whitespace from *any* input data before handing it to a constructor relies on str.translate:

    =====================

    >>> import string
    >>> _del_whitespace = str.maketrans('', '', string.whitespace)
    >>> def clean_whitespace(text):
    ...     return text.translate(_del_whitespace)
    ... 
    >>> clean_whitespace('CA FE\nF0\t0D')
    'CAFEF00D'

    =====================

    (http://stackoverflow.com/questions/3739909/how-to-strip-all-whitespace-from-string also points out "".join(text.split()) as a clever one liner for a similar outcome)

    So I'm inclined to advise *against* making any changes here - the apparent benefit is more a symptom of the fact that there isn't an immediately obvious spelling of "strip all whitespace, including that between other characters, from this string" that can be readily applied as an initial filter on the incoming data.

    (However, I do sometimes wonder if it would make sense to offer a "str.stripall()" that defaulted to removing all whitespace, rather than treating this purely as a particular use case for str.translate)

    @nneonneo
    Copy link
    Mannequin Author

    @nneonneo nneonneo mannequin commented Dec 17, 2016

    I see your point, Nick. Can I offer a counterpoint?

    Most of the string parsers operate only on relatively short inputs, like numbers. Numbers in particular are rarely written with inner spaces, so it makes sense not to ignore internal whitespaces.

    On the other hand, hexadecimal data can be very long, and is often formatted with spaces and newlines. For example, the default output of xxd -p, a format quite suitable for copy-paste, looks like this:

    cffaedfe07000001030000800200000015000000d8080000858021000000
    000019000000480000005f5f504147455a45524f00000000000000000000
    000000000000000001000000000000000000000000000000000000000000
    000000000000000000000000000019000000180300005f5f544558540000
    0000000000000000000000000100000000909d0100000000000000000000

    It would be desirable to write something like

    blob = bytes.fromhex('''
    cffaedfe07000001030000800200000015000000d8080000858021000000
    000019000000480000005f5f504147455a45524f00000000000000000000
    000000000000000001000000000000000000000000000000000000000000
    000000000000000000000000000019000000180300005f5f544558540000
    0000000000000000000000000100000000909d0100000000000000000000
    ''')

    and not have to worry about sticking in some whitespace remover, like this:

    blob = bytes.fromhex(''.join('''
    cffaedfe07000001030000800200000015000000d8080000858021000000
    000019000000480000005f5f504147455a45524f00000000000000000000
    000000000000000001000000000000000000000000000000000000000000
    000000000000000000000000000019000000180300005f5f544558540000
    0000000000000000000000000100000000909d0100000000000000000000
    '''.split()))

    or removing the newlines in the source code, which impacts readability.

    Similar kinds of whitespaced output (sometimes with spaces between octets, words or dwords, sometimes with tabs between 8-16 byte groups, sometimes with newlines between groups, etc.) can be found in the wild and from the "hex" clipboard output from various applications.

    We can already have newlines and other whitespace with base64, which is in principle quite similar:

    blob = base64.b64decode('''
    z/rt/gcAAAEDAACAAgAAABUAAADYCAAAhYAhAAAAAAAZAAAASAAAAF9fUEFHRVpFUk8AAAAAAAAAAAAA
    AAAAAAAAAAABAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAZAAAAGAMAAF9fVEVYVAAA
    AAAAAAAAAAAAAAAAAQAAAACQnQEAAAAAAAAAAAAAAAAAkJ0BAAAAAAcAAAAFAAAACQAAAAAAAABfX3Rl
    eHQAAAAAAAAAAAAAX19URVhUAAAAAAAAAAAAAAALAAABAAAARCF5AQAAAAAACwAACAAAAAAAAAAAAAAA
    ''')

    so I think it makes sense to support other whitespaces in fromhex. I'm happy to reconsider if there's a strong argument against adding this convenience.

    @ncoghlan
    Copy link
    Contributor

    @ncoghlan ncoghlan commented Dec 18, 2016

    Ah, the parallel with base64 decoding and embedding encoded data in multi-line string literals is indeed a compelling one - I'd missed that.

    Given that rationale, +1 from me.

    Perhaps it would make sense to call that out directly in the documentation? Something like a second example saying:

    ====
    Ignoring all ASCII whitespace provides compatibility with common hexdump formats (like the output of xxd), allowing such data to easily be read from a file or included directly in source code as a multiline string literal::

        >>> bytes.fromhex("""
        ... F0F1F2F3F4
        ... FAFBFCFDFE
        """)
        b'\xf0\xf1\xf2\xf4\xfa\xfb\xfc\xfd'

    ====

    And then a versionchanged note for 3.7 to say that this was switched from filtering out specifically space to filtering out any ASCII whitespace.

    My other question would be whether or not a separate issue should be filed to update the bytes-to-bytes "hex" codec to be consistent with this change - at the moment, it doesn't allow whitespace *at all* (not even ASCII spaces), while the base64 decoder is consistent with base64.b64decode and allows it.

    @serhiy-storchaka
    Copy link
    Member

    @serhiy-storchaka serhiy-storchaka commented Dec 18, 2016

    Is it worth to ignore also non-ASCII whitespaces for compatibility with string-to-number convertions and base64 decoder?

    >>> float('\xa01.0')
    1.0
    >>> base64.decodebytes(b'\xa0YWJj')
    b'abc'

    Note that not all spaces are ignored. They shouldn't break hexadecimal pairs.

    >>> bytes.fromhex('a b')
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    ValueError: non-hexadecimal number found in fromhex() arg at position 1

    @vadmium
    Copy link
    Member

    @vadmium vadmium commented Dec 18, 2016

    As far as I know, non-ASCII newlines and whitespace are not supported in Python source code, so there is not a big need to support it in bytes.fromhex() either. But since bytes.fromhex() accepts Unicode strings, I think non-ASCII whitespace would be okay if it was easy to implement.

    Serhiy: Whitespace is not treated specially by the base-64 decoders, it is just treated like any non-alphabet character. See <https://docs.python.org/3/library/base64.html#base64.b64decode\> and b64decode(validate=True).

    Regarding hex-codec, I doubt many people use it in Python 3. To decode a whole string, binascii.unhexlify() or bytes.fromhex() is probably more obvious, and I think the incremental decoder never worked properly (bpo-20132).

    @serhiy-storchaka
    Copy link
    Member

    @serhiy-storchaka serhiy-storchaka commented Dec 18, 2016

    No, it is not easy to implement. Currently bytes.fromhex() works only with ASCII strings and can just iterate over char*.

    If add support of non-ASCII whitespaces, we should add a support of non-ASCII digits, as in float(). This looks excessive.

    In case if there is a case for this, we rather would expose _PyUnicode_TransformDecimalAndSpaceToASCII() in Python level.

    In general fromhex.patch LGTM, but please add versionchanged directives and document the change in What's New.

    @ncoghlan
    Copy link
    Contributor

    @ncoghlan ncoghlan commented Dec 19, 2016

    +1 to the above Unicode whitespace discussion - the "ASCII space -> ASCII whitespace" change is relatively straightforward to implement, and clearly beneficial given Robert's point regarding the popularity of multi-line terminal-oriented hexdump formats.

    Neither of those points apply to Unicode whitespace (it would be much harder to implement, and we don't have a demonstrated use case for it), so it makes sense to avoid going that far.

    @nneonneo
    Copy link
    Mannequin Author

    @nneonneo nneonneo mannequin commented Dec 19, 2016

    OK, I've attached a new version of the patch with the requested documentation changes (versionchanged and whatsnew).

    @nneonneo
    Copy link
    Mannequin Author

    @nneonneo nneonneo mannequin commented Dec 19, 2016

    New patch with proper line lengths in documentation.

    @serhiy-storchaka
    Copy link
    Member

    @serhiy-storchaka serhiy-storchaka commented Dec 19, 2016

    LGTM.

    @serhiy-storchaka serhiy-storchaka added the interpreter-core label Dec 19, 2016
    @serhiy-storchaka serhiy-storchaka self-assigned this Dec 19, 2016
    @python-dev
    Copy link
    Mannequin

    @python-dev python-dev mannequin commented Dec 19, 2016

    New changeset fcc09d9ee7d4 by Serhiy Storchaka in branch 'default':
    Issue bpo-28927: bytes.fromhex() and bytearray.fromhex() now ignore all ASCII
    https://hg.python.org/cpython/rev/fcc09d9ee7d4

    @serhiy-storchaka
    Copy link
    Member

    @serhiy-storchaka serhiy-storchaka commented Dec 19, 2016

    Thank you for your contribution Robert.

    @terryjreedy
    Copy link
    Member

    @terryjreedy terryjreedy commented Dec 19, 2016

    I think non-ASCII whitespace and digits are YAGNI until we are convinced otherwise by evidence from the field that people are routinely mixing other decimal digits with 'abcdef' as hex numerals. Anyone who does try such a thing can write a wrapper that first translates to ascii digits.

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    3.7 interpreter-core type-feature
    Projects
    None yet
    Development

    No branches or pull requests

    4 participants