Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle bytes comparisons in difflib.Differ #61647

Closed
warsaw opened this issue Mar 17, 2013 · 29 comments
Closed

Handle bytes comparisons in difflib.Differ #61647

warsaw opened this issue Mar 17, 2013 · 29 comments
Labels
stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error

Comments

@warsaw
Copy link
Member

warsaw commented Mar 17, 2013

BPO 17445
Nosy @warsaw, @terryjreedy, @ncoghlan, @pitrou, @bitdancer, @berkerpeksag
Files
  • bytes_diff.py: Wrap difflib.context/unified_diff
  • 13161c1d9c5f.diff
  • 13161c1d9c5-difflibf.diff
  • fa4c6160c518.diff
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2015-04-21.01:22:35.351>
    created_at = <Date 2013-03-17.14:42:58.733>
    labels = ['type-bug', 'library']
    title = 'Handle bytes comparisons in difflib.Differ'
    updated_at = <Date 2015-04-21.03:06:42.469>
    user = 'https://github.com/warsaw'

    bugs.python.org fields:

    activity = <Date 2015-04-21.03:06:42.469>
    actor = 'berker.peksag'
    assignee = 'gward'
    closed = True
    closed_date = <Date 2015-04-21.01:22:35.351>
    closer = 'gward'
    components = ['Library (Lib)']
    creation = <Date 2013-03-17.14:42:58.733>
    creator = 'barry'
    dependencies = []
    files = ['29493', '39053', '39057', '39083']
    hgrepos = ['179']
    issue_num = 17445
    keywords = ['patch']
    message_count = 29.0
    messages = ['184379', '184381', '184382', '184409', '184424', '184428', '184438', '184487', '184490', '184534', '184535', '184547', '184566', '184580', '184596', '184597', '184699', '185157', '240717', '240718', '241098', '241107', '241172', '241174', '241175', '241239', '241302', '241304', '241691']
    nosy_count = 9.0
    nosy_names = ['barry', 'gward', 'terry.reedy', 'ncoghlan', 'pitrou', 'durin42', 'r.david.murray', 'python-dev', 'berker.peksag']
    pr_nums = []
    priority = 'normal'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'behavior'
    url = 'https://bugs.python.org/issue17445'
    versions = ['Python 3.5']

    @warsaw
    Copy link
    Member Author

    warsaw commented Mar 17, 2013

    This came up at the Pycon 2013 Python 3 porting clinic. There are many cases in the stdlib that claim (either explicitly or implicitly) to accept bytes or strings, but that don't return the type of the arguments they accept. An example is urllib.parse.quote() which accepts bytes or str but always returns str. A similar example brought up at the clinic was difflib, which accepts both types, and works internally on both, but crashes when joining the results for return.

    It should be policy for the stdlib (i.e. codified in an informational PEP and including bug reports, because they *are* bugs, not features or baked-in API) where bytes or str are accepted but the right things are not done (i.e. return the type you accept).

    This bug captures the principle, and probably should be closed once such a PEP is accepted, with individual bugs opened for each individual case.

    @warsaw warsaw added the stdlib Python modules in the Lib dir label Mar 17, 2013
    @bitdancer
    Copy link
    Member

    bitdancer commented Mar 17, 2013

    There was a long thread about this on python-dev that might be worth going back over, where I had the same misconception (that functions should always return the same type as their arguments). While I think that should be the default design, it isn't always the best API. (The real rule, if I recall correctly, is that functions should never accept *mixed* argument types for input data.)

    @warsaw
    Copy link
    Member Author

    warsaw commented Mar 17, 2013

    On Mar 17, 2013, at 03:10 PM, R. David Murray wrote:

    There was a long thread about this on python-dev that might be worth going
    back over, where I had the same misconception (that functions should always
    return the same type as their arguments). While I think that should be the
    default design, it isn't always the best API. (The real rule, if I recall
    correctly, is that functions should never accept *mixed* argument types for
    input data.)

    Totally agree about the mixed type rule.

    But this is something different, and I know it's been discussed, and is a
    difficult problem. It's causing real-world pain for people though, so it's
    worth thinking about again.

    @gward
    Copy link
    Mannequin

    gward mannequin commented Mar 18, 2013

    The particular use case that triggered this: Mercurial's test suite. It runs "hg blah blah" and compares the output against known good output. But Mercurial's output is just bytes, because pretty much everything in a Mercurial repo is just bytes (file data of course, but also filenames and even changeset metadata like usernames).

    So attempting to run the Mercurial test suite under 3.x immediately fails hard. The boiled-down essence of the bug is this:

    >>> import difflib
    >>> a = b"hello world"
    >>> b = b"goodbye world"
    >>> [line for line in difflib.unified_diff(a, b)]
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "<stdin>", line 1, in <listcomp>
      File "/home/greg/src/cpython/3.2/Lib/difflib.py", line 1224, in unified_diff
        yield '-' + line
    TypeError: Can't convert 'int' object to str implicitly

    @bitdancer
    Copy link
    Member

    bitdancer commented Mar 18, 2013

    That looks like a bug in difflib (regardless of what type it returns). But note that I'm agreeing that returning the same type you are given is generally preferrable.

    @terryjreedy
    Copy link
    Member

    terryjreedy commented Mar 18, 2013

    Changing behavior that already matches the docs is an enhancement, not a bugfix, and one that will almost certainly break code. It is therefore one that would normally require a deprecation period. I think the most you should ask for is to skip the deprecation period.

    I believe the urllib and difflib problems are quite different. I am going to presume that urllib simply converts bytes input to str and goes on from there, returning the result as str rather than (possibly) converting back to bytes. That is an example for this issue.

    Difflib.unified_diff, on the other hand, raises rather than returning an unexpected or undesired type. The 3 sections like the below have two problems given the toy input of two bytes objects.

                if tag in {'replace', 'delete'}:
                    for line in a[i1:i2]:
                        yield '-' + line

    First, iterating bytes or a slice of bytes returns ints, not 1-byte bytes. Hence the exception. Even if that were worked around, the mixed string constant + bytes expression would raise a TypeError. One fix for both problems would be to change the expression to '-' + str(line).

    Neither of these problems are bugs. The doc says "Compare a and b (lists of strings)". Actually, 'sequence of strings' is sufficient. For the operations of unified_diff, a string looks like a sequence of 1-char strings, which is why

    >> for l in difflib.unified_diff('ab', 'c'): print(l)


    +++

    @@ -1,2 +1 @@

    -a
    -b
    +c

    works.

    The other lines yielded by unified_diff are produced with str.format, and % formatting does not seem to work with bytes either. So a dual string/bytes function would not be completely trivial.

    Greg, can you convert bytes to strings, or strings to bytes, for your tests, or do you have non-ascii codes in your bytes? Otherwise, I think it might be better to write a new function 'unified_diff_bytes' that did exactly what you want than to try to make unified_diff accept sequences of bytes.

    @ncoghlan
    Copy link
    Contributor

    ncoghlan commented Mar 18, 2013

    At a glance, this just looks like a bug in difflib - it should use
    different literals when handling bytes. (Given that difflib newline
    processing assumes ASCII compatibility, a latin-1 based decode/encode
    solution may also be viable).

    @gward
    Copy link
    Mannequin

    gward mannequin commented Mar 18, 2013

    The original reproduction I posted was incorrect -- it makes difflib look worse than it should. (I passed strings rather than lists of strings.) Here is a more accurate version:

    >>> import difflib
    >>> a = [b'hello']
    >>> b = [b'hello!']
    >>> '\n'.join(line for line in difflib.unified_diff(a, b))
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "<stdin>", line 1, in <genexpr>
      File "/home/greg/src/cpython/3.3/Lib/difflib.py", line 1223, in unified_diff
        yield '-' + line
    TypeError: Can't convert 'bytes' object to str implicitly

    So it still crashes, but the exception makes it pretty clear what the problem is.

    @gward
    Copy link
    Mannequin

    gward mannequin commented Mar 18, 2013

    Replying to Terry Reedy:

    So a dual string/bytes function would not be completely trivial.

    Correct. I have one working, but it makes my eyes bleed. I fail ashamed to have written it.

    Greg, can you convert bytes to strings, or strings to bytes

    Nope. Here is the hypothetical use case: I have a text file written in Polish encoded in ISO-8859-1 committed to a Mercurial repository. (Or saved in a filesystem somewhere: doesn't really matter, except that Mercurial repositories are immutable, long-term, and *must* *not* *lose* *data*.) Then I decide I should play nicely with the rest of the world and transcode to UTF-8, so commit a new rev in UTF-8.

    Years later, I need to look at the diff between those two old revisions. Rev 1 is a pile of ISO-8859-2 bytes, and rev 2 is a pile of UTF-8 bytes. The output of diff looks like

    • blah blah [iso-8859-2 bytes] blah
      + blah blah [utf-8 bytes] blah

    Note this: the output of diff has some lines that are iso-8859-2 bytes and some that are utf-8 bytes. *There is no single encoding* that applies.

    Note also that diff output must contain the exact original bytes, so that it can be consumed by patch. Diffs are read both by humans and by machines.

    Otherwise, I think it might be better to write a new function
    'unified_diff_bytes' that did exactly what you want than to try to
    make unified_diff accept sequences of bytes.

    Good idea. That might be much less revolting than what I have now. I'll give it a shot.

    @gward
    Copy link
    Mannequin

    gward mannequin commented Mar 18, 2013

    OK I now have two competing patches. Both are disgusting, but in different ways.

    1. http://hg.gerg.ca/cpython/rev/fcf3d27f20d9

      • factor out all the string constants
      • always concatenate, do not .format()
    2. http://hg.gerg.ca/cpython/rev/cebefce2cfd4

      • copy {unified,context}_diff() to {unified,context}_diff_bytes()
      • this is a future maintenance headache, guaranteed!

    Feedback welcome. If anyone can see a way to unify these two approaches, or a third way that sucks less, I'm all ears.

    @pitrou
    Copy link
    Member

    pitrou commented Mar 18, 2013

    Necessary when comparing files with unknown or inconsistent encodings:
    there's no way to convert everything to Unicode, so just leave it all
    bytes.

    You could simply use the surrogateescape error handler.

    @ncoghlan
    Copy link
    Contributor

    ncoghlan commented Mar 18, 2013

    Since we don't need to worry about ASCII incompatible encodings (difflib will already have issues with such files due to the assumptions about newlines), it should be possible to use the same approach as that used in urllib.parse, but based on latin-1 rather than ascii.

    It's the least bad option for this kind of use case (surrogateescape can be good too, but it doesn't work properly in this case where the two encodings may be different and we want to compare the raw bytes directly).

    (changed scope of issue to reflect the subsequent discussion)

    @ncoghlan ncoghlan changed the title Return the type you accept Handle bytes comparisons in difflib.Differ Mar 18, 2013
    @gward
    Copy link
    Mannequin

    gward mannequin commented Mar 19, 2013

    Take 3: http://hg.gerg.ca/cpython/rev/78bdb10551ee

    • uses surrogateescape as suggested by Antoine
    • seems to work

    @terryjreedy
    Copy link
    Member

    terryjreedy commented Mar 19, 2013

    I was about to suggested a simplified version of the original one-function version but the new one is better. One change: name = lambda... is discouraged in the stdlib (Guido, pydev, a few years ago). Def statements require only 3 more chars and produce properly named objects for tracebacks.
    encode = lambda s: s
    def encode(s): return s

    Under current rules, this is a 3.4 enhancement. For the context_diff and unified_diff doc change:
    -  Compare a and b (lists of strings);
    +  Compare string or bytes sequences a and b;   # (or)
    +  Compare a and b (both sequences of strings or sequences of bytes);

    Neither entry says anything at present about the type of from/tofile. Based on your patch, the following could go after the first sentence:
    +"Arguments *fromfile* and *tofile* are normally strings but may be bytes if the items of *a* and *b* are."

    @ncoghlan
    Copy link
    Contributor

    ncoghlan commented Mar 19, 2013

    Ah, I forgot we didn't do within-line diffs. If we did those, then latin-1 would be less bad choice than ascii+surrogateescape. As it is, either should work in this case (since they just need to tunnel the raw bytes, and aren't being sliced at all).

    I agree with most of Terry's comments, but think the case can be made this is a bug worth fixing in 3.3 (it's definitely borderline, but making it feasible to port Mercurial is a pretty big gain for a relatively tiny risk).

    3.2 is about to enter security fix only mode though, so dropping that from the list of affected versions - while the fix should apply just fine, it's definitely not appropriate to make a change like this in the final planned maintenance release.

    @ncoghlan ncoghlan added the type-bug An unexpected behavior, bug, or error label Mar 19, 2013
    @gward
    Copy link
    Mannequin

    gward mannequin commented Mar 19, 2013

    Thanks for the review, Terry! Here is a revised patch, now on trunk:

    http://hg.gerg.ca/cpython/rev/6dedcdbe7cd5

    I believe I have addressed all of your concerns.

    Note also that the tests now highlight some dubious behaviour. Further feedback is welcome!

    I'm happy to rebase onto 3.3 if folks generally agree it's safe. (It seems fine to me; IMHO being unable to handle bytes is a regression relative to Python 2.)

    @terryjreedy
    Copy link
    Member

    terryjreedy commented Mar 19, 2013

    The surrogate escape approach embodies the 3.x recommendation:
    decode bytes to strings, manipulate strings, encode strings to bytes.
    It also makes it possible to wrap the existing context/unified_diff functions, without touching them, with a simple 12 line function. Function bytes_diff avoids the complexities of mixing and unmixing strings and bytes that remain in Greg's latest patch.

    I recommend the following: replace the simple test in the attached bytes_diff.py with Greg's unittest-based tests and adjust the __name__ == '__main__' incantation accordingly. Next upload to pypi to make it available to all 3.1-3.3 users. Then, after some minimal field testing, add the utility wrapper function to 3.4 difflib. These steps would make moot for difflib the sub-issue of whether the 3.x design is a bug fixable in bugfix releases. We could even add a reference to the pypi module in the 3.2 and 3.3 docs.

    @gward
    Copy link
    Mannequin

    gward mannequin commented Mar 24, 2013

    I recommend the following: replace the simple test in the attached bytes_diff.py with
    Greg's unittest-based tests and adjust the __name__ == '__main__' incantation
    accordingly.

    Latest patch, following Terry's suggestion: http://hg.gerg.ca/cpython/rev/6718d54cf9eb

    Pro:

    • does not touch unified_diff() or context_diff(), just adds a new function
    • no question that this is a new feature, so no debate about what branch to commit on
    • forces caller to know exactly what they are dealing with, strings or bytes

    Con:

    • not a bug fix, so 3.3 users won't get it ... but they can just copy the
      implementation of diff_bytes() (or, as Terry suggests, it could live on PyPI)
    • has explicit isinstance() checks (for a good reason: see the comments)
    • also has more Pythonic "raise TypeError if s.decode raised AttributeError",
      which is friendlier to duck typing but inconsistent with the isinstance() checks
      used elsewhere in the patch

    Overall I'm fairly happy with this. Not too thrilled with the explicit type checks; suggestions welcome.

    Regarding Terry's suggestion of putting diff_bytes() on PyPI: meh. If the only project that needs this is Mercurial, that would be pointless. Mercurial doesn't have PyPI dependencies, and that is unlikely to change. This might be one of those rare cases where copying the code is easier than depending on it.

    @gward gward mannequin self-assigned this Mar 24, 2013
    @durin42
    Copy link
    Mannequin

    durin42 mannequin commented Apr 13, 2015

    Friendly ping. With bytes formatting in Python 3.5a3, this is now the biggest pain port for getting our test runner working cleanly on Python 3.

    @durin42
    Copy link
    Mannequin

    durin42 mannequin commented Apr 13, 2015

    (For values of "our" == "Mercurial".)

    @gward
    Copy link
    Mannequin

    gward mannequin commented Apr 15, 2015

    OK I've revived my patch and rebased on latest trunk.

    http://hg.gerg.ca/cpython/rev/13161c1d9c5f

    Comments welcome. I'll push this in a couple of days if nobody objects.

    @berkerpeksag
    Copy link
    Member

    berkerpeksag commented Apr 15, 2015

    Some small comments:

    • diff_bytes needs to be documented in the difflib docs and in Doc/whatsnew/3.5.rst.

    • diff_bytes needs to be added to difflib.__all__

    • This looks like a new feature to me, so it would be better to just commit it to the default branch.

    + except AttributeError:
    + raise TypeError('all arguments must be bytes, not %r' % s)

    This could be changed to raise TypeError(...) from None

    + self.assertTrue(
    + isinstance(line, bytes),

    assertIsInstance

    + try:
    + list(difflib.unified_diff(a, b, fna, fnb))
    + self.fail('expected TypeError')
    + except TypeError:
    + pass

    with self.assertRaises(TypeError):
        list(difflib.unified_diff(a, b, fna, fnb))

    looks more readable to me.

    @terryjreedy
    Copy link
    Member

    terryjreedy commented Apr 15, 2015

    I tried the [Create Patch] button. Two problems: the result is about 90% concerned with other issues; it is not reviewable on Rietveld. I will unlink it and upload a cut-down version.

    Wtiht the test changes suggested by Berker, I agree that it is time to apply this, with whatever decision we make about 3.4.

    I am sympathetic to the notion that there is a regression from 2.x. There is precedent for adding a feature to fix a bug (in difflib, a new parameter for SequenceMatcher, for 2.7 3 (or thereabouts)). However, doing so was contentious (discussed on pydev) and not meant to be routine. The bug being fixed had been reported (as I remember) on four separate issues by four people and seconded by other people, so we really wanted the fix in 2.7.

    Would the following compromise work for Mercurial? The patch already adds a new private function _check_types. For 3.4, also add _diff_bytes as a private function. Merge both into 3.5. Create a 3.5 patch that makes _diff_bytes public by renaming it to diff_bytes, adds the new tests, and documents the new feature. The What's New entry could mention that the function was added privately in 3.4.4.

    @terryjreedy
    Copy link
    Member

    terryjreedy commented Apr 15, 2015

    Now the review button appears for the big patch. Lets see if another submission makes it appear for the smaller version.

    @durin42
    Copy link
    Mannequin

    durin42 mannequin commented Apr 15, 2015

    Changes to 3.4 aren't going to help Mercurial. Given that bytes formatting is new in 3.5, I won't be attempting a port that supports a version of Python 3 any older than 3.5. At this point, just a difflib.diff_bytes method in 3.5 would be sufficient to satisfy my needs.

    @ncoghlan
    Copy link
    Contributor

    ncoghlan commented Apr 16, 2015

    I think the convert to str -> process as str -> convert back to bytes approach is a good one - it's the same one we use in urllib.parse.

    In this case, since we explicit need to handle mixed encodings, I also agree with the idea of using surrogate escape to make it possible to tunnel arbitrary bytes through the process, and expose that as a new module level API for Python 3.5.

    @gward
    Copy link
    Mannequin

    gward mannequin commented Apr 17, 2015

    Just uploaded https://bugs.python.org/file39083/fa4c6160c518.diff. Pretty sure I've addressed all of @berker.peksag's review comments: thanks for that!

    I also fixed a number of subtle bugs in the tests. Pro tip: when asserting that something raises TypeError, inspect the exception message. There are many many ways that Python code can raise TypeError *other than* the condition you thought you were testing for. ;-)

    I'm happy with this patch now unless anyone else spots problems.

    @durin42: have you been trying this patch with your Mercurial-on-Python-3.5 patches? This would be a good time to re-update your difflib.py.

    @berkerpeksag
    Copy link
    Member

    berkerpeksag commented Apr 17, 2015

    Thanks, looks great. Two trivial comments about the documentation:

    • Needs .. versionadded:: 3.5
    • dfunc(a, b, fromfile, tofile, fromfiledate, tofiledate, n, lineterm) -> dfunc(a, b, fromfile, tofile, fromfiledate, tofiledate, n, lineterm)

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Apr 21, 2015

    New changeset 1764d42b340d by Greg Ward in branch 'default':
    bpo-17445: difflib: add diff_bytes(), to compare bytes rather than str
    https://hg.python.org/cpython/rev/1764d42b340d

    @gward gward mannequin closed this as completed Apr 21, 2015
    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error
    Projects
    None yet
    Development

    No branches or pull requests

    6 participants