-
-
Notifications
You must be signed in to change notification settings - Fork 30.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Handle bytes comparisons in difflib.Differ #61647
Comments
This came up at the Pycon 2013 Python 3 porting clinic. There are many cases in the stdlib that claim (either explicitly or implicitly) to accept bytes or strings, but that don't return the type of the arguments they accept. An example is urllib.parse.quote() which accepts bytes or str but always returns str. A similar example brought up at the clinic was difflib, which accepts both types, and works internally on both, but crashes when joining the results for return. It should be policy for the stdlib (i.e. codified in an informational PEP and including bug reports, because they *are* bugs, not features or baked-in API) where bytes or str are accepted but the right things are not done (i.e. return the type you accept). This bug captures the principle, and probably should be closed once such a PEP is accepted, with individual bugs opened for each individual case. |
There was a long thread about this on python-dev that might be worth going back over, where I had the same misconception (that functions should always return the same type as their arguments). While I think that should be the default design, it isn't always the best API. (The real rule, if I recall correctly, is that functions should never accept *mixed* argument types for input data.) |
On Mar 17, 2013, at 03:10 PM, R. David Murray wrote:
Totally agree about the mixed type rule. But this is something different, and I know it's been discussed, and is a |
The particular use case that triggered this: Mercurial's test suite. It runs "hg blah blah" and compares the output against known good output. But Mercurial's output is just bytes, because pretty much everything in a Mercurial repo is just bytes (file data of course, but also filenames and even changeset metadata like usernames). So attempting to run the Mercurial test suite under 3.x immediately fails hard. The boiled-down essence of the bug is this: >>> import difflib
>>> a = b"hello world"
>>> b = b"goodbye world"
>>> [line for line in difflib.unified_diff(a, b)]
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "<stdin>", line 1, in <listcomp>
File "/home/greg/src/cpython/3.2/Lib/difflib.py", line 1224, in unified_diff
yield '-' + line
TypeError: Can't convert 'int' object to str implicitly |
That looks like a bug in difflib (regardless of what type it returns). But note that I'm agreeing that returning the same type you are given is generally preferrable. |
Changing behavior that already matches the docs is an enhancement, not a bugfix, and one that will almost certainly break code. It is therefore one that would normally require a deprecation period. I think the most you should ask for is to skip the deprecation period. I believe the urllib and difflib problems are quite different. I am going to presume that urllib simply converts bytes input to str and goes on from there, returning the result as str rather than (possibly) converting back to bytes. That is an example for this issue. Difflib.unified_diff, on the other hand, raises rather than returning an unexpected or undesired type. The 3 sections like the below have two problems given the toy input of two bytes objects. if tag in {'replace', 'delete'}:
for line in a[i1:i2]:
yield '-' + line First, iterating bytes or a slice of bytes returns ints, not 1-byte bytes. Hence the exception. Even if that were worked around, the mixed string constant + bytes expression would raise a TypeError. One fix for both problems would be to change the expression to '-' + str(line). Neither of these problems are bugs. The doc says "Compare a and b (lists of strings)". Actually, 'sequence of strings' is sufficient. For the operations of unified_diff, a string looks like a sequence of 1-char strings, which is why
+++ @@ -1,2 +1 @@ -a works. The other lines yielded by unified_diff are produced with str.format, and % formatting does not seem to work with bytes either. So a dual string/bytes function would not be completely trivial. Greg, can you convert bytes to strings, or strings to bytes, for your tests, or do you have non-ascii codes in your bytes? Otherwise, I think it might be better to write a new function 'unified_diff_bytes' that did exactly what you want than to try to make unified_diff accept sequences of bytes. |
At a glance, this just looks like a bug in difflib - it should use |
The original reproduction I posted was incorrect -- it makes difflib look worse than it should. (I passed strings rather than lists of strings.) Here is a more accurate version: >>> import difflib
>>> a = [b'hello']
>>> b = [b'hello!']
>>> '\n'.join(line for line in difflib.unified_diff(a, b))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "<stdin>", line 1, in <genexpr>
File "/home/greg/src/cpython/3.3/Lib/difflib.py", line 1223, in unified_diff
yield '-' + line
TypeError: Can't convert 'bytes' object to str implicitly So it still crashes, but the exception makes it pretty clear what the problem is. |
Replying to Terry Reedy:
Correct. I have one working, but it makes my eyes bleed. I fail ashamed to have written it.
Nope. Here is the hypothetical use case: I have a text file written in Polish encoded in ISO-8859-1 committed to a Mercurial repository. (Or saved in a filesystem somewhere: doesn't really matter, except that Mercurial repositories are immutable, long-term, and *must* *not* *lose* *data*.) Then I decide I should play nicely with the rest of the world and transcode to UTF-8, so commit a new rev in UTF-8. Years later, I need to look at the diff between those two old revisions. Rev 1 is a pile of ISO-8859-2 bytes, and rev 2 is a pile of UTF-8 bytes. The output of diff looks like
Note this: the output of diff has some lines that are iso-8859-2 bytes and some that are utf-8 bytes. *There is no single encoding* that applies. Note also that diff output must contain the exact original bytes, so that it can be consumed by patch. Diffs are read both by humans and by machines.
Good idea. That might be much less revolting than what I have now. I'll give it a shot. |
OK I now have two competing patches. Both are disgusting, but in different ways.
Feedback welcome. If anyone can see a way to unify these two approaches, or a third way that sucks less, I'm all ears. |
You could simply use the surrogateescape error handler. |
Since we don't need to worry about ASCII incompatible encodings (difflib will already have issues with such files due to the assumptions about newlines), it should be possible to use the same approach as that used in urllib.parse, but based on latin-1 rather than ascii. It's the least bad option for this kind of use case (surrogateescape can be good too, but it doesn't work properly in this case where the two encodings may be different and we want to compare the raw bytes directly). (changed scope of issue to reflect the subsequent discussion) |
Take 3: http://hg.gerg.ca/cpython/rev/78bdb10551ee
|
I was about to suggested a simplified version of the original one-function version but the new one is better. One change: name = lambda... is discouraged in the stdlib (Guido, pydev, a few years ago). Def statements require only 3 more chars and produce properly named objects for tracebacks. Under current rules, this is a 3.4 enhancement. For the context_diff and unified_diff doc change:
- Compare a and b (lists of strings);
+ Compare string or bytes sequences a and b; # (or)
+ Compare a and b (both sequences of strings or sequences of bytes); Neither entry says anything at present about the type of from/tofile. Based on your patch, the following could go after the first sentence: |
Ah, I forgot we didn't do within-line diffs. If we did those, then latin-1 would be less bad choice than ascii+surrogateescape. As it is, either should work in this case (since they just need to tunnel the raw bytes, and aren't being sliced at all). I agree with most of Terry's comments, but think the case can be made this is a bug worth fixing in 3.3 (it's definitely borderline, but making it feasible to port Mercurial is a pretty big gain for a relatively tiny risk). 3.2 is about to enter security fix only mode though, so dropping that from the list of affected versions - while the fix should apply just fine, it's definitely not appropriate to make a change like this in the final planned maintenance release. |
Thanks for the review, Terry! Here is a revised patch, now on trunk: http://hg.gerg.ca/cpython/rev/6dedcdbe7cd5 I believe I have addressed all of your concerns. Note also that the tests now highlight some dubious behaviour. Further feedback is welcome! I'm happy to rebase onto 3.3 if folks generally agree it's safe. (It seems fine to me; IMHO being unable to handle bytes is a regression relative to Python 2.) |
The surrogate escape approach embodies the 3.x recommendation: I recommend the following: replace the simple test in the attached bytes_diff.py with Greg's unittest-based tests and adjust the __name__ == '__main__' incantation accordingly. Next upload to pypi to make it available to all 3.1-3.3 users. Then, after some minimal field testing, add the utility wrapper function to 3.4 difflib. These steps would make moot for difflib the sub-issue of whether the 3.x design is a bug fixable in bugfix releases. We could even add a reference to the pypi module in the 3.2 and 3.3 docs. |
Latest patch, following Terry's suggestion: http://hg.gerg.ca/cpython/rev/6718d54cf9eb Pro:
Con:
Overall I'm fairly happy with this. Not too thrilled with the explicit type checks; suggestions welcome. Regarding Terry's suggestion of putting diff_bytes() on PyPI: meh. If the only project that needs this is Mercurial, that would be pointless. Mercurial doesn't have PyPI dependencies, and that is unlikely to change. This might be one of those rare cases where copying the code is easier than depending on it. |
Friendly ping. With bytes formatting in Python 3.5a3, this is now the biggest pain port for getting our test runner working cleanly on Python 3. |
(For values of "our" == "Mercurial".) |
OK I've revived my patch and rebased on latest trunk. http://hg.gerg.ca/cpython/rev/13161c1d9c5f Comments welcome. I'll push this in a couple of days if nobody objects. |
Some small comments:
+ except AttributeError: This could be changed to raise TypeError(...) from None + self.assertTrue( assertIsInstance + try: with self.assertRaises(TypeError):
list(difflib.unified_diff(a, b, fna, fnb)) looks more readable to me. |
I tried the [Create Patch] button. Two problems: the result is about 90% concerned with other issues; it is not reviewable on Rietveld. I will unlink it and upload a cut-down version. Wtiht the test changes suggested by Berker, I agree that it is time to apply this, with whatever decision we make about 3.4. I am sympathetic to the notion that there is a regression from 2.x. There is precedent for adding a feature to fix a bug (in difflib, a new parameter for SequenceMatcher, for 2.7 3 (or thereabouts)). However, doing so was contentious (discussed on pydev) and not meant to be routine. The bug being fixed had been reported (as I remember) on four separate issues by four people and seconded by other people, so we really wanted the fix in 2.7. Would the following compromise work for Mercurial? The patch already adds a new private function _check_types. For 3.4, also add _diff_bytes as a private function. Merge both into 3.5. Create a 3.5 patch that makes _diff_bytes public by renaming it to diff_bytes, adds the new tests, and documents the new feature. The What's New entry could mention that the function was added privately in 3.4.4. |
Now the review button appears for the big patch. Lets see if another submission makes it appear for the smaller version. |
Changes to 3.4 aren't going to help Mercurial. Given that bytes formatting is new in 3.5, I won't be attempting a port that supports a version of Python 3 any older than 3.5. At this point, just a difflib.diff_bytes method in 3.5 would be sufficient to satisfy my needs. |
I think the convert to str -> process as str -> convert back to bytes approach is a good one - it's the same one we use in urllib.parse. In this case, since we explicit need to handle mixed encodings, I also agree with the idea of using surrogate escape to make it possible to tunnel arbitrary bytes through the process, and expose that as a new module level API for Python 3.5. |
Just uploaded https://bugs.python.org/file39083/fa4c6160c518.diff. Pretty sure I've addressed all of @berker.peksag's review comments: thanks for that! I also fixed a number of subtle bugs in the tests. Pro tip: when asserting that something raises TypeError, inspect the exception message. There are many many ways that Python code can raise TypeError *other than* the condition you thought you were testing for. ;-) I'm happy with this patch now unless anyone else spots problems. @durin42: have you been trying this patch with your Mercurial-on-Python-3.5 patches? This would be a good time to re-update your difflib.py. |
Thanks, looks great. Two trivial comments about the documentation:
|
New changeset 1764d42b340d by Greg Ward in branch 'default': |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: