-
-
Notifications
You must be signed in to change notification settings - Fork 31.6k
tokenize/untokenize roundtrip fails with tabs #64586
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Consider this simple unit test: def test_tokenize():
input = "if False:\n\tx=3\n\ty=3\n"
g = list(generate_tokens(io.StringIO(input).readline))
assert untokenize(g) == input According to the docs, untokenize guarantees the output equals the input to tokenize: http://docs.python.org/2/library/tokenize.html#tokenize.untokenize As this test fails on Python 2 and Python 3 (use io.BytesIO on Python 2), I believe the test illustrates a violation of this guarantee. The second line in the if block gets its tab replaced by a space. As the input is valid Python Syntax, this behavior is particularly surprising. |
Whitespace equivalence is explicitly disclaimed. "The guarantee applies only to the token type and token string as the spacing between tokens (column positions) may change." The assert is not a valid test. I think you should close this. (Note that there are several issues for bugs where untokenize results in a sematically different text.) |
I read the manual more carefully and noticed that the guarantee is that tokenizing the result of untokenize matches the input to untokenize. " The result is guaranteed to tokenize back to match the input so that the conversion is lossless and round-trips are assured." I ran the test in 3.4 and noticed the exactness and direction of the matching does not matter because only the 2nd \t is replaced by a space, making the invalid code that raises IndentationError. So this is definitely a bug and I withdraw the close suggestion. I believe the test should be that both lines of the body begin with the same indent. from tokenize import tokenize, untokenize
import io
def test_tab_indent(self):
code = b"if False:\n\tx=3\n\ty=3\n"
codelines = untokenize(tokenize(io.BytesIO(s).readline)).split(\n)
assertEqual(codelines[1], codelines[2]) Of course, the test of tokenize and untokenize should be separated. |
I think the problem is with untokenize. s =b"if False:\n\tx=3\n\ty=3\n"
t = tokenize(io.BytesIO(s).readline)
for i in t: print(i) produces a token stream that seems correct. TokenInfo(type=56 (ENCODING), string='utf-8', start=(0, 0), end=(0, 0), line='') The problem with untokenize and indents is this: In the old untokenize duples function, now called 'compat', INDENT strings were added to a list and popped by the corresponding DEDENT. While compat has the minor problem of returning a string instead of bytes (which is actually as I think it should be) and adding extraneous spaces within and at the end of lines, it correctly handles tabs in your example and this: s =b"if False:\n\tx=1\n\t\ty=2\n\t\t\tz=3\n"
t = tokenize(io.BytesIO(s).readline)
print(untokenize(i[:2] for i in t).encode())
>>>
b'if False :\n\tx =1 \n\t\ty =2 \n\t\t\tz =3 \n' When tokenize was changed to producing 5-tuples, untokenize was changed to use the start and end coordinates, but all special processing of indents was cut in favor of .add_space(). So this issue is a regression due in inadequate testing. |
The untokenize docstring has a stronger guarantee, and in the direction you were claiming. "Round-trip invariant for full input: Untokenized source will match input source exactly". For this to be true, the indent strings must be saved and not replaced by spaces. |
bpo-24447, closed as duplicate, has another example. |
Sorry for the inconvenience. I failed to find this old bug. I think there is another problem. The docs of The attached patch should have addressed the problems above. When trying to make a patch, a tokenize bug was found. Consider the new attached tab.py, the tabs between comments and code, and the tabs between expressions are lost, so when untokenizing, position information is used to produce equivalent spaces, instead of tabs. Despite the tokenization problem, the patch can produce syntactically correct code as accurately as it can. The PEP-8 recommends spaces for indentation, but the usage of tabs should not be ignored. new tab.py (in Python string): '#!/usr/bin/env python\n# -- coding: utf-8 --\n\ndef foo():\n\t"""\n\tTests tabs in tokenization\n\t\tfoo\n\t"""\n\tpass\n\tpass\n\tif 1:\n\t\t# not indent correctly\n\t\tpass\n\t\t# correct\ttab\n\t\tpass\n\tpass\n\tbaaz = {\'a\ttab\':\t1,\n\t\t\t\'b\': 2}\t\t# also fails\n\npass\n#if 2:\n\t#pass\n#pass\n' |
@gumblex: This is a good start. It certainly provides a candidate implementation. First, can I suggest that you remove the changes pertinent to the "at least two elements" and address that in a separate ticket/discussion? Second, any patch will necessarily need to include a test that fails before the patch and passes after the test. Third, and least important, the proposed implementation adds a lot of branching and cognitive complexity to an already complicated function. I'd like to work on the implementation to see if we can make it easier to comprehend. That said, practicality beats purity, so if it passes the test, then it's surely acceptable. |
I've created a repo clone and have added a version of Terry's test to it and will now test Dingyuan's patch. |
I've committed the patch without the change for "at least two elements" as https://bitbucket.org/jaraco/cpython-issue20387/commits/b7fe3c865b8dbdb33d26f4bc5cbb6096f5445fb2. The patch corrects the new test, demonstrating its effectiveness, but yields two new test failures (apparent regressions): $ ./python.exe Lib/test/test_tokenize.py
**********************************************************************
File "/public/cpython-issue20387/Lib/test/test_tokenize.py", line ?, in test.test_tokenize.__test__.doctests
Failed example:
roundtrip("try: import somemodule\n"
"except ImportError: # comment\n"
" print('Can not import' # comment2\n)"
"else: print('Loaded')\n")
Exception raised:
Traceback (most recent call last):
File "/public/cpython-issue20387/Lib/doctest.py", line 1318, in __run
compileflags, 1), test.globs)
File "<doctest test.test_tokenize.__test__.doctests[14]>", line 1, in <module>
roundtrip("try: import somemodule\n"
File "/public/cpython-issue20387/Lib/test/test_tokenize.py", line 698, in roundtrip
bytes_from5 = untokenize(tokens5)
File "/public/cpython-issue20387/Lib/tokenize.py", line 339, in untokenize
out = ut.untokenize(iterable)
File "/public/cpython-issue20387/Lib/tokenize.py", line 273, in untokenize
self.add_whitespace(start)
File "/public/cpython-issue20387/Lib/tokenize.py", line 236, in add_whitespace
.format(row, col, self.prev_row, self.prev_col))
ValueError: start (4,1) precedes previous end (4,4)
**********************************************************************
File "/public/cpython-issue20387/Lib/test/test_tokenize.py", line ?, in test.test_tokenize.__test__.doctests
Failed example:
for testfile in testfiles:
if not roundtrip(open(testfile, 'rb')):
print("Roundtrip failed for file %s" % testfile)
break
else: True
Exception raised:
Traceback (most recent call last):
File "/public/cpython-issue20387/Lib/doctest.py", line 1318, in __run
compileflags, 1), test.globs)
File "<doctest test.test_tokenize.__test__.doctests[70]>", line 2, in <module>
if not roundtrip(open(testfile, 'rb')):
File "/public/cpython-issue20387/Lib/test/test_tokenize.py", line 698, in roundtrip
bytes_from5 = untokenize(tokens5)
File "/public/cpython-issue20387/Lib/tokenize.py", line 339, in untokenize
out = ut.untokenize(iterable)
File "/public/cpython-issue20387/Lib/tokenize.py", line 273, in untokenize
self.add_whitespace(start)
File "/public/cpython-issue20387/Lib/tokenize.py", line 236, in add_whitespace
.format(row, col, self.prev_row, self.prev_col))
ValueError: start (73,8) precedes previous end (73,12)
**********************************************************************
1 items had failures:
2 of 74 in test.test_tokenize.__test__.doctests
***Test Failed*** 2 failures.
Traceback (most recent call last):
File "Lib/test/test_tokenize.py", line 1261, in <module>
test_main()
File "Lib/test/test_tokenize.py", line 1252, in test_main
support.run_doctest(test_tokenize, True)
File "/public/cpython-issue20387/Lib/test/support/__init__.py", line 1854, in run_doctest
raise TestFailed("%d of %d doctests failed" % (f, t))
test.support.TestFailed: 2 of 79 doctests failed @gumblex, Can you review the failures and address them? |
The new patch should now pass all tests correctly. The main idea is:
The new test_tokenize.py fails: Line 1244 should be: It seems that the tokens generated by tokenize.tokenize don't contain enough information to restore the original file.
(From test/tokenize_tests.txt)
# Backslash means line continuation:
-x = 1 \
+x = 1\
+ 1 My roundtrip test code copied here from bpo-24447: python2 -c 'import sys, tokenize; sys.stdout.write(tokenize.untokenize(tokenize.generate_tokens(sys.stdin.readline)))' |
@gumblex, I've applied your updated patch (though I couldn't figure out why it wouldn't apply mechanically; I had to paste it). I also corrected the test. Thanks for the advice on that. I don't understand the second half of your message. Are you stating caveats for our edification? Do you have outstanding concerns? Can you confirm your original report is corrected by the latest tip in the patch repository? I believe what remains to be done is to add a NEWS entry and merge the changes with the 3.4, 3.5, and default branches. If a 2.7 backport is required, that could be done also. |
I mean the patch only restores tabs in indentation. The reports above should be corrected. (I can't pull down the whole hg repo, so the patch is manually created by just diffing two versions.) |
New changeset b784c842a63c by Jason R. Coombs in branch '3.4': New changeset 49323e5f6391 by Jason R. Coombs in branch '3.4': New changeset ff47efeeed48 by Dingyuan Wang in branch '3.4': New changeset 4856ae883041 by Jason R. Coombs in branch '3.4': New changeset 330e28b28334 by Jason R. Coombs in branch '3.4': New changeset 9ce5c1f371f7 by Jason R. Coombs in branch '3.4': New changeset 98380a6e037c by Jason R. Coombs in branch '3.5': New changeset f2f5d1c928eb by Jason R. Coombs in branch 'default': |
Patch and test applied to 3.4+. I'm inclined to backport this to Python 2.7, as that was where I encountered it originally. |
Benjamin, any objections to a backport of this patch? |
New changeset 524a0e755797 by Jason R. Coombs in branch '2.7': New changeset cb9df1ae287b by Jason R. Coombs in branch '2.7': |
For the sake of expediency, I've gone ahead and backported and pushed the fix to 2.7. Please back out the changes if appropriate. |
It seems the problem with tabs in the indentation is fixed. Can we close this? For the problem with tabs between tokens in a line, I think that would be better handled in separate report. But I suspect it is at worst a documentation problem. You can’t guarantee that you can reconstruct the exact source based off just the tokenized tokens. |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: