New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
tokenize.generate_tokens treat '\f' symbol as the end of file (when reading in unicode) #63235
Comments
I execute the following code on the attached file 'text.txt': import tokenize
import codecs
with open('text.txt', 'r') as f:
reader = codecs.getreader('utf-8')(f)
tokens = tokenize.generate_tokens(reader.readline) The file 'text.txt' has the following structure: first line with some text, then '\f' symbol (0x0c) on the second line and then some text on the last line. The result is that the function 'generate_tokens' ignores everything after '\f'. I've made some debugging and found out the following. If the file is read without using codecs (in ascii-mode), there are considered to be 3 lines in the file: 'text1\n', '\f\n', 'text2\n'. However in unicode-mode there are 4 lines: 'text1\n', '\f', '\n', 'text2\n'. I guess this is an intended behaviour since 2.7.x, but this causes a bug in tokenize module. Consider the lines 317-329 in tokenize.py: column = 0
while pos < max: # measure leading whitespace
if line[pos] == ' ':
column += 1
elif line[pos] == '\\t':
column = (column//tabsize + 1)*tabsize
elif line[pos] == '\\f':
column = 0
else:
break
pos += 1
if pos == max:
break The last 'break' corresponds to the main parsing loop and makes the parsing stop. Thus the lines that consist of (' ', '\t', '\f') characters and don't end with '\n' are treated as the end of file. |
I suspect this isn't the only place where the change in what is considered a (unicode) line ending character between 2.6 and 2.7/python3 is an issue. As you observe, it causes very subtle bugs. I'm going to have to go trolling through the python3 email package looking for places where this could break things :(. |
Ha! Apparently this bug broke coverage for the Mailman 3 source code: https://bitbucket.org/ned/coveragepy/issue/360/html-reports-get-confused-by-l-in-the-code |
I can not reproduce this anymore in the latest main:
|
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: