tokenize.generate_tokens treat '\f' symbol as the end of file (when reading in unicode) #63235

AlexeyUmnov · 2013-09-16T14:12:54Z

BPO	19035
Nosy	@warsaw, @bitdancer
Files	tokens.txt

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields:

assignee = None
closed_at = None
created_at = <Date 2013-09-16.14:12:54.000>
labels = ['type-bug', 'library']
title = "tokenize.generate_tokens treat '\\f' symbol as the end of file (when reading in unicode)"
updated_at = <Date 2015-03-02.14:27:38.589>
user = 'https://bugs.python.org/AlexeyUmnov'

bugs.python.org fields:

activity = <Date 2015-03-02.14:27:38.589>
actor = 'barry'
assignee = 'none'
closed = False
closed_date = None
closer = None
components = ['Library (Lib)']
creation = <Date 2013-09-16.14:12:54.000>
creator = 'Alexey.Umnov'
dependencies = []
files = ['31796']
hgrepos = []
issue_num = 19035
keywords = []
message_count = 3.0
messages = ['197899', '197910', '237044']
nosy_count = 3.0
nosy_names = ['barry', 'r.david.murray', 'Alexey.Umnov']
pr_nums = []
priority = 'normal'
resolution = None
stage = None
status = 'open'
superseder = None
type = 'behavior'
url = 'https://bugs.python.org/issue19035'
versions = ['Python 2.7']

AlexeyUmnov · 2013-09-16T14:12:53Z

I execute the following code on the attached file 'text.txt':

import tokenize
import codecs

with open('text.txt', 'r') as f:
    reader = codecs.getreader('utf-8')(f)
    tokens = tokenize.generate_tokens(reader.readline)

The file 'text.txt' has the following structure: first line with some text, then '\f' symbol (0x0c) on the second line and then some text on the last line. The result is that the function 'generate_tokens' ignores everything after '\f'.

I've made some debugging and found out the following. If the file is read without using codecs (in ascii-mode), there are considered to be 3 lines in the file: 'text1\n', '\f\n', 'text2\n'. However in unicode-mode there are 4 lines: 'text1\n', '\f', '\n', 'text2\n'. I guess this is an intended behaviour since 2.7.x, but this causes a bug in tokenize module.

Consider the lines 317-329 in tokenize.py:

column = 0
while pos < max:                   # measure leading whitespace
    if line[pos] == ' ':
        column += 1
    elif line[pos] == '\\t':
        column = (column//tabsize + 1)*tabsize
    elif line[pos] == '\\f':
        column = 0
    else:
        break
    pos += 1
if pos == max:
    break

The last 'break' corresponds to the main parsing loop and makes the parsing stop. Thus the lines that consist of (' ', '\t', '\f') characters and don't end with '\n' are treated as the end of file.

bitdancer · 2013-09-16T15:28:49Z

I suspect this isn't the only place where the change in what is considered a (unicode) line ending character between 2.6 and 2.7/python3 is an issue. As you observe, it causes very subtle bugs. I'm going to have to go trolling through the python3 email package looking for places where this could break things :(.

warsaw · 2015-03-02T14:27:39Z

Ha! Apparently this bug broke coverage for the Mailman 3 source code:

https://bitbucket.org/ned/coveragepy/issue/360/html-reports-get-confused-by-l-in-the-code

pablogsal · 2023-05-25T16:01:53Z

I can not reproduce this anymore in the latest main:

❯ ./python.exe lel.py
[TokenInfo(type=1 (NAME), string='text1', start=(1, 0), end=(1, 5), line='text1'),
 TokenInfo(type=4 (NEWLINE), string='\n', start=(1, 5), end=(1, 6), line='text1'),
 TokenInfo(type=65 (NL), string='\n', start=(2, 1), end=(2, 2), line='\x0c'),
 TokenInfo(type=1 (NAME), string='text2', start=(3, 0), end=(3, 5), line='text2'),
 TokenInfo(type=4 (NEWLINE), string='\n', start=(3, 5), end=(3, 6), line='text2'),
 TokenInfo(type=0 (ENDMARKER), string='', start=(4, 0), end=(4, 0), line='')]

AlexeyUmnov mannequin added stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error labels Sep 16, 2013

ezio-melotti transferred this issue from another repository Apr 10, 2022

pablogsal closed this as completed May 25, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tokenize.generate_tokens treat '\f' symbol as the end of file (when reading in unicode) #63235

tokenize.generate_tokens treat '\f' symbol as the end of file (when reading in unicode) #63235

AlexeyUmnov mannequin commented Sep 16, 2013

AlexeyUmnov mannequin commented Sep 16, 2013 •

edited by terryjreedy

bitdancer commented Sep 16, 2013

warsaw commented Mar 2, 2015

pablogsal commented May 25, 2023

tokenize.generate_tokens treat '\f' symbol as the end of file (when reading in unicode) #63235

tokenize.generate_tokens treat '\f' symbol as the end of file (when reading in unicode) #63235

Comments

AlexeyUmnov mannequin commented Sep 16, 2013

AlexeyUmnov mannequin commented Sep 16, 2013 • edited by terryjreedy

bitdancer commented Sep 16, 2013

warsaw commented Mar 2, 2015

pablogsal commented May 25, 2023

AlexeyUmnov mannequin commented Sep 16, 2013 •

edited by terryjreedy