Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tokenize.generate_tokens treat '\f' symbol as the end of file (when reading in unicode) #63235

Closed
AlexeyUmnov mannequin opened this issue Sep 16, 2013 · 4 comments
Closed
Labels
stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error

Comments

@AlexeyUmnov
Copy link
Mannequin

AlexeyUmnov mannequin commented Sep 16, 2013

BPO 19035
Nosy @warsaw, @bitdancer
Files
  • tokens.txt
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = None
    created_at = <Date 2013-09-16.14:12:54.000>
    labels = ['type-bug', 'library']
    title = "tokenize.generate_tokens treat '\\f' symbol as the end of file (when reading in unicode)"
    updated_at = <Date 2015-03-02.14:27:38.589>
    user = 'https://bugs.python.org/AlexeyUmnov'

    bugs.python.org fields:

    activity = <Date 2015-03-02.14:27:38.589>
    actor = 'barry'
    assignee = 'none'
    closed = False
    closed_date = None
    closer = None
    components = ['Library (Lib)']
    creation = <Date 2013-09-16.14:12:54.000>
    creator = 'Alexey.Umnov'
    dependencies = []
    files = ['31796']
    hgrepos = []
    issue_num = 19035
    keywords = []
    message_count = 3.0
    messages = ['197899', '197910', '237044']
    nosy_count = 3.0
    nosy_names = ['barry', 'r.david.murray', 'Alexey.Umnov']
    pr_nums = []
    priority = 'normal'
    resolution = None
    stage = None
    status = 'open'
    superseder = None
    type = 'behavior'
    url = 'https://bugs.python.org/issue19035'
    versions = ['Python 2.7']

    @AlexeyUmnov
    Copy link
    Mannequin Author

    AlexeyUmnov mannequin commented Sep 16, 2013

    I execute the following code on the attached file 'text.txt':

    import tokenize
    import codecs
    
    with open('text.txt', 'r') as f:
        reader = codecs.getreader('utf-8')(f)
        tokens = tokenize.generate_tokens(reader.readline)

    The file 'text.txt' has the following structure: first line with some text, then '\f' symbol (0x0c) on the second line and then some text on the last line. The result is that the function 'generate_tokens' ignores everything after '\f'.

    I've made some debugging and found out the following. If the file is read without using codecs (in ascii-mode), there are considered to be 3 lines in the file: 'text1\n', '\f\n', 'text2\n'. However in unicode-mode there are 4 lines: 'text1\n', '\f', '\n', 'text2\n'. I guess this is an intended behaviour since 2.7.x, but this causes a bug in tokenize module.

    Consider the lines 317-329 in tokenize.py:

    column = 0
    while pos < max:                   # measure leading whitespace
        if line[pos] == ' ':
            column += 1
        elif line[pos] == '\\t':
            column = (column//tabsize + 1)*tabsize
        elif line[pos] == '\\f':
            column = 0
        else:
            break
        pos += 1
    if pos == max:
        break

    The last 'break' corresponds to the main parsing loop and makes the parsing stop. Thus the lines that consist of (' ', '\t', '\f') characters and don't end with '\n' are treated as the end of file.

    @AlexeyUmnov AlexeyUmnov mannequin added stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error labels Sep 16, 2013
    @bitdancer
    Copy link
    Member

    I suspect this isn't the only place where the change in what is considered a (unicode) line ending character between 2.6 and 2.7/python3 is an issue. As you observe, it causes very subtle bugs. I'm going to have to go trolling through the python3 email package looking for places where this could break things :(.

    @warsaw
    Copy link
    Member

    warsaw commented Mar 2, 2015

    Ha! Apparently this bug broke coverage for the Mailman 3 source code:

    https://bitbucket.org/ned/coveragepy/issue/360/html-reports-get-confused-by-l-in-the-code

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    @pablogsal
    Copy link
    Member

    I can not reproduce this anymore in the latest main:

    ❯ ./python.exe lel.py
    [TokenInfo(type=1 (NAME), string='text1', start=(1, 0), end=(1, 5), line='text1'),
     TokenInfo(type=4 (NEWLINE), string='\n', start=(1, 5), end=(1, 6), line='text1'),
     TokenInfo(type=65 (NL), string='\n', start=(2, 1), end=(2, 2), line='\x0c'),
     TokenInfo(type=1 (NAME), string='text2', start=(3, 0), end=(3, 5), line='text2'),
     TokenInfo(type=4 (NEWLINE), string='\n', start=(3, 5), end=(3, 6), line='text2'),
     TokenInfo(type=0 (ENDMARKER), string='', start=(4, 0), end=(4, 0), line='')]
    

    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error
    Projects
    None yet
    Development

    No branches or pull requests

    3 participants