Token lines in the tokenize module always end in new line #104825

pablogsal · 2023-05-23T23:36:58Z

As a consequence of the changes in the Python tokenizer post PEP701, now the line attribute always ends in a new line character. We should remove this as it adds no information and wasn't there before.

Linked PRs

gh-104825: Remove implicit newline in the line attribute in tokens emitted in the tokenize module #104846
[3.12] gh-104825: Remove implicit newline in the line attribute in tokens emitted in the tokenize module (GH-104846). #104850
[3.12] gh-104825: add omitted idlelib text fix #104880

The text was updated successfully, but these errors were encountered:

lysnikolaou · 2023-05-24T08:24:21Z

Yeah, I ran across that while working on #104798. Are you already working on a fix or shall I take it up?

pablogsal · 2023-05-24T08:50:00Z

Yeah, I ran across that while working on #104798. Are you already working on a fix or shall I take it up?

I am working on a fix, should be up in half an hour or so

…ens emitted in the tokenize module

…itted in the tokenize module (#104846)

… in tokens emitted in the tokenize module (pythonGH-104846). (cherry picked from commit c8cf9b4) Co-authored-by: Pablo Galindo Salgado <Pablogsal@gmail.com>

lysnikolaou · 2023-05-24T10:31:56Z

#104846 now removes a newline, even when a newline exists in the original source.

❯ ./python.exe
Python 3.13.0a0 (heads/setuptools-peg-generator-dirty:c7503656c2, May 24 2023, 12:24:20) [Clang 14.0.3 (clang-1403.0.22.14.1)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import tokenize
>>> import io
>>> file = io.StringIO('1 + 2 + 3\n')
>>> for token in tokenize.generate_tokens(file.readline):
...     print(token)
... 
TokenInfo(type=2 (NUMBER), string='1', start=(1, 0), end=(1, 1), line='1 + 2 + 3')
TokenInfo(type=55 (OP), string='+', start=(1, 2), end=(1, 3), line='1 + 2 + 3')
TokenInfo(type=2 (NUMBER), string='2', start=(1, 4), end=(1, 5), line='1 + 2 + 3')
TokenInfo(type=55 (OP), string='+', start=(1, 6), end=(1, 7), line='1 + 2 + 3')
TokenInfo(type=2 (NUMBER), string='3', start=(1, 8), end=(1, 9), line='1 + 2 + 3')
TokenInfo(type=4 (NEWLINE), string='\n', start=(1, 9), end=(1, 10), line='1 + 2 + 3')
TokenInfo(type=0 (ENDMARKER), string='', start=(2, 0), end=(2, 0), line='')

…kens emitted in the tokenize module (GH-104846). (#104850) (cherry picked from commit c8cf9b4)

lysnikolaou · 2023-05-24T11:14:17Z

@pablogsal Do you think the above should be fixed? Will you work on it?

pablogsal · 2023-05-24T11:41:23Z

@pablogsal Do you think the above should be fixed? Will you work on it?

Hummmmm, not sure how easy is going to be for us to distinguish between implicit and explicit newlines at the end. I would say, leave it like is now unless you disagree, in which case let's see if we can fix it easily.

lysnikolaou · 2023-05-24T12:25:22Z

My only concern is that tests that rely on that will fail. For example, test_pegen now fails:

def test_repeat_0_simple(self) -> None:
        grammar = """
        start: thing thing* NEWLINE
        thing: NUMBER
        """
        parser_class = make_parser(grammar)
        node = parse_string("1 2 3\n", parser_class)
        self.assertEqual(
            node,
            [
                TokenInfo(NUMBER, string="1", start=(1, 0), end=(1, 1), line="1 2 3\n"),
                [
                    TokenInfo(
                        NUMBER, string="2", start=(1, 2), end=(1, 3), line="1 2 3\n"
                    ),
                    TokenInfo(
                        NUMBER, string="3", start=(1, 4), end=(1, 5), line="1 2 3\n"
                    ),
                ],
                TokenInfo(
                    NEWLINE, string="\n", start=(1, 5), end=(1, 6), line="1 2 3\n"
                ),
            ],
        )
        node = parse_string("1\n", parser_class)
        self.assertEqual(
            node,
            [
                TokenInfo(NUMBER, string="1", start=(1, 0), end=(1, 1), line="1\n"),
                [],
                TokenInfo(NEWLINE, string="\n", start=(1, 1), end=(1, 2), line="1\n"),
            ],
        )

The failure is:

======================================================================
FAIL: test_repeat_0_simple (test.test_peg_generator.test_pegen.TestPegen.test_repeat_0_simple)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/lysnikolaou/repos/python/cpython/Lib/test/test_peg_generator/test_pegen.py", line 248, in test_repeat_0_simple
    self.assertEqual(
AssertionError: Lists differ: [Toke[67 chars]1 2 3'), [TokenInfo(type=2 (NUMBER), string='2[201 chars] 3')] != [Toke[67 chars]1 2 3\n'), [TokenInfo(type=2 (NUMBER), string=[209 chars]\n')]

First differing element 0:
TokenInfo(type=2 (NUMBER), string='1', start=(1, 0), end=(1, 1), line='1 2 3')
TokenInfo(type=2 (NUMBER), string='1', start=(1, 0), end=(1, 1), line='1 2 3\n')

Diff is 1018 characters long. Set self.maxDiff to None to see it.

The fix is very simple, so I'm okay with going with this as-is. I will fix test_pegen in my setuptools PR.

Order of events: Terry merged new idlelib test into main. Ms. I. made a 3.12 backport; tests passed. Pablo merged the tokenize change with idlelib test fix into main. Pablo merged a 3.12 backport without the idle test fix as the backport of the latter had not yet been been merged. Terry merged the idlelib test backport. The new test failed on at least 4 3.12 buildbots because of the tokenize change. This PR backports the now needed idlelib test fix. (cherry picked from commit c8cf9b4)

nedbat · 2023-05-26T12:00:01Z

You say:

As a consequence of the changes in the Python tokenizer post PEP701, now the line attribute always ends in a new line character. We should remove this as it adds no information and wasn't there before.

But it looks to me like the lines used to always end in newline, except for ENDMARKER. This uses tokbug.py from #104972:

3.10.9 (main, Dec  7 2022, 07:15:24) [Clang 12.0.0 (clang-1200.0.32.29)]
TokenInfo(type=62 (NL), string='\n', start=(1, 0), end=(1, 1), line='\n')
TokenInfo(type=1 (NAME), string='a', start=(2, 0), end=(2, 1), line='a + \\\n')
TokenInfo(type=54 (OP), string='+', start=(2, 2), end=(2, 3), line='a + \\\n')
TokenInfo(type=1 (NAME), string='b', start=(3, 0), end=(3, 1), line='b\n')
TokenInfo(type=4 (NEWLINE), string='\n', start=(3, 1), end=(3, 2), line='b\n')
TokenInfo(type=0 (ENDMARKER), string='', start=(4, 0), end=(4, 0), line='')

pablogsal mentioned this issue May 23, 2023

generate_tokens tokenizes $ differently with Python 3.12 than earlier #104802

Closed

bedevere-bot mentioned this issue May 24, 2023

gh-104825: Remove implicit newline in the line attribute in tokens emitted in the tokenize module #104846

Merged

pablogsal added a commit to pablogsal/cpython that referenced this issue May 24, 2023

pythongh-104825: Remove implicit newline in the line attribute in tok…

e81810d

…ens emitted in the tokenize module

pablogsal added a commit that referenced this issue May 24, 2023

gh-104825: Remove implicit newline in the line attribute in tokens em…

c8cf9b4

…itted in the tokenize module (#104846)

pablogsal closed this as completed May 24, 2023

pablogsal added needs backport to 3.12 bug and security fixes and removed needs backport to 3.12 bug and security fixes labels May 24, 2023

bedevere-bot mentioned this issue May 24, 2023

[3.12] gh-104825: Remove implicit newline in the line attribute in tokens emitted in the tokenize module (GH-104846). #104850

Merged

pablogsal added a commit that referenced this issue May 24, 2023

[3.12] gh-104825: Remove implicit newline in the line attribute in to…

3d2ed89

…kens emitted in the tokenize module (GH-104846). (#104850) (cherry picked from commit c8cf9b4)

bedevere-bot mentioned this issue May 24, 2023

[3.12] gh-104825: add omitted idlelib text fix #104880

Merged

nedbat mentioned this issue May 26, 2023

The lines in tokens from tokenize.generate_tokens incorrectly indicate multiple lines. #104972

Closed

erlend-aasland mentioned this issue Jun 6, 2023

3.12 backport gh 105236 #105358

Closed

Erotemic mentioned this issue Oct 23, 2023

Tokenize generate_tokens regression in CPython 3.12 #111224

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Token lines in the tokenize module always end in new line #104825

Token lines in the tokenize module always end in new line #104825

pablogsal commented May 23, 2023 •

edited by bedevere-bot

lysnikolaou commented May 24, 2023

pablogsal commented May 24, 2023

lysnikolaou commented May 24, 2023

lysnikolaou commented May 24, 2023

pablogsal commented May 24, 2023

lysnikolaou commented May 24, 2023

nedbat commented May 26, 2023

Token lines in the tokenize module always end in new line #104825

Token lines in the tokenize module always end in new line #104825

Comments

pablogsal commented May 23, 2023 • edited by bedevere-bot

Linked PRs

lysnikolaou commented May 24, 2023

pablogsal commented May 24, 2023

lysnikolaou commented May 24, 2023

lysnikolaou commented May 24, 2023

pablogsal commented May 24, 2023

lysnikolaou commented May 24, 2023

nedbat commented May 26, 2023

pablogsal commented May 23, 2023 •

edited by bedevere-bot