Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Python 3.12 tokenize generates invalid locations for f'\N{unicode}' #115154

Closed
smartbomb opened this issue Feb 8, 2024 · 3 comments
Closed

Python 3.12 tokenize generates invalid locations for f'\N{unicode}' #115154

smartbomb opened this issue Feb 8, 2024 · 3 comments
Labels
topic-parser type-bug An unexpected behavior, bug, or error

Comments

@smartbomb
Copy link

smartbomb commented Feb 8, 2024

Bug report

Bug description:

from tokenize import untokenize, generate_tokens
from io import StringIO

untokenize(generate_tokens(StringIO("f'\\N{EXCLAMATION MARK}'").readline))

ValueError: start (1,22) precedes previous end (1,24)

CPython versions tested on:

3.12

Operating systems tested on:

Windows

Linked PRs

@smartbomb smartbomb added the type-bug An unexpected behavior, bug, or error label Feb 8, 2024
@Eclips4
Copy link
Member

Eclips4 commented Feb 8, 2024

cc @pablogsal

@terryjreedy
Copy link
Member

The tokens parsing f'\\N{EXCLAMATION MARK}' and traceback unparsing the list (toks) are

[TokenInfo(type=59 (FSTRING_START), string="f'", start=(1, 0), end=(1, 2), line="f'\\N{EXCLAMATION MARK}'"),
 TokenInfo(type=60 (FSTRING_MIDDLE), string='\\N{EXCLAMATION MARK}', start=(1, 2), end=(1, 22),
   line="f'\\N{EXCLAMATION MARK}'"), 
TokenInfo(type=61 (FSTRING_END), string="'", start=(1, 22), end=(1, 23), line="f'\\N{EXCLAMATION MARK}'"),
 TokenInfo(type=4 (NEWLINE), string='', start=(1, 23), end=(1, 24), line="f'\\N{EXCLAMATION MARK}'"),
 TokenInfo(type=0 (ENDMARKER), string='', start=(2, 0), end=(2, 0), line='')]
#
Traceback (most recent call last):
  File "F:\dev\tem\tem2.py", line 6, in <module>
    print(untokenize(toks))
  File "C:\Programs\Python313\Lib\tokenize.py", line 294, in untokenize
    out = ut.untokenize(iterable)
  File "C:\Programs\Python313\Lib\tokenize.py", line 223, in untokenize
    self.add_whitespace(start)
  File "C:\Programs\Python313\Lib\tokenize.py", line 176, in add_whitespace
    raise ValueError("start ({},{}) precedes previous end ({},{})"
ValueError: start (1,22) precedes previous end (1,24)

The problem is that when Untokenizer.untokenize, line215 sees the FSTRING_MIDDLE token, it replaces '{' and '}' with '{{' and '}}' and bumps the end position by 2, making the end column 2 more than the next start column. This works when the presence of curly brackets results from the reverse process, but not when the lexer recognizes \N{name} as a unicode named literal without replacing it with the indicated character. Other escapes are resolved, as with '\ueeee' being tokenized as a single character. Unless the tokenizer replaces \N{name} with a character, the untokenizer must recognize it also and not do the replacement.

@pablogsal
Copy link
Member

CC: @isidentical

pablogsal added a commit to pablogsal/cpython that referenced this issue Feb 8, 2024
Signed-off-by: Pablo Galindo <pablogsal@gmail.com>
pablogsal added a commit to pablogsal/cpython that referenced this issue Feb 11, 2024
miss-islington pushed a commit to miss-islington/cpython that referenced this issue Feb 19, 2024
…ythonGH-115171)

(cherry picked from commit ecf16ee)

Co-authored-by: Pablo Galindo Salgado <Pablogsal@gmail.com>
pablogsal added a commit that referenced this issue Feb 19, 2024
…H-115171) (#115662)

gh-115154: Fix untokenize handling of unicode named literals (GH-115171)
(cherry picked from commit ecf16ee)

Co-authored-by: Pablo Galindo Salgado <Pablogsal@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
topic-parser type-bug An unexpected behavior, bug, or error
Projects
None yet
Development

No branches or pull requests

4 participants