`generate_tokens` starts to give `SyntaxError` #105549

sylee957 · 2023-06-09T08:58:07Z

Bug report

generate_tokens gives error in python 3.12

from tokenize import generate_tokens
from io import StringIO
list(generate_tokens(StringIO('01234').readline))

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/sylee957/.pyenv/versions/3.12.0b1/lib/python3.12/tokenize.py", line 451, in _tokenize
    for token in _generate_tokens_from_c_tokenizer(source, extra_tokens=True):
  File "/home/sylee957/.pyenv/versions/3.12.0b1/lib/python3.12/tokenize.py", line 542, in _generate_tokens_from_c_tokenizer
    for info in c_tokenizer.TokenizerIter(source, extra_tokens=extra_tokens):
  File "<string>", line 1
    01234
    ^
SyntaxError: leading zeros in decimal integer literals are not permitted; use an 0o prefix for octal integers

While in Python 3.11, it successfully returns some result

[TokenInfo(type=2 (NUMBER), string='0', start=(1, 0), end=(1, 1), line='01234'), TokenInfo(type=2 (NUMBER), string='1234', start=(1, 1), end=(1, 5), line='01234'), TokenInfo(type=4 (NEWLINE), string='', start=(1, 5), end=(1, 6), line=''), TokenInfo(type=0 (ENDMARKER), string='', start=(2, 0), end=(2, 0), line='')]

This is related to some issues in SymPy
sympy/sympy#25185

Your environment

CPython versions tested on: 3.12.0b1
Operating system and architecture: Ubuntu 22.04.2 LTS

Linked PRs

The text was updated successfully, but these errors were encountered:

pablogsal · 2023-06-09T09:53:57Z

Please, see #105238 regarding the new tokenize SyntaxErrors for invalid Python.

In particular, see #105238 (comment)

We could try to make it work to make your life easier but, we will not be able to preserve the previous output of NUMBER + NUMBER because that was an idiosyncrasy of the previous tokenizer for invalid Python. What we could do is emitting this:

[TokenInfo(type=2 (NUMBER), string='01234', start=(1, 0), end=(1, 5), line='01234\n'),
 TokenInfo(type=4 (NEWLINE), string='', start=(1, 5), end=(1, 6), line='01234\n'),
 TokenInfo(type=0 (ENDMARKER), string='', start=(2, 0), end=(2, 0), line='')]

pablogsal · 2023-06-09T09:54:07Z

Would that work for you @sylee957 ?

pablogsal · 2023-06-09T10:29:40Z

Please, see #105238 regarding the new tokenize SyntaxErrors for invalid Python.

In particular, see #105238 (comment)

We could try to make it work to make your life easier but, we will not be able to preserve the previous output of NUMBER + NUMBER because that was an idiosyncrasy of the previous tokenizer for invalid Python. What we could do is emitting this:
[TokenInfo(type=2 (NUMBER), string='01234', start=(1, 0), end=(1, 5), line='01234\n'),
 TokenInfo(type=4 (NEWLINE), string='', start=(1, 5), end=(1, 6), line='01234\n'),
 TokenInfo(type=0 (ENDMARKER), string='', start=(2, 0), end=(2, 0), line='')]

I have executed the failing test with this change and seems that it works:

❯ python -m pytest sympy/core/tests/test_sympify.py -k test_sympify1 -v
/Users/pgalindo3/github/python/3.12/venv/lib/python3.12/site-packages/_pytest/assertion/rewrite.py:281: DeprecationWarning: 'importlib.abc.TraversableResources' is deprecated and slated for removal in Python 3.14
  def get_resource_reader(self, name: str) -> importlib.abc.TraversableResources:  # type: ignore
================================================================================================= test session starts ==================================================================================================
platform darwin -- Python 3.12.0b1+, pytest-7.0.1, pluggy-1.0.0 -- /Users/pgalindo3/github/python/3.12/venv/bin/python
cachedir: .pytest_cache
architecture: 64-bit
cache:        yes
ground types: python

rootdir: /Users/pgalindo3/github/sympy, configfile: pyproject.toml
plugins: asyncio-0.21.0
asyncio: mode=Mode.STRICT
collected 54 items / 53 deselected / 1 selected

sympy/core/tests/test_sympify.py::test_sympify1 PASSED                                                                                                                                                           [100%]

=================================================================================================== warnings summary ===================================================================================================
conftest.py:29
  /Users/pgalindo3/github/sympy/conftest.py:29: ResourceWarning: unclosed file <_io.TextIOWrapper name='/Users/pgalindo3/github/sympy/.ci/durations.json' mode='rt' encoding='UTF-8'>
    veryslow_group, slow_group = [_mk_group(group_dict) for group_dict in json.loads(open(durations_path, 'rt').read())]

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
===================================================================================== 1 passed, 53 deselected, 1 warning in 0.07s ======================================================================================

pablogsal · 2023-06-09T11:16:39Z

I am a bit torn because making it work for this case is easy, but making it work for the other things you seem to need like 0x being valid (or 0b or 0o) is going to introduce a lot of complexity in the tokenizer that we really cannot deal with and it was never part of the guarantees of the tokenize module (because is invalid Python and the tokenize module "provides a lexical scanner for Python source code").

@lysnikolaou what do you think?

aroberge · 2023-06-09T11:48:01Z

I don't know if this would help solving the problem, but might it not be possible to emit a new subclass of SyntaxError when an exception is raised during the tokenizing stage? Users of the tokenize module might then be in a position to attempt some targeted error recovery.

pablogsal · 2023-06-09T12:13:22Z

We are already emmiting TokenError instead of SyntaxError in the latest 3.12. Is this what you are referring to?

Users of the tokenize module might then be in a position to attempt some targeted error recovery.

I don't think this helps because most people are relying on super specific behavior for invalid code like emitting ERRORTOKEN, the lack of NL tokens before EOF or some other stuff like that. Not sure how they can fix that on their side without bumping into this yet again in the future.

…fixed literals (#105555)

… 0-prefixed literals (pythonGH-105555) (cherry picked from commit b047fa5) Co-authored-by: Pablo Galindo Salgado <Pablogsal@gmail.com>

…w 0-prefixed literals (GH-105555) (#105602) Co-authored-by: Pablo Galindo Salgado <Pablogsal@gmail.com>

sylee957 added the type-bug An unexpected behavior, bug, or error label Jun 9, 2023

AlexWaygood assigned pablogsal and lysnikolaou Jun 9, 2023

sylee957 closed this as completed Jun 9, 2023

pablogsal mentioned this issue Jun 9, 2023

Parsing tests are failing with python 3.12.0-beta.1 sympy/sympy#25185

Closed

sylee957 reopened this Jun 9, 2023

AlexWaygood added 3.12 bugs and security fixes 3.13 new features, bugs and security fixes labels Jun 9, 2023

pablogsal added a commit to pablogsal/cpython that referenced this issue Jun 9, 2023

pythongh-105549: Tokenize separately NUMBER and NAME tokens

2b99bf7

pablogsal mentioned this issue Jun 9, 2023

gh-105549: Tokenize separately NUMBER and NAME tokens and allow 0-prefixed literals #105555

Merged

pablogsal added a commit to pablogsal/cpython that referenced this issue Jun 9, 2023

pythongh-105549: Tokenize separately NUMBER and NAME tokens

93ee708

pablogsal added a commit that referenced this issue Jun 9, 2023

gh-105549: Tokenize separately NUMBER and NAME tokens and allow 0-pre…

b047fa5

…fixed literals (#105555)

bedevere-bot mentioned this issue Jun 9, 2023

[3.12] gh-105549: Tokenize separately NUMBER and NAME tokens and allow 0-prefixed literals (GH-105555) #105602

Merged

pablogsal closed this as completed Jun 9, 2023

pablogsal added a commit that referenced this issue Jun 9, 2023

[3.12] gh-105549: Tokenize separately NUMBER and NAME tokens and allo…

ae6e002

…w 0-prefixed literals (GH-105555) (#105602) Co-authored-by: Pablo Galindo Salgado <Pablogsal@gmail.com>

Erotemic mentioned this issue Oct 23, 2023

Tokenize generate_tokens regression in CPython 3.12 #111224

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`generate_tokens` starts to give `SyntaxError` #105549

`generate_tokens` starts to give `SyntaxError` #105549

sylee957 commented Jun 9, 2023 •

edited by bedevere-bot

pablogsal commented Jun 9, 2023 •

edited

pablogsal commented Jun 9, 2023 •

edited

pablogsal commented Jun 9, 2023

pablogsal commented Jun 9, 2023

aroberge commented Jun 9, 2023

pablogsal commented Jun 9, 2023 •

edited

generate_tokens starts to give SyntaxError #105549

generate_tokens starts to give SyntaxError #105549

Comments

sylee957 commented Jun 9, 2023 • edited by bedevere-bot

Bug report

Your environment

Linked PRs

pablogsal commented Jun 9, 2023 • edited

pablogsal commented Jun 9, 2023 • edited

pablogsal commented Jun 9, 2023

pablogsal commented Jun 9, 2023

aroberge commented Jun 9, 2023

pablogsal commented Jun 9, 2023 • edited

`generate_tokens` starts to give `SyntaxError` #105549

`generate_tokens` starts to give `SyntaxError` #105549

sylee957 commented Jun 9, 2023 •

edited by bedevere-bot

pablogsal commented Jun 9, 2023 •

edited

pablogsal commented Jun 9, 2023 •

edited

pablogsal commented Jun 9, 2023 •

edited