Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

generate_tokens starts to give SyntaxError #105549

Closed
sylee957 opened this issue Jun 9, 2023 · 6 comments
Closed

generate_tokens starts to give SyntaxError #105549

sylee957 opened this issue Jun 9, 2023 · 6 comments
Assignees
Labels
3.12 bugs and security fixes 3.13 new features, bugs and security fixes type-bug An unexpected behavior, bug, or error

Comments

@sylee957
Copy link

sylee957 commented Jun 9, 2023

Bug report

generate_tokens gives error in python 3.12

from tokenize import generate_tokens
from io import StringIO
list(generate_tokens(StringIO('01234').readline))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/sylee957/.pyenv/versions/3.12.0b1/lib/python3.12/tokenize.py", line 451, in _tokenize
    for token in _generate_tokens_from_c_tokenizer(source, extra_tokens=True):
  File "/home/sylee957/.pyenv/versions/3.12.0b1/lib/python3.12/tokenize.py", line 542, in _generate_tokens_from_c_tokenizer
    for info in c_tokenizer.TokenizerIter(source, extra_tokens=extra_tokens):
  File "<string>", line 1
    01234
    ^
SyntaxError: leading zeros in decimal integer literals are not permitted; use an 0o prefix for octal integers

While in Python 3.11, it successfully returns some result

[TokenInfo(type=2 (NUMBER), string='0', start=(1, 0), end=(1, 1), line='01234'), TokenInfo(type=2 (NUMBER), string='1234', start=(1, 1), end=(1, 5), line='01234'), TokenInfo(type=4 (NEWLINE), string='', start=(1, 5), end=(1, 6), line=''), TokenInfo(type=0 (ENDMARKER), string='', start=(2, 0), end=(2, 0), line='')]

This is related to some issues in SymPy
sympy/sympy#25185

Your environment

  • CPython versions tested on: 3.12.0b1
  • Operating system and architecture: Ubuntu 22.04.2 LTS

Linked PRs

@sylee957 sylee957 added the type-bug An unexpected behavior, bug, or error label Jun 9, 2023
@pablogsal
Copy link
Member

pablogsal commented Jun 9, 2023

Please, see #105238 regarding the new tokenize SyntaxErrors for invalid Python.

In particular, see #105238 (comment)

We could try to make it work to make your life easier but, we will not be able to preserve the previous output of NUMBER + NUMBER because that was an idiosyncrasy of the previous tokenizer for invalid Python. What we could do is emitting this:

[TokenInfo(type=2 (NUMBER), string='01234', start=(1, 0), end=(1, 5), line='01234\n'),
 TokenInfo(type=4 (NEWLINE), string='', start=(1, 5), end=(1, 6), line='01234\n'),
 TokenInfo(type=0 (ENDMARKER), string='', start=(2, 0), end=(2, 0), line='')]

@pablogsal
Copy link
Member

pablogsal commented Jun 9, 2023

Would that work for you @sylee957 ?

@pablogsal
Copy link
Member

Please, see #105238 regarding the new tokenize SyntaxErrors for invalid Python.

In particular, see #105238 (comment)

We could try to make it work to make your life easier but, we will not be able to preserve the previous output of NUMBER + NUMBER because that was an idiosyncrasy of the previous tokenizer for invalid Python. What we could do is emitting this:

[TokenInfo(type=2 (NUMBER), string='01234', start=(1, 0), end=(1, 5), line='01234\n'),
 TokenInfo(type=4 (NEWLINE), string='', start=(1, 5), end=(1, 6), line='01234\n'),
 TokenInfo(type=0 (ENDMARKER), string='', start=(2, 0), end=(2, 0), line='')]

I have executed the failing test with this change and seems that it works:

❯ python -m pytest sympy/core/tests/test_sympify.py -k test_sympify1 -v
/Users/pgalindo3/github/python/3.12/venv/lib/python3.12/site-packages/_pytest/assertion/rewrite.py:281: DeprecationWarning: 'importlib.abc.TraversableResources' is deprecated and slated for removal in Python 3.14
  def get_resource_reader(self, name: str) -> importlib.abc.TraversableResources:  # type: ignore
================================================================================================= test session starts ==================================================================================================
platform darwin -- Python 3.12.0b1+, pytest-7.0.1, pluggy-1.0.0 -- /Users/pgalindo3/github/python/3.12/venv/bin/python
cachedir: .pytest_cache
architecture: 64-bit
cache:        yes
ground types: python

rootdir: /Users/pgalindo3/github/sympy, configfile: pyproject.toml
plugins: asyncio-0.21.0
asyncio: mode=Mode.STRICT
collected 54 items / 53 deselected / 1 selected

sympy/core/tests/test_sympify.py::test_sympify1 PASSED                                                                                                                                                           [100%]

=================================================================================================== warnings summary ===================================================================================================
conftest.py:29
  /Users/pgalindo3/github/sympy/conftest.py:29: ResourceWarning: unclosed file <_io.TextIOWrapper name='/Users/pgalindo3/github/sympy/.ci/durations.json' mode='rt' encoding='UTF-8'>
    veryslow_group, slow_group = [_mk_group(group_dict) for group_dict in json.loads(open(durations_path, 'rt').read())]

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
===================================================================================== 1 passed, 53 deselected, 1 warning in 0.07s ======================================================================================

@sylee957 sylee957 reopened this Jun 9, 2023
@pablogsal
Copy link
Member

I am a bit torn because making it work for this case is easy, but making it work for the other things you seem to need like 0x being valid (or 0b or 0o) is going to introduce a lot of complexity in the tokenizer that we really cannot deal with and it was never part of the guarantees of the tokenize module (because is invalid Python and the tokenize module "provides a lexical scanner for Python source code").

@lysnikolaou what do you think?

@AlexWaygood AlexWaygood added 3.12 bugs and security fixes 3.13 new features, bugs and security fixes labels Jun 9, 2023
@aroberge
Copy link

aroberge commented Jun 9, 2023

I don't know if this would help solving the problem, but might it not be possible to emit a new subclass of SyntaxError when an exception is raised during the tokenizing stage? Users of the tokenize module might then be in a position to attempt some targeted error recovery.

@pablogsal
Copy link
Member

pablogsal commented Jun 9, 2023

We are already emmiting TokenError instead of SyntaxError in the latest 3.12. Is this what you are referring to?

Users of the tokenize module might then be in a position to attempt some targeted error recovery.

I don't think this helps because most people are relying on super specific behavior for invalid code like emitting ERRORTOKEN, the lack of NL tokens before EOF or some other stuff like that. Not sure how they can fix that on their side without bumping into this yet again in the future.

pablogsal added a commit to pablogsal/cpython that referenced this issue Jun 9, 2023
miss-islington pushed a commit to miss-islington/cpython that referenced this issue Jun 9, 2023
… 0-prefixed literals (pythonGH-105555)

(cherry picked from commit b047fa5)

Co-authored-by: Pablo Galindo Salgado <Pablogsal@gmail.com>
pablogsal added a commit that referenced this issue Jun 9, 2023
…w 0-prefixed literals (GH-105555) (#105602)

Co-authored-by: Pablo Galindo Salgado <Pablogsal@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3.12 bugs and security fixes 3.13 new features, bugs and security fixes type-bug An unexpected behavior, bug, or error
Projects
None yet
Development

No branches or pull requests

5 participants