Skip to content

bpo-33338: [lib2to3] Synchronize token.py and tokenize.py with stdlib #6572

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 3 commits into from

Conversation

ambv
Copy link
Contributor

@ambv ambv commented Apr 23, 2018

(This is Step 1 in BPO-33337. See there for larger context.)

lib2to3's token.py and tokenize.py were initially copies of the respective
files from the standard library. They were copied to allow Python 3 to read
Python 2's grammar.

Since 2006, lib2to3 grew to be widely used as a Concrete Syntax Tree, also for
parsing Python 3 code. Additions to support Python 3 grammar were added but
sadly, the main token.py and tokenize.py diverged.

This change brings them back together, minimizing the differences to the bare
minimum that is in fact required by lib2to3. Before this change, almost every
line in lib2to3/pgen2/tokenize.py was different from tokenize.py. After this
change, the diff between the two files is only 200 lines long and is entirely
filled with relevant Python 2 compatibility bits.

Merging the implementations, there's numerous fixes to the lib2to3 tokenizer:

  • docstrings made as similar as possible
  • ported TokenInfo
  • ported tokenize.tokenize() and tokenize.open()
  • removed Python 2-only implementation cruft
  • made Unicode identifier handling the same
  • made string prefix handling the same
  • added Ellipsis to the Special group
  • Untokenizer backported bugfixes:
  • detect_encoding tries to figure out a filename and find_cookie uses
    the filename in error messages, if available
  • find_cookie bugfix: BPO-14990
  • BPO-16152: tokenizer doesn't crash on missing newline at the end of the
    stream (added \Z (end of string) to PseudoExtras)

Improvements to token.py:

  • taken from the current Lib/token.py
  • tokens renumbered to match Lib/token.py
  • __all__ properly defined
  • ASYNC, AWAIT and BACKQUOTE exist under different numbers (100 + old number)
  • ELLIPSIS added
  • ENCODING added

https://bugs.python.org/issue33338

lib2to3's token.py and tokenize.py were initially copies of the respective
files from the standard library.  They were copied to allow Python 3 to read
Python 2's grammar.

Since 2006, lib2to3 grew to be widely used as a Concrete Syntax Tree, also for
parsing Python 3 code.  Additions to support Python 3 grammar were added but
sadly, the main token.py and tokenize.py diverged.

This change brings them back together, minimizing the differences to the bare
minimum that is in fact required by lib2to3.  Before this change, almost every
line in lib2to3/pgen2/tokenize.py was different from tokenize.py.  After this
change, the diff between the two files is only 200 lines long and is entirely
filled with relevant Python 2 compatibility bits.

Merging the implementations, there's numerous fixes to the lib2to3 tokenizer:

+ docstrings made as similar as possible
+ ported `TokenInfo`
+ ported `tokenize.tokenize()` and `tokenize.open()`
+ removed Python 2-only implementation cruft
+ made Unicode identifier handling the same
+ made string prefix handling the same
+ added Ellipsis to the Special group
+ Untokenizer backported bugfixes:
  - 5e6db31
  - 9dc3a36
  - 5b8d2c3
  - e411b66
  - BPO-2495
+ `detect_encoding` tries to figure out a filename and `find_cookie` uses
  the filename in error messages, if available
+ `find_cookie` bugfix: BPO-14990
+ BPO-16152: tokenizer doesn't crash on missing newline at the end of the
  stream (added \Z (end of string) to PseudoExtras)

Improvements to token.py:

+ taken from the current Lib/token.py
+ tokens renumbered to match Lib/token.py
+ `__all__` properly defined
+ ASYNC, AWAIT and BACKQUOTE exist under different numbers (100 + old number)
+ ELLIPSIS added
+ ENCODING added
@@ -110,7 +110,7 @@ atom: ('(' [yield_expr|testlist_gexp] ')' |
'[' [listmaker] ']' |
'{' [dictsetmaker] '}' |
'`' testlist1 '`' |
NAME | NUMBER | STRING+ | '.' '.' '.')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm pretty sure this is required to parse Python 2, where something like x[. ..] is a perfectly valid way of writing x[...].

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sigh, you're right. I doubt anybody actually does this in practice though.

But yeah, for completion we'd have to retain atoms with three dots. Technically not a loss in functionality but definitely a loss in convenience for the programmer.

automatic encoding detection and yielding ENCODING tokens; - Unicode
identifiers are now supported; - ELLIPSIS is its own token type now; -
Untokenizer improved with backports of 5e6db31, 9dc3a36, 5b8d2c3, e411b66,
and BPO-2495.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem was:

Warning, treated as error:
../build/NEWS:137:Bullet list ends without a blank line; unexpected unindent.

which I think means that this is the correct rst format:

- bla bla item one
  second line is indentend

- bla bla second item after blank line

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants