bpo-33338: [lib2to3] Synchronize token.py and tokenize.py with stdlib #6572

ambv · 2018-04-23T01:09:03Z

(This is Step 1 in BPO-33337. See there for larger context.)

lib2to3's token.py and tokenize.py were initially copies of the respective
files from the standard library. They were copied to allow Python 3 to read
Python 2's grammar.

Since 2006, lib2to3 grew to be widely used as a Concrete Syntax Tree, also for
parsing Python 3 code. Additions to support Python 3 grammar were added but
sadly, the main token.py and tokenize.py diverged.

This change brings them back together, minimizing the differences to the bare
minimum that is in fact required by lib2to3. Before this change, almost every
line in lib2to3/pgen2/tokenize.py was different from tokenize.py. After this
change, the diff between the two files is only 200 lines long and is entirely
filled with relevant Python 2 compatibility bits.

Merging the implementations, there's numerous fixes to the lib2to3 tokenizer:

docstrings made as similar as possible
ported TokenInfo
ported tokenize.tokenize() and tokenize.open()
removed Python 2-only implementation cruft
made Unicode identifier handling the same
made string prefix handling the same
added Ellipsis to the Special group
Untokenizer backported bugfixes:
- 5e6db31
- 9dc3a36
- 5b8d2c3
- e411b66
- BPO-2495
detect_encoding tries to figure out a filename and find_cookie uses
the filename in error messages, if available
find_cookie bugfix: BPO-14990
BPO-16152: tokenizer doesn't crash on missing newline at the end of the
stream (added \Z (end of string) to PseudoExtras)

Improvements to token.py:

taken from the current Lib/token.py
tokens renumbered to match Lib/token.py
__all__ properly defined
ASYNC, AWAIT and BACKQUOTE exist under different numbers (100 + old number)
ELLIPSIS added
ENCODING added

https://bugs.python.org/issue33338

lib2to3's token.py and tokenize.py were initially copies of the respective files from the standard library. They were copied to allow Python 3 to read Python 2's grammar. Since 2006, lib2to3 grew to be widely used as a Concrete Syntax Tree, also for parsing Python 3 code. Additions to support Python 3 grammar were added but sadly, the main token.py and tokenize.py diverged. This change brings them back together, minimizing the differences to the bare minimum that is in fact required by lib2to3. Before this change, almost every line in lib2to3/pgen2/tokenize.py was different from tokenize.py. After this change, the diff between the two files is only 200 lines long and is entirely filled with relevant Python 2 compatibility bits. Merging the implementations, there's numerous fixes to the lib2to3 tokenizer: + docstrings made as similar as possible + ported `TokenInfo` + ported `tokenize.tokenize()` and `tokenize.open()` + removed Python 2-only implementation cruft + made Unicode identifier handling the same + made string prefix handling the same + added Ellipsis to the Special group + Untokenizer backported bugfixes: - 5e6db31 - 9dc3a36 - 5b8d2c3 - e411b66 - BPO-2495 + `detect_encoding` tries to figure out a filename and `find_cookie` uses the filename in error messages, if available + `find_cookie` bugfix: BPO-14990 + BPO-16152: tokenizer doesn't crash on missing newline at the end of the stream (added \Z (end of string) to PseudoExtras) Improvements to token.py: + taken from the current Lib/token.py + tokens renumbered to match Lib/token.py + `__all__` properly defined + ASYNC, AWAIT and BACKQUOTE exist under different numbers (100 + old number) + ELLIPSIS added + ENCODING added

benjaminp · 2018-04-24T03:54:04Z

Lib/lib2to3/Grammar.txt

@@ -110,7 +110,7 @@ atom: ('(' [yield_expr|testlist_gexp] ')' |
       '[' [listmaker] ']' |
       '{' [dictsetmaker] '}' |
       '`' testlist1 '`' |
-       NAME | NUMBER | STRING+ | '.' '.' '.')


I'm pretty sure this is required to parse Python 2, where something like x[. ..] is a perfectly valid way of writing x[...].

Sigh, you're right. I doubt anybody actually does this in practice though.

But yeah, for completion we'd have to retain atoms with three dots. Technically not a loss in functionality but definitely a loss in convenience for the programmer.

merwok · 2018-11-22T19:08:54Z

Misc/NEWS.d/next/Library/2018-04-22-18-27-13.bpo-33338.185_om.rst

-automatic encoding detection and yielding ENCODING tokens; - Unicode
-identifiers are now supported; - ELLIPSIS is its own token type now; -
-Untokenizer improved with backports of 5e6db31, 9dc3a36, 5b8d2c3, e411b66,
-and BPO-2495.


The problem was:

Warning, treated as error:
../build/NEWS:137:Bullet list ends without a blank line; unexpected unindent.

which I think means that this is the correct rst format:

- bla bla item one second line is indentend - bla bla second item after blank line

ambv requested review from gpshead, benjaminp, gvanrossum and serhiy-storchaka April 23, 2018 01:09

the-knights-who-say-ni added the CLA signed label Apr 23, 2018

bedevere-bot added the awaiting merge label Apr 23, 2018

ambv added 2 commits April 22, 2018 18:27

NEWS.d

77ae1de

D'oh. Blurb hates bullet points.

a6a3957

benjaminp reviewed Apr 24, 2018

View reviewed changes

merwok reviewed Nov 22, 2018

View reviewed changes

ambv closed this Jul 12, 2021

ambv deleted the tokenizeSync branch July 12, 2021 11:24

ambv mentioned this pull request Oct 11, 2023

Provide a supported Concrete Syntax Tree implementation in the standard library #77518

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

bpo-33338: [lib2to3] Synchronize token.py and tokenize.py with stdlib #6572

bpo-33338: [lib2to3] Synchronize token.py and tokenize.py with stdlib #6572

Uh oh!

ambv commented Apr 23, 2018 •

edited by bedevere-bot

Loading

Uh oh!

benjaminp Apr 24, 2018

Uh oh!

ambv Apr 24, 2018

Uh oh!

merwok Nov 22, 2018

Uh oh!

Uh oh!

Uh oh!

bpo-33338: [lib2to3] Synchronize token.py and tokenize.py with stdlib #6572

bpo-33338: [lib2to3] Synchronize token.py and tokenize.py with stdlib #6572

Uh oh!

Conversation

ambv commented Apr 23, 2018 • edited by bedevere-bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

benjaminp Apr 24, 2018

Choose a reason for hiding this comment

Uh oh!

ambv Apr 24, 2018

Choose a reason for hiding this comment

Uh oh!

merwok Nov 22, 2018

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ambv commented Apr 23, 2018 •

edited by bedevere-bot

Loading