bpo-42729: Introduce ast.parse_tokens() to interface with "tokenize" #23922

pfalcon · 2020-12-24T10:26:17Z

Currently, with Python it's possible:

To get from stream-of-characters program representation to AST
representation (AST.parse()).
To get from AST to code object (compile()).
To get from a code object to first-class function to the execute the
program.

Python also offers "tokenize" module, but it stands as a disconnected
island: the only things it allows to do is to get from
stream-of-characters program representation to stream-of-tokens, and
back. At the same time, conceptually, tokenization is not a disconnected
feature, it's the first stage of language processing pipeline. The fact
that "tokenize" is disconnected from the rest of the pipeline, as
listed above, is more an artifact of CPython implementation: both
"ast" module and compile() module are backed by the underlying bytecode
compiler implementation written in C, and that's what connects them.

On the other hand, "tokenize" module is pure-Python, while the
underlying compiler has its own tokenizer implementation (not exposed).
That's the likely reason of such disconnection between "tokenize" and
the rest of the infrastructure.

Thia patch closes that gap, and establishes an API which allows to
parse token stream (iterable) into an AST. The initial implementation
for CPython is naive, making a loop thru surface program representation.
That's considered ok, as the idea is to establish a standard API to be
able to go tokens -> AST. Then individual Python implementation can
make/optimize it based on their needs.

The function introduced here is ast.parse_tokens(). It follows the
signature of the existing ast.parse(), except that first parameter
is "token_stream" instead of "source".

Another alternative would be to overload existing ast.parse() to
accept token iterable. I guess, at the current stage, where we try
to tighten up type strictness of API, and have clear typing signatures
for API functions, this is not favored solution.

Signed-off-by: Paul Sokolovsky pfalcon@users.sourceforge.net

https://bugs.python.org/issue42729

Currently, with Python it's possible: * To get from stream-of-characters program representation to AST representation (AST.parse()). * To get from AST to code object (compile()). * To get from a code object to first-class function to the execute the program. Python also offers "tokenize" module, but it stands as a disconnected island: the only things it allows to do is to get from stream-of-characters program representation to stream-of-tokens, and back. At the same time, conceptually, tokenization is not a disconnected feature, it's the first stage of language processing pipeline. The fact that "tokenize" is disconnected from the rest of the pipeline, as listed above, is more an artifact of CPython implementation: both "ast" module and compile() module are backed by the underlying bytecode compiler implementation written in C, and that's what connects them. On the other hand, "tokenize" module is pure-Python, while the underlying compiler has its own tokenizer implementation (not exposed). That's the likely reason of such disconnection between "tokenize" and the rest of the infrastructure. Thia patch closes that gap, and establishes an API which allows to parse token stream (iterable) into an AST. The initial implementation for CPython is naive, making a loop thru surface program representation. That's considered ok, as the idea is to establish a standard API to be able to go tokens -> AST. Then individual Python implementation can make/optimize it based on their needs. The function introduced here is ast.parse_tokens(). It follows the signature of the existing ast.parse(), except that first parameter is "token_stream" instead of "source". Another alternative would be to overload existing ast.parse() to accept token iterable. I guess, at the current stage, where we try to tighten up type strictness of API, and have clear typing signatures for API functions, this is not favored solution. Signed-off-by: Paul Sokolovsky <pfalcon@users.sourceforge.net>

lysnikolaou · 2020-12-24T13:40:13Z

Thanks for the patch @pfalcon, but I'm closing this as discussed on the bpo issue.

the-knights-who-say-ni added the CLA signed label Dec 24, 2020

bedevere-bot added the awaiting review label Dec 24, 2020

lysnikolaou closed this Dec 24, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bpo-42729: Introduce ast.parse_tokens() to interface with "tokenize" #23922

bpo-42729: Introduce ast.parse_tokens() to interface with "tokenize" #23922

pfalcon commented Dec 24, 2020 •

edited by bedevere-bot

lysnikolaou commented Dec 24, 2020

Navigation Menu

bpo-42729: Introduce ast.parse_tokens() to interface with "tokenize" #23922

bpo-42729: Introduce ast.parse_tokens() to interface with "tokenize" #23922

Conversation

pfalcon commented Dec 24, 2020 • edited by bedevere-bot

lysnikolaou commented Dec 24, 2020

pfalcon commented Dec 24, 2020 •

edited by bedevere-bot