Navigation Menu

Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bpo-42729: Introduce ast.parse_tokens() to interface with "tokenize" #23922

Closed
wants to merge 1 commit into from

Conversation

pfalcon
Copy link

@pfalcon pfalcon commented Dec 24, 2020

Currently, with Python it's possible:

  • To get from stream-of-characters program representation to AST
    representation (AST.parse()).
  • To get from AST to code object (compile()).
  • To get from a code object to first-class function to the execute the
    program.

Python also offers "tokenize" module, but it stands as a disconnected
island: the only things it allows to do is to get from
stream-of-characters program representation to stream-of-tokens, and
back. At the same time, conceptually, tokenization is not a disconnected
feature, it's the first stage of language processing pipeline. The fact
that "tokenize" is disconnected from the rest of the pipeline, as
listed above, is more an artifact of CPython implementation: both
"ast" module and compile() module are backed by the underlying bytecode
compiler implementation written in C, and that's what connects them.

On the other hand, "tokenize" module is pure-Python, while the
underlying compiler has its own tokenizer implementation (not exposed).
That's the likely reason of such disconnection between "tokenize" and
the rest of the infrastructure.

Thia patch closes that gap, and establishes an API which allows to
parse token stream (iterable) into an AST. The initial implementation
for CPython is naive, making a loop thru surface program representation.
That's considered ok, as the idea is to establish a standard API to be
able to go tokens -> AST. Then individual Python implementation can
make/optimize it based on their needs.

The function introduced here is ast.parse_tokens(). It follows the
signature of the existing ast.parse(), except that first parameter
is "token_stream" instead of "source".

Another alternative would be to overload existing ast.parse() to
accept token iterable. I guess, at the current stage, where we try
to tighten up type strictness of API, and have clear typing signatures
for API functions, this is not favored solution.

Signed-off-by: Paul Sokolovsky pfalcon@users.sourceforge.net

https://bugs.python.org/issue42729

Currently, with Python it's possible:

* To get from stream-of-characters program representation to AST
  representation (AST.parse()).
* To get from AST to code object (compile()).
* To get from a code object to first-class function to the execute the
  program.

Python also offers "tokenize" module, but it stands as a disconnected
island: the only things it allows to do is to get from
stream-of-characters program representation to stream-of-tokens, and
back. At the same time, conceptually, tokenization is not a disconnected
feature, it's the first stage of language processing pipeline. The fact
that "tokenize" is disconnected from the rest of the pipeline, as
listed above, is more an artifact of CPython implementation: both
"ast" module and compile() module are backed by the underlying bytecode
compiler implementation written in C, and that's what connects them.

On the other hand, "tokenize" module is pure-Python, while the
underlying compiler has its own tokenizer implementation (not exposed).
That's the likely reason of such disconnection between "tokenize" and
the rest of the infrastructure.

Thia patch closes that gap, and establishes an API which allows to
parse token stream (iterable) into an AST. The initial implementation
for CPython is naive, making a loop thru surface program representation.
That's considered ok, as the idea is to establish a standard API to be
able to go tokens -> AST. Then individual Python implementation can
make/optimize it based on their needs.

The function introduced here is ast.parse_tokens(). It follows the
signature of the existing ast.parse(), except that first parameter
is "token_stream" instead of "source".

Another alternative would be to overload existing ast.parse() to
accept token iterable. I guess, at the current stage, where we try
to tighten up type strictness of API, and have clear typing signatures
for API functions, this is not favored solution.

Signed-off-by: Paul Sokolovsky <pfalcon@users.sourceforge.net>
@lysnikolaou
Copy link
Contributor

Thanks for the patch @pfalcon, but I'm closing this as discussed on the bpo issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants