New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bpo-35808: Retire pgen and use pgen2 to generate the parser #11814
Conversation
23ff976
to
20fcc02
Compare
d60bfe5
to
adc73e6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't appear that there are changes to configure.ac so is there a need to churn aclocal.m4 and configure?
Do you need my review? I'm super behind on stuff so it may be a while, and I don't want to block you. |
Hi, in the next line: there is a reference to $(PGEN) and that does not exist. By the way, that don't give any error because is empty |
Good catch! |
@gvanrossum Don't worry. I will try to get someone else to review, but the fact that the interpreter builds and passes the test is a good assurance that the patch is not fundamentally wrong. |
Hum. I'm not super excited by duplicating Lib/lib2to3/pgen2/ as Parser/pgen2/ "with a few changes". It sounds like a good recipe to duplicate the maintenance. Would it be possible to modify Lib/lib2to3/pgen2/ to respect your constraints instead? Maybe even with "#ifdef" ( Maybe my question doesn't make sense, the code diverges enough to justify to have two "copies" of the "same code". |
@vstinner We could add some branching on the original pgen2, but IMHO I think having the code in Being said that, if we think is ok to add branching to pgen2 and invoke that from P.S. Thanks for reviewing this big PR! 😄 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Searching PGEN on CPython root I have the some issues:
- .gititgnore: i think that we have to delete the Parser/pgen and Parser/pge.exe, because now we wont have pgen binary, isn't?
- on parsetok.h On the line 15 I don't know if is necessary the
#ifndef PGEN ...
- There are several ```#ifndef````on files: tokenizer.c parsetok.c, ... are they necessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, I did not wanted to affect any lib2to3
behaviour
so that is why I opted by "duplicating" pgen2 initially.
There should be a way to modify lib2to3.pgen2 just enough to allow an user to change some constants and functions, no?
I compared Parser/pgen2/ and Lib/lib2to3/pgen2/ using meld. Most code is duplicated, but "some" is modified. That's my worry.
The 194 lines of grammar.py are almost all the same, except of opmap_raw constant which is different. That's why I talked about "#ifdef". Would it be possible to modify lib2to3 to allow to modify opmap_raw? For this specific case, it looks like lib2to3 doesn't support PEP 572 whereas your change adds support for ":="... It means that we have 2 copies of the same file, but one is outdated? What's the point of having an outdated parser which doesn't support PEP 572 in lib2to3? lib2to3 should also be modified to support ":=", no?
Can't Parser/pgen2/ reuses Lib/lib2to3/pgen2/grammar.py? For a practical reason, I would suggest to rename Parser/pgen2/ to Parser/pgen/: it would avoid conflict with lib2to3.pgen2 package and it reuses "pgen" name :-)
Parser/pgen2/pgen.py: multiple changes are to add support for older Python versions. Would it be acceptable to modify lib2to3 and reuse its pgen.py in Parser/pgen2/?
Parser/pgen2/tokenize.py: most code is the same, but you replace " with ' and other coding style changes. I'm not against coding style changes, except that here it makes the maintenance harder to keep Parser/pgen2/tokenize.py in sync with Lib/lib2to3/pgen2/tokenize.py. Either revert coding style changes or update also Lib/lib2to3/pgen2/tokenize.py.
Parser/pgen2/token.py: you added BACKQUOTE, removed ELLIPSIS, etc. That's maybe one of the most tricky difference which may prevent core reuse? Or other modules should be made configurable to accept a "custom" token module? Rather than being "hardcoded" to use a specific "token" module? I'm talking about from . import grammar, token, tokenize
in pgen.py for example.
When you're done making the requested changes, leave the comment: |
@vstinner I will work on changes that will address these issues trying to reuse pgen2 as much as possible. Will ping you when is ready :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Signing off on the changes to the Windows build files. Haven't looked at the rest.
Lib/lib2to3/pgen2/tokenize.py
Outdated
**{f'{prefix}"""': double3prog for prefix in _strprefixes}, | ||
**{prefix: None for prefix in _strprefixes}} | ||
"'''": single3prog, '"""': double3prog} | ||
endprogs.update({"{}'''".format(prefix): single3prog for prefix in _strprefixes}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are needed because Travis uses an old version of python for the regenerations step.
I have made some commits triying to use pgen2, but there are some problems that only show on Travis and Windows.....I will try to find what is happening. |
As I have rebased, the old conversations that were attached to commits can be found here: 80c1217#commitcomment-32381091 104b8e7#commitcomment-32380993 @gvanrossum Ok, I have decided to go to the option you initially suggested and leave
The reasons there is a full implementation of the parser generator in
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Honestly I think at this point you might as well have started over with a new PR. But alla.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Getting closer. Maybe we can we completely independent from lib2to3.
Parser/pgen/__main__.py
Outdated
"grammar", type=str, help="The file with the grammar definition in EBNF format" | ||
) | ||
parser.add_argument( | ||
"gramminit_h", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Normally this is spelled with one ‘m’. (Also below.)
Parser/pgen/pgen.py
Outdated
# Add token to the module cache so tokenize.py uses that excact one instead of | ||
# the one in the stdlib of the interpreter executing this file. | ||
sys.modules['token'] = token | ||
tokenize = importlib.machinery.SourceFileLoader('tokenize', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This still looks fragile. Why do we need to use the latest tokenize to parse the Grammar file? The “meta” grammar is super simple, it just has NAME, string literals, and some basic punctuation and operators. The tokenize module from Python 2.4 can handle this. :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The tokenize module from Python 2.4 can handle this.
:)
Why do we need to use the latest tokenize to parse the Grammar file?
Is not that the tokenize cannot handle the grammar is that but that the tokenizer uses different values for the tokens, it fails when constructing the dfas when calling self.parse()
:
Traceback (most recent call last):
File "/usr/lib/python3.4/runpy.py", line 170, in _run_module_as_main
"__main__", mod_spec)
File "/usr/lib/python3.4/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/src/Parser/pgen/__main__.py", line 36, in <module>
main()
File "/src/Parser/pgen/__main__.py", line 29, in main
p = ParserGenerator(args.grammar, token_lines, verbose=args.verbose)
File "/src/Parser/pgen/pgen.py", line 20, in __init__
self.dfas, self.startsymbol = self.parse()
File "/src/Parser/pgen/pgen.py", line 173, in parse
self.expect(self.tokens['OP'], ":")
File "/src/Parser/pgen/pgen.py", line 337, in expect
type, value, self.type, self.value)
File "/src/Parser/pgen/pgen.py", line 356, in raise_error
self.end[1], self.line))
File "./Grammar/Grammar", line 13
single_input: NEWLINE | simple_stmt | compound_stmt NEWLINE
^
SyntaxError: expected 54/:, got 52/:
This is because OP has the value of 52 in Python3.5 (in this example) and 54 in the tokens that we construct from Grammar/Tokens
(or in Lib/tokens.py
). This difference is because the value of 52 is yielded by the tokenize (from Python3.5) when calling next(self.generator)
in gettoken
. Maybe I am missing something here, but that is the problem I found when triying to use the tokenize from the running Python :(
Parser/pgen/grammar.py
Outdated
@@ -0,0 +1,97 @@ | |||
from lib2to3.pgen2 import grammar |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe also copy this, so we’re completely independent from lib2to3?
Parser/pgen/pgen.py
Outdated
import collections | ||
import importlib.machinery | ||
|
||
# Use Lib/token.py and Lib/tokenize.py to obtain the tokens. To maintain this |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would get them directly from Grammar/Tokens
OK I will try to play with this myself.
On Thu, Feb 21, 2019 at 9:36 AM Pablo Galindo ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In Parser/pgen/pgen.py
<#11814 (comment)>:
> +
+# Use Lib/token.py and Lib/tokenize.py to obtain the tokens. To maintain this
+# compatible with older versions of Python, we need to make sure that we only
+# import these two files (and not any of the dependencies of these files).
+
+CURRENT_FOLDER_LOCATION = os.path.dirname(os.path.realpath(__file__))
+LIB_LOCATION = os.path.realpath(os.path.join(CURRENT_FOLDER_LOCATION, '..', '..', 'Lib'))
+TOKEN_LOCATION = os.path.join(LIB_LOCATION, 'token.py')
+TOKENIZE_LOCATION = os.path.join(LIB_LOCATION, 'tokenize.py')
+
+token = importlib.machinery.SourceFileLoader('token',
+ TOKEN_LOCATION).load_module()
+# Add token to the module cache so tokenize.py uses that excact one instead of
+# the one in the stdlib of the interpreter executing this file.
+sys.modules['token'] = token
+tokenize = importlib.machinery.SourceFileLoader('tokenize',
The tokenize module from Python 2.4 can handle this.
:)
Why do we need to use the latest tokenize to parse the Grammar file?
Is not that the tokenize cannot handle the grammar is that if the
tokenizer uses different values for the tokens, it fails when constructing
the dfas when calling self.parse():
Traceback (most recent call last):
File "/usr/lib/python3.4/runpy.py", line 170, in _run_module_as_main
"__main__", mod_spec)
File "/usr/lib/python3.4/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/src/Parser/pgen/__main__.py", line 36, in <module>
main()
File "/src/Parser/pgen/__main__.py", line 29, in main
p = ParserGenerator(args.grammar, token_lines, verbose=args.verbose)
File "/src/Parser/pgen/pgen.py", line 20, in __init__
self.dfas, self.startsymbol = self.parse()
File "/src/Parser/pgen/pgen.py", line 173, in parse
self.expect(self.tokens['OP'], ":")
File "/src/Parser/pgen/pgen.py", line 337, in expect
type, value, self.type, self.value)
File "/src/Parser/pgen/pgen.py", line 356, in raise_error
self.end[1], self.line))
File "./Grammar/Grammar", line 13
single_input: NEWLINE | simple_stmt | compound_stmt NEWLINE
^
SyntaxError: expected 54/:, got 52/:
This is because OP has the value of 52 in Python3.5 (in this example) and
54 in the tokens that we construct from Grammar/Tokens (or in
Lib/tokens.py). This difference is because the value of 52 is yielded by
the tokenize (from Python3.5) when calling next(self.generator) in
gettoken. Maybe I am missing something here, but that is the problem I
found when triying to use the tokenize from the running Python :(
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#11814 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ACwrMsvafFsHzCCrxp4OM99kzKGQCCOWks5vPtkKgaJpZM4azYg1>
.
--
--Guido (mobile)
|
I think I have the solution, at least for |
Of course! Yeah, I can confirm that this works perfectly. Thanks for the patience on going with me through this problem. :) With this, if we read from |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ship it!
@vstinner can you re-review this? |
@pablogsal You can remove the other requested reviewers who haven't said anything yet. |
Code changed a lot since I reviewed the PR
@gvanrossum @pablogsal: Sorry, I was busy with other stuff last days and so I failed to follow this PR. Well, I reviewed an earlier version of the code. The code changed a lot in the meanwhile. I trust Guido to be able to review such change properly. Pablo: please don't wait for me if you want to merge this PR. |
@pablogsal, feel free to merge. |
Whee! |
This problem turned out to be a bit more complex because pgen2 needed some modifications to make it compatible with the old pgen output (apart from the extra glue to generate
graminit.c
andgraminit.h
). I had also to downgrade some f-strings and '**'-unpacking because travis uses an old python for the regeneration step.This PR implements the following:
Removes pgen and all related files from the source tree and Makefile.
Substitutes pgen for a modified and reduced (only pgen functionality is
included) version of Lib/lib2to3/pgen2:
Add new tokens (like COLONEQUAL, TYPE_IGNORE,... etc) to
synchronize with the current Grammar.
All dfa names are included in the label list (this has the effect
of including 'file_input', 'eval_input', ... in the list of
labels to guarantee compatibility with the old pgen generated
files).
The order in which dfas are generated is preserved to maintain
compatibility with old pgen generated files and avoid empty stacks
when the parser processes the rule. If this change is not made,
the parser fails with:
"""' ... It's a token we know
DFA 'and_expr', state 0: Push 'shift_expr'
DFA 'shift_expr', state 0: Push 'arith_expr'
DFA 'arith_expr', state 0: Push 'term'
DFA 'term', state 0: Push 'factor'
DFA 'factor', state 0: Push 'power'
DFA 'power', state 0: Push 'atom_expr'
DFA 'atom_expr', state 0: Push 'atom'
DFA 'atom', state 0: Shift.
Token NEWLINE/'' ... It's a token we know
DFA 'atom', state 5: Pop ...
DFA 'atom_expr', state 2: Pop ...
DFA 'power', state 1: Pop ...
DFA 'factor', state 2: Pop ...
DFA 'term', state 1: Pop ...
DFA 'arith_expr', state 1: Pop ...
DFA 'shift_expr', state 1: Pop ...
DFA 'and_expr', state 1: Pop ...
Error: bottom of stack.
This seem to happen because instead of starting with
'single_input' / 'file_input' ... etc it starts with
'and_expr' just because its the second element in the
dfa list in gramminit.c. This makes the parser to push
tokens from 'and_expr' so before the shift can happen,
it arrives at an empty stack when popping the nonterminals.
First sets are represented with python sets intead of dictionaries
with 1 in the values.
Added (optional) verbose output that dumps different elements of
the grammar and displays the first sets of every nonterminal.
pgen2 now is invoked as:
python -m Parser.pgen2 Grammar/Grammar Include/graminit.h Python/graminit.c
(optionally add '-v' for verbose output).
pgen included "<>" in the list of labels but pgen2 does not as it
thinks is already in the list of tokens because it collides with
the value of "!=". This is not a problem because this label is a
terminal and is handled manually and Parser/token.c produces the
token NOTEQUAL anyway). This is the reason there are now 183 labels
where pgen generates 184).
Changes the regen-grammar to use pgen2 instead of pgen.
Regenerates Makefiles using autoreconf.
https://bugs.python.org/issue35808
This is just a (working) first proposal to start discussing the issue