Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bpo-35808: Retire pgen and use pgen2 to generate the parser #11814

Merged
merged 9 commits into from Mar 1, 2019

Conversation

@pablogsal
Copy link
Member

@pablogsal pablogsal commented Feb 11, 2019

This problem turned out to be a bit more complex because pgen2 needed some modifications to make it compatible with the old pgen output (apart from the extra glue to generate graminit.c and graminit.h). I had also to downgrade some f-strings and '**'-unpacking because travis uses an old python for the regeneration step.

This PR implements the following:

  • Removes pgen and all related files from the source tree and Makefile.

  • Substitutes pgen for a modified and reduced (only pgen functionality is
    included) version of Lib/lib2to3/pgen2:

    • Add new tokens (like COLONEQUAL, TYPE_IGNORE,... etc) to
      synchronize with the current Grammar.

    • All dfa names are included in the label list (this has the effect
      of including 'file_input', 'eval_input', ... in the list of
      labels to guarantee compatibility with the old pgen generated
      files).

    • The order in which dfas are generated is preserved to maintain
      compatibility with old pgen generated files and avoid empty stacks
      when the parser processes the rule. If this change is not made,
      the parser fails with:

      """' ... It's a token we know
      DFA 'and_expr', state 0: Push 'shift_expr'
      DFA 'shift_expr', state 0: Push 'arith_expr'
      DFA 'arith_expr', state 0: Push 'term'
      DFA 'term', state 0: Push 'factor'
      DFA 'factor', state 0: Push 'power'
      DFA 'power', state 0: Push 'atom_expr'
      DFA 'atom_expr', state 0: Push 'atom'
      DFA 'atom', state 0: Shift.
      Token NEWLINE/'' ... It's a token we know
      DFA 'atom', state 5: Pop ...
      DFA 'atom_expr', state 2: Pop ...
      DFA 'power', state 1: Pop ...
      DFA 'factor', state 2: Pop ...
      DFA 'term', state 1: Pop ...
      DFA 'arith_expr', state 1: Pop ...
      DFA 'shift_expr', state 1: Pop ...
      DFA 'and_expr', state 1: Pop ...
      Error: bottom of stack.

      This seem to happen because instead of starting with
      'single_input' / 'file_input' ... etc it starts with
      'and_expr' just because its the second element in the
      dfa list in gramminit.c. This makes the parser to push
      tokens from 'and_expr' so before the shift can happen,
      it arrives at an empty stack when popping the nonterminals.

    • First sets are represented with python sets intead of dictionaries
      with 1 in the values.

    • Added (optional) verbose output that dumps different elements of
      the grammar and displays the first sets of every nonterminal.

    • pgen2 now is invoked as:

      python -m Parser.pgen2 Grammar/Grammar Include/graminit.h Python/graminit.c

      (optionally add '-v' for verbose output).

    • pgen included "<>" in the list of labels but pgen2 does not as it
      thinks is already in the list of tokens because it collides with
      the value of "!=". This is not a problem because this label is a
      terminal and is handled manually and Parser/token.c produces the
      token NOTEQUAL anyway). This is the reason there are now 183 labels
      where pgen generates 184).

  • Changes the regen-grammar to use pgen2 instead of pgen.

  • Regenerates Makefiles using autoreconf.

https://bugs.python.org/issue35808

This is just a (working) first proposal to start discussing the issue

@pablogsal pablogsal requested a review from gvanrossum Feb 11, 2019
@pablogsal pablogsal self-assigned this Feb 11, 2019
@pablogsal pablogsal force-pushed the pablogsal:pgen2 branch 3 times, most recently from 23ff976 to 20fcc02 Feb 11, 2019
@pablogsal pablogsal requested a review from python/windows-team as a code owner Feb 11, 2019
@pablogsal pablogsal force-pushed the pablogsal:pgen2 branch 2 times, most recently from d60bfe5 to adc73e6 Feb 11, 2019
Copy link
Member

@ned-deily ned-deily left a comment

It doesn't appear that there are changes to configure.ac so is there a need to churn aclocal.m4 and configure?

@pablogsal pablogsal force-pushed the pablogsal:pgen2 branch from adc73e6 to 5448767 Feb 11, 2019
@gvanrossum
Copy link
Member

@gvanrossum gvanrossum commented Feb 11, 2019

Do you need my review? I'm super behind on stuff so it may be a while, and I don't want to block you.

@pablogsal pablogsal requested a review from ambv Feb 11, 2019
@eamanu
Copy link
Contributor

@eamanu eamanu commented Feb 11, 2019

Hi, in the next line:

-rm -f $(BUILDPYTHON) $(PGEN) $(LIBRARY) $(LDLIBRARY) $(DLLLIBRARY) \

there is a reference to $(PGEN) and that does not exist.

By the way, that don't give any error because is empty

@pablogsal
Copy link
Member Author

@pablogsal pablogsal commented Feb 11, 2019

there is a reference to $(PGEN) and that does not exist.

Good catch!

@pablogsal
Copy link
Member Author

@pablogsal pablogsal commented Feb 11, 2019

Do you need my review? I'm super behind on stuff so it may be a while, and I don't want to block you.

@gvanrossum Don't worry. I will try to get someone else to review, but the fact that the interpreter builds and passes the test is a good assurance that the patch is not fundamentally wrong.

@pablogsal pablogsal requested review from vstinner and serhiy-storchaka Feb 11, 2019
@vstinner
Copy link
Member

@vstinner vstinner commented Feb 11, 2019

Hum. I'm not super excited by duplicating Lib/lib2to3/pgen2/ as Parser/pgen2/ "with a few changes". It sounds like a good recipe to duplicate the maintenance.

Would it be possible to modify Lib/lib2to3/pgen2/ to respect your constraints instead? Maybe even with "#ifdef" (if PGEN: ... ?).

Maybe my question doesn't make sense, the code diverges enough to justify to have two "copies" of the "same code".

@pablogsal
Copy link
Member Author

@pablogsal pablogsal commented Feb 11, 2019

Hum. I'm not super excited by duplicating Lib/lib2to3/pgen2/ as Parser/pgen2/ "with a few changes". It sounds like a good recipe to duplicate the maintenance.

@vstinner We could add some branching on the original pgen2, but IMHO I think having the code in Parser/gen2 makes sense and helps with discovery as opposed to invoke something in Lib/lib2to3/pgen2. Also, only the relevant files are ported (the parser and the driver and the rest of the suite was not moved) so is not the full pgen2 that will need double maintainance. (Also, we retired the original pgen, so you can think about pgen2 as having the maintainance of the original pgen). Also, I did not wanted to affect any lib2to3 behaviour so that is why I opted by "duplicating" pgen2 initially.

Being said that, if we think is ok to add branching to pgen2 and invoke that from lib2to3 I am happy to do so :)

P.S. Thanks for reviewing this big PR! 😄

Copy link
Contributor

@eamanu eamanu left a comment

Searching PGEN on CPython root I have the some issues:

  • .gititgnore: i think that we have to delete the Parser/pgen and Parser/pge.exe, because now we wont have pgen binary, isn't?
  • on parsetok.h On the line 15 I don't know if is necessary the #ifndef PGEN ...
  • There are several ```#ifndef````on files: tokenizer.c parsetok.c, ... are they necessary?
Makefile.pre.in Show resolved Hide resolved
Copy link
Member

@vstinner vstinner left a comment

Also, I did not wanted to affect any lib2to3 behaviour so that is why I opted by "duplicating" pgen2 initially.

There should be a way to modify lib2to3.pgen2 just enough to allow an user to change some constants and functions, no?

I compared Parser/pgen2/ and Lib/lib2to3/pgen2/ using meld. Most code is duplicated, but "some" is modified. That's my worry.

The 194 lines of grammar.py are almost all the same, except of opmap_raw constant which is different. That's why I talked about "#ifdef". Would it be possible to modify lib2to3 to allow to modify opmap_raw? For this specific case, it looks like lib2to3 doesn't support PEP 572 whereas your change adds support for ":="... It means that we have 2 copies of the same file, but one is outdated? What's the point of having an outdated parser which doesn't support PEP 572 in lib2to3? lib2to3 should also be modified to support ":=", no?

Can't Parser/pgen2/ reuses Lib/lib2to3/pgen2/grammar.py? For a practical reason, I would suggest to rename Parser/pgen2/ to Parser/pgen/: it would avoid conflict with lib2to3.pgen2 package and it reuses "pgen" name :-)

Parser/pgen2/pgen.py: multiple changes are to add support for older Python versions. Would it be acceptable to modify lib2to3 and reuse its pgen.py in Parser/pgen2/?

Parser/pgen2/tokenize.py: most code is the same, but you replace " with ' and other coding style changes. I'm not against coding style changes, except that here it makes the maintenance harder to keep Parser/pgen2/tokenize.py in sync with Lib/lib2to3/pgen2/tokenize.py. Either revert coding style changes or update also Lib/lib2to3/pgen2/tokenize.py.

Parser/pgen2/token.py: you added BACKQUOTE, removed ELLIPSIS, etc. That's maybe one of the most tricky difference which may prevent core reuse? Or other modules should be made configurable to accept a "custom" token module? Rather than being "hardcoded" to use a specific "token" module? I'm talking about from . import grammar, token, tokenize in pgen.py for example.

@bedevere-bot
Copy link

@bedevere-bot bedevere-bot commented Feb 11, 2019

When you're done making the requested changes, leave the comment: I have made the requested changes; please review again.

@pablogsal
Copy link
Member Author

@pablogsal pablogsal commented Feb 11, 2019

@vstinner I will work on changes that will address these issues trying to reuse pgen2 as much as possible. Will ping you when is ready :)

Copy link
Member

@zooba zooba left a comment

Signing off on the changes to the Windows build files. Haven't looked at the rest.

**{f'{prefix}"""': double3prog for prefix in _strprefixes},
**{prefix: None for prefix in _strprefixes}}
"'''": single3prog, '"""': double3prog}
endprogs.update({"{}'''".format(prefix): single3prog for prefix in _strprefixes})

This comment has been minimized.

@pablogsal

pablogsal Feb 11, 2019
Author Member

These are needed because Travis uses an old version of python for the regenerations step.

@pablogsal
Copy link
Member Author

@pablogsal pablogsal commented Feb 11, 2019

I have made some commits triying to use pgen2, but there are some problems that only show on Travis and Windows.....I will try to find what is happening.

@pablogsal
Copy link
Member Author

@pablogsal pablogsal commented Feb 20, 2019

As I have rebased, the old conversations that were attached to commits can be found here:

80c1217#commitcomment-32381091

104b8e7#commitcomment-32380993

@gvanrossum Ok, I have decided to go to the option you initially suggested and leave lib2to3 intact (at least in this PR). The implementation of this is in commit 2bc3198. The reasons are the following:

  1. Changing lib2to3's pgen can be backwards incompatible without really knowing due to the test suite not checking everything.
  2. Modifying pgen2 to allow to use Grammar/Tokens involves a bigger diff because we need to make every file compatible with a token file (optionally). Also, Grammar/Tokens and the tokens in lib2to3 are slightly different and make this a bit more complex, therefore generating a bigger diff.
  3. I don't feel comfortable just with a comment stating that 'lib2to3' 's pgen needs to be backwards compatible and it needs to be compatible with a top-level package because comments can be ignored and this can lead to future problems.
  4. This allows evolving the parser independently of lib2to3 if in the future we decide to change the Parser or the parser generator.

The reasons there is a full implementation of the parser generator in Parser/pgen/pgen.py is the following:

  1. Subclassing the parser generator in Lib2to3/pgen2/pgen.py is not enough as several methods need to be changed to stabilize the output files. This is because some of the containers and classes that the parser generator uses produce a list of dfas, names, labels, arcs and states that changes between runs.
  2. There are several modifications that need to be done anyway to the parser generator, to make sure that the dfas are produced in the order that the Parser expects, making it compatible with the old pgen output.
  3. This PR is already very big. If in the future we want to unify lib2to3 and Parser/pgen we can do it but I suggest doing it in another PR with another issue because this is a different task in its own right. This way, we can discuss all the problems with the different approach without producing even bigger diffs.
  4. I have added a --verbose flag to the generator step that dumps some information of the process (lib2to3 does not do that). This helps a lot if there are problems when generating a modification of the Grammar.
…ing pgen2
Copy link
Member

@gvanrossum gvanrossum left a comment

(Honestly I think at this point you might as well have started over with a new PR. But alla.)

Makefile.pre.in Outdated Show resolved Hide resolved
Parser/pgen/__main__.py Show resolved Hide resolved
Parser/pgen/grammar.py Outdated Show resolved Hide resolved
Parser/pgen/__main__.py Outdated Show resolved Hide resolved
@pablogsal pablogsal force-pushed the pablogsal:pgen2 branch from f386c4f to c986980 Feb 21, 2019
@pablogsal pablogsal force-pushed the pablogsal:pgen2 branch from c986980 to 3d593ef Feb 21, 2019
Copy link
Member

@gvanrossum gvanrossum left a comment

Getting closer. Maybe we can we completely independent from lib2to3.

"grammar", type=str, help="The file with the grammar definition in EBNF format"
)
parser.add_argument(
"gramminit_h",

This comment has been minimized.

@gvanrossum

gvanrossum Feb 21, 2019
Member

Normally this is spelled with one ‘m’. (Also below.)

# Add token to the module cache so tokenize.py uses that excact one instead of
# the one in the stdlib of the interpreter executing this file.
sys.modules['token'] = token
tokenize = importlib.machinery.SourceFileLoader('tokenize',

This comment has been minimized.

@gvanrossum

gvanrossum Feb 21, 2019
Member

This still looks fragile. Why do we need to use the latest tokenize to parse the Grammar file? The “meta” grammar is super simple, it just has NAME, string literals, and some basic punctuation and operators. The tokenize module from Python 2.4 can handle this. :-)

This comment has been minimized.

@pablogsal

pablogsal Feb 21, 2019
Author Member

The tokenize module from Python 2.4 can handle this.

:)

Why do we need to use the latest tokenize to parse the Grammar file?

Is not that the tokenize cannot handle the grammar is that but that the tokenizer uses different values for the tokens, it fails when constructing the dfas when calling self.parse():

Traceback (most recent call last):
  File "/usr/lib/python3.4/runpy.py", line 170, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/lib/python3.4/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/src/Parser/pgen/__main__.py", line 36, in <module>
    main()
  File "/src/Parser/pgen/__main__.py", line 29, in main
    p = ParserGenerator(args.grammar, token_lines, verbose=args.verbose)
  File "/src/Parser/pgen/pgen.py", line 20, in __init__
    self.dfas, self.startsymbol = self.parse()
  File "/src/Parser/pgen/pgen.py", line 173, in parse
    self.expect(self.tokens['OP'], ":")
  File "/src/Parser/pgen/pgen.py", line 337, in expect
    type, value, self.type, self.value)
  File "/src/Parser/pgen/pgen.py", line 356, in raise_error
    self.end[1], self.line))
  File "./Grammar/Grammar", line 13
    single_input: NEWLINE | simple_stmt | compound_stmt NEWLINE
                ^
SyntaxError: expected 54/:, got 52/:

This is because OP has the value of 52 in Python3.5 (in this example) and 54 in the tokens that we construct from Grammar/Tokens (or in Lib/tokens.py). This difference is because the value of 52 is yielded by the tokenize (from Python3.5) when calling next(self.generator) in gettoken. Maybe I am missing something here, but that is the problem I found when triying to use the tokenize from the running Python :(

@@ -0,0 +1,97 @@
from lib2to3.pgen2 import grammar

This comment has been minimized.

@gvanrossum

gvanrossum Feb 21, 2019
Member

Maybe also copy this, so we’re completely independent from lib2to3?

import collections
import importlib.machinery

# Use Lib/token.py and Lib/tokenize.py to obtain the tokens. To maintain this

This comment has been minimized.

@gvanrossum

gvanrossum Feb 21, 2019
Member

I would get them directly from Grammar/Tokens

@gvanrossum
Copy link
Member

@gvanrossum gvanrossum commented Feb 22, 2019

@gvanrossum
Copy link
Member

@gvanrossum gvanrossum commented Feb 22, 2019

I think I have the solution, at least for tokenize. If in pgen.py, in the function parse() and all parse_xxx() functions, you replace self.tokens.XXX with tokenize.XXX, the parser that reads the Grammar file (the meta-parser if you will) will work independently from the tokens defined for the target parser/lexer. POC:
@.patch.txt

@pablogsal
Copy link
Member Author

@pablogsal pablogsal commented Feb 22, 2019

I think I have the solution, at least for tokenize. If in pgen.py, in the function parse() and all parse_xxx() functions, you replace self.tokens.XXX with tokenize.XXX, the parser that reads the Grammar file (the meta-parser if you will) will work independently from the tokens defined for the target parser/lexer. POC:
@.patch.txt

Of course! Yeah, I can confirm that this works perfectly. Thanks for the patience on going with me through this problem. :)

With this, if we read from Grammar/Token as you suggested, we don't need any import hacks. I have pushed 8ff1e44 including this patch and some changes adressing the rest of your feedback.

@pablogsal pablogsal force-pushed the pablogsal:pgen2 branch from 8ff1e44 to b1fae75 Feb 22, 2019
Copy link
Member

@gvanrossum gvanrossum left a comment

Ship it!

@gvanrossum
Copy link
Member

@gvanrossum gvanrossum commented Feb 27, 2019

@vstinner can you re-review this?

@gvanrossum
Copy link
Member

@gvanrossum gvanrossum commented Feb 27, 2019

@pablogsal You can remove the other requested reviewers who haven't said anything yet.

@vstinner vstinner dismissed their stale review Mar 1, 2019

Code changed a lot since I reviewed the PR

@vstinner
Copy link
Member

@vstinner vstinner commented Mar 1, 2019

@gvanrossum @pablogsal: Sorry, I was busy with other stuff last days and so I failed to follow this PR. Well, I reviewed an earlier version of the code. The code changed a lot in the meanwhile. I trust Guido to be able to review such change properly. Pablo: please don't wait for me if you want to merge this PR.

@gvanrossum
Copy link
Member

@gvanrossum gvanrossum commented Mar 1, 2019

@pablogsal, feel free to merge.

@pablogsal pablogsal merged commit 1f24a71 into python:master Mar 1, 2019
5 checks passed
5 checks passed
Azure Pipelines PR #20190222.38 succeeded
Details
@bedevere-bot
bedevere/issue-number Issue number 35808 found
Details
@bedevere-bot
bedevere/news News entry found in Misc/NEWS.d
continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
@gvanrossum
Copy link
Member

@gvanrossum gvanrossum commented Mar 1, 2019

Whee!

ma8ma added a commit to ma8ma/cpython that referenced this pull request Apr 7, 2019
…-11814)

Pgen is the oldest piece of technology in the CPython repository, building it requires various #if[n]def PGEN hacks in other parts of the code and it also depends more and more on CPython internals. This commit removes the old pgen C code and replaces it for a new version implemented in pure Python. This is a modified and adapted version of lib2to3/pgen2 that can generate grammar files compatibles with the current parser.

This commit also eliminates all the #ifdef and code branches related to pgen, simplifying the code and making it more maintainable. The regen-grammar step now uses $(PYTHON_FOR_REGEN) that can be any version of the interpreter, so the new pgen code maintains compatibility with older versions of the interpreter (this also allows regenerating the grammar with the current CI solution that uses Python3.5). The new pgen Python module also makes use of the Grammar/Tokens file that holds the token specification, so is always kept in sync and avoids having to maintain duplicate token definitions.
arnolddumas added a commit to arnolddumas/cpython that referenced this pull request May 3, 2019
…-11814)

Pgen is the oldest piece of technology in the CPython repository, building it requires various #if[n]def PGEN hacks in other parts of the code and it also depends more and more on CPython internals. This commit removes the old pgen C code and replaces it for a new version implemented in pure Python. This is a modified and adapted version of lib2to3/pgen2 that can generate grammar files compatibles with the current parser.

This commit also eliminates all the #ifdef and code branches related to pgen, simplifying the code and making it more maintainable. The regen-grammar step now uses $(PYTHON_FOR_REGEN) that can be any version of the interpreter, so the new pgen code maintains compatibility with older versions of the interpreter (this also allows regenerating the grammar with the current CI solution that uses Python3.5). The new pgen Python module also makes use of the Grammar/Tokens file that holds the token specification, so is always kept in sync and avoids having to maintain duplicate token definitions.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment