New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bpo-35808: Retire pgen and use pgen2 to generate the parser #11814

Open
wants to merge 9 commits into
base: master
from

Conversation

@pablogsal
Copy link
Member

pablogsal commented Feb 11, 2019

This problem turned out to be a bit more complex because pgen2 needed some modifications to make it compatible with the old pgen output (apart from the extra glue to generate graminit.c and graminit.h). I had also to downgrade some f-strings and '**'-unpacking because travis uses an old python for the regeneration step.

This PR implements the following:

  • Removes pgen and all related files from the source tree and Makefile.

  • Substitutes pgen for a modified and reduced (only pgen functionality is
    included) version of Lib/lib2to3/pgen2:

    • Add new tokens (like COLONEQUAL, TYPE_IGNORE,... etc) to
      synchronize with the current Grammar.

    • All dfa names are included in the label list (this has the effect
      of including 'file_input', 'eval_input', ... in the list of
      labels to guarantee compatibility with the old pgen generated
      files).

    • The order in which dfas are generated is preserved to maintain
      compatibility with old pgen generated files and avoid empty stacks
      when the parser processes the rule. If this change is not made,
      the parser fails with:

      """' ... It's a token we know
      DFA 'and_expr', state 0: Push 'shift_expr'
      DFA 'shift_expr', state 0: Push 'arith_expr'
      DFA 'arith_expr', state 0: Push 'term'
      DFA 'term', state 0: Push 'factor'
      DFA 'factor', state 0: Push 'power'
      DFA 'power', state 0: Push 'atom_expr'
      DFA 'atom_expr', state 0: Push 'atom'
      DFA 'atom', state 0: Shift.
      Token NEWLINE/'' ... It's a token we know
      DFA 'atom', state 5: Pop ...
      DFA 'atom_expr', state 2: Pop ...
      DFA 'power', state 1: Pop ...
      DFA 'factor', state 2: Pop ...
      DFA 'term', state 1: Pop ...
      DFA 'arith_expr', state 1: Pop ...
      DFA 'shift_expr', state 1: Pop ...
      DFA 'and_expr', state 1: Pop ...
      Error: bottom of stack.

      This seem to happen because instead of starting with
      'single_input' / 'file_input' ... etc it starts with
      'and_expr' just because its the second element in the
      dfa list in gramminit.c. This makes the parser to push
      tokens from 'and_expr' so before the shift can happen,
      it arrives at an empty stack when popping the nonterminals.

    • First sets are represented with python sets intead of dictionaries
      with 1 in the values.

    • Added (optional) verbose output that dumps different elements of
      the grammar and displays the first sets of every nonterminal.

    • pgen2 now is invoked as:

      python -m Parser.pgen2 Grammar/Grammar Include/graminit.h Python/graminit.c

      (optionally add '-v' for verbose output).

    • pgen included "<>" in the list of labels but pgen2 does not as it
      thinks is already in the list of tokens because it collides with
      the value of "!=". This is not a problem because this label is a
      terminal and is handled manually and Parser/token.c produces the
      token NOTEQUAL anyway). This is the reason there are now 183 labels
      where pgen generates 184).

  • Changes the regen-grammar to use pgen2 instead of pgen.

  • Regenerates Makefiles using autoreconf.

https://bugs.python.org/issue35808

This is just a (working) first proposal to start discussing the issue

@pablogsal pablogsal requested a review from gvanrossum Feb 11, 2019

@pablogsal pablogsal self-assigned this Feb 11, 2019

@pablogsal pablogsal force-pushed the pablogsal:pgen2 branch 3 times, most recently from 23ff976 to 20fcc02 Feb 11, 2019

@pablogsal pablogsal requested a review from python/windows-team as a code owner Feb 11, 2019

@pablogsal pablogsal force-pushed the pablogsal:pgen2 branch 2 times, most recently from d60bfe5 to adc73e6 Feb 11, 2019

@ned-deily
Copy link
Member

ned-deily left a comment

It doesn't appear that there are changes to configure.ac so is there a need to churn aclocal.m4 and configure?

@pablogsal pablogsal force-pushed the pablogsal:pgen2 branch from adc73e6 to 5448767 Feb 11, 2019

@gvanrossum

This comment has been minimized.

Copy link
Member

gvanrossum commented Feb 11, 2019

Do you need my review? I'm super behind on stuff so it may be a while, and I don't want to block you.

@pablogsal pablogsal requested a review from ambv Feb 11, 2019

@eamanu

This comment has been minimized.

Copy link
Contributor

eamanu commented Feb 11, 2019

Hi, in the next line:

-rm -f $(BUILDPYTHON) $(PGEN) $(LIBRARY) $(LDLIBRARY) $(DLLLIBRARY) \

there is a reference to $(PGEN) and that does not exist.

By the way, that don't give any error because is empty

@pablogsal

This comment has been minimized.

Copy link
Member Author

pablogsal commented Feb 11, 2019

there is a reference to $(PGEN) and that does not exist.

Good catch!

@pablogsal

This comment has been minimized.

Copy link
Member Author

pablogsal commented Feb 11, 2019

Do you need my review? I'm super behind on stuff so it may be a while, and I don't want to block you.

@gvanrossum Don't worry. I will try to get someone else to review, but the fact that the interpreter builds and passes the test is a good assurance that the patch is not fundamentally wrong.

@pablogsal pablogsal requested review from vstinner and serhiy-storchaka Feb 11, 2019

@vstinner

This comment has been minimized.

Copy link
Member

vstinner commented Feb 11, 2019

Hum. I'm not super excited by duplicating Lib/lib2to3/pgen2/ as Parser/pgen2/ "with a few changes". It sounds like a good recipe to duplicate the maintenance.

Would it be possible to modify Lib/lib2to3/pgen2/ to respect your constraints instead? Maybe even with "#ifdef" (if PGEN: ... ?).

Maybe my question doesn't make sense, the code diverges enough to justify to have two "copies" of the "same code".

@pablogsal

This comment has been minimized.

Copy link
Member Author

pablogsal commented Feb 11, 2019

Hum. I'm not super excited by duplicating Lib/lib2to3/pgen2/ as Parser/pgen2/ "with a few changes". It sounds like a good recipe to duplicate the maintenance.

@vstinner We could add some branching on the original pgen2, but IMHO I think having the code in Parser/gen2 makes sense and helps with discovery as opposed to invoke something in Lib/lib2to3/pgen2. Also, only the relevant files are ported (the parser and the driver and the rest of the suite was not moved) so is not the full pgen2 that will need double maintainance. (Also, we retired the original pgen, so you can think about pgen2 as having the maintainance of the original pgen). Also, I did not wanted to affect any lib2to3 behaviour so that is why I opted by "duplicating" pgen2 initially.

Being said that, if we think is ok to add branching to pgen2 and invoke that from lib2to3 I am happy to do so :)

P.S. Thanks for reviewing this big PR! 😄

@eamanu
Copy link
Contributor

eamanu left a comment

Searching PGEN on CPython root I have the some issues:

  • .gititgnore: i think that we have to delete the Parser/pgen and Parser/pge.exe, because now we wont have pgen binary, isn't?
  • on parsetok.h On the line 15 I don't know if is necessary the #ifndef PGEN ...
  • There are several ```#ifndef````on files: tokenizer.c parsetok.c, ... are they necessary?
Show resolved Hide resolved Makefile.pre.in
@vstinner
Copy link
Member

vstinner left a comment

Also, I did not wanted to affect any lib2to3 behaviour so that is why I opted by "duplicating" pgen2 initially.

There should be a way to modify lib2to3.pgen2 just enough to allow an user to change some constants and functions, no?

I compared Parser/pgen2/ and Lib/lib2to3/pgen2/ using meld. Most code is duplicated, but "some" is modified. That's my worry.

The 194 lines of grammar.py are almost all the same, except of opmap_raw constant which is different. That's why I talked about "#ifdef". Would it be possible to modify lib2to3 to allow to modify opmap_raw? For this specific case, it looks like lib2to3 doesn't support PEP 572 whereas your change adds support for ":="... It means that we have 2 copies of the same file, but one is outdated? What's the point of having an outdated parser which doesn't support PEP 572 in lib2to3? lib2to3 should also be modified to support ":=", no?

Can't Parser/pgen2/ reuses Lib/lib2to3/pgen2/grammar.py? For a practical reason, I would suggest to rename Parser/pgen2/ to Parser/pgen/: it would avoid conflict with lib2to3.pgen2 package and it reuses "pgen" name :-)

Parser/pgen2/pgen.py: multiple changes are to add support for older Python versions. Would it be acceptable to modify lib2to3 and reuse its pgen.py in Parser/pgen2/?

Parser/pgen2/tokenize.py: most code is the same, but you replace " with ' and other coding style changes. I'm not against coding style changes, except that here it makes the maintenance harder to keep Parser/pgen2/tokenize.py in sync with Lib/lib2to3/pgen2/tokenize.py. Either revert coding style changes or update also Lib/lib2to3/pgen2/tokenize.py.

Parser/pgen2/token.py: you added BACKQUOTE, removed ELLIPSIS, etc. That's maybe one of the most tricky difference which may prevent core reuse? Or other modules should be made configurable to accept a "custom" token module? Rather than being "hardcoded" to use a specific "token" module? I'm talking about from . import grammar, token, tokenize in pgen.py for example.

@bedevere-bot

This comment has been minimized.

Copy link

bedevere-bot commented Feb 11, 2019

When you're done making the requested changes, leave the comment: I have made the requested changes; please review again.

@pablogsal

This comment has been minimized.

Copy link
Member Author

pablogsal commented Feb 11, 2019

@vstinner I will work on changes that will address these issues trying to reuse pgen2 as much as possible. Will ping you when is ready :)

@zooba
Copy link
Member

zooba left a comment

Signing off on the changes to the Windows build files. Haven't looked at the rest.

**{f'{prefix}"""': double3prog for prefix in _strprefixes},
**{prefix: None for prefix in _strprefixes}}
"'''": single3prog, '"""': double3prog}
endprogs.update({"{}'''".format(prefix): single3prog for prefix in _strprefixes})

This comment has been minimized.

@pablogsal

pablogsal Feb 11, 2019

Author Member

These are needed because Travis uses an old version of python for the regenerations step.

@pablogsal

This comment has been minimized.

Copy link
Member Author

pablogsal commented Feb 11, 2019

I have made some commits triying to use pgen2, but there are some problems that only show on Travis and Windows.....I will try to find what is happening.

@@ -2497,6 +2483,6 @@ static label labels[184] = {
grammar _PyParser_Grammar = {
92,
dfas,
{184, labels},
{183, labels},

This comment has been minimized.

@gvanrossum

gvanrossum Feb 19, 2019

Member

Do you understand why the regenerated graminit.c is different? Is it just that pgen2 processes some nodes in a different order? Or is the algorithm actually different in some way? Hopefully the order is stable? (If it were to use sets, it might not be!)

This comment has been minimized.

@pablogsal

pablogsal Feb 19, 2019

Author Member

Hopefully the order is stable? (If it were to use sets, it might not be!)

I can confirm that the order is stable, regardless of the Python version you use to run pgen2. The reason is because iteration over sets and regular dicts is stabilized by sorting the container first. I have removed the sorting as a test and make use of OrderedDicts everywhere and there are still differences with the pgen version of the files.

Do you understand why the regenerated graminit.c is different? Is it just that pgen2 processes some nodes in a different order? Or is the algorithm actually different in some way?

The labels have different order because the arcs in the dfas have different order (labels are generated by make_label from the arcs in the dfas generated by self.parse()). The arcs themselves also follow the order in which dfas are generated. Note that dfas themselves are not changed from one file to the other, so they are parsed in the same fashion.

pgen seems to call addlabel when an nfa is created and when an atom is compiled so that explains the different in the order of the labels and I think the majority of the file.

I did not put much effort in recreating the same order because the grammar files are equivalent and the test suite passes.

-> RARROW
... ELLIPSIS
:= COLONEQUAL
` BACKQUOTE

This comment has been minimized.

@gvanrossum

gvanrossum Feb 19, 2019

Member

Why'd you indent all this by one space? (It just adds noise to the diff.)


class PgenGrammar(grammar.Grammar):
pass
def produce_graminit_h(self, writer):

This comment has been minimized.

@gvanrossum

gvanrossum Feb 19, 2019

Member

An alternative way of structuring this would be to have a subclass in Parser/pgen.py that adds this functionality.

@@ -31,43 +31,48 @@
EQUAL = 22
DOT = 23
PERCENT = 24
BACKQUOTE = 25

This comment has been minimized.

@gvanrossum

gvanrossum Feb 19, 2019

Member

By deleting BACKQUOTE you make it impossible for the 2to3 tool to parse (some) Python 2 code. I don't think that's an acceptable change. It would probably break python-modernize (which is a 3rd party extension of 2to3).

This comment has been minimized.

@vstinner

vstinner Feb 19, 2019

Member

Oh. The CI didn't complain. Does lib2to3 lack a test for that?

This comment has been minimized.

@gvanrossum

gvanrossum Feb 19, 2019

Member

Likely. It doesn't have very good tests. :-(

This comment has been minimized.

@pablogsal

pablogsal Feb 19, 2019

Author Member

BACKQUOTE is not deleted, is now the 63th token. It has been moved because I used the Grammar/Tokens file to make it compatible with pulling that one in in the future instead of maintaining two separate sets of files.

Sorry for the confusion :S

This comment has been minimized.

@gvanrossum

gvanrossum Feb 19, 2019

Member

Got it!

# Copyright 2004-2005 Elemental Security, Inc. All Rights Reserved.
# Licensed to PSF under a Contributor Agreement.

"""The pgen2 package."""

This comment has been minimized.

@gvanrossum

gvanrossum Feb 19, 2019

Member

I don't think these comments and this docstring are needed.

@bedevere-bot

This comment has been minimized.

Copy link

bedevere-bot commented Feb 19, 2019

When you're done making the requested changes, leave the comment: I have made the requested changes; please review again.

@pablogsal

This comment has been minimized.

Copy link
Member Author

pablogsal commented Feb 19, 2019

PS. What does the fixup! convention do?

They are generated by git commit --fixup. It allows me to run git rebase -i --autosquash and git will automatically merge those commits to the commit is fixing. This allows me to maintain different commits and help the reviewer check that the feedback was addressed. So as a reviewer, you can just review the commits you care about and check the ones that keep iterating only on that commit. You can read about it here:

https://fle.github.io/git-tip-keep-your-branch-clean-with-fixup-and-autosquash.html

So for example, if you are interested on the changes to pgen2, you only need to review:

  • Modify pgen2 to make it produce pgen compatible output
  • fixup! Modify pgen2 to make it produce pgen compatible output

@pablogsal pablogsal force-pushed the pablogsal:pgen2 branch 3 times, most recently from 31a7847 to 09cef2e Feb 19, 2019

pablogsal added some commits Feb 12, 2019

@pablogsal pablogsal force-pushed the pablogsal:pgen2 branch from 09cef2e to e2ed0d1 Feb 20, 2019

@pablogsal

This comment has been minimized.

Copy link
Member Author

pablogsal commented Feb 20, 2019

As I have rebased, the old conversations that were attached to commits can be found here:

80c1217#commitcomment-32381091

104b8e7#commitcomment-32380993

@gvanrossum Ok, I have decided to go to the option you initially suggested and leave lib2to3 intact (at least in this PR). The implementation of this is in commit 2bc3198. The reasons are the following:

  1. Changing lib2to3's pgen can be backwards incompatible without really knowing due to the test suite not checking everything.
  2. Modifying pgen2 to allow to use Grammar/Tokens involves a bigger diff because we need to make every file compatible with a token file (optionally). Also, Grammar/Tokens and the tokens in lib2to3 are slightly different and make this a bit more complex, therefore generating a bigger diff.
  3. I don't feel comfortable just with a comment stating that 'lib2to3' 's pgen needs to be backwards compatible and it needs to be compatible with a top-level package because comments can be ignored and this can lead to future problems.
  4. This allows evolving the parser independently of lib2to3 if in the future we decide to change the Parser or the parser generator.

The reasons there is a full implementation of the parser generator in Parser/pgen/pgen.py is the following:

  1. Subclassing the parser generator in Lib2to3/pgen2/pgen.py is not enough as several methods need to be changed to stabilize the output files. This is because some of the containers and classes that the parser generator uses produce a list of dfas, names, labels, arcs and states that changes between runs.
  2. There are several modifications that need to be done anyway to the parser generator, to make sure that the dfas are produced in the order that the Parser expects, making it compatible with the old pgen output.
  3. This PR is already very big. If in the future we want to unify lib2to3 and Parser/pgen we can do it but I suggest doing it in another PR with another issue because this is a different task in its own right. This way, we can discuss all the problems with the different approach without producing even bigger diffs.
  4. I have added a --verbose flag to the generator step that dumps some information of the process (lib2to3 does not do that). This helps a lot if there are problems when generating a modification of the Grammar.
@gvanrossum
Copy link
Member

gvanrossum left a comment

(Honestly I think at this point you might as well have started over with a new PR. But alla.)

Show resolved Hide resolved Makefile.pre.in Outdated
Show resolved Hide resolved Parser/pgen/__main__.py
Show resolved Hide resolved Parser/pgen/grammar.py Outdated
Show resolved Hide resolved Parser/pgen/__main__.py Outdated

@pablogsal pablogsal force-pushed the pablogsal:pgen2 branch from f386c4f to c986980 Feb 21, 2019

@pablogsal pablogsal force-pushed the pablogsal:pgen2 branch from c986980 to 3d593ef Feb 21, 2019

@gvanrossum
Copy link
Member

gvanrossum left a comment

Getting closer. Maybe we can we completely independent from lib2to3.

"grammar", type=str, help="The file with the grammar definition in EBNF format"
)
parser.add_argument(
"gramminit_h",

This comment has been minimized.

@gvanrossum

gvanrossum Feb 21, 2019

Member

Normally this is spelled with one ‘m’. (Also below.)

# Add token to the module cache so tokenize.py uses that excact one instead of
# the one in the stdlib of the interpreter executing this file.
sys.modules['token'] = token
tokenize = importlib.machinery.SourceFileLoader('tokenize',

This comment has been minimized.

@gvanrossum

gvanrossum Feb 21, 2019

Member

This still looks fragile. Why do we need to use the latest tokenize to parse the Grammar file? The “meta” grammar is super simple, it just has NAME, string literals, and some basic punctuation and operators. The tokenize module from Python 2.4 can handle this. :-)

This comment has been minimized.

@pablogsal

pablogsal Feb 21, 2019

Author Member

The tokenize module from Python 2.4 can handle this.

:)

Why do we need to use the latest tokenize to parse the Grammar file?

Is not that the tokenize cannot handle the grammar is that but that the tokenizer uses different values for the tokens, it fails when constructing the dfas when calling self.parse():

Traceback (most recent call last):
  File "/usr/lib/python3.4/runpy.py", line 170, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/lib/python3.4/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/src/Parser/pgen/__main__.py", line 36, in <module>
    main()
  File "/src/Parser/pgen/__main__.py", line 29, in main
    p = ParserGenerator(args.grammar, token_lines, verbose=args.verbose)
  File "/src/Parser/pgen/pgen.py", line 20, in __init__
    self.dfas, self.startsymbol = self.parse()
  File "/src/Parser/pgen/pgen.py", line 173, in parse
    self.expect(self.tokens['OP'], ":")
  File "/src/Parser/pgen/pgen.py", line 337, in expect
    type, value, self.type, self.value)
  File "/src/Parser/pgen/pgen.py", line 356, in raise_error
    self.end[1], self.line))
  File "./Grammar/Grammar", line 13
    single_input: NEWLINE | simple_stmt | compound_stmt NEWLINE
                ^
SyntaxError: expected 54/:, got 52/:

This is because OP has the value of 52 in Python3.5 (in this example) and 54 in the tokens that we construct from Grammar/Tokens (or in Lib/tokens.py). This difference is because the value of 52 is yielded by the tokenize (from Python3.5) when calling next(self.generator) in gettoken. Maybe I am missing something here, but that is the problem I found when triying to use the tokenize from the running Python :(

@@ -0,0 +1,97 @@
from lib2to3.pgen2 import grammar

This comment has been minimized.

@gvanrossum

gvanrossum Feb 21, 2019

Member

Maybe also copy this, so we’re completely independent from lib2to3?

import collections
import importlib.machinery

# Use Lib/token.py and Lib/tokenize.py to obtain the tokens. To maintain this

This comment has been minimized.

@gvanrossum

gvanrossum Feb 21, 2019

Member

I would get them directly from Grammar/Tokens

@gvanrossum

This comment has been minimized.

Copy link
Member

gvanrossum commented Feb 22, 2019

@gvanrossum

This comment has been minimized.

Copy link
Member

gvanrossum commented Feb 22, 2019

I think I have the solution, at least for tokenize. If in pgen.py, in the function parse() and all parse_xxx() functions, you replace self.tokens.XXX with tokenize.XXX, the parser that reads the Grammar file (the meta-parser if you will) will work independently from the tokens defined for the target parser/lexer. POC:
@.patch.txt

@pablogsal

This comment has been minimized.

Copy link
Member Author

pablogsal commented Feb 22, 2019

I think I have the solution, at least for tokenize. If in pgen.py, in the function parse() and all parse_xxx() functions, you replace self.tokens.XXX with tokenize.XXX, the parser that reads the Grammar file (the meta-parser if you will) will work independently from the tokens defined for the target parser/lexer. POC:
@.patch.txt

Of course! Yeah, I can confirm that this works perfectly. Thanks for the patience on going with me through this problem. :)

With this, if we read from Grammar/Token as you suggested, we don't need any import hacks. I have pushed 8ff1e44 including this patch and some changes adressing the rest of your feedback.

@pablogsal pablogsal force-pushed the pablogsal:pgen2 branch from 8ff1e44 to b1fae75 Feb 22, 2019

@gvanrossum
Copy link
Member

gvanrossum left a comment

Ship it!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment