-
-
Notifications
You must be signed in to change notification settings - Fork 276
Design Notes
When building up a pyparsing parser, the resulting strings are returned as a
ParseResults
object, with the parsed strings in a flat sequence. This is the case even
if you build up the parser from sub-expressions, and add them into a larger expression
using '+' and '|' operators.
Imagine we are writing a parser for a simplified form of a Python def
statement. Here is
a BNF:
def_stmt ::= 'def' identifier '(' arg_spec, ... ')' -> return_type_spec ':'
arg_spec ::= identifier
return_type_spec ::= identifier
identifier ::= <normal Python identifier>
This translates almost directly to pyparsing
(working from bottom to top):
import pyparsing as pp
ppc = pp.common
# define punctuation and keywords
LPAR, RPAR, COLON = map(pp.Suppress, "():")
ARROW = pp.Suppress("->")
DEF = pp.Keyword("def").suppress()
identifier = ppc.identifier
arg_spec = identifier()
return_type_spec = identifier()
def_stmt = (DEF + identifier
+ LPAR + pp.Opt(pp.delimitedList(arg_spec)) + RPAR
+ ARROW + return_type_spec + COLON)
We can test this parser using the run_tests()
method:
def_stmt.run_tests("""\
def sin() -> float:
def abs(x) -> float:
def pad(s, min, max) -> str:
""")
And we get our parsed results:
def sin() -> float:
['sin', 'float']
def abs(x) -> float:
['abs', 'x', 'float']
def pad(s, min, max) -> str:
['pad', 's', 'min', 'max', 'str']
This is okay, we can certainly parcel this out into function name, arguments, and return type. But now imagine we add support for optional types for the arguments as well:
arg_spec ::= identfier [':' type_spec]
type_spec ::= identfier
So that pad
s definition could be expanded to def pad(s: str, min: int, max:int) -> str:
.
We accomplish this just be changing the definition of arg_spec to:
type_spec = identifier()
arg_spec = identifier() + pp.Opt(COLON + type_spec)
And parsing the type annotated definition of pad now gives:
def pad(s: str, min: int, max: int) -> str:
['pad', 's', 'str', 'min', 'int', 'max', 'int', 'str']
SUCCESS! Well, sort of. We can see that we've parsed all the individual parts. But the
results are still just a sequence of the found strings. We could remove the suppressing of
all the punctuation, and get the separating ()
's and ,
's, but then we will have done little
more than tokenize the statement, and now we need logic to walk all the tokens.
Also, since the types are optional, we can't reliably walk this list of strings, assuming that after the function name will come the name/type pairs for each argument - some arguments might have omitted the type.
This is where Group
comes in. If we enclose arg_spec
in a Group
, this will define name/type
structures for the def
statement. Changing arg_spec
to:
arg_spec = pp.Group(identifier() + pp.Opt(COLON + type_spec))
With this change, the parser gives us a ParseResults
containing:
['pad', ['s', 'str'], ['min', 'int'], ['max', 'int'], 'str']
Now that we can group the results using Group, we can go further and gather all the parsed argument specs into a sub-structure, so we don't have to unpack the results by grabbing leading name and trailing return type, and assuming the rest are the arguments.
So next we change the def
statement itself to group the optional arguments:
def_stmt = (DEF + identifier
+ LPAR + pp.Group(pp.Opt(pp.delimitedList(arg_spec))) + RPAR
+ ARROW + return_type_spec + COLON)
Giving us this structured result:
['pad', [['s', 'str'], ['min', 'int'], ['max', 'int']], 'str']
If we run this parser against our original list of function definitions, we will
see another benefit: every function always returns exactly 3 elements. Even those
that have no defined arguments, such as the one for sin
, will include an empty
list for the second element.
And lastly, to make it easier to access the individual parts of the results, we'll add results names to the grammar definition:
def_stmt = (DEF + identifier("function_name")
+ LPAR + pp.Group(pp.Opt(pp.delimitedList(arg_spec)))("args") + RPAR
+ ARROW + return_type_spec("returns") + COLON)
And we get:
['pad', [['s', 'str'], ['min', 'int'], ['max', 'int']], 'str']
- args: [['s', 'str'], ['min', 'int'], ['max', 'int']]
[0]:
['s', 'str']
[1]:
['min', 'int']
[2]:
['max', 'int']
- function_name: 'pad'
- returns: 'str'
You can still access the contents of the ParseResults
as if they were items in a list,
but you can also get to them by name:
result = def_stmt.parse_string("def pad(s: str, min: int, max: int) -> str:")
print(result[0]) # prints 'pad'
print(result.function_name) # also prints 'pad'
We should do the same thing for the parts of each argument spec:
arg_spec = pp.Group(identifier("name") + pp.Opt(COLON + type_spec("type")))
To get:
['pad', [['s', 'str'], ['min', 'int'], ['max', 'int']], 'str']
- args: [['s', 'str'], ['min', 'int'], ['max', 'int']]
[0]:
['s', 'str']
- name: 's'
- type: 'str'
[1]:
['min', 'int']
- name: 'min'
- type: 'int'
[2]:
['max', 'int']
- name: 'max'
- type: 'int'
- function_name: 'pad'
- returns: 'str'
So we could then write:
for arg in result.args:
print(arg.name, arg.type)
Group
has done one more thing for us. Note that all the grouped argument specs have maintained
their separate argument names and types. If we remove the Group
() enclosing arg_spec
, this is the kind
of result we would get:
['pad', ['s', 'str', 'min', 'int', 'max', 'int'], 'str']
- args: ['s', 'str', 'min', 'int', 'max', 'int']
- name: 'max'
- type: 'int'
- function_name: 'pad'
- returns: 'str'
Saving results names into a ParseResults
is very similar to assigning key-values to a Python
dict: if duplicated keys are used, only the last assigned value is stored. When the arg_specs
are ungrouped, the name and type only have the values for the final max
argument.
So by grouping the argument specs, we maintain the separate names and types for each argument
in the list of arguments.
In summary, Group
serves two purposes:
- it adds structure
- it scopes results names
By using Group
and results names, your parser will generate results that are much easier
for your post-parsing code to process.
Here are some ideas for other enhancements to this parser:
- add an optional section after the return type to list possible exceptions that could be raised (similar to "throws" clause in Java method definitions)
- expand
arg_type
to handle multiple types, such asdef abs(x: int|float) -> float:
Here is the complete program for this parser:
import pyparsing as pp
ppc = pp.common
# define punctuation and keywords (they get suppressed because they are only
# needed at parse time, they don't contain useful information for post-parse work)
LPAR, RPAR, COLON = map(pp.Suppress, "():")
ARROW = pp.Suppress("->")
DEF = pp.Keyword("def").suppress()
# define the sub-expressions
identifier = ppc.identifier
type_spec = identifier()
arg_spec = pp.Group(identifier("name") + pp.Opt(COLON + type_spec("type")))
return_type_spec = type_spec()
# define the full def statement expression
def_stmt = (DEF + identifier("function_name")
+ LPAR + pp.Group(pp.Opt(pp.delimitedList(arg_spec)))("args") + RPAR
+ ARROW + return_type_spec("returns") + COLON)
# try it out!
result = def_stmt.parse_string("def pad(s: str, min: int, max: int) -> str:")
print(result[0]) # prints 'pad'
print(result.function_name) # also prints 'pad'
for arg in result.args:
print(arg.name, arg.type)
def_stmt.run_tests("""\
def sin() -> float:
def abs(x) -> float:
def pad(s, min, max) -> str:
def abs(x: float) -> float:
def pad(s: str, min: int, max: int) -> str:
""")