A small parser generator that generates recursive descent parsers. It also generates a lexer form the token defined in the grammar.
This parser generator is relatively simple. It can handle direct left recursion, but indirect recursion will result in an infinite loop.
Note: Because this git repository uses submodules, you should either clone it using the --recurse-submodules
flag, or run git submodule update --init --recursive
in the repositories directory after cloning it.
The generator will read in a file containing the grammar for your parser, and will
then generate an output C file that contains the lexer and parser for the defined
grammar. An example of how such a grammar lock can be found in the example.grammar
file in the repository root directory. After building using make
, it can be compiled
using the command ./build/bin/parsed example.grammar example.c
.
- Using
%return { type }
the type that will be returned from the parser can be specified. - Using
%args { type name, type name }
a list of arguments can be specified that will have to be passed into theparsedTokenize
and all parse functions and can be used in the inline code segments. %free { free($tofree); }
can be used to define how returned types that are not used should be discarded.
Tokens can be defined either using a simple string like so: 'while'
or "for"
, or
using a regular expression like so: /\/\/.*\n?/
or /\s+/
.
To tokenize a given string the function ParsedTokens parsedTokenize(const char* string, int length, USER_ARGS)
will be generated. The result from this function can then be given to one of the parsing functions.
To define tokens that should be ignored in the output you can use the %ignore
option in one of
the following ways:
%ignore 'token'
Specify a token, which will then be ignored, or%ignore { $length = amountToSkip($source, $maxlength); }
Use an inline code segment that sets$length
to the length of string that is to be skipped
If at any point no token can be found or ignored, an invalid token will be generated, that will
never be consumed, by the parsing functions. Similarly, the end of the input will be marked by a
special token, that can be checked using the int parsedReachedEnd(ParsedTokens tokens)
function.
Tokens have the following fields:
int kind
this stores an id generated by the generatorint offset
this stores the offset in the source string of the start of the tokenint len
this stores the length of the tokenconst char* start
this stores the pointer to the start of the token in the original string
A production rule can be defined as follows:
expr := expr '+' expr
| expr '-' expr
| number;
Every non-terminal must be defined in a single production rule, and multiple definitions with the
same name will result in an error. Multiple expansion options are separated by a |
and consists
of a list of tokens and names of non-terminals. Each definition must start with the name of the
non-terminal followed by :=
and end with a semicolon ;
.
To define how the return value should be computed, you can specify an inline code segment for every one of the possible expansions, like so:
expr := expr '+' expr { $return = $nonterm[0] + $nonterm[1]; }
| expr '-' expr { $return = $nonterm[0] - $nonterm[1]; }
| /[1-9]/ { $return = $term[0].start[0] - '0'; };
It is important to note that in this example because the grammar is ambiguous the input 1 - 2 + 3
will not be parsed in the conventional way (i.e. executing left-to-right) but as follows:
-
/ \
1 +
/ \
2 3
The following are all variables available for these inline code segments:
$return
write the value you want to return to this$term
array of tokens in the expansion (in order of definition)$numterm
number of tokens in $term$nonterm
array of non-terminals in the expansion (in order of definition)$numnonterm
number of non-terminals in $nonterm
If no code segment is supplied than the value in $nonterm[0]
will be returned,
even if this field contains only uninitialized data.
The functions that will be generated all have the following signature:
ParsedTokens parse_NAME(ParsedTokens tokens, RETURN_TYPE ret, USER_ARGS)
where:
NAME
is the name of the expressionRETURN_TYPE
is the type in the%return
setting and also where$return
will write toUSER_ARGS
is the value in the%args
setting
On the global level additional code segments can be specified using %{ code }
. This code
will be inserted unchanged into the output file in order of definition.