Skip to content

Latest commit

 

History

History
97 lines (81 loc) · 4.62 KB

README.md

File metadata and controls

97 lines (81 loc) · 4.62 KB

Parsed

A small parser generator that generates recursive descent parsers. It also generates a lexer form the token defined in the grammar.

This parser generator is relatively simple. It can handle direct left recursion, but indirect recursion will result in an infinite loop.

Note: Because this git repository uses submodules, you should either clone it using the --recurse-submodules flag, or run git submodule update --init --recursive in the repositories directory after cloning it.

How does it work?

The generator will read in a file containing the grammar for your parser, and will then generate an output C file that contains the lexer and parser for the defined grammar. An example of how such a grammar lock can be found in the example.grammar file in the repository root directory. After building using make, it can be compiled using the command ./build/bin/parsed example.grammar example.c.

Settings

  • Using %return { type } the type that will be returned from the parser can be specified.
  • Using %args { type name, type name } a list of arguments can be specified that will have to be passed into the parsedTokenize and all parse functions and can be used in the inline code segments.
  • %free { free($tofree); } can be used to define how returned types that are not used should be discarded.

Tokens

Tokens can be defined either using a simple string like so: 'while' or "for", or using a regular expression like so: /\/\/.*\n?/ or /\s+/.

To tokenize a given string the function ParsedTokens parsedTokenize(const char* string, int length, USER_ARGS) will be generated. The result from this function can then be given to one of the parsing functions.

To define tokens that should be ignored in the output you can use the %ignore option in one of the following ways:

  • %ignore 'token' Specify a token, which will then be ignored, or
  • %ignore { $length = amountToSkip($source, $maxlength); } Use an inline code segment that sets $length to the length of string that is to be skipped

If at any point no token can be found or ignored, an invalid token will be generated, that will never be consumed, by the parsing functions. Similarly, the end of the input will be marked by a special token, that can be checked using the int parsedReachedEnd(ParsedTokens tokens) function.

Tokens have the following fields:

  • int kind this stores an id generated by the generator
  • int offset this stores the offset in the source string of the start of the token
  • int len this stores the length of the token
  • const char* start this stores the pointer to the start of the token in the original string

Production rules

A production rule can be defined as follows:

expr := expr '+' expr
      | expr '-' expr
      | number;

Every non-terminal must be defined in a single production rule, and multiple definitions with the same name will result in an error. Multiple expansion options are separated by a | and consists of a list of tokens and names of non-terminals. Each definition must start with the name of the non-terminal followed by := and end with a semicolon ;.

To define how the return value should be computed, you can specify an inline code segment for every one of the possible expansions, like so:

expr := expr '+' expr { $return = $nonterm[0] + $nonterm[1]; }
      | expr '-' expr { $return = $nonterm[0] - $nonterm[1]; }
      | /[1-9]/ { $return = $term[0].start[0] - '0'; };

It is important to note that in this example because the grammar is ambiguous the input 1 - 2 + 3 will not be parsed in the conventional way (i.e. executing left-to-right) but as follows:

  -
 / \
1   +
   / \
  2   3

The following are all variables available for these inline code segments:

  • $return write the value you want to return to this
  • $term array of tokens in the expansion (in order of definition)
  • $numterm number of tokens in $term
  • $nonterm array of non-terminals in the expansion (in order of definition)
  • $numnonterm number of non-terminals in $nonterm

If no code segment is supplied than the value in $nonterm[0] will be returned, even if this field contains only uninitialized data.

The functions that will be generated all have the following signature: ParsedTokens parse_NAME(ParsedTokens tokens, RETURN_TYPE ret, USER_ARGS) where:

  • NAME is the name of the expression
  • RETURN_TYPE is the type in the %return setting and also where $return will write to
  • USER_ARGS is the value in the %args setting

Inline code

On the global level additional code segments can be specified using %{ code }. This code will be inserted unchanged into the output file in order of definition.