Fetching latest commit…
Cannot retrieve the latest commit at this time.
|Failed to load latest commit information.|
Parser Combinators v 0.1 ======================== This branch is for parser combinators implemented in C. Features: --------- * Written in C * Error message support * Support for syntax macros * Support for adding operators with arbitrary precedence, fixity, associativity * Simple example language (calc) Dependencies: ------------- * Boehm-Demers-Weiser garbage collection Build: ------ Update the directory locations at the top of Makefile and type: make To run example, type: ./calc Usage: ------ To use the parser library, you must include parser.h in your C program. You must also currently have: extern jmp_buf exc; and define a setjmp(exc) somewhere in your parser (see the example calc program) so that the exception code has somewhere to jump in the event of an exception. Parser combinators are data structures which contain data about how to parse a given grammar production rule. One creates a parser by building up a grammar from combinators. One then passes the root combinator to the parse function to parse statements/expressions from the language described by the combinators. The basic type in the library is a combinator_t. There are two different ways to create combinator_t's. Some of the combinator functions return a combinator_t, others take an existing one and modify it. To create a new combinator data struct called stmt for example, we write: combinator_t * stmt = new_combinator(); The ability to create empty combinators like this is useful when we wish to have combinators which are (indirectly) self-referential. The combinator can be made before it is defined so that some component of it can refer to it. Combinators, once created and fully defined, can be parsed with the parse function: ast_t * parse(input_t * in, combinator_t * comb) If the input coming from the given input stream matches the grammar defined by comb, an ast_t is returned. The ast nodes are defined in ast.h and the input_t type is defined in input.h. Some combinators do not return AST details about what is parsed. They merely return an ast_t called ast_nil if the input matches the grammar and NULL otherwise. Such combinators can be used with the capture function (described below) to capture the exact text that was parsed. We now list the different functions available for creating and combining combinators. Literals ======== combinator_t * match(char * str) This combinator when parsed exactly matches a given string str. If the string is not matched the parser returns NULL, otherwise it returns ast_nil. For example we may write: match("if"); to create a combinator which matches the exact string "if" when parsed. This combinator skips any whitespace preceding the match string before parsing it. combinator_t * expect(combinator_t * c, char * msg) This combinator causes an exception to be raised if the given combinator returns NULL when parsed. This is a mechanism for printing error messages if given parts of the grammar are not matched. For example, one may have statements of the form if (expr) stmt; in one's grammar. Once the "if" has been parsed, one expects a left parenthesis. Thus one may write expect(match("("), "Error: \"(\" expected\n"); to expect a left parenthesis at this point. If it is not found then an exception is raised and the given error message printed. To raise an exception the library calls the exception function declared in exception.h. Currently this relies on longjmp which is available on posix systems, however it can be replaced with a user defined function if required. combinator_t * exact(char * str) This combinator is identical to match except that it does not skip preceding whitespace. This is useful in combination with other "lexical" combinators which can be used with the capture combinator to capture strings without whitespace in them. combinator_t * range(char * str) This combinator function takes a string consisting of precisely two non-null characters (an exception is raised if this is not the case). When parsed the combinator matches any character which lies between the two specified characters (inclusive). For example: range("ae"); The range combinator does not skip preceding whitespace. combinator_t * alpha() This combinator matches an alphabetical character (upper or lower case). It does not skip preceding whitespace. combinator_t * digit() This combinator matches a digit (0-9). It does not skip preceding whitespace. combinator_t * anything() This combinator matches any character. It does not skip preceding whitespace. combinator_t * integer() This combinator matches an integer, i.e. either 0 or a digit between 1 and 9 inclusive, followed by any number of digits. This combinator skips preceding whitespace. The AST node returned has tag T_INT. The integer is hashed and turned into a symbol (equal integers have the same symbol). The symbol is then stored in the returned AST node. combinator_t * cident() This combinator matches a C identifier, i.e. an underscore or letter followed by any number of underscores, letters or digits. This combinator skips preceding whitespace. The AST node returned has tag T_IDENT. The identifier is hashed and turned into a symbol (equal identifiers have the same symbol). The symbol is then stored in the returned AST node. Lexical combinators =================== combinator_t * not(combinator_t * c) This combinator returns ast_nil if the given combinator c returns NULL when parsed, otherwise it returns NULL. combinator_t * option(combinator_t * c) This combinator returns ast_nil regardless of whether the combinator c is matched or not. This allows one to specify that a given grammatical element is optional, i.e. may occur zero or one times. combinator_t * zeroplus(tag_t typ, combinator_t * c) This combinator is the kleene star operator. It returns ast_nil if the given combinator is matched zero or more times. It is greedy in that it attempts to match as many times as it can. combinator_t * oneplus(tag_t typ, combinator_t * c); This combinator returns ast_nil if the given combinator is matched one or more times, and NULL otherwise. It is greedy in that it attempts to match as many times as it can. combinator_t * capture(tag_t typ, combinator_t * c) This combinator wraps an existing "lexical" combinator and does two things. Firstly it skips preceding whitespace. Secondly it captures the entire string matched by the combinator c and returns an AST node with the given AST tag and the captured string as a symbol. The function for turning the captured string into a symbol is declared in symbol.h but can be replaced with a user defined function. The default behaviour is to hash the string against all previously captured strings. If two captured strings are parsed then (referentially) identical symbols are returned in the AST nodes. This is a useful performance enhancement that allows compilers/interpreters to compare two strings by comparing the (pointers to) their symbols instead of comparing the actual strings. Note that c can be any combinator, including a composite one, made of lexical combinators. Syntactic combinators ===================== combinator_t * seq(combinator_t * ret, tag_t typ, combinator_t * c1, ...) This function takes an existing (empty) combinator as first argument (and then returns it once defined). It defines ret to be a combinator which parses a sequence of other combinators in order. To terminate the list of combinators one must pass a final NULL argument. For example: seq(simple_if_stmt, T_IF1, match("if"), match("("), expr, match(")"), stmt, NULL); where expr and stmt are previously defined combinators. The seq combinator will receive an AST node from each matched combinator in the sequence. Those which return something other than ast_nil are then chained together using the next field of the ast_t type to produce a list. If typ is AST_NONE then this list is returned without decoration. However if typ is any other AST tag then the list is wrapped in a new AST node with that tag, i.e. the list becomes the child node of the returned AST node. combinator_t * multi(combinator_t * ret, tag_t typ, combinator_t * c1, ...) The multi combinator is a multiple choice combinator. It takes an existing (empty) combinator as first argument (and then returns it once defined). It defines ret to be a combinator which parses any one of the other combinators (checked in the order given). To terminate the list of combinators one must pass a final NULL argument. For example: multi(if_stmt, T_IF, simple_if_stmt, compound_if_stmt, NULL); where simple_if_stmt and compound_if_stmt are previously defined combinators. The multi combinator will receive an AST node from the first matched combinator in the sequence. If typ is AST_NONE then this is returned without decoration. However if typ is any other AST tag then the node is wrapped in a new AST node with that tag, i.e. it becomes the child node of the returned AST node. Expression combinators ====================== Because combinators do not handle left recursion correctly, expressions are handled a little differently in this library, using a special expression combinator. combinator_t * expr(combinator_t * exp, combinator_t * base) This combinator takes an existing (blank) combinator exp (and returns it once defined) and another combinator base. The result is a combinator which parses only expressions defined by the base combinator. However, the exp combinator can have other expression combinators added to it with the functions below. This is done for two reasons. Firstly, this allows for expression combinators to be added with a given precedence, associativity and fixity. Secondly, this allows new expression combinators to be added to the expression heirarchy at run-time. The base combinator is expected to have the highest precedence of all the expression combinators added to exp. It is usually used to parse literals at the base of the expression heirarchy. void expr_insert(combinator_t * expr, int prec, tag_t tag, expr_fix fix, expr_assoc assoc, combinator_t * comb) This function inserts a new operator into the expression heirarchy. The symbol for the operator is parsed by the combinator comb. The operator is added at the given precedence and is defined to have the given fixity and associativity. The precedence is just a number from 0 (lowest) to n (highest) where n is the number of non-base precedence levels already defined. The new operator is added with the given precedence and all higher precedence operators (and base) are shifted up in their precedence. Available options for inserted operators are EXPR_INFIX, EXPR_PREFIX and EXPR_POSTFIX depending on whether the operator symbol will appear between two expressions of higher precedence, before an expression of higher precedence or after such, respectively. Available options for associativity include ASSOC_LEFT and ASSOC_RIGHT for left associativity (e.g. addition is left associative) and right associativity respectively. For unary (prefix and postfix) operators, ASSOC_NONE can be specified. Only binary infix and unary prefix/postfix operators are currently supported. In the infix case an AST node with the given tag is returned. The arguments of the operator are returned as child and child->next nodes of the returned AST node. In the prefix and postfix cases an AST node with the given tag is return and the single argument of the operator is returned as the child node of the returned AST node. void expr_altern(combinator_t * expr, int prec, tag_t tag, combinator_t * comb) Sometimes more than one operator has the same precedence, fixity and associativity. In this case one can add it at an existing precedence level using exp_altern. The parameters for this function are as for expr_insert except that a fixity and associativity are not specified, these being derived from the existing operators at the specified precedence. Here is an example of the usage of the expression combinators from the calc program. It assumes that a base combinator exists and is defined and that an exp combinator has been created (but not defined yet): expr(exp, base); expr_insert(exp, 0, T_ADD, EXPR_INFIX, ASSOC_LEFT, match("+")); expr_altern(exp, 0, T_SUB, match("-")); expr_insert(exp, 1, T_MUL, EXPR_INFIX, ASSOC_LEFT, match("*")); expr_altern(exp, 1, T_DIV, match("/")); expr_altern(exp, 1, T_REM, match("%")); expr_insert(exp, 2, T_NEG, EXPR_PREFIX, ASSOC_NONE, match("-")); The example calc language: -------------------------- In the files calc.c and eval.c/h is an example of the usage of the combinator library. It defines a very simple language which has simple arithmetic operators, variable assignment and if and for statements. Here are the elements of the language: 1) Integer literals, e.g. 0, 123, 23478343 2) Unary minus, e.g. -1, -s, -(4*3*a - 1) 3) Addition/subtraction: 1 + 4, s + 3, 2 - 7 4) Multiplication/division/remainder: 3*4, 7/2, 7%s 5) Parentheses: 3 + (4 * 2 - 1), 3 - (2 * 3 + (4 + 1)) 6) Assignment: s = 1, v = t = 3, s = t + 2 7) Simple "if" statments: if s then t = 3 8) Simple "for" statements: for i in 1:100 do s = s + i Here is a sample session: Every statement must be followed by a semicolon. > s = 1 + 2 - (3*4 + 1); -10 > t = 0; 0 > if s then t = t + 1; 1 > t; 1 > for i in 1:100 do t = t + i; 4951