Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rethink whitespace handling #58

Closed
danieldietrich opened this issue Oct 3, 2014 · 6 comments
Closed

Rethink whitespace handling #58

danieldietrich opened this issue Oct 3, 2014 · 6 comments

Comments

@danieldietrich
Copy link
Member

See also #23.

  1. rule : does not return ws and does not combine children
  2. Rule : returns ws and does not combine children
  3. RULE : returns ws and combines children

Rule names consisting of one char:

  • a matches 1.
  • A matches 3.
  • there is no rule name which has length 1 and matches 2.
@danieldietrich danieldietrich modified the milestones: 1.1.0 M2 Additional Parser Features, 1.1.0 M1 - Parser Core Oct 3, 2014
@danieldietrich
Copy link
Member Author

Whitespace has to be handled in sequences and rule references. Quantifiers are out of scope here, because quantified lexical rules have not to be treated special regarding whitespace and quantified rule references are already handled (see above).

Current problem: For correct whitespace handling in sequences it is needed to know, if the current rulePart of the sequence is lexical or not. If the rulePart is a subrule, this is not clear before parsing the subrule.

Solution: We need to return whitespace but not to combine the ruleParts of the sequence (see 2nd case above).

@danieldietrich danieldietrich self-assigned this Oct 4, 2014
@danieldietrich
Copy link
Member Author

TODO: Fix GrammarTest.shouldParseRichStringWithoutEatingUpWhitespace()

danieldietrich added a commit that referenced this issue Oct 4, 2014
issues regarding whitespace handling (#58) and recursion (#54).
@danieldietrich
Copy link
Member Author

In general we talk about lexical tokens, which are atomic units of text / sequences of characters. A parser parses a stream of tokens. The Javaslang parser determines lexical tokens while parsing, i.e. there is only one parse phase. This is in contrast to Antlr, which has two phases - a lexing phase where tokens are gathered and a parsing phase where the sequence of known tokens is determined.

So the overall goal here is to determine, if the Javaslang parser has to skip whitespace and if tokens have to be combined, while reading the input stream char by char.

The underlying data structure of the parser is a tree. Because it is a descend parser, the children of a node are parsed first. Therefore the node has to decide
A) whether to combine its children or not and
B) whether surrounding whitespace can safely be skipped without raising the risk of inadvertently taking characters needed in future by a subsequent parser.

@danieldietrich
Copy link
Member Author

Syntax

The core grammar syntax describes a formal language targeted to parsers of finite character sequences.

Character Groups

These are the atomic building parts to match groups of characters:

'text'     // a Literal containing one or more characters
.          // Any, a single character
'a'..'z'   // a Range of characters
[a-zA-Z$]  // a Charset, equals ( 'a'..'z' | 'A'..'Z' | '$' )
EOF        // end of file, matches no character

Single character matchers (Any, Range and Charset) and EOF may be negated:

!T         // negation, character not in T

The following rules hold:

  • !!T = T
  • !. = EOF, !EOF = .

The operator precedence of ! is the highest of all, e.g. !T* is the same as (!T)*.

Rules

We distinguish between parser rules and lexer rules, which have the same syntax:

rule : alternative_1 | ... | alternative_n

where n > 0 and rule is the name of the rule. The name is unique in the scope of a grammar. The first matching alternative wins. The alternatives are composed of the building parts of character groups as described above and:

ruleRef           // reference to another rule by name
T?                // zero or one occurrence of T
T*                // zero or more occurrences of T
T+                // one or more occurrences of T
T{m,n}            // m to n occurrences of T
T{m}              // same as T{m,m}
T1 ... Tn         // a sequence which matches if all parts match
( T1 | ... | Tn ) // a subrule of alternatives, first match wins

In the following, the difference between lexer and parser rules is described.

Lexer rules

  • Lexer rules (short: token) produce the leafs of a parse tree.
  • They consist of a (non-empty) sequence of characters.
  • There is one empty token, EOF.
  • The name of a tokens is not part of the parse tree.
  • A token name starts with an upper case character.
  • Whitespace is not parsed automatically within a token rule.
  • Tokens may contain references to other tokens.
  • Tokens may not contain references to rules other than tokens.

Parser rules

  • Parser rules (short: rule) produce the inner nodes of a parse tree.
  • They have a name and a non-empty list of children.
  • The name of a rule and it's children are part of the parse tree.
  • A rule name starts with a lower case character.
  • Whitespace is parsed automatically within a parser rule.
  • Rules may contain references to other rules and tokens.

Lexical Token Definitions within Parser Rules

A parser rule part T is purely lexical if it is a combination of the following rule parts:

  • A single character matcher (Any ., Range 'a'..'z' or Charset [a-zA-Z$])
  • End of file matcher (EOF)
  • Negation (!T)
  • Quantification (T?, T*, T+, T{m,n}, T{m})

Purely lexical parse results are combined to a token.

@danieldietrich
Copy link
Member Author

Deducing the parser design from the grammar description above:

  • A rule propagates the lexical state to its children (parser rule or token). This has implications on how to handle whitespace and whether to combine results.

@danieldietrich
Copy link
Member Author

I think it is necessary to allow lexical token definitions within parser rules (see definition above of pure lexical parser rules).

This allows us to write rule : 'Hello' [a-zA-Z]+ '!'
which parses "Hello Daniel!"
but not "Hello D a n i e l!".

Note: Because a literal 'test' is not pure, rule : 'test'+ parses "test test test".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant