Rethink whitespace handling #58

danieldietrich · 2014-10-03T17:13:57Z

See also #23.

rule : does not return ws and does not combine children
Rule : returns ws and does not combine children
RULE : returns ws and combines children

Rule names consisting of one char:

a matches 1.
A matches 3.
there is no rule name which has length 1 and matches 2.

The text was updated successfully, but these errors were encountered:

danieldietrich · 2014-10-04T00:06:08Z

Whitespace has to be handled in sequences and rule references. Quantifiers are out of scope here, because quantified lexical rules have not to be treated special regarding whitespace and quantified rule references are already handled (see above).

Current problem: For correct whitespace handling in sequences it is needed to know, if the current rulePart of the sequence is lexical or not. If the rulePart is a subrule, this is not clear before parsing the subrule.

Solution: We need to return whitespace but not to combine the ruleParts of the sequence (see 2nd case above).

danieldietrich · 2014-10-04T00:32:32Z

TODO: Fix GrammarTest.shouldParseRichStringWithoutEatingUpWhitespace()

issues regarding whitespace handling (#58) and recursion (#54).

danieldietrich · 2014-10-04T09:01:14Z

In general we talk about lexical tokens, which are atomic units of text / sequences of characters. A parser parses a stream of tokens. The Javaslang parser determines lexical tokens while parsing, i.e. there is only one parse phase. This is in contrast to Antlr, which has two phases - a lexing phase where tokens are gathered and a parsing phase where the sequence of known tokens is determined.

So the overall goal here is to determine, if the Javaslang parser has to skip whitespace and if tokens have to be combined, while reading the input stream char by char.

The underlying data structure of the parser is a tree. Because it is a descend parser, the children of a node are parsed first. Therefore the node has to decide
A) whether to combine its children or not and
B) whether surrounding whitespace can safely be skipped without raising the risk of inadvertently taking characters needed in future by a subsequent parser.

danieldietrich · 2014-10-05T13:10:12Z

Syntax

The core grammar syntax describes a formal language targeted to parsers of finite character sequences.

Character Groups

These are the atomic building parts to match groups of characters:

'text'     // a Literal containing one or more characters
.          // Any, a single character
'a'..'z'   // a Range of characters
[a-zA-Z$]  // a Charset, equals ( 'a'..'z' | 'A'..'Z' | '$' )
EOF        // end of file, matches no character

Single character matchers (Any, Range and Charset) and EOF may be negated:

!T         // negation, character not in T

The following rules hold:

!!T = T
!. = EOF, !EOF = .

The operator precedence of ! is the highest of all, e.g. !T* is the same as (!T)*.

Rules

We distinguish between parser rules and lexer rules, which have the same syntax:

rule : alternative_1 | ... | alternative_n

where n > 0 and rule is the name of the rule. The name is unique in the scope of a grammar. The first matching alternative wins. The alternatives are composed of the building parts of character groups as described above and:

ruleRef           // reference to another rule by name
T?                // zero or one occurrence of T
T*                // zero or more occurrences of T
T+                // one or more occurrences of T
T{m,n}            // m to n occurrences of T
T{m}              // same as T{m,m}
T1 ... Tn         // a sequence which matches if all parts match
( T1 | ... | Tn ) // a subrule of alternatives, first match wins

In the following, the difference between lexer and parser rules is described.

Lexer rules

Lexer rules (short: token) produce the leafs of a parse tree.
They consist of a (non-empty) sequence of characters.
There is one empty token, EOF.
The name of a tokens is not part of the parse tree.
A token name starts with an upper case character.
Whitespace is not parsed automatically within a token rule.
Tokens may contain references to other tokens.
Tokens may not contain references to rules other than tokens.

Parser rules

Parser rules (short: rule) produce the inner nodes of a parse tree.
They have a name and a non-empty list of children.
The name of a rule and it's children are part of the parse tree.
A rule name starts with a lower case character.
Whitespace is parsed automatically within a parser rule.
Rules may contain references to other rules and tokens.

Lexical Token Definitions within Parser Rules

A parser rule part T is purely lexical if it is a combination of the following rule parts:

A single character matcher (Any ., Range 'a'..'z' or Charset [a-zA-Z$])
End of file matcher (EOF)
Negation (!T)
Quantification (T?, T*, T+, T{m,n}, T{m})

Purely lexical parse results are combined to a token.

danieldietrich · 2014-10-05T21:34:17Z

Deducing the parser design from the grammar description above:

A rule propagates the lexical state to its children (parser rule or token). This has implications on how to handle whitespace and whether to combine results.

danieldietrich · 2014-10-06T22:01:08Z

I think it is necessary to allow lexical token definitions within parser rules (see definition above of pure lexical parser rules).

This allows us to write rule : 'Hello' [a-zA-Z]+ '!'
which parses "Hello Daniel!"
but not "Hello D a n i e l!".

Note: Because a literal 'test' is not pure, rule : 'test'+ parses "test test test".

danieldietrich added enhancement labels Oct 3, 2014

danieldietrich modified the milestones: 1.1.0 M2 Additional Parser Features, 1.1.0 M1 - Parser Core Oct 3, 2014

danieldietrich mentioned this issue Oct 3, 2014

Bootstrap the parser #33

Closed

danieldietrich self-assigned this Oct 4, 2014

danieldietrich added a commit that referenced this issue Oct 4, 2014

Fixes #29. Towards #33. Fixed bootstrap grammar + impl. There are open

10caf2d

issues regarding whitespace handling (#58) and recursion (#54).

This was referenced Oct 5, 2014

Non-greedy quantifiers #30

Closed

Make subrules identifiable #57

Closed

Negating arbitrary rule parts #62

Closed

danieldietrich closed this as completed in f55eded Oct 5, 2014

danieldietrich reopened this Oct 6, 2014

danieldietrich closed this as completed in 2d5c115 Oct 6, 2014

danieldietrich added the design/refactoring/improvement label May 1, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rethink whitespace handling #58

Rethink whitespace handling #58

danieldietrich commented Oct 3, 2014

danieldietrich commented Oct 4, 2014

danieldietrich commented Oct 4, 2014

danieldietrich commented Oct 4, 2014

danieldietrich commented Oct 5, 2014

danieldietrich commented Oct 5, 2014

danieldietrich commented Oct 6, 2014

Rethink whitespace handling #58

Rethink whitespace handling #58

Comments

danieldietrich commented Oct 3, 2014

danieldietrich commented Oct 4, 2014

danieldietrich commented Oct 4, 2014

danieldietrich commented Oct 4, 2014

danieldietrich commented Oct 5, 2014

Syntax

Character Groups

Rules

Lexer rules

Parser rules

Lexical Token Definitions within Parser Rules

danieldietrich commented Oct 5, 2014

danieldietrich commented Oct 6, 2014