Fix whitespace handling #23

danieldietrich · 2014-09-02T00:15:15Z

Whitespace configuration:

Update: See also issue #27.

Whitespace can be declared via WS : ... -> skip ; or WHITESPACE : ... -> skip ;.
There is an implicit default whitespace declaration, e.g. WS : [ \t\r\n]+ -> skip ;.
Additionally whitespace should be customizable, i.e. by overwriting the default whitespace rule
(note: currently not clear how to distinguish fragment and -> skip because there will be only a parsing phase/no lexing phase and then fragment and skip have the same semantics...)

Rules for handling whitespace:

When parsing (rule starts with lower-case), then whitespace is parsed automatically.
When lexing (rule starts with upper-case), then whitespace is not parsed automatically.
Lexer rules can only reference other lexer rules.
Parser rules can reference parser and lexer rules and can also declare anonymous lexer rules.

The text was updated successfully, but these errors were encountered:

mode. Still unimplemented methods in Quantifier parser.

needed.

danieldietrich · 2014-09-13T15:41:07Z

Given WS : [ \t\r\n]+ ;, i.e. parser tokens have to be separated by non-empty whitespace, there are the following rules:

in a sequence, parser rule references rule1 and rule2 are separated by whitespace
a multiplied parser rule rule, using a quantifier ?, + or *, is separated by whitespace
a lexer rule RULE does not eat up whitespace automatically

Question:
How to cope with punctuation / inline lexer rules abc 'LEX' def regarding whitespace?

danieldietrich · 2014-09-13T17:22:36Z

More specifically, the following grammar should parse these sentences:

(abc)(def ghi)
( abc ) ( def ghi )

Grammar:

groups : group*

group : '(' WORD+ ')'

WORD : 'a'..'z'+

given a whitespace configuration of WS : [ \t\r\n]+.

Does this make sense?
How should the parser behave different when WS : [ \t\r\n]* and WS : [ \t\r\n]+?
The current implementation works with both WS definitions, using +and *:
static int skipWhitespace(String text, int index, boolean lex) {
    return lex ? index : endIndex(DEFAULT_WS.parse(text, index, true)).orElse(index);
}

Question: What if WORD : [ a-z]+, i.e. may contain spaces?

Question: Is this equivalent?

groups : ( '(' [a-z]+ ( [a-z]+ )* ')' )*

Does a quantifier participate on a lexical context (whitespace)?

Do outer braces participate on a lexical context (whitespace)?

One possible Answer: A plain Literal is handeled special (whitespace ignored) as inline lexer rule in a sequence but not when bound to a quantifier. Braces make no difference to whitespace handling. Charset and Range are handled as normal tokens in a parser rule, i.e. separated by whitespace.

Example:

richString : '"""' .*? '"""'

Update:

The answer above seems to not solve the problem in practice. Also the richString example is wrong because the .*? does skip whitespace. It should be RichString : '"""' .*? '"""' instead. But what if the rule name richString is needed as part of the parse tree?

#23) * Made EOF parser token invisible in parse tree (fixes #26) * Better naming of instance variables (Supplier<Parser> is a 'parserSupplier', not a 'parser')

* made GroupGrammar examples working * RichString example does currently not work

danieldietrich · 2014-09-14T22:49:29Z

Two whitespace-handling rules would be great:

defensive / non-greedy -> eat only up if next parser does not need the whitespace
offensive / greedy -> eat all up

This should be accomplishable by

WS : [ \t\r\n]+? (defensive)
WS : [ \t\r\n]+ (offensive)

danieldietrich · 2014-09-15T15:03:24Z

Whitespace variants

WS : [ \t\r\n]+?  // (defensive / non-greedy)
WS : [ \t\r\n]+   // (offensive / greedy)

static int skipWhitespace(String text, int index, boolean lex) {
    return lex ? index : endIndex(WS.parse(text, index, true)).orElse(index);
}

Example 1: Group grammar

Variant 1

//
// group is identified => whitespace is eaten up.
// * quantifier inherits behavior of group => whitespace is eaten up.
//
// ==> the Quantifier parser is reponsible for eating up whitespace before and after group.parse()
//
groups : group*

//
// sequence does not eat up whitespace by itself.
// children:
//   '(' is lexical and does not eat up whitespace.
//   WORD is identified => whitespace is eaten up.
//   + quantifier inherits behavior of WORD => whitespace is eaten up.
//   ')' is lexical and does not eat up whitespace.
//
// ==> the Sequence parser is reponsible for eating up whitespace before and after WORD.parse()
//
group : '(' WORD+ ')'

//
// range 'a'..'z' is lexical and does not eat up whitespace.
// + quantifier inherits behavior of range => whitespace is not eaten up.
//
// ==>  the Quantifier parser is reponsible for *not* eating up whitespace before and after range.parse()
//
WORD : 'a'..'z'+

Variant 2

groups : ( '(' [a-z]+ ( [a-z]+ )* ')' )*

Input

(abc)(def ghi)
( abc ) ( def ghi )

// the spaces within '( ... )' are parsed because of WORD+ eats up whitespace.
// the spaces between ') ... (' are parsed because of group* eats up whitespace.

Parse Tree

// result of both inputs parsed with variant 1 and result of 1st input parsed with variant 2. 2nd input is not parsable by variant2.
Tree(groups (group '(' 'abc' ')') (group '(' 'def' 'ghi' ')'))

Example 2: Rich string grammar

//
// sequence does not eat up whitespace by itself.
// children:
//   '"""' is lexical and does not eat up whitespace.
//   . is lexical and does not eat up whitespace.
//   * quantifier inherits behavior of . => whitespace is not eaten up.
//   ? non-greedy operator inherits behavior of .* => whitespace is not eaten up.
//   '"""' is lexical and does not eat up whitespace.
//
// ==> the Sequence parser is reponsible for eating up whitespace before and after WORD.parse()
// ==> the Quantifier parser COMBINES results, if child parser is lexical
// ==> the non-greedy operator just makes a binary decision: eat up more chars with current parser or proceed with next parser, if applicable.
//
richString : '"""' .*? '"""'

Input

""" test """

Parse Tree

Tree(richString '"""' ' test ' '"""')

danieldietrich · 2014-09-17T19:15:31Z

It should be sufficient (and more effective) to invent a new parser 'RuleRef', which eats up whitespace. Then quantifier does not have to inherit the behavior of its underlying parser.

A RuleRef is like a pointer to a Parser. If it is initialized with a Supplier, i.e. RuleRef( () -> new Literal("xyz") ) or RuleRef( SomeClass::parser() ), then all parsers with children can be simplified to encapsulate Parser instead of Supplier.

danieldietrich added the bug label Sep 2, 2014

danieldietrich self-assigned this Sep 2, 2014

danieldietrich added a commit that referenced this issue Sep 9, 2014

#23, #25: Whitespace handling & distinguishing between lexer and parser

f6a1093

mode. Still unimplemented methods in Quantifier parser.

danieldietrich added a commit that referenced this issue Sep 10, 2014

Working parser but before fixing issues #23 and #25 more tests are

c5e8274

needed.

danieldietrich added a commit that referenced this issue Sep 13, 2014

Fixes #25. Added more tests and some TODOs regarding #23.

f28999f

danieldietrich closed this as completed Sep 13, 2014

danieldietrich reopened this Sep 13, 2014

danieldietrich added a commit that referenced this issue Sep 13, 2014

Towards #23 / whitespace handling

0fb410f

danieldietrich added a commit that referenced this issue Sep 14, 2014

Steps towards #23 (whitespace handling):

1752895

* made GroupGrammar examples working * RichString example does currently not work

danieldietrich closed this as completed in f6b2118 Sep 20, 2014

danieldietrich mentioned this issue Sep 25, 2014

Make whitespace configurable #48

Closed

danieldietrich changed the title ~~[parser] Fix whitespace handling~~ Fix whitespace handling Oct 4, 2014

danieldietrich added the [parser] label Oct 4, 2014

danieldietrich mentioned this issue Oct 4, 2014

Rethink whitespace handling #58

Closed

danieldietrich added this to the ?.?.? Parser milestone Oct 23, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix whitespace handling #23

Fix whitespace handling #23

danieldietrich commented Sep 2, 2014

danieldietrich commented Sep 13, 2014

danieldietrich commented Sep 13, 2014

danieldietrich commented Sep 14, 2014

danieldietrich commented Sep 15, 2014

danieldietrich commented Sep 17, 2014

Fix whitespace handling #23

Fix whitespace handling #23

Comments

danieldietrich commented Sep 2, 2014

danieldietrich commented Sep 13, 2014

danieldietrich commented Sep 13, 2014

danieldietrich commented Sep 14, 2014

danieldietrich commented Sep 15, 2014

Whitespace variants

Example 1: Group grammar

Variant 1

Variant 2

Input

Parse Tree

Example 2: Rich string grammar

Input

Parse Tree

danieldietrich commented Sep 17, 2014