Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix whitespace handling #23

Closed
danieldietrich opened this issue Sep 2, 2014 · 5 comments
Closed

Fix whitespace handling #23

danieldietrich opened this issue Sep 2, 2014 · 5 comments
Assignees
Labels
Milestone

Comments

@danieldietrich
Copy link
Member

Whitespace configuration:

Update: See also issue #27.

  • Whitespace can be declared via WS : ... -> skip ; or WHITESPACE : ... -> skip ;.
  • There is an implicit default whitespace declaration, e.g. WS : [ \t\r\n]+ -> skip ;.
  • Additionally whitespace should be customizable, i.e. by overwriting the default whitespace rule
  • (note: currently not clear how to distinguish fragment and -> skip because there will be only a parsing phase/no lexing phase and then fragment and skip have the same semantics...)

Rules for handling whitespace:

  • When parsing (rule starts with lower-case), then whitespace is parsed automatically.
  • When lexing (rule starts with upper-case), then whitespace is not parsed automatically.
  • Lexer rules can only reference other lexer rules.
  • Parser rules can reference parser and lexer rules and can also declare anonymous lexer rules.
@danieldietrich danieldietrich self-assigned this Sep 2, 2014
danieldietrich added a commit that referenced this issue Sep 9, 2014
mode. Still unimplemented methods in Quantifier parser.
@danieldietrich
Copy link
Member Author

Given WS : [ \t\r\n]+ ;, i.e. parser tokens have to be separated by non-empty whitespace, there are the following rules:

  • in a sequence, parser rule references rule1 and rule2 are separated by whitespace
  • a multiplied parser rule rule, using a quantifier ?, + or *, is separated by whitespace
  • a lexer rule RULE does not eat up whitespace automatically

Question:
How to cope with punctuation / inline lexer rules abc 'LEX' def regarding whitespace?

@danieldietrich
Copy link
Member Author

More specifically, the following grammar should parse these sentences:

  • (abc)(def ghi)
  • ( abc ) ( def ghi )

Grammar:

groups : group*

group : '(' WORD+ ')'

WORD : 'a'..'z'+

given a whitespace configuration of WS : [ \t\r\n]+.

Does this make sense?
How should the parser behave different when WS : [ \t\r\n]* and WS : [ \t\r\n]+?
The current implementation works with both WS definitions, using +and *:

static int skipWhitespace(String text, int index, boolean lex) {
    return lex ? index : endIndex(DEFAULT_WS.parse(text, index, true)).orElse(index);
}

Question: What if WORD : [ a-z]+, i.e. may contain spaces?

Question: Is this equivalent?

groups : ( '(' [a-z]+ ( [a-z]+ )* ')' )*

Does a quantifier participate on a lexical context (whitespace)?

Do outer braces participate on a lexical context (whitespace)?

One possible Answer: A plain Literal is handeled special (whitespace ignored) as inline lexer rule in a sequence but not when bound to a quantifier. Braces make no difference to whitespace handling. Charset and Range are handled as normal tokens in a parser rule, i.e. separated by whitespace.

Example:

richString : '"""' .*? '"""'

Update:

The answer above seems to not solve the problem in practice. Also the richString example is wrong because the .*? does skip whitespace. It should be RichString : '"""' .*? '"""' instead. But what if the rule name richString is needed as part of the parse tree?

danieldietrich added a commit that referenced this issue Sep 13, 2014
danieldietrich added a commit that referenced this issue Sep 14, 2014
#23)
* Made EOF parser token invisible in parse tree (fixes #26)
* Better naming of instance variables (Supplier<Parser> is a
'parserSupplier', not a 'parser')
danieldietrich added a commit that referenced this issue Sep 14, 2014
* made GroupGrammar examples working
* RichString example does currently not work
@danieldietrich
Copy link
Member Author

Two whitespace-handling rules would be great:

  • defensive / non-greedy -> eat only up if next parser does not need the whitespace
  • offensive / greedy -> eat all up

This should be accomplishable by

  • WS : [ \t\r\n]+? (defensive)
  • WS : [ \t\r\n]+ (offensive)

@danieldietrich
Copy link
Member Author

Whitespace variants

WS : [ \t\r\n]+?  // (defensive / non-greedy)
WS : [ \t\r\n]+   // (offensive / greedy)

static int skipWhitespace(String text, int index, boolean lex) {
    return lex ? index : endIndex(WS.parse(text, index, true)).orElse(index);
}

Example 1: Group grammar

Variant 1

//
// group is identified => whitespace is eaten up.
// * quantifier inherits behavior of group => whitespace is eaten up.
//
// ==> the Quantifier parser is reponsible for eating up whitespace before and after group.parse()
//
groups : group*

//
// sequence does not eat up whitespace by itself.
// children:
//   '(' is lexical and does not eat up whitespace.
//   WORD is identified => whitespace is eaten up.
//   + quantifier inherits behavior of WORD => whitespace is eaten up.
//   ')' is lexical and does not eat up whitespace.
//
// ==> the Sequence parser is reponsible for eating up whitespace before and after WORD.parse()
//
group : '(' WORD+ ')'

//
// range 'a'..'z' is lexical and does not eat up whitespace.
// + quantifier inherits behavior of range => whitespace is not eaten up.
//
// ==>  the Quantifier parser is reponsible for *not* eating up whitespace before and after range.parse()
//
WORD : 'a'..'z'+

Variant 2

groups : ( '(' [a-z]+ ( [a-z]+ )* ')' )*

Input

  1. (abc)(def ghi)
  2. ( abc ) ( def ghi )
// the spaces within '( ... )' are parsed because of WORD+ eats up whitespace.
// the spaces between ') ... (' are parsed because of group* eats up whitespace.

Parse Tree

// result of both inputs parsed with variant 1 and result of 1st input parsed with variant 2. 2nd input is not parsable by variant2.
Tree(groups (group '(' 'abc' ')') (group '(' 'def' 'ghi' ')'))

Example 2: Rich string grammar

//
// sequence does not eat up whitespace by itself.
// children:
//   '"""' is lexical and does not eat up whitespace.
//   . is lexical and does not eat up whitespace.
//   * quantifier inherits behavior of . => whitespace is not eaten up.
//   ? non-greedy operator inherits behavior of .* => whitespace is not eaten up.
//   '"""' is lexical and does not eat up whitespace.
//
// ==> the Sequence parser is reponsible for eating up whitespace before and after WORD.parse()
// ==> the Quantifier parser COMBINES results, if child parser is lexical
// ==> the non-greedy operator just makes a binary decision: eat up more chars with current parser or proceed with next parser, if applicable.
//
richString : '"""' .*? '"""'

Input

""" test """

Parse Tree

Tree(richString '"""' ' test ' '"""')

@danieldietrich
Copy link
Member Author

It should be sufficient (and more effective) to invent a new parser 'RuleRef', which eats up whitespace. Then quantifier does not have to inherit the behavior of its underlying parser.

A RuleRef is like a pointer to a Parser. If it is initialized with a Supplier, i.e. RuleRef( () -> new Literal("xyz") ) or RuleRef( SomeClass::parser() ), then all parsers with children can be simplified to encapsulate Parser instead of Supplier.

@danieldietrich danieldietrich changed the title [parser] Fix whitespace handling Fix whitespace handling Oct 4, 2014
@danieldietrich danieldietrich added this to the ?.?.? Parser milestone Oct 23, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant