New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix whitespace handling #23
Comments
mode. Still unimplemented methods in Quantifier parser.
Given
Question: |
More specifically, the following grammar should parse these sentences:
Grammar:
given a whitespace configuration of
Question: What if Question: Is this equivalent?
Does a quantifier participate on a lexical context (whitespace)? Do outer braces participate on a lexical context (whitespace)? One possible Answer: A plain Literal is handeled special (whitespace ignored) as inline lexer rule in a sequence but not when bound to a quantifier. Braces make no difference to whitespace handling. Charset and Range are handled as normal tokens in a parser rule, i.e. separated by whitespace. Example:
Update: The answer above seems to not solve the problem in practice. Also the richString example is wrong because the |
* made GroupGrammar examples working * RichString example does currently not work
Two whitespace-handling rules would be great:
This should be accomplishable by
|
Whitespace variantsWS : [ \t\r\n]+? // (defensive / non-greedy)
WS : [ \t\r\n]+ // (offensive / greedy)
static int skipWhitespace(String text, int index, boolean lex) {
return lex ? index : endIndex(WS.parse(text, index, true)).orElse(index);
} Example 1: Group grammarVariant 1//
// group is identified => whitespace is eaten up.
// * quantifier inherits behavior of group => whitespace is eaten up.
//
// ==> the Quantifier parser is reponsible for eating up whitespace before and after group.parse()
//
groups : group*
//
// sequence does not eat up whitespace by itself.
// children:
// '(' is lexical and does not eat up whitespace.
// WORD is identified => whitespace is eaten up.
// + quantifier inherits behavior of WORD => whitespace is eaten up.
// ')' is lexical and does not eat up whitespace.
//
// ==> the Sequence parser is reponsible for eating up whitespace before and after WORD.parse()
//
group : '(' WORD+ ')'
//
// range 'a'..'z' is lexical and does not eat up whitespace.
// + quantifier inherits behavior of range => whitespace is not eaten up.
//
// ==> the Quantifier parser is reponsible for *not* eating up whitespace before and after range.parse()
//
WORD : 'a'..'z'+ Variant 2groups : ( '(' [a-z]+ ( [a-z]+ )* ')' )* Input
// the spaces within '( ... )' are parsed because of WORD+ eats up whitespace.
// the spaces between ') ... (' are parsed because of group* eats up whitespace. Parse Tree// result of both inputs parsed with variant 1 and result of 1st input parsed with variant 2. 2nd input is not parsable by variant2.
Tree(groups (group '(' 'abc' ')') (group '(' 'def' 'ghi' ')')) Example 2: Rich string grammar//
// sequence does not eat up whitespace by itself.
// children:
// '"""' is lexical and does not eat up whitespace.
// . is lexical and does not eat up whitespace.
// * quantifier inherits behavior of . => whitespace is not eaten up.
// ? non-greedy operator inherits behavior of .* => whitespace is not eaten up.
// '"""' is lexical and does not eat up whitespace.
//
// ==> the Sequence parser is reponsible for eating up whitespace before and after WORD.parse()
// ==> the Quantifier parser COMBINES results, if child parser is lexical
// ==> the non-greedy operator just makes a binary decision: eat up more chars with current parser or proceed with next parser, if applicable.
//
richString : '"""' .*? '"""' Input""" test """ Parse TreeTree(richString '"""' ' test ' '"""') |
It should be sufficient (and more effective) to invent a new parser 'RuleRef', which eats up whitespace. Then quantifier does not have to inherit the behavior of its underlying parser. A RuleRef is like a pointer to a Parser. If it is initialized with a Supplier, i.e. |
Whitespace configuration:
Update: See also issue #27.
WS : ... -> skip ;
orWHITESPACE : ... -> skip ;
.WS : [ \t\r\n]+ -> skip ;
.fragment
and-> skip
because there will be only a parsing phase/no lexing phase and then fragment and skip have the same semantics...)Rules for handling whitespace:
The text was updated successfully, but these errors were encountered: