Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.
Sign upecma262 lexical grammar #81
Comments
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
benjamingr
Oct 12, 2015
Contributor
Well, trivially because it needs to support context free semantics. For example, a classical example of a grammar that is context free grammar is:
{x | x is an expression containing `{` and `}` and the parenthesis match }
That is, matching {s and } }s. It is trivial to construct a context free grammar for it, but it is impossible to build a regular expression that parses it. Namely, because otherwise it would violate the pumping lemma - there is a direct proof in Sipser's introductory book on the topic if you'd prefer that. Intuitively, the reason is that regular languages cannot count.
Now that we covered that the language is not regular - note that under the alphabet containing only { and } only the expressions that are in the language are valid ECMAScript statements. Namely:
{} // valid
{{}} // valid
{{}{}} // valid
{{{}}} // valid
{{{ // invalid
}}} // invalidAnd so on - so only words that are valid ECMAScript statements are in the langauge under that alphabet. That language isn't regular and it's a subset of ES created by limiting the alphabet so ECMAScript must not be regular too.
Of course, you can parse ECMAScript with regular expressions that contain various additions. Note that most real programming languages do not have a regular grammar.
|
Well, trivially because it needs to support context free semantics. For example, a classical example of a grammar that is context free grammar is:
That is, matching Now that we covered that the language is not regular - note that under the alphabet containing only {} // valid
{{}} // valid
{{}{}} // valid
{{{}}} // valid
{{{ // invalid
}}} // invalidAnd so on - so only words that are valid ECMAScript statements are in the langauge under that alphabet. That language isn't regular and it's a subset of ES created by limiting the alphabet so ECMAScript must not be regular too. Of course, you can parse ECMAScript with regular expressions that contain various additions. Note that most real programming languages do not have a regular grammar. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
benjamingr
Oct 12, 2015
Contributor
As for "why" - because programming languages need to be able to support arbitrary nesting of function calls/blocks/whatever.
|
As for "why" - because programming languages need to be able to support arbitrary nesting of function calls/blocks/whatever. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
mehdishojaei
Oct 12, 2015
@benjamingr, Please be informed that my question is about lexical grammar and not syntactic grammar.
mehdishojaei
commented
Oct 12, 2015
|
@benjamingr, Please be informed that my question is about lexical grammar and not syntactic grammar. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
bterlson
Oct 12, 2015
Member
I believe all the individual productions of the lexical grammar are regular and could be represented as a regular expression if someone cared to do so. But there are the complications, for example the different goal symbols depending on parser context (eg. whether a / is the start of a regexp literal or a divide token) and automatic semicolon insertion.
|
I believe all the individual productions of the lexical grammar are regular and could be represented as a regular expression if someone cared to do so. But there are the complications, for example the different goal symbols depending on parser context (eg. whether a / is the start of a regexp literal or a divide token) and automatic semicolon insertion. |
bterlson
added
the
question
label
Oct 12, 2015
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
mehdishojaei
Oct 13, 2015
I believe all the individual productions of the lexical grammar are regular and could be represented as a regular expression
@bterlson Thanks, Yes I believe too, but I think one of the main problems in converting this lexical grammar to a regular grammar is it's recursive productions.
mehdishojaei
commented
Oct 13, 2015
@bterlson Thanks, Yes I believe too, but I think one of the main problems in converting this lexical grammar to a regular grammar is it's recursive productions. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
|
Which lexical productions aren't regular? |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
mehdishojaei
Oct 14, 2015
General forms of a regular grammar are:
right regular grammar:

or
left regular grammar:

Where A,B are variables and
is sequence of terminals (including ε).
Any grammar witch has a production that is not in one of the above forms is not regular. As you know, Regular grammars can be mapped easily and efficiently to Regular Expressions that can be used in lexer generators like lex.
But lexical grammar of ecma262 is a Context-Free grammar, that cannot be represented as Regular Expressions.
General form of a Context-Free grammar is:

Where A is a variable and
is any sequence of terminals and variables.
mehdishojaei
commented
Oct 14, 2015
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
benjamingr
Oct 14, 2015
Contributor
@benjamingr, Please be informed that my question is about lexical grammar and not syntactic grammar.
Ah, I don't know how I missed that in the title. Anyway:
There are several situations where the identification of lexical input elements is sensitive to the syntactic grammar context that is consuming the input elements. This requires multiple goal symbols for the lexical grammar. The InputElementRegExpOrTemplateTail goal is used in syntactic grammar contexts where a RegularExpressionLiteral, a TemplateMiddle, or a TemplateTail is permitted. The InputElementRegExp goal symbol is used in all syntactic grammar contexts where a RegularExpressionLiteral is permitted but neither a TemplateMiddle, nor a TemplateTail is permitted. The InputElementTemplateTail goal is used in all syntactic grammar contexts where a TemplateMiddle or a TemplateTail is permitted but a RegularExpressionLiteral is not permitted. In all other contexts, InputElementDiv is used as the lexical goal symbol.
Is probably the only relevant reference in the spec.
Ah, I don't know how I missed that in the title. Anyway:
Is probably the only relevant reference in the spec. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
bterlson
Oct 14, 2015
Member
I think you might be hung up a bit on notation. This isn't my strong area but given this abbreviated production of the lexical grammar:
InputElementDiv :
WhiteSpace
LineTerminator
WhiteSpace:
<TAB>
<VT>
LineTerminator
<LF>
<CR>
You might think this is a CFG but I don't think it is. You could refactor to remove the indirection and end up with the syntactically identical:
InputElementDiv :
<TAB>
<VT>
<LF>
<CR>
Which is clearly regular.
Anyway, the non-regularity is ASI and multiple goal symbols. Their motivation should be clear by reading Clause 11.
|
I think you might be hung up a bit on notation. This isn't my strong area but given this abbreviated production of the lexical grammar:
You might think this is a CFG but I don't think it is. You could refactor to remove the indirection and end up with the syntactically identical:
Which is clearly regular. Anyway, the non-regularity is ASI and multiple goal symbols. Their motivation should be clear by reading Clause 11. |
mehdishojaei commentedOct 12, 2015
Why ecma262 lexical grammar is not a regular grammar (that can be represented as Regular Expressions)?