New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ecma262 lexical grammar #81

Closed
mehdishojaei opened this Issue Oct 12, 2015 · 9 comments

Comments

Projects
None yet
3 participants
@mehdishojaei

mehdishojaei commented Oct 12, 2015

Why ecma262 lexical grammar is not a regular grammar (that can be represented as Regular Expressions)?

@benjamingr

This comment has been minimized.

Show comment
Hide comment
@benjamingr

benjamingr Oct 12, 2015

Contributor

Well, trivially because it needs to support context free semantics. For example, a classical example of a grammar that is context free grammar is:

{x | x is an expression containing `{` and `}` and the parenthesis match }

That is, matching {s and } }s. It is trivial to construct a context free grammar for it, but it is impossible to build a regular expression that parses it. Namely, because otherwise it would violate the pumping lemma - there is a direct proof in Sipser's introductory book on the topic if you'd prefer that. Intuitively, the reason is that regular languages cannot count.

Now that we covered that the language is not regular - note that under the alphabet containing only { and } only the expressions that are in the language are valid ECMAScript statements. Namely:

{} // valid
{{}} // valid
{{}{}} // valid
{{{}}} // valid
{{{ // invalid
}}} // invalid

And so on - so only words that are valid ECMAScript statements are in the langauge under that alphabet. That language isn't regular and it's a subset of ES created by limiting the alphabet so ECMAScript must not be regular too.

Of course, you can parse ECMAScript with regular expressions that contain various additions. Note that most real programming languages do not have a regular grammar.

Contributor

benjamingr commented Oct 12, 2015

Well, trivially because it needs to support context free semantics. For example, a classical example of a grammar that is context free grammar is:

{x | x is an expression containing `{` and `}` and the parenthesis match }

That is, matching {s and } }s. It is trivial to construct a context free grammar for it, but it is impossible to build a regular expression that parses it. Namely, because otherwise it would violate the pumping lemma - there is a direct proof in Sipser's introductory book on the topic if you'd prefer that. Intuitively, the reason is that regular languages cannot count.

Now that we covered that the language is not regular - note that under the alphabet containing only { and } only the expressions that are in the language are valid ECMAScript statements. Namely:

{} // valid
{{}} // valid
{{}{}} // valid
{{{}}} // valid
{{{ // invalid
}}} // invalid

And so on - so only words that are valid ECMAScript statements are in the langauge under that alphabet. That language isn't regular and it's a subset of ES created by limiting the alphabet so ECMAScript must not be regular too.

Of course, you can parse ECMAScript with regular expressions that contain various additions. Note that most real programming languages do not have a regular grammar.

@benjamingr

This comment has been minimized.

Show comment
Hide comment
@benjamingr

benjamingr Oct 12, 2015

Contributor

As for "why" - because programming languages need to be able to support arbitrary nesting of function calls/blocks/whatever.

Contributor

benjamingr commented Oct 12, 2015

As for "why" - because programming languages need to be able to support arbitrary nesting of function calls/blocks/whatever.

@mehdishojaei

This comment has been minimized.

Show comment
Hide comment
@mehdishojaei

mehdishojaei Oct 12, 2015

@benjamingr, Please be informed that my question is about lexical grammar and not syntactic grammar.

mehdishojaei commented Oct 12, 2015

@benjamingr, Please be informed that my question is about lexical grammar and not syntactic grammar.

@bterlson

This comment has been minimized.

Show comment
Hide comment
@bterlson

bterlson Oct 12, 2015

Member

I believe all the individual productions of the lexical grammar are regular and could be represented as a regular expression if someone cared to do so. But there are the complications, for example the different goal symbols depending on parser context (eg. whether a / is the start of a regexp literal or a divide token) and automatic semicolon insertion.

Member

bterlson commented Oct 12, 2015

I believe all the individual productions of the lexical grammar are regular and could be represented as a regular expression if someone cared to do so. But there are the complications, for example the different goal symbols depending on parser context (eg. whether a / is the start of a regexp literal or a divide token) and automatic semicolon insertion.

@bterlson bterlson added the question label Oct 12, 2015

@mehdishojaei

This comment has been minimized.

Show comment
Hide comment
@mehdishojaei

mehdishojaei Oct 13, 2015

I believe all the individual productions of the lexical grammar are regular and could be represented as a regular expression

@bterlson Thanks, Yes I believe too, but I think one of the main problems in converting this lexical grammar to a regular grammar is it's recursive productions.

mehdishojaei commented Oct 13, 2015

I believe all the individual productions of the lexical grammar are regular and could be represented as a regular expression

@bterlson Thanks, Yes I believe too, but I think one of the main problems in converting this lexical grammar to a regular grammar is it's recursive productions.

@bterlson

This comment has been minimized.

Show comment
Hide comment
@bterlson

bterlson Oct 13, 2015

Member

Which lexical productions aren't regular?

Member

bterlson commented Oct 13, 2015

Which lexical productions aren't regular?

@mehdishojaei

This comment has been minimized.

Show comment
Hide comment
@mehdishojaei

mehdishojaei Oct 14, 2015

General forms of a regular grammar are:
right regular grammar:
rlrg
or
left regular grammar:
llrg

Where A,B are variables and alpha is sequence of terminals (including ε).

Any grammar witch has a production that is not in one of the above forms is not regular. As you know, Regular grammars can be mapped easily and efficiently to Regular Expressions that can be used in lexer generators like lex.

But lexical grammar of ecma262 is a Context-Free grammar, that cannot be represented as Regular Expressions.

General form of a Context-Free grammar is:
cfg

Where A is a variable and gamma is any sequence of terminals and variables.

mehdishojaei commented Oct 14, 2015

General forms of a regular grammar are:
right regular grammar:
rlrg
or
left regular grammar:
llrg

Where A,B are variables and alpha is sequence of terminals (including ε).

Any grammar witch has a production that is not in one of the above forms is not regular. As you know, Regular grammars can be mapped easily and efficiently to Regular Expressions that can be used in lexer generators like lex.

But lexical grammar of ecma262 is a Context-Free grammar, that cannot be represented as Regular Expressions.

General form of a Context-Free grammar is:
cfg

Where A is a variable and gamma is any sequence of terminals and variables.

@benjamingr

This comment has been minimized.

Show comment
Hide comment
@benjamingr

benjamingr Oct 14, 2015

Contributor

@benjamingr, Please be informed that my question is about lexical grammar and not syntactic grammar.

Ah, I don't know how I missed that in the title. Anyway:

There are several situations where the identification of lexical input elements is sensitive to the syntactic grammar context that is consuming the input elements. This requires multiple goal symbols for the lexical grammar. The InputElementRegExpOrTemplateTail goal is used in syntactic grammar contexts where a RegularExpressionLiteral, a TemplateMiddle, or a TemplateTail is permitted. The InputElementRegExp goal symbol is used in all syntactic grammar contexts where a RegularExpressionLiteral is permitted but neither a TemplateMiddle, nor a TemplateTail is permitted. The InputElementTemplateTail goal is used in all syntactic grammar contexts where a TemplateMiddle or a TemplateTail is permitted but a RegularExpressionLiteral is not permitted. In all other contexts, InputElementDiv is used as the lexical goal symbol.

Is probably the only relevant reference in the spec.

Contributor

benjamingr commented Oct 14, 2015

@benjamingr, Please be informed that my question is about lexical grammar and not syntactic grammar.

Ah, I don't know how I missed that in the title. Anyway:

There are several situations where the identification of lexical input elements is sensitive to the syntactic grammar context that is consuming the input elements. This requires multiple goal symbols for the lexical grammar. The InputElementRegExpOrTemplateTail goal is used in syntactic grammar contexts where a RegularExpressionLiteral, a TemplateMiddle, or a TemplateTail is permitted. The InputElementRegExp goal symbol is used in all syntactic grammar contexts where a RegularExpressionLiteral is permitted but neither a TemplateMiddle, nor a TemplateTail is permitted. The InputElementTemplateTail goal is used in all syntactic grammar contexts where a TemplateMiddle or a TemplateTail is permitted but a RegularExpressionLiteral is not permitted. In all other contexts, InputElementDiv is used as the lexical goal symbol.

Is probably the only relevant reference in the spec.

@bterlson

This comment has been minimized.

Show comment
Hide comment
@bterlson

bterlson Oct 14, 2015

Member

I think you might be hung up a bit on notation. This isn't my strong area but given this abbreviated production of the lexical grammar:

InputElementDiv :
   WhiteSpace
   LineTerminator

WhiteSpace:
    <TAB>
    <VT>

LineTerminator
    <LF>
    <CR>

You might think this is a CFG but I don't think it is. You could refactor to remove the indirection and end up with the syntactically identical:

InputElementDiv :
    <TAB>
    <VT>
    <LF>
    <CR>

Which is clearly regular.

Anyway, the non-regularity is ASI and multiple goal symbols. Their motivation should be clear by reading Clause 11.

Member

bterlson commented Oct 14, 2015

I think you might be hung up a bit on notation. This isn't my strong area but given this abbreviated production of the lexical grammar:

InputElementDiv :
   WhiteSpace
   LineTerminator

WhiteSpace:
    <TAB>
    <VT>

LineTerminator
    <LF>
    <CR>

You might think this is a CFG but I don't think it is. You could refactor to remove the indirection and end up with the syntactically identical:

InputElementDiv :
    <TAB>
    <VT>
    <LF>
    <CR>

Which is clearly regular.

Anyway, the non-regularity is ASI and multiple goal symbols. Their motivation should be clear by reading Clause 11.

@bterlson bterlson closed this Oct 14, 2015

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment