Support for ES6-style identifier characters #215

Closed
marijnh opened this Issue Mar 4, 2015 · 16 comments

Projects

None yet

3 participants

@marijnh
Member
marijnh commented Mar 4, 2015

Our current big fat regexps only recognize ES5-style identifier characters. See https://mathiasbynens.be/notes/javascript-identifiers-es6 . We need a separate set of, even bigger, regexps to properly parse ES6 identifiers. I am not really happy about just pasting those in and bloating up the library size. A clever alternative trick would be nice.

@mathiasbynens
Contributor

In terms of BMP characters in identifiers, the only difference between ES5 and ES6 is the one in the first example here: https://mathiasbynens.be/notes/javascript-identifiers-es6#acceptable-unicode-symbols

// Valid in ES5 & Unicode v5.1.0+, but invalid in ES6:
var ⸯ; // U+2E2F VERTICAL TILDE
var \u2E2F; // U+2E2F VERTICAL TILDE

In other words, once you add the ES6 regex you don’t need the ES5 regex anymore. You could just special-case U+2E2F and astral symbols.

@marijnh
Member
marijnh commented Mar 4, 2015

Okay, cool. I'll look into compressing down your ES6 regexp by extracting duplicated pieces.

@marijnh marijnh referenced this issue Mar 4, 2015
@marijnh marijnh Allow braced \u escapes in identifiers
We still can't properly recognize code points as ES6-style
identifier chars.

Issue #214
891d5d0
@mathiasbynens
Contributor

You could apply the same trick you’re using in Acorn at the moment, i.e. having separate IdentifierStart and IdentifierPartOnly parts and then dynamically concatenating the regex at runtime. This can easily be done by tweaking https://gist.github.com/mathiasbynens/6334847 (cfr. https://github.com/marijnh/acorn/blob/891d5d07ddbe587d605ff4e00538bc05ed482dd3/tools/generate-identifier-regex.js). However, keep in mind that by doing so, the resulting regex will be larger (in memory) than the current output. You’re trading library size for performance / memory usage.

@marijnh
Member
marijnh commented Mar 4, 2015

If I didn't make any mistake in constructing the regexps (see below), there appear to be more differences than just the one in the example you give. These identifiers from the tests are no longer considered valid once I drop in the new regexps:

T= []
price_9̶9̶_89

Here's how I'm generating them:

var regenerate = require('regenerate');
// Which Unicode version should be used?
var version = '7.0.0';
var ID_Start = require('unicode-' + version + '/properties/ID_Start/code-points');
var ID_Continue = require('unicode-' + version + '/properties/ID_Continue/code-points');

var identifierStart = regenerate(ID_Start).add("$", "_");
var identifierPart = regenerate(ID_Continue).add("\u200c", "\u200d").remove(identifierStart);

console.log(identifierStart.toString());
console.log(identifierPart.toString());
@mathiasbynens
Contributor

Both those examples are still valid in ES6. Here’s my validator that uses the regular expression I mentioned:

https://mothereff.in/js-variables#T%E2%80%BF
https://mothereff.in/js-variables#price%5f9%CC%B69%CC%B6%5f89

Your script looks good to me, so there must be something wrong with the way you use the regexps. Can I view that code anywhere?

@marijnh
Member
marijnh commented Mar 4, 2015

You're quite right, I still had the assumption that these were simple character ranges elsewhere in the code, though they now use the | operator. Fixing that, the new regexps work. Now to update the identifier reader to understand multibyte characters.

@mathiasbynens
Contributor

Ah yes, wrapping each part in (?:x) is now needed. Sorry for not mentioning that explicitly.

@marijnh marijnh added a commit that referenced this issue Mar 5, 2015
@marijnh marijnh Make the tokenizer aware of multi-byte characters in ES6 mode
Add a data structure to recognize astral identifier chars. Parse whole
code points when looking for identifiers.

Issue #215
d76ea4b
@marijnh
Member
marijnh commented Mar 5, 2015

Attached patch fixes this, using a non-regexp approach for testing the astral characters.

@marijnh marijnh closed this Mar 5, 2015
@mathiasbynens
Contributor

Looks like the implementation is broken:

$ cat foo.js
// Random example from https://mathiasbynens.be/notes/javascript-identifiers-es6
// Invalid in ES5, but valid in ES6:
var 𐊧; // U+102A7 CARIAN LETTER A2
var \u{102A7}; // U+102A7 CARIAN LETTER A2

$ acorn foo.js
Unexpected character '𐊧' (2:4)

https://mathiasbynens.be/notes/javascript-identifiers-es6 has more examples that may be useful when testing.

@RReverser
Collaborator

@mathiasbynens No, it's just that Acorn parses ES5 by default. acorn --ecma6 foo.js seems to work.

@mathiasbynens
Contributor

Thanks! That’s a strange default… Anyway, sorry for not RTFM’ing :)

@marijnh
Member
marijnh commented Jan 13, 2016

ES6 wasn't really a thing (and was certainly not supported) when the API was initially designed. Hence.

@RReverser
Collaborator

@marijnh Btw, I think now that it's a released & established spec, we can bump defaults in a major update.

@mathiasbynens
Contributor

ES6 wasn't really a thing (and was certainly not supported) when the API was initially designed. Hence.

Sure, but it’s supposed to be backwards compatible — ES5 code should still parse per ES6.

we can bump defaults in a major update.

👍

@marijnh
Member
marijnh commented Jan 13, 2016

Sure, but it’s supposed to be backwards compatible

Not entirely, as was pointed out to me recently (see for example let[1]), and also you'll get nodes in your AST that you may not have been expecting, so this is definitely not a backwards-compatible change.

@marijnh
Member
marijnh commented Jan 13, 2016

Discussion moved to #374

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment