Optimise tokenisation and whitespace skipping #139

ricea · 2018-03-02T13:08:53Z

Tokenisation and whitespace skipping are two major hot spots for the parser.

Optimise by:

Using re.exec() instead of str.replace()
Using re.lastIndex with sticky expressions to avoid repeatedly trimming the front of the string
Using str.indexOf('\n') when counting newlines
Using a zero-width lookahead to quickly reject integers in the "float" regexp
Tweaking other regexps
Peeking at the next character in tokenise() to reduce the number of regular expressions that need to be checked at each position
Moving initialisation of re tables out of the hot path

Also optimise the removal of quotation marks from strings by using str.substring() instead of str.replace().

Due to optimisation of the "whitespace" regexp tokenise() now splits " // foo" into two tokens (" " and "// foo") rather than one. This makes no difference to the output of the parser. Verified with the test suite and with the w-p-t "idlharness" tests.

With these changes tokenize() is 30% faster and all_ws() is no longer a major contributor to parse time.

marcoscaceres · 2018-03-02T13:23:38Z

@dontcallmedom, seems TravisCI was disabled for this repo too? Any idea what’s been causing that?

marcoscaceres

This looks great. Would like one one set of eyes on it tho.

dontcallmedom · 2018-03-02T13:25:40Z

@marcoscaceres no - we've had that happen a number of times the past few weeks, and we have no idea what's happening

marcoscaceres · 2018-03-02T13:31:10Z

Pinged GitHub support. Will see if they can hopefully clue us in.

marcoscaceres · 2018-03-02T13:33:53Z

@dontcallmedom, sorry, could you check if it’s correctly re-enabled? I’m unsure if I have enough privileges (I clicked on things, but just want to make sure).

dontcallmedom · 2018-03-02T13:39:18Z

@marcoscaceres confirmed

marcoscaceres · 2018-03-02T13:44:40Z

@ricea, could you send this again (or with an empty commit)? That should trigger TravisCI again. Want to confirm tests passing.

Tokenisation and whitespace skipping are two major hot spots for the parser. Optimise by: 1. Using re.exec() instead of str.replace() 2. Using re.lastIndex with sticky expressions to avoid repeatedly trimming the front of the string 3. Using str.indexOf('\n') when counting newlines 4. Using a zero-width lookahead to quickly reject integers in the "float" regexp 5. Tweaking other regexps 6. Peeking at the next character in tokenise() to reduce the number of regular expressions that need to be checked at each position 7. Moving initialisation of re tables out of the hot path 8. Also optimise the removal of quotation marks from strings by using str.substring() instead of str.replace(). Due to optimisation of the "whitespace" regexp tokenise() now splits " // foo" into two tokens (" " and "// foo") rather than one. This makes no difference to the output of the parser. Verified with the test suite and with the w-p-t "idlharness" tests. With these changes tokenize() is 30% faster and all_ws() is no longer a major contributor to parse time.

saschanaz · 2018-03-02T14:10:55Z

~~That didn't trigger CI 🤔~~

Oh, it did!

saschanaz · 2018-03-02T14:14:46Z

lib/webidl2.js

@@ -97,7 +137,8 @@
      if (!tokens.length || tokens[0].type !== type) return;
      if (typeof value === "undefined" || tokens[0].value === value) {
        last_token = tokens.shift();
-        if (type === ID) last_token.value = last_token.value.replace(/^_/, "");
+        if (type === ID && last_token.value.charAt(0) === '_')


I would prefer last_token.value.startsWith("_"), IMO it shouldn't be slower than charAt...

saschanaz · 2018-03-02T14:16:52Z

lib/webidl2.js

@@ -429,7 +476,11 @@
          return { type: "sequence", value: [] };
        } else {
          const str = consume(STR) || error("No value for default");
-          str.value = str.value.replace(/^"/, "").replace(/"$/, "");
+          if (str.value.charAt(0) !== '"')


startsWith here too.

saschanaz · 2018-03-02T14:17:11Z

lib/webidl2.js

-          str.value = str.value.replace(/^"/, "").replace(/"$/, "");
+          if (str.value.charAt(0) !== '"')
+            error(`string '${str.value}' doesn't start with a quote`);
+          if (str.value.charAt(str.value.length - 1) !== '"')


endsWith should be a cleaner alternative.

saschanaz · 2018-03-02T14:18:39Z

lib/webidl2.js

@@ -917,7 +968,7 @@
          return ret;
        }
        const val = consume(STR) || error("Unexpected value in enum");
-        val.value = val.value.replace(/"/g, "");
+        val.value = val.value.substring(1, val.value.length - 1);


.slice(1, -1) would be cleaner.

saschanaz · 2018-03-02T14:22:23Z

lib/webidl2.js

-            for (const type of wsTypes) {
-              w = w.replace(re[type], (tok, m1) => {
-                store.push({ type: type + (pea ? ("-" + pea) : ""), value: m1 });
+            for (var type in all_ws_re) {


Are we going back to for-var-in pattern on purpose? I'm not sure we should...

According to web-platform-tests/wpt@280005e Servo does not support "const type of" yet. While we don't strictly need to care about it here, if we use it it will have to be patched downstream again.

servo/servo#19535

Servo supports for (let type of foo) (but in a wrong way), I think keeping for-of syntax should be preferred.

Alternatively the repo may just use a transpiled file. Should we provide one?

Maybe we should meet them half-way and use let... as there is only one type in scope, servo should not get confused, right?

I changed it to let type in all_ws_re. With the way the code is now it doesn't need const ... of anyway.

I locally confirmed the latest nightly build has no issue with callback-less synchronous for (let type of array), can we use it so that we can keep of?

Additionally, a relevant comment will be great so that we won't accidentally do a rollback.

We can't use of here unless we cache an array of the keys of all_ws_re. Is there a benefit in doing that?

I have added a comment about why we are not using const.

Ah, I just found that wsTypes is removed. Never mind, sorry.

I'd like to add a relevant Servo issue URL so that a future contributor can easily follow it, but I couldn't find one. Should I file a new issue?

If you have a running copy of Servo to repro on, that would be helpful, yes.

saschanaz · 2018-03-02T14:31:30Z

lib/webidl2.js

+    const all_ws_re = {
+      "ws": /([\t\n\r ]+)/y,
+      "line-comment": /\/\/(.*)\r?\n?/y,
+      "multiline-comment": /\/\*((?:[^*]|\*[^/])*)\*\//y


Curious about this multiline comment regex change, is there also a performance win?

Yes, but not huge. It's about 2% faster on V8.

saschanaz · 2018-03-02T14:38:02Z

lib/webidl2.js

+  const tokenRe = {
+    // This expression uses a lookahead assertion to catch false matches
+    // against integers early.
+    "float": /-?(?=[0-9]*\.|[0-9]+[eE])(([0-9]+\.[0-9]*|[0-9]*\.[0-9]+)([Ee][-+]?[0-9]+)?|[0-9]+[Ee][-+]?[0-9]+)/y,


I want to note that this is now branching from the spec. Maybe worth pushing to upstream if this gives a significant win.

I think it would be not be a good match for the spec because it makes the expression harder to read and understand. It offers a significant performance benefit in this tokeniser, because of the way we run the expression against every digit, most of which will fail the lookahead assertion. But it might just get in the way in a different type of tokeniser.

saschanaz · 2018-03-02T14:40:28Z

lib/webidl2.js

+      } else if (/[A-Z_a-z]/.test(nextChar)) {
+        result = attemptTokenMatch(str, "identifier", tokenRe.identifier,
+                                   lastIndex, tokens);
+      } else if (/"/.test(nextChar)) {


Just nextChar === '"'?

saschanaz · 2018-03-02T15:01:39Z

lib/webidl2.js

+            ++line;
+            ++i;
+          }
+        }


How about make this as function count(str, char), for readability?

ricea · 2018-03-06T09:46:24Z

@saschanaz I think that's everything. PTAL.

There was a regression where a '/' was not correctly tokenised as "other" when it was not part of a comment. Fix it. Add a regression test that a single '/' in an IDL file fails to parse correctly. Also split the cases of whitespace and comments in the tokeniser to make them a bit faster.

ricea · 2018-03-06T11:54:55Z

I spotted a bug in my changes where a '/' that was not part of a comment would fail to be tokenised as "other". It doesn't make much difference at the moment because it's never valid IDL syntax, but it might in future. I added a regression test to ensure that the correct error is produced for a solitary '/'.

I took the opportunity to optimise tokenise() a bit more by treating whitespace as a separate case from comments (the parser still sees "whitespace").

saschanaz · 2018-03-07T13:20:08Z

Filed servo/servo#20231.

saschanaz · 2018-03-07T13:25:19Z

lib/webidl2.js

+          if (!str.value.startsWith('"'))
+            error(`string '${str.value}' doesn't start with a quote`);
+          if (!str.value.endsWith('"'))
+            error(`string '${str.value}' doesn't end with a quote`);


Hmm, this cannot happen as the regex forces them, right? We have AST tests, so probably no need to do runtime check here. What do you think? @marcoscaceres

@saschanaz, agree. @ricea, is there something we are overlooking?

I don't think so. I just left it there in case there was something I was missing.

saschanaz

👍 for code quality, but note that I didn't run any benchmark so I have no real idea about the performance boost.

ricea · 2018-03-08T06:58:19Z

@saschanaz I mostly benchmarked against a memory sanitizer build, as that is where I was feeling the pain (see http://crbug.com/810963). I just tested with a normal optimised compile against the dom/interfaces.html web-platform-test (using the Blink layout test infrastructure). I got a median of 1.36s before the change and 1.26s after the change, which is a 7% improvement. The improvements under msan were larger, as it spends a larger portion of its time in the parser.

marcoscaceres · 2018-03-08T07:05:32Z

As an aside, maybe we should add some performance regressions testing using something similar to what http://crbug.com/810963 is doing.

ricea · 2018-03-08T08:58:53Z

TravisCI appears to be stuck. 😞

marcoscaceres · 2018-03-08T09:29:21Z

Yeah, stuck everywhere:( all w3c projects currently affected. Hopefully will come back soon.

marcoscaceres · 2018-03-08T12:11:51Z

Restarted Travis build... 🤞

marcoscaceres · 2018-03-09T03:51:35Z

webidl2@10.2.1

marcoscaceres requested a review from saschanaz March 2, 2018 13:19

marcoscaceres approved these changes Mar 2, 2018

View reviewed changes

ricea force-pushed the optimise-re branch from 62e1b16 to 42b867d Compare March 2, 2018 14:10

saschanaz requested changes Mar 2, 2018

View reviewed changes

ricea added 2 commits March 5, 2018 23:06

Add count() function. Use idiomatic string methods.

f547eca

Use "let" instead of "var" in for statement

aa2d473

Add a comment about not using for (const ...)

ed82c58

saschanaz mentioned this pull request Mar 7, 2018

for-const-of loop causes syntax error servo/servo#20231

Closed

saschanaz reviewed Mar 7, 2018

View reviewed changes

Add bug link and remove unnecessary quote tests

fae13bc

saschanaz approved these changes Mar 8, 2018

View reviewed changes

marcoscaceres merged commit c357d4c into w3c:develop Mar 9, 2018

csnardi mentioned this pull request Apr 26, 2018

Update webidl2.js to v10.2.1 web-platform-tests/wpt#10645

Merged

saschanaz mentioned this pull request May 11, 2018

Remove all_ws() and gather trivia implicitly #154

Merged

Optimise tokenisation and whitespace skipping #139

Optimise tokenisation and whitespace skipping #139

Conversation

ricea commented Mar 2, 2018 • edited Loading

marcoscaceres commented Mar 2, 2018

marcoscaceres left a comment

Choose a reason for hiding this comment

dontcallmedom commented Mar 2, 2018

marcoscaceres commented Mar 2, 2018

marcoscaceres commented Mar 2, 2018

dontcallmedom commented Mar 2, 2018

marcoscaceres commented Mar 2, 2018

saschanaz commented Mar 2, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ricea commented Mar 6, 2018

ricea commented Mar 6, 2018

saschanaz commented Mar 7, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

saschanaz left a comment

Choose a reason for hiding this comment

ricea commented Mar 8, 2018

marcoscaceres commented Mar 8, 2018

ricea commented Mar 8, 2018

marcoscaceres commented Mar 8, 2018

marcoscaceres commented Mar 8, 2018

marcoscaceres commented Mar 9, 2018

ricea commented Mar 2, 2018 •

edited

Loading

saschanaz commented Mar 2, 2018 •

edited

Loading