Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimise tokenisation and whitespace skipping #139

Merged
merged 6 commits into from
Mar 9, 2018

Conversation

ricea
Copy link
Contributor

@ricea ricea commented Mar 2, 2018

Tokenisation and whitespace skipping are two major hot spots for the parser.

Optimise by:

  1. Using re.exec() instead of str.replace()
  2. Using re.lastIndex with sticky expressions to avoid repeatedly trimming the front of the string
  3. Using str.indexOf('\n') when counting newlines
  4. Using a zero-width lookahead to quickly reject integers in the "float" regexp
  5. Tweaking other regexps
  6. Peeking at the next character in tokenise() to reduce the number of regular expressions that need to be checked at each position
  7. Moving initialisation of re tables out of the hot path

Also optimise the removal of quotation marks from strings by using str.substring() instead of str.replace().

Due to optimisation of the "whitespace" regexp tokenise() now splits " // foo" into two tokens (" " and "// foo") rather than one. This makes no difference to the output of the parser. Verified with the test suite and with the w-p-t "idlharness" tests.

With these changes tokenize() is 30% faster and all_ws() is no longer a major contributor to parse time.

@marcoscaceres
Copy link
Member

@dontcallmedom, seems TravisCI was disabled for this repo too? Any idea what’s been causing that?

Copy link
Member

@marcoscaceres marcoscaceres left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great. Would like one one set of eyes on it tho.

@dontcallmedom
Copy link
Member

@marcoscaceres no - we've had that happen a number of times the past few weeks, and we have no idea what's happening

@marcoscaceres
Copy link
Member

Pinged GitHub support. Will see if they can hopefully clue us in.

@marcoscaceres
Copy link
Member

@dontcallmedom, sorry, could you check if it’s correctly re-enabled? I’m unsure if I have enough privileges (I clicked on things, but just want to make sure).

@dontcallmedom
Copy link
Member

@marcoscaceres confirmed

@marcoscaceres
Copy link
Member

@ricea, could you send this again (or with an empty commit)? That should trigger TravisCI again. Want to confirm tests passing.

Tokenisation and whitespace skipping are two major hot spots for the
parser.

Optimise by:

1. Using re.exec() instead of str.replace()
2. Using re.lastIndex with sticky expressions to avoid repeatedly
   trimming the front of the string
3. Using str.indexOf('\n') when counting newlines
4. Using a zero-width lookahead to quickly reject integers in the "float" regexp
5. Tweaking other regexps
6. Peeking at the next character in tokenise() to reduce the number of
   regular expressions that need to be checked at each position
7. Moving initialisation of re tables out of the hot path
8. Also optimise the removal of quotation marks from strings by using
   str.substring() instead of str.replace().

Due to optimisation of the "whitespace" regexp tokenise() now splits
"  // foo" into two tokens (" " and "// foo") rather than one. This makes
no difference to the output of the parser. Verified with the test suite
and with the w-p-t "idlharness" tests.

With these changes tokenize() is 30% faster and all_ws() is no longer a
major contributor to parse time.
@saschanaz
Copy link
Member

saschanaz commented Mar 2, 2018

That didn't trigger CI 🤔

Oh, it did!

lib/webidl2.js Outdated
@@ -97,7 +137,8 @@
if (!tokens.length || tokens[0].type !== type) return;
if (typeof value === "undefined" || tokens[0].value === value) {
last_token = tokens.shift();
if (type === ID) last_token.value = last_token.value.replace(/^_/, "");
if (type === ID && last_token.value.charAt(0) === '_')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer last_token.value.startsWith("_"), IMO it shouldn't be slower than charAt...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

lib/webidl2.js Outdated
@@ -429,7 +476,11 @@
return { type: "sequence", value: [] };
} else {
const str = consume(STR) || error("No value for default");
str.value = str.value.replace(/^"/, "").replace(/"$/, "");
if (str.value.charAt(0) !== '"')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

startsWith here too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

lib/webidl2.js Outdated
str.value = str.value.replace(/^"/, "").replace(/"$/, "");
if (str.value.charAt(0) !== '"')
error(`string '${str.value}' doesn't start with a quote`);
if (str.value.charAt(str.value.length - 1) !== '"')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

endsWith should be a cleaner alternative.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

lib/webidl2.js Outdated
@@ -917,7 +968,7 @@
return ret;
}
const val = consume(STR) || error("Unexpected value in enum");
val.value = val.value.replace(/"/g, "");
val.value = val.value.substring(1, val.value.length - 1);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.slice(1, -1) would be cleaner.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

lib/webidl2.js Outdated
for (const type of wsTypes) {
w = w.replace(re[type], (tok, m1) => {
store.push({ type: type + (pea ? ("-" + pea) : ""), value: m1 });
for (var type in all_ws_re) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we going back to for-var-in pattern on purpose? I'm not sure we should...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to web-platform-tests/wpt@280005e Servo does not support "const type of" yet. While we don't strictly need to care about it here, if we use it it will have to be patched downstream again.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

servo/servo#19535

Servo supports for (let type of foo) (but in a wrong way), I think keeping for-of syntax should be preferred.

Alternatively the repo may just use a transpiled file. Should we provide one?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should meet them half-way and use let... as there is only one type in scope, servo should not get confused, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed it to let type in all_ws_re. With the way the code is now it doesn't need const ... of anyway.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I locally confirmed the latest nightly build has no issue with callback-less synchronous for (let type of array), can we use it so that we can keep of?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additionally, a relevant comment will be great so that we won't accidentally do a rollback.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can't use of here unless we cache an array of the keys of all_ws_re. Is there a benefit in doing that?

I have added a comment about why we are not using const.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I just found that wsTypes is removed. Never mind, sorry.

I'd like to add a relevant Servo issue URL so that a future contributor can easily follow it, but I couldn't find one. Should I file a new issue?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you have a running copy of Servo to repro on, that would be helpful, yes.

const all_ws_re = {
"ws": /([\t\n\r ]+)/y,
"line-comment": /\/\/(.*)\r?\n?/y,
"multiline-comment": /\/\*((?:[^*]|\*[^/])*)\*\//y
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious about this multiline comment regex change, is there also a performance win?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but not huge. It's about 2% faster on V8.

const tokenRe = {
// This expression uses a lookahead assertion to catch false matches
// against integers early.
"float": /-?(?=[0-9]*\.|[0-9]+[eE])(([0-9]+\.[0-9]*|[0-9]*\.[0-9]+)([Ee][-+]?[0-9]+)?|[0-9]+[Ee][-+]?[0-9]+)/y,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want to note that this is now branching from the spec. Maybe worth pushing to upstream if this gives a significant win.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be not be a good match for the spec because it makes the expression harder to read and understand. It offers a significant performance benefit in this tokeniser, because of the way we run the expression against every digit, most of which will fail the lookahead assertion. But it might just get in the way in a different type of tokeniser.

lib/webidl2.js Outdated
} else if (/[A-Z_a-z]/.test(nextChar)) {
result = attemptTokenMatch(str, "identifier", tokenRe.identifier,
lastIndex, tokens);
} else if (/"/.test(nextChar)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just nextChar === '"'?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

lib/webidl2.js Outdated
++line;
++i;
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about make this as function count(str, char), for readability?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@ricea
Copy link
Contributor Author

ricea commented Mar 6, 2018

@saschanaz I think that's everything. PTAL.

There was a regression where a '/' was not correctly tokenised as
"other" when it was not part of a comment. Fix it.

Add a regression test that a single '/' in an IDL file fails to parse
correctly.

Also split the cases of whitespace and comments in the tokeniser to make
them a bit faster.
@ricea
Copy link
Contributor Author

ricea commented Mar 6, 2018

I spotted a bug in my changes where a '/' that was not part of a comment would fail to be tokenised as "other". It doesn't make much difference at the moment because it's never valid IDL syntax, but it might in future. I added a regression test to ensure that the correct error is produced for a solitary '/'.

I took the opportunity to optimise tokenise() a bit more by treating whitespace as a separate case from comments (the parser still sees "whitespace").

@saschanaz
Copy link
Member

Filed servo/servo#20231.

lib/webidl2.js Outdated
if (!str.value.startsWith('"'))
error(`string '${str.value}' doesn't start with a quote`);
if (!str.value.endsWith('"'))
error(`string '${str.value}' doesn't end with a quote`);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, this cannot happen as the regex forces them, right? We have AST tests, so probably no need to do runtime check here. What do you think? @marcoscaceres

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@saschanaz, agree. @ricea, is there something we are overlooking?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so. I just left it there in case there was something I was missing.

Copy link
Member

@saschanaz saschanaz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 for code quality, but note that I didn't run any benchmark so I have no real idea about the performance boost.

@ricea
Copy link
Contributor Author

ricea commented Mar 8, 2018

@saschanaz I mostly benchmarked against a memory sanitizer build, as that is where I was feeling the pain (see http://crbug.com/810963). I just tested with a normal optimised compile against the dom/interfaces.html web-platform-test (using the Blink layout test infrastructure). I got a median of 1.36s before the change and 1.26s after the change, which is a 7% improvement. The improvements under msan were larger, as it spends a larger portion of its time in the parser.

@marcoscaceres
Copy link
Member

As an aside, maybe we should add some performance regressions testing using something similar to what http://crbug.com/810963 is doing.

@ricea
Copy link
Contributor Author

ricea commented Mar 8, 2018

TravisCI appears to be stuck. 😞

@marcoscaceres
Copy link
Member

Yeah, stuck everywhere:( all w3c projects currently affected. Hopefully will come back soon.

@marcoscaceres
Copy link
Member

Restarted Travis build... 🤞

@marcoscaceres marcoscaceres merged commit c357d4c into w3c:develop Mar 9, 2018
@marcoscaceres
Copy link
Member

  • webidl2@10.2.1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants