-
-
Notifications
You must be signed in to change notification settings - Fork 62
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimise tokenisation and whitespace skipping #139
Conversation
@dontcallmedom, seems TravisCI was disabled for this repo too? Any idea what’s been causing that? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great. Would like one one set of eyes on it tho.
@marcoscaceres no - we've had that happen a number of times the past few weeks, and we have no idea what's happening |
Pinged GitHub support. Will see if they can hopefully clue us in. |
@dontcallmedom, sorry, could you check if it’s correctly re-enabled? I’m unsure if I have enough privileges (I clicked on things, but just want to make sure). |
@marcoscaceres confirmed |
@ricea, could you send this again (or with an empty commit)? That should trigger TravisCI again. Want to confirm tests passing. |
Tokenisation and whitespace skipping are two major hot spots for the parser. Optimise by: 1. Using re.exec() instead of str.replace() 2. Using re.lastIndex with sticky expressions to avoid repeatedly trimming the front of the string 3. Using str.indexOf('\n') when counting newlines 4. Using a zero-width lookahead to quickly reject integers in the "float" regexp 5. Tweaking other regexps 6. Peeking at the next character in tokenise() to reduce the number of regular expressions that need to be checked at each position 7. Moving initialisation of re tables out of the hot path 8. Also optimise the removal of quotation marks from strings by using str.substring() instead of str.replace(). Due to optimisation of the "whitespace" regexp tokenise() now splits " // foo" into two tokens (" " and "// foo") rather than one. This makes no difference to the output of the parser. Verified with the test suite and with the w-p-t "idlharness" tests. With these changes tokenize() is 30% faster and all_ws() is no longer a major contributor to parse time.
Oh, it did! |
lib/webidl2.js
Outdated
@@ -97,7 +137,8 @@ | |||
if (!tokens.length || tokens[0].type !== type) return; | |||
if (typeof value === "undefined" || tokens[0].value === value) { | |||
last_token = tokens.shift(); | |||
if (type === ID) last_token.value = last_token.value.replace(/^_/, ""); | |||
if (type === ID && last_token.value.charAt(0) === '_') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer last_token.value.startsWith("_")
, IMO it shouldn't be slower than charAt
...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
lib/webidl2.js
Outdated
@@ -429,7 +476,11 @@ | |||
return { type: "sequence", value: [] }; | |||
} else { | |||
const str = consume(STR) || error("No value for default"); | |||
str.value = str.value.replace(/^"/, "").replace(/"$/, ""); | |||
if (str.value.charAt(0) !== '"') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
startsWith
here too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
lib/webidl2.js
Outdated
str.value = str.value.replace(/^"/, "").replace(/"$/, ""); | ||
if (str.value.charAt(0) !== '"') | ||
error(`string '${str.value}' doesn't start with a quote`); | ||
if (str.value.charAt(str.value.length - 1) !== '"') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
endsWith
should be a cleaner alternative.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
lib/webidl2.js
Outdated
@@ -917,7 +968,7 @@ | |||
return ret; | |||
} | |||
const val = consume(STR) || error("Unexpected value in enum"); | |||
val.value = val.value.replace(/"/g, ""); | |||
val.value = val.value.substring(1, val.value.length - 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
.slice(1, -1)
would be cleaner.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
lib/webidl2.js
Outdated
for (const type of wsTypes) { | ||
w = w.replace(re[type], (tok, m1) => { | ||
store.push({ type: type + (pea ? ("-" + pea) : ""), value: m1 }); | ||
for (var type in all_ws_re) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we going back to for-var-in pattern on purpose? I'm not sure we should...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
According to web-platform-tests/wpt@280005e Servo does not support "const type of" yet. While we don't strictly need to care about it here, if we use it it will have to be patched downstream again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Servo supports for (let type of foo)
(but in a wrong way), I think keeping for-of syntax should be preferred.
Alternatively the repo may just use a transpiled file. Should we provide one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we should meet them half-way and use let
... as there is only one type
in scope, servo should not get confused, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed it to let type in all_ws_re
. With the way the code is now it doesn't need const ... of
anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I locally confirmed the latest nightly build has no issue with callback-less synchronous for (let type of array)
, can we use it so that we can keep of
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Additionally, a relevant comment will be great so that we won't accidentally do a rollback.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can't use of
here unless we cache an array of the keys of all_ws_re
. Is there a benefit in doing that?
I have added a comment about why we are not using const
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I just found that wsTypes
is removed. Never mind, sorry.
I'd like to add a relevant Servo issue URL so that a future contributor can easily follow it, but I couldn't find one. Should I file a new issue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you have a running copy of Servo to repro on, that would be helpful, yes.
const all_ws_re = { | ||
"ws": /([\t\n\r ]+)/y, | ||
"line-comment": /\/\/(.*)\r?\n?/y, | ||
"multiline-comment": /\/\*((?:[^*]|\*[^/])*)\*\//y |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Curious about this multiline comment regex change, is there also a performance win?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but not huge. It's about 2% faster on V8.
const tokenRe = { | ||
// This expression uses a lookahead assertion to catch false matches | ||
// against integers early. | ||
"float": /-?(?=[0-9]*\.|[0-9]+[eE])(([0-9]+\.[0-9]*|[0-9]*\.[0-9]+)([Ee][-+]?[0-9]+)?|[0-9]+[Ee][-+]?[0-9]+)/y, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I want to note that this is now branching from the spec. Maybe worth pushing to upstream if this gives a significant win.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be not be a good match for the spec because it makes the expression harder to read and understand. It offers a significant performance benefit in this tokeniser, because of the way we run the expression against every digit, most of which will fail the lookahead assertion. But it might just get in the way in a different type of tokeniser.
lib/webidl2.js
Outdated
} else if (/[A-Z_a-z]/.test(nextChar)) { | ||
result = attemptTokenMatch(str, "identifier", tokenRe.identifier, | ||
lastIndex, tokens); | ||
} else if (/"/.test(nextChar)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just nextChar === '"'
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
lib/webidl2.js
Outdated
++line; | ||
++i; | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about make this as function count(str, char)
, for readability?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
@saschanaz I think that's everything. PTAL. |
There was a regression where a '/' was not correctly tokenised as "other" when it was not part of a comment. Fix it. Add a regression test that a single '/' in an IDL file fails to parse correctly. Also split the cases of whitespace and comments in the tokeniser to make them a bit faster.
I spotted a bug in my changes where a '/' that was not part of a comment would fail to be tokenised as "other". It doesn't make much difference at the moment because it's never valid IDL syntax, but it might in future. I added a regression test to ensure that the correct error is produced for a solitary '/'. I took the opportunity to optimise tokenise() a bit more by treating whitespace as a separate case from comments (the parser still sees "whitespace"). |
Filed servo/servo#20231. |
lib/webidl2.js
Outdated
if (!str.value.startsWith('"')) | ||
error(`string '${str.value}' doesn't start with a quote`); | ||
if (!str.value.endsWith('"')) | ||
error(`string '${str.value}' doesn't end with a quote`); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, this cannot happen as the regex forces them, right? We have AST tests, so probably no need to do runtime check here. What do you think? @marcoscaceres
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@saschanaz, agree. @ricea, is there something we are overlooking?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think so. I just left it there in case there was something I was missing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 for code quality, but note that I didn't run any benchmark so I have no real idea about the performance boost.
@saschanaz I mostly benchmarked against a memory sanitizer build, as that is where I was feeling the pain (see http://crbug.com/810963). I just tested with a normal optimised compile against the dom/interfaces.html web-platform-test (using the Blink layout test infrastructure). I got a median of 1.36s before the change and 1.26s after the change, which is a 7% improvement. The improvements under msan were larger, as it spends a larger portion of its time in the parser. |
As an aside, maybe we should add some performance regressions testing using something similar to what http://crbug.com/810963 is doing. |
TravisCI appears to be stuck. 😞 |
Yeah, stuck everywhere:( all w3c projects currently affected. Hopefully will come back soon. |
Restarted Travis build... 🤞 |
|
Tokenisation and whitespace skipping are two major hot spots for the parser.
Optimise by:
Also optimise the removal of quotation marks from strings by using str.substring() instead of str.replace().
Due to optimisation of the "whitespace" regexp tokenise() now splits " // foo" into two tokens (" " and "// foo") rather than one. This makes no difference to the output of the parser. Verified with the test suite and with the w-p-t "idlharness" tests.
With these changes tokenize() is 30% faster and all_ws() is no longer a major contributor to parse time.