-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Re-write throttling logic with queues #1587
Conversation
This re-writes the convoluted logic introduced in previous commit (and the also convoluted logic that pre-existed) as suggested in #1582, taking inspiration from the queue in the CG monitor: https://github.com/w3c/cg-monitor/blob/5ceea24a1eaa7c1736d38909bcbb47df4ae297a3/lib/caching-queued-fetch.js#L68 On top of serializing requests sent to a given origin and sleeping a bit in between these requests, the new queue controls parallelization, making sure that the code does not load too many specs at once (I stuck to 4 in parallel as before). The net result is essentially the same as that achieved with the previous commit: specs are processed as if origins had been interleaved. But the new code is more straightforward, less verbose and more readable. Also, the code for the throttled queue is the exact same one as the code in browser-specs (see w3c/browser-specs#1356). Code is duplicated here out of simplicity and laziness to create and maintain a shared dependency.
That seems to work well, with the caveat that crawl takes much longer due to the serialization per origin and the 2s pause in-between specs that target the same origin. The thing is there are >300 nightly specs with a We could split the different (I'll prepare a side PR to avoid requests on SVG and CSS resources, that works fine as well, and seems useful!) |
my personal sense is that we should only use the 2 seconds pacing for the CSS server, and a much smaller one for other "origins" |
Done in the latest commit. I kept a 1 second sleep time for www.w3.org because of the 180 requests/minute limit. Note that getOrigin now also returns the "https://" prefix, to be more consistent with what the The update does not fundamentally change the duration of a full crawl: loading the specs is what takes the most amount of time. It should speed up the 304 case in incremental crawls though. |
This makes the sleep time depend on the origin being crawled: - The CSS server breaks easily, the code sleeps 2 seconds in between requests. - The www.w3.org server has a limit of 180 requests per minute, the code sleeps 1 second in between requests. - Other origins seem more amenable to faster crawls, the code sleeps 100ms in between requests.
Breaking change: - Bump required version of Node.js to >=20.12.1 (#1590) Required by dependency on ReSpec, although note this dependency is only for tests (`devDependencies` in `package.json`). New features: - Reduce crawler load on servers (#1587, #1581) - Include data-xref-type as another hint for autolinks (#1585) Feature patches: - Skip requests on SVG and CSS resources (#1588) - Fix Puppeteer instance teardown (#1586) Dependency bumps: - Bump respec from 35.0.2 to 35.1.0 (#1583) - Bump ajv from 8.15.0 to 8.16.0 (#1580)
(Leaving as draft as this would need to be updated if the code for throttled-queue changes following review of w3c/browser-specs#1356)
This re-writes the convoluted logic introduced in previous commit (and the also convoluted logic that pre-existed) as suggested in #1582, taking inspiration from the queue in the CG monitor.
On top of serializing requests sent to a given origin and sleeping a bit in between these requests, the new queue controls parallelization, making sure that the code does not load too many specs at once (I stuck to 4 in parallel as before).
The net result is essentially the same as that achieved with the previous commit: specs are processed as if origins had been interleaved. But the new code is more straightforward, less verbose and more readable.
Also, the code for the throttled queue is the exact same one as the code in browser-specs (see w3c/browser-specs#1356). Code is duplicated here out of simplicity and laziness to create and maintain a shared dependency.