Re-write throttling logic with queues #1587

tidoust · 2024-06-06T17:48:58Z

(Leaving as draft as this would need to be updated if the code for throttled-queue changes following review of w3c/browser-specs#1356)

This re-writes the convoluted logic introduced in previous commit (and the also convoluted logic that pre-existed) as suggested in #1582, taking inspiration from the queue in the CG monitor.

On top of serializing requests sent to a given origin and sleeping a bit in between these requests, the new queue controls parallelization, making sure that the code does not load too many specs at once (I stuck to 4 in parallel as before).

The net result is essentially the same as that achieved with the previous commit: specs are processed as if origins had been interleaved. But the new code is more straightforward, less verbose and more readable.

Also, the code for the throttled queue is the exact same one as the code in browser-specs (see w3c/browser-specs#1356). Code is duplicated here out of simplicity and laziness to create and maintain a shared dependency.

This re-writes the convoluted logic introduced in previous commit (and the also convoluted logic that pre-existed) as suggested in #1582, taking inspiration from the queue in the CG monitor: https://github.com/w3c/cg-monitor/blob/5ceea24a1eaa7c1736d38909bcbb47df4ae297a3/lib/caching-queued-fetch.js#L68 On top of serializing requests sent to a given origin and sleeping a bit in between these requests, the new queue controls parallelization, making sure that the code does not load too many specs at once (I stuck to 4 in parallel as before). The net result is essentially the same as that achieved with the previous commit: specs are processed as if origins had been interleaved. But the new code is more straightforward, less verbose and more readable. Also, the code for the throttled queue is the exact same one as the code in browser-specs (see w3c/browser-specs#1356). Code is duplicated here out of simplicity and laziness to create and maintain a shared dependency.

src/lib/throttled-queue.js

tidoust · 2024-06-07T12:57:38Z

That seems to work well, with the caveat that crawl takes much longer due to the serialization per origin and the 2s pause in-between specs that target the same origin. The thing is there are >300 nightly specs with a github.io URL. That means crawl will take ~12minutes at best, closer to 15mn in practice. That's fine in theory.

We could split the different github.io origins (I know, I argued against that in browser-specs ;)). We'll still have ~200 specs with a w3c.github.io URL, and ~85 specs with a wicg.github.io URL, but that's better than grouping them together.

(I'll prepare a side PR to avoid requests on SVG and CSS resources, that works fine as well, and seems useful!)

dontcallmedom · 2024-06-07T13:11:33Z

my personal sense is that we should only use the 2 seconds pacing for the CSS server, and a much smaller one for other "origins"

tidoust · 2024-06-07T14:35:14Z

my personal sense is that we should only use the 2 seconds pacing for the CSS server, and a much smaller one for other "origins"

Done in the latest commit. I kept a 1 second sleep time for www.w3.org because of the 180 requests/minute limit.

Note that getOrigin now also returns the "https://" prefix, to be more consistent with what the origin property returns.

The update does not fundamentally change the duration of a full crawl: loading the specs is what takes the most amount of time. It should speed up the 304 case in incremental crawls though.

src/lib/specs-crawler.js

This makes the sleep time depend on the origin being crawled: - The CSS server breaks easily, the code sleeps 2 seconds in between requests. - The www.w3.org server has a limit of 180 requests per minute, the code sleeps 1 second in between requests. - Other origins seem more amenable to faster crawls, the code sleeps 100ms in between requests.

Breaking change: - Bump required version of Node.js to >=20.12.1 (#1590) Required by dependency on ReSpec, although note this dependency is only for tests (`devDependencies` in `package.json`). New features: - Reduce crawler load on servers (#1587, #1581) - Include data-xref-type as another hint for autolinks (#1585) Feature patches: - Skip requests on SVG and CSS resources (#1588) - Fix Puppeteer instance teardown (#1586) Dependency bumps: - Bump respec from 35.0.2 to 35.1.0 (#1583) - Bump ajv from 8.15.0 to 8.16.0 (#1580)

tidoust mentioned this pull request Jun 6, 2024

Add throttling logic per origin w3c/browser-specs#1356

Merged

dontcallmedom reviewed Jun 7, 2024

View reviewed changes

src/lib/throttled-queue.js Outdated Show resolved Hide resolved

Fix typo

cf53cac

tidoust marked this pull request as ready for review June 7, 2024 12:53

tidoust requested a review from dontcallmedom June 7, 2024 14:35

dontcallmedom reviewed Jun 7, 2024

View reviewed changes

src/lib/specs-crawler.js Show resolved Hide resolved

tidoust force-pushed the throttling branch from 3cf9fb8 to 02a021e Compare June 7, 2024 14:40

dontcallmedom approved these changes Jun 7, 2024

View reviewed changes

tidoust merged commit 0364b86 into main Jun 7, 2024
1 check passed

tidoust deleted the throttling branch June 7, 2024 14:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Re-write throttling logic with queues #1587

Re-write throttling logic with queues #1587

tidoust commented Jun 6, 2024

tidoust commented Jun 7, 2024

dontcallmedom commented Jun 7, 2024

tidoust commented Jun 7, 2024 •

edited

Loading

Re-write throttling logic with queues #1587

Re-write throttling logic with queues #1587

Conversation

tidoust commented Jun 6, 2024

tidoust commented Jun 7, 2024

dontcallmedom commented Jun 7, 2024

tidoust commented Jun 7, 2024 • edited Loading

tidoust commented Jun 7, 2024 •

edited

Loading