Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce crawler load on servers #1581

Merged
merged 1 commit into from
Jun 6, 2024
Merged

Reduce crawler load on servers #1581

merged 1 commit into from
Jun 6, 2024

Conversation

tidoust
Copy link
Member

@tidoust tidoust commented Jun 5, 2024

[Note: not fully tested for now because the drafts CSS server is currently down]

The crawler has a hard time crawling all specs nowadays due to more stringent restrictions on servers that lead to network timeouts and errors. See: w3c/webref#1244

The goal of this update is to reduce the load of the crawler onto servers. Two changes:

  1. The list of specs to crawl gets sorted to distribute origins. This should help with diluting requests sent to a specific server at once. The notion of "origin" used in the code is loose and more meant to identify the server that serves the resource than the actual origin.

  2. Requests sent to a given origin are serialized, and sent 2 seconds minimum after the last request was sent (and processed). The crawler still processes the list 4 specs at a time otherwise (provided the specs are to be retrieved from different origins).

The consequence of 1. is that the specs are no longer processed in order, so logs will make the crawler look a bit drunk, processing specs seemingly randomly, as in:

  1/610 - https://aomediacodec.github.io/afgs1-spec/ - crawling
  8/610 - https://compat.spec.whatwg.org/ - crawling
 12/610 - https://datatracker.ietf.org/doc/html/draft-davidben-http-client-hint-reliability - crawling
 13/610 - https://datatracker.ietf.org/doc/html/draft-ietf-httpbis-rfc6265bis - crawling
 12/610 - https://datatracker.ietf.org/doc/html/draft-davidben-http-client-hint-reliability - done
 16/610 - https://drafts.css-houdini.org/css-typed-om-2/ - crawling
 13/610 - https://datatracker.ietf.org/doc/html/draft-ietf-httpbis-rfc6265bis - done
 45/610 - https://fidoalliance.org/specs/fido-v2.1-ps-20210615/fido-client-to-authenticator-protocol-v2.1-ps-errata-20220621.html - crawling
https://compat.spec.whatwg.org/ [error] Multiple event handler named orientationchange, cannot associate reliably to an interface in Compatibility Standard
  8/610 - https://compat.spec.whatwg.org/ - done
 66/610 - https://registry.khronos.org/glTF/specs/2.0/glTF-2.0.html - crawling
https://aomediacodec.github.io/afgs1-spec/ [log] extract refs without rules
  1/610 - https://aomediacodec.github.io/afgs1-spec/ - done

The crawler has a hard time crawling all specs nowadays due to more stringent
restrictions on servers that lead to network timeouts and errors. See:
w3c/webref#1244

The goal of this update is to reduce the load of the crawler onto servers. Two
changes:

1. The list of specs to crawl gets sorted to distribute origins. This should
help with diluting requests sent to a specific server at once. The notion of
"origin" used in the code is loose and more meant to identify the server that
serves the resource than the actual origin.

2. Requests sent to a given origin are serialized, and sent 2 seconds minimum
after the last request was sent (and processed). The crawler still processes
the list 4 specs at a time otherwise (provided the specs are to be retrieved
from different origins).

The consequence of 1. is that the specs are no longer processed in order, so
logs will make the crawler look a bit drunk, processing specs seemingly
randomly, as in:

```
  1/610 - https://aomediacodec.github.io/afgs1-spec/ - crawling
  8/610 - https://compat.spec.whatwg.org/ - crawling
 12/610 - https://datatracker.ietf.org/doc/html/draft-davidben-http-client-hint-reliability - crawling
 13/610 - https://datatracker.ietf.org/doc/html/draft-ietf-httpbis-rfc6265bis - crawling
 12/610 - https://datatracker.ietf.org/doc/html/draft-davidben-http-client-hint-reliability - done
 16/610 - https://drafts.css-houdini.org/css-typed-om-2/ - crawling
 13/610 - https://datatracker.ietf.org/doc/html/draft-ietf-httpbis-rfc6265bis - done
 45/610 - https://fidoalliance.org/specs/fido-v2.1-ps-20210615/fido-client-to-authenticator-protocol-v2.1-ps-errata-20220621.html - crawling
https://compat.spec.whatwg.org/ [error] Multiple event handler named orientationchange, cannot associate reliably to an interface in Compatibility Standard
  8/610 - https://compat.spec.whatwg.org/ - done
 66/610 - https://registry.khronos.org/glTF/specs/2.0/glTF-2.0.html - crawling
https://aomediacodec.github.io/afgs1-spec/ [log] extract refs without rules
  1/610 - https://aomediacodec.github.io/afgs1-spec/ - done
```
Comment on lines +126 to +131
mockAgent
.get("https://www.w3.org")
.intercept({ method: "GET", path: "/StyleSheets/TR/2021/dark.css" })
.reply(200, '')
.persist();

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Forgot to add: that's unrelated but now needed (I suspect the style sheet was added recently) to avoid a test complaint that request on this resource is not being mocked.

Copy link
Member

@dontcallmedom dontcallmedom left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, although I'm not sure the interleaving/systematic pacing is necessarily optimal in terms of reducing processing time (see embedded comment); in any case, I think we should merge this as is and look for possible optimizations separately

}
specsByOrigin[origin].push(spec);
}
const spreadList = interleave(...Object.values(specsByOrigin));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder how much the interleaving buys us vs managing a proper per-"origin" queue of requests that would let us process requests in parallel without pauses as long as there is no lock on the relevant origins.

I think e.g. https://github.com/w3c/cg-monitor/blob/5ceea24a1eaa7c1736d38909bcbb47df4ae297a3/lib/caching-queued-fetch.js#L68 accomplishes this for the CG monitor

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Optimization-wise, I don't know if either approach really trumps the other. From a readibility and maintenance perspective, a proper per-origin queue would be cleaner though! I was more trying to patch the existing code than to re-write it.

One constraint I think we need in Reffy is that parallelism needs to be kept to a reasonable level, say 4-5 specs in parallel maximum, because loading too many specs at once in Puppeteer may use a lot of RAM. If we just enqueue everything on a per-origin queue without throttling, we may end up in a situation where 10 specs from 10 different origins get processed at once.

Right now that throttling is done at the spec level, which is why I chose to shuffle the list instead, to reduce the likelihood that the 4 specs that get processed in parallel target the same origin server. We'd need to apply that throttling to the list of origins instead (process 4 origins in parallel among the list of origins). That can also very likely be done "properly" ;)

@tidoust tidoust merged commit c63cb0a into main Jun 6, 2024
1 check passed
@tidoust tidoust deleted the slowdown branch June 6, 2024 05:54
@tidoust
Copy link
Member Author

tidoust commented Jun 6, 2024

[Note: not fully tested for now because the drafts CSS server is currently down]

For info, I merged the PR but haven't managed to run a successful crawl locally so far because the drafts CSS server keeps crashing. That may signal that crawl needs to be slowed further for CSS drafts. I'm not releasing a new version of Reffy right away as a result.

tidoust added a commit that referenced this pull request Jun 7, 2024
Breaking change:
- Bump required version of Node.js to >=20.12.1 (#1590)

Required by dependency on ReSpec, although note this dependency is only for
tests (`devDependencies` in `package.json`).

New features:
- Reduce crawler load on servers (#1587, #1581)
- Include data-xref-type as another hint for autolinks (#1585)

Feature patches:
- Skip requests on SVG and CSS resources (#1588)
- Fix Puppeteer instance teardown (#1586)

Dependency bumps:
- Bump respec from 35.0.2 to 35.1.0 (#1583)
- Bump ajv from 8.15.0 to 8.16.0 (#1580)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants