Make reffy crawler http-cache friendly #850

dontcallmedom · 2022-02-02T11:51:32Z

An average reffy crawl takes around 6 to 8 minutes, and that duration is set to expand as we consider expanding the source of crawled specifications.

But in practice, the vast majority of the time is spent fetching and processing documents that have not changed since the last crawl.

If we were to record the ETag and Last-Modified header our crawler hits for each crawled URLs, we could easily skip the large majority of the work using the fallback path developed in #825 when hitting a 304.

tidoust · 2022-02-02T13:34:47Z

But in practice, the vast majority of the time is spent fetching and processing documents that have not changed since the last crawl.

It could be useful to evaluate how that time is split between fetching (or waiting in-between fetches to remain somewhat friendly with servers), and processing. I have no idea whether that split is 50/50, 80/20, or 20/80.

For fetching, we would still need to fetch the main resource at least once (or send an initial HEAD request?). For multipage specs, we would still need to request pages individually as they may change independently of the first one. Also, I note that we would need to account for Respec drafts that use data-include to include additional resources (not sure how many do that in practice, perhaps none?) and I'm not sure how to do that.

Fetches to additional resources that are in common across specs get caught by our local file caching mechanism already so only trigger one network request per crawl.

The crawler already skips network requests to image resources.

In other words, I'm not sure that we will gain a lot in terms of network requests. Skipping processing would be nice, especially if it takes a significant bit of the total time.

dontcallmedom · 2022-02-02T13:40:02Z

For fetching, we would still need to fetch the main resource at least once (or send an initial HEAD request?).

If we make the main request with a Cache base, the GET will only get back a 304 without body.

For multipage specs, we would still need to request pages individually as they may change independently of the first one.

I think multipage specs update their date subtitle (or at least they definitely should)

Also, I note that we would need to account for Respec drafts that use data-include to include additional resources

Good point, I hadn't thought of these.

I don't think they are critical blockers though: we could annotate specs to mark them as exceptions if that's a real issue; or run every so often a no-cache crawl of specs to catch these situations

In other words, I'm not sure that we will gain a lot in terms of network requests. Skipping processing would be nice, especially if it takes a significant bit of the total time.

I think the gains will be substantial for both.

dontcallmedom · 2022-02-02T13:41:57Z

we could annotate specs to mark them as exceptions if that's a real issue

reffy could do that flagging itself (i.e. as part of its crawl results), whenever it detects a data-include (or equivalent) situation

…d processing close #850

dontcallmedom added the enhancement label Feb 2, 2022

dontcallmedom mentioned this issue Feb 2, 2022

Near real-time updates to crawled data w3c/webref#486

Open

dontcallmedom added a commit that referenced this issue Feb 4, 2022

Record and use HTTP cache information to skip unnecessary crawling an…

0d54f81

…d processing close #850

dontcallmedom mentioned this issue Feb 4, 2022

Record and use HTTP cache information to skip unnecessary crawling and processing #856

Merged

dontcallmedom closed this as completed in #856 Feb 7, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make reffy crawler http-cache friendly #850

Make reffy crawler http-cache friendly #850

dontcallmedom commented Feb 2, 2022

tidoust commented Feb 2, 2022 •

edited

Loading

dontcallmedom commented Feb 2, 2022

dontcallmedom commented Feb 2, 2022

Make reffy crawler http-cache friendly #850

Make reffy crawler http-cache friendly #850

Comments

dontcallmedom commented Feb 2, 2022

tidoust commented Feb 2, 2022 • edited Loading

dontcallmedom commented Feb 2, 2022

dontcallmedom commented Feb 2, 2022

tidoust commented Feb 2, 2022 •

edited

Loading