Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make reffy crawler http-cache friendly #850

Closed
dontcallmedom opened this issue Feb 2, 2022 · 3 comments · Fixed by #856
Closed

Make reffy crawler http-cache friendly #850

dontcallmedom opened this issue Feb 2, 2022 · 3 comments · Fixed by #856

Comments

@dontcallmedom
Copy link
Member

An average reffy crawl takes around 6 to 8 minutes, and that duration is set to expand as we consider expanding the source of crawled specifications.

But in practice, the vast majority of the time is spent fetching and processing documents that have not changed since the last crawl.

If we were to record the ETag and Last-Modified header our crawler hits for each crawled URLs, we could easily skip the large majority of the work using the fallback path developed in #825 when hitting a 304.

@tidoust
Copy link
Member

tidoust commented Feb 2, 2022

But in practice, the vast majority of the time is spent fetching and processing documents that have not changed since the last crawl.

It could be useful to evaluate how that time is split between fetching (or waiting in-between fetches to remain somewhat friendly with servers), and processing. I have no idea whether that split is 50/50, 80/20, or 20/80.

For fetching, we would still need to fetch the main resource at least once (or send an initial HEAD request?). For multipage specs, we would still need to request pages individually as they may change independently of the first one. Also, I note that we would need to account for Respec drafts that use data-include to include additional resources (not sure how many do that in practice, perhaps none?) and I'm not sure how to do that.

Fetches to additional resources that are in common across specs get caught by our local file caching mechanism already so only trigger one network request per crawl.

The crawler already skips network requests to image resources.

In other words, I'm not sure that we will gain a lot in terms of network requests. Skipping processing would be nice, especially if it takes a significant bit of the total time.

@dontcallmedom
Copy link
Member Author

For fetching, we would still need to fetch the main resource at least once (or send an initial HEAD request?).

If we make the main request with a Cache base, the GET will only get back a 304 without body.

For multipage specs, we would still need to request pages individually as they may change independently of the first one.

I think multipage specs update their date subtitle (or at least they definitely should)

Also, I note that we would need to account for Respec drafts that use data-include to include additional resources

Good point, I hadn't thought of these.

I don't think they are critical blockers though: we could annotate specs to mark them as exceptions if that's a real issue; or run every so often a no-cache crawl of specs to catch these situations

In other words, I'm not sure that we will gain a lot in terms of network requests. Skipping processing would be nice, especially if it takes a significant bit of the total time.

I think the gains will be substantial for both.

@dontcallmedom
Copy link
Member Author

we could annotate specs to mark them as exceptions if that's a real issue

reffy could do that flagging itself (i.e. as part of its crawl results), whenever it detects a data-include (or equivalent) situation

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants