-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make reffy crawler http-cache friendly #850
Comments
It could be useful to evaluate how that time is split between fetching (or waiting in-between fetches to remain somewhat friendly with servers), and processing. I have no idea whether that split is 50/50, 80/20, or 20/80. For fetching, we would still need to fetch the main resource at least once (or send an initial HEAD request?). For multipage specs, we would still need to request pages individually as they may change independently of the first one. Also, I note that we would need to account for Respec drafts that use Fetches to additional resources that are in common across specs get caught by our local file caching mechanism already so only trigger one network request per crawl. The crawler already skips network requests to image resources. In other words, I'm not sure that we will gain a lot in terms of network requests. Skipping processing would be nice, especially if it takes a significant bit of the total time. |
If we make the main request with a Cache base, the GET will only get back a 304 without body.
I think multipage specs update their date subtitle (or at least they definitely should)
Good point, I hadn't thought of these. I don't think they are critical blockers though: we could annotate specs to mark them as exceptions if that's a real issue; or run every so often a no-cache crawl of specs to catch these situations
I think the gains will be substantial for both. |
reffy could do that flagging itself (i.e. as part of its crawl results), whenever it detects a |
An average reffy crawl takes around 6 to 8 minutes, and that duration is set to expand as we consider expanding the source of crawled specifications.
But in practice, the vast majority of the time is spent fetching and processing documents that have not changed since the last crawl.
If we were to record the ETag and Last-Modified header our crawler hits for each crawled URLs, we could easily skip the large majority of the work using the fallback path developed in #825 when hitting a 304.
The text was updated successfully, but these errors were encountered: