Record and use HTTP cache information to skip unnecessary crawling and processing #856

dontcallmedom · 2022-02-04T07:57:54Z

close #850

…d processing close #850

dontcallmedom · 2022-02-04T08:00:17Z

this depends on tidoust/fetch-filecache-for-crawling#6

from my exploration, this can reduce a no-change crawl to ~1min30; but there are server-side configuration issues (on www.w3.org and rfc-editor.org, probably on csswg.org although I haven't diagnosed them yet) that makes the actual impact as is less big at the moment.

dontcallmedom · 2022-02-04T08:01:17Z

@tidoust if you get a chance to review the current direction, this would be very much appreciated

src/lib/specs-crawler.js

src/lib/util.js

see tidoust/fetch-filecache-for-crawling#6

as pointed in code review https://github.com/w3c/reffy/pull/856/files#r799292847

from code review https://github.com/w3c/reffy/pull/856/files#r799290909

as suggested in code review https://github.com/w3c/reffy/pull/856/files#r799297474

Only use conditional request headers on the relevant URL Use Last-Modified in prefere to Etag as ETag tends to provide less reliable caching experience

dontcallmedom · 2022-02-04T11:10:03Z

Thanks for the review, I've fixed the bugs you've mentioned and improved a bit the cache semantics to avoid relying on etag; the CSS server is still problematic (which is unfortunate given their many many specs), but I'm down to 3min30 of processing despite that.

…hanged

dontcallmedom · 2022-02-04T12:01:33Z

Added another option to skip processing that accounts for mis-behaving servers; it now runs in 1min30 with still a few servers/spec not providing cache info (fxtf, rfcs (!))

dontcallmedom · 2022-02-04T15:41:57Z

Once tidoust/fetch-filecache-for-crawling#6 is merged and released as an updated npm package, the only missing from this PR should be a bump in package.json (and presumable package-lock) - which should also result in making the added test not fail.

tidoust

A couple of nits.

This looks good to me otherwise. I note that one implicit hypothesis here is that the index page of a multipage spec is always going to be updated when one of the pages gets updated. I think that's the case because the index page contains the last modified date. I'm not sure that is always the case though. For instance, is the index page always updated when multiple updates get made to pages in the same day?

More importantly, as-is, this caching mechanism (not surprisingly?) creates a cache invalidation problem: if we fix or update one of the extraction modules in browserlib or even other parts of Reffy's code that could affect extracts, and continue to run Reffy with the fallback parameter, the update will only be reflected in extracts when specs get modified.

That problem already exists without this caching mechanism but is limited to specs that cannot be crawled (which we hope is going to be a temporary condition).

The obvious solution is to force a full crawl when that happens. However, from a Webref perspective:

We don't really know which version of Reffy was used to crawl data, so we cannot easily run Reffy without the fallback parameter when a new version comes out.
We probably want to keep the fallback parameter in any case, so as to reuse previous crawl results in case of errors and not introduce temporary updates in the extracts.

I'm thinking that a relatively simple solution would be to inject the version of Reffy that was used to crawl the data in index.json (which is useful info to have in any case), and to only enable this caching mechanism when the current version matches the one in the fallback crawl result.

tidoust · 2022-02-05T13:57:40Z

src/lib/util.js

+                         requestId,
+                         errorReason: "Failed"
+                       });
+                       reuseExistingData = true;


It's a bit sad that there's no easy way to pass that reason within the Fetch.failRequest message. errorReason: "BlockedByClient" could perhaps be the closest reason to what we're doing here.

In any case, #858 removes that part, so that comment will quickly become moot.

src/lib/util.js

src/lib/specs-crawler.js

as suggested in #856 (review) Co-authored-by: François Daoust <fd@tidoust.net>

dontcallmedom · 2022-02-07T07:14:32Z

I'm thinking that a relatively simple solution would be to inject the version of Reffy that was used to crawl the data in index.json (which is useful info to have in any case), and to only enable this caching mechanism when the current version matches the one in the fallback crawl result.

I had been thinking along the same line, although I had imagined we would do only a full crawl when reffy was bumped in minor or major versions (while having a dispatchable workflow to do a full crawl on demand as well)

Do you want this in this PR, or can this be done separately?

tidoust · 2022-02-07T08:44:20Z

Do you want this in this PR, or can this be done separately?

I would prefer if we had it in this PR so as not to start a loop in Webref that cannot be fixed until we get the feature. But if that's the next thing you do, I'm fine doing it separately as well

(I still wonder about multipage specs though)

per #856 (review)

dontcallmedom · 2022-02-07T10:41:10Z

I've added support for checking reffy version to use fallback data; this could made more subtle (i.e. not necessarily require an exact match), but it's probably enough as starting point.

I note that one implicit hypothesis here is that the index page of a multipage spec is always going to be updated when one of the pages gets updated. I think that's the case because the index page contains the last modified date. I'm not sure that is always the case though. For instance, is the index page always updated when multiple updates get made to pages in the same day?

Looking at multi-page specs we have today:

HTML, CSS2.2 and ES are generated from a single source file, so the title page is guaranteed to be updated
CSS 2.1 and SVG 1.1 are TR-only (so would necessary have their title page updated if they ever get updated)
SVG2 is problematic - its title page DOES NOT get updated when subpages are updated

In practice, since we would still get a fresh data update on each reffy upgrade, and given that SVG2 gets updated very rarely, I'm not too concerned by that limitation; we could check when fetching subpages whether their last-modified is (significantly) more recent than the title page and not save the cache info in that situation if we really care.

src/lib/specs-crawler.js

tidoust · 2022-02-07T10:49:53Z

HTML, CSS2.2 and ES are generated from a single source file, so the title page is guaranteed to be updated

But will that generation always trigger an actual git update? What is going to change in the title page? The date is per day in particular, so would the title page get git-updated the second time a spec receives an update on the same day? That situation is probably not that uncommon for the HTML spec.

Edit: Or are the generated pages not stored in and served from a Git branch?

dontcallmedom · 2022-02-07T11:01:15Z

But will that generation always trigger an actual git update? What is going to change in the title page? The date is per day in particular, so would the title page get git-updated the second time a spec receives an update on the same day? That situation is probably not that uncommon for the HTML spec.

For HTML, the title page includes a link with the relevant commit (for the snapshot) - so if that's our primary concern, I think we're safe

per #856 (comment)

dontcallmedom · 2022-02-07T13:21:09Z

Edit: Or are the generated pages not stored in and served from a Git branch?

HTML and CSS2 are not served from a git branch afaict; ES is, but it looks (from a cursory check) that the title page is updated for the date when one of the subpages is tc39/ecma262@65905e9#diff-bf7934c87fd1a409fa27452a6461d0ca2916decce82025154e60df5a3e0e8215 - we might lose some same-day updates, but I don't think that warrants a strong concern. This presumably could be addressed with my earlier proposal discussed in the context of SVG2

tidoust

Good! Go go go!

New features: - Record and use HTTP cache information to skip unnecessary crawling and processing (#856) - Check cache validity before launching browser tab (#858)

Record and use HTTP cache information to skip unnecessary crawling an…

0d54f81

…d processing close #850

tidoust reviewed Feb 4, 2022

View reviewed changes

src/lib/specs-crawler.js Outdated Show resolved Hide resolved

src/lib/specs-crawler.js Outdated Show resolved Hide resolved

src/lib/util.js Outdated Show resolved Hide resolved

dontcallmedom added 7 commits February 4, 2022 11:54

Delegate proper transparent caching to fetch-filecache-for-crawling

092602a

see tidoust/fetch-filecache-for-crawling#6

Remove bogus logCounter usage

c0bc248

as pointed in code review https://github.com/w3c/reffy/pull/856/files#r799292847

Simplify code

a7bf290

from code review https://github.com/w3c/reffy/pull/856/files#r799290909

Do not use exception for signal processing skip

dafced8

as suggested in code review https://github.com/w3c/reffy/pull/856/files#r799297474

Improve caching semantics

b4585a3

Only use conditional request headers on the relevant URL Use Last-Modified in prefere to Etag as ETag tends to provide less reliable caching experience

Remove dead code

5b70337

Format code

861f221

Skip processing also if response is not 304 but cache info have not c…

521dd86

…hanged

dontcallmedom requested a review from tidoust February 4, 2022 12:55

dontcallmedom added 4 commits February 4, 2022 16:05

Unbreak http error handling

077a83c

Avoid adding bogus cache crawl records

2933268

Add test for cache-fallback

8090c89

Fix non-http error reporting

0025f88

dontcallmedom mentioned this pull request Feb 4, 2022

Check cache validity before launching browser tab #858

Merged

Bump version of fetch-filecache-for-crawling

abbfb2c

tidoust requested changes Feb 5, 2022

View reviewed changes

tidoust and others added 2 commits February 5, 2022 15:33

Merge branch 'main' into http-cache

4e7e182

Report not modified status as a status rather than an error

a00e310

as suggested in #856 (review) Co-authored-by: François Daoust <fd@tidoust.net>

dontcallmedom marked this pull request as ready for review February 7, 2022 07:14

dontcallmedom requested a review from tidoust February 7, 2022 07:14

Use fallback data only if crawler version match

9097c0d

per #856 (review)

tidoust reviewed Feb 7, 2022

View reviewed changes

src/lib/specs-crawler.js Show resolved Hide resolved

dontcallmedom added 2 commits February 7, 2022 13:53

Use fallback for errors when available, independently of crawler compat

83cf692

per #856 (comment)

Adapt test to annotate crawl with current reffy version

0160c1b

This was referenced Feb 7, 2022

Report specs for which we only have fallback data in pre-release PR #863

Open

Add test on processing with fallback and different version of Reffy #864

Open

tidoust approved these changes Feb 7, 2022

View reviewed changes

dontcallmedom merged commit c371626 into main Feb 7, 2022

dontcallmedom deleted the http-cache branch February 7, 2022 13:46

tidoust added a commit that referenced this pull request Feb 7, 2022

Release minor version of Reffy - v6.3.0

d00e393

New features: - Record and use HTTP cache information to skip unnecessary crawling and processing (#856) - Check cache validity before launching browser tab (#858)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Record and use HTTP cache information to skip unnecessary crawling and processing #856

Record and use HTTP cache information to skip unnecessary crawling and processing #856

dontcallmedom commented Feb 4, 2022

dontcallmedom commented Feb 4, 2022

dontcallmedom commented Feb 4, 2022

dontcallmedom commented Feb 4, 2022

dontcallmedom commented Feb 4, 2022 •

edited

Loading

dontcallmedom commented Feb 4, 2022

tidoust left a comment

tidoust Feb 5, 2022

dontcallmedom commented Feb 7, 2022

tidoust commented Feb 7, 2022

dontcallmedom commented Feb 7, 2022

tidoust commented Feb 7, 2022 •

edited

Loading

dontcallmedom commented Feb 7, 2022

dontcallmedom commented Feb 7, 2022

tidoust left a comment

Record and use HTTP cache information to skip unnecessary crawling and processing #856

Record and use HTTP cache information to skip unnecessary crawling and processing #856

Conversation

dontcallmedom commented Feb 4, 2022

dontcallmedom commented Feb 4, 2022

dontcallmedom commented Feb 4, 2022

dontcallmedom commented Feb 4, 2022

dontcallmedom commented Feb 4, 2022 • edited Loading

dontcallmedom commented Feb 4, 2022

tidoust left a comment

Choose a reason for hiding this comment

tidoust Feb 5, 2022

Choose a reason for hiding this comment

dontcallmedom commented Feb 7, 2022

tidoust commented Feb 7, 2022

dontcallmedom commented Feb 7, 2022

tidoust commented Feb 7, 2022 • edited Loading

dontcallmedom commented Feb 7, 2022

dontcallmedom commented Feb 7, 2022

tidoust left a comment

Choose a reason for hiding this comment

dontcallmedom commented Feb 4, 2022 •

edited

Loading

tidoust commented Feb 7, 2022 •

edited

Loading