Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Explicitly encourage user agents to validate cache content via integrity attributes #101

Open
crisperdue opened this issue Jun 2, 2021 · 7 comments

Comments

@crisperdue
Copy link

It has been recognized for a number of years that cryptographic hashes of content have potential to improve the effectiveness of user agent caching. Ideas have been aired for shared caching keyed by cryptographic hashes, enabling sharing of identical content even across origins, but these have been rejected on various privacy and security grounds. In fact, browsers have been moving in a conceptually opposite direction with cache partitioning, in order to avoid potential privacy issues due to access of the same CDN content by multiple websites, and observability of the resulting access patterns.

With this background in mind, even within a conventional browser cache keyed by content URL, or a partitioned cache further keyed by the origin of the containing document, considerable reductions in numbers of network roundtrips could be achieved by a user agent that uses the SRI "integrity" attribute for validation of existing browser cache content.

Scenario: Document https://example.com/A refers to resource https://example.com/C, perhaps a script, image, or stylesheet. References to C from A include a suitable "integrity" attribute. When document A is first loaded, the user agent also loads C, checking that its content matches. Later the user returns to document A. (Assume recency as by max-age is in no case keeping the cached content live.) On return to A, the user agent again needs C, potentially triggering an HTTP request, with potentially an "if-none-match" header and a 304 response indicating that the cached content is still valid. With an integrity attribute for C, the browser has enough information to validate C without participation by any server. If document A continues to refer to the same content as before, the integrity attribute will be the same, and match C if it is still present. If document A refers to different content using the same URL, the integrity attribute of the new version of A will differ, and the user agent can detect that the currently cached content is not valid and re-fetch. Similar sequences would result from access to another document B at the same origin, referring to the same resource C, either updated or the same version as before.

Network roundtrips can take as much or more time than actual resource loading, so the performance improvements can be considerable and comparable to the benefits of including version identifiers in URLs and configuring the server to respond with very large max-age values for versioned URLs, as is frequently done in CDNs for well-known resources (think jQuery). Removing network roundtrips this way can also free up network connections for access to other resources. This use of the integrity attribute requires no server modifications nor even server configuration to function, and it is compatible with partitioned user agent caching.

As far as I can tell, this possibility is not well documented, and probably deserves to be called out as permissible, for the benefit of implementors of user agents as well as web developers.

@annevk
Copy link
Member

annevk commented Jun 3, 2021

Unfortunately it's been discussed since at least 2015 and has exactly the same kind of privacy issues the normal cache had before it got partitioned. Duplicating this into #22.

Apologies, I misread OP and the idea seems to be limited to avoiding the need for revalidation requests. This can already be achieved through cache-control: immutable, but this alternative is probably worth considering.

@annevk annevk closed this as completed Jun 3, 2021
@annevk annevk reopened this Jun 3, 2021
@crisperdue
Copy link
Author

Thanks, yes, the intention is to reduce revalidation requests to the server, in effect doing the revalidation on the client. Compared with cache-control: immutable, user agents would still make additional requests for the resource at the same URL in case the cached content would no longer match the integrity attribute from a more recently updated webpage.

@jayaddison
Copy link

If document A refers to different content using the same URL, the integrity attribute of the new version of A will differ, and the user agent can detect that the currently cached content is not valid and re-fetch. Similar sequences would result from access to another document B at the same origin, referring to the same resource C, either updated or the same version as before.

While acknowledging the cache-hit use case (potentially avoiding a round-trip when a referenced resource appears unmodified), I'd also like to re-mention the cache-miss use case here.

I'd like to add a note in support of the cache-miss detection, if I understand it correctly, because it seems to match a desire that I've had in various contexts to cache-bust subresources referenced from HTML documents without changing the URIs of those subresources (a current commonly-used practice that I have never entirely felt comfortable with).

I believe that the approach proposed here would be orthogonal to HTTP-header-based cache control mechanisms, because the latter operate on individual URIs (in other words: by design, the cache response headers for an HTML document cannot effect expiry of subresources included in the document, nor vice-versa).

@jayaddison
Copy link

To phrase this another way: as a web developer, I'd like my website's main stylesheet to always be hosted at https://example.org/website/main.css. I would like to reference the current stylesheet, with SHA256 digest <hash-a> from HTML by using <link rel="stylesheet" integrity="sha256-<hash-a>" href="https://example.org/website/main.css" /> or similar. When the content of the main stylesheet is updated and the digest becomes <hash-b>, I would like web browsers to consider the cached main.css with digest <hash-a> to be invalid. I prefer this cache-busting mechanism to URL-based cache-busting. I want the cache key to remain linked to the URL of the resource on this origin -- that is, I don't want other origins to be able to retrieve the resource from my website's origin from their own caches.

@jayaddison
Copy link

A potential problem with using the integrity attribute as a cache-buster for a given URL: without the cache-bust string in the request, an intermediate network cache could respond with stale contents. The response contents would not be loaded by the client, and this could cause a frustrating user experience, not to mention a waste of bandwidth.

Perhaps a client should only cache-bust an existing record for a URL if the existing cache entry also contains an ETag. In that case, the client must include that seemingly-expired ETag in an If-None-Match header in the fetch request -- so that intermediate caches, and ultimately the origin webserver, can exclude that when considering cached responses.

@jayaddison
Copy link

No, sorry - I think I've misunderstood how If-None-Match works. I'll need to do some more research.

@jayaddison
Copy link

Perhaps a client should only cache-bust an existing record for a URL if the existing cache entry also contains an ETag. In that case, the client must include that seemingly-expired ETag in an If-None-Match header in the fetch request -- so that intermediate caches, and ultimately the origin webserver, can exclude that when considering cached responses.

To ensure retrieval of a resource that matches the one specified on the integrity attribute, I think we'd want the client to use the If-Match header instead, to request exactly the resource indicated.

If-Match does allow multiple ETags to be listed -- so if we took the strongest-digest-algo specified in a multi-checksum integrity value, then the client request could ask for any of them, and a cache/origin could honour that.

However, a more fundamental problem is that there's no intrinsic guarantee that the ETag format will match the format of an individual integrity checksum. Some webservers do use well-known hash algorithms to generate etags, but in my experience the results don't tend to be transmitted in the SRI-specified format (algorithm-name-prefix, hyphen, base64-encoded digest).

An individual webserver operator could likely configure SRI-attribute / Etag equality for their domain, but expecting that to be compatible across multiple web properties without further standardization seems unlikely.

It's possible I've misunderstood some details, and I'll continue to think about this, but short-term I believe that URI-based cache-busting likely continues to be necessary. This does not necessarily negate @crisperdue's original suggestion that cache invalidation could check for integrity attribute mismatches -- but if the URL is the same, then intermediate caches may not behave in the way we expect, and if the URL has changed, then we would not expect existing browser cache content to exist (in most cases).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants