Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider shared caching #22

Open
metromoxie opened this issue Dec 21, 2015 · 35 comments

Comments

@metromoxie
Copy link
Contributor

@metromoxie metromoxie commented Dec 21, 2015

We've had a lot of discussions about using SRI for shared caching (see https://lists.w3.org/Archives/Public/public-webappsec/2015May/0095.html for example). An explicit issue was filed at w3c/webappsec#504 suggesting a sharedcache attribute to imply that shared caching is OK. We should consider leveraging SRI for more aggressive caching.

@metromoxie metromoxie added this to the v2 milestone Dec 21, 2015
@hillbrad hillbrad modified the milestone: v2 Jan 22, 2016
@btrask

This comment has been minimized.

Copy link

@btrask btrask commented Apr 24, 2016

I hope this is a reasonable place to comment. (If not please tell me where to go.)

I've been working on content addressing systems for several years. I understand that content addresses, which are "locationless," are inherently in conflict with the same-origin policy, which is location-based.

An additional/alternate solution is for a list of acceptable hashes to be published by the server at a well-known location.

For example, the user agent could request https://example.com/.well-known/sri-list, which would return a plain text file with a list of acceptable hashes, one per line. Hashes on this list would be treated as if they were hosted by the server itself, and thus could be fetched from a shared cache while being treated for all intents and purposes like they were fetched from the server in question.

This does add some complexity both for user agents and for site admins. On the other hand, the security implications are well understood, and wouldn't require new permission logic.

Thanks for your work on SRI.

@metromoxie

This comment has been minimized.

Copy link
Contributor Author

@metromoxie metromoxie commented Apr 25, 2016

An interesting idea (although I know many folks who are vehemently against well-known location solutions, but I won't pretend to fully grasp why). If implemented, though, it would still require a round trip to get .well-known/sri-list, right? Which seems to lose a lot of the benefit of these acting as libraries.

Another suggestion, that I think I heard somewhere, is, if the page includes a CSP, only use an x-origin cache for an integrity attribute resource if the CSP includes the integrity value in the script-hash whitelist. I think this would address @mozfreddyb's concerns listed in Synzvato/decentraleyes#26, but I haven't thought too hard about it. On the other hand, it also starts to look really weird and complicated :-/

Also, these solutions don't address timing attacks with x-origin caches. Although, as a side not, someone recently pointed out to me that history timing attacks in this case are probably not too concerning from a security perspective since it's a "one-shot" timing attack. That is, the resource is definitively loaded after the attack happens, so you can't attempt the timing again, and that makes the timing attack much more difficult to pull off, since timing attacks usually rely on repeated measurement.

@btrask

This comment has been minimized.

Copy link

@btrask btrask commented Apr 26, 2016

Using a script-hash whitelist in the HTTP headers (as part of CSP or separately) is better for a small number of hashes, since it doesn't require an extra round trip. Using a well-known list is better for a large number of hashes, since it can be cached for a long time.

I agree that well-known locations are ugly. Although it works for /robots.txt and /favicon.ico, there is a high cost for introducing new ones.

The privacy problem is worse than timing attacks: if you control the server, you can tell that no request is ever made. This seems insurmountable for cross-origin caching.

Perhaps the gulf between hashes and locations is too large to span. For true content-addressing systems (like what I'm working on), my preference is to treat all hashes as a single origin (so they can't reference or be referenced by location-based resources).

Thanks for your quick reply!

@mozfreddyb

This comment has been minimized.

Copy link
Contributor

@mozfreddyb mozfreddyb commented Apr 26, 2016

I'd be slightly more interested in blessing the hashes for cross-origin caches by mentioning in the CSP. .well-known would add another roundtrip. I'm not sure if that's going to impact hamper the performance benefit that we wanted in the first place.

The idea to separate hashed resources into their own origin is interesting, but I don't feel comfortable drilling holes that deep into the existing weirdness of origins.

@btrask

This comment has been minimized.

Copy link

@btrask btrask commented Apr 26, 2016

To be clear, giving hashes their own origin only makes sense if you are loading top-level resources by hash. In that case, you can give access to all other hashes, but prohibit access to ordinary URLs. But that is a long way off for any web browsers and far from the scope of SRI.

@mozfreddyb

This comment has been minimized.

Copy link
Contributor

@mozfreddyb mozfreddyb commented Oct 18, 2016

For the record, @hillbrad wrote a great document outlining the privacy and security risks of shared caching: https://hillbrad.github.io/sri-addressable-caching/sri-addressable-caching.html

@kevincox

This comment has been minimized.

Copy link

@kevincox kevincox commented Oct 31, 2016

That document doesn't appear to consider an opt-in approach. While this would reduce the number of people who do it it could be quite useful.

<script src=jquery.js integrity="..." public/>

This tag should only be put on scripts for which timing is not an issue. Of course deciding what is pubic is now the responsibility of the website. However since the benefit would be negligible for anything that is website specific this might be pretty clear. For example loading a script specific to my site has a single URL anyways, so I may as well not put public otherwise malicious sites can figure out who has been to my site recently even though I don't get any benefit from the content-addressed cache. However if I am including jQuery there will be a benefit because there are many different copies on the internet and at the same time it means that knowing whether a user has jQuery in their cache is much less identifying.

That being said if FF had a way to turn this on now I would enable it, I don't see the privacy hit to be large and the performance would be nice to have.

@hillbrad

This comment has been minimized.

Copy link
Contributor

@hillbrad hillbrad commented Dec 21, 2016

@kevincox

This comment has been minimized.

Copy link

@kevincox kevincox commented Dec 21, 2016

@btrask

This comment has been minimized.

Copy link

@btrask btrask commented Dec 21, 2016

A "public" flag seems like a good solution to me. It seems to encapsulate both the benefits and the drawbacks of shared caching. It says, "yes, you can share files publicly, but that means anyone can see them."

That said, if it's opt-in, there's the question of how many sites would actually use it, and whether it's worth the trouble. Especially if it has to be set in HTML, rather than say by CDNs automatically. Maybe it would work better as an HTTP header?

@ScottHelme

This comment has been minimized.

Copy link

@ScottHelme ScottHelme commented Dec 22, 2016

Setting in the HTML doesn't seem to be a big problem. If large CDN providers include this in their example script/style tags then sites will copy and paste support for this. A similar approach is currently being used for SRI and although it's not as fast as I'd like, usage will slowly grow. Sites that are also looking for those extra performance boosts would be keen to implement it.

@kevincox

This comment has been minimized.

Copy link

@kevincox kevincox commented Jan 2, 2017

The idea of a public header (or even another key in Cache-Control) sounds quite interesting and elegant, however I think it would make it more difficult to use as one significant use case of this is to let each site to point to their own copy of a script, rather then a centrally hosted one. This means that each site would have to add headers to some of their scripts rather then just a modification in HTML. Not that either is a huge barrier but often static site hosting makes it difficult to set headers especially for a subset of paths.

At the end of the day I have not major objections to either option though.

@btrask

This comment has been minimized.

Copy link

@btrask btrask commented Jan 3, 2017

@kevincox Yes, I was suspecting that Cache-Control: public might be appropriate. It seems like the HTTP concept of a "shared cache" is fundamentally equivalent to SRI shared caching. See here for definitions of public and private: https://tools.ietf.org/html/rfc7234#section-5.2.2.5

The Cache-Control security concerns (cache poisoning, accidentally caching sensitive information) are prevented by hashing. The only remaining security consideration is information leaks, which Cache-Control: public seems to address.

I'm not opposed to using an HTML attribute instead, but I think it's good to reuse existing mechanisms when they fit. Caching has traditionally been controlled via HTTP, not HTML.

There are a few other ways to break this down:

  • Does an HTML attribute make more sense for non-HTTP (file:, data:, ftp:, etc.) resources? (There's an argument for shared caching across protocols, which a HTTP header wouldn't really help with; on the other hand, caching doesn't make much sense for some protocols)
  • Is publicness a property of the resource itself, or the use of that resource? (My intuition says the resource, since the point is that it can be shared between different contexts)
  • Which is better for third party resources (e.g. hotlinking)? (Either approach can be limiting)

I think that thinking about it in terms of "which method is easier for non-expert webmasters to deploy?" is likely to lead to a suboptimal solution. Yes some people don't know how to set HTTP headers, and some hosts don't let users set them, but in that case they are already stuck with limited caching options. Unless we're going to expose all of Cache-Control via HTML.

@brillout

This comment has been minimized.

Copy link

@brillout brillout commented Mar 8, 2017

@btrask A website highly concerned about privacy and loading <script src='/uncommon-datepicker.jquery.js' integrity="sha....." /> will want to make sure that uncommon-datepicker.jquery.js is never loaded from the shared cache. Whether the shared cache should be used or not is to be controlled by the website using the resource and not by the server who first delivered the resource.

@btrask

This comment has been minimized.

Copy link

@btrask btrask commented Mar 8, 2017

@brillout: Yes, good point. Using a mechanism not in the page source defeats the purpose, when the page source is the only trusted information. Thanks for the tip!

@brillout

This comment has been minimized.

Copy link

@brillout brillout commented Mar 8, 2017

@metromoxie
@mozfreddyb
@kevincox
@ScottHelme

Are we missing any pieces?

The two concerns are;

  • CSP
  • Privacy / "history attacks"

Solution to privacy: We can make the shared cache an opt-in option via an HTML attribute. I'd say it to be enough. (But if we want more protection then browsers could add a resource to the shared cache only when many domains use that resource. As described in https://hillbrad.github.io/sri-addressable-caching/sri-addressable-caching.html#solution and w3c/webappsec#504 (comment)).

Solution to CSP: UA should treat scripts with enabled shared cache as inline scripts. (As described here w3c/webappsec#504 (comment).)

It would be super exciting to be able to use bunch of web components using different frontend frameworks behind the web component curtain. A date picker using Angular, an infinite scroll using React and a video player using Vue. This is currently prohibitive KB-wise but a shared cache would allow it.

And with WebAssembly the sizes of libraries will get bigger increasing the need of such shared cache.

@nomeata Funny to see you on this thread, the world is small

@annevk

This comment has been minimized.

Copy link
Member

@annevk annevk commented Mar 8, 2017

An opt-in privacy leak isn't a great feature to have.

@brillout

This comment has been minimized.

Copy link

@brillout brillout commented Mar 8, 2017

An opt-in privacy leak isn't a great feature to have.

How about opt-in + a resource is added to the shared cache only after the resource has been loaded by several domains?

@kevincox

This comment has been minimized.

Copy link

@kevincox kevincox commented Mar 8, 2017

@brillout

This comment has been minimized.

Copy link

@brillout brillout commented Mar 8, 2017

I don't think that really helps as the attacker can purchase two domains
quite easily.

Yes it can't be n domains where n is predefined. But making n probabilistic makes it considerably more difficult for an attack to be successful. (E.g. last comment at w3c/webappsec#504 (comment).)

@strugee

This comment has been minimized.

Copy link

@strugee strugee commented Mar 10, 2017

CSP has (is getting?) a nonce-based approach. IIUC the concern with CSP is that an attacker would be able to inject a script that loaded an outdated/insecure library through the cache, thus bypassing controls based on origin. However requiring nonces for SRI-based caching seems to solve this issue as the attacker wouldn't know the nonce; it also creates a performance incentive for websites to move to nonces, which are more secure than domain whitelists for the same reason[1].

I think it's possible that we could solve the privacy problem by requiring a certain number of domains to reference the script... it'd be really useful to have some metrics from browser telemetry here. For example if we determined that enough users encountered e.g. a reference to jQuery in >100 domains for that to be the minimum, it might be that we could load things from an SRI cache if they had been encountered in 100+ distinct top-level document domains (i.e. domains the user explicitly browsed to, not that were loaded in a frame or something). The idea being that because of the top-level document requirement, the attacker would have to socially engineer the user into visiting 100 domains, which would be very, very difficult. However if telemetry told us that 100 is too high a number and it's actually more like 20 for a particular jQuery version, that'd be a different story.

[1]: consider e.g. being able to load an insecure Angular version from the Google CDN because the site loaded jQuery from the Google CDN

@zrm

This comment has been minimized.

Copy link

@zrm zrm commented Apr 5, 2017

For example, the user agent could request https://example.com/.well-known/sri-list, which would return a plain text file with a list of acceptable hashes, one per line.

For some domains that file could be too large and change too often. Consider Tumblr's image hosting (##.media.tumblr.com) where each of the domain names host billions of files and the list changes every second.

How about something similar to HTTP ETag but with a client-specified hash algorithm. If the hash is correct you only get a response affirming as much instead of the entire file, which the browser can cache. It doesn't save you the round trip but it saves you the data.

@BigBlueHat

This comment has been minimized.

Copy link
Member

@BigBlueHat BigBlueHat commented Mar 22, 2018

How about something similar to HTTP ETag but with a client-specified hash algorithm. If the hash is correct you only get a response affirming as much instead of the entire file, which the browser can cache. It doesn't save you the round trip but it saves you the data.

RFC 3230: Instance Digests in HTTP defines a Digest header and a Want-Digest header that work exactly this way...or was meant to.

This would get the 304 Not Modified style responses, but it's still limited to a single URL check.

Maybe it (or something like it) coupled with the Immutable header could be used to populate some amount of caching or "permanence," but the model is still about the "given name" of the object (it's URL) and not about its intrinsic identification (it's content hash).

Caching's one use case for these things, but the Web could also benefit from some "object permanence" where possible and appropriate.

@kevincox

This comment has been minimized.

Copy link

@kevincox kevincox commented Mar 22, 2018

I don't see the benefit from Want-Digest. If the client has a whitelisted digest and the content backing it why bother the server? There are three possible responses.

  1. 304 Not Modified: Use what you had in the cache.
  2. 200 + Contents that match digest. Redundant transfer of file.
  3. Other: Error.

This would wait around for a response that can only make the situation worse.

@ArneBab

This comment has been minimized.

Copy link

@ArneBab ArneBab commented Oct 5, 2018

However if telemetry told us that 100 is too high a number and it's actually more like 20 for a particular jQuery version, that'd be a different story.

Even if 100 is too high a number today, the load time advantages of using a popular version of the library could quickly push the usage of a specific version over that limit. Browser telemetry today might not be representative of the situation after shared caching has been rolled out.

@cben

This comment has been minimized.

Copy link

@cben cben commented Mar 6, 2019

The discussion so far seems to assume JS libraries e.g. jQuery as the canonical use case.
I'd like to add web fonts as another use case of widely shared large subresources which could benefit from cross-domain cache.

I'd think the security risks for fonts are milder, though privacy implications might be similar. I'm talking about font files themselves, not CSS — fonts CSS is small, and malicious CSS is dangerous.

Note that CSS does not yet support SRI at all on font file urls: #40, w3c/webappsec#306
Note also that in practice optimized font delivery varies by browser, for example Google Fonts doesn't want to support SRI: google/fonts#473. (This is not a blocker for hashing
& sharing, just a tradeoff...)

@ArneBab

This comment has been minimized.

Copy link

@ArneBab ArneBab commented Mar 7, 2019

Couldn’t you just embed the font as data-uri in the CSS? With shared caching that would be efficient.

@kevincox

This comment has been minimized.

Copy link

@kevincox kevincox commented Mar 7, 2019

@ArneBab

This comment has been minimized.

Copy link

@ArneBab ArneBab commented Mar 7, 2019

It helps on the second access. By externalizing the font loading to a self-contained CSS file with sri-secured shared caching, the download could then be cached over multiple sites.

Yes, it would not be as good as specifying the integrity tag directly on the font, but the same is true for images and other resources, so I don’t see this as a blocker.

@troglotit

This comment has been minimized.

Copy link

@troglotit troglotit commented Oct 29, 2019

I was thinking about shared caches for several months, googling proposals and not finding anything before this. I am extremely excited about shared caches and opportunities that it enables, e.g. would help TC39 with identifying more in-demand libraries to include to standard library.

I have an idea about whitelisting hashes: there's https://en.wikipedia.org/wiki/Accumulator_(cryptography) this thing. So you can pass in CSP header one Accumulator of all integrity hashes saving bandwidth. I basically found it just because I had an instinct that it should be possible so got this answer https://crypto.stackexchange.com/questions/22410/hash-which-can-be-used-to-verify-one-of-multiple-inputs but I'm not sure it's applicable 100%.

@jeffkaufman

This comment has been minimized.

Copy link

@jeffkaufman jeffkaufman commented Nov 1, 2019

If you want to track me and you control both origins you want to track me from you can just use the same URL and you get a cookie which is better tracking and works today.

This was reasonable in 2016, but it's different now: browsers are partitioning caches (Safari launched; Chrome and Firefox in progress), browsers are reconsidering third-party cookies, and what were talking about here could allow new ways of cross-site tracking.

@mozfreddyb

This comment has been minimized.

Copy link
Contributor

@mozfreddyb mozfreddyb commented Nov 4, 2019

I'm afraid Jeff is right. He even wrote a good summarizing blog post about the potential deprecation of shared caching.

While I don't know the timeline for this change, it seems rather unlikely that we can ever consider a shared cache 😕

@troglotit

This comment has been minimized.

Copy link

@troglotit troglotit commented Nov 5, 2019

But what about cache of artifacts which are explicitly stated as "public artifacts" by website owner?

@annevk

This comment has been minimized.

Copy link
Member

@annevk annevk commented Nov 5, 2019

Website owners do not get to decide over end user privacy.

@troglotit

This comment has been minimized.

Copy link

@troglotit troglotit commented Nov 5, 2019

In my opinion, they already do. All the Googles and FACEBOOKs already have all the userdata and sell that to 3rd parties. They also have all the engineering power and monopoly on user attention so shared cache wouldn't really benefit them. But shared cache would enable small websites/indie-gamedevs and others to leverage cache and build more ambitious websites which should benefit end user. And to be fair, I have only a little idea what cache-hit rate would be.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
You can’t perform that action at this time.