Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal would allow pages to learn how the user is interacting with the site #50

Closed
pes10k opened this issue Nov 20, 2021 · 31 comments
Closed
Labels
privacy-needs-resolution Issue the Privacy Group has raised and looks for a response on.

Comments

@pes10k
Copy link

pes10k commented Nov 20, 2021

This issue is part of the PING privacy review w3cping/privacy-request#61

The server (including third-parties) could observe the patch requests to learn how the user is interacting with the page. In certain cases (e.g., when JavaScript is blocked) this would provide a unique powerful capability to the server, since they otherwise wouldn't be able to learn what the user is reading on the page, or typing in a form, etc.

I appreciate that this is partially discussed in Privacy Considerations section, but i think the analysis there is incorrect, or at least incomplete. Specifically the text currently says:

One mitigation, which was originally introduced for reasons of networking efficiency so is likely to be implemented in practice, is to request additional, un-needed characters to dilute the ability to infer what content the user is viewing. Requesting characters which are statistically likely to occur may avoid a subsequent request.

In practice though, this is unlilkely to be a useful mitigation, at least in the way its currently discussed. By definition, the client is likely to have already requested characters that are "statistically likely to occur" by the time the client needs to fetch an uncommon glyph / patch, meaning that there will be few or no common characters to request to obscure the uncommon, highly identifying characters.

I don't know what the best approach is here, but a partial solution here would be to define (or increase) the minimum patch size, reducing the granularity of the signal here.

@svgeesus
Copy link
Contributor

The server (including third-parties) could observe the patch requests

How would third parties observe this, over HTTPS?

@pes10k
Copy link
Author

pes10k commented Nov 23, 2021

In several ways. First, the font thats participating in IFT could be a third-party to the page no?

More generally though, users might disable JS to prevent the kind of moment-to-moment interaction monitoring that JS enables. ITF would re-enable servers to learn (some) of the moment-to-moment, or keypress-by-keypress, monitoring the user was hoping to avoid by disabling JS.

Similarly, users of filter lists or similar tools (NoScript, etc) might disable scripts but not fonts, in an effort to match user privacy goals. If a server is able to use ITF to learn about how an individual is interacting with a page, that increases the range of resources such tools need to restrict, and pushes users into a worse privacy/compatability tradeoff.

I appreciate that some of this can be done by maliciously constructed CSS as well, by constructing selectors that perform different network requests when different page elements appear or when text fields contain certain characters. But since the target fonts here are enormous, with 10s of thousands of glyphs (or maybe more?) I expect there are margins where these attacks would be infeasable in CSS (where you had to enumerate every character targeted) but feasisble with IFT (where only a single instruction would suffice)

@vlevantovsky
Copy link
Contributor

I don't know what the best approach is here, but a partial solution here would be to define (or increase) the minimum patch size, reducing the granularity of the signal here.

We should probably consider the implications of possible approaches when different scripts are in use. The semantical meaning of a 'character' would be very different when comparing alphabet-based, or syllabic, or ideographic scripts based languages, and the risks / mitigation approaches are likely to be very different and may not be easily resolved as part of the IFT Rec. Would it be sufficient to raise the awareness and offer possible mitigation strategies as part of the privacy considerations section, but leave it for users to decide what to do? Until we know what fonts / scripts / languages are used as page content, it's hard to assess what actual risks are.

We sure can define a minimum patch size (and it is probably a good thing to do considering the overhead of asking for one), but would it really be sufficient mitigation strategy? Too large of a size may end up covering a full character set of an alphabet and not be enough for ideographic scripts.

@litherum
Copy link
Contributor

define (or increase) the minimum patch size

This is an interesting idea.

We should probably consider the implications of possible approaches when different scripts are in use.

I think @vlevantovsky is correct here. For a language like English, requesting A-Z (26 distinct characters) doesn't tell you anything about the contents of the page. However, there are many sets of 26 distinct Chinese characters which would be extremely informative of the contents of the page. So, a minimum patch size, by itself, would probably be too simplistic.

I think @pes10k is generally right, here, though, that we should probably be able to do better, and we should do better, now before anybody ships. It just requires some thought about the right mechanism.

@litherum litherum added Range Request Solution includes something specific to range-request method and removed Range Request Solution includes something specific to range-request method labels Dec 13, 2021
@svgeesus
Copy link
Contributor

In several ways. First, the font thats participating in IFT could be a third-party to the page no?

No. In this scenario, the first party is the server with the actual content, and the second party is the (different domain) server hosting the fonts. The client connects to the font server over HTTPS. Third parties (proxies, other scripts on the same content page) are unable to observe this request.

@svgeesus
Copy link
Contributor

More generally though, users might disable JS to prevent the kind of moment-to-moment interaction monitoring that JS enables. ITF would re-enable servers to learn (some) of the moment-to-moment, or keypress-by-keypress, monitoring the user was hoping to avoid by disabling JS.

Because the limiting factor on slow networks is latency, not round trip time, implementations need to send the minimum number of requests. Thus, a per-keystroke request pattern would be extremely expensive and browsers are unlikely to implement in this manner.

@pes10k
Copy link
Author

pes10k commented Feb 15, 2022

No. In this scenario, the first party is the server with the actual content, and the second party is the (different domain) server hosting the fonts. The client connects to the font server over HTTPS. Third parties (proxies, other scripts on the same content page) are unable to observe this request.

Sorry, im not sure i follow here. Google Fonts could be the party using IFT, no? Or is it only useable by fonts served by the same eTLD+1 as the top level document?

Because the limiting factor on slow networks is latency, not round trip time, implementations need to send the minimum number of requests. Thus, a per-keystroke request pattern would be extremely expensive and browsers are unlikely to implement in this manner.

If browsers shouldn't implement it in this way, that seems like a very good thing to mention in the spec! :)

@svgeesus
Copy link
Contributor

I think we have a terminology problem with the security and privacy use of the term "third party". If there is one client (first party) and one server (second party) then anyone else, presumed potentially malicious, is the "third party" in the sense of snooping proxies, person in the middle attacks, and so on.

That terminology becomes ambiguous where there is a second sever (in this case the font server) and people start referring to it as the "third party" which carries along with it implications that are not intended.

As it is too late to encourage a new term like "external party" we are stuck with the existing definition of "third party" so I think it makes sense to continue to use the existing terminology, and just to consider each network transaction to be separate. so:

request 1: Client (first party) requests html page from server A (second party) over HTTPS. Third parties cannot look at this.
request 2: Client (first party) requests css from server B (second party) over HTTPS. Third parties, including server A, cannot look at this.
request 3: Client (first party) requests font from server C (second party) over HTTPS. Third parties, including servers A and B, cannot look at this.

@svgeesus
Copy link
Contributor

@pes10k does my definition of "third party" help here?

@pes10k
Copy link
Author

pes10k commented Jun 14, 2022

@svgeesus i believe i understand your comment / definitions correct, but I dont see how that addresses the main concern in the issue.

@svgeesus
Copy link
Contributor

Your original comment was:

The server (including third-parties) could observe the patch requests to learn how the user is interacting with the page

We have demonstrated that a third party cannot observe the patch requests.

The issue title is also odd:
"Proposal would allow pages to learn how the user is interacting with the site"
pages should probably be font servers there.

So this becomes

The font server could observe the patch requests to learn how the user is interacting with the page

Servers observing the requests made to those servers is kind of implicit in a client-server architecture, so the crux of the privacy issue is the claim that the font server can "learn how the user is interacting with the page"

This is already covered in the Privacy Considerations section:

However, for those languages, it is possible that individual requests might be analyzed by a rogue font server to obtain intelligence about the type of content which is being read. It is unclear how feasible this attack is, or the computational complexity required to exploit it, unless the characters being requested are very unusual.

Feedback, such as from the Chinese Web Interest Group, is that this is a theoretical possibility which does not seem feasible in practice without some sort of AI analysis.

I would like to test this, for example by having a set of subject-specific content pages and then having the server attempt to guess which page is being read. That requires the cooperation of native speakers in sourcing and analyzing such a corpora, however.

@pes10k
Copy link
Author

pes10k commented Jun 16, 2022

I dont see why it would require AI or anything so significant. If a page wanted to learn how the user was interacting with it, it could construct the page so that uncommon font patches were needed to render different parts of the page. When the server saw the browser request that patch, the server would know the user was interacting with that part of the page. Possibly the Chinese Web IG folks were considering only situations where the pages were not specifically constructed for this purpose?

This is already covered in the Privacy Considerations section:

Understood. This issue is me requesting that the proposal be updated / changed to prevent the attack, not merely mention its feasibility.

@svgeesus
Copy link
Contributor

If a page wanted to learn how the user was interacting with it, it could construct the page so that uncommon font patches were needed to render different parts of the page. When the server saw the browser request that patch, the server would know the user was interacting with that part of the page.

Ah, I see. So for example rare or unusual emoji could be sprinkled through the sections of a page, as a beacon that the user is reading that section? Although that assumes a very on-demand last minute request for font updates, and not the browser optimizing ahead to be ready to render as soon as the content scrolls into view. Is that the sort of thing you were thinking of?

BTW that technique already seems doable just by sprinkling small images throughout a page, or using script to rewrite image urls based on content becoming visible.

@pes10k
Copy link
Author

pes10k commented Jun 22, 2022

Is that the sort of thing you were thinking of?

Yes, exactly

BTW that technique already seems doable just by sprinkling small images throughout a page, or using script to rewrite image urls based on content becoming visible

Yep, i think you're right (and you could probably do the same thing with CSS too. I tried to highlight that above). This is one reason why very privacy conscious users will disable (or tightly control) scripting and image fetches on pages. My goal with this issue is to prevent a situation where those same users also need to disable web fonts to maintain their current level of privacy.

@svgeesus
Copy link
Contributor

This option is already addressed:

For <em>especially privacy-sensitive contexts</em>,
options would include never downloading any webfonts
(at the risk that some characters may be rendered incorrectly, or not at all),
or always downloading all webfonts whether needed or not
(ignoring unicode-range,
and potentially downloading vast quantities of unused fonts
each time the page is viewed).

I see that the Tor browser disables downloadable fonts on Safest security level since 9.0.6

@pes10k
Copy link
Author

pes10k commented Jun 23, 2022

Having the spec say "browsers that need to protect privacy shouldn't enable this feature" isn't addressing privacy issues with the spec, its making them vendors' or users' problems. The purpose of this issue, and of privacy horizontal review in general, is to make it so new proposals are safe enough for privacy-sensitive users.

Judging from the above comments, it seems like at least one of the three editors of the proposal thinks this is a privacy threat worth addressing / mitigating / addressing. Is it the position of the WG that they do not think its appropriate or are otherwise not interested in addressing this issue?

@svgeesus
Copy link
Contributor

svgeesus commented Mar 8, 2023

Having the spec say "browsers that need to protect privacy shouldn't enable this feature" isn't addressing privacy issues with the spec,

It doesn't say that.

It says that for those few users for whom cast-iron privacy is the over-riding concern - greater than loss of functionality from scripting, or the complete absence of images, or mis-rendered illegible text due to the lack of a suitable font - then they can disable font download the same as they already disable all scripts, all image and video downloads, all stylesheets, and use techniques such as the Tor network to obscure their network location.

That isn't the only privacy-protecting mechanism, just the ultimate one.

However, there are many sets of 26 distinct Chinese characters which would be extremely informative of the contents of the page. So, a minimum patch size, by itself, would probably be too simplistic.

That was my understanding too, although we asked the Chinese Web IG and they did not think this was a significant risk. I guess we need a second opinion from people fluent in a Chinese language and aware of privacy attack methods.

@garretrieger
Copy link
Contributor

I've run a simulation on the impact that adding random codepoints to the request has on the ability to match a request back to the page that generated it: https://github.com/w3c/PFE-analysis/blob/main/results/noise_simulations.md

As recommended in the conclusions I'm planning to make some spec updates that require a minimum number of additional codepoints to be added (dependent of the script of the content).

@garretrieger
Copy link
Contributor

garretrieger commented Jun 19, 2023

Alright, so the original thought was to use randomly added codepoints to the request. Unfortunately that completely kills any possibility of doing request caching. So we found a slightly different approach where the set of codepoints that can be loaded is partitioned into groups of a minimum size. If a codepoint from a group needs to be loaded then the whole group is always loaded. This has the same effect as adding random noise in that it creates uncertainty about which specific codepoints were present on the page, but is deterministic so caching will still be possible.

I've updated the noise simulation doc with the results of also simulating the grouping approach and found it works as well as the random noise approach, but does cost a bit more in terms of the number of additional codepoints sent. Based on those results I went ahead and added the grouping approach to the specification (#148)

@svgeesus
Copy link
Contributor

@pes10k it would be great if you could review Add codepoint groups and Appendix B: Codepoints Requiring Obfuscation to see how we have made content prediction by a malicious server statistically unlikely.

@pes10k
Copy link
Author

pes10k commented Jun 20, 2023

@svgeesus @garretrieger i think this looks terrific.

One suggestion though (not intended to be blocking), is to also amend the spec to point to something external that could include additional codepoints in Appendix B, if in the future its found that future fonts/codepoints need similar obscuring. That way the protection could be added to new cases w/o needing to go through the entire W3C process again.

But etiher way, this is a terrific change, and I very appreciate the group's work here!

@svgeesus
Copy link
Contributor

@garretrieger I wonder if we can point to Unicode character classes, so that when those get expanded by the Unicode consortium our spec picks up that change?

@svgeesus
Copy link
Contributor

The relevant character class is Unified_Ideograph. Here is the list, from the Unicode 15 properties list although our spec should reference the latest (no version number) Properties list.

This has 12 more characters than the list in the current spec, as the compatibility characters are also included:

# ================================================

3400..4DBF    ; Unified_Ideograph # Lo [6592] CJK UNIFIED IDEOGRAPH-3400..CJK UNIFIED IDEOGRAPH-4DBF
4E00..9FFF    ; Unified_Ideograph # Lo [20992] CJK UNIFIED IDEOGRAPH-4E00..CJK UNIFIED IDEOGRAPH-9FFF
FA0E..FA0F    ; Unified_Ideograph # Lo   [2] CJK COMPATIBILITY IDEOGRAPH-FA0E..CJK COMPATIBILITY IDEOGRAPH-FA0F
FA11          ; Unified_Ideograph # Lo       CJK COMPATIBILITY IDEOGRAPH-FA11
FA13..FA14    ; Unified_Ideograph # Lo   [2] CJK COMPATIBILITY IDEOGRAPH-FA13..CJK COMPATIBILITY IDEOGRAPH-FA14
FA1F          ; Unified_Ideograph # Lo       CJK COMPATIBILITY IDEOGRAPH-FA1F
FA21          ; Unified_Ideograph # Lo       CJK COMPATIBILITY IDEOGRAPH-FA21
FA23..FA24    ; Unified_Ideograph # Lo   [2] CJK COMPATIBILITY IDEOGRAPH-FA23..CJK COMPATIBILITY IDEOGRAPH-FA24
FA27..FA29    ; Unified_Ideograph # Lo   [3] CJK COMPATIBILITY IDEOGRAPH-FA27..CJK COMPATIBILITY IDEOGRAPH-FA29
20000..2A6DF  ; Unified_Ideograph # Lo [42720] CJK UNIFIED IDEOGRAPH-20000..CJK UNIFIED IDEOGRAPH-2A6DF
2A700..2B739  ; Unified_Ideograph # Lo [4154] CJK UNIFIED IDEOGRAPH-2A700..CJK UNIFIED IDEOGRAPH-2B739
2B740..2B81D  ; Unified_Ideograph # Lo [222] CJK UNIFIED IDEOGRAPH-2B740..CJK UNIFIED IDEOGRAPH-2B81D
2B820..2CEA1  ; Unified_Ideograph # Lo [5762] CJK UNIFIED IDEOGRAPH-2B820..CJK UNIFIED IDEOGRAPH-2CEA1
2CEB0..2EBE0  ; Unified_Ideograph # Lo [7473] CJK UNIFIED IDEOGRAPH-2CEB0..CJK UNIFIED IDEOGRAPH-2EBE0
30000..3134A  ; Unified_Ideograph # Lo [4939] CJK UNIFIED IDEOGRAPH-30000..CJK UNIFIED IDEOGRAPH-3134A
31350..323AF  ; Unified_Ideograph # Lo [4192] CJK UNIFIED IDEOGRAPH-31350..CJK UNIFIED IDEOGRAPH-323AF

# Total code points: 97058

@svgeesus
Copy link
Contributor

svgeesus commented Jun 28, 2023

The reference would be Unicode® Standard Annex #44: Unicode Character Database https://www.unicode.org/reports/tr44/

Specref already know about it, use [UAX44]

@svgeesus
Copy link
Contributor

@pes10k would that change satisfy your remaining concern?

@pes10k
Copy link
Author

pes10k commented Jun 29, 2023

@svgeesus i believe so, yes, though to be 100% honest, im not expert enough (or at all) in unicode nitty gritty so im not certain.

My understanding is that the change would reference the kinds / categories of code points that are relevant here, instead of the actual ranges themselves? So that if additional code points are added to those categories in the future, they're automatically covered by the spec?

If I'm understanding correctly, that sounds terrific to me.

@garretrieger
Copy link
Contributor

Referencing the Unified_Ideograph property instead of explicitly listing unicode blocks sounds good to me, I'll update my PR.

@garretrieger
Copy link
Contributor

garretrieger commented Jul 5, 2023

I've created a PR to change to use the unified_ideograph property (#150). Please take a look.

@svgeesus
Copy link
Contributor

I've created a PR to change to use the unified_ideograph property (#150). Please take a look.

Perfect, merged.

@svgeesus
Copy link
Contributor

If I'm understanding correctly, that sounds terrific to me.

Your understanding is correct.

@svgeesus
Copy link
Contributor

Closed as original commenter is now fully satisfied.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
privacy-needs-resolution Issue the Privacy Group has raised and looks for a response on.
Projects
None yet
Development

No branches or pull requests

5 participants