Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Partial freezing of the User-Agent string #467

Closed
yoavweiss opened this issue Jan 27, 2020 · 59 comments
Closed

Partial freezing of the User-Agent string #467

yoavweiss opened this issue Jan 27, 2020 · 59 comments

Comments

@yoavweiss
Copy link

@yoavweiss yoavweiss commented Jan 27, 2020

Goedenavond TAG!

This is not your typical spec review, and is highly related to #320. But, because @torgo asked nicely, I'm opening up a review for a specific application of UA-CH as a replacement for the User-Agent string.

We've had a lot of feedback on the intent, which resulted in changes to the API we want to ship. It also resulted in many open issues. Most either have pending PRs or will have ones shortly.

The latest summary is:

  • The User-Agent request header will be frozen other than its browser's significant version, and unified between different platforms and devices to reduce the amount of passive fingerprinting the browser is sending out by default.
  • navigator.userAgent and friends will be similarly frozen.
  • By default, the browser will send Sec-CH-UA and Sec-CH-UA-Mobile headers to enable most cases of content negotiation. As those headers are low-entropy, we can afford that trade-off, privacy-wise.
    • Sec-CH-UA is defined as a set, and likely to be GREASEd to avoid current abuse patterns of the User-Agent string.
  • Servers can opt-in to request more information about the user agent using the Client Hints negotiation mechanisms: platform, architecture and model are what we currently have, but we're considering things like "input mode".
    • The opt-in mechanism to enable opt-in on the "very first view" is not yet ready. At the same time, the hints sent by default are likely to answer most HTML level content-adaptation use-cases.
  • An equivalent API will enable access to that information on the client. Access to low-entropy information will be synchronous, while access to high-entropy one will be through a Promise. (to enable browsers to take their time when considering if the site should really be granted to potentially fingerprintable info)

Checkboxes:

  • Explainers: for the feature, as well as its infrastructure
  • Specifications: for the UA-CH feature, as well as its infrastructure.
  • Tests: WPT
  • Security and Privacy self-review: Security and privacy are covered for the feature as well as its infrastructure. As this is a privacy-positive feature, I don't think a separate review is necessary, but let me know if you think otherwise.
  • GitHub repo URL
  • Primary contacts:
    • Yoav Weiss (yoavweiss@), Google Chrome - editor/implementer
  • Organization driving the specification: Google Chrome
  • Key pieces of existing multi-stakeholder review or discussion of this specification: design-review#320
  • External status/issue trackers for this specification: Chrome status entry for UA-CH and for User-Agent freezing.

Further details:

  • I have reviewed the TAG's API Design Principles
  • Relevant time constraints or deadlines: hoping to start freezing in M83 (June) and unification in M85 (September)
  • The group where the work on this specification is being done: WICG
  • Major unresolved issues with or opposition to this specification:
  • This work is being funded by: Google Chrome

You should also know that...

[please tell us anything you think is relevant to this review]

We'd prefer the TAG provide feedback as (please delete all but the desired option):

🐛 open issues in our GitHub repo for each point of feedback

@torgo torgo self-assigned this Jan 27, 2020
@torgo torgo added this to the 2020-01-27-week milestone Jan 27, 2020
@dbaron dbaron self-assigned this Jan 27, 2020
@jwrosewell
Copy link

@jwrosewell jwrosewell commented Jan 30, 2020

Dear Yoav,

Over the past two weeks we have sought evidence to justify this change. We have found none. None of the industry stakeholders we have spoken to were previously aware of the proposal. This means that the usual and necessary protocols of public review are lacking.

In its current form this could easily be interpreted as a partisan gerrymandering attempt by the incumbent dominant player in the field, to the disadvantage of other players.

In our conversations with other players, most recently at the Westminster Policy Forum, we found that they thought that this was all part of a cookie discussion. Solicitation of feedback from a wide global cross section of stakeholders via a neutral party is now required. The W3C comes to mind as fulfilling exactly that role.

This proposal is too radical and has too much potential for disruption to be pushed though quickly, even if you accept the privacy arguments, which we remain unconvinced by.

Considering purely the macro issues associated with good governance and controlled change we observe the following:

  1. The proposal has not been reviewed beyond a small group of dedicated, focused and elite engineers who in the main add – not take away – excellent features. This change impacts global industry sectors as diverse as publishing, marketing, advertising, technology, charities and more in yet undetermined ways. The risk is equivalent to the Y2K bug.

  2. Online platforms – and Google in particular – are the subject of a wide ranging Competition and Market Authority (CMA) review covering the subjects of this proposal. The full report is expected to be published on 2nd July 2020. An interim report was published on 18th December 2019. The interim report establishes a balance between the needs of individuals for privacy and for markets and technology to function efficiently. Its conclusions should inform this proposal.

  3. This proposal will disproportionally benefit Google as in practice it will remove data for smaller platform operators and millions of others. Paragraph 60 appendix E of the CMA review states.

“Google is the platform with the largest dataset collected from its leading consumer-facing services such as YouTube, Google Maps, Gmail, Android, Google Chrome and from partner sites using Google pixel tags, analytical and advertising services. A Google internal document recognises this advantage saying that ‘Google has more data, of more types, from more sources than anyone else’.”

The appendix includes the following diagram to illustrate the point.

Chromium2

  1. The CMA are yet to comment on Google’s role in relation to influence over web standards via the Chromium project and other means. Microsoft’s decision to adopt Chromium, and the apparent decline of Firefox are likely to be topics they comment on in July 2020.

  2. We need more time to discuss the impact with our users. We believe this is true for others.

As just one example the AdCom specification needs to be updated. Only once this is done can all publishers, SSPs, exchanges and DSPs adopt the new schema. If any of these parties do not make the modifications all are disadvantaged. The change needs to be made in lockstep.

Many trade bodies and organisations are focused on the implications associated with the publicity concerning 3rd party cookies. They are only just becoming aware of this proposal. We are encouraging them to engage publicly but respect the demands on their time, limited resources and the sensitivities concerning the topic of privacy.

There are many more arguments concerning assumptions, insufficient evidence, implementation, control over privacy (who decides?), and the technical impacts of the proposal yet to be resolved.

In summary, this is a change that requires careful and widespread consideration, and a significant effort to socialise for it to be recognised by all as legitimately in the public interest. Without mature reflection and appropriate implementation delay it will be perceived as market manipulation by the incumbent player.

Regards,

James Rosewell - for self and 51Degrees

@torgo
Copy link
Member

@torgo torgo commented Feb 1, 2020

One specific concern have about this proposal has to do with how "minority browsers" are impacted. Let's consider non-Chrome browsers that are also based on Chromium (as one aspect of browser diversity). It's not clear to me from reading the explainer how you intend non-Chrome browsers based on Chromium to make themselves known through Client Hints. If a hypothetical Chromium-based browser, let's call it Zamzung Zinternet, sends Sec-CH-UA: "Chrome"; v="70" (for example) that might match up with the Chromium engine they are shipping, but it won't line up with the feature set (since their feature set may not exactly match the chromium engine number) and web site owners will lose all the analytics to understand which browsers their users are using. However, if they send Sec-CH-UA: "Zamsung Zinternet"; v="17.6" then it's very likely that many web sites will give their users a bad user experience, or flash a message up encouraging people to download Chrome. You mentioned on this thread that "The UA-CH design is trying to tackle this by enabling browsers to define an easily parsable set of values, that will enable minority browsers to be visible, while claiming they are equivalent to other browsers" however I don't see that reflected in the explainer. Can you be more explicit about this? Secondly, regarding analytics, have you validated this approach with your own Google Analytics team, who currently use the UA to extract this information?

Please note, the TAG's ethical web principles argues that there is an inherent value of having multiple browsers. We should not be introducing a change the web platform that could result in making browser diversity less apparent / less measurable, as this could negatively impact browser diversity.

@yoavweiss
Copy link
Author

@yoavweiss yoavweiss commented Feb 1, 2020

As @jyasskin pointed out, the examples there should clarify what we had in mind on that front

As for the Zamzung Zinternet case, I'd expect it to send out a set that looks something like Sec-CH-UA: "Chrome"; v="70", "Chromium"; v="70", "Zamzung Zinternet"; v="10".
That would enable sites that haven't bothered testing on it to consider it the same as they consider other Chromium browsers, enable sites that want to target it specifically, and will enable analytics (that are aware of it) to understand what specific browsers those users are coming from.

Does that help alleviate your concerns?

Also, note that there's a discussion on maybe putting more emphasis on the engine vs. the UA brand: WICG/ua-client-hints#52

To me, the Zamzung Zinternet case sounds like a good example in which we should prefer the current spec, over switching over to sending only the engine by default.

Please note, the TAG's ethical web principles argues that there is an inherent value of having multiple browsers. We should not be introducing a change the web platform that could result in making browser diversity less apparent / less measurable, as this could negatively impact browser diversity.

Beyond the privacy benefits of this change, it has an explicit goal of discouraging unreliable UA sniffing, as well as problematic UA sniffing patterns such as allow and block lists. So its intent is to discourage patterns that harm browser diversity.

@mh0478025
Copy link

@mh0478025 mh0478025 commented Feb 2, 2020

Hi Yoav,

First and foremost, thank you for giving the opportunity to members of the community to engage in this discussion.

I'm very concerned regarding the reasoning behind this change: bits of entropy.

As a small ad network, we use ip and user agent data to combat ad fraud. These same bits of entropy are used when we detect ad fraud. This is virtually impossible to do if all we are getting is a rotating vpn-based ip address and "Chrome 74". At best, we have to wait for another request to get the rest of the UA data (significantly reducing our ad serving speed) or worst case, we will "exceed the user's privacy budget" and be denied this information altogether.

Who decides how "the user agent can make reasonable decisions about when to honor requests for detailed user agent hints"? There is absolutely no doubt that your own properties will be ranked high in a hypothetical "trust/privacy" rating. There is nothing stopping you or your successors from abusing this power against smaller players like in the case with Yelp.

Just because your organization has hundreds of millions of logged in active users across your various web properties and devices (search, chrome, android, chrome os, youtube, gmail, pixel etc.), it is significantly easier to run ad fraud analysis and protect your own ad network as the rest of us bite the dust.

On the github repo it states "Top-level sites a user visits frequently (or installs!) might get more granular data than cross-origin, nested sites, for example". What about smaller sites just starting out? With all the large players (that already have top-level sites a user visits frequently) remaining untouched, you are effectively crippling competition from smaller players.

Vast majority of the internet users simply don't care about this change and the handful that do, are probably underestimating the anti-trust issues that this change brings. One would have to be naive to think that there is absolutely no conflict of interest when a company that collects the most amount of data on earth is limiting what other, less frequently visited sites are allowed to see?

Reducing bits of entropy is simply not a good enough reason to proceed with the change. I adore chrome and personally use it every day, however, I'd like to point out to the community that this change is not in everyone's best interest.

TL;DR: We need the full os and browser version to survive as a small ad network. And more importantly, we need this data as part of every first http request, without being discriminated against for being a less frequently visited / smaller website.

Regards,
Andy

@torgo
Copy link
Member

@torgo torgo commented Feb 2, 2020

@yoavweiss @jyasskin

Sec-CH-UA: "Chrome"; v="70", "Chromium"; v="70", "Zamzung Zinternet"; v="10"
Does that help alleviate your concerns?

Sort of...? But couldn't this just revert over time to being just as messy as the current UA String?

And yes, from the PoV of non-dominant browsers, who often need to maintain support for their development by citing usage numbers, I don't think it would be a good thing to suppress the the browser name.

I do not think you can engineer a system to 100% eradicate browser sniffing and targeting. Some of this will always have to be tackled through best practice sharing and community pressure.

@yoavweiss
Copy link
Author

@yoavweiss yoavweiss commented Feb 2, 2020

But couldn't this just revert over time to being just as messy as the current UA String?

It could. I'm hoping a decent application of GREASE can help prevent it.

I do not think you can engineer a system to 100% eradicate browser sniffing and targeting

I agree. But I want us to try and disincentivize negative behavior related to that (e.g. block and allow lists)

@kiwibrowser
Copy link

@kiwibrowser kiwibrowser commented Feb 4, 2020

Hi @yoavweiss

"I want us to try and disincentivize negative behavior related to that"

Did you consider removing the installation and Google-specific tracking headers (x-client-data) that Google Chrome is sending to Google properties ?

Example:
https://www.youtube.com - in network headers, look for x-client-data

Now, go to https://ad.doubleclick.net/abc - and your browser also sends this magic x-client-data.

It's a unique ID to track a specific Chrome instance across all Google properties.

Really curious about your opinion, especially after the GDPR explicitly forbidding such tracking.
Moreover, it doesn't make sense to anonymise user-agent if you have such backdoor.

@gjsman
Copy link

@gjsman gjsman commented Feb 4, 2020

From HN related to @kiwibrowser 's post: https://www.google.com/chrome/privacy/whitepaper.html

"We want to build features that users want, so a subset of users may get a sneak peek at new functionality being tested before it’s launched to the world at large. A list of field trials that are currently active on your installation of Chrome will be included in all requests sent to Google. This Chrome-Variations header (X-Client-Data) will not contain any personally identifiable information, and will only describe the state of the installation of Chrome itself, including active variations, as well as server-side experiments that may affect the installation."

he variations active for a given installation are determined by a seed number which is randomly selected on first run. If usage statistics and crash reports are disabled, this number is chosen between 0 and 7999 (13 bits of entropy). If you would like to reset your variations seed, run Chrome with the command line flag “--reset-variation-state”. Experiments may be further limited by country (determined by your IP address), operating system, Chrome version and other parameters.

@chris-griffin
Copy link

@chris-griffin chris-griffin commented Feb 4, 2020

If usage statistics and crash reports are disabled, this number is chosen between 0 and 7999 (13 bits of entropy)

This is a misdirect. First, according to the same cited whitepaper, Usage statistics are "enabled by default for Chrome installations of version 54 or later". This means that nearly all Chrome installs will have a very high entropy.

And even if a user disables usage statistics, a low entropy seed will very likely still yield a high entropy string since it includes "the state of the installation of Chrome itself, including active variations, as well as server-side experiments that may affect the installation."

If you want to use this argument, the equivalent would be allow users to disable their User Agent, but to send it by default. This seems like a much more sane approach.

@jwrosewell
Copy link

@jwrosewell jwrosewell commented Feb 4, 2020

This thread is already highlighting a set of issues which can be summarised as:

  1. Consultation – ensure a multitude of stakeholder needs from "minority browser" vendors, fraud, advertising, publishing, technology and marketing industry sectors – among others - are all considered.

  2. Design – ensure the perceived problems with the existing User-Agent field value are not recreated in the replacement. Turn the "what we had in mind" musings of exceptional engineers into fully defined engineering specifications suitable for all to work with.

  3. Breaking the web – a full study of how the User-Agent [and other similar fields and practices] is used in practice – not in theory or based on individual bias – to inform an impact assessment. Understand the migration approach for each scenario.

There is no burning problem or "innovation impairment" to justify incomplete engineering, poor governance or risky implementation. National regulators do not require this change – and in fact are actively balancing the needs of privacy, business and fair competition.

@pluma
Copy link

@pluma pluma commented Feb 4, 2020

Just as a reminder for people who think an "anonymous" ID on a request is fine because the GDPR is only about personally identifiable information: if you attach that random ID to the (most likely fairly unique) configuration of the individual installation and then apply the resulting ID to various requests made by the user, the "anonymous" ID becomes "pseudonymous" as you can infer the user's identity from it, making the resulting data personally-identifiable.

Additionally, regardless of the legality and compliance of this, it is clearly a violation of the spirit of information scarcity present in the GDPR and shows a complete disregard for the idea of personal ownership of data and the right to privacy by default.

@fredgrott
Copy link

@fredgrott fredgrott commented Feb 4, 2020

Google, I think you can do better than this and it might hamr certain US state AG meetings about Google anti-trust issues. Please rethink!

@joneslloyd
Copy link

@joneslloyd joneslloyd commented Feb 4, 2020

This is scary stuff for those living in oppressive countries who aren’t tech-savvy enough to use a proxy and change these settings..

@fightborn
Copy link

@fightborn fightborn commented Feb 4, 2020

I'm just here to read advanced excuses from people thinking that other smart guys reporting the issue here are idiots. You do realize that Chrome is becoming worse plague than Internet Explorer ever was?

@markentingh
Copy link

@markentingh markentingh commented Feb 4, 2020

You could easily install a Chrome extension for modifying request headers and block the x-client-data header.

@AdamMurray
Copy link

@AdamMurray AdamMurray commented Feb 4, 2020

@markentingh Privacy shouldn't be reserved for those with the knowledge to modify request headers.

@bhartvigsen

This comment was marked as disruptive content.

@immanuelfodor
Copy link

@immanuelfodor immanuelfodor commented Feb 8, 2020

Sites sending your browser special content can be a desired feature sometimes, for example, Emby transcodes the AC3 audio for Firefox which is ultimately broken when streamed if you fake the UA string to appear as Chrome (personal experience). If any of the above proposals can solve such cases with e.g. an opt-in feature to "unfreeze" the UA string for better compatibility on a user-edited whitelist of sites that are not yet ported to the new method(s), I'm all in for better privacy.

@mgol
Copy link

@mgol mgol commented Feb 8, 2020

@yoavweiss

GREASEing by adding non-existent browser names would avoid blocking of unknown browsers (which as @mcatanzaro indicated, is a major problem today).

It's true this would avoid blocking unknown browsers. What browsers are unknown, though, depends on the tools you use. As @ocram rightly noticed, there are incentives to detect even minor browsers (e.g. for analytics purposes) which means tools are created that make such detection possible and these same tools are then used to apply some fixes, enable some features, etc. only to some browsers.

When you use a browser-detecting library that knows about minor browsers, this library will ignore tokens like "NotBrowser" but it will take tokens like "Vivaldi" into account.

Vivaldi has recently stopped using its own token in the user agent string for this very reason. They wrote a blog post showing a few examples of how Google sites were serving a degraded experience to Vivaldi when the Vivaldi token was present in the UA. One example is google.com where input text appears outside of the input frame. I repeated the test locally and the site was broken with the Vivaldi/2.10.1745.27 token but it worked correctly when I changed it to Vivald/2.10.1745.27 (i.e. I just removed i). It's clear, then that it's not just that any extra UA suffix would break the site; it was specifically singling out Vivaldi.

These issues wouldn't exist if sites were targeting engines by default instead of browser names when applying changes related to engines' APIs. Since even Google often does it by browser, it's hard to expect companies with less cash to spend time on making sure they're not singling out minority browsers.

One can also imagine sending invalid headers that would also be correctly parsed by valid Structured Headers parsers, to avoid error-prone regex based "parsing". (e.g. "Chrome"; v="73", "GibberishFirefox 66 dfgdfg")

This won't solve the case I described above.

One could imagine a distant future where browsers would keep a similar list on which they perform targeted GREASEing (somewhat similar to what @mcatanzaro suggested) in order to dissuade known compatibility offenders from their practices. Enabling GREASEing in the first place seems like a good first step in that direction.

If such a strategy was applied against known offenders it'd have to be done carefully as I can't imaging browser makers willingfully breaking existing sites of these offenders. Also, note that these offenders list would have to include Google today so Chromium would have to fight against the company that governs the project. It's hard to imagine it happening.

@ocram
Copy link

@ocram ocram commented Feb 9, 2020

So we seem to agree that preventing problems with unexpected entries is the only thing that GREASE solves.

Therefore, the absurd accumulation of complexity and size, and the lies about browser identities, are things that will quickly happen again, as described in detail above – because the proposal has nothing in it to stop this and the incentives all remain the same.

Finally, as @mgol said, I can’t see popular browsers start to intentionally lie about their identity for the greater good, especially not Google Chrome lying to Google Search, Google Docs, YouTube, etc. If popular browsers wanted to lie about their identity for the greater good, they could have been doing this already for a long time.

@Steve51D
Copy link

@Steve51D Steve51D commented Feb 9, 2020

If GREASE creates more problems than it solves then you are left with the question of what to do about the underlying problem it is trying to solve. This primarily seems to be an issue for browser developers, some of whom advocate removing the User-Agent or Sec-CH-UA entirely. There are also privacy campaigners who want it removed as well.

There are several issues down that road but I think that one of the most critical is that it puts far more power into the hands of the dominant browser. I.e. Google.

The fact that Google themselves have added additional tracking into Chrome to go beyond what User-Agent allows shows the value of this kind of information.
This x-client-data header is only sent to Google websites so only Google have access to that data.

If the browser were not identifiable in the request then it would just mean that Google would be the only ones with a picture of the browser landscape rather than one of many because they are in the unique position of having a huge share of the browser market as well as enough big website properties to funnel data through.

I think that browser developers are just going to have to continue dealing with this problem of incompatible websites as they come up. I'm sure that's a very frustrating position to be in but the alternatives seem far worse for everyone else.

@ghost
Copy link

@ghost ghost commented Feb 10, 2020

you should not have to be tech-savvy to prevent tracking or data gathering, companies should not be able to gather data or track you without CLEAR ad EXPLICIT opt-in and provision to ensure that you at any time can request the removal of all your data they hold. It is unfortunate that the approach of companies is that once you allow them to gather data that the data gather belongs to them. Companies at best have permitted use of data from the user when the opt-in and once cancel so it the permission to use it. if companies want to keep that data when a user opts out, there should be an explicit request from a company to the user if they are permitted to keep using already captured data.

All of this should be implemented with the intended user being, a none-tech person and is easy and correctly displayed of information.

@scottlow
Copy link

@scottlow scottlow commented Feb 13, 2020

As @yoavweiss mentioned above, issue #52 in the UA Client Hints repository attempts to summarize much of the ongoing debate here, particularly around GREASE.

My concern with browsers pretending to be other browsers some fixed percentage of the time is twofold:

  • First, as @torgo mentioned above, this will make it hard for minority browsers to have their share accurately tracked since it will be unclear how much share should be attributed to actual browser usage versus GREASEd values coming from more popular browsers.
  • Second, it will lead to "by design" compatibility issues. When we moved to Chromium, a substantial portion of the UA-related bugs we received from users were from them reporting security emails that stated "We noticed you just logged in from a new Chrome browser" rather than "Edge". This was the result of sites not yet detecting our new "Edg" token as "Edge". These types of issues (i.e. ones where sites legitimately need a stable, accurate per-browser identifier) will become more prevalent by design if all browsers start pretending to be other browsers some amount of the time.

While we certainly ran into a few sites that blocked the new Edge based on the fact that it had an unknown "Edg" token (web.whatsapp.com was one example), the far more common cause of breakage that we encountered was from sites that started detecting our "Edg" token as a unique browser, but failed to update their per-browser allow lists to include the new Edge. As @mgol mentioned above:

These issues wouldn't exist if sites were targeting engines by default instead of browser names

While I admit that exposing engine by default and letting sites opt into receiving brand information using Accept-CH: UA does not address the issues of enabling allow/block lists being created (at least not without some discouragement from opting into additional client hints via something like Privacy Budget), my hypothesis is that it would encourage site developers to build allow lists off of well-defined equivalence classes, thus reducing the number of compatibility issues caused by allow lists constructed from per-browser identifiers.

@ocram
Copy link

@ocram ocram commented Feb 13, 2020

Do you think that with engines instead of browser brands, website operators will suddenly all become responsible citizens of the web? That is, engines will not just be the new browser brands when it comes to browser identification?

If my browser has CustomEngine, but sites restrict certain features or serve a degraded experience due to that information, my browser will either send CustomEngine (Chromium) or "CustomEngine", "Chromium" or "Chromium"; version=80, "CustomEngine"; version=28.

Again, website operators may exclusively rely on true equivalence classes and everything may be great. But why should anything be different with Sec-CH-UA and Sec-CH-UA-Engine instead of User-Agent? With regard to the incentives and underlying problems, nothing has changed.

By the way, as for randomly returning different values (e.g. in 25 % of all cases), I think it’s obvious that this won’t work for the use cases that make User-Agent something that people rely on today. It’s the same situation as with including fake brands or dropping oneself from the set.

@scottlow
Copy link

@scottlow scottlow commented Feb 13, 2020

Do you think that with engines instead of browser brands, website operators will suddenly all become responsible citizens of the web?

Nope. I will readily admit that both the Sec-CH-UA-Engine and Sec-CH-UA proposals suffer from the fact that there are no technical provisions in place to prevent allow/block lists from being created as they can be from the User-Agent today.

My main point is that exposing both brand and engine in a single hint doesn't encourage developers to change their behavior in any way for the better of compatibility. We can provide guidance encouraging them to target true equivalence classes by default, however providing both brand and engine in a single hint feels an awful lot like providing per-browser identifiers in the User-Agent header today, but recommending that feature detection be used instead.

By only exposing Sec-CH-UA-Engine by default, we are at least adding a hurdle (in the form of having to opt in to receiving brand information) between sites and per-browser identifiers.

@ocram
Copy link

@ocram ocram commented Feb 13, 2020

I agree that the separation of brand and engine is reasonable. It’s just that the hope for better usage by the community in the future is not a strong argument, and responsible developers could already do today what they should do in the future, i.e. rely on engines instead of brands where possible.

Turning passive fingerprinting into (detectable) active fingerprinting and offering information selectively is good as well. While most sites will request similar information and there won’t be much variation that could allow you to detect bad actors, this is still the strongest point of the proposal, I’d say.

But I really don’t think it will change anything about the complexity and length of strings (or sets), so maybe we should not put too much hope into that and avoid making the proposal more complex to make those dreams possible. It will not work.

All in all, it doesn’t appear to be a strong case for this new proposal replacing the current string where both have similar power and will suffer from similar problems. In the end, you will either have to support frozen old values forever or ultimately break backward compatibility.

@ronancremin
Copy link

@ronancremin ronancremin commented Feb 14, 2020

I have some comments on the proposal in a few different areas. Some of these points have been made already but I nonetheless want to restate them.

Lack of industry consultation
The HTTP protocol has become deeply embedded globally over its lifetime. As envisaged by the authors of the HTTP protocol, the User-Agent string has been used in the ensuing decades for “statistical purposes, the tracing of protocol violations, and automated recognition of user agents for the sake of tailoring responses to avoid particular user agent limitations”.

The User-Agent header has been part of the web since its inception. It has been stable element of the HTTP protocol through all its versions from HTTP 1.0 in 1996 all the way to HTTP/2 in 2015 and thus has inevitably come to be relied upon, even if particular use cases are not apparent, or have have been forgotten about, or its practitioners are not participants in standards groups. The User-Agent string is also likely being used in new ways not contemplated by the original authors of the specification.

There was a salutary example of the longevity of standards in a recent Tweet from the author of Envoy, a web proxy server. He has been forced to add elements of HTTP 1.0 to ensure it works in the real world, despite Envoy’s development starting 23 years after HTTP/1.1 was ratified and deliberately opting not to support HTTP 1.0. This is the reality of the web—legacy is forever.

Despite this reality, there is no public evidence of any attempt to consult with industry groups to understand the breath and severity of the impact of this proposed change to HTTP. It is a testament to its original design that the HTTP protocol has endured so well despite enormous changes in the internet landscape. Such designs should not be changed lightly.

Issues with the stated aim of the proposal
The problem with the User-Agent string and the reason to propose Client Hints, per the explainer, is that “there's a lot of entropy wrapped up in the UA string” and that “this makes it an important part of fingerprinting schemes of all sorts.”

In subsequent discussions in the HTTP WG the privacy issues focused on passive fingerprinting, where the User-Agent string could potentially be used by entities for tracking users without their knowledge.

What is missing from the discussion is any concrete evidence of the extent or severity of this supposed tracking. Making changes to an open standard that has been in place for over 24 years should require a careful and transparent weighing of the benefits and costs of doing so, not the opinion of some individuals. In this case the benefits are unclear and the central argument is disputed by experts in the field. The costs on the other hand are significant. The burden of proof for making the case that this truly is a problem worth fixing clearly falls on the proposer of the change.

If active tracking is the main issue that this proposal seeks to address there are far richer sources of entropy than the User-Agent string. Google themselves have published a paper on a canvas-based tracking technique that can uniquely identify 52M client types with 100% accuracy. Audio fingerprinting, time skew fingerprinting and font-list fingerprinting can be combined to give very high entropy tracking.

Timeline of change
This proposed change is proceeding more quickly than the industry can keep up with. In January 2020 alone there were some important changes made to the proposal (e.g. sending the mobileness hint by default). It is difficult to fully consider the proposal and understand its impact until it is stable for a while. The community needs time to 1) notice the proposal and 2) consider its impact. There has not been enough time.

Move fast and break things is not the correct approach for making changes to an open standard.

Narrow review group
It’s difficult to be objective about this but the group discussing this proposal feels narrow and mostly comes from the web browser constituency, where the change would initially be enacted, but the impact not necessarily felt. It would be good to see more people from the following constituencies in the discussion:

  • advertisers
  • web analytics
  • HTTP servers
  • load balancers
  • CDNs
  • web caches

All of these constituencies make use of the User-Agent string and must be involved in the discussion for a meaningful consensus to be reached.

Obviously you can’t force people to people contribute but my sense is that this proposal is not widely known about amongst these impacted parties.

Diversity of web monitisation
Ads are the micropayments system of the web. Nobody likes them but they serve a crucial role in the web ecosystem.

The proposed change hurts web diversity by disproportionally harming smaller advertising networks that use the OpenRTB protocol. This essentially means most networks outside of Google and Facebook. Why? The User-Agent string is part of the OpenRTB BidRequest object where it is used to help inform bidding decisions, format ads and targeting. Why does it hurt Google less? Because Google is able to maintain a richer set of user data across its dominant web properties (90% market share in search), Chrome browser (69% market share) and Android operating system (74% market share).

The web needs diversity of monetisation just as much as it needs diversity in browsers.

Dismissive tone in discussions
Some of the commentary from the proposers has been dismissive in nature e.g. the following comments on the Intent to Deprecate and Freeze: The User-Agent string post in response to a set of questions:

  • “I’d expect analytics tools to adapt to this change.”
  • “CDNs will have to adapt as well.“

Entire constituencies of the web should not be dismissed out of hand. This tone has no place in standards setting.

Entangling Chrome releases with an open standards process
In the review request, Chrome release dates are mentioned. It doesn’t feel appropriate to link a commercial organisation’s internal dates to a proposed standard. There are mentions of shipping code and the Chrome intent.

Overstated support
This point has been made by others here but it is worth restating. It feels like there is an attempt to make this proposal sound as if it has broader support than it really does, in particular on the Chrome intent, linked explicitly by the requester.

Unresolved issues
The review states “Major unresolved issues with or opposition to this specification: “ i.e. no unresolved issues or opposition. This is true only if you consider unilaterally closed issues to be truly closed. Here are a couple of issues that were closed rather abruptly, and coinciding with a Chrome intent.

Some closed HTTPWG issues:

@jwrosewell
Copy link

@jwrosewell jwrosewell commented Feb 14, 2020

Significant time is now being spent by web engineers around the world "second guessing" how Google will proceed. In the interests of good governance and engineering @yoavweiss should now close this issue and the associated Intent at Chromium.org stating it will not be pursued. As many have pointed out the following is needed before it is ready to be debated by the W3C and TAG.

  1. Define and agree the objectives.
  2. Gather evidence to support the objectives - i.e. what is the impact on privacy in practice? How does this compare to other privacy weaknesses?
  3. Produce a robust design to accepted standards compatible with dependencies and their timeframes i.e. privacy sandbox.
  4. Articulate alternative designs that were considered and why they were rejected.
  5. Understand the current use of the User-Agent and first request optimisations in practice.
  6. Determine the effort impact of changing. The OpenRTB example provided by @ronancremin will consume hundreds of man years worth of engineering time across the AdTech industry alone. Any player that doesn't adopt the change will disadvantage the entire eco system. The vast majority of engineers are employed by Google's competitors who could otherwise be performing more value adding work. Google have no such constraints. The disruption benefits Google more than anyone else.

Related to this issue we have written to the Competition and Market Authority concerning Google's control over Chromium and web standards in general. A copy of that letter is available on our website.

@yoavweiss
Copy link
Author

@yoavweiss yoavweiss commented Feb 17, 2020

Lack of industry consultation
The HTTP protocol has become deeply embedded globally over its lifetime. As envisaged by the authors of the HTTP protocol, the User-Agent string has been used in the ensuing decades for “statistical purposes, the tracing of protocol violations, and automated recognition of user agents for the sake of tailoring responses to avoid particular user agent limitations”.

Indeed. Those are all use-cases that we intend to maintain.

The User-Agent header has been part of the web since its inception. It has been stable element of the HTTP protocol through all its versions from HTTP 1.0 in 1996 all the way to HTTP/2 in 2015 and thus has inevitably come to be relied upon, even if particular use cases are not apparent, or have have been forgotten about, or its practitioners are not participants in standards groups. The User-Agent string is also likely being used in new ways not contemplated by the original authors of the specification.

There was a salutary example of the longevity of standards in a recent Tweet from the author of Envoy, a web proxy server. He has been forced to add elements of HTTP 1.0 to ensure it works in the real world, despite Envoy’s development starting 23 years after HTTP/1.1 was ratified and deliberately opting not to support HTTP 1.0. This is the reality of the web—legacy is forever.

Despite this reality, there is no public evidence of any attempt to consult with industry groups to understand the breath and severity of the impact of this proposed change to HTTP. It is a testament to its original design that the HTTP protocol has endured so well despite enormous changes in the internet landscape. Such designs should not be changed lightly.

The Client Hints infrastructure was thoroughly discussed at the IETF's HTTPWG, as well as its specific application as a replacement to the User Agent string.

Issues with the stated aim of the proposal
The problem with the User-Agent string and the reason to propose Client Hints, per the explainer, is that “there's a lot of entropy wrapped up in the UA string” and that “this makes it an important part of fingerprinting schemes of all sorts.”

In subsequent discussions in the HTTP WG the privacy issues focused on passive fingerprinting, where the User-Agent string could potentially be used by entities for tracking users without their knowledge.

What is missing from the discussion is any concrete evidence of the extent or severity of this supposed tracking. Making changes to an open standard that has been in place for over 24 years should require a careful and transparent weighing of the benefits and costs of doing so, not the opinion of some individuals. In this case the benefits are unclear and the central argument is disputed by experts in the field. The costs on the other hand are significant. The burden of proof for making the case that this truly is a problem worth fixing clearly falls on the proposer of the change.

There's a lot of independent research on the subject. Panopticlick is one from the EFF.

If active tracking is the main issue that this proposal seeks to address there are far richer sources of entropy than the User-Agent string. Google themselves have published a paper on a canvas-based tracking technique that can uniquely identify 52M client types with 100% accuracy. Audio fingerprinting, time skew fingerprinting and font-list fingerprinting can be combined to give very high entropy tracking.

I'm afraid there's been some confusion. This proposal tries to address passive fingerprinting, by turning it into active fingerprinting that the browser can then keep track of.

Timeline of change
This proposed change is proceeding more quickly than the industry can keep up with. In January 2020 alone there were some important changes made to the proposal (e.g. sending the mobileness hint by default). It is difficult to fully consider the proposal and understand its impact until it is stable for a while. The community needs time to 1) notice the proposal and 2) consider its impact. There has not been enough time.

Move fast and break things is not the correct approach for making changes to an open standard.

Regarding timelines, I updated the intent thread.

Narrow review group
It’s difficult to be objective about this but the group discussing this proposal feels narrow and mostly comes from the web browser constituency, where the change would initially be enacted, but the impact not necessarily felt. It would be good to see more people from the following constituencies in the discussion:

  • advertisers
  • web analytics
  • HTTP servers
  • load balancers
  • CDNs
  • web caches

The latter 4 are active at the IETF and at the HTTPWG. We've also received a lot of feedback from others on the UA-CH repo.

All of these constituencies make use of the User-Agent string and must be involved in the discussion for a meaningful consensus to be reached.

Obviously you can’t force people to people contribute but my sense is that this proposal is not widely known about amongst these impacted parties.

Diversity of web monitisation
Ads are the micropayments system of the web. Nobody likes them but they serve a crucial role in the web ecosystem.

The proposed change hurts web diversity by disproportionally harming smaller advertising networks that use the OpenRTB protocol. This essentially means most networks outside of Google and Facebook. Why? The User-Agent string is part of the OpenRTB BidRequest object where it is used to help inform bidding decisions, format ads and targeting.

A few points:

  • Existence of data in the OpenRTB BidRequest object doesn't mean that users and their user agents are obligated to provide it to advertisers. For example, I also see Geolocation in that same object as "recommended". I'm assuming you don't think that browsers should passively provide geolocation data on every request.
  • The implications of anti-fingerprinting work on OpenRTB are being discussed at the W3C’s Web Advertising Business Group.
  • The user agent information would still be available to advertisers, they'd just have to actively ask for it (using UA-CH or the equivalent JS API) in ways that enable browsers to know which origins are gathering that data.
  • Once something like Privacy Budget is in place, the change of UA data to active fingerprinting would enable advertisers to “spend” their entropy bits where they need them, rather than “pay” for entropy bits they potentially don’t need.

Why does it hurt Google less? Because Google is able to maintain a richer set of user data across its dominant web properties (90% market share in search), Chrome browser (69% market share) and Android operating system (74% market share).

The web needs diversity of monetisation just as much as it needs diversity in browsers.

Dismissive tone in discussions
Some of the commentary from the proposers has been dismissive in nature e.g. the following comments on the Intent to Deprecate and Freeze: The User-Agent string post in response to a set of questions:

  • “I’d expect analytics tools to adapt to this change.”
  • “CDNs will have to adapt as well.“

Entire constituencies of the web should not be dismissed out of hand. This tone has no place in standards setting.

I apologize if this came across as dismissive. That wasn't my intention.

Entangling Chrome releases with an open standards process
In the review request, Chrome release dates are mentioned. It doesn’t feel appropriate to link a commercial organisation’s internal dates to a proposed standard. There are mentions of shipping code and the Chrome intent.

The TAG review process asks for relevant time constraints. I provided them.

Overstated support
This point has been made by others here but it is worth restating. It feels like there is an attempt to make this proposal sound as if it has broader support than it really does, in particular on the Chrome intent, linked explicitly by the requester.

Unresolved issues
The review states “Major unresolved issues with or opposition to this specification: “ i.e. no unresolved issues or opposition. This is true only if you consider unilaterally closed issues to be truly closed. Here are a couple of issues that were closed rather abruptly, and coinciding with a Chrome intent.

Some closed HTTPWG issues:

I'm not sure what your point is here. These issues were raised (one by me), discussed, resolved and then closed.

@ronancremin
Copy link

@ronancremin ronancremin commented Feb 26, 2020

Thanks for the response.

Despite this reality, there is no public evidence of any attempt to consult with industry groups to understand the breath and severity of the impact of this proposed change to HTTP. It is a testament to its original design that the HTTP protocol has endured so well despite enormous changes in the internet landscape. Such designs should not be changed lightly.

The Client Hints infrastructure was thoroughly discussed at the IETF's HTTPWG, as well as its specific application as a replacement to the User Agent string.

I'm saying that there is insufficient industry realisation that this is going on, despite discussions in the HTTPWG. However well-intentioned the discussions are it seems that some web constituents are only vaguely aware of what's being proposed. Obviously this isn't any particular person's fault but it feels like more time or outreach is required for the industry to become aware of the proposal and respond.

What is missing from the discussion is any concrete evidence of the extent or severity of this supposed tracking. Making changes to an open standard that has been in place for over 24 years should require a careful and transparent weighing of the benefits and costs of doing so, not the opinion of some individuals. In this case the benefits are unclear and the central argument is disputed by experts in the field. The costs on the other hand are significant. The burden of proof for making the case that this truly is a problem worth fixing clearly falls on the proposer of the change.

There's a lot of independent research on the subject. Panopticlick is one from the EFF.

With respect, I don't think this answers my concern at all, specifically the extent of this supposed passive tracking. Panopticlick and others like it say what's possible without saying anything about how widespread this tracking actually is, so I don't think that this counts as evidence of passive tracking. Furthermore, Panopticlick mixes both passive and active tracking. If there is independent research on the extent and severity of passive tracking maybe you could cite it here?

Existence of data in the OpenRTB BidRequest object doesn't mean that users and their user agents are obligated to provide it to advertisers. For example, I also see Geolocation in that same object as "recommended". I'm assuming you don't think that browsers should passively provide geolocation data on every request.

No, of course not. And User agents are not obligated to send User-Agent headers either and can say whatever they want in them.

But the point is that most user agents have been sending useful User-Agent headers for the last 25 years or so and, for all its imperfection, the web ecosystem has grown up around this consensus, including the advertising industry that helps pay for so much of what we utilise on the web.

The user agent information would still be available to advertisers, they'd just have to actively ask for it (using UA-CH or the equivalent JS API) in ways that enable browsers to know which origins are gathering that data.

Yes, but they get it only on the second request—a significant drawback in an industry where time is everything, especially on mobile devices where connectivity issues are more likely.

The review states “Major unresolved issues with or opposition to this specification: “ i.e. no unresolved issues or opposition. This is true only if you consider unilaterally closed issues to be truly closed. Here are a couple of issues that were closed rather abruptly, and coinciding with a Chrome intent.

I'm not sure what your point is here. These issues were raised (one by me), discussed, resolved and then closed.

Perhaps this is the normal process but the closure felt forced/abrupt.

@jwrosewell
Copy link

@jwrosewell jwrosewell commented Feb 28, 2020

There is a still a general lack of awareness around this and related Google driven changes to the web.

Note: I've edited this post to ensure compliance with the W3C code of conduct. I hope it has - and will - provoked thought.

The Economist last week touched on these subjects in a Special Report. In accessing it online one can experience the effects this change is already having on journalisim and access to advertising funded content. Basically you'll have to enter your email address everytime you want to read something, or if that's too much hassle just use Google News where Google will control the revenue for everyone. Small publishers will never get funding and will be commercially unviable. That's not a good thing and takes us a long way from Sir Tim's vision for the web.

Here's a link to The Economist article.

Who will benefit most from the data economy?

And my wider thoughts on the impact.

My wider thoughts on the changes and the impact to AdTech

@jwrosewell
Copy link

@jwrosewell jwrosewell commented Mar 14, 2020

Last week the European Parliament and Council met to debate repealing legislation concerning Privacy and Electronic Communications regulation. The proposals recognise legitimate interest and the providers of electronic services. It calls out the end user confusion associated with gaining and controlling consent. These are subjects me and others have articulated previously in comments on this proposal.

The debate explicitly recognises the use of “metadata can be useful for businesses, consumers and society as a whole”. Legitimate interest includes:

• identification of security threads;
• meeting quality of service requirements;
• aggregated analysis;
• providing services;
• websites without direct monetary payment;
• websites wholly or mainly financed by advertising;
• audience measuring;
• management or optimisation of the network;
• detecting technical faults;
• preventing phishing attacks; and
• anti-spam.

The debate recognises “providers should be permitted to process an end-user’s electronic communications metadata where it is necessary for the provision of an electronic communications service”.

Implementations should be performed in the “least intrusive manner”. The User-Agent meets this criteria.

There is an explicit list of the information contained within the end users’ terminal equipment that requires explicit consent. The list does not include metadata such as that contained in the User-Agent.

The legitimate interests of businesses are explicitly recognised “taking into consideration the reasonable expectations of the end-user based on her or his relationship with the provider”.

The debate advocates placing controls over consent and control within the terminal equipment (or user’s agent) not the removal of such data.

The outcome of the debate should inform the W3C, Chromium and other stakeholders. The UK (now no longer part of the EU) is also considering these matters via the Competition and Markets Authority (CMA) investigation and the Information Commissioners Office (ICO). At least two of these three regulatory bodies are publicly progressing in a direction that is not aligned to this proposal.

It is not the business of the W3C to help pick winners and losers on the web. This proposal in practice will favour larger businesses. Technically and now regulatorily it looks like a solution looking for a problem. It should be rejected by the W3C at this time.

The full text of the EU document is available here.

https://data.consilium.europa.eu/doc/document/ST-5979-2020-INIT/en/pdf

@jwrosewell
Copy link

@jwrosewell jwrosewell commented Mar 19, 2020

Google yesterday recognise the stresses many businesses are already under and are doing their bit to reduce that burden by delaying enhancements to Chromium and Chrome. Here's the short update.

"Due to adjusted work schedules at this time, we are pausing upcoming Chrome and Chrome OS releases. Our primary objectives are to ensure Chrome continues to be stable, secure, and work reliably for anyone who depends on them. We’ll continue to prioritize any updates related to security, which will be included in Chrome 80."

https://blog.chromium.org/2020/03

@jwrosewell
Copy link

@jwrosewell jwrosewell commented May 14, 2020

@torgo for those that are following this issue please could you add a comment / link on the output from TAG review?

There were may related comments concerning the issues associated with removing a standard interface discussed under the Scheme-bound Cookies proposal.

Thank you.

@torgo
Copy link
Member

@torgo torgo commented May 26, 2020

We see from this post from Yoav that the proposal has been put off until 2021 at the earliest. At the same time, Client Hints is progressing. We think this state of affairs could allow client hints to mature, both in terms of the spec, and in terms of the implementations and industry adoption. Right now we're going to close this issue to make way for other issues in our workload but we'll be happy to re-open when appropriate.

@jwrosewell
Copy link

@jwrosewell jwrosewell commented Aug 31, 2020

FYI A pull request has been made on the WICG draft specification to add feedback from those who have used the experiments and considered the specification.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet