Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"tracked" sessions: architectural concerns pending resolution with TAG #85

Open
ddorwin opened this issue Aug 27, 2015 · 70 comments

Comments

Projects
None yet
8 participants
@ddorwin
Copy link
Contributor

commented Aug 27, 2015

Pull request #54 was merged without addressing architectural concerns about “tracked” sessions. Unresolved questions are pending a discussion with the TAG. The outcome could result in modification (or removal) of “tracked” sessions.

This issue is a placeholder for that discussion and outcome.

Resolving #82 and #84 could help accelerate conclusion of this discussion.

@mwatson2

This comment has been minimized.

Copy link
Contributor

commented Aug 27, 2015

The group did not agree that the issue needed to be raised by the TAG. Of course companies are free to do so and that might result in TAG advice for the group to further consider. But it is not the case that we are blocked pending discussion with the TAG.

@ddorwin

This comment has been minimized.

Copy link
Contributor Author

commented Aug 28, 2015

@mwatson2: This issue was carefully worded to reflect the situation. While I am unable to find a path to your conclusion that it implies that "tracked" is "blocked pending discussion with the TAG," I want to clarify that you are right - "tracked" isn't blocked pending a TAG discussion. However, I also believe that this discussion 'could result in modification (or removal) of “tracked” sessions.' Do you disagree?

@mwatson2

This comment has been minimized.

Copy link
Contributor

commented Aug 28, 2015

My comment was just to clarify that the unresolved questions and concerns are Google's, not shared by the group, as this wasn't stated either way in the issue description.

mwatson2 added a commit to mwatson2/encrypted-media that referenced this issue Sep 3, 2015

@mwatson2

This comment has been minimized.

Copy link
Contributor

commented Sep 24, 2015

I offered on our call to provide information about the fraction of time we expect to / do receive the secure release information. I have a qualified answer: considering only streaming sessions where either:
(a) the session is closed gracefully and the secure release exchange with the server completes, or
(b) (a) does not hold, but the user later revisits the site and Local Storage information from the session has not been cleared
we expect to receive secure release messages for at least 99% of sessions. And indeed in practice we achieve this in the field a desktop browser that has implemented secure release.

@paulbrucecotton

This comment has been minimized.

Copy link

commented Oct 30, 2015

As the Sapporo F2F meeting we have no update from the TAG in either EME ISSUE-85 or the related TAG Issue-73.

Paul will continue to chase after @travisleithead and Alex to get feedback on this matter.

@paulbrucecotton

This comment has been minimized.

Copy link

commented Nov 16, 2015

@travisleithead - Can you please give us an update on the TAG discuss of this EME issue?

@travisleithead

This comment has been minimized.

Copy link
Member

commented Nov 17, 2015

@slightlyoff and I are still discussing this. I hope we can make some progress this week.

@paulbrucecotton

This comment has been minimized.

Copy link

commented Dec 1, 2015

@travisleithead and @slightlyoff: Can you give us an update on your progress?

@paulbrucecotton

This comment has been minimized.

Copy link

commented Dec 7, 2015

@travisleithead and @slightlyoff: Would it be possible for you to attend a Media TF meeting on Tue Dec 15 to discuss your progress on this issue?

@travisleithead

This comment has been minimized.

Copy link
Member

commented Dec 9, 2015

I know recent conferences, holiday travel, and vacation have been a factor in getting @slightlyoff and I to make progress. I'm available to join the call Dec 15th but am afraid I won't have much to report. Regarding subsequent calls, after the 16th, I won't be available until January 2016.

@slightlyoff

This comment has been minimized.

Copy link

commented Dec 15, 2015

Apologies.

@paulbrucecotton

This comment has been minimized.

Copy link

commented Jan 23, 2016

The W3C TAG has filed its review of this issue as "Architectural view on run-after-app-close behavior" at:
w3ctag/design-reviews#73 (comment)

Please ask your questions or add comments here on this review or on the TAG issue 73. If required I will request that Travis and/or Alex attend an upcoming Media TF teleconference to discuss this matter.

/paulc

@jdsmith3000

This comment has been minimized.

Copy link
Contributor

commented Feb 3, 2016

The TAG response seems clear that web specs should not require synchronous operations post shutdown, and puts particular emphasis on issues that would be encountered when documents are abruptly destroyed. It does not, however, make judgements on the EME spec or the persistent-usage-record feature, and in fact puts feature impact specifically out of scope, as stated in the second paragraph:

In this response, we only seek to clarify the architectural question of requiring steps to run after application close; we make no value judgement of the feature in question or implementation strategies vendors might choose.

In addition to not passing judgement on EME features, the guidance also leaves implementation choices open to implementers:

Implementations are welcome to add triggers and hooks to run operations on shutdown of specific web platform environments, of course.

This raises two points we should consider:

  1. Does the EME spec require synchronous processing on shutdown? It does not, at least not anywhere in the spec language. There have been discussions about the value of writing data when the session closes, but it's not been established as a spec requirement. A valid implementation could save timed data throughout playback to provide a useful record of key usage.
  2. Choices made by implementers are not restricted. The TAG judgement leaves open design choices made by implementers. In that context, a given implementation might choose to write data on shutdown. At least one current implementation does this now. That is allowable, but doesn’t establish any of its implementation choices as requirements.

Given the TAG guidance, we shouldn’t make changes to EME that require post shutdown processing. Beyond that, I don’t think the judgement invalidates the persistent-usage-record feature, and also don’t think that pull request #54 should be reverted.

@mwatson2

This comment has been minimized.

Copy link
Contributor

commented Feb 3, 2016

I agree with Jerry's comments and would go a little further:

  • the discussion in the opinion on synchronous operations at shutdown
    refers to the execution of "arbitrary code" at page close or shutdown. I
    interpret this as referring to code supplied by the page. The
    execution of user
    agent code
    at shutdown is a different issue (and is clearly unavoidable,
    since it is the user agent which is shutting down the page).
  • another valid implementation of the specification requirements would be
    to persist the secure release data some time after shutdown, in exactly
    the way Beacon or send data after shutdown. The opinion says
    "These are reasonable models to follow".

... Mark

On Tue, Feb 2, 2016 at 5:46 PM, jdsmith3000 notifications@github.com
wrote:

The TAG response seems clear that web specs should not require synchronous
operations post shutdown, and puts particular emphasis on issues that would
be encountered when documents are abruptly destroyed. It does not, however,
make judgements on the EME spec or the persistent-usage-record feature, and
in fact puts feature impact specifically out of scope, as stated in the
second paragraph:

In this response, we only seek to clarify the architectural question of
requiring steps to run after application close; we make no value judgement
of the feature in question or implementation strategies vendors might
choose.

In addition to not passing judgement on EME features, the guidance also
leaves implementation choices open to implementers:

Implementations are welcome to add triggers and hooks to run operations on
shutdown of specific web platform environments, of course.

This raises two points we should consider:

Does the EME spec require synchronous processing on shutdown? It
does not, at least not anywhere in the spec language. There have been
discussions about the value of writing data when the session closes, but
it's not been established as a spec requirement. A valid implementation
could save timed data throughout playback to provide a useful record of key
usage.
2.

Choices made by implementers are not restricted. The TAG judgement
leaves open design choices made by implementers. In that context, a given
implementation might choose to write data on shutdown. At least one current
implementation does this now. That is allowable, but doesn’t establish any
of its implementation choices as requirements.

Given the TAG guidance, we shouldn’t make changes to EME that require post
shutdown processing. Beyond that, I don’t think the judgement invalidates
the persistent-usage-record feature, and also don’t think that pull request
#54 #54 should be reverted.


Reply to this email directly or view it on GitHub
#85 (comment).

@mwatson2

This comment has been minimized.

Copy link
Contributor

commented Feb 9, 2016

In the absence of further comments, can we close this issue ?

@ddorwin

This comment has been minimized.

Copy link
Contributor Author

commented Feb 23, 2016

We interpret the text of the official TAG response differently, and below, I have described specific differences with the above interpretations. For the sake of resolving this as efficiently as possible, I propose that we ask the authors, @slightlyoff and @travisleithead, to clarify the intent of the text and accuracy of the interpretations.


@jdsmith3000 wrote:

It does not, however, make judgements on the EME spec or the persistent-usage-record feature...

The TAG opinion says the TAG “make[s] no value judgement of the feature.” That is, for example, whether it would be useful. Although the text is general, there is a clear conclusion on whether "a web-based feature should require executing steps in an environment that is already in the process of closing down."

  1. Does the EME spec require synchronous processing on shutdown? It does not, at least not anywhere in the spec language.

Actually, that is exactly how the observable behavior is defined. My interpretation is that the TAG response recommends that specs do not define such behavior.

An approach to consider is to try and change the feature definition to avoid this behavior. However, as discussed before, one possible conclusion of that path is tamper-evident-storage, which is not always available, especially for non-first-party implementations.

  1. Choices made by implementers are not restricted. The TAG judgement leaves open design choices made by implementers. In that context, a given implementation might choose to write data on shutdown. At least one current implementation does this now. That is allowable, but doesn’t establish any of its implementation choices as requirements.

In this specific case, the choice is a Hobson's choice. As the feature is currently defined, some implementations can make only one choice - one that the TAG discourages specs from requiring. I interpreted this text as allowing implementations choice in how they implement features, possibly not even web platform features. But such choices optional, i.e. to enable optimizations. I don't think this was referring to a case where something is the only possible solution for a large portion of implementations.


@mwatson2 wrote:

- the discussion in the opinion on synchronous operations at shutdown
refers to the execution of "arbitrary code" at page close or shutdown. I
interpret this as referring to code supplied by the page. The
execution of user
agent code
at shutdown is a different issue (and is clearly unavoidable,
since it is the user agent which is shutting down the page).

This is missing the context. "Arbitrary code" is immediately contrasted with "declarative, canned behaviors." Beacons and <a ping>, the existing web platform features discussed before that text fall into the latter category. The user agent can process these when the page loads and prepare the actions for when it unloads. In contrast, this feature requires the user agent to allow the CDM - a separate entity - to execute code on its behalf for the page when the page is closed.

- another valid implementation of the specification requirements would be
to persist the secure release data some time after shutdown, in exactly
the way Beacon or send data after shutdown. The opinion says
"These are reasonable models to follow".

This is misleading and omits important statements from the same paragraph. The TAG response does not say persisting such data "some time after shutdown" is a reasonable model to follow.

Beacons and <a ping> have three very important properties, two of which are mentioned in the surrounding text, that this feature does not.

  • They have a declarative semantic for canned behaviors that can be replayed by the user agent.
  • They “are designed not to interfere or block navigation of a document nor shutdown of a browsing context.” Of particular note is that they do not modify exposed or persisted per-origin state.
  • They are (very) best-effort, as noted in the response.

These properties allow user agents to build a list of actions - or requests to replay later - and maintain that list independent of the page or origin. Thus, when the page is closed, the user agent can process the actions without the page's context and at a time of its choosing. In addition, it is acceptable - even expected - that some (potentially significant) percentage of the requests will not succeed.

In contrast, for this feature:

  • The user agent does not even know that it is supposed to perform an operation until the page is closed. (Because they might need to perform an operation, user agent implementations must wait for the CDM instance to tear down before destroying the page and its context.)
  • Since the operation is to generate and store store data for the origin that may later be read by the app/origin, the user agent must keep the page's browsing context alive to perform that operation.
  • We are told above it requires somewhere around 99% reliability.


There are a lot of details in the body of the TAG response, but I think the conclusion is clear:

In conclusion, the TAG does not believe that a web-based feature should require executing steps in an environment that is already in the process of closing down as described above. In general, the TAG favors designs that promote asynchronous or deferred actions; in contrast, requiring run-at-close steps as requested would likely be synchronous in order to reliably work in such a scenario, and therefore not appropriate for the web platform.

@mwatson2

This comment has been minimized.

Copy link
Contributor

commented Feb 23, 2016

Just a couple of points in response:

The user agent can process these when the page loads and prepare the actions for when it unloads. In contrast, this feature requires the user agent to allow the CDM - a separate entity - to execute code on its behalf for the page when the page is closed.

The relationship between user agent and CDM in an implementation is a software architecture issue. I don't see how we can say that there is a web-architecture-visible difference between the user agent executing code and a CDM executing code even though those may have different characteristics in any particular implementation.

The user agent does not even know that it is supposed to perform an operation until the page is closed.

This is true of <a ping> too, since it is not until the user follows the link that it is known the ping should be sent or which one should be sent of several on the page.

Since the operation is to generate and store store data for the origin that may later be read by the app/origin, the user agent must keep the page's browsing context alive to perform that operation.

It's not clear to me that the entire browsing context is required: just as the user agent may keep a queue of pings or beacons to be sent later with their individual destinations the user agent may keep a queue of blobs to be stored later with their individual origins.

@mwatson2

This comment has been minimized.

Copy link
Contributor

commented Mar 22, 2016

IIUC, we are expecting further TAG advice on this issue, or at least clarification on their previous advice. However, I don't think we should make this a dependency on our progress. So, I think this issue should be reclassified V1NonBlocking (the implication being that if there is no further progress on this issue, the specification is left unchanged).

@ddorwin

This comment has been minimized.

Copy link
Contributor Author

commented Mar 25, 2016

Yes, we are waiting for clarification from the TAG. If the feature was not currently in the spec, I would agree that V1NonBlocking makes sense. However, one interpretation of the most recent TAG response is that the feature is not acceptable as defined and thus should be removed from the spec unless/until it can be made acceptable. (For example, through incubation for VNext, which I think is probably the best path forward.) A Proposed Recommendation (PR) should not include such known issues, so we must get this clarity before V1 gets to that point, meaning the V1 milestone is appropriate

@jdsmith3000

This comment has been minimized.

Copy link
Contributor

commented Mar 25, 2016

I agree that TAG clarification is still needed, but don't agree that in the absence of such clarification the feature should be pulled from the spec. Keeping this on V1 would force that if clarification isn't accomplished by our deadlines. Given the majority support for the feature in the TF, that does not seem appropriate.

I support moving this to V1NonBlocking based on that.

@ddorwin

This comment has been minimized.

Copy link
Contributor Author

commented Mar 28, 2016

Reclassifying this issue as V1NonBlocking would indicate that the spec could reach V1 without TAG clarification. However, this issue does block V1 - the spec will not reach REC without a clear resolution of the related concerns raised about the current spec contents, and TAG clarification is the most appropriate next step towards a resolution.

@jdsmith3000

This comment has been minimized.

Copy link
Contributor

commented Mar 28, 2016

This would imply that any issue that requests TAG clarification should be considered blocking until such clarification is received. I'm asserting that is not appropriate, especially on an issue with general working group support. In this specific case, if TAG chooses not to provide additional clarification, then I believe the existing feature in the spec should remain. That would imply V1NonBlocking.

I understand there is a difference in interpretation of the previous TAG opinion, but my reading of it does not make it clear that the feature should be removed. There is considerable discussion of challenges for implementation. I believe that at least two CDMs have working implementations now, and that they are providing useful data.

If TAG informs us that the issue should be considered blocking until additional clarification is provided by them, then it should certainly be marked V1.

@paulbrucecotton

This comment has been minimized.

Copy link

commented May 17, 2016

Can we get agreement to mark this feature "at risk" when EME transitions to Candidate Recommendation?

This will:
a) give us more time to demonstrate the number of implementations we have of this feature,
b) give us more time to discuss the matter further with the TAG, and
c) still leave open the possibility of removing the feature when we transition to Proposed Recommendation.

@mavgit

This comment has been minimized.

Copy link

commented May 18, 2016

@jdsmith3000 This is used in support of download count or device count use cases associated with usage limits. Here is a reply from my security team:

The service permits clients to submit a secure assertion to a DRM/security endpoint that license(s) in connection with a business identifier are not authorized by way of confirming that licenses are not present in local cache. In current implementations, the application calls a release API passing in a business identifier. One or more licenses are deleted from local cache if one or more licenses are bound to a business-oriented attribute matching the specified business identifier. Since licenses may have previously expired or removed from cache, actual license deletion would only occur in the event matching licenses are present. Regardless of whether license deletion executes, the license must no longer exist in local cache in order for the client to proceed into generating its assertion. Provided the client is able to guarantee licenses in connection with the business identifier are not present in local cache, the client generates a digitally signed assertion identifying the business identifier and optionally license identifiers. The secured assertion is submitted into a DRM/security endpoint. The distribution service is then able to disassociate the client with specified licenses in support of download / device count use cases associated with usage limits.

@ddorwin

This comment has been minimized.

Copy link
Contributor Author

commented May 19, 2016

@mwatson2 wrote:

No, that note stated their understanding as "TAG's "low-latency notification" meant synchronous work being done on page shutdown" and subsequent discussion with @travisleithead did not dispute my characterization of this as synchronous work by the page.

@mwatson2, can you explain how your characterization of "synchronous work by the page" (on shutdown/teardown) is different from "synchronous work being done on page shutdown?"


@paulbrucecotton wrote:

Can we get agreement to mark this feature "at risk" when EME transitions to Candidate Recommendation?

We agree that marking the feature "at risk" is the minimum required to ensure we do not put the schedule at risk if we do not resolve this before CR. However, keeping the feature in the spec through CR means we must continue to spend cycles debating related issues - at least one of which relates directly to the current discussion - before June 9th. Therefore, as before, I believe the prudent path forward is to file a VNext issue to track this feature, fork/branch the current spec text to enable continued development outside V1 and its schedule pressures, and remove the text from the V1 draft.

@paulbrucecotton

This comment has been minimized.

Copy link

commented May 19, 2016

@ddorwin wrote:

Therefore, as before, I believe the prudent path forward is to file a VNext issue to track this feature, fork/branch the current spec text to enable continued development outside V1 and its schedule pressures, and remove the text from the V1 draft.

If we fork/branch at CR then ALL changes that are made to the CR branch that are pertinent to the VNext branch must be made there as well. And since we have lots of V1NonBlocking and V1Editorial issues we plan to make during CR since will mean more work for the Editors and WG during CR when we should be doing testing.

I believe a better plan is to fork/branch at PR just before we decide what "at risk" features need to be removed. This causes NO additional work for the Editors during CR.

@ddorwin

This comment has been minimized.

Copy link
Contributor Author

commented May 19, 2016

@paulbrucecotton, my comments were not in relation to the main "VNext branch", your proposed plan, or "at risk" features in general. I did not mean that such a fork/branch would be the VNext branch.

Forking/branching before removal was simply a suggestion for continuing to iterate on text that is no longer in the mainline. There are other options as well, including creating a branch from a historical commit and applying a git revert. However, such mechanics are really irrelevant to the proposal, so I'll simplify it as follows:

File a VNext issue to track this feature, remove the text from the V1 draft, and continue discussion and development outside the constraints of the V1 schedule pressures.

My concern about this particular "at risk" feature is that there are pending substantive changes for which the same unresolved concerns apply and others that only need to be addressed by June 9th if this feature is still in V1.

@mwatson2

This comment has been minimized.

Copy link
Contributor

commented May 19, 2016

@ddorwin asked:

can you explain how your characterization of "synchronous work by the page" (on shutdown/teardown) is different from "synchronous work being done on page shutdown?"

By "work done by the page", I mean execution of page Javascript.

The more general "synchronous work being done on page shutdown" could mean any work done by the CPU at the time the page is closed, before it is considered closed. This is hard to define, because a definition of "synchronous" in this context would mean that the work in question is done "before the page is consider closed" and I am not sure how we define "the point a which the page is considered closed" ?

In any case, firstly, I can't think of an implementation-independent definition of this point in which secure release requires work to be done before that and where there isn't already plenty of other work that has to be done (by the CPU). For example, if we are concerned about work that needs to be done before "all resources associated with the page are released", say, then there are obviously things like releasing hardware resources that have to be done and queuing storage of the secure release message is trivial by comparison and very similar to queuing an <a ping>.

@mwatson2

This comment has been minimized.

Copy link
Contributor

commented May 20, 2016

@ddorwin wrote:

My concern about this particular "at risk" feature is that there are pending substantive changes

I don't believe the changes that are pending are substantive. They describe more explicitly what is already required by the existing text. This is why the issue is classified V1NonBlocking.

Now that we have merged the session close re-factor under #181, the changes for #171 are very simple.

@hsivonen

This comment has been minimized.

Copy link

commented May 28, 2016

@ddorwin, I'd like to understand your current objections to the CDM requesting the browser to store a key usage record on the CDM's behalf in origin-partitioned storage when the browser shuts down a CDM instance. Specifically, this would not involve generating another EME message (synchronous messaging concern) at CDM shutdown. Specifically, this would not involve storing key usage records during playback, so even in the absence of "secure" storage, there wouldn't be a risk of rollback--just the risk of failing to write a record at all if either the CDM or the browser crashes (or electricity to the computer is cut, etc.). Also, the user deleting the storage between the CDM shutdown and next initialization for the same site would result in a lost record.

Your comment from April 14 strongly hints that the crux of your concern is that Chrome loses the association between a CDM instance and the origin to which the storage should be partitioned before the CDM instance is actually shut down. Is that the case? Is there a reason why the CDM instance can't have a potential (origin-partitioned) storage location assigned to it at CDM instance initialization time so that knowledge of the storage location would survive the destruction of the Document object that was responsible for the CDM instance getting initialized?

jdsmith3000 added a commit to jdsmith3000/encrypted-media that referenced this issue Jun 10, 2016

jdsmith3000 added a commit that referenced this issue Jun 10, 2016

@ddorwin

This comment has been minimized.

Copy link
Contributor Author

commented Jun 28, 2016

@hsivonen, most of the issues in your first paragraph were addressed over the last year and led us to this point. For example, requiring delayed application shutdown to send messages (and thus the ability to enforce concurrent stream limitations) was dropped; and the current definition does not require tamper-evident storage (and avoids trivial rollback attacks). The latter, however, leaves the feature in a state where it requires very high assurances that the single record write will occur when the page is closed, including because the tab or browser is closed.

#45 (comment) summarizes the state, before the discussion moved to the architectural issues. Specifically, the feature does not support enforcement like alternative mechanisms, and identifying abuse while dealing with the uncertainty over time (there will always be some percentage of key usage records that are never received and others that may not be received for days or weeks) on the server could be quite complex, especially across the broad spectrum of client device form factors and usage models.


tl;dr: Persisting state is not an explicit goal/purpose of EME, yet this feature requires something that even APIs with the explicit goal of persisting data do not support: ensuring persistence during destruction of an object or the entire page. Requiring client implementations to accommodate a limited-utility and orthogonal capability is unnecessary, especially when other options have not been duly considered.

Specific to Chrome, Chrome treats the CDM like any other part of the page, such as the media player backing the <video> and, where possible, stores CDM data using the same mechanism as other site data. This helps keep the EME implementation consistent with the rest of the platform and allows reuse of established mechanisms.

Even if a parallel storage mechanism for CDMs was added to Chrome, it would need to manage the CDM instance/process lifetime outside that of the browsing context, including the associated renderer process - something that is not and cannot currently be normatively described. This would be inconsistent with the current goals to treat EME and CDMs like any other part of the web platform.

Even if solutions for both were implemented in Chrome, this would only address the issue for the Chrome browser. Any Chromium-based user agents, including Opera and Chromium Embedded Framework, that use Chromium’s content layer but not the entire Chromium browser may have to independently address the issues above. It is important to ensure that implementers continue to have freedom to innovate and avoid imposing orthogonal requirements.


As has become apparent from recent discussions above, though, this is not just an implementation issue - the reason these potential implementation issues exist is that the required behavior is outside the bounds of the current web platform, including the extent of normatively-definable behavior. Thus, even if there was not an implementation problem, we would still have the problem of being unable to normatively describe the required behavior. I’ll address this in more detail on this in my next comment.

Given the lack of evidence that this is a generally useful solution (as far as we know, only Netflix has actually deployed it, and mostly on platforms that have tamper-evident storage), it seems premature to consider this feature a Recommendation at this time, especially given all the other issues and implementation burden it involves.

@ddorwin

This comment has been minimized.

Copy link
Contributor Author

commented Jun 28, 2016

The spec for MediaKeySession currently (after #171) says:

If a MediaKeySession object becomes inaccessible to the page and the Session Closed algorithm has not already been run, the User Agent MUST run the MediaKeySession destroyed algorithm before User Agent state associated with the session is deleted.

This sounds like a destructor but maybe even broader. Is there precedent for such normative text related to destruction or inaccessibility of an object?


The MediaKeySession Destroyed algorithm says:

The following steps are run in parallel to the main event loop:

3. Use cdm to execute the following steps:

  1. Close the session associated with session.
  2. If session's session type is "persistent-usage-record", store session's record of key usage, if it exists.
    NOTE
    Since it has no effects observable to the document, this step may be run asynchronously, including after the document has unloaded.


So, when "a MediaKeySession object becomes inaccessible," the steps related to this feature "are run in parallel to the main event loop." Ignoring the destructor question, this might be okay if it was best effort. However, with an expectation of "at least 99%" reliability, the user agent would need to wait until the parallel steps complete before terminating the main loop. As far as I understand, there is currently no way to specify this behavior in the web platform (the current spec text likely does not provide the expected reliability), and there is no such precedent for delaying the teardown of the main event loop or browsing context. Such delays also seem to contradict trends in implementations towards quick shutdown.

The spec changes for #171 attempt to deal with this with a non-normative note that the persistence “may be run asynchronously, including after the document has unloaded.” This appears to be an acknowledgement of the issue above, but since there is no mechanism for normatively solving it (within the document lifetime), it relies on a non-normative suggestion for implementations of the specification to do work related to the browsing context outside the lifetime of that browsing context.

@mwatson2

This comment has been minimized.

Copy link
Contributor

commented Jul 8, 2016

@ddorwin wrote:

Persisting state is not an explicit goal/purpose of EME

I disagree with this statement. The goal/purpose of EME, from the beginning, was to provide access to Content Protection capabilities previously available only through plugins. Those pre-existing capabilities included persistent licenses and key release messaging. We documented the in-scope use-cases in our wiki which lists both persistent licenses and key release as "supported". Key release was in the very first EME draft proposed.

Whilst we have agreed on constraining the CDM features accessible by EME to a smaller set than those supported in plugins, it has never been proposed to exclude or deprioritize those features requiring persistent state (except insofar as they are optional).

Requiring client implementations to accommodate a limited-utility and orthogonal capability is unnecessary, especially when other options have not been duly considered.

The feature is optional, so noone is required to implement it.

We have had long long discussions of alternatives and I've provided detailed explanations of why we think license renewal is overkill for this problem, so I'm not sure why you say other options have not been duly considered.

So, when "a MediaKeySession object becomes inaccessible," the steps related to this feature "are run in parallel to the main event loop."

The following is not a major issue, but as commented earlier the "in parallel" here is really a noop. "in parallel" means only that pages cannot assume the steps will be complete before the next turn of the event loop. But in this case the effects of the steps can only be observed asynchronously (by loading a session and waiting for the release message) and in the page close case there may be no more turns of the event loop at all. Whether we say "in parallel" or not it looks the same as far as the page is concerned.

However, with an expectation of "at least 99%" reliability, the user agent would need to wait until the parallel steps complete before terminating the main loop.

I don't think this follows, or even makes sense. The main event loop in our specifications is an abstract thing which expresses the serialization of the execution of tasks for the page. There can, by definition, be no more tasks related to this object and may be no more tasks at all, so how does it make sense to talk about "before terminating the main loop" in this context ? The only thing that could mean is that you expect more tasks to be executed after these steps, but that is not proposed.

Furthermore, this is an optional feature and the specification does not specify a reliability target. The 99% figure I gave is our (Netflix's) "happy path" target, considering only cases of graceful session close / page shutdown and where local storage has not been cleared.

Finally, as noted here, a possible implementation is that the CDM is provided with the necessary origin-specific context at initialization time and has the ability to post the storage task to some separate page-independent queue for later execution, even after the page has closed (as could also be done with ping).

As far as I understand, there is currently no way to specify this behavior in the web platform (the current spec text likely does not provide the expected reliability), and there is no such precedent for delaying the teardown of the main event loop or browsing context. Such delays also seem to contradict trends in implementations towards quick shutdown.

The current text doesn't imply any particular level of reliability but does not constrain reliable implementation either. I can't see how the text constrains implementations (i.e., how it "does not provide the expected reliability"). AFAIK, there is no such thing in the web specifications as "teardown of the main event loop" - the UA decides when to stop executing tasks. The necessary storage scope can be captured earlier, independently of the browsing context, DOM, etc. So there is nothing that has to be delayed. The closest I can think of is if there was a requirement to execute a specific queued task or make some UI change or other page or user-visible chage, after the storage of the persistent release data, but there is no such requirement.

In general, I find that these objections confuse specification / platform issues with implementation issues. Of course there are many many aspects of browser implementation which are not addressed by the specification. Because some aspects of a feature are in that unspecified implementation realm is not a valid criticism of the feature - all features have this property - so long as the observable behavior is well-defined.

The implementation complexity of a given feature will vary from browser to browser because of the prior implementation choices they have made. Some browsers may choose not to implement this optional feature as a result. That is all fine in a competitive market. It's a dangerous path, IMO, to say that the interoperability benefits of a standard specification should be denied to all players because one player finds the feature more difficult to implement than others.

@ddorwin

This comment has been minimized.

Copy link
Contributor Author

commented Jul 9, 2016

I think the next step is to get an updated response from the TAG (see my next comment). However, I want to address some points in @mwatson2's comment.

The goal/purpose of EME, from the beginning, was to provide access to Content Protection capabilities previously available only through plugins.

The purpose of EME, as stated in the first sentence of the Abstract, is to enable playback of encrypted content in HTMLMediaElements. This does not imply providing access to all content protection features available through legacy plugins. We have previously chosen not to expose other functionalities provided by such plugins (e.g. when they conflict with the principles or limitations of the web platform).

The feature is optional, so noone is required to implement it.

As has been discussed before, whether this feature is marked "optional" is meaningless when it is actually “required” - for the most basic access to content (online streaming) - by one of the leading users of the API.

The following is not a major issue, but as commented earlier the "in parallel" here is really a noop. "in parallel" means only that pages cannot assume the steps will be complete before the next turn of the event loop. But in this case the effects of the steps can only be observed asynchronously (by loading a session and waiting for the release message) and in the page close case there may be no more turns of the event loop at all. Whether we say "in parallel" or not it looks the same as far as the page is concerned.

Applications also assume that those steps will complete. If they do not, that is, as noted in this quote, observable by the application when it later loads the session.

The current text doesn't imply any particular level of reliability but does not constrain reliable implementation either. I can't see how the text constrains implementations (i.e., how it "does not provide the expected reliability").

Unless the feature or algorithm are labeled best effort - like other features to which it has been compared - the expectation must be the same as other web features and algorithms - essentially 100% reliable. I don't believe the current spec text would ensure that. For example, there is no thread join or other mechanism specified to ensure the parallel steps complete before exiting. Either the spec should include such text or it should state that the feature is best effort. The latter would be inaccurate given the intended usage and Netflix's reliability target.

The main event loop in our specifications is an abstract thing which expresses the serialization of the execution of tasks for the page.

The main thread/event loop is fundamental to the web platform and synchronization. "The event loop" is defined in HTML5 and referenced in the unload algorithms. Behavior of other threads ("in parallel" steps) is not defined, and they are definitely not guaranteed to run to completion (see my paragraph immediately above).

Finally, as noted here, a possible implementation is that the CDM is provided with the necessary origin-specific context at initialization time and has the ability to post the storage task to some separate page-independent queue for later execution, even after the page has closed (as could also be done with ping).

While that is one possible implementation, that does not address these issues. The CDM is not defined as a special entity outside the scope of the browsing context, and any observable behaviors must be defined in terms of the context as with all web platform specs.

In general, I find that these objections confuse specification / platform issues with implementation issues.

I disagree, as I have outlined here and above. In addition to veering into areas, such as object destruction/inaccessibility, that are undefined in the web platform, the current text does not guarantee data will be persisted reliably. I believe it is clear that a simple literal implementation of the spec algorithms would not reliably persist the data without some unspecified synchronization mechanism. That is a specification / platform issue that has nothing to do with specific implementation(s).

It's a dangerous path, IMO, to say that the interoperability benefits of a standard specification should be denied to all players because one player finds the feature more difficult to implement than others.

It’s a dangerous precedent to let portions of a W3C Recommendation exist outside the defined web platform. “Interoperability benefits” seem questionable when the specification does not fully define all required behavior. Also, authors would benefit from a fully-specified and implementable specification that is widely available. Forcing the feature through as-is will not result in a consistent platform for authors. It also seems premature since to finalize this feature in a Recommendation when, as far as we know, only one author has used it in production. (For example, see the potential issues for authors in the second paragraph of #85 (comment).)

Finally, most of the “benefits of a standard specification” can be achieved without including this feature in the v1 Recommendation. Specifically, nothing prevents maintenance of a proposed specification or implementers from implementing or experimenting with such a feature while work continues to explore the necessary platform hooks. Such an approach will benefit the web as a whole rather than introducing a significant new browser behavior only useful to a tiny fraction of sites.

@ddorwin

This comment has been minimized.

Copy link
Contributor Author

commented Jul 9, 2016

Since the last input from the TAG, the intended behavior has been further clarified (as outlined above). There appears to be disagreement on the implications and even the meaning of running steps "in parallel" and the "main event loop."

@travisleithead and @slightlyoff: Is the current text consistent with the TAG's existing understanding of the feature and feedback that "this feature doesn't seem to work well inside the currently spec'd web platform?" Does the TAG's "strong guidance is to move this to a V2" to allow investigation of the necessary hooks still apply?

@mwatson2

This comment has been minimized.

Copy link
Contributor

commented Jul 11, 2016

@ddorwin wrote:

As has been discussed before, whether this feature is marked "optional" is meaningless when it is actually “required” - for the most basic access to content (online streaming) - by one of the leading users of the API.

Our product plans should not really be relevant to this discussion, but for the record, Netflix is working on Chrome browser today without this feature, so it is not "required" in that sense.

Regarding the "in parallel" steps to store the record of key usage:

Applications also assume that those steps will complete. If they do not, that is, as noted in this quote, observable by the application when it later loads the session.

This assumption would not be observable, since there is another reason why the data might not be present in a subsequent session (the user has cleared it) and the application cannot distinguish this reason from some kind of write failure on page close.

Unless the feature or algorithm are labeled best effort - like other features to which it has been compared - the expectation must be the same as other web features and algorithms - essentially 100% reliable.

It is not required that this feature be 100% reliable. The level of reliability (essentially the frequency with which the write succeeds during normal page shutdown) is a quality-of-implementation issue. If it would help to note that in the specification, we could do so.

I don't see anything in the definition of "in-parallel" that implies those steps have a different status - in terms of whether they will run or not - compared to other steps. So, I still maintain that whether these steps are defined to be "in parallel" or not is unobservable. Still, they may certainly be carried out asynchronously with respect to other unload tasks and this is explicitly stated in a note.

@mwatson2

This comment has been minimized.

Copy link
Contributor

commented Jul 11, 2016

@ddorwin wrote:

Since the last input from the TAG, the intended behavior has been further clarified (as outlined above). There appears to be disagreement on the implications and even the meaning of running steps "in parallel" and the "main event loop."

I feel there is some disagreement as to how the web specifications are prescriptive. As I understand it, they are prescriptive in terms of observable behavior and no more.

So, when we say steps are run "in parallel", we are making a statement about the lack of serialization of observable effects with respect to other observable steps that are not "in parallel", but no more. The specifications do not say anything about when such "in parallel" steps run or what may cause them not to be run. There is nothing to say that the status of the main event loop (running or not) affects this. I don't think this silence can be interpreted as meaning all "in parallel" steps are best effort.

Equally, if steps are run in the main event loop, but their effects are not observable until later, implementations are free to execute the steps later, provided the observable behavior is unchanged.

Nevertheless, I agree the reliability of the feature in question is a quality-of-implementation issue and we could note as much in the specification.

@travisleithead

This comment has been minimized.

Copy link
Member

commented Jul 12, 2016

Assuming there is no 100% reliability requirement as has been conceded earlier, and there is no user agent requirement to notify the page of completion (of the write) as confirmed earlier, such that the side-effects would only be observerable later (on the next page load or session) then I see this feature largely boiling down to a sendBeacon or a[ping]-style feature.

Using the "destructor" of a JavaScript object as a trigger is still a concern for me. In general, APIs should not be designed around the garbage collection semantics of script engines, nor make GC semantics visible. This is only a small concern for me because there are no observerable side effects that the page could learn. It's just the use of this odd behavior that concerns me. If there were another hook that could be used I'd feel much more comfortable, but as Alex and I have said already, this hook is probably years away from being specified.

Outside of that concern, if there remains an architectural problem around running the persistent record write task outside of the observable event loop, that problem should extend to sendBeacon and a[ping] as well. Certainly aspects of how these three features queue and dispatch their payloads outside the lifetime of a document are not precisely specified. Despite this, I'm not so sure there remains an architectural concern.

@travisleithead and @slightlyoff: Is the current text consistent with the TAG's existing understanding of the feature and feedback that "this feature doesn't seem to work well inside the currently spec'd web platform?"

"work well" - it seems to work well-enough as sendBeacon and a[ping] in my estimation. Is it "inside the currently spec'd web platform"? Well, not precisely. :-)

Does the TAG's "strong guidance is to move this to a V2" to allow investigation of the necessary hooks still apply?

Alex and I have previously said that the hooks needed for precise specification are years away. This is long enough that it extends beyond v1. The salient question is whether that precise specification is necessary, and given sendBeacon and a[ping], I'm not convinced that it is necessary, or we have to call into question those specs as well.

@mwatson2

This comment has been minimized.

Copy link
Contributor

commented Jul 13, 2016

@travisleithead wrote:

Using the "destructor" of a JavaScript object as a trigger is still a concern for me. In general, APIs should not be designed around the garbage collection semantics of script engines, nor make GC semantics visible.

The trigger is that the object "becomes inaccessible to the page", which is well-defined entirely in terms of the Javascript Objects that exist and the variable references to them (i.e. things visible to the page, independent of implementation). Now, of course, when the implementation detects this state depends on its GC implementation, but still this is not visible to the page because the fact of the stored information is not visible until later.

I recently discovered that the unload a document steps include a hook for unloading document cleanup steps defined by other specifications. If the unload a document steps always occur, than an alternative formulation would be to make that the trigger for all unclosed MediaKeySession objects and also specify that browsers MAY close any unclosed MediaKeySession objects that are not accessible to the page at any time. This wouldn't result in any observable behavior difference, but might be a better specification formulation.

@travisleithead

This comment has been minimized.

Copy link
Member

commented Jul 30, 2016

Note, the TAG has closed w3ctag/design-reviews#73 (comment). After review in our Stockholm F2F meeting, we have found no architectural concerns with the feature as currently understood.

@ddorwin

This comment has been minimized.

Copy link
Contributor Author

commented Aug 1, 2016

The following quotes are from the minutes of the referenced F2F.

@travisleithead said:

At least agreement that there's not a 100% requirement....Given not 100% reliability, and no side effects, I didn't see a super-strong concern.

@travisleithead, what is your understanding of the reliability requirement? @mwatson2 said, "we expect to receive secure release messages for at least 99% of sessions." The <= 1% messages not received includes crashes, data clearing, etc., so Netflix expects the user agent implementation to be over 99% reliable. In other words, the implementation should be designed for 100% reliability. Even if we assume 99% implementation reliability, are you saying there is a meaningful difference between requiring 99% vs. 100%?

@slightlyoff said:

As long as it's isomorphic to sendBeacon and we're not adding extra constraints on shutdown, that's ok.

I'm not sure those assumptions are true. While sendBeacon() sends the origin, which is cached in the synchronous part of the algorithm, it does not actually perform any origin- or browsing context-specific operations - certainly not in the asynchronous / "after the document has unloaded" steps. In contrast, this feature, requires asynchronously storing origin-specific data that is generated by a CDM instance tied to the browsing context.

Also that there is no expected bad outcome for users of UAs that do not or cannot implement sendBeacon or <a ping> with 99-100% reliability or that do not implement them at all. In contrast, the former could lead to users being denied access to content (due to false positive concurrent stream detection), and the latter could lead to users being entirely denied access to content or higher qualities of content.

@slightlyoff said:

The open question -- got no response -- what is the limit? If you have to do behavioral monitoring of your users, what's the amount of information you need to get information you're obligated to expose. Seems the answer is "as much as possible".

@mwatson2, can you answer that open question?

@mwatson2

This comment has been minimized.

Copy link
Contributor

commented Aug 2, 2016

@ddorwin wrote:

@mwatson2 said, "we expect to receive secure release messages for at least 99% of sessions." The <= 1% messages not received includes crashes, data clearing, etc.

What I said was that we expect 99% of sessions where either there is graceful shutdown of the session before the page is closed or where the user revisits later, without clearing data. So, the 1% does not include data being cleared or where the user hasn't revisited the site. I don't think we include crashes either, but I'd have to check that.

@mwatson2, can you answer that open question?

Yes, but since the TAG have closed their issue, I'm not sure how useful that's going to be. The title of this issue ends "...pending clarification from the TAG", which we now have, so I believe we should now close this one

FWIW - and I don't recall the question being asked before, so I'm not sure who Alex was waiting for a response from - the specification requires recoding of the key ids that were used in the session and the first and last (wall clock) time content was decrypted by the session. This is sufficient for the use-cases we've discussed.

@ddorwin

This comment has been minimized.

Copy link
Contributor Author

commented Aug 2, 2016

On the first item, thank you for the clarification. However, I don't think that affects the point, implementation design targets, or questions.

On the second item, @mwatson2 has said that "The level of reliability... is a quality-of-implementation issue," but there must be some level that is required for the feature be useful. In order to facilitate interoperability, app compatibility, and implementations that are useful for authors, implementers need to know the necessary [minimum] level of reliability. Currently, we have 99+%. I'll let @slightlyoff clarify his question, but I think one aspect is whether such behavior analysis could succeed with meaningfully lower levels of reliability, such as that of sendBeacon and <a ping>.

@mwatson2

This comment has been minimized.

Copy link
Contributor

commented Aug 2, 2016

Aside from the fact that it does not need to be 100% reliable (or five 9's or similar), I don't agree that Netflix's current requirements are relevant here.

The feature is proposed to be optional, so if a particular browser does not meet the requirements of a particular site at a particular time, the site can behave as if the feature were not present (which it has to support anyway, because the feature is optional).

FWIW, several implementors have reached a level of reliability where we find the feature is useful, so it is certainly possible with reasonable effort.

@ddorwin

This comment has been minimized.

Copy link
Contributor Author

commented Aug 3, 2016

Aside from the fact that it does not need to be 100% reliable (or five 9's or similar), I don't agree that Netflix's current requirements are relevant here.

The question was not necessarily about Netflix's current requirements. However, as the only author with implementation experience with this feature, Netflix's experience and usage is very relevant for the evaluation of the Candidate Recommendation.

The feature is proposed to be optional, so if a particular browser does not meet the requirements of a particular site at a particular time, the site can behave as if the feature were not present (which it has to support anyway, because the feature is optional).

Other than checking the user agent string, how would an site determine whether to "behave as if the feature were not present?" It seems like a design flaw for a web platform feature to a) have a critical dependency on reliability in order to be useful to applications yet b) have no requirement or guidance for reliability and c) provide no way to determine the level of reliability.

It is important to note that "behav[ing] as if the feature were not present" may (or likely) includes denying users access to content (or qualities of content) even though the implementation is fully capable of protecting the content and enforcing concurrent stream limitations.

FWIW, several implementors have reached a level of reliability where we find the feature is useful, so it is certainly possible with reasonable effort.

Two implementations use first-party OS-based DRMs that can write periodically rather than at teardown, and (as far as I understand) the third delays shutdown of the CDM and browser and stores the CDM data via a special path that is distinct from that used for other site data. We don't disagree that ~100% reliability is possible with such implementations. However, we don't think a W3C Recommendation should (effectively) force implementers down one of these paths in order to ensure their users have access to content on a handful of sites.

@slightlyoff

This comment has been minimized.

Copy link

commented Aug 3, 2016

To provide some color of the Stockholm consensus from the TAG meeting, what was debated was the extent to which the proposed feature pushes out the boat past what sendBeacon and <a ping> already provide. If the requirement is that logging be more reliable than those features, the TAG doesn't feel that the feature is in-line with what could be explained in the near future and is therefore a risk to compatibility, platform coherence, and layering.

To the extent that folks here are OK with spec-ing this in terms of the language that <a ping> and sendBeacon() use for their processing, the feature seems roughly OK.

Hope that helps.

@paulbrucecotton

This comment has been minimized.

Copy link

commented Nov 1, 2016

@ddorwin - Please change the milestone for this issue from V1 to VNext since the "persistent-usage-record" feature is being removed from EME V1. See ISSUE-353.

@ddorwin ddorwin modified the milestones: VNext, V1 Nov 1, 2016

@ddorwin

This comment has been minimized.

Copy link
Contributor Author

commented Nov 1, 2016

Moved to VNext per the above comment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.