Clarify presence of requests that don't return a response #12

Closed
igrigorik opened this Issue Feb 25, 2015 · 48 comments

Projects

None yet

10 participants

@igrigorik
Member

https://lists.w3.org/Archives/Public/public-web-perf/2015Feb/0065.html

There's some differences in the way browsers treat requests that don't
return a response. FF Nightly and IE 11 both create PerformanceResourceTiming entries for
these requests, whereas Chrome & Canary don't. The RT spec isn't explicit as to whether requests that miss a response should be included. -- @andydavies

@andydavies

OK so my 0.02c having pondered about this for a few days:

Requests that are made but fail should be included in the waterfall e.g. DNS, TCP connection, SSL/TLS negotiation failures.

Requests that aren't made because they fail a browser security check e.g. Mixed Content, CSP failures should not be included.

One thing I'm not clear on is how/where server pushed resources fit.

We should get @bluesmoon and @nicjansma views on this as they use RT in their RUM product

@nicjansma

@andydavies I think that criteria for what should be included is great.

There should probably also be a new field on the PerformanceResourceTiming interface to indicate that it is a failure (and possibly, some classification on why).

@igrigorik
Member

@andydavies how does FF/IE surface failed requests? I assume some of the timing values are set to 0, or left undefined?

@bluesmoon

This is FF 37:

404s.
IMG element:
image

XMLHttpRequest:
image

Duration is 0 for the next two, so it's rather misleading, especially if it took a while for the DNS failure.

DNS failure:
image

Timeout (though you have to wait for it to timeout):
image

@igrigorik
Member

@bluesmoon 4XX/5XX are valid responses, I don't think we should treat them any different from 2XX. That's worth clarifying in the spec... And for connection failures, it seems like we should provide the timestamps for the parts of the connection establishment that we were able to observe.

/cc @sicking @marcoscaceres

@bluesmoon

@igrigorik right... except that Chrome does not include 4xx/5xx.

For connection failures, it becomes a little complicated with cross-origin requests. For example, in the blackhole.wpt.org case, we could say that duration was 60s, but without connect start/end, we wouldn't know where the failure was... I suppose TAO is the only way to get that, except that for a resource that times out, there is no TAO header.

@igrigorik
Member

@bluesmoon yes, something we would need to fix in Chrome as well. Re, cross-origin: right, we would only surface duration for cross-origin, since we wouldn't know if TAO applies or not. For more detailed reports you have NEL.

@andydavies

I wrote a few test for this - feel free to pick holes in them

For the DNS and TCP failures FF sets responseEnd to the same as startTime (or fetchStart) so duration is 0

responseStart is after responseEnd for the 404 case in FF too.

I really need to run these tests through WPT so they're in a clean test environment

DNS Lookup Failure

http://andydavies.github.io/rt-tests/dns-failure.html

FF Nightly 38.0a1 (2015-02-22) OSX 10.9.5 IE 11.0.9600.17501 Win 7
name http://some … registeredyet.com/image.png http://some … registeredyet.com/image.png
entryType resource resource
startTime 330.10028 1267.113837
duration 0 60.99783632
initiatorType img img
redirectStart 0 0
redirectEnd 0 0
fetchStart 330.10028 1312.24597
domainLookupStart 0 0
domainLookupEnd 0 0
connectStart 0 0
connectEnd 0 0
secureConnectionStart 0
requestStart 0 0
responseStart 0 0
responseEnd 330.10028 1328.111673

TCP Connection Failure

http://andydavies.github.io/rt-tests/tcp-connection-failure.html

FF Nightly 38.0a1 (2015-02-22) OSX 10.9.5 IE 11.0.9600.17501 Win 7
name http://192.0.2.0/image.png http://192.0.2.0/image.png
entryType resource resource
startTime 242.767033 228.2401814
duration 0 42025.49786
initiatorType img img
redirectStart 0 0
redirectEnd 0 0
fetchStart 242.767033 250.4388826
domainLookupStart 0 0
domainLookupEnd 0 0
connectStart 0 0
connectEnd 0 0
secureConnectionStart 0
requestStart 0 0
responseStart 0 0
responseEnd 242.767033 42253.73804

HTTP 404 Failure

http://andydavies.github.io/rt-tests/http-404-failure.html

FF Nightly 38.0a1 (2015-02-22) OSX 10.9.5 IE 11.0.9600.17501 Win 7
name http://andydavies.github.io/image.png http://andydavies.github.io/image.png
entryType resource resource
startTime 4749.947469 125.0158889
duration 0 298.3351236
initiatorType img img
redirectStart 0 0
redirectEnd 0 0
fetchStart 4749.947469 125.3303143
domainLookupStart 4749.947469 125.3303143
domainLookupEnd 4749.947469 125.3303143
connectStart 4749.947469 125.3303143
connectEnd 4749.947469 125.3303143
secureConnectionStart 0
requestStart 4750.353281 125.3880032
responseStart 4906.206245 421.8555012
responseEnd 4749.947469 423.3510125
@igrigorik
Member

@andydavies thanks, this is really helpful. On first pass, IE behavior seems to make sense.. Do you see any issues with it?

@andydavies

I think IE's behaviour provides more clarity but I wonder if it can be improved on.

In DNS case there's no value for domainLookupStart, in the TCP case there's no values for domanLookupStart or End, and no value for connectStart even though events happened.

I re-ran the tests in WPT, and captured the relevant RT entires - Details > Custom Metrics

The WPT waterfalls aren't quite right as they miss entries when DNS lookup or TCP connection fails (known area for improvement)

DNS Failure

IE11 - http://www.webpagetest.org/result/150304_3C_87597985c7a090fd9637b2e43d57afef/
FF 39 - http://www.webpagetest.org/result/150304_H6_17609db8069a102c8a5dcfc1ecf4afda/

TCP Connection Failure

IE11 - http://www.webpagetest.org/result/150304_72_66f3513711180fc8fadd1fc1b1c84e57/
FF 39 - http://www.webpagetest.org/result/150304_VT_eb206bf462912e4795f5303e9a6c7d67/

404 Failure

IE11 - http://www.webpagetest.org/result/150304_NE_bb2b93c502a6422cd130b061610fd477/
FF 39 - http://www.webpagetest.org/result/150304_XE_029be40b79555a9934747aa5e0b2c6ac/

@bluesmoon

We should also look into how 301s & 302s are handled. @gui-poa has done some research on this.

@andydavies

@bluesmoon As in when the included resource redirects and resource redirected to fails for some reason?

@bluesmoon

well, there are probably several different cases we should look at. Failure being one of them. We also need to check if both 301 & 302 responses actually show up and is it different if the 301 state was cached by the browser.

@gui-poa
gui-poa commented Mar 5, 2015
  1. Could the redirect time, in Nav Timing, be measured between different subdomains? (www.example.com X m.example.com) Could be another reason to not use different urls (desktop x mobile). We dropped our m. redirect based on other redirect's time.

  2. I have some cases in Resource Timing that the transfer time is < 1 RTT. Could be cache, but I didn't find any doc about this.

@igrigorik
Member

In DNS case there's no value for domainLookupStart, in the TCP case there's no values for domanLookupStart or End, and no value for connectStart even though events happened.

Makes sense. As a general rule: we should report for all successful substeps up to the point where the failure occurred.

@gui-poa @bluesmoon redirects are a separate discussion, see: https://lists.w3.org/Archives/Public/public-web-perf/2015Feb/0059.html

@toddreifsteck
Member

Andy, thanks for the testing and issues for IE. I'll get those issues on our list.

It is possible that the DNS entries are accurate due to DNS caching depending on the testing methodology. Is the VM the browser runs on guaranteed to have a clean DNS cache?

Also, these tests seem useful for standards validation if we come to agreement on behavior.

@andydavies

@toddreifsteck Good point in the DNS front, I realised this after I posted the first set of numbers above, so I repeated the tests in WebPageTest which guarantees a clean DNS cache.

The WPT tests show the same behaviour, this is the 404 case for example:

http://www.webpagetest.org/custom_metrics.php?test=150304_72_66f3513711180fc8fadd1fc1b1c84e57&run=1&cached=0

[
{
"connectEnd": 0,
"connectStart": 0,
"domainLookupEnd": 0,
"domainLookupStart": 0,
"fetchStart": 161.1152,
"initiatorType": "img",
"redirectEnd": 0,
"redirectStart": 0,
"requestStart": 0,
"responseEnd": 42171.1321,
"responseStart": 0,
"duration": 42010.3838,
"entryType": "resource",
"name": "http://192.0.2.0/image.png",
"startTime": 160.7483
}
]

@igrigorik igrigorik added a commit that referenced this issue Mar 26, 2015
@igrigorik igrigorik surface failed fetches in the performance timeline
- fetches that are blocked (CSP, CORS, etc) are omitted
- fetches aborted due to network/other errors must be included
- failed fetches must surface initialized attributes up to point of
  failure

Closes #12.
0eb0f69
@igrigorik
Member

First run at resolving this: 0eb0f69 -- thoughts, feedback?

@toddreifsteck
Member

The update looks good however... thinking a bit more on this.. and noting the thought here to bubble on it for a day... Are there any privacy concerns for revealing network errors for fetches without a NEL registration or a Timing-Allow-Origin on file?

@igrigorik
Member

@toddreifsteck thanks, good point. A couple of related discussions... /cc @annevk

Are there any privacy concerns for revealing network errors for fetches without a NEL registration or a Timing-Allow-Origin on file?

TAO should apply, just as it does to any other PerformanceResourceTiming object:

  • Same origin fetches are implicitly allowed by TAO, and should surface relevant timestamps.
  • Cross-origin fetches are subject to: https://w3c.github.io/resource-timing/#cross-origin-resources ... Except, since this is a failed fetch and we don't have a TAO header to inspect, we should just assume that it's disallowed - i.e. we can return startTime and duration, but all other values are set to zero.

Does that sound reasonable?

p.s. FWIW, I think NEL registrations are orthogonal to this discussion..

@igrigorik igrigorik closed this in #19 Apr 29, 2015
@annevk
Member
annevk commented Apr 30, 2015

Historically we have not surfaced detailed error information even same-origin. This would be a change in security practices around that.

@igrigorik
Member

@annevk fwiw, a quick recap of what we're proposing here...

  1. Requests that don't initiate fetch (blocked by CSP, Mixed Content, etc) are omitted from timeline.
  2. Requests that initiate a fetch but fail, for whatever reason, are surfaced in the timeline:
    1. The application can already get startTime and duration on their own (e.g. time from dispatch to error callback), regardless of same or third party origin, so there is nothing new here. At a minimum, we can provide an entry with a URL, startTime, and duration fields.

So, the question is whether more detailed timing data is available for failed requests (2i) -- correct?

It seems odd to me that we would provide detailed timing data for successful fetches, but then omit it for failed ones. If the concern is that someone could use said timing data on failed fetches to infer something about the user, or their network, then it seems that this same argument should apply to successful fetches... As in, I don't think we're exposing any additional surface area, as far as security/privacy is concerned, as long as we follow the TAO model. But, perhaps you disagree?

@annevk
Member
annevk commented May 1, 2015

I guess the difference is that you can figure out where the failure happens since you provide a more detailed breakdown in the comments above, such as DNS lookup time.

@bluesmoon

That only happens if Timing-Allow-Origin is on, which is the intention. Note that With DNS or TCP failures, there will be no TAO header since there are no headers, so that information doesn't come back. See the full conversation for details of this.

@annevk
Member
annevk commented May 1, 2015

@bluesmoon I was discussing same-origin, not cross-origin. It is my understanding TAO implicitly allows same-origin and therefore effectively does not apply.

@bluesmoon

yes that's right, but I don't believe this raises any new security concerns.

If you're able to get timing information of a same-origin resource, then you already have control over the HTML of the page either because you own the page or because you have already XSSed it.

For an attacker, it doesn't matter if the resource you're trying to time is successful or an error. Can you get DNS, TCP & SSL timing? Doesn't matter because you can get that from navtiming of the page itself, which you have control over. Are you trying to check if the user's ISP/client has a different DNS/TCP timeout than the main page? You can get that with a successful resource as well.

For a site owner however, the benefits of having this are way more important -- I can report and alert on whether my site has problems. And there are no downsides. Every time we talk to site owners about resource timing, and once they understand it, they want to know if it will tell them which resources aren't responding, because very few people care about timing the resources that work correctly.

@igrigorik
Member

@annevk: any thoughts on @bluesmoon's response? I'm reopening the ticket to make sure we come to a resolution. FWIW, I agree with @bluesmoon.

@igrigorik igrigorik reopened this May 6, 2015
@annevk
Member
annevk commented May 7, 2015

I'm not sure. I'm curious what each browser's security team has to say. Since historically this kind of stuff (e.g. exposing this for XMLHttpRequest) has blocked on them.

@toddreifsteck
Member

I also agree with @bluesmoon. Having Microsoft security folk take a look at this to ensure they don't have an alternate opinion.

@toddreifsteck
Member

There was no significant pushback on this proposal but that doesn't necessarily make revealing this information secure.

@igrigorik igrigorik self-assigned this Sep 9, 2015
@plehegar plehegar added this to the V1 milestone Mar 3, 2016
@wesleyhales
Contributor

Trying to get this one wrapped up.

As I caught up from over a year's worth of history, I created the following summary:
Today applications must register explicit error callbacks on every fetch and element request. When a failure occurs and the error callback is invoked, little or inaccurate information is provided by the platform about why it failed.
To identify and resolve these problems, the platform will handle in the following manner:

  • If a resource fetch was aborted due to a networking error (e.g. DNS, TCP, or TLS error), then the fetch would be included as a PerformanceResourceTiming object in the Performance Timeline with initialized attribute values up to the point of failure - e.g. a TCP handshake error should report DNS timestamps for the request, and so on.
  • If a resource fetch is aborted because it failed a fetch precondition (e.g. mixed content, CORS restriction, CSP policy, etc), then this resource will not be included as aPerformanceResourceTiming object in the Performance Timeline.

With this detailed failure data being reported accurately, security and privacy implications are said to be no different than they are today. The risks are outweighed by the benefits of application reliability and better monitoring.

@annevk
Member
annevk commented Apr 26, 2016

With this detailed failure data being reported accurately, security and privacy implications are said to be no different than they are today.

Based on what, exactly?

@plehegar plehegar modified the milestone: V2, V1 Apr 26, 2016
@plehegar plehegar added the security label Apr 26, 2016
@wesleyhales
Contributor

Based on @bluesmoon comment above and @igrigorik and @toddreifsteck agreement with it. Also, Microsoft is the only one who came back without any pushback on their side. We need input from at least one other browser imo.

@annevk
Member
annevk commented Apr 26, 2016

None of those are browser security engineers.

@wesleyhales
Contributor
wesleyhales commented Apr 26, 2016 edited

Ok, so we can count Microsoft's sec team as one. Correct? It took @toddreifsteck 4 months for to come back with:

There was no significant pushback on this proposal but that doesn't necessarily make revealing this information secure.

From Microsoft's security team.
If we're going to do this again (hopefully with a quick turnaround). It would be great to know who else should review this. Suggestions?

@toddreifsteck
Member

I believe this is waiting on a review by Google and/or Mozilla's security folks. @igrigorik @annevk

@igrigorik
Member

Pinged our security and privacy folks as well. My summary of the above discussion and the proposal: https://bugs.chromium.org/p/chromium/issues/detail?id=460879#c11

@yoavweiss
Contributor

Seems like there are no concerns from Chrome's privacy team: https://bugs.chromium.org/p/chromium/issues/detail?id=460879#c13

@igrigorik
Member

@annevk as @yoavweiss already mentioned, Chrome's security and privacy team is good with proposed behavior, same for Edge (#12 (comment)). I don't see any activity on the mozilla.dev.security thread. Have you had any feedback or discussions via other channels?

Unless we hear strong pushback, proposed next steps:

  • Document more clearly the desired behavior in the processing section of the spec
  • Add example(s) illustrating the behavior for failed requests
  • Extend privacy/security section to explain the decision
@plehegar plehegar added the privacy label Jun 1, 2016
@toddreifsteck
Member

For transparency, IE11 still has these failures in resource timing, but Edge 14 does not currently expose them. [During refactoring and bug fixing (and without tests to guarantee these behaviors), they go away.... Sigh...]

@toddreifsteck toddreifsteck assigned igrigorik and unassigned igrigorik Jun 1, 2016
@annevk
Member
annevk commented Jun 7, 2016

FWIW, I haven't heard much through other channels. If the conclusion is indeed going to be that we're doing this, Fetch should be refactored as well to expose these errors and then APIs can decide whether to expose errors at this granularity.

@igrigorik
Member

@wesleyhales for some reason github won't let me assign this one to you (cc @plehegar). Are you still up for taking a first run at a PR for this one?

@wesleyhales
Contributor

@igrigorik should be fixed now go ahead and assign. I'll take a stab at first run.

@igrigorik
Member

Great, thanks Wes!

@wesleyhales
Contributor
wesleyhales commented Jul 31, 2016 edited

My security wording might be a little weak. You summed it up well with your fetch precondition examples in the Resources Included section.Should I be more descriptive here? It really comes down to 1) is the fetch same origin 2) and/or does it have TAO opt in.

Finally, are we doing a fetchFailed flag? I have in my notes that the consensus was on a flag that was set to true by responseStart and responseEnd both being equal to zero. Should we add this to the spec?

@igrigorik
Member

Hmm, I think this is heading in the right direction but I'm wondering if we should be more explicit?

For example...

  • In Processing model (5.1) add a clause after step 1 that an aborted fetch should immediately go to step 18 (queue the record). That would make clear that records are always queued, regardless if it request was aborted due to timeout, protocol error, or 'failed' status code.
  • Update each of "on getting" clauses for each attribute under PerformanceResourceTiming Interface to explicitly state what we expect to see.
    • E.g. "On getting, the domainLookupEnd attribute must return as follows: ... ~The time immediately before the user agent aborts the request due to failed resolve request for same-origin resource, and zero otherwise", and so on.

Also, we'll need carveouts for blocked fetches due to CORS, mixed content, etc.

@plehegar @toddreifsteck wdyt?

@igrigorik igrigorik referenced this issue in w3c/performance-timeline Sep 18, 2016
Closed

performance.getEntries() can't get failed requests. #59

@igrigorik
Member

Discussed this at TPAC today with Todd & Yoav...

  • Removing privacy + security labels. We addressed those questions earlier in the thread and proposed solution (#12 (comment)) passed Edge/Chrome's reviews.
  • We need to address 'status code' as part of this work as well; developers need a way to distinguish successful requests from 4xx/5xx and errors. We'll open a separate bug for this. @wesleyhales would you be up for driving that one as well?
  • I've created an L2 branch (https://github.com/w3c/resource-timing/tree/V2) and main branch (gh-pages) will track L3 work moving forward. I'm merging this into L3.
@igrigorik igrigorik closed this in ed1689f Sep 20, 2016
@igrigorik igrigorik modified the milestone: Level 3, Level 2 Sep 20, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment