Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Getting all bytes in a body #661

Closed
domenic opened this issue Jan 12, 2018 · 14 comments · Fixed by #1172
Closed

Getting all bytes in a body #661

domenic opened this issue Jan 12, 2018 · 14 comments · Fixed by #1172

Comments

@domenic
Copy link
Member

domenic commented Jan 12, 2018

(This spans HTML, Encoding, Streams, and Fetch. Let's tag at least @ricea to help while we're here, and @jakearchibald since he was helpful in whatwg/infra#181.)

I need help figuring out how at a spec level to get all bytes in a body. This is part of solving whatwg/html#3316.

Right now specs are using a few patterns:

In general the problem here is Streams use of the JS formalism, including promises, and how that interacts poorly with spec-level code and its different conventions and type system. The confusion between encoding's streams and Streams's streams is also tricky.

Looking at https://html.spec.whatwg.org/#fetch-a-classic-script my first thought is that it should be written as:

  1. Let body bytes be the byte sequence obtained by [reading all the bytes] of the [body] of response.
  2. Let source text be the result of [decoding] body bytes to Unicode, using character encoding as the fallback encoding.
  3. ... muted errors ...
  4. Let script be the result of creating a classic script given body bytes, source text, ... the other stuff.

This doesn't quite work on a few levels:

  • "reading all the bytes" would at the very least need to be asynchronous, if it goes through Streams's promise machinery.
  • "reading all the bytes" might fail. This is currently unaccounted for, but if we go through Streams's machinery, it's more explicit.
  • "reading all the bytes" will need to hand-wave to get from Uint8Array to byte sequence.
  • "decoding" doesn't accept byte sequences, only "byte streams".

So here is my proposal, which is essentially trying for a minimal delta from today:

  • We define "reading all the bytes", probably in whatwg/fetch. It returns either a byte sequence, or null to signal failure. Either its asynchronous, or we use "wait" (see Define async algorithms (or "spec promises"???) infra#181) to make it synchronous. It wraps up all the promise/Uint8Array stuff so that spec authors don't have to worry about it, hand-waving as appropriate.
  • We define in Encoding that "byte sequences" can be used as "byte streams" implicitly, when appropriate.

An alternate approach, which you might prefer, is to double-down on the spec-level concept of a byte stream. We'd provide some way of translating a ReadableStream, and thus a body, into a spec-level byte stream. It creates a reader, locks the ReadableStream, and then from then on, specs only manipulate the byte stream.

The hardest part of this, I think, is putting the asynchronicity of "reading" from the byte stream on solid ground. It seems very hand-wavey in Encoding, and most of Encoding's clients, right now. We could either:

  • Say that you can read "synchronously", in that it waits for more data to come in before "read from a byte stream" returns
  • or say that read needs to be asynchronous

This ties back to whatwg/infra#181 again. The problem with synchronous reads is that you can only do them from in-parallel sections, and I think in most cases "decode" does not run in those sections.

Indeed, the larger issue where encode/decode often are used on strings/byte sequences, instead of on "character streams"/"byte streams", seems pretty prevalent: see e.g. https://html.spec.whatwg.org/#form-submission-algorithm:encode or https://html.spec.whatwg.org/#navigating-across-documents:utf-8-decode. So maybe we need to do something about that anyway. Those cases wouldn't need to run in parallel.

@domenic
Copy link
Member Author

domenic commented Jan 12, 2018

I guess I should also mention that, since this is blocking whatwg/html#3316 which is a quite old issue (going back to w3c/ServiceWorker#1212), maybe we should implement my minimal-delta proposal first, and then worry about cleaning up this larger space (especially the Encoding byte streams mess) later. I'd like @annevk's blessing on that route though first before sending PRs.

@annevk
Copy link
Member

annevk commented Jan 13, 2018

Your approach sounds reasonable, though I'd appreciate if you could read through whatwg/html#1077 (comment) (and maybe earlier comments) and whatwg/encoding#128 to see if that changes your mind.

I agree that it would be good to have a shorthand for the body.

@annevk
Copy link
Member

annevk commented Jan 13, 2018

Note that we do have https://fetch.spec.whatwg.org/#concept-read-all-bytes-from-readablestream which goes from ReadableStream to a byte sequence. And it's rejected if the stream somehow errors (which may not be what we always want I guess, that's a little unclear to me).

@ricea
Copy link
Collaborator

ricea commented Jan 22, 2018

In the short-term, it looks like a bit of cut n' paste of "reading the body" language from from the Fetch standard to the HTML standard will solve the immediate issue. Is that what you are proposing?

What I'd like to avoid is adding explicit support for synchronous reading to the Streams standard. That's a big can of worms.

My understanding is that the glue between the Javascript world and the HTML world generally happens in the WebIDL standard. So in the longer term I'd like to see the ReadableStream glue moved from Fetch to WebIDL. I'd also like to add some extra glue for TransformStream, so we can avoid things like explicitly creating Promises and calling abstract operations in the Encoding standard.

In the next few days I'm planning to try a seach-and-replace from "streams" to "token sequences" in the Encoding standard. Once we see what that looks like we can make a judgement as to whether that is the way forward to resolve the streams/streams confusion.

@domenic
Copy link
Member Author

domenic commented Jan 24, 2018

In the short-term, it looks like a bit of cut n' paste of "reading the body" language from from the Fetch standard to the HTML standard will solve the immediate issue. Is that what you are proposing?

Basically yes, although I'd like to locate it somewhere more reusable. Also, it would bridge away from promises and into "spec async".

But yes, I think that's the right short-term plan. I'm more confused about the long-term vision.

What I'd like to avoid is adding explicit support for synchronous reading to the Streams standard. That's a big can of worms.

What about if it were done using a magic "wait" primitive which was only available "in parallel" (i.e. on background threads)? Well, still a pretty big can of worms, given you'd essentially be talking about manipulating the stream JS objects in parallel :-/. Yeah, let's not do that, at least not until I've really psyched myself up into believing it's clearly the best path forward...

My understanding is that the glue between the Javascript world and the HTML world generally happens in the WebIDL standard. So in the longer term I'd like to see the ReadableStream glue moved from Fetch to WebIDL. I'd also like to add some extra glue for TransformStream, so we can avoid things like explicitly creating Promises and calling abstract operations in the Encoding standard.

Where it lives is less concerning to me than the model. Do we try to emulate Web IDL's style by treating streams as JS objects and then translating them into spec language? E.g. wrapping promises up into "spec async", or Uint8Arrays into Infra byte sequences. Or, do we try to say that there is some "byte stream" or "byte token sequences" concept that is more fundamental, and the JS API lives on top of?

whether that is the way forward to resolve the streams/streams confusion.

I think that'll be a great first step. However I'm also worried about the byte sequences/token sequences confusion I discovered, where lots of other specs treat Encoding's algorithms as if they operate on static byte sequences, whereas Encoding defines them as operating on streams/token sequences that can change over time.

I think if we had more explicit static byte-sequence-taking endpoints in Encoding, it'd be clearer how much of encoding really needs "streaming" (i.e. token sequences) and thus how much of Encoding needs asynchronicity and all the attendant complexity.

@ricea
Copy link
Collaborator

ricea commented Jan 24, 2018

What about if it were done using a magic "wait" primitive which was only available "in parallel" (i.e. on background threads)? Well, still a pretty big can of worms, given you'd essentially be talking about manipulating the stream JS objects in parallel :-/. Yeah, let's not do that, at least not until I've really psyched myself up into believing it's clearly the best path forward...

I hadn't thought about that aspect. There needs to be a clear separation between parts that are notionally taking place on some other thread and those parts that may be calling into JS. But it naively appears to be doable with the right wording; something like "await with Javascript context Foo ...".

Where it lives is less concerning to me than the model. Do we try to emulate Web IDL's style by treating streams as JS objects and then translating them into spec language? E.g. wrapping promises up into "spec async", or Uint8Arrays into Infra byte sequences. Or, do we try to say that there is some "byte stream" or "byte token sequences" concept that is more fundamental, and the JS API lives on top of?

What's the cost of getting it wrong? My inclination is just to start with the text from the Fetch standard and expand on that. But obviously we shouldn't go down a dead-end if it's going to be hard to get back out.

However I'm also worried about the byte sequences/token sequences confusion I discovered, where lots of other specs treat Encoding's algorithms as if they operate on static byte sequences, whereas Encoding defines them as operating on streams/token sequences that can change over time.

I think they are probably just missing the:

  1. If input is given, then push a copy of input to stream.

step from decode() and the:

  1. If result is finished, convert output into a byte sequence, and then return a Uint8Array object wrapping an ArrayBuffer containing output.

step from encode(). Since I don't find it attractive to export the Encoding Standard's "stream" concept, I think it would be preferable to add a couple of algorithmic wrappers around the run algorithm for use by other standards.

I think if we had more explicit static byte-sequence-taking endpoints in Encoding, it'd be clearer how much of encoding really needs "streaming" (i.e. token sequences) and thus how much of Encoding needs asynchronicity and all the attendant complexity.

Everything in Encoding is synchronous, so there's no problem there. The kind of asynchronocity I need is to have an implicit conversion from a synchronous algorithm to an asynchronous algorithm so that the algorithms for transform stream support can just say "return" and the Streams standard will see a Promise resolved with undefined.

Encoding's concept of a "stream" is technically speaking a double-ended queue (deque). It's almost a simple queue; It only needs to be double-ended to support the "prepend" operation. No "streaming" as such is required, and I don't think browsers actually implement it this way, it just makes the algorithms clearer.

There's also the confusingly-named "stream" option to decode(), which doesn't "need" streaming but enables callers to implement it.

@annevk
Copy link
Member

annevk commented Apr 18, 2018

FWIW, I think "wait" is fine here. If some HTML feature invokes fetch, that'll go in parallel. The JavaScript involved would run in a service worker, but as the JavaScript writes things there, it'll then get copied over and be out of JavaScript's control, so we can perfectly wait for it in some HTML feature definition. What scenario are you thinking of where that would not be the case?

@annevk
Copy link
Member

annevk commented Feb 8, 2021

I wonder if we should solve this together with #536. Where if all you need is are the headers and the response body as byte sequence, you'd simply register for specification callbacks that give you exactly that. The tasks fetch queues were essentially meant to provide this, but the design was done before streams.

@annevk
Copy link
Member

annevk commented Feb 15, 2021

As far as I can tell we need two operations here and both would make the overall setup more verbose...

  1. To read body incrementally, given a body body, an algorithm processBytes, an optional algorithm processEndOfBody , and an optional algorithm processError, ...
  2. To read body, given a body body, an algorithm processBody, and an optional algorithm processError, ...

Both 1 and 2 could be used by XMLHttpRequest so it would no longer have to talk about streams directly.

The algorithms would be invoked from a parallel queue so the caller will have to use "queue a global task" as needed. We could create a wrapper that does that as well I suppose. (Maybe that's better, we add "in parallel" to the names of the algorithms above and the names above would queue tasks instead. What realm though? The current realm? The magical realm we might add for class construction and such?)

However, Fetch has "process response end-of-body" which I suppose we could remove. But Fetch also needs the entire body to be readable for integrity metadata checks. That's what https://fetch.spec.whatwg.org/#concept-body-wait is for. Should that basically tee (and wait, to avoid returning it before we know if it's safe)? Otherwise we cannot read it twice, right?

@domenic
Copy link
Member Author

domenic commented Feb 16, 2021

As far as I can tell we need two operations here and both would make the overall setup more verbose...

Hmm. Yeah, I think this is unavoidable. In particular, the typical flow even for simple cases like fetching scripts or for <link> is going to be something like

  1. Setup request.
  2. Fetch request, with processResponse set to the following steps given response:
    1. If response's type is "error", or response's status is not an ok status, then do some failure thing and abort these steps.
    2. Read response's body, with processBody set to the following steps given body:
      1. Do something with the byte sequence body.
        and processError set to the following steps:
      2. Do some failure thing.

Note that body-reading failure is not handled today in specs, so this is an improvement.

The algorithms would be invoked from a parallel queue so the caller will have to use "queue a global task" as needed. We could create a wrapper that does that as well I suppose. (Maybe that's better, we add "in parallel" to the names of the algorithms above and the names above would queue tasks instead. What realm though? The current realm? The magical realm we might add for class construction and such?)

Oh, that's annoying. It would add an indentation level to the above I guess? Maybe instead we could do

Read response's body, with global set to some global, processBody set to the following steps given body:

I think we can safely assume the task source is always the networking task source. And yeah, when we have a magic global we could maybe omit the global... I should prioritize that.

However, Fetch has "process response end-of-body" which I suppose we could remove. But Fetch also needs the entire body to be readable for integrity metadata checks. That's what https://fetch.spec.whatwg.org/#concept-body-wait is for. Should that basically tee (and wait, to avoid returning it before we know if it's safe)? Otherwise we cannot read it twice, right?

Umm, hmm. I kind of doubt implementations are making a full tee of the stream just to do integrity checking? That is:

  • If you do fetch() with an integrity option, and then start reading from res.body as soon as it's available, I imagine either implementations don't do the integrity checking at all, or (more like the current spec) they put the entire response in memory, check the integrity, and then feed you a stream backed by memory instead of network. @ricea @yutakahirano, any insights on what we do?

  • If you do <script src="..." integrity="...">, I imagine the integrity checking is built in to the processBody steps.

@yutakahirano
Copy link
Member

For fetch(), as currently specified, we don't run the process response callback until we check the SRI.

For scripts, we start running some tasks (e.g., compiling) before calculating the hash, but such tasks should be side-effect free, and we abort everything when we find an SRI error.

@annevk
Copy link
Member

annevk commented Feb 17, 2021

@domenic and I discussed this on IRC as well and my tentative plan is to fully read response's body's stream and then set response's body stream to a new stream that user agents can enqueue chunks on. A buffering proxy that does SRI, if you will. This would not allow for doing work in parallel, but I think that is fine as that would be an optimization.

annevk added a commit that referenced this issue Feb 17, 2021
annevk added a commit that referenced this issue Feb 18, 2021
annevk added a commit that referenced this issue Feb 20, 2021
This refactoring allows specifications building on top of Fetch to
avoid having to directly integrate with Streams. It makes these
changes:

* body's transmitted bytes is removed. body's total bytes will likely
  be removed in #604.
* body's done and wait concepts are removed.
* It introduces three primitives to read a body: incrementally read
  (process each chunk), fully read (get all), and
  fully reading body as promise (get all as promise).
* Removed transmit request body in favor of HTTP-network fetch using
  incrementally read directly to transmit the request body.
* Fetch's processRequestBody and processRequestEndOfBody are no longer
  passed a request. The former is passed the chunk length for progress
  reporting. (Needed by XMLHttpRequest.)
* Fetch's processResponseEndOfBody when passed will now cause the body
  to be fully read and will pass the result in the second argument.
  (This means that callers can no longer read the body themselves.)
* Subresource Integrity changed slightly with Fetch handing it the
  bytes and integrity metadata and no longer a response. Essentially
  fetch fully reads the body, does the check, and then creates a new
  body from the same bytes to hand to the caller.

Subresoruce Integrity PR: w3c/webappsec-subresource-integrity#97.

XMLHttpRequest PR: whatwg/xhr#313.

Closes #661.
annevk added a commit that referenced this issue Feb 24, 2021
This refactoring allows specifications building on top of Fetch to
avoid having to directly integrate with Streams. It makes these
changes:

* body's transmitted bytes is removed. body's total bytes will likely
  be removed in #604.
* body's done and wait concepts are removed.
* It introduces three primitives to read a body: incrementally read
  (process each chunk), fully read (get all), and
  fully reading body as promise (get all as promise).
* Removed transmit request body in favor of HTTP-network fetch using
  incrementally read directly to transmit the request body.
* Fetch's processRequestBody and processRequestEndOfBody are no longer
  passed a request. The former is passed the chunk length for progress
  reporting. (Needed by XMLHttpRequest.)
* Fetch's processResponseEndOfBody when passed will now cause the body
  to be fully read and will pass the result in the second argument.
  (This means that callers can no longer read the body themselves.)
* Subresource Integrity changed slightly with Fetch handing it the
  bytes and integrity metadata and no longer a response. Essentially
  fetch fully reads the body, does the check, and then creates a new
  body from the same bytes to hand to the caller.

Subresoruce Integrity PR: w3c/webappsec-subresource-integrity#97.

XMLHttpRequest PR: whatwg/xhr#313.

Closes #661.
annevk added a commit that referenced this issue Feb 24, 2021
This refactoring allows specifications building on top of Fetch to
avoid having to directly integrate with Streams. It makes these
changes:

* body's transmitted bytes is removed. body's total bytes will likely
  be removed in #604.
* body's done and wait concepts are removed.
* It introduces three primitives to read a body: incrementally read
  (process each chunk), fully read (get all), and
  fully reading body as promise (get all as promise).
* Removed transmit request body in favor of HTTP-network fetch using
  incrementally read directly to transmit the request body.
* Fetch's processRequestBody and processRequestEndOfBody are no longer
  passed a request. The former is passed the chunk length for progress
  reporting. (Needed by XMLHttpRequest.)
* Fetch's processResponseEndOfBody when passed will now cause the body
  to be fully read and will pass the result in the second argument.
  (This means that callers can no longer read the body themselves.)
* Subresource Integrity changed slightly with Fetch handing it the
  bytes and integrity metadata, and no longer a response. Essentially
  fetch fully reads the body, does the check, and then creates a new
  body from the same bytes to hand to the caller.

Subresoruce Integrity PR: w3c/webappsec-subresource-integrity#97.

XMLHttpRequest PR: whatwg/xhr#313.

Closes #661.
@jimmywarting
Copy link

Hmm, saw some integrity question earlier on an other repo. i really thought that it was possible to process the readable stream before integrity was even calculated, that it would just throw an error at the very end of the stream if it did not match instead of successfully closing the stream. is this how we want it to work really?

// this only console logs 1 final chunk
fetch('https://cdnjs.cloudflare.com/ajax/libs/p5.js/1.4.2/p5.min.js', {
  integrity: 'sha512-rCZdHNB0AePry6kAnKAVFMRfWPmUXSo+/vlGtrOUvhsxD0Punm/xWbEh+8vppPIOzKB9xnk42yCRZ5MD/jvvjQ=='
}).then(res => {
  const reader = res.body.getReader()
  function pump() {
    reader.read().then(val => {
      console.log(val)
      if (!val.done) pump()
    })
  }
  pump()
})

// where as this logs multiple times
fetch('https://cdnjs.cloudflare.com/ajax/libs/p5.js/1.4.2/p5.min.js').then(res => {
  const reader = res.body.getReader()
  function pump() {
    reader.read().then(val => {
      console.log(val)
      if (!val.done) pump()
    })
  }
  pump()
})

@annevk
Copy link
Member

annevk commented Aug 15, 2022

@jimmywarting please don't comment on long-closed issues with novel questions. I believe I called this out a couple times now, it's starting to get a little annoying. Perhaps discuss it in https://whatwg.org/chat first next time around or ask on Stack Overflow.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging a pull request may close this issue.

5 participants