Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pull based streams: ability to read $num items #74

Closed
bmeck opened this issue Feb 8, 2014 · 6 comments
Closed

Pull based streams: ability to read $num items #74

bmeck opened this issue Feb 8, 2014 · 6 comments

Comments

@bmeck
Copy link

bmeck commented Feb 8, 2014

The ability to read a specific number of things at a time and leave remaining queued data is an important feature of streams. This is present in the low level OS APIs as well:

Unix read: http://man7.org/linux/man-pages/man2/read.2.html
AIO: http://man7.org/linux/man-pages/man7/aio.7.html , http://man7.org/linux/man-pages/man3/aio_read.3.html

Windows: http://msdn.microsoft.com/en-us/library/windows/desktop/aa365467(v=vs.85).aspx
IOCP: http://msdn.microsoft.com/en-us/library/windows/desktop/aa365198(v=vs.85).aspx (uses ReadFile)

In particular use cases around reading binary file formats are greatly simplified if we do not have to maintain a buffer in addition to the OS buffer for a stream. Even if the resulting read does not contain all of the data we need, we can maintain a single buffer for the segment of the file we are processing, rather than one for the segment we are processing and one in case we need to "unshift" (in user land) data that is outside of the segment we are processing.

While we can just add on user land wrappers to do this, the situation gets blurry about memory consumption and leads us to take a step away from back pressure. If we are queueing up chunks of memory in a user land buffer naively we can negate back pressure all together, thus leading to a need to ensure our user land buffering stream keeps back pressure. In addition to needing back pressure if a stream is blocked by a long standing operation (such as waiting on a HTTP response) before wanting more data; we are wasting memory and some CPU (for re-forming a pull based stream) in our user land solution.

Currently, there is no way to work with the low level buffer that backs a stream without also needing to maintain a separate buffer for pull based workflows. I propose we include the ability to limit the maximum number of items that can be pulled off of a Stream while using read.

@domenic
Copy link
Member

domenic commented Feb 8, 2014

This is good data, thank you!

I would be especially interested in @isaacs's experiences with this, as I believe he has mentioned a few times that node's read(n), which returns <= n bytes or characters, is a mistake. (It also works awkwardly in Node's awkward "object mode.") I've heard him say that read() + unshift(chunk) would be better (see #3).

I've also heard him say that, toward the backpressure strategy point, high water marks are probably unnecessary (see #13). The latter would, to me, indicate we'd probably want to expose even more low-level buffer/backpressure-management primitives, and force most of the work into userspace (see #24)---but without the hazards you mention which cause waste by maintaining a separate buffer. In this case the idea might be to allow building higher-level things like unshift(chunk) or readUntil(specificByteSequence) in userspace. (This is all kind of fuzzy though; don't try to pin me down on the details. I still don't fully understand how a zero-HWM-advocate looks at the world.)

On the other hand, the idea of mapping to low-level OS APIs is a very compelling one, and probably what drove Node in that direction in the first place. I am hesitant to do anything that takes us away from that. I am just unsure whether they might be an attractive nuisance, or something that is implemented inconsistently among streams, or only works for binary streams, or similar.

What has been your experience for these binary-file-format use cases when working with Node's APIs? Are they sufficient? Are they ergonomic? Is read(n) by itself enough, or do you need unshift(chunk) also? Would read() + unshift(chunk) be enough?

@bmeck
Copy link
Author

bmeck commented Feb 9, 2014

I am unfamiliar with using read(n) in "object mode". Generally though abstractions have a type that you are concerned with (similar to how typed arrays use views) and it may be handy to swap between them. This can be recreated with either unshift or read(n_bytes)//may require view generation.

In nodejs/readable-stream#30 @TooTallNate suggests a peek instead. Having unshift causes me 2 concerns:

  1. Mutating the backing buffer leads to buffer payloads being changed during read/unshift workflows. This can be useful but can cause a can of worms. Imagine unshifting something with external references that allows it to change (looking at you fs).
  2. unshift means that we must have a user-land backing structure to support pushing data onto the stream. This becomes even more of a chaos causer by doing things like using WebWorkers/Processes/etc. an potentially passing file descriptors around, the user-land backing structure can not pass these around safely without buffering anything that may be needed by the final destination (this is still somewhat true for non-binary streams).

A good start to working with binary data structures is .tar parsing. It is very well documented, well used, potentially extensible without creating new formats, and complex enough to give you some problems.

As for the developer experience:

  • read(n) is amazing, readUInt32BE etc. often are not enough for binary file formats. Often there are fields that are defined as a string of bytes with the size of n. Often it is more convenient to take this approach of reading an entire string when a segment or header also has a very specific size.
  • read(n) by itself is enough, but leaves some horrifying problems if used for reading arbitrary sized structures. Take for example: reading a string in JSON, if you do not have unshift you will be using read(1) a lot trying to find the ending " and destroying performance or creating a buffered stream with an unshift. Some thoughts around this have been a peek(n) and a readUntil(token), but both of these also have problems.
  • peek(n) does not have a clear means to signal that it is not enough data and more should be queued
  • readUntil(token) does not handle context free grammars well
  • Reading arbitrary sized structures is fairly simple with unshift as you can unshift any data that is not processed. However, lots of care must be taken to deal with the problems above.
  • Processing programming languages means the ability to parse LR(k) grammars. This is something of a nightmare to try and create a tailored parser for and leave me often with dealing with entire strings rather than stream based parsing. I currently am looking into generators and how they may be able to create resumable recursive descent instead.

Overall I am not against unshift, but I would very much like to be able to read strings and segments without forcing user-land buffering/wrapping if possible.

@tyoshino
Copy link
Member

I've been editing another streams spec at W3C [1] before I join here to define only one primitive set here together with Domenic. This spec basically focused on binary handling. Please take a look.

I'm planning to provide read(n) kind interface as an extension for binary (or similar things) handling streams in a separate doc.

Since the core primitive is going to support arbitrary objects which are neither divisible nor joinable in general, introducing read(n) stuff to the core complicates it much. Actually, I tried to do so in the W3C version, but was very hard. The only feasible way to introduce read(n) to the core is define the "size" for all objects as 1.

[1] https://dvcs.w3.org/hg/streams-api/raw-file/tip/Overview.htm

@tyoshino
Copy link
Member

My idea is briefly:

  • making "is-readable-check" code overridable
  • override it with "n bytes or more ready, or not"
  • add wait() variant which accepts an argument "size"
  • do binary cut/merge to generate an ArrayBuffer of the exact size in read() method and return it
  • don't touch pull/push operation. strategy works separately

@tyoshino
Copy link
Member

tyoshino commented Oct 1, 2014

@tyoshino
Copy link
Member

I believe almost all of the features discussed here has been implemented. See e601d69

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

3 participants