New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rsReader.readBatch(numberOfChunks)? rbsReader.read(view, { waitUntilDone: true })? #320

Closed
domenic opened this Issue Apr 7, 2015 · 50 comments

Comments

5 participants
@domenic
Member

domenic commented Apr 7, 2015

Some ideas came up in conversation with @wanderview in IRC today and I thought would be worth logging.

The first is a potential readBatch(numberOfChunks) method that would allow you to wait for multiple chunks at once. This would presumably return a promise fulfilled with an array. Empty array, or array of less than numberOfChunks length, would mean you reached end of stream. The motivation here was performance though (on the theory that it'd be more efficient for the underlying source to assemble an array of chunks and then deliver in one promise, rather then having consumer code extract from multiple promises then reassemble into an array). And performance arguments are weak without data. My instinct is to push this off into the future.

The second was the question of how read(view) works for BYOB readable byte stream readers. I had been envisioning it as doing a direct pass through to the underlying byte source. So for example for files, you'll usually get back view.byteLength bytes (unless at the end of the file), whereas for sockets, you'll get back some unknown number of bytes <= view.byteLength. However, we could instead say that the underlying source needs to either fill the entire view or close the stream. (I am not sure we could enforce this programmatically, but it could be something we advise underlying source implementations to do, and in particular fetch body stream.) We could even add an option that lets you toggle between the behaviors.

@wanderview

This comment has been minimized.

Member

wanderview commented Apr 7, 2015

Partial motivation for this is to let the client code amortize the cost of .read() Promises across more stream data.

@domenic

This comment has been minimized.

Member

domenic commented Apr 7, 2015

Right, that's what I meant by the parenthetical

the theory that it'd be more efficient for the underlying source to assemble an array of chunks and then deliver in one promise, rather then having consumer code extract from multiple promises then reassemble into an array

But again, without data that promise costs are problematic, this seems premature.

The read(view, { waitUntilDone }) issue is interesting independently of performance, though, IMO. If only because there's no way to reassemble multiple array buffers into a single array buffer without copying -_-.

@tyoshino

This comment has been minimized.

Member

tyoshino commented Apr 8, 2015

readBatch(): Though the situation has changed (now read() is async), basically this is revival of #74. In http://lists.w3.org/Archives/Public/public-webapps/2013JulSep/0481.html, I proposed this functionality as readableThreshold, FYI.

waitUntilDone: This thread #177 (comment) is related. Basically I don't think we need the option

fill the entire view or close the stream

By "close the stream", do you mean "close the stream after filling the view with the bytes read though that is smaller than originalView.byteLength"?

@domenic

This comment has been minimized.

Member

domenic commented Apr 8, 2015

Basically I don't think we need the option

Any particular reasoning?

By "close the stream", do you mean "close the stream after filling the view with the bytes read though that is smaller than originalView.byteLength"?

Right, indeed.

@wanderview

This comment has been minimized.

Member

wanderview commented Apr 9, 2015

So I agree that promises are probably not the end of the world in these two scenarios:

  • stream pipeline throttled by source I/O
  • stream pipeline throttled by sink I/O

But I don't think that describes every scenario. Its quite conceivable to envision stream sources and sinks that are completely memory driven. Should pipelines based on these types by pessimized?

The classic example here is where you write data to a pipe. The chunks go into a memory buffer until its "full" causing back pressure on the source. I/O occurred when populating the pipe, but reading from the pipe may have a many chunks available without any I/O.

In these conditions I think the requirement that every read() is a separate Promise is quite expensive. The cost here is mainly the object creation and the required async microtask schedule.

In IRC @domenic asked for citation that the promise approach was "much slower" than a sync read in these cases. So I wrote a small test to simulate reading from a pipe or memory source:

https://github.com/wanderview/streams-promise-read

You can execute the test by visiting this page and looking at your js console:

https://blog.wanderview.com/streams-promise-read/bluebird.html

Here are some cleaned up results from firefox nightly running on my beefy MBP.

   10 chunks in pipe: sync-read: 18.8 us/chunk vs. promise-read: 625.4 us/chunk
  100 chunks in pipe: sync-read:  0.3 us/chunk vs. promise-read:  20.5 us/chunk
 1000 chunks in pipe: sync-read:  0.2 us/chunk vs. promise-read:  28.1 us/chunk
10000 chunks in pipe: sync-read: 0.03 us/chunk vs. promise-read:   7.4 us/chunk

The real kicker here for me is it takes over 6ms just to drain a pipe with 10 chunks. That's a very realistic workload for pipes and huge amount of time considering what its doing.

@wanderview

This comment has been minimized.

Member

wanderview commented Apr 9, 2015

Also consider many streams running in parallel. While each stream is throttled at I/O, collectively they are putting a larger load on the system due to increased object creation and stack unwinding/rewinding. I might try to write a test for this case as well.

@wanderview

This comment has been minimized.

Member

wanderview commented Apr 9, 2015

@domenic correctly pointed out on IRC that I need to prime the jit before my first measurement. Doing this gives more realistic results for the 10 chunk case:

   10 chunks in pipe: sync-read: 0.5 us/chunk vs. promise-read: 39.3 us/chunk

So ~0.4ms to read 10 chunks buffered in a pipe. Still much slower than the sync case, but not so slow in absolute terms to be considered ridiculously broken.

@tyoshino

This comment has been minimized.

Member

tyoshino commented Apr 9, 2015

#320 (comment)

Any particular reasoning?

Sorry. Seems I forgot to write the reason, AFK, back and clicked the button. I guessed the overhead is comparatively small as we're already passing chunks (ArrayBufferView) while for the object stream, we only have non-batch one-by-one reading interface. That said, agree that benchmark helps.

@tyoshino

This comment has been minimized.

Member

tyoshino commented Apr 9, 2015

The result of Ben's benchmark is similar on Chrome Canary.

@domenic

This comment has been minimized.

Member

domenic commented Apr 9, 2015

A version of Ben's benchmark using a benchmarking tool like benchmark.js gives drastically different results: wanderview/streams-promise-read#1 still investigating.

@wanderview

This comment has been minimized.

Member

wanderview commented Apr 9, 2015

Thanks for the benchmarking improvements! I merged that PR and also modified it to make testing easier on mobile. You can now see results in the window and control number of chunks to test with using ?chunks=100 in the URL.

@wanderview

This comment has been minimized.

Member

wanderview commented Apr 10, 2015

So, I was really curious why we were seeing ~210 ops/sec for the sync case, but upwards of 8000 ops/sec for the native promise case in Firefox.

I investigate with bz's help and we discovered that benchmark.js deferred.resolve() calls setTimeout(). This means that using deferred:true in a synchronous fashion like we were will always be limited by the nested setTimeout() clamp of 4ms. A 4ms clamp equals an upper bound of 250 ops/sec. Add in some overhead and we can very easily see where the ~210 ops/sec was coming from.

I changed the sync case to do Promise.resolve().then(function() { deferred.resolve() }). This uses the microtask schedule to break the nested setTimeout() loop. This allows the sync case to run at 16000 ops/sec on my desktop.

Unfortunately, bluebird promises are unavoidably going to be limited by the setTimeout() clamp. So that library still starts at ~210 ops/sec, but its performance characteristics allow it to do better than native at higher chunk size.

I put full test results in this google doc:

https://docs.google.com/spreadsheets/d/1rl6mbD2z1x1bgJLD6y9KJLYWjppB7BujfiWvUMjYTVs/edit?usp=sharing

To summarize:

With a modest 4 chunks in the pipe, an async read() takes about 4 times longer to consume the pipe than it would with a sync read(). In absolute terms, the results show that the mobile devices I tested take longer than 1ms to consume this small 4-chunk pipe.

Based on these results I think we should consider providing a sync read() on getReader(). The ReadableByteStream can offer an async read(view) on the getByobReader(). I feel very strongly that we should not pessimize the common case to accommodate the aesthetics of the ByobReader optimization.

Note: I don't think this test shows any problem with Promises themselves. I think the same issues would be present with callbacks. Its just difficult to test the callback case without hitting the setTimeout() clamp.

Bonus Note: The test case now parses the URL to get values from the search params. URLSearchParams is not available in Chrome yet, though, so this doesn't work there yet. If anyone could provide a better solution I'd love to run these tests on Chrome as well.

@domenic

This comment has been minimized.

Member

domenic commented Apr 10, 2015

Thanks for the investigation in benchmarks. They're tricky indeed, and I don't feel that we've really got anything rigorous yet. But at least we have some idea of the range.

I'm not opposed to adding a readBatch(). It has some nice symmetry with writev or similar (#27). However, sync read() is not very sensible at this point. To actually use it you need to have an async wait-for-readable that you accompany it with. You might as well fold the async wait-for-readable plus sync read-until-empty into a single async readBatch().

I feel very strongly that we should not pessimize the common case to accommodate the aesthetics of the ByobReader optimization.

And I feel equally strongly that you are misrepresenting "the common case." You've found very specific scenarios where this will matter. In particular:

  • The consumer (e.g. writable stream) must be slower than the reader, so as to cause backpressure to accumulate chunks in the buffer
  • There must be no I/O involved at any point in the chain---the data must be entirely generated in memory---otherwise the I/O cost would dwarf the read() costs, as we saw with just a simple setTimeout(,0).
  • The person who's generating all this data in memory must not respect the backpressure signals from the readable stream telling it that its queue is full.
  • The person who's generating all this data in memory has decided to use a readable stream, instead of simply using, say, an array, to represent a bunch of data being held in memory. (Which might be advantageous sometimes, I agree---just like Promise.resolve(5) is sometimes more useful than 5.)
  • The consumer is concerned about time spent on read operations, but is not concerned about time spent on GC or concerned with memory consumption. (If they were, they would be using BYOB.)
  • None of the identity transform optimizations (#325) apply

I think this set of circumstances is extremely rare. My instinct when faced with this kind of scenario is to not include readBatch, or to include it as experimental in implementations behind a flag, and then wait for true real-world streaming programs to appear which we can use to test: does readBatch actually make a difference?

It may be that these circumstances do all come together with enough frequency to advocate for API addition. However, I don't think we can know that without seeing some serious stream deployment in the wild, and I certainly don't think we can call this confluence of circumstances "the common case."

@wanderview

This comment has been minimized.

Member

wanderview commented Apr 10, 2015

First, let me say that I regret making this feedback at such a late stage. The read() method became async a while ago and I really should have said something then. I apologize for not being more involved earlier.

That being said, I don't think we can ignore this kind of data. If we're going to fix it we should do it sooner rather than later.

While I could debate you point-for-point, I'd instead like to discuss the idea that every chunk is directly tied to an equivalent amount of I/O. I think this a reasonable position for byte-oriented streams, but it seems ReadableStream is also designed to support message-oriented streams.

I think message-oriented streams can very easily produce many chunks from relatively little I/O. Allow me to lay out an example:

Consider a framework or application that wants to modularize its software. To do this it connects the modules with message streams. The framework is designed such that each stream chunk is a Message object.

Now consider that the framework has the ability to send and receive Message objects over the network. Or it is loading Message objects from a file. Lets say the wire protocol for a Message takes on average 128 bytes.

Now consider a MessageParserTransform. This is a ReadableStream/WritableStream pair that accepts ArrayBuffer chunks and produces Message objects. One ArrayBuffer may produce many Message objects.

If a MessageParserTransform consumes a 4096-byte buffer, it will produce 32 Message objects.

Based on the benchmark, reading these 32 message objects out of MessageParserTransform will take over 7ms on a mobile device. This is 7ms of time added on top of the I/O.

Now, you may be saying to yourself "but those 7ms can happen while you're doing more I/O". That's true, but it only helps solve the bandwidth problem. Unfortunately, people also care about latency. In this case the async scheduling required by read() is adding unavoidable latency.

This is just one example, but I hope it demonstrates a realistic use case for sending larger numbers of chunks through a stream without increased external I/O.

Finally, I'd just like to clarify that I'm now advocating for a sync read with a ready promise like was previously spec'd. Given these numbers I'd like to see sync be the default.

I'm unsure about sync vs async .write() at this point. It seems we expose a state which could be checked synchronously, but its a bit inconvenient to require a second statement to detect back pressure.

@yutakahirano

This comment has been minimized.

Member

yutakahirano commented Apr 10, 2015

@wanderview I don't understand why bluebird is worse than native promise when the number of chunks is small but better when it is large. Could you tell me why?

@wanderview

This comment has been minimized.

Member

wanderview commented Apr 10, 2015

@yutakahirano its due to the async primitives the polyfil has available. It has to use something like setTimeout. Unfortunately, if you nest more than 4 or 5 setTimeout calls it will begin forcing the delay to a minimum of 4ms. This places the upper bound of an async loop like this at 250 ops/sec.

In theory something like postMessage could avoid the 4ms clamping, but I think browsers have started doing it for nested postMessage recently as well.

The polyfil needs something like node's nextTick to perform well on this test.

@tyoshino

This comment has been minimized.

Member

tyoshino commented Apr 10, 2015

Seems even just creation of a Promise is heavy. I ran sync.html with new Promise(function() {}); at the top of read definition in makeSyncReader on Firefox 36.0.4. It's ~3 times faster than native.html, but decreases similarly to native.html.

@tyoshino

This comment has been minimized.

Member

tyoshino commented Apr 10, 2015

Same curve if new Promise(function() {}) is replaced with Promise.resolve();.

@wanderview

This comment has been minimized.

Member

wanderview commented Apr 10, 2015

We have work to optimize our native promises. The main cost you are probably seeing here, though, is increased load due to cycle collecting so many c++ backed js objects.

The native promise will improve for sure, but I don't see it improving by 300%. I think the microtask async schedule is the main source of delay here and I don't see how that can be optimized away.

So I see this as an "async required" problem and not Promise specific. If we could implement this with just callbacks without hitting set time out clamping I think we would see similar results.

Happy to be proven wrong, though!

@yutakahirano

This comment has been minimized.

Member

yutakahirano commented Apr 10, 2015

I profiled a simple script creating and resolving a large promise chain with V8. Roughly half of time was consumed by GC.

@yutakahirano

This comment has been minimized.

Member

yutakahirano commented Apr 10, 2015

@wanderview So, one setTimeout is slower than one microtask execution, but many setTimeout is fater than many microtasks execution? Interesting...

@domenic

This comment has been minimized.

Member

domenic commented Apr 10, 2015

Based on the benchmark, reading these 32 message objects out of MessageParserTransform will take over 7ms on a mobile device. This is 7ms of time added on top of the I/O.

In this case the async scheduling required by read() is adding unavoidable latency.

This is a good scenario, which I appreciate you taking the time to outline. However, I think we need to significantly improve our benchmarking techniques if we're going to make this sort of claim. For example, we need to test in a ready-plus-read() scenario instead of simple sync read(). And I want to see something more realistic, like the scenario you outline, where we do I/O for a message and then split it into several objects. And we need to eliminate these scheduler artifacts.

All this talk about GC is recalling that when I ask the VM people about this sort of stuff, they say "generational GC will solve this." Apparently currently promises always outlive the young generation, whereas they probably should not if they're immediately .thened or if the result of .then is not used. But, who knows when that can get fixed...

@domenic

This comment has been minimized.

Member

domenic commented Apr 10, 2015

Also, I don't see why you want sync read + state + ready over readBatch; I didn't see you address this point.

@tyoshino

This comment has been minimized.

Member

tyoshino commented Apr 10, 2015

How does the .readBatch() on a non-BYOB reader of a ReadableByteStream work? numberOfChunks is number of chunks? Or number of bytes? E.g. if there're two ArrayBufferViews abv0, abv1 available for read, and abv0.byteLength < N < abv1.byteLength, we return an Array of abv0 and abv1.subarray(0, N - abv0.byteLength)?

(Below are moved from later comment to avoid discussion interference)

Just to satisfy the needs to reduce promise overhead, it's fine that it's number of chunks.

I guess readBatch(numberOfChunks) is proposed here instead of providing only readAllAvailable() in order to control back-pressure precisely (if the consumer is ready to consume only up to 3 chunks, it can call readBatch(3)). If we want to give the same functionality to readBatch(N) of ReadableByteStream, I have no idea how much value we should set N to if N means number of chunks.

@tyoshino

This comment has been minimized.

Member

tyoshino commented Apr 10, 2015

(Comment merged to #320 (comment))

@domenic

This comment has been minimized.

Member

domenic commented Apr 10, 2015

@wanderview given

So I see this as an "async required" problem and not Promise specific. If we could implement this with just callbacks without hitting set time out clamping I think we would see similar results.

Maybe a good way to cut out the noise here is to actually test callbacks + a loop (i.e., manually emulate the microtask queue) and compare that to sync read. Then we can see if an ideal fast microtask queue/promise implementation still introduces 7 ms of latency, or if it's just a Firefox-on-mobile optimization issue.

@tyoshino

This comment has been minimized.

Member

tyoshino commented Apr 10, 2015

(Comment merged to #320 (comment))

@domenic

This comment has been minimized.

Member

domenic commented Apr 10, 2015

readAllAvailable() sounds better actually, it addresses the use case without causing confusion in byte cases.

@wanderview

This comment has been minimized.

Member

wanderview commented Apr 10, 2015

@wanderview So, one setTimeout is slower than one microtask execution, but many setTimeout is fater than many microtasks execution? Interesting...

I don't think this is the case. Its more that other factors such as js execution time, GC, etc become more significant in the results. This means that the 4ms clamping does not come into effect any more.

I think if setTimeout() ran without 4ms clamping then bluebird promises would be faster than native promises, but slower than the sync case.

@wanderview

This comment has been minimized.

Member

wanderview commented Apr 10, 2015

For example, we need to test in a ready-plus-read() scenario instead of simple sync read().

The sync test in the benchmark already does an async action. It does effectively:

sync read
sync read
...
sync read
async resolve

I highly doubt moving the async resolve before the sync reads will materially impact the results.

And I want to see something more realistic, like the scenario you outline, where we do I/O for a message and then split it into several objects. And we need to eliminate these scheduler artifacts.

This is difficult to benchmark when you also claim the reference implementation is unoptimized. Open to suggestions here (or benchmark code!).

Also, I don't see why you want sync read + state + ready over readBatch; I didn't see you address this point.

I have a number of reasons for favoring a read+state+ready solution in .getReader() over an additive solution:

  • Sync read+state+ready is more primitive. Async read can be built on top of it easily, but not the other way around.
  • If we believe performance is a problem with async read (which I do) then making it the default is a footgun.
  • I think we will see people recommending that everyone always use the additive API to avoid the footgun. This is fine, but it suggests to me it should just be the default then.

In terms of possible additive APIs, I think I prefer something other than readBatch() now. Requiring code to always deal with a possible returned array (or single value) is just a bad API and I don't want to require it to achieve performance.

If additive is the only possible route, I think I might prefer a stream.getSyncReader() that provides a read+state+ready.

@domenic

This comment has been minimized.

Member

domenic commented Apr 10, 2015

getSyncReader() seems possible. However, I do disagree that we've established performance is a problem with async, and indeed will try to create better benchmarks to help out :). What did you think of my suggestion to test an idealized microtask queue, to remove implementation-specific slowness? E.g. Angular has one, where once control returns to Angular code they run a while() loop of framework-specific tasks. We could do something similar.

@wanderview

This comment has been minimized.

Member

wanderview commented Apr 10, 2015

Maybe a good way to cut out the noise here is to actually test callbacks + a loop (i.e., manually emulate the microtask queue) and compare that to sync read. Then we can see if an ideal fast microtask queue/promise implementation still introduces 7 ms of latency, or if it's just a Firefox-on-mobile optimization issue.

As far as I know, a callback async loop cannot be implemented without hitting timer 4ms clamping. This is why developers want something like nextTick or setImmediate. It seems the promise microtask is the solution.

Also, I believe Firefox has a reasonably fast microtask queue implementation. Its almost certainly faster than something that puts the next event in the main event queue like setTimeout.

Finally, if you think this is a firefox issue, please help me fix the benchmark to run in chrome. :-) Sorry I didn't have time to do this myself, but my time is shorter than usual this week due to travel.

Angular has one, where once control returns to Angular code they run a while() loop of framework-specific tasks. We could do something similar.

I would be interested to see how we could construct this with the non-promise async primitives without triggering timer clamping in a an async loop.

@wanderview

This comment has been minimized.

Member

wanderview commented Apr 10, 2015

All this talk about GC is recalling that when I ask the VM people about this sort of stuff, they say "generational GC will solve this." Apparently currently promises always outlive the young generation, whereas they probably should not if they're immediately .thened or if the result of .then is not used. But, who knows when that can get fixed...

I don't think this necessarily applies to Firefox. We definitely have overhead for native promises due to our cycle collector (like chrome's oilpan I think), but I haven't seen or heard evidence about GGC nursery issues. We can make improvements to the CC (and other native promise bits) but we're unlikely to see anything approaching a 300% improvement.

@domenic

This comment has been minimized.

Member

domenic commented Apr 10, 2015

I would be interested to see how we could construct this with the non-promise async primitives without triggering timer clamping in a an async loop.

The trick is to not actually be async at all---just like microtasks are in some sense "not actually async." Microtasks run a loop after exiting user code. If we just abstract the definition of "user code" to be "this code over here, but not this code over here" (like Angular does), we can program this in JavaScript.

@wanderview

This comment has been minimized.

Member

wanderview commented Apr 10, 2015

The trick is to not actually be async at all---just like microtasks are in some sense "not actually async." Microtasks run a loop after exiting user code. If we just abstract the definition of "user code" to be "this code over here, but not this code over here" (like Angular does), we can program this in JavaScript.

Interested, but with the caveat that dropping the js-to-c++ transition may not be indicative of actual native promise performance. (Although I think someone told me this week chrome promises were self hosted.)

@domenic

This comment has been minimized.

Member

domenic commented Apr 10, 2015

Indeed, they are, as are most (all?) primitives that appear in the ES language spec in SpiderMonkey. Promises are a strange exception in SpiderMonkey...

@domenic

This comment has been minimized.

Member

domenic commented Apr 10, 2015

Anyway, kind of the point is to separate the "X's promises are slow" and "X's microtasks are slow" from "streams are slow." Since the source of slowness claimed here is usage of the microtask queue (via promises), what we really want to investigate is the slowness of streams with an idealized microtask queue.

@wanderview

This comment has been minimized.

Member

wanderview commented Apr 10, 2015

Indeed, they are, as are most (all?) primitives that appear in the ES language spec in SpiderMonkey. Promises are a strange exception in SpiderMonkey...

I am told this is because SpiderMonkey does not have the concept of an event loop. All previous timer and async callback APIs were in the DOM. So Promises are a unique case. Long term plan is to add an event loop to SpiderMonkey.

@wanderview

This comment has been minimized.

Member

wanderview commented Apr 10, 2015

I don't think this necessarily applies to Firefox. We definitely have overhead for native promises due to our cycle collector (like chrome's oilpan I think), but I haven't seen or heard evidence about GGC nursery issues. We can make improvements to the CC (and other native promise bits) but we're unlikely to see anything approaching a 300% improvement.

I was able to talk to one of our CC experts on the way out of the office today. He informed me c++ backed js objects like promises are not currently permitted in the GGC nursery. So we do have that problem. Its unclear if this is fixable since promises can hold references back to other js objects.

@trevnorris

This comment has been minimized.

trevnorris commented Apr 12, 2015

Here's an example of a super simple message queue that runs in io.js:

'use strict';
const SlowBuffer = require('buffer').SlowBuffer;
const ITER = 1e7;


function Readable() {
  this._messages = [];
  this._onreadable = undefined;
  this._onreadable_args = undefined;
  this.isreadable = false;
}

Readable.prototype.push = function push(data){
  this._messages.push(data);
  if (this._messages.length === 1) {
    this.isreadable = true;
    if (this._onreadable_args)
      this._onreadable.apply(this, this._onreadable_args);
    else
      this._onreadable();
  }
};

Readable.prototype.read = function read() {
  if (this._messages.length > 0)
    return this._messages.shift();
  if (this._messages.length === 0)
    this.isreadable = false;
};

Readable.prototype.onreadable = function onreadable(cb) {
  this._onreadable = cb;
  if (arguments.length > 1)
    this._onreadable_args = parseArgs(arguments, 1);
};


/* actual benchmark */

var inst = new Readable();
var t = process.hrtime();
var iter = 0;

inst.onreadable(function onReadable(foo) {
  if (foo !== 'foo')
    throw new Error('where is foo?');

  if (++iter > ITER)
    return printTime(process.hrtime(t));

  while (this.isreadable)
    this.read();
  var self = this;
  process.nextTick(function nt() {
    self.push(new SlowBuffer(1024 * 64));
  });
}, 'foo');
inst.push();


/* utility functions */

function parseArgs(args, start) {
  var arr = [];
  for (var i = start; i < args.length; i++)
    arr.push(args[i]);
  return arr;
}

function printTime(t) {
  console.log(((t[0] * 1e9 + t[1]) / ITER).toFixed(1) + ' ns/op');
  console.log((ITER / (t[0] + t[1] / 1e9)).toFixed(1) + ' op/sec');
}

Output:

$ iojs --nouse-osr --nouse-inlining bench.js
1474.5 ns/op
678194.4 op/sec

A few key points of things intentionally implemented:

  • The use of SlowBuffer() that forces a JS -> C++ round trip.
  • Allocating 64KB, same size as a TCP packet to place appropriate strain on GC.
  • Force the pushing of new data to be "async" using process.nextTick().
  • Emitting the actual onreadable callback event so the data can be read.
  • Allow extra arguments to be passed to propagate to onreadable callback to force the use of .apply().
  • Disabled OSR and inlining for the benchmark.

As you can see I am still able to process over 600k messages per second. Performance-wise I'd like to see the final result take no more than 2us per operation. "Good enough" isn't really my thing. IMO the implementation should be no more than a footnote in total measured CPU time.

BTW, if SlowBuffer() allocations are taken down to the very small size of 32 bytes performance increases to 2 million ops/sec.

EDIT: Updated the example and performance numbers to use .shift() instead of .pop() (which would have made it a stack, instead of a queue).

EDIT: Added .isreadable as a way to detect if .read() should be run again.

@wanderview

This comment has been minimized.

Member

wanderview commented Apr 12, 2015

@trevnorris, if I'm reading this example correctly, I don't think you would ever have more than one iteration in this loop, right?

  while (this.isreadable)
    this.read();

Overall, though, are you saying that an async callback for every read() is not a problem? That seems to be what your example is showing.

@trevnorris

This comment has been minimized.

trevnorris commented Apr 12, 2015

@wanderview For this benchmark that's correct. I used a while() for implementation sake, to show making sure all data is read off the internal queue so that onreadable() will for sure be called again.

Both V8's microtask and Node's next tick queue can iterate through callbacks in around 100ns per call. So calling the callbacks asynchronously doesn't add much overhead. I created the benchmark to set the performance bar, showing that the native Stream implementation should be able to iterate through chunks of data in no more than 2us per 64KB allocation, and 500ns for trivially small allocations.

@tyoshino

This comment has been minimized.

Member

tyoshino commented Apr 13, 2015

@wanderview @domenic In resolve() definition in benchmark.js, setTimeout(fn, 0) is called for each operation (function added by suite.add()). It seems Chrome always has 4-5ms delay for it, and therefore it's hiding actual performance difference (seeing ~200 ops for everything). This setTimeout can be omitted as far as the number of cycle is small. I commented it out (and implemented URLSearchParams) and ran the test on Chrome 41.0.2272.118 (64-bit) on Linux.

Testing 1 chunks of 1 bytes per operation.
sync x 35,994 ops/sec ±1.99% (84 runs sampled)

Testing 32 chunks of 1 bytes per operation.
sync x 28,294 ops/sec ±0.98% (83 runs sampled)

Testing 1 chunks of 1 bytes per operation.
promise x 21,132 ops/sec ±1.94% (81 runs sampled)

Testing 32 chunks of 1 bytes per operation.
promise x 2,628 ops/sec ±0.86% (87 runs sampled)
@tyoshino

This comment has been minimized.

Member

tyoshino commented Apr 13, 2015

So, the same reason as Ben's analysis in #320 (comment) but the workaround for Firefox he suggested doesn't work for Chrome.

@wanderview

This comment has been minimized.

Member

wanderview commented Apr 13, 2015

@tyoshino Can you send me a pull request to remove the setTimeout() and parse the args for chrome?

@domenic

This comment has been minimized.

Member

domenic commented May 8, 2015

@wanderview given

# [16:38] Domenic: so, first glance result from this system benchmark is that async-read is roughly equivalent to sync-read+ready-promise... except when the browser is under memory pressure... then the sync-read+ready-promise pulls away... I'm not sure how I feel about that yet

and the fact that https://github.com/wanderview/streams-time-echo seems to be using native promises anyway (so that even the under-memory-pressure scenario could possibly be improved with a better promise implementation), where do we stand?

@wanderview

This comment has been minimized.

Member

wanderview commented May 9, 2015

where do we stand?

I still need to finish the benchmark to run different configurations systematically, test with bluebird, and then run the tests on desktop/mobile. I can try to prioritize this after my PTO late next week.

By the way, does my promise usage in the streams-time-echo thing look reasonable? It would be nice to have someone confirm I'm testing what I think I'm testing.

Anyway, if the further testing results in values similar to the preliminary test results you mention in your quote above, then I'm leaning towards declaring it "good enough".

@wanderview

This comment has been minimized.

Member

wanderview commented May 15, 2015

Ok, I have a semi-working benchmark:

https://github.com/wanderview/streams-time-echo

It basically works by streaming an series of timestamps from a node.js server. The browser receives these with a fetch Response stream, parses the timestamps out into individual "objects", and then passes these on via either an asyncRead() or syncRead(). Every time we get a chunk of data from the server we end up with many timestamp objects to pass on, so it should test the "populated pipe" concept I was worried about above.

I say "semi" because it locks up now and then. I believe this is due to a combination of trying to do my own back pressure with the buffering of data in the chrome body stream or maybe the node http lib. (Unfortunately the chrome fetch body stream does not do back pressure itself yet.)

In any case, it works well enough that I got some reasonable looking data. See the "macro benchmark" tab:

https://docs.google.com/spreadsheets/d/1rl6mbD2z1x1bgJLD6y9KJLYWjppB7BujfiWvUMjYTVs/edit?usp=sharing

Summary of results:

  • Desktop shows no difference between async and sync read paths.
  • Nexus5 with native promises shows a statistical difference between async read and sync read.
  • Nexus5 with bluebird promises shows no difference between async and sync read.

The only other good benchmark I can think of is a very "wide" server that has to handle numerous simultaneous streams. This is basically the node.js use case. It doesn't seem very applicable to js running in the browser, though. Unfortunately I don't have time to investigate this particular use case.

At this point I'm inclined to accept the async promise-returning read(). It seems reasonable that native promises can achieve the same performance as bluebird in the long term.

Thoughts?

@wanderview

This comment has been minimized.

Member

wanderview commented May 15, 2015

And I'm inclined not to add any kind of batchRead()or getSyncReader() right now. We can add something like this if real-world usage shows a need.

@domenic

This comment has been minimized.

Member

domenic commented May 18, 2015

Thanks very much for the testing @wanderview! It's great to have data on this kind of thing. I agree with your conclusions---we can add something like these later if necessary. Closing for now!

@domenic domenic closed this May 18, 2015

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment