Add support for streaming stdout/stderr from Child invocations #75

Open
wants to merge 11 commits into
from

Projects

None yet

6 participants

@brianmario
Collaborator
brianmario commented Dec 30, 2016 edited

As the title says, this adds support for passing a block for receiving chunks of output stdout/stderr as they're being read.

I'm not super happy with how the API turned out, and I could probably add a few more tests - so suggestions are definitely welcome.

@tmm1 @peff @piki @carlosmn @simonsj @vmg @scottjg @mclark @arthurschreiber @tma in case either of you have time to review this 🙏

@@ -337,6 +337,10 @@ class MaximumOutputExceeded < StandardError
class TimeoutExceeded < StandardError
end
+ # Exception raised when output streaming is aborted early.
+ class Aborted < StandardError
@brianmario
brianmario Dec 30, 2016 Collaborator

This could maybe be CallerAborted or something more specific?

@mclark
mclark Jan 3, 2017

I'm fine with either.

lib/posix/spawn/child.rb
+ @err << chunk
+ end
+ end
+ end
@brianmario
brianmario Dec 30, 2016 Collaborator

This whole block is pretty gross, but the alternative may involve being tricky (and less readable?) with Ruby.

@mhagger
mhagger Jan 3, 2017

This might be the "tricky" Ruby you're talking about, but it seems to me that when streaming is not requested, you could set @stdout_block anyway, to

Proc.new do |chunk|
  @out << chunk
  false
end

(like you do for the tests) and do the equivalent for @stderr_block. Then you could avoid the inner conditionals here.

To shrink the code even further, you could set

@blocks = { stdout => @stdout_block, stderr => @stderr_block }

(in which case you wouldn't even need @stdout_block and @stderr_block anymore, but you get the idea) then this whole processing code could become

if @blocks[fd].call(chunk)
  raise Aborted
end
lib/posix/spawn/child.rb
+ end
+
+ if @streaming && abort
+ raise Aborted
@brianmario
brianmario Dec 30, 2016 Collaborator

I don't love raising here, but it enforces proper cleanup (and killing the subprocess) up on https://github.com/rtomayko/posix-spawn/blob/0c02f3337575a9fcdd1ba6135882afcbb0aaf290/lib/posix/spawn/child.rb#L168.

@mhagger

I added some comments. It would be great to have docs, too.

lib/posix/spawn/child.rb
+ @err << chunk
+ end
+ end
+ end
@mhagger
mhagger Jan 3, 2017

This might be the "tricky" Ruby you're talking about, but it seems to me that when streaming is not requested, you could set @stdout_block anyway, to

Proc.new do |chunk|
  @out << chunk
  false
end

(like you do for the tests) and do the equivalent for @stderr_block. Then you could avoid the inner conditionals here.

To shrink the code even further, you could set

@blocks = { stdout => @stdout_block, stderr => @stderr_block }

(in which case you wouldn't even need @stdout_block and @stderr_block anymore, but you get the idea) then this whole processing code could become

if @blocks[fd].call(chunk)
  raise Aborted
end
lib/posix/spawn/child.rb
+ end
+ end
+
+ if @streaming && abort
@mhagger
mhagger Jan 3, 2017

I think @streaming && is redundant here, since when @streaming is not set, abort retains its initial value, false.

+ })
+ end
+ end
+
@mhagger
mhagger Jan 3, 2017

There are no tests that involve reading more than one BUFSIZE worth of output, or reading from both stdout and stderr. Those might be worthwhile additions.

@mclark
mclark Jan 3, 2017 edited

We should also add a test for passing in a minimal custom object, just to ensure the interface contract is maintained. May be a good time to use a spy.
No longer applicable as we are using Procs now.

@brianmario brianmario Switch to a writer interface from using a proc
This requires that the stdout and stderr stream objects passed respond to
`#write` and `#string` methods.
b6e35f8
@mclark

I'm liking the duck typed object interface MUCH better than the two Procs. I think that was the right way to go in this case.

lib/posix/spawn/child.rb
[stdin, stdout, stderr].each do |fd|
- fd.set_encoding('BINARY', 'BINARY')
+ fd.set_encoding(bin_encoding, bin_encoding)
@mclark
mclark Jan 3, 2017

Can we do this on a per fd basis?

bin_encoding = Encoding::BINARY
[stdin, stdout, stderr].each do |fd|
  fd.set_encoding(bin_encoding, bin_encoding) if fd.respond_to?(:set_encoding)
end

Also, are we intentionally dropping the force_encoding calls on stdout and stderr here?

lib/posix/spawn/child.rb
- @out.force_encoding('BINARY')
- @err.force_encoding('BINARY')
- input = input.dup.force_encoding('BINARY') if input
+ @stdout_buffer.set_encoding(bin_encoding)
@mclark
mclark Jan 3, 2017

Are these duplicate calls to the above intentional? We really can't assume these objects respond to these methods any more.

lib/posix/spawn/child.rb
+ abort = false
+ if chunk
+ if fd == stdout
+ abort = (@stdout_buffer.write(chunk) == 0)
@mclark
mclark Jan 3, 2017

I don't feel like this is a safe way to test for aborting the operation. The output object could be simply refusing to write the current chunk but is not done consuming the stream.
Why not use an exception for this test instead? If the consumer raises Posix::Spawn::Aborted then we clearly know to abort.

lib/posix/spawn/child.rb
@@ -262,12 +288,10 @@ def read_and_write(input, stdin, stdout, stderr, timeout=nil, max=nil)
end
# maybe we've hit our max output
- if max && ready[0].any? && (@out.size + @err.size) > max
+ if max && ready[0].any? && (@stdout_buffer.size + @stderr_buffer.size) > max
@mclark
mclark Jan 3, 2017

again, we can't assume there is a #size method on these objects...

I think we should probably keep a local count of the bytes we have written instead of calling #size anyway. We can't trust these objects any more as they could be anything.

+ })
+ end
+ end
+
@mclark
mclark Jan 3, 2017 edited

We should also add a test for passing in a minimal custom object, just to ensure the interface contract is maintained. May be a good time to use a spy.
No longer applicable as we are using Procs now.

@@ -337,6 +337,10 @@ class MaximumOutputExceeded < StandardError
class TimeoutExceeded < StandardError
end
+ # Exception raised when output streaming is aborted early.
+ class Aborted < StandardError
@mclark
mclark Jan 3, 2017

I'm fine with either.

@mclark
mclark commented Jan 3, 2017

Oh right, we also need some thorough docs on this once we've nailed down the exact 🦆 interface we are using

lib/posix/spawn/child.rb
begin
- buf << fd.readpartial(BUFSIZE)
+ chunk = fd.readpartial(BUFSIZE)
@peff
peff Jan 4, 2017

If I'm understanding this right, the old code would always append directly into the final buffer, whereas this one reads a chunk and then appends that chunk to the buffer. Not knowing anything about how Ruby operates under the hood, is this a potential performance problem? It should just be an extra memcpy in the worst case, but I recall that we've hit bottlenecks on reading into Ruby before. I suspect those were mostly about arrays and not buffers, though (e.g., reading each line into its own buffer can be slow).

@brianmario
brianmario Jan 4, 2017 Collaborator

I could be wrong here (cc @tenderlove) but I'm pretty sure the previous code would actually create a new string (down inside readpartial), then that string would have been appended to buf. Which would require potentially resizing that buffer first, then the memcpy.

This new code just keeps that first string as a local var first, so we can later determine where to write it. In the default case we're using a StringIO so the result is essentially the same as before (potential buffer resize then a copy). Though iirc we saw some pretty significant speedups by using an array to keep track of chunks, then calling join at the end when it was all needed. The reason for that is because it avoids the reallocation of the resulting buffer as we're reading, and instead allows join to allocate a buffer exactly the size that's needed then copying all the chunks in to it.

Basically this (the old way):

buffer = ""
buffer << "one,"
buffer << "two,"
buffer << "three"

return buffer

vs this (the array optimized version I just mentioned):

buffer = []
buffer << "one,"
buffer << "two,"
buffer << "three"

# buffer is an array with 3 elements at this point
# and this join call figures out how big all of the strings inside are, then creates a single buffer to append them to.
return buffer.join

Using that approach efficiently may change the API contract here slightly though...

@peff
peff Jan 4, 2017

@brianmario Ah, right, that sort of return value optimization would be pretty easy to implement, and would mean we end up with the same number of copies. Though if we're just counting memcpys anyway, I suspect it doesn't matter much either way.

The reason for that is because it avoids the reallocation of the resulting buffer as we're reading, and instead allows join to allocate a buffer exactly the size that's needed then copying all the chunks in to it.

Interesting. It sounds like appending doesn't grow the buffer aggressively in that case, because you should be able to get amortized constant-time.

Anyway. We're well out of my level for intelligent discussion of Ruby internals. The interesting result is whether reading the output of a spawned cat some-gigabyte-file is measurably any different. Probably not, but it's presumably easy to test.

@brianmario

So, I decided to go back to the proc-based API because the requirements on the caller are much simpler. The objects passed as streams need only respond to call with an arity of 1 (the current chunk) and return a boolean. true on success false to abort (note this is opposite from how I originally had it, though I think it makes more sense).

I'll keep going on tests and the documentation changes, but wanted to give folks one last chance to review this direction.

+ streams = {stdout => @stdout_stream, stderr => @stderr_stream}
+
+ bytes_seen = 0
+ chunk_buffer = ""
@brianmario
brianmario Jan 4, 2017 edited Collaborator

This buffer is reused by readpartial below. Internally, so far as I can tell, it will be resized to BUFSIZE on the first call to readpartial and then that underlying buffer will be reused from then on out.

@carlosmn
carlosmn Jan 4, 2017

Due to the issue with appending to strings that have been mentioned, we might want to consider having #readpartial allocate a new string and give ownership of it to the stream, but since we're now re-using this buffer, it's probably already more efficient than what we had before, so we can probably leave it until we actually find a perf issue we can trace back to this specifically.

@jbarnette
jbarnette Jan 5, 2017

Another "" that might be better as String.new here.

raise MaximumOutputExceeded
end
end
-
- [@out, @err]
@brianmario
brianmario Jan 4, 2017 Collaborator

I decided to just drop returning these in an attempt at consistency since these one or both of these ivars are useless if we're streaming.

+ streams = {stdout => @stdout_stream, stderr => @stderr_stream}
+
+ bytes_seen = 0
+ chunk_buffer = ""
@carlosmn
carlosmn Jan 4, 2017

Due to the issue with appending to strings that have been mentioned, we might want to consider having #readpartial allocate a new string and give ownership of it to the stream, but since we're now re-using this buffer, it's probably already more efficient than what we had before, so we can probably leave it until we actually find a perf issue we can trace back to this specifically.

test/test_child.rb
+ begin
+ Child.new('yes', :streams => {:stdout => stdout_stream}, :max => limit)
+ rescue POSIX::Spawn::MaximumOutputExceeded
+ end
@carlosmn
carlosmn Jan 4, 2017

This should probably be a assert_raises block so we assert that the exception was raised, as-is this would not fail the test even if we never raise the error.

@brianmario
brianmario Jan 4, 2017 Collaborator

Good call 👍

@carlosmn
carlosmn commented Jan 4, 2017

This doesn't seem to be new, but the semantics of :max seem rather confusing. It doesn't mean "send me at most X bytes" but rather "if the process sends over X amount, give it to me and then raise an error", which means that for a :max which is a multiple of the effective buffer size, you get a whole extra chunk.

While I was playing around with the added test here, I noticed that all chunks are 16kB (which seems to be the default pipe buffer size here) so the BUFSIZE * 2 size four times the chunk size (since yes outputs so quickly, we always find it full) which means the output stream gets a whole extra 16kB beyond what is specified as :max. This seems to be semantics that are already there, but these seems really hard to plan for, if I can get an arbitrary amount of bytes beyond what I asked for.

@brianmario
Collaborator

@carlosmn went ahead and added a failing test for that. Will get things fixed up so we only ever hand the caller max bytes. I don't think that'll break anything because it seems like that's been the assumption all along.

@@ -95,6 +95,31 @@ def initialize(*args)
@options[:pgroup] = true
end
@options.delete(:chdir) if @options[:chdir].nil?
+
+ @out, @err = "", ""
@jbarnette
jbarnette Jan 5, 2017

Might want to use String.new to avoid breaking when someone passes --enable-frozen-string-literal.

+ @out << chunk
+
+ true
+ end
@jbarnette
jbarnette Jan 5, 2017
@stdout_stream = @out.method(:<<)

:trollface: 🚲 🏠

@brianmario
brianmario Jan 5, 2017 Collaborator

No I love it! Being able to use method was one of the main reasons for going back to the proc-based API ;)

+ streams = {stdout => @stdout_stream, stderr => @stderr_stream}
+
+ bytes_seen = 0
+ chunk_buffer = ""
@jbarnette
jbarnette Jan 5, 2017

Another "" that might be better as String.new here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment