-
Notifications
You must be signed in to change notification settings - Fork 35
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Synchronize access to zstream to prevent segfault in multithreaded use #27
Conversation
I'm not sure whether this handles all multithreaded use cases, but this handles the example that crashes almost immediately and does 10,000,000 total deflates using 100 separate threads. To prevent the tests from taking forever, the committed test for this uses only 10,000 deflates across 10 separate threads, which still causes a segfault in the previous implementation almost immediately. Fixes [Bug #17803]
Is there actually a valid use case, or is this just about avoiding a segfault? |
The original report describes using a thread pool with EventMachine and the |
I don't think the deflate instance should be shared between threads. As you said, there are plenty of situations where it might fail, not just this one. Now, everyone who does the right thing (uses one per thread, etc) pays the cost of an additional mutex + locking for every operation. Is it fair? |
I'm not sure all the cases are handled, true. However, there are a lot of code paths that run through In terms of fairness, I see your point, but Ruby generally goes to considerable lengths to avoid segfaults from pure Ruby code (modulo fiddle/ffi use). I think the alternative would be just accepting and documenting that the code can segfault in the multithreaded case. I don't know if it is a myth, but I've heard that uncontested mutex access is not slow. Do you expect this to cause measurable performance issues in real world cases? If so, is there an alternative approach that would prevent the segfault without the overhead of a mutex? To check the performance difference, as a worse case benchmark, I used: require "zlib"
require 'benchmark/ips'
zd = Zlib::Deflate.new
Benchmark.ips do |x|
x.report("deflate"){zd.deflate("x")}
end
zd.close Output:
I'm sure the new code is not faster, but I doubt you'd be able to even measure the difference. And this is the worst case scenario, deflating a single byte. Any code that deflates multiple bytes will pay a proportionally smaller cost. Considering that, do you still think it's a better idea to accept the current code and allow segfaults? |
I agree everything you say, and performance is no visible difference in your case. However, it's still uncomfortable to me because I'm almost certain the original use case is still broken. Most C libraries are not thread safe, and the only way we can protect users is synchronize everything. Performance will be impacted IMHO, but analysis may be a little tricky. Also, my understanding is that you cannot inflate a stream of chunks out of order, so it's confusing to me how this even works (multiple threads decoding chunks of data without any synchronisation/order). These are two separate issues though - whether we make interface thread safe and bugs in the original code...
In Ruby, because of GVL, uncontested Mutex is not slow because there is no synchronisation required that isn't already provided by the GVL. That being said, I don't believe you can use Ruby mutex without GVL. So, I could be wrong, but I'd be comfortable suggesting that the reason why uncontested Mutex is not slow is because it's already so slow no matter what. My only other suggestion would be that we don't release GVL when using deflate operation, or we try to detect multi-threaded use cases and fail predictably some how - probably impossible. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
s = Zlib.deflate("x" * 10000) | ||
(0...10).map do |x| | ||
Thread.new do | ||
1000.times { zi.inflate(s) } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is invalid usage and we should not be encouraging it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://zlib.net/zlib_faq.html#faq21
Of course, you should only operate on any given zlib or gzip stream from a single thread at a time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't encouraging it, merely testing it doesn't crash. I agree the spec isn't good, as you don't get the expected result when doing it. One possible way to avoid that would be to wrap the entire call to inflate
and deflate
in the mutex (and possibly other methods). I'll experiment with that.
Sorry, after seeing this merged, I have more concerns. I looked at the test case. Calling inflate the way in the test case is akin to reading every page in a book in a totally random order. The output cannot possibly make sense. I understand we want to prevent segfault... but I could not actually reproduce it. But I could reproduce that it doesn't make any sense: require 'securerandom'
require "zlib"
# If you set this to 1, it is valid program. Any number > 1 is invalid and is likely to produce errors.
COUNT = 10
data = SecureRandom.random_bytes(1024**2)
compressed_data = Zlib.deflate(data)
decompressed_data = String.new.b
Thread.report_on_exception = true
IO.pipe do |input, output|
reader = Thread.new do
z = Zlib::Inflate.new
threads = COUNT.times.map do
Thread.new do
while chunk = input.read(1024)
decompressed_data << z.inflate(chunk)
end
end
end
threads.each(&:join)
decompressed_data << z.finish
z.close
pp reader: "finished"
end
writer = Thread.new do
output.write(compressed_data)
output.close
pp writer: "finished"
end
reader.join
writer.join
pp data == decompressed_data
end This program is similar to the failure case. The output Yes, it might work in some limited cases where you are decompressing the same data over and over again. But generally, it's invalid. |
In my testing, before the patch, reproducing the crash was trivial. The test resulted in a crash every time, almost immediately. I tested it again with the changes backed out from zlib.c (leaving only the added tests), and it crashed almost immediately all 3 times I tried. Wrapping the whole call to @ioquatix I think we both agree that simultaneous usage of @mame what are your thoughts? |
I agree we don't want to crash in general. However, there must be many C extensions which can have this kind of issue when used from multiple threads as this is the nature of underlying synchronisation problems when releasing the GVL. For example, I wonder how So, while I don't think "because we can't fix them all we shouldn't fix one", I'm also concerned about the message this sends about thread safety. In terms of the cost, I haven't really had time to evaluate it, but allocating extra objects does impact GC overhead. This cost is implicit in a web server like falcon, as one extra mutex/object allocation per request. It can be a burden - or at least, we seek to reduce per-request object allocation as much as possible. There is no benefit to correct usage of The key point to me is, we are spending resources to protect an invalid use case. That does bother me a little bit, since in general, this will penalise correctly written code. With all that being said and done, I don't think I'm being all that helpful. So here is an idea: When we have a C extension which is not thread safe, and we want that to fail when used in a thread unsafe way, why don't we do the following:
This will immediately blow up if the method is called from two threads at the same time, i.e. while the GVL is released, another Ruby thread calls If we want to limit an object to only one thread, we could do the following.
or
|
Out of order calls on a stream due to threading is out of scope here, it's a logical issue. The fault appeared by using https://github.com/faye/permessage-deflate-ruby (which deals with a full set of data at once, and sometimes keeps the inflate/deflate instance) on a websocket along with EM on a pool of threads. (by the way @ioquatix I plan to rewrite this with async-websocket 😉 ) I admit that in practice this combination makes a poor use case, but Ruby should not segfault. And I agree that being too conservative would kill performance or concurrency for ones that do things right or run in less convoluted cases. The issue is also quite general as most of things are not thread safe in C land.
|
@maximeg Thanks for chiming in with more details. I don't fully understand what logic is leading to a segfault. The zlib stream could be used safely from multiple threads provided you are decoding one chunk at a time in sequence such that the deflate/inflate step don't overlap - either logically (e.g. a loop/stream) or enforced by mutex. Can you explain the issue in more detail? The zlib stream is stateful so any out of order processing is liable to create incorrect results. |
By the way, I'd welcome this as part of |
I'm not sure I have enough context to meaningfully assess all this but I'll mention what did occur to me while reading this. Firstly this construction is a good illustration of the problem: threads = COUNT.times.map do
Thread.new do
while chunk = input.read(1024)
decompressed_data << z.inflate(chunk)
end
end
end This will result in your data getting mangled because it can cause input chunks to be reordered, i.e. two concurrent threads could do this:
Data must flow through a zlib session in the same order as it goes through the I/O stream or it will make no sense. It's relatively uncommon to have an implementation where multiple threads handle reading for a single WebSocket connection, it's more likely that in a multi-threaded app you accidentally have multiple threads writing to a WebSocket, which causes a similar problem to the one above except you're dealing with input being compressed and then written to the TCP/TLS stream. The Without knowing how you're using the As an aside, the node.js version of |
I'm not sure whether this handles all multithreaded use cases,
but this handles the example that crashes almost immediately
and does 10,000,000 total deflates using 100 separate threads.
To prevent the tests from taking forever, the committed test
for this uses only 10,000 deflates across 10 separate threads,
which still causes a segfault in the previous implementation
almost immediately.
Fixes [Bug #17803]