Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Synchronize access to zstream to prevent segfault in multithreaded use #27

Merged
merged 1 commit into from
Jul 16, 2021

Conversation

jeremyevans
Copy link
Contributor

I'm not sure whether this handles all multithreaded use cases,
but this handles the example that crashes almost immediately
and does 10,000,000 total deflates using 100 separate threads.

To prevent the tests from taking forever, the committed test
for this uses only 10,000 deflates across 10 separate threads,
which still causes a segfault in the previous implementation
almost immediately.

Fixes [Bug #17803]

I'm not sure whether this handles all multithreaded use cases,
but this handles the example that crashes almost immediately
and does 10,000,000 total deflates using 100 separate threads.

To prevent the tests from taking forever, the committed test
for this uses only 10,000 deflates across 10 separate threads,
which still causes a segfault in the previous implementation
almost immediately.

Fixes [Bug #17803]
@jeremyevans jeremyevans requested a review from mame June 15, 2021 22:38
@ioquatix
Copy link
Member

Is there actually a valid use case, or is this just about avoiding a segfault?

@jeremyevans
Copy link
Contributor Author

Is there actually a valid use case, or is this just about avoiding a segfault?

The original report describes using a thread pool with EventMachine and the permessage_deflate gem, so I think the current behavior causes problems in a valid use case.

@ioquatix
Copy link
Member

I don't think the deflate instance should be shared between threads. As you said, there are plenty of situations where it might fail, not just this one. Now, everyone who does the right thing (uses one per thread, etc) pays the cost of an additional mutex + locking for every operation. Is it fair?

@jeremyevans
Copy link
Contributor Author

I'm not sure all the cases are handled, true. However, there are a lot of code paths that run through zstream_run, which uses the mutex. So the current patch probably handles at least the majority of multithreading issues.

In terms of fairness, I see your point, but Ruby generally goes to considerable lengths to avoid segfaults from pure Ruby code (modulo fiddle/ffi use). I think the alternative would be just accepting and documenting that the code can segfault in the multithreaded case.

I don't know if it is a myth, but I've heard that uncontested mutex access is not slow. Do you expect this to cause measurable performance issues in real world cases? If so, is there an alternative approach that would prevent the segfault without the overhead of a mutex?

To check the performance difference, as a worse case benchmark, I used:

require "zlib"
require 'benchmark/ips'

zd = Zlib::Deflate.new

Benchmark.ips do |x|
  x.report("deflate"){zd.deflate("x")}
end

zd.close

Output:

Before:
deflate    166.392k (_ 1.8%) i/s -    833.442k in   5.010592s

After:
deflate    167.529k (_ 2.1%) i/s -    845.050k in   5.046563s

I'm sure the new code is not faster, but I doubt you'd be able to even measure the difference. And this is the worst case scenario, deflating a single byte. Any code that deflates multiple bytes will pay a proportionally smaller cost.

Considering that, do you still think it's a better idea to accept the current code and allow segfaults?

@ioquatix
Copy link
Member

I agree everything you say, and performance is no visible difference in your case. However, it's still uncomfortable to me because I'm almost certain the original use case is still broken. Most C libraries are not thread safe, and the only way we can protect users is synchronize everything. Performance will be impacted IMHO, but analysis may be a little tricky.

Also, my understanding is that you cannot inflate a stream of chunks out of order, so it's confusing to me how this even works (multiple threads decoding chunks of data without any synchronisation/order). These are two separate issues though - whether we make interface thread safe and bugs in the original code...

I don't know if it is a myth, but I've heard that uncontested mutex access is not slow

In Ruby, because of GVL, uncontested Mutex is not slow because there is no synchronisation required that isn't already provided by the GVL. That being said, I don't believe you can use Ruby mutex without GVL. So, I could be wrong, but I'd be comfortable suggesting that the reason why uncontested Mutex is not slow is because it's already so slow no matter what.

My only other suggestion would be that we don't release GVL when using deflate operation, or we try to detect multi-threaded use cases and fail predictably some how - probably impossible.

Copy link
Member

@mame mame left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@jeremyevans jeremyevans merged commit 4b1023b into ruby:master Jul 16, 2021
s = Zlib.deflate("x" * 10000)
(0...10).map do |x|
Thread.new do
1000.times { zi.inflate(s) }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is invalid usage and we should not be encouraging it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://zlib.net/zlib_faq.html#faq21

Of course, you should only operate on any given zlib or gzip stream from a single thread at a time.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't encouraging it, merely testing it doesn't crash. I agree the spec isn't good, as you don't get the expected result when doing it. One possible way to avoid that would be to wrap the entire call to inflate and deflate in the mutex (and possibly other methods). I'll experiment with that.

@ioquatix
Copy link
Member

ioquatix commented Jul 16, 2021

Sorry, after seeing this merged, I have more concerns. I looked at the test case.

Calling inflate the way in the test case is akin to reading every page in a book in a totally random order. The output cannot possibly make sense. I understand we want to prevent segfault... but I could not actually reproduce it. But I could reproduce that it doesn't make any sense:

require 'securerandom'
require "zlib"

# If you set this to 1, it is valid program. Any number > 1 is invalid and is likely to produce errors.
COUNT = 10

data = SecureRandom.random_bytes(1024**2)
compressed_data = Zlib.deflate(data)
decompressed_data = String.new.b

Thread.report_on_exception = true

IO.pipe do |input, output|
	reader = Thread.new do
		z = Zlib::Inflate.new
		
		threads = COUNT.times.map do
			Thread.new do
				while chunk = input.read(1024)
					decompressed_data << z.inflate(chunk)
				end
			end
		end
		
		threads.each(&:join)
		decompressed_data << z.finish
		z.close
		
		pp reader: "finished"
	end
	
	writer = Thread.new do
		output.write(compressed_data)
		output.close
		pp writer: "finished"
	end
	
	reader.join
	writer.join
	
	pp data == decompressed_data
end

This program is similar to the failure case. The output decompressed_data is completely invalid or lib fails outright.

Yes, it might work in some limited cases where you are decompressing the same data over and over again. But generally, it's invalid.

@jeremyevans
Copy link
Contributor Author

In my testing, before the patch, reproducing the crash was trivial. The test resulted in a crash every time, almost immediately. I tested it again with the changes backed out from zlib.c (leaving only the added tests), and it crashed almost immediately all 3 times I tried.

Wrapping the whole call to inflate or deflate doesn't work. You still don't get the results you expect, because of the internal buffering between calls.

@ioquatix I think we both agree that simultaneous usage of Zlib::Inflate or Zlib::Deflate in multiple threads is not going to work. Ideally, raising an exception would be best, but I cannot see a way to raise an exception that will disallow all incorrect usage and allow all correct usage. So the question becomes, do we crash or not in this case? If you think crashing is best, we can just revert this, and maybe add some documentation that multithreaded usage will result in crashes. I don't have a strong opinion.

@mame what are your thoughts?

@ioquatix
Copy link
Member

ioquatix commented Jul 16, 2021

I agree we don't want to crash in general.

However, there must be many C extensions which can have this kind of issue when used from multiple threads as this is the nature of underlying synchronisation problems when releasing the GVL. For example, I wonder how sqlite3 gem or mysql2 gem will behave - does it result in crash?

So, while I don't think "because we can't fix them all we shouldn't fix one", I'm also concerned about the message this sends about thread safety.

In terms of the cost, I haven't really had time to evaluate it, but allocating extra objects does impact GC overhead. This cost is implicit in a web server like falcon, as one extra mutex/object allocation per request. It can be a burden - or at least, we seek to reduce per-request object allocation as much as possible. There is no benefit to correct usage of zlib in this case, it only prevents incorrect usage.

The key point to me is, we are spending resources to protect an invalid use case. That does bother me a little bit, since in general, this will penalise correctly written code.

With all that being said and done, I don't think I'm being all that helpful. So here is an idea:

When we have a C extension which is not thread safe, and we want that to fail when used in a thread unsafe way, why don't we do the following:

# Pseudo C code.
def inflate
  raise "Thread unsafe state detected" if @token
  @token = Thread.current / Fiber.current
  nogvl {inflate_implementation}
  @token = nil
end

This will immediately blow up if the method is called from two threads at the same time, i.e. while the GVL is released, another Ruby thread calls inflate. This has only a single pointer overhead, and rather than giving a false sense of security, it blows up.

If we want to limit an object to only one thread, we could do the following.

# Pseudo C code.
def initialize
  @token = Thread.current / Fiber.current
end

def inflate
  raise "Invalid thread usage" unless @token == Thread.current
end

or

def inflate
  raise "Thread unsafe state detected" unless @token == Thread.current
  @token = Thread.current / Fiber.current # Bound to current thread forever
  nogvl {inflate_implementation}
end

@maximeg
Copy link

maximeg commented Jul 17, 2021

Out of order calls on a stream due to threading is out of scope here, it's a logical issue.

The fault appeared by using https://github.com/faye/permessage-deflate-ruby (which deals with a full set of data at once, and sometimes keeps the inflate/deflate instance) on a websocket along with EM on a pool of threads. (by the way @ioquatix I plan to rewrite this with async-websocket 😉 )
Basically I deferred the handling of incoming websocket message and sending a response later, on that pool. The segfault was during the deflate of that response. @jcoglan what do you think of this?

I admit that in practice this combination makes a poor use case, but Ruby should not segfault.

And I agree that being too conservative would kill performance or concurrency for ones that do things right or run in less convoluted cases.

The issue is also quite general as most of things are not thread safe in C land.
Maybe we could have a mecanism that releases GVL only if it is seen to be running on one thread that some can opt in.

# C pseudo code
class SafeNoGvl
  def initialize
    @threads = Set.new
  end

  def safe_nogvl
    @threads << Thread.current
    if @threads.length == 1
       nogvl { yield }
    else
       yield
    end
  end
end

@ioquatix
Copy link
Member

ioquatix commented Jul 17, 2021

@maximeg Thanks for chiming in with more details. I don't fully understand what logic is leading to a segfault. The zlib stream could be used safely from multiple threads provided you are decoding one chunk at a time in sequence such that the deflate/inflate step don't overlap - either logically (e.g. a loop/stream) or enforced by mutex. Can you explain the issue in more detail? The zlib stream is stateful so any out of order processing is liable to create incorrect results.

@ioquatix
Copy link
Member

By the way, I'd welcome this as part of protocol-websocket.

@jcoglan
Copy link

jcoglan commented Jul 17, 2021

I'm not sure I have enough context to meaningfully assess all this but I'll mention what did occur to me while reading this. Firstly this construction is a good illustration of the problem:

		threads = COUNT.times.map do
			Thread.new do
				while chunk = input.read(1024)
					decompressed_data << z.inflate(chunk)
				end
			end
		end

This will result in your data getting mangled because it can cause input chunks to be reordered, i.e. two concurrent threads could do this:

    thread 1                 | thread 2
    -------------------------+-------------------------
    chunk = input.read(1024) |
                             | chunk = input.read(1024)
                             | z.inflate(chunk)
    z.inflate(chunk)         |

Data must flow through a zlib session in the same order as it goes through the I/O stream or it will make no sense.

It's relatively uncommon to have an implementation where multiple threads handle reading for a single WebSocket connection, it's more likely that in a multi-threaded app you accidentally have multiple threads writing to a WebSocket, which causes a similar problem to the one above except you're dealing with input being compressed and then written to the TCP/TLS stream.

The permessage-deflate extension to the WebSocket protocol can operate in two modes: either a fresh zlib session is used per message, or a single zlib session is used for the entire connection, so state is retained between messages. The latter mode results in better compression at the cost of higher memory usage. It also requires that messages are written to TCP/TLS in the same order they went through zlib compression, or you'll reorder chunks of your zlib stream and it won't make any sense.

Without knowing how you're using the permessage_deflate gem I can't really comment further on your system design but my gut reaction is that papering over a segfault isn't especially useful because, per the above, there is an actual logic error in programs with this behaviour. The race condition is in the relationship between ordering of messages passing through zlib and being written to the socket, not in concurrent use of zlib by itself. Applications should ensure that this order is maintained.

As an aside, the node.js version of websocket-extensions includes machinery to maintain message order between extensions b/c they might do asynchronous processing. We can't do the same thing in Ruby because we're deliberately agnostic about the I/O and concurrency libraries people are running, so it's left up to applications to ensure concurrent use of WebSocket APIs is done safely.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

5 participants