Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.Sign up
Deadlock when using parallelization and blocking on async code #864
Am working on getting Npgsql's Entity Framework Core test suite to run in parallel, and am running into an issue where my test freezes whenever parallelization is turned on. Everything works fine if I specify
Npgsql has a logic path whereby it sends database query information asynchronously. With awaiting for this send to complete, it then reads synchronously, waiting for the database response. In other words, an async send is going on "in the background" while a blocking sync read is started (here's the relevant code). When parallelization is turned on, I can clearly see that the async send operation does not return - it's continuation is not executed.
This resembles the classic async/sync deadlock where the continuation needs to be scheduled on a specific thread which itself is synchronously blocking on the results - except I have no idea why xunit's parallelization feature causes this (note that
I have run into this exact same issue in a different scenario.
The root cause seems to be that when running tests in parallel, Xunit uses a problematic mechanism for regulating the rate at which it starts new tests. (This has already been discussed in this issue but that discussion focuses on the TPL. There's an extra dimension to this if you're using
When running tests concurrently, Xunit creates a task scheduler associated with a custom synchronization context of type
That works very well when everything is synchronous, but unfortunately, it causes two problems when you introduce asynchronous tests. First, it allows the number of tests in flight concurrently to grow in an unbounded fashion. (I'll explain why shortly.) Second, it's very prone to deadlock: if all of the threads block, all other work will just sit in the queue, and if all of the threads block waiting for the completion of work that's sat in the queue, progress grinds to a halt. I've worked on a project where exactly this happens if you run it in a computer with a CPU with anything less than 8 hardware threads.
I've put together a couple of repros of this problem here: https://github.com/idg10/XunitAsyncHang. There are two projects illustrating a simple and a slightly more subtle case where you can hit this problem. The first (in
Both make the same mistake: they block in an
This commit doesn't help because it only hides the scheduler. The problem is that
I'd like to pre-empt suggestions of the form "You should always call
Now, earlier I said the scheduler/sync context design in Xunit can lead to unbounded growth in the number of concurrent tests. I'll now explain why. Any time a test hits an
Now as it happens you won't necessarily see it attempting to run all of the tests at once. Because of the structure of the various test runner classes, there are some limits to the degree of concurrency possible within a single class. But given a sufficiently large number of test classes, it's possible for the number of tests in flight to grow very large. And in particular, you can find that if you have a lot of tests that have an
A simple solution would be to limit the number of tests in flight directly - if you have N tests in progress, don't start the next one until another one completes. However, this will, in some circumstances, lower throughput considerably. The current design has the virtue that if for some reason your tests spend a lot of time with outstanding
So I'd like to suggest a more subtle approach; use a synchronization context that keeps track of how many threads are actively processing work, and stop starting new tests once over the threshold, but continue to allow more threads to spin up. So if you have 8 busy threads and a configured concurrency level of 8, don't start any new tests until you drop down to at most 7 busy threads, but allow the thread count to go up to 10, 20, or whatever it takes. (And in practice, it would probably be best to just let the built-in thread pool actually do the spinning up of threads. Use a custom scheduler to keep track of how much work is in progress, but let the thread pool decide whether to spin up new threads or queue new work items, because that already does a good job of working out what is the right number of threads required when the work in progress exceeds the hardware thread count.
The use of
Certainly, if you have tests that can block then thread starvation is still going to be a risk.
I'd be happy to try a fix.
Note that in my case the problematic test(s) isn't an
For example, if you write
And when you split it out like this, it should be clear that
Oren, if you take a look at https://github.com/idg10/xunit/tree/limit-concurrency (and idg10@e68cd17 in particular) there's a very simple fix that verifies that the root of the problem is as described.
However, this probably isn't good enough. I have to admit that I don't understand how your custom synchronization context is handling
That workaround is probably going to hose the async context:
It's really important that the
With these deadlocks, do we know if things are completing? Adding some trace/debug statements to that context could help expose something.
Couple other things -- I don't have it setup to debug atm --
The async context is created here, is it before/after the place you're nulling it out?
That may be set lower anyway.
Second, what if you set the maxParrallelThreads to something huge: http://xunit.github.io/docs/configuring-with-json.html like if it's 100 or something? That effectively gives you an unbounded set.
I can, and I do.
If you're writing a library, then any use of these functions without ensuring you're not on a sync context is absolutely, 100% guaranteed to cause deadlocks in some applications. The sample code that illustrates the "bug" in xUnit.net is just plain broken code. It's like writing a sample test showing a "bug" in xUnit.net because it throws a
If you can't guarantee that you've transitioned off of any sync context, then you cannot safely call any of those problematic methods. Period, end of story. There is no wiggle room in this rule. The sample code absolutely violates those rules.
The difference with
I might not be a (real) library, I might be writing tests for routines for my executable programs which aren't meant for consumption by others. In my program there is no sync context (I'm not a GUI), and xunit is forcing me to worry about these things by artificially bringing a sync context into my life for its parallelization implementation.
And if that isn't convincing, think of your users who depend on a badly-written library. Of course you can claim it's the library's fault, but that's not providing a solution to anyone (neither does suggesting unbounded concurrency).
I admit I simply assumed that to implement parallelization xunit simply spins off threads as requested and takes care not to execute more than N tests concurrently. I'm sure you guys have reasons for doing things via a sync context but that does have its adverse effects.
Brad, note that I said this was intended only as an attempt to test my hypothesis about where the problem lies. I was not about to submit a PR...
I think you are mistaken to think there's no problem here. For one thing, deadlocking is not the only issue caused by the current design: if xUnit's tendency to spiral the number of tests in flight way out of control when confronted with lots of
Second, although in most scenarios,
Even if you decided that xUnit was serving a greater moral purpose by deliberately deadlocking when people use these techniques in the code under test (which I believe would be a mistake because of 2) you don't also need to deadlock when such techniques are used purely in test code.
(Regarding your proposed workaround: how do I control that in the VS test runner? I don't see how to pass it that switch.)
I think it won't blow away the async context because that only gets set at the very last minute - right before invoking the test method itself. All the problems my repro illustrate are in the
Really, what you want is to rate limit specifically at
However, this may all be moot since it seems like Brad's not open to changing this. A proper fix would be non-trivial, and I'm not going to put any time into it if it's almost certain to be rejected.
Yes. Purposefully. To help catch broken (deadlocking) code.
If your code deadlocks, use the workaround that stops the deadlocks.
They can (and, philosophically, argue that they should) mock away dependencies.
That is exactly what it does, through the use of
No, this is working as designed. The intention is not to limit the number of tests which have been started, but rather to limit the number of tests which are currently simultaneously executing. There is no reason to slow down and wait artificially when a test has voluntarily (via
What would you propose that we do about these worker threads that we cannot (by your own admission) have known anything about?
Limiting the number of concurrently executing threads is not an attempt to make things "easy for you to see"; it's there so that users can opt to increase or reduce the number of threads such that CPU utilization matches their desired rate (our default of 1 thread per logical CPU thread is the best default for this on a system which is otherwise not busy).
If you want to make it easy to see things, turn off parallelization.
As already stated many times, this design is purposeful, to help people catch their own broken (deadlocking) code, and there is an available workaround for people who do not wish to have this capability.
And there's no reason to use these techniques in your test code, since you most definitely have access to
Use a configuration file, and set
Oren is correct, and you are incorrect. The sync context when the test method calls must be the
You are correct. We have no intention of making any changes here. We thought carefully about what we wanted to achieve, and believe we have successfully achieved those goals.
By your own statement, it's only broken if it's a library. There's other code that needs to get tested in this world.
More importantly, if one of your goals is to help users catch deadlocking code, it's inconsistent to do that only if parallelization is turned on. That again makes it seem that an implementation detail of parallelization is being recast as a purposeful "goal".
Test frameworks are frequently used for integration tests, not just unit tests - in which case mocks make no sense at all.
We've already been through this, you can only say it's "inappropriate" if it's a general-purpose library.
I'm going to drop this now, it would have been nice to at least get some acknowledgement of an issue. What I can suggest is that you make this very clear in the documentation - i.e. that when parallelization is on a special sync context is set up which will trigger deadlocks in code that synchronously blocks without
Separating out this reply from the other points because it doesn't really have any bearing on what ought to be done, but it's peripherally useful because it demonstrates that you're not thinking as clearly as you think you are about this whole issue, and that an evidence-based approach might produce better results.
Sorry Brad, but it takes only the simplest of experiments to show that, as I said, the async context does not get blown away. (As ever, the scientific method tends to be more reliable than relying purely on thinking things through. As it happens, my thought processes worked better than yours here, but it's what can be shown that matters more than what any individual thinks.) I put this inside one of my test methods in the repro project:
If you were right to say I was wrong (i.e., if the async context was in fact blown away) this would print either "No context", or the name of some context type other than the async context. In fact, it prints out
I suggest you check your assumptions the next time you accuse someone of being wrong in public.
Unfortunately, as this detail shows, thinking alone is not reliable. Unless you also take evidence into account, you will often end up at the wrong conclusion. And it remains my view that the results of your thought processes about what you wanted to achieve have left xUnit.net in a slightly problematic state.
You misunderstand what I was proposing. I'm not suggesting slowing down merely because a test has yielded via
(I should state an assumption here: I assume the underlying goal here is to minimize the amount of time it takes to execute all the tests in a solution. I would hope that's the real goal, rather than implementation-driven things like "keeping exactly N threads busy running tests, or context-bound work originated by tests". If running tests quickly is not a goal, then none of what follows is relevant. With that in mind, I will throw in an observation: throughput can often be made worse by having too many operations in flight, because it puts pressure both on CPU caches and the GC, so although it's often desirable to have enough concurrency to keep your CPU cores busy, any more than that is typically harmful. Consequently, once a test has begun, work requiring CPU time to move that test to completion should take priority over tests that haven't started yet.)
My suggestion is quite different from your oversimplified interpretation of it. What I'm suggesting is only slightly different from what you already do, but the difference is crucial: currently when the number of active threads hits the configured limit, you refuse to start any more work of any kind in the sync context (whether it be starting a new test, or processing some work queued by a test already in progress). I'm suggesting that you only refuse to start new tests at this point. The critical difference is that you would still be open to starting new work kicked off by tests in progress (allowing the thread count to increase if necessary; the obvious thing to do would be just to let the TPL thread pool handle actual creation of threads, not least because its LIFO-per-thread, FIFO-cross-thread approach does a much better job of a) finishing the work already started before beginning anything new than your sync context's pure LIFO, and b) getting better locality, thus improving throughput by using the CPU cache more efficiently).
Although this conversation started with deadlocks, deadlocks are not the only problem caused by the current design. I work on one project (in which I never see the deadlock issue) in which some tests have timeouts that keep occurring spuriously unless I disable parallel execution in xUnit.net. This happens because xUnit merrily throws more and more new tests into the mix even though all the CPU cores are busy (and it can't see that this is happening because I'm using
The best way to handle this would be with a custom
As it happens this would also prevent the deadlocks. But even if you don't care about them, perhaps you might care about the way tests already in progress can get bogged down as a result of new tests being started even when all CPU cores are busy. Or perhaps you have a philosophical objection to tests that contain timeouts?
I don't actually think that that's what the current design does. Each parallelization boundary (test class, by default) schedules one task to start running without regard to the number of actual tasks, and depends on the concurrency sync context to guarantee that only a specific number of tasks are in the running state at any given time. It is expected behavior that many, many tasks will be scheduled almost immediately and simultaneously (often times many more than could run simultaneously, for any but the smallest unit test projects).
It is working as designed.
We explicitly tell people (when they asked why we removed
If you've opted to add back in a timeout feature, then you must do it understanding that you will either be wildly inaccurate, or need to disable parallelization.
If you have an alternative implementation which you feel is better, yet does not compromise our design goals, by all means please send a PR.
For now, this thread has reached the limit of its usefulness, so I'm locking it. We can have further discussion once a PR has been issued, on the technical merits of that PR itself. I am not spending any more time defending our design decisions for an issue with a simple and effective workaround.