New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deadlock when using parallelization and blocking on async code #864

Closed
roji opened this Issue May 30, 2016 · 23 comments

Comments

Projects
None yet
4 participants
@roji
Contributor

roji commented May 30, 2016

Am working on getting Npgsql's Entity Framework Core test suite to run in parallel, and am running into an issue where my test freezes whenever parallelization is turned on. Everything works fine if I specify [assembly: CollectionBehavior(DisableTestParallelization = true)], but the issues manifests the moment I remove it, including if I specify [assembly: CollectionBehavior(MaxParallelThreads = 1)]. I'm using dotnet-test-xunit 1.0.0-rc3-*, so xunit 2.1.0.

Npgsql has a logic path whereby it sends database query information asynchronously. With awaiting for this send to complete, it then reads synchronously, waiting for the database response. In other words, an async send is going on "in the background" while a blocking sync read is started (here's the relevant code). When parallelization is turned on, I can clearly see that the async send operation does not return - it's continuation is not executed.

This resembles the classic async/sync deadlock where the continuation needs to be scheduled on a specific thread which itself is synchronously blocking on the results - except I have no idea why xunit's parallelization feature causes this (note that ConfigureAwait(false) is properly specified). I've also seen this issue, this commit and the release notes for 2.0 which mention what seems ot be this exact issue - but I'm still getting it with 2.1.

@idg10

This comment has been minimized.

idg10 commented Jun 1, 2016

I have run into this exact same issue in a different scenario.

The root cause seems to be that when running tests in parallel, Xunit uses a problematic mechanism for regulating the rate at which it starts new tests. (This has already been discussed in this issue but that discussion focuses on the TPL. There's an extra dimension to this if you're using async/await so it's worth revisiting the root of the problem in the light of the usage patterns associated with async code rather than the use of pure TPL tasks.)

When running tests concurrently, Xunit creates a task scheduler associated with a custom synchronization context of type MaxConcurrencySyncContext. That creates a fixed number of threads, and distributes work through a queue that all these threads listen to. This means that at any given instant, no more than N tasks can be executing code where N is the concurrency level.

That works very well when everything is synchronous, but unfortunately, it causes two problems when you introduce asynchronous tests. First, it allows the number of tests in flight concurrently to grow in an unbounded fashion. (I'll explain why shortly.) Second, it's very prone to deadlock: if all of the threads block, all other work will just sit in the queue, and if all of the threads block waiting for the completion of work that's sat in the queue, progress grinds to a halt. I've worked on a project where exactly this happens if you run it in a computer with a CPU with anything less than 8 hardware threads.

I've put together a couple of repros of this problem here: https://github.com/idg10/XunitAsyncHang. There are two projects illustrating a simple and a slightly more subtle case where you can hit this problem. The first (in SimpleAsyncHang) will start running as many test as you have hardware threads in your CPU and then stop, making no further progress. The second (the SubtleAsyncHang project) will get a bit further. (On my 4-core, 2-way hyperthreaded, i.e. 8-hardware-thread machine, it seems to start 16 tests before hanging.)

Both make the same mistake: they block in an async method. The 'subtle' one ends up letting more work happen because it yields first, enabling the thread to return back to the thread pool for a bit, with the upshot that Xunit may decide to start running another test without waiting for the current one to complete. If you replace the await Task.Yield(); on line 12 of YieldThenWaitInInit.cs with await Task.Delay(10); and you bump the delay on line 20 from 10 to 20, then on my machine the first 12 tests complete successfully before it deadlocks with 16 tests in progress.

This commit doesn't help because it only hides the scheduler. The problem is that await isn't looking for the current scheduler - it's scheduling continuations on the current SynchronizationContext, and the change Stephen Toub made there only hides the scheduler, not the synchronization context.

I'd like to pre-empt suggestions of the form "You should always call ConfigureAwait." Some people believe this to be a rule, so no doubt someone will say that's the fix. But in fact a dogmatic "Always use ConfigureAwait" rule is wrong. The await keyword continues on the same context it started on by default for a good reason: you very often want that. If you're writing UI code you very often want that. And if you're writing web code that needs access to anything in the context of the request, you typically need it too. In practice you need to decide what to do for each await but since there are plenty of scenarios in which the default context-affine behaviour is absolutely what code requires, it's unsupportable for a test framework to force you to do something else. So you cannot solve this problem by telling people they should slap ConfigureAwait warts onto every await.

Now, earlier I said the scheduler/sync context design in Xunit can lead to unbounded growth in the number of concurrent tests. I'll now explain why. Any time a test hits an await statement that does not complete immediately, it will return, meaning the thread comes back into the custom synchronization context's worker thread loop, and it'll pull another work item off the queue. And since, unlike the TPL's thread pool, Xunit's custom synchronization context does not operate a LIFO (well, the TPL uses thread-local LIFO and then shifts to FIFO for cross-CPU work stealing once local work is complete, strictly speaking) it's highly likely that work items for as-yet-unstarted tests will be processed ahead of work items pertaining to tests already in progress. So every time a test uses an await to wait for something to happen, it's quite likely that yet another test will start.

Now as it happens you won't necessarily see it attempting to run all of the tests at once. Because of the structure of the various test runner classes, there are some limits to the degree of concurrency possible within a single class. But given a sufficiently large number of test classes, it's possible for the number of tests in flight to grow very large. And in particular, you can find that if you have a lot of tests that have an await early on, you can end up with hundreds of tests all competing for time on the CPU, with none of them making particularly fast progress.

A simple solution would be to limit the number of tests in flight directly - if you have N tests in progress, don't start the next one until another one completes. However, this will, in some circumstances, lower throughput considerably. The current design has the virtue that if for some reason your tests spend a lot of time with outstanding await expressions that are waiting for something properly slow, it will increase the number of concurrent tests to make better use of the CPU. If things go well, it will have exactly as many tests in flight as it takes to saturate however many CPU cores you have. In fact, that only works if tests never actually block threads, but doubtless for some users the simple fix would significantly slow down test suites for no apparent benefit.

So I'd like to suggest a more subtle approach; use a synchronization context that keeps track of how many threads are actively processing work, and stop starting new tests once over the threshold, but continue to allow more threads to spin up. So if you have 8 busy threads and a configured concurrency level of 8, don't start any new tests until you drop down to at most 7 busy threads, but allow the thread count to go up to 10, 20, or whatever it takes. (And in practice, it would probably be best to just let the built-in thread pool actually do the spinning up of threads. Use a custom scheduler to keep track of how much work is in progress, but let the thread pool decide whether to spin up new threads or queue new work items, because that already does a good job of working out what is the right number of threads required when the work in progress exceeds the hardware thread count.

@roji

This comment has been minimized.

Contributor

roji commented Jun 1, 2016

@idg10 thanks for the analysis, but the odd thing is that in my case I am using ConfigureAwait(false) when awaiting and the issue still occurs - I am not sure how this can happen.

@idg10

This comment has been minimized.

idg10 commented Jun 1, 2016

The use of ConfigureAwait(false) doesn't necessarily help, because it only ever affects a single continuation. So unless you can get to every single await in the code path under test (including any in libraries you use but don't control) it might not help.

Certainly, if you have tests that can block then thread starvation is still going to be a risk.

@roji

This comment has been minimized.

Contributor

roji commented Jun 1, 2016

Can you elaborate? I was under the impression that using ConfigureAwait(false) on a top-level await is enough, i.e. that awaits in code called from that await implicitly execute on the thread pool (i.e. no synchronization context). Is that wrong?

@onovotny

This comment has been minimized.

Member

onovotny commented Jun 1, 2016

Interesting conversation -- any chance you can test a fix? The other purpose of that sync context is to properly handle async void test cases as it tracks the OperationStarted/Completed callbacks that the async/await state machines generate.

@roji

This comment has been minimized.

Contributor

roji commented Jun 1, 2016

I'd be happy to try a fix.

Note that in my case the problematic test(s) isn't an async void one, it's actually even a sync test that somewhere in its logic spawns off an async task and then performs a sync blocking read that depends on the prior async task completing (which it never does).

@idg10

This comment has been minimized.

idg10 commented Jun 1, 2016

roji, ConfigureAwait(false) determines only what will happen when the continuation for the await in question occurs. It cannot possibly have any effect on nested code because you're calling it after that nested code has returned.

For example, if you write await Foo().ConfigureAwait(false) then if Foo() was unable to execute synchronously, then it must already have kicked off its first async operation before returning, but that call to ConfigureAwait obviously can't occur until after Foo() returns, because it needs the Task (or other awaitable) returned by Foo. The code is in effect this:

var t = Foo();
var ct = t.ConfigureAwait(false);
await ct;

And when you split it out like this, it should be clear that Foo() has no way of knowing that you called ConfigureAwait.

@idg10

This comment has been minimized.

idg10 commented Jun 1, 2016

Oren, if you take a look at https://github.com/idg10/xunit/tree/limit-concurrency (and idg10@e68cd17 in particular) there's a very simple fix that verifies that the root of the problem is as described.

However, this probably isn't good enough. I have to admit that I don't understand how your custom synchronization context is handling async void tests, so it may well be messing that up. I've not really thought through the consequences of cutting this out at this particular point - it may well in fact be too early.

@onovotny

This comment has been minimized.

Member

onovotny commented Jun 1, 2016

That workaround is probably going to hose the async context:
https://github.com/xunit/xunit/blob/master/src/xunit.execution/Sdk/AsyncTestSyncContext.cs

It's really important that the AsyncTestSyncContext be the current context.

With these deadlocks, do we know if things are completing? Adding some trace/debug statements to that context could help expose something.

@onovotny

This comment has been minimized.

Member

onovotny commented Jun 1, 2016

Couple other things -- I don't have it setup to debug atm --

The async context is created here, is it before/after the place you're nulling it out?
https://github.com/xunit/xunit/blob/6381c93738860481af501a2372f796492b6e8d42/src/xunit.execution/Sdk/Frameworks/Runners/TestInvoker.cs

That may be set lower anyway.

Second, what if you set the maxParrallelThreads to something huge: http://xunit.github.io/docs/configuring-with-json.html like if it's 100 or something? That effectively gives you an unbounded set.

@bradwilson

This comment has been minimized.

Member

bradwilson commented Jun 1, 2016

@idg10 If that's your "fix", then you should use -maxthreads unlimited, which simply bypasses the max concurrency sync context entirely.

It almost assuredly means your async code is broken, FWIW. :)

@bradwilson

This comment has been minimized.

Member

bradwilson commented Jun 1, 2016

Repeat after me:

Calling .Wait() causes deadlocks.
Calling .Result causes deadlocks.
Calling .GetAwaiter().GetResult() causes deadlocks.

xUnit.net is working as expected. Your code is broken.

@roji

This comment has been minimized.

Contributor

roji commented Jun 1, 2016

@bradwilson I agree with @idg10, xunit is imposing a very specific environment on its tests here via its SynchronizationContext, that is quite invasive and can be a problem. Wait(), Result and others can make sense in some applications - you can't just say any code using them is broken...

@bradwilson

This comment has been minimized.

Member

bradwilson commented Jun 1, 2016

There is a workaround (turning off the max concurrency sync context). We have no intention of doing anything else in regards to this issue, because it's working as we designed it.

@bradwilson

This comment has been minimized.

Member

bradwilson commented Jun 1, 2016

you can't just say any code using them is broken...

I can, and I do.

If you're writing a library, then any use of these functions without ensuring you're not on a sync context is absolutely, 100% guaranteed to cause deadlocks in some applications. The sample code that illustrates the "bug" in xUnit.net is just plain broken code. It's like writing a sample test showing a "bug" in xUnit.net because it throws a DivideByZeroException when the same code is x = 1 / 0;. This is the expected behavior of the system.

If you can't guarantee that you've transitioned off of any sync context, then you cannot safely call any of those problematic methods. Period, end of story. There is no wiggle room in this rule. The sample code absolutely violates those rules.

@roji

This comment has been minimized.

Contributor

roji commented Jun 1, 2016

The difference with DivideByZeroException is that in the async case, xunit is setting things up in a very specific way that triggers the deadlock.

I might not be a (real) library, I might be writing tests for routines for my executable programs which aren't meant for consumption by others. In my program there is no sync context (I'm not a GUI), and xunit is forcing me to worry about these things by artificially bringing a sync context into my life for its parallelization implementation.

And if that isn't convincing, think of your users who depend on a badly-written library. Of course you can claim it's the library's fault, but that's not providing a solution to anyone (neither does suggesting unbounded concurrency).

I admit I simply assumed that to implement parallelization xunit simply spins off threads as requested and takes care not to execute more than N tests concurrently. I'm sure you guys have reasons for doing things via a sync context but that does have its adverse effects.

@idg10

This comment has been minimized.

idg10 commented Jun 1, 2016

Brad, note that I said this was intended only as an attempt to test my hypothesis about where the problem lies. I was not about to submit a PR...

I think you are mistaken to think there's no problem here. For one thing, deadlocking is not the only issue caused by the current design: if xUnit's tendency to spiral the number of tests in flight way out of control when confronted with lots of awaits is "expected" then that's a pretty low standard to which to hold yourself. I have a test suite in which the xunit test runner output crawls across the screen like it's coming over a 1200 bps modem because there are so many threads competing for a time slice. (And the reason there are so many threads is a combination of the fact that a) my code is in fact using ConfigureAwait so my continuations spin up worker threads where your throttling can't see them and b) xunit splurges a load of tests out all at once because doing things 'properly' with ConfigureAwait has meant it can't see what I'm doing, so this rate limiting code doesn't really rate limit in a way that's terribly useful. This same problem has caused me to hit timeouts when running tests. This is a different symptom from the deadlocks, but the same root cause, which is that a SynchronizationContext is really not a good fit for the problem at hand. Mismatches like this tend to be a source of trouble.)

Second, although in most scenarios, Wait and friends are indeed a recipe for deadlock, you have not considered two important things: 1) even if you avoid this stuff like the plague in your code under test, it's perfectly legitimate to use these in the test code, and often a good deal more convenient than the alternatives so it's a bit feeble of xUnit to fail to cope with it; the fact that it would very often be bad to do this in production code is no excuse because the code you've identified as "broken" isn't production code and it isn't library code. The rules it "absolutely violates" are, as you said yourself, only really for "if you're writing a library" and therefore not applicable here. 2) in highly constrained circumstances there are legitimate (highly unusual, yes, but still legitimate) reasons for using these. (Obviously one of the constraints is that you have to know the entirety of the context in which the call will occur. You qualified your own statements with "If you're writing a library" but if I'm writing a test setup, then I'm not writing a library, and I do know the context. And there's no reason for it not to work in that case beyond xunit's failure to cope.)

Even if you decided that xUnit was serving a greater moral purpose by deliberately deadlocking when people use these techniques in the code under test (which I believe would be a mistake because of 2) you don't also need to deadlock when such techniques are used purely in test code.

(Regarding your proposed workaround: how do I control that in the VS test runner? I don't see how to pass it that switch.)

Oren,

I think it won't blow away the async context because that only gets set at the very last minute - right before invoking the test method itself. All the problems my repro illustrate are in the IAsyncLifetime.InitializeAsync method, which is upstream of where the async context gets inserted. But it's not really relevant because I don't think a proper fix would resemble what I've done.
I think the right way to solve this will involve not attempting to use a SynchronizationContext to rate limit test execution - it's just the wrong tool for the job. We want to rate-limit one specific thing: execution of new tests; a SynchronizationContext doesn't enable us to distinguish between that and additional work queued up by tests already in progress, and we very much don't want to rate-limit the latter, because doing that is what's causing the deadlocks.

Really, what you want is to rate limit specifically at TestInvoker.RunAsync. That's where it's going wrong (not just for deadlocks, but also for the too-much-concurrency problems I've had in projects that don't deadlock).

However, this may all be moot since it seems like Brad's not open to changing this. A proper fix would be non-trivial, and I'm not going to put any time into it if it's almost certain to be rejected.

@bradwilson

This comment has been minimized.

Member

bradwilson commented Jun 2, 2016

@roji wrote:

The difference with DivideByZeroException is that in the async case, xunit is setting things up in a very specific way that triggers the deadlock.

Yes. Purposefully. To help catch broken (deadlocking) code.

I might not be a (real) library, I might be writing tests for routines for my executable programs which aren't meant for consumption by others.

If your code deadlocks, use the workaround that stops the deadlocks.

And if that isn't convincing, think of your users who depend on a badly-written library.

They can (and, philosophically, argue that they should) mock away dependencies.

I admit I simply assumed that to implement parallelization xunit simply spins off threads as requested and takes care not to execute more than N tests concurrently.

That is exactly what it does, through the use of Tasks. When you use Tasks inappropriately in such a way as to block the working thread, how exactly is xUnit.net to know that you're just blocking instead of, say, doing something? It cannot.

@bradwilson

This comment has been minimized.

Member

bradwilson commented Jun 2, 2016

@idg10 wrote:

if xUnit's tendency to spiral the number of tests in flight way out of control when confronted with lots of awaits is "expected" then that's a pretty low standard to which to hold yourself.

No, this is working as designed. The intention is not to limit the number of tests which have been started, but rather to limit the number of tests which are currently simultaneously executing. There is no reason to slow down and wait artificially when a test has voluntarily (via await) yielded back control of the thread while it waits for some other operation to complete.

a) my code is in fact using ConfigureAwait so my continuations spin up worker threads where your throttling can't see them

What would you propose that we do about these worker threads that we cannot (by your own admission) have known anything about?

b) xunit splurges a load of tests out all at once because doing things 'properly' with ConfigureAwait has meant it can't see what I'm doing, so this rate limiting code doesn't really rate limit in a way that's terribly useful.

Limiting the number of concurrently executing threads is not an attempt to make things "easy for you to see"; it's there so that users can opt to increase or reduce the number of threads such that CPU utilization matches their desired rate (our default of 1 thread per logical CPU thread is the best default for this on a system which is otherwise not busy).

If you want to make it easy to see things, turn off parallelization.

it's a bit feeble of xUnit to fail to cope with it

As already stated many times, this design is purposeful, to help people catch their own broken (deadlocking) code, and there is an available workaround for people who do not wish to have this capability.

you don't also need to deadlock when such techniques are used purely in test code.

And there's no reason to use these techniques in your test code, since you most definitely have access to await.

(Regarding your proposed workaround: how do I control that in the VS test runner? I don't see how to pass it that switch.)

Use a configuration file, and set maxParallelThreads to -1.

http://xunit.github.io/docs/configuring-with-json.html

I think it won't blow away the async context because that only gets set at the very last minute - right before invoking the test method itself.

Oren is correct, and you are incorrect. The sync context when the test method calls must be the async void sync context, or else it cannot correctly support async void.

However, this may all be moot since it seems like Brad's not open to changing this. A proper fix would be non-trivial, and I'm not going to put any time into it if it's almost certain to be rejected.

You are correct. We have no intention of making any changes here. We thought carefully about what we wanted to achieve, and believe we have successfully achieved those goals.

@roji

This comment has been minimized.

Contributor

roji commented Jun 2, 2016

The difference with DivideByZeroException is that in the async case, xunit is setting things up in a very specific way that triggers the deadlock.

Yes. Purposefully. To help catch broken (deadlocking) code.

By your own statement, it's only broken if it's a library. There's other code that needs to get tested in this world.

More importantly, if one of your goals is to help users catch deadlocking code, it's inconsistent to do that only if parallelization is turned on. That again makes it seem that an implementation detail of parallelization is being recast as a purposeful "goal".

And if that isn't convincing, think of your users who depend on a badly-written library.

They can (and, philosophically, argue that they should) mock away dependencies.

Test frameworks are frequently used for integration tests, not just unit tests - in which case mocks make no sense at all.

I admit I simply assumed that to implement parallelization xunit simply spins off threads as requested and takes care not to execute more than N tests concurrently.

That is exactly what it does, through the use of Tasks. When you use Tasks inappropriately in such a way as to block the working thread, how exactly is xUnit.net to know that you're just blocking instead of, say, doing something? It cannot.

We've already been through this, you can only say it's "inappropriate" if it's a general-purpose library.

I'm going to drop this now, it would have been nice to at least get some acknowledgement of an issue. What I can suggest is that you make this very clear in the documentation - i.e. that when parallelization is on a special sync context is set up which will trigger deadlocks in code that synchronously blocks without ConfigureAwait(false).

@roji roji closed this Jun 2, 2016

@idg10

This comment has been minimized.

idg10 commented Jun 2, 2016

Separating out this reply from the other points because it doesn't really have any bearing on what ought to be done, but it's peripherally useful because it demonstrates that you're not thinking as clearly as you think you are about this whole issue, and that an evidence-based approach might produce better results.

@idg10: I think it won't blow away the async context because that only gets set at the very last minute - right before invoking the test method itself.

@bradwilson: Oren is correct, and you are incorrect. The sync context when the test method calls must be the async void sync context, or else it cannot correctly support async void .

Sorry Brad, but it takes only the simplest of experiments to show that, as I said, the async context does not get blown away. (As ever, the scientific method tends to be more reliable than relying purely on thinking things through. As it happens, my thought processes worked better than yours here, but it's what can be shown that matters more than what any individual thinks.) I put this inside one of my test methods in the repro project:

Console.WriteLine(SynchronizationContext.Current?.GetType()?.Name ?? "No context");

If you were right to say I was wrong (i.e., if the async context was in fact blown away) this would print either "No context", or the name of some context type other than the async context. In fact, it prints out AsyncTestSyncContext. This demonstrates clearly that as I predicted, the async test context has not been blown away. (Insofar as you said that the async context must be present you are of course correct. But you have no grounds for saying I was incorrect when I said I thought it would still be present after my change.)

I suggest you check your assumptions the next time you accuse someone of being wrong in public.

We thought carefully about what we wanted to achieve, and believe we have successfully achieved those goals.

Unfortunately, as this detail shows, thinking alone is not reliable. Unless you also take evidence into account, you will often end up at the wrong conclusion. And it remains my view that the results of your thought processes about what you wanted to achieve have left xUnit.net in a slightly problematic state.

@idg10

This comment has been minimized.

idg10 commented Jun 2, 2016

@bradwilson: this is working as designed. The intention is not to limit the number of tests which have been started, but rather to limit the number of tests which are currently simultaneously executing. There is no reason to slow down and wait artificially when a test has voluntarily (via await ) yielded back control of the thread while it waits for some other operation to complete.

You misunderstand what I was proposing. I'm not suggesting slowing down merely because a test has yielded via await. I'm suggesting that once the number of threads all actively performing work reaches a particular threshold, you stop adding more tasks. (You appear to think that this is what the current design does based on your reply to @roji. But it does not. It's almost, but not quite what it does. If the goal was to do this, then xUnit.net is not in fact behaving as designed.)

(I should state an assumption here: I assume the underlying goal here is to minimize the amount of time it takes to execute all the tests in a solution. I would hope that's the real goal, rather than implementation-driven things like "keeping exactly N threads busy running tests, or context-bound work originated by tests". If running tests quickly is not a goal, then none of what follows is relevant. With that in mind, I will throw in an observation: throughput can often be made worse by having too many operations in flight, because it puts pressure both on CPU caches and the GC, so although it's often desirable to have enough concurrency to keep your CPU cores busy, any more than that is typically harmful. Consequently, once a test has begun, work requiring CPU time to move that test to completion should take priority over tests that haven't started yet.)

My suggestion is quite different from your oversimplified interpretation of it. What I'm suggesting is only slightly different from what you already do, but the difference is crucial: currently when the number of active threads hits the configured limit, you refuse to start any more work of any kind in the sync context (whether it be starting a new test, or processing some work queued by a test already in progress). I'm suggesting that you only refuse to start new tests at this point. The critical difference is that you would still be open to starting new work kicked off by tests in progress (allowing the thread count to increase if necessary; the obvious thing to do would be just to let the TPL thread pool handle actual creation of threads, not least because its LIFO-per-thread, FIFO-cross-thread approach does a much better job of a) finishing the work already started before beginning anything new than your sync context's pure LIFO, and b) getting better locality, thus improving throughput by using the CPU cache more efficiently).

Although this conversation started with deadlocks, deadlocks are not the only problem caused by the current design. I work on one project (in which I never see the deadlock issue) in which some tests have timeouts that keep occurring spuriously unless I disable parallel execution in xUnit.net. This happens because xUnit merrily throws more and more new tests into the mix even though all the CPU cores are busy (and it can't see that this is happening because I'm using ConfigureAwait(false)) and this dramatically extends the time taken for any single test to complete, significantly increasing the chances of a spurious timeout.

The best way to handle this would be with a custom TaskScheduler that was not hidden from downstream tasks. (You currently have to hide your scheduler because you're using that as the means by which you rate-limit. I'm suggesting instead a custom scheduler whose role is to monitor usage, with control being enacted elsewhere.) That way, you'd be able to take all work into account when determining what the load level already is. This would defer to the TPL for actual scheduling of work, and its main job would be to keep track of how much work is going on. You could then use that load level as a gate to decide whether starting more tests is a good idea right now. You'd have the best of all worlds: just like now, if you have tests that spend a significant amount of time in await you'd be able to keep starting more tests, stopping only once you've hit the configured thread count, but unlike now, you'd be able to process additional work fired off by test already in progress. If you used the TLP thread pool (or implemented your own LIFO scheme) tests that initially awaited and then suddenly got busy after the thing they were waiting for completed would tend to get prioritised rather than sitting at the back of the queue, reducing the likelihood of timeout problems.

As it happens this would also prevent the deadlocks. But even if you don't care about them, perhaps you might care about the way tests already in progress can get bogged down as a result of new tests being started even when all CPU cores are busy. Or perhaps you have a philosophical objection to tests that contain timeouts?

@bradwilson

This comment has been minimized.

Member

bradwilson commented Jun 2, 2016

@idg10 wrote:

(You appear to think that this is what the current design does based on your reply to @roji. But it does not. It's almost, but not quite what it does. If the goal was to do this, then xUnit.net is not in fact behaving as designed.)

I don't actually think that that's what the current design does. Each parallelization boundary (test class, by default) schedules one task to start running without regard to the number of actual tasks, and depends on the concurrency sync context to guarantee that only a specific number of tasks are in the running state at any given time. It is expected behavior that many, many tasks will be scheduled almost immediately and simultaneously (often times many more than could run simultaneously, for any but the smallest unit test projects).

It is working as designed.

I work on one project (in which I never see the deadlock issue) in which some tests have timeouts that keep occurring spuriously unless I disable parallel execution in xUnit.net

We explicitly tell people (when they asked why we removed Timeout from [Fact]) that it's because it's impossible to get accurate run-time execution timing when many things are running in parallel. We felt parallelism was the more valuable feature than timeout, which is why we don't offer a timeout feature at all.

If you've opted to add back in a timeout feature, then you must do it understanding that you will either be wildly inaccurate, or need to disable parallelization.

The best way to handle this would be with a custom TaskScheduler that was not hidden from downstream tasks.

If you have an alternative implementation which you feel is better, yet does not compromise our design goals, by all means please send a PR.

For now, this thread has reached the limit of its usefulness, so I'm locking it. We can have further discussion once a PR has been issued, on the technical merits of that PR itself. I am not spending any more time defending our design decisions for an issue with a simple and effective workaround.

@xunit xunit locked and limited conversation to collaborators Jun 2, 2016

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.