thread blocked indefinitely in an MVar operation with parMap #23

Closed
Shimuuar opened this Issue Oct 2, 2012 · 16 comments

Projects

None yet

7 participants

@Shimuuar
Shimuuar commented Oct 2, 2012

Bug was originally reported against crierion. But Till Berger created reduced test case:

import Control.Monad.Par

test :: [Int] -> IO [Int]
test xs = do
    let list = runPar $ parMap (\x -> x + 1) xs
    putStrLn $ show list
    test list

main = do
    test [1]

If compiled with -threaded program fails after a few (5-30) iterations with message thread blocked indefinitely in an MVar operation or occasionaly with message Impossible state in globalWorkComplete. If it's compiled without threadng it still fails but it require much more iterations (tens of thousands)

bos commented Oct 3, 2012

Still a concern here.

Owner
simonmar commented Oct 4, 2012

This is likely the same as issue #21. Sorry about this - we've known about the problem for a while, but unfortunately the code in question was written by Daniel Winograd-Court during his internship at Microsoft and it is a bit inscrutable.

There are workarounds:

  1. Use the direct scheduler: import Control.Monad.Par.Scheds.Direct instead of Control.Monad.Par
  2. Use monad-par-0.1.0.3 instead of 0.3

I propose to do one of the following (Ryan, please let me know your preference): either

  1. make the direct scheduler the default, for the time being, or
  2. go back to the original non-nested Trace scheduler from 0.1.0.3
kfl commented Nov 14, 2012

It is now even necessary to use parMap a simple spawn will do.

For instance,

import Data.List(foldl')
import qualified Control.Monad.Par as P

psum :: [Int] -> Int
psum xs = foldl' fun 0 xs
  where fun acc i = P.runPar $ (P.spawn.return $ i+acc) >>= P.get >>= return

main = do
    print $ psum [1..128]

Compiled with -threaded will fail with thread blocked indefinitely in an MVar operation. Even with +RTS -N1.

But as @simonmar says, using Control.Monad.Par.Scheds.Direct seems to fix it.

Owner

After looking at this a bit more, I'm not sure it has anything to do with nesting. There's no nesting going on in this particular example, unlike #21. I think it's just a flat-out bug, triggered by a particular interleaving of threads while runPar is shutting down. Here is the RTS debugging output:

2b3dcd264b40: cap 0: running thread 3 (ThreadRunGHC)
2b3dcd264b40: cap 0: created thread 11
2b3dcd264b40: cap 0: thread 3 stopped (blocked on an MVar)
        thread    3 @ 0x2b3dcda05ee0 is blocked on an MVar @ 0x2b3dcda16f50 (TSO
_DIRTY)
2b3dcd264b40: giving up capability 0
2b3dcd264b40: passing capability 0 to worker 0x2b3dcdf01700
2b3dcdf01700: woken up on capability 0
2b3dcdf01700: resuming capability 0
2b3dcdf01700: cap 0: running thread 11 (ThreadRunGHC)
2b3dcdf01700: cap 0: waking up thread 3 on cap 0
2b3dcdf01700: cap 0: thread 11 stopped (yielding)
2b3dcdf01700: giving up capability 0
2b3dcdf01700: passing capability 0 to bound task 0x2b3dcd264b40
2b3dcd264b40: woken up on capability 0
2b3dcd264b40: resuming capability 0
2b3dcd264b40: cap 0: running thread 3 (ThreadRunGHC)
2b3dcd264b40: cap 0: thread 3 stopped (blocked on an MVar)
        thread    3 @ 0x2b3dcda05ee0 is blocked on an MVar @ 0x2b3dcda18828 (TSO
_DIRTY)
2b3dcd264b40: giving up capability 0
2b3dcd264b40: passing capability 0 to worker 0x2b3dcdf01700
2b3dcdf01700: woken up on capability 0
2b3dcdf01700: resuming capability 0
2b3dcdf01700: cap 0: running thread 11 (ThreadRunGHC)
2b3dcdf01700: cap 0: thread 11 stopped (finished)
2b3dcdf01700: giving up capability 0
2b3dcdf01700: freeing capability 0
2b3dcdd00700: returning; I want capability 0
2b3dcdd00700: resuming capability 0
2b3dcdd00700: cap 0: running thread 2 (ThreadRunGHC)
2b3dcdd00700: cap 0: thread 2 stopped (suspended while making a foreign call)
2b3dcdd00700: passing capability 0 to worker 0x2b3dcdf01700
2b3dcdf01700: woken up on capability 0
2b3dcdf01700: resuming capability 0
2b3dcdf01700: deadlocked, forcing major GC...

thread 11 is the Par monad thread, thread 3 is the main thread. Thread 11 wakes up thread 3, and then yields (this seems to be crucial). Then thread 3 gets blocked again, and never wakes up.

I don't understand the nested trace scheduler well enough to say why, but maybe this will help Daniel.

bgamari commented Jan 9, 2013

Any progress here?

Owner

@rrnewton is preparing a release that will have the fix (workaround actually). See #26.

Owner

Also, I backed off the trace scheduler to the non-nested version (18e1968), because the nested version has at least two separate bugs (this one and #21).

Owner

Released version 0.3.4 that doesn't suffer from this bug.

@simonmar simonmar closed this Feb 15, 2013
cartazio commented Jun 4, 2013

I seem to have this bug or something very much like it happen with criterion for me today with the new haskell platform release when running criterion. I'll try and see if its the same one or not

@bos
@simonmar

cartazio commented Jun 4, 2013

the test case at the top of this ticket doesn't trigger the problem, will investigate more, might be a criterion side problem instead.

Owner
simonmar commented Jun 6, 2013

@cartazio: this ticket is closed, we released a version of monad-par without the bug (0.3.4). Maybe you're using an older version?

cartazio commented Jun 6, 2013

@simonmar I'm on the haskell platform. the one released last week

it might be an unrelated problem in criterion that triggers a similar error message.

The test case at the opening of the ticket doesn't seem to trigger the bug, but building my criterion test suite with -threaded triggers the error.

Might not be a monad-par bug, but if i can figure out a simple small repro, i'll share it here as well as opening an suitable criterion ticket

Collaborator
rrnewton commented Jun 6, 2013

(In airport, on the way back into the US.)

I'd like to take a look at this. Alas, if it is monad-par you're hitting
it indirectly through criterion so you can't play around with different
schedulers as easily.

But you can fairly easily play around with different monad-par versions
which expose different schedulers. For example, install criterion along
with :

  • monad-par 0.1.0.3 -- Trace scheduler without nesting
  • monad-par 0.3 -- Trace scheduler + nesting (known bugs)
  • 0.3.4.2 -- Direct scheduler (with idling + parent stealing to be
    specific, but no nesting)

A reproducer would be great...

-Ryan

On Thu, Jun 6, 2013 at 10:16 PM, Carter Tazio Schonwald <
notifications@github.com> wrote:

@simonmar https://github.com/simonmar I'm on the haskell platform. the
one released last week

it might be an unrelated problem in criterion that triggers a similar
error message.

The test case at the opening of the ticket doesn't seem to trigger the
bug, but building my criterion test suite with -threaded triggers is.

Might not be a monad-par bug, but if i can figure out a simple small
repro, i'll share it here as well as opening an suitable criterion ticket


Reply to this email directly or view it on GitHubhttps://github.com/simonmar/monad-par/issues/23#issuecomment-19071142
.

Collaborator
rrnewton commented Jun 6, 2013

Actually, 0.3.4.1 was a mistake, but is perhaps another useful datapoint.
It was released in a debugging mode. (Busy waiting, no idling.)

On Thu, Jun 6, 2013 at 10:30 PM, Ryan Newton rrnewton@gmail.com wrote:

(In airport, on the way back into the US.)

I'd like to take a look at this. Alas, if it is monad-par you're hitting
it indirectly through criterion so you can't play around with different
schedulers as easily.

But you can fairly easily play around with different monad-par versions
which expose different schedulers. For example, install criterion along
with :

  • monad-par 0.1.0.3 -- Trace scheduler without nesting
  • monad-par 0.3 -- Trace scheduler + nesting (known bugs)
  • 0.3.4.2 -- Direct scheduler (with idling + parent stealing to be
    specific, but no nesting)

A reproducer would be great...

-Ryan

On Thu, Jun 6, 2013 at 10:16 PM, Carter Tazio Schonwald <
notifications@github.com> wrote:

@simonmar https://github.com/simonmar I'm on the haskell platform. the
one released last week

it might be an unrelated problem in criterion that triggers a similar
error message.

The test case at the opening of the ticket doesn't seem to trigger the
bug, but building my criterion test suite with -threaded triggers is.

Might not be a monad-par bug, but if i can figure out a simple small
repro, i'll share it here as well as opening an suitable criterion ticket


Reply to this email directly or view it on GitHubhttps://github.com/simonmar/monad-par/issues/23#issuecomment-19071142
.

cartazio commented Jun 8, 2013

@rrnewton bos/criterion#28 heres the repro

i'm just using vanilla haskell platform 64bit released last week on my mac

#31 is my repro with current monad par (i've not had the time to unpeal the statistics / criterion wrapper from it, but it seems related to this issue since the only use of monad-par in the code is indirectly, via par-map and runPar)

@rrnewton rrnewton pushed a commit to rrnewton/criterion that referenced this issue Sep 6, 2014
@Shimuuar Shimuuar Workaround for the bug in the monad-par
* simonmar/monad-par#23

As suggested by Simon Marlow direct scheduler is used.
1e4dd69
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment