Tests hang: possible race condition #285

kammoh · 2021-03-30T18:38:05Z

I'm seeing this happening in both Treadle and Verilator backends: When running long tests with ~10000 I/O elements, the test usually hangs (forever).
When running for ~100 I/O elements the tests usually conclude with no issue.

My test harness looks like this:

        fork {
          dut.io.in.enqueueSeq(inputs)
        }.fork {
          dut.io.cmd.enqueueSeq(instructions)
        }.fork {
          dut.io.out.expectDequeueSeq(expected)
        }.join()

I tried to do some printf debugging and it seems the tests hang at a clock .step(1) call in DecoupledDriver/Monitor. I also tried to look at the threads on JVM debugger and, they seem to deadlock at either a step() or stepAndJoin() call.
On the Treadle backend, I see threads locked with a call stack something like this:

Unsafe.park(boolean,long)[native method] (myproject/.metals/readonly/java.base/jdk/internal/misc/Unsafe.java:-1)
LockSupport.park(Object) (myproject/.metals/readonly/java.base/java/util/concurrent/locks/LockSupport.java:211)
AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer$Node,int,boolean,boolean,boolean,long) (myproject/.metals/readonly/java.base/java/util/concurrent/locks/AbstractQueuedSynchronizer.java:714)
AbstractQueuedSynchronizer.acquireSharedInterruptibly(int) (myproject/.metals/readonly/java.base/java/util/concurrent/locks/AbstractQueuedSynchronizer.java:1046)
Semaphore.acquire() (myproject/.metals/readonly/java.base/java/util/concurrent/Semaphore.java:318)
ThreadedBackend.doJoin(Seq,Option) (myproject/.metals/readonly/chiseltest/internal/ThreadedBackend.scala:680)
ThreadedBackend.doJoin$(ThreadedBackend,Seq,Option) (myproject/.metals/readonly/chiseltest/internal/ThreadedBackend.scala:658)
TreadleBackend.doJoin(Seq,Option) (myproject/.metals/readonly/chiseltest/backends/treadle/TreadleBackend.scala:15)
TesterThreadList.joinAndStep(Clock) (myproject/.metals/readonly/chiseltest/internal/BackendInterface.scala:21)
DecoupledDriver.$anonfun$expectDequeue$1(DecoupledDriver,Data) (myproject/.metals/readonly/chiseltest/DecoupledDriver.scala:96)
DecoupledDriver$$Lambda$2931/0x00000008012f6520.apply$mcV$sp() (Unknown Source:-1)
TreadleBackend.doTimescope(Function0) (myproject/.metals/readonly/chiseltest/backends/treadle/TreadleBackend.scala:112)
package$.timescope(Function0) (myproject/.metals/readonly/chiseltest/package.scala:290)
DecoupledDriver.expectDequeue(Data) (myproject/.metals/readonly/chiseltest/DecoupledDriver.scala:87)
DecoupledDriver.$anonfun$expectDequeueSeq$2(DecoupledDriver,Data) (myproject/.metals/readonly/chiseltest/DecoupledDriver.scala:112)
DecoupledDriver.$anonfun$expectDequeueSeq$2$adapted(DecoupledDriver,Data) (myproject/.metals/readonly/chiseltest/DecoupledDriver.scala:111)
DecoupledDriver$$Lambda$2930/0x00000008012f6140.apply(Object) (Unknown Source:-1)
Stream.foreach(Function1) (myproject/.metals/readonly/scala/collection/immutable/Stream.scala:533)
DecoupledDriver.$anonfun$expectDequeueSeq$1(DecoupledDriver,Seq) (myproject/.metals/readonly/chiseltest/DecoupledDriver.scala:111)
DecoupledDriver$$Lambda$2929/0x00000008012f5e50.apply$mcV$sp() (Unknown Source:-1)

The behaviour is quite non-deterministic, which I guess makes the case for a race condition stronger.
I've not not been able to come up with a "Minimum Working Example" and unfortunately the code has not been made public yet, but I'm pretty sure this is an issue with chiselTest and not related to my particular tests.

EDIT:

Minimum Working Example

The following test hangs for large number of inputs:

class Mwe extends Module {
  val io = IO(new Bundle {
    val in  = Flipped(Decoupled(UInt(16.W)))
    val out = Decoupled(UInt(16.W))
  })
  io.out <> Queue(io.in)
}

class MweSpec extends FlatSpec with ChiselScalatestTester {

  behavior.of("MWE Bug")

  val inputs = Seq.tabulate(8000) { i => i.U }

  it should "conclude" in {
    test(new Mwe) { dut =>
      dut.io.in.initSource().setSourceClock(dut.clock)
      dut.io.out.initSink().setSinkClock(dut.clock)

      fork {
        dut.io.in.enqueueSeq(inputs)
      }.fork {
        dut.io.out.expectDequeueSeq(inputs)
      }.join()
    }
  }
}

https://scastie.scala-lang.org/kammoh/7kDyWnXaSOGcYcT9Qgnf4A/2

The text was updated successfully, but these errors were encountered:

kammoh · 2021-03-30T19:01:54Z

This happens on both "0.3.2" release and the master branch.

kammoh · 2021-03-31T02:17:11Z

I was actually able to observe it in the following simple MWE. If the number of inputs is small (<1000) it will most probably be fine. With larger number of inputs (>8000) it always hangs on my machine.

class Mwe extends Module {
  val io = IO(new Bundle {
    val in  = Flipped(Decoupled(UInt(16.W)))
    val out = Decoupled(UInt(16.W))
  })
  io.out <> Queue(io.in)
}

class MweSpec extends FlatSpec with ChiselScalatestTester {

  behavior.of("MWE Bug")

  val annos = Seq(
    // VerilatorBackendAnnotation,
    // WriteVcdAnnotation,
  )

  val inputs = Seq.tabulate(8000) { i => i.U }

  it should "conclude" in {
    test(new Mwe).withAnnotations(annos) { dut =>
      dut.io.in.initSource().setSourceClock(dut.clock)
      dut.io.out.initSink().setSinkClock(dut.clock)

      fork {
        dut.io.in.enqueueSeq(inputs)
      }.fork {
        dut.io.out.expectDequeueSeq(inputs)
      }.join()
    }
  }
}

kammoh · 2021-03-31T02:51:06Z

I guess also somewhat reproducible on Scastie: https://scastie.scala-lang.org/kammoh/7kDyWnXaSOGcYcT9Qgnf4A/2
With number of inputs <= 7000, test passes with no issue.

Playground$MweSpec:
MWE Bug
Elaborating design...
Done elaborating.
Starting Test
Test Finished!
test Mwe Success: 0 tests passed in 7003 cycles in 16.966041 seconds 412.77 Hz
- should conclude

Increasing to >=8000, sometimes concludes (in total time < 20 seconds), but sometimes times out.

timed out after 30 seconds when running code

Mostly timing out with >=18000.

ducky64 · 2021-03-31T03:14:18Z

Strange. I don't know what's going on but:

As architected, chiseltest shouldn't actually be running anything in parallel. Chiseltest spawns threads only to track execution context, but threads are scheduled to run one at a time.
enqueueSeq and expectDequeueSeq should both be fully serial - that is, each should not create more than one subthread, and it blocks on the completion of a sub-thread before spawning another.

If you wanted to dig further:

It's normal for threads to be blocked on a step or joinAndStep call - those are effectively scheduler invocations. But a scheduler invocation should mean that another thread is unblocked.
Do you know where inside step or joinAndStep the threads are blocked on?
Are there any exceptions thrown from any of the threads?

The one thing that might be a concern is that the scheduler releases the next thread semaphore before acquiring its own semaphore, and this isn't an atomic operation. It might be possible that inbetween, all the other threads run, and the original thread's semaphore is released (by another thread) before it is acquired, causing deadlock.

That being said, I'm not sure what the recommended way to do an atomic thread switch is, since this generally isn't what synchronization structures are designed to do. A mildly hacky solution might be to check the currently active thread (and having that be a synchronized variable) before acquiring its own semaphore - so thinking of the semaphore as more of an optimization than the primary way to schedule threads.

ekiwi assigned ducky64 Mar 31, 2021

ducky64 mentioned this issue May 4, 2021

When is it time to switch to ChiselTest #297

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tests hang: possible race condition #285

Tests hang: possible race condition #285

kammoh commented Mar 30, 2021 •

edited

kammoh commented Mar 30, 2021

kammoh commented Mar 31, 2021

kammoh commented Mar 31, 2021

ducky64 commented Mar 31, 2021

Tests hang: possible race condition #285

Tests hang: possible race condition #285

Comments

kammoh commented Mar 30, 2021 • edited

Minimum Working Example

kammoh commented Mar 30, 2021

kammoh commented Mar 31, 2021

kammoh commented Mar 31, 2021

ducky64 commented Mar 31, 2021

kammoh commented Mar 30, 2021 •

edited