Nondeterminism in diameter reported by TLC #883

ahelwer · 2024-02-21T21:58:58Z

Description

TLC occasionally reports varying diameter values with multiple workers. As discovered in this PR: tlaplus/Examples#122

Expected Behavior

Diameter is expected to be deterministic, per Markus.

Actual Behavior

TLC will occasionally report varying state graph diameter, although the total & distinct states remain the same.

Steps to Reproduce

Check out the tlaplus/examples repo
Download a recent version of tla2tools.jar
Repeatedly run the model for specifications/glowingRaccoon/clean.tla or specifications/Paxos/MCVoting.cfg with -workers greater than 1
Observe that the reported diameter varies

Steps Taken to Fix

N/A

Possible Fix

N/A

Your Environment

Has been reproduced across multiple OSs and (recent) versions of TLC.

The text was updated successfully, but these errors were encountered:

FedericoPonzi · 2024-06-11T20:45:50Z

Issue: When running a model checking with multiple workers, sometimes the maxLevel (or max Depth) is larger then it should be with respect to maxLevel computed with a single worker.

Some findings first:

the states are generated correctly. The same number of states is generated. The state graph is the same (I suspect).
It doesn't reproduce with all specifications.
I've tried to go as back as 2018 and the bug was already there.

The ModelChecker involved is running a BFS. I wanted to try the DFS one, but apparently it just doesn't work yet with multiple workers (#548 seems like a good issue to look into).

Anyways, what's happening is that when a worker is faster then another, a node is discovered from a longer path.
Suppose we have this graph:

A sequential BFS would visit the tree in the following order:

visit A
visit B
visit C
visit E
visit D

E and D are effectively at depth 2. A concurrent visit with two threads (the prefix is the thread id):

0: Visit A
0: Take B, go to sleep
2: Visit C
2: Visit D
2: Visit E
0: Visit B

Now node E is at depth 3 and boom. For the specification in exam, I first reduced the constants of DNA and PRIMER to 4 (to reduce the state graph size), then I've generated the dot of the state graph:

java  -cp ./dist/tla2tools.jar tlc2.TLC  -dump dot test.dot /home/fponzi/dev/tla+/Examples/specifications/glowingRaccoon/clean.tla

Then in the codebase, I've patched the code here with:

if(succState.getLevel() > 10){
    this.tlc.trace.printTrace(curState, succState);
}

To get the attached stack trace: stacktrace.txt

If you follow the stack trace from the graph above, you will find that it's a fair execution. This is possible because the next states are generated from the previous states from within the worker itself. I haven't thought of a solution yet, but wanted to share my findings.

ahelwer · 2024-06-11T20:54:15Z

Is this deterministic depth feature even implemented for multiple workers then? The algorithm should be to ensure all states with depth < N (where N is the depth of the error state) are visited before terminating, right? I can't imagine what it would be otherwise.

lemmy · 2024-06-11T21:58:29Z

Is this deterministic depth feature even implemented for multiple workers then? The algorithm should be to ensure all states with depth < N (where N is the depth of the error state) are visited before terminating, right? I can't imagine what it would be otherwise.

@FedericoPonzi Thanks for the refresher! No, a deterministic diameter is not what TLC implements, as that would require additional synchronization, limiting scalability. For small models, use a single worker. For large models, the diameter will almost always appear deterministic.

Does anyone have the time to find and update the relevant documentation?

FedericoPonzi · 2024-06-11T22:34:37Z

Unless @ahelwer wants to do it, I can try to find a good place to persist this information.

ahelwer · 2024-06-11T22:36:03Z

All yours!

lemmy · 2024-06-11T22:49:54Z

There are the Toolbox help, the command-line help, and current tools (markdown clone). The PDF is under Leslie's control.

Persist the learnings from issue tlaplus#883. Diameter is non-deterministic when the BFS is run with multiple workers. Signed-off-by: Federico Ponzi <me@fponzi.me>

Persist the learning from issue tlaplus#883. Diameter is non-deterministic when the BFS is run with multiple workers. Signed-off-by: Federico Ponzi <me@fponzi.me>

ahelwer · 2024-06-12T13:25:39Z

Actually I was mistaking this for an error trace reporting issue in my description of the algorithm. There could be other methods. The simplest one would be to ensure workers finish exploring all nodes of distance N from the origin before any explore one of distance N+1. If work stealing from concurrent queues were implemented we could avoid thread starvation. Would that be worth it?

lemmy · 2024-06-12T15:05:00Z

For larger, real-world specifications and models, a non-deterministic diameter is generally not an issue, except in pathological cases.

The state queue remains the primary bottleneck in TLC's breadth-first search. To improve scalability, efforts to replace the the state queue should focus on that. Replacing it is not a trivial task because TLC workers globally synchronize on the state queue during checkpoints or liveness checking. A few years ago, I began designing a more scalable state queue based on dynamically sized sets of unexplored states.

lemmy · 2024-06-12T15:27:27Z

More background:

If I recall correctly, @Calvin-L once created a prototype for a priority queue. However, I am unsure whether this prototype replaced StateQueue.java or was derived from it. My recent work on StateDeque.java does derive from StateQueue.java, as I did not have the time to refactor the global synchronization out of StateQueue.java. StateQueue.java Verification verifies deadlock-freedom of StateQueue.java. Note that our performance and scalability benchmarks are currently offline because we no longer have access to suitable hardware. These benchmarks are essential for any performance or scalability-related work on TLC.

Persist the learning from issue tlaplus#883. Diameter is non-deterministic when the BFS is run with multiple workers. Signed-off-by: Federico Ponzi <me@fponzi.me>

ahelwer · 2024-06-12T17:11:28Z

That sounds like an interesting concurrency problem! I also wonder what it would take to get a decent benchmarking system in place again. I don't know much about scientifically-valid benchmarking techniques honestly.

lemmy · 2024-06-12T17:26:14Z

We have a decent benchmarking system that can be easily extended with additional workloads, i.e., spec/models. However, the system currently lacks dedicated hardware since our previous sponsor withdrew. Perhaps, it is possible to request resources from the TLA+ Foundation.

Persist the learning from issue #883. Diameter is non-deterministic when the BFS is run with multiple workers. Signed-off-by: Federico Ponzi <me@fponzi.me>

lemmy · 2024-06-21T17:20:03Z

Done

ahelwer mentioned this issue Feb 21, 2024

Record state count info in manifest, check during CI tlaplus/Examples#122

Merged

lemmy added bug error, glitch, fault, flaw, ... Tools The command line tools - TLC, SANY, ... labels Feb 21, 2024

lemmy added the help wanted We need your help label Jun 11, 2024

This comment was marked as resolved.

Sign in to view

lemmy assigned FedericoPonzi Jun 11, 2024

This comment was marked as resolved.

Sign in to view

lemmy removed the bug error, glitch, fault, flaw, ... label Jun 11, 2024

lemmy added the enhancement Lets change things for the better label Jun 11, 2024

FedericoPonzi mentioned this issue Jun 12, 2024

Update documentation about non-deterministic diameter with multiple workers #942

Merged

Calvin-L pushed a commit that referenced this issue Jun 14, 2024

Diameter is nondeterministic with multiple workers (#942)

671a006

Persist the learning from issue #883. Diameter is non-deterministic when the BFS is run with multiple workers. Signed-off-by: Federico Ponzi <me@fponzi.me>

lemmy closed this as completed Jun 21, 2024

ahelwer mentioned this issue Sep 30, 2024

Parallel TLC (workers > 1) reports intermittently an incorrect, too low diameter #45

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Nondeterminism in diameter reported by TLC #883

Nondeterminism in diameter reported by TLC #883

ahelwer commented Feb 21, 2024 •

edited by lemmy

Loading

This comment was marked as resolved.

This comment was marked as resolved.

This comment was marked as resolved.

FedericoPonzi commented Jun 11, 2024 •

edited

Loading

ahelwer commented Jun 11, 2024 •

edited

Loading

lemmy commented Jun 11, 2024 •

edited

Loading

FedericoPonzi commented Jun 11, 2024

ahelwer commented Jun 11, 2024

lemmy commented Jun 11, 2024

ahelwer commented Jun 12, 2024

lemmy commented Jun 12, 2024

lemmy commented Jun 12, 2024

ahelwer commented Jun 12, 2024

lemmy commented Jun 12, 2024

lemmy commented Jun 21, 2024

Nondeterminism in diameter reported by TLC #883

Nondeterminism in diameter reported by TLC #883

Comments

ahelwer commented Feb 21, 2024 • edited by lemmy Loading

Description

Expected Behavior

Actual Behavior

Steps to Reproduce

Steps Taken to Fix

Possible Fix

Your Environment

This comment was marked as resolved.

This comment was marked as resolved.

This comment was marked as resolved.

FedericoPonzi commented Jun 11, 2024 • edited Loading

ahelwer commented Jun 11, 2024 • edited Loading

lemmy commented Jun 11, 2024 • edited Loading

FedericoPonzi commented Jun 11, 2024

ahelwer commented Jun 11, 2024

lemmy commented Jun 11, 2024

ahelwer commented Jun 12, 2024

lemmy commented Jun 12, 2024

lemmy commented Jun 12, 2024

ahelwer commented Jun 12, 2024

lemmy commented Jun 12, 2024

lemmy commented Jun 21, 2024

ahelwer commented Feb 21, 2024 •

edited by lemmy

Loading

FedericoPonzi commented Jun 11, 2024 •

edited

Loading

ahelwer commented Jun 11, 2024 •

edited

Loading

lemmy commented Jun 11, 2024 •

edited

Loading