Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nondeterminism in diameter reported by TLC #883

Open
ahelwer opened this issue Feb 21, 2024 · 14 comments
Open

Nondeterminism in diameter reported by TLC #883

ahelwer opened this issue Feb 21, 2024 · 14 comments
Assignees
Labels
enhancement Lets change things for the better help wanted We need your help Tools The command line tools - TLC, SANY, ...

Comments

@ahelwer
Copy link
Contributor

ahelwer commented Feb 21, 2024

Description

TLC occasionally reports varying diameter values with multiple workers. As discovered in this PR: tlaplus/Examples#122

Expected Behavior

Diameter is expected to be deterministic, per Markus.

Actual Behavior

TLC will occasionally report varying state graph diameter, although the total & distinct states remain the same.

Steps to Reproduce

  1. Check out the tlaplus/examples repo
  2. Download a recent version of tla2tools.jar
  3. Repeatedly run the model for specifications/glowingRaccoon/clean.tla or specifications/Paxos/MCVoting.cfg with -workers greater than 1
  4. Observe that the reported diameter varies

Steps Taken to Fix

N/A

Possible Fix

N/A

Your Environment

Has been reproduced across multiple OSs and (recent) versions of TLC.

@lemmy lemmy added bug error, glitch, fault, flaw, ... Tools The command line tools - TLC, SANY, ... labels Feb 21, 2024
@lemmy lemmy added the help wanted We need your help label Jun 11, 2024
@lemmy

This comment was marked as resolved.

@FedericoPonzi

This comment was marked as resolved.

@lemmy

This comment was marked as resolved.

@FedericoPonzi
Copy link
Contributor

FedericoPonzi commented Jun 11, 2024

Issue: When running a model checking with multiple workers, sometimes the maxLevel (or max Depth) is larger then it should be with respect to maxLevel computed with a single worker.

Some findings first:

  1. the states are generated correctly. The same number of states is generated. The state graph is the same (I suspect).
  2. It doesn't reproduce with all specifications.
  3. I've tried to go as back as 2018 and the bug was already there.

The ModelChecker involved is running a BFS. I wanted to try the DFS one, but apparently it just doesn't work yet with multiple workers (#548 seems like a good issue to look into).

Anyways, what's happening is that when a worker is faster then another, a node is discovered from a longer path.
Suppose we have this graph:
Untitled Diagram drawio

A sequential BFS would visit the tree in the following order:

  1. visit A
  2. visit B
  3. visit C
  4. visit E
  5. visit D

E and D are effectively at depth 2. A concurrent visit with two threads (the prefix is the thread id):

  1. 0: Visit A
  2. 0: Take B, go to sleep
  3. 2: Visit C
  4. 2: Visit D
  5. 2: Visit E
  6. 0: Visit B

Now node E is at depth 3 and boom. For the specification in exam, I first reduced the constants of DNA and PRIMER to 4 (to reduce the state graph size), then I've generated the dot of the state graph:

java  -cp ./dist/tla2tools.jar tlc2.TLC  -dump dot test.dot /home/fponzi/dev/tla+/Examples/specifications/glowingRaccoon/clean.tla

image

Then in the codebase, I've patched the code here with:

if(succState.getLevel() > 10){
    this.tlc.trace.printTrace(curState, succState);
}

To get the attached stack trace: stacktrace.txt

If you follow the stack trace from the graph above, you will find that it's a fair execution. This is possible because the next states are generated from the previous states from within the worker itself. I haven't thought of a solution yet, but wanted to share my findings.

@ahelwer
Copy link
Contributor Author

ahelwer commented Jun 11, 2024

Is this deterministic depth feature even implemented for multiple workers then? The algorithm should be to ensure all states with depth < N (where N is the depth of the error state) are visited before terminating, right? I can't imagine what it would be otherwise.

@lemmy
Copy link
Member

lemmy commented Jun 11, 2024

Is this deterministic depth feature even implemented for multiple workers then? The algorithm should be to ensure all states with depth < N (where N is the depth of the error state) are visited before terminating, right? I can't imagine what it would be otherwise.

@FedericoPonzi Thanks for the refresher! No, a deterministic diameter is not what TLC implements, as that would require additional synchronization, limiting scalability. For small models, use a single worker. For large models, the diameter will almost always appear deterministic.

Does anyone have the time to find and update the relevant documentation?

@lemmy lemmy removed the bug error, glitch, fault, flaw, ... label Jun 11, 2024
@FedericoPonzi
Copy link
Contributor

Unless @ahelwer wants to do it, I can try to find a good place to persist this information.

@ahelwer
Copy link
Contributor Author

ahelwer commented Jun 11, 2024

All yours!

@lemmy
Copy link
Member

lemmy commented Jun 11, 2024

There are the Toolbox help, the command-line help, and current tools (markdown clone). The PDF is under Leslie's control.

@lemmy lemmy added the enhancement Lets change things for the better label Jun 11, 2024
FedericoPonzi added a commit to FedericoPonzi/tlaplus that referenced this issue Jun 12, 2024
Persist the learnings from issue tlaplus#883. Diameter is
non-deterministic when the BFS is run with multiple workers.

Signed-off-by: Federico Ponzi <me@fponzi.me>
FedericoPonzi added a commit to FedericoPonzi/tlaplus that referenced this issue Jun 12, 2024
Persist the learning from issue tlaplus#883. Diameter is
non-deterministic when the BFS is run with multiple workers.

Signed-off-by: Federico Ponzi <me@fponzi.me>
FedericoPonzi added a commit to FedericoPonzi/tlaplus that referenced this issue Jun 12, 2024
Persist the learning from issue tlaplus#883. Diameter is
non-deterministic when the BFS is run with multiple workers.

Signed-off-by: Federico Ponzi <me@fponzi.me>
FedericoPonzi added a commit to FedericoPonzi/tlaplus that referenced this issue Jun 12, 2024
Persist the learning from issue tlaplus#883. Diameter is
non-deterministic when the BFS is run with multiple workers.

Signed-off-by: Federico Ponzi <me@fponzi.me>
FedericoPonzi added a commit to FedericoPonzi/tlaplus that referenced this issue Jun 12, 2024
Persist the learning from issue tlaplus#883. Diameter is
non-deterministic when the BFS is run with multiple workers.

Signed-off-by: Federico Ponzi <me@fponzi.me>
FedericoPonzi added a commit to FedericoPonzi/tlaplus that referenced this issue Jun 12, 2024
Persist the learning from issue tlaplus#883. Diameter is
non-deterministic when the BFS is run with multiple workers.

Signed-off-by: Federico Ponzi <me@fponzi.me>
@ahelwer
Copy link
Contributor Author

ahelwer commented Jun 12, 2024

Actually I was mistaking this for an error trace reporting issue in my description of the algorithm. There could be other methods. The simplest one would be to ensure workers finish exploring all nodes of distance N from the origin before any explore one of distance N+1. If work stealing from concurrent queues were implemented we could avoid thread starvation. Would that be worth it?

@lemmy
Copy link
Member

lemmy commented Jun 12, 2024

For larger, real-world specifications and models, a non-deterministic diameter is generally not an issue, except in pathological cases.

The state queue remains the primary bottleneck in TLC's breadth-first search. To improve scalability, efforts to replace the the state queue should focus on that. Replacing it is not a trivial task because TLC workers globally synchronize on the state queue during checkpoints or liveness checking. A few years ago, I began designing a more scalable state queue based on dynamically sized sets of unexplored states.

@lemmy
Copy link
Member

lemmy commented Jun 12, 2024

More background:

If I recall correctly, @Calvin-L once created a prototype for a priority queue. However, I am unsure whether this prototype replaced StateQueue.java or was derived from it. My recent work on StateDeque.java does derive from StateQueue.java, as I did not have the time to refactor the global synchronization out of StateQueue.java. StateQueue.java Verification verifies deadlock-freedom of StateQueue.java. Note that our performance and scalability benchmarks are currently offline because we no longer have access to suitable hardware. These benchmarks are essential for any performance or scalability-related work on TLC.

FedericoPonzi added a commit to FedericoPonzi/tlaplus that referenced this issue Jun 12, 2024
Persist the learning from issue tlaplus#883. Diameter is
non-deterministic when the BFS is run with multiple workers.

Signed-off-by: Federico Ponzi <me@fponzi.me>
FedericoPonzi added a commit to FedericoPonzi/tlaplus that referenced this issue Jun 12, 2024
Persist the learning from issue tlaplus#883. Diameter is
non-deterministic when the BFS is run with multiple workers.

Signed-off-by: Federico Ponzi <me@fponzi.me>
FedericoPonzi added a commit to FedericoPonzi/tlaplus that referenced this issue Jun 12, 2024
Persist the learning from issue tlaplus#883. Diameter is
non-deterministic when the BFS is run with multiple workers.

Signed-off-by: Federico Ponzi <me@fponzi.me>
@ahelwer
Copy link
Contributor Author

ahelwer commented Jun 12, 2024

That sounds like an interesting concurrency problem! I also wonder what it would take to get a decent benchmarking system in place again. I don't know much about scientifically-valid benchmarking techniques honestly.

@lemmy
Copy link
Member

lemmy commented Jun 12, 2024

We have a decent benchmarking system that can be easily extended with additional workloads, i.e., spec/models. However, the system currently lacks dedicated hardware since our previous sponsor withdrew. Perhaps, it is possible to request resources from the TLA+ Foundation.

Calvin-L pushed a commit that referenced this issue Jun 14, 2024
Persist the learning from issue #883. Diameter is
non-deterministic when the BFS is run with multiple workers.

Signed-off-by: Federico Ponzi <me@fponzi.me>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Lets change things for the better help wanted We need your help Tools The command line tools - TLC, SANY, ...
Development

No branches or pull requests

3 participants