Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TLC with multiple workers in DFID mode intermittently fails to find safety invariant violation #548

Closed
ahelwer opened this issue Dec 2, 2020 · 2 comments
Labels
bug error, glitch, fault, flaw, ... Tools The command line tools - TLC, SANY, ...
Milestone

Comments

@ahelwer
Copy link
Contributor

ahelwer commented Dec 2, 2020

Version Info

TLC version: 1.7.0 release
JRE version: AdoptOpenJDK build 15.0.1+9
OS version: Windows 10

What happened?

I continually ran TLC in DFID mode against a simple test spec which produces a safety invariant violation. I noticed a rare failure where TLC did not find a safety violation, seemingly only when a -workers 4 command line parameter was present. Here's an example where I ran the same command twice in a row on the same files and got different results:

PS C:\Users\ahelwer\downloads\dfid> java tlc2.TLC -config .\TESpecSafetyTest.cfg -workers 4 -dfid 10 .\TESpecTest.tla
TLC2 Version 2.15 of Day Month 20?? (rev: eb3ff99)
Warning: Please run the Java VM which executes TLC with a throughput optimized garbage collector by passing the "-XX:+UseParallelGC" property.
(Use the -nowarning option to disable this warning.)
Running depth-first search Model-Checking with fp 109 and seed -2587986588159494214 with 4 workers on 4 cores with 4082MB heap and 64MB offheap memory [pid: 5144] (Windows 10 10.0 amd64, AdoptOpenJDK 15.0.1 x86_64).
Parsing file C:\Users\ahelwer\Downloads\dfid\TESpecTest.tla
Parsing file C:\Users\ahelwer\AppData\Local\Temp\Naturals.tla
Semantic processing of module Naturals
Semantic processing of module TESpecTest
Starting... (2020-12-02 10:29:04)
Finished computing initial states: 1 states generated, with 1 of them distinct.
Starting level 2: 1 states generated, 1 distinct states found.
Starting level 3: 4 states generated, 2 distinct states found.
Starting level 4: 7 states generated, 3 distinct states found.
Starting level 5: 17 states generated, 4 distinct states found.
Starting level 6: 25 states generated, 5 distinct states found.
Starting level 7: 39 states generated, 6 distinct states found.
Starting level 8: 55 states generated, 7 distinct states found.
Starting level 9: 68 states generated, 8 distinct states found.
Starting level 10: 83 states generated, 9 distinct states found.
100 states generated, 10 distinct states found.
Finished in 00s at (2020-12-02 10:29:04)

PS C:\Users\ahelwer\downloads\dfid> java tlc2.TLC -config .\TESpecSafetyTest.cfg -workers 4 -dfid 10 .\TESpecTest.tla
TLC2 Version 2.15 of Day Month 20?? (rev: eb3ff99)
Warning: Please run the Java VM which executes TLC with a throughput optimized garbage collector by passing the "-XX:+UseParallelGC" property.
(Use the -nowarning option to disable this warning.)
Running depth-first search Model-Checking with fp 122 and seed 8604517234160323322 with 4 workers on 4 cores with 4082MB heap and 64MB offheap memory [pid: 9448] (Windows 10 10.0 amd64, AdoptOpenJDK 15.0.1 x86_64).
Parsing file C:\Users\ahelwer\Downloads\dfid\TESpecTest.tla
Parsing file C:\Users\ahelwer\AppData\Local\Temp\Naturals.tla
Semantic processing of module Naturals
Semantic processing of module TESpecTest
Starting... (2020-12-02 10:29:20)
Finished computing initial states: 1 states generated, with 1 of them distinct.
Starting level 2: 1 states generated, 1 distinct states found.
Starting level 3: 3 states generated, 2 distinct states found.
Starting level 4: 6 states generated, 3 distinct states found.
Error: Invariant Safety is violated.
Error: The behavior up to this point is:
State 1:
/\ x = 0
/\ y = FALSE

State 2:
/\ x = 1
/\ y = TRUE

State 3:
/\ x = 2
/\ y = FALSE

State 4:
/\ x = 3
/\ y = TRUE

10 states generated, 4 distinct states found.
Finished in 00s at (2020-12-02 10:29:20)

Further testing shows this occurs about 1/50-100 runs of the tool.

What did you expect to happen?

I expected TLC to find the safety invariant violation every time. I'm pretty sure DFID should fully explore the state space and find the safety invariant violation.

Steps to reproduce

  1. Download the attached specs: dfidsafetyfailure.zip
  2. Repeatedly run TLC on the specs with the command java tlc2.TLC -config .\TESpecSafetyTest.cfg -workers 4 -dfid 10 .\TESpecTest.tla

It helps to use a loop on the command line.

@lemmy
Copy link
Member

lemmy commented Dec 10, 2020

TLC's test suite has only two basic tests for DFID mode (checked by adding Assert.check(false, EC.GENERAL) to the ctor of DFIDModelChecker), which are tlc2.tool.DepthFirstDieHardTest and tlc2.tool.DepthFirstErrorTraceTest.


Below is a patch that fixes this bug at the expense of breaking TLC's continue flag. However, the DFID implementation appears generally broken when it comes to multi-threading, why it's safer to disable multi-threaded DFID (see second patch).

diff --git a/tlatools/org.lamport.tlatools/src/tlc2/tool/DFIDModelChecker.java b/tlatools/org.lamport.tlatools/src/tlc2/tool/DFIDModelChecker.java
index b8efccf91..5a541c2e9 100644
--- a/tlatools/org.lamport.tlatools/src/tlc2/tool/DFIDModelChecker.java
+++ b/tlatools/org.lamport.tlatools/src/tlc2/tool/DFIDModelChecker.java
@@ -4,6 +4,7 @@ package tlc2.tool;
 
 import java.io.IOException;
 import java.util.Set;
+import java.util.concurrent.atomic.AtomicBoolean;
 import java.util.concurrent.atomic.AtomicLong;
 import java.util.stream.Collectors;
 
@@ -629,21 +630,36 @@ public class DFIDModelChecker extends AbstractChecker
        }
     }
 
+    private final AtomicBoolean foundError = new AtomicBoolean(false);
+    
     /**
      * Set the error state. 
      * <strong>Note:</note> this method must be protected by lock 
      */
     public boolean setErrState(TLCState curState, TLCState succState, boolean keep, int errorCode)
     {
-        boolean result = super.setErrState(curState, succState, keep, errorCode);
-        if (!result)
-        {
-            return false;
-        } else
-        {
-            this.setStop(2);
-            return true;
+        assert Thread.holdsLock(this) : "Caller thread has to hold monitor!";
+        //MAK 12/28/2020: Instead of branching on the return value of super.setErrState, which
+        // is determined on the value of this.done, branch on a dedicated variable foundError.
+        // The variable done is set to true under *two* condition: (i) the first worker from the set
+        // of workers found an (invariant) violation and (ii) when a subset of the workers do not find
+        // unexplored initial states (see tlc2.tool.DFIDWorker.getInit()).  Case (ii) is what
+        // interferes with this code and can cause the worker that attempts to report a violation
+        // to read done=true and erroneously skip reporting the violation.  What this change breaks
+        // is TLC's continue functionality because foundError is never set back to false.
+        // The variable foundError wouldn't have to be an AtomicBoolean, because foundError is only
+        // read and written from synchronized blocks on this instance.
+        if (foundError.get()) {
+               return false;
         }
+               IdThread.resetCurrentState();
+               this.predErrState = curState;
+               this.errState = (succState == null) ? curState : succState;
+               this.errorCode = errorCode;
+               this.done = true;
+               this.keepCallStack = keep;
+               this.setStop(2);
+               return true;
     }
 
     /**

Disable multi-threaded DFID:

diff --git a/tlatools/org.lamport.tlatools/src/tlc2/tool/DFIDModelChecker.java b/tlatools/org.lamport.tlatools/src/tlc2/tool/DFIDModelChecker.java
index b8efccf91..847af4a54 100644
--- a/tlatools/org.lamport.tlatools/src/tlc2/tool/DFIDModelChecker.java
+++ b/tlatools/org.lamport.tlatools/src/tlc2/tool/DFIDModelChecker.java
@@ -20,6 +20,7 @@ import tlc2.util.IStateWriter;
 import tlc2.util.IdThread;
 import tlc2.util.LongVec;
 import tlc2.util.SetOfStates;
+import util.Assert;
 import util.FileUtil;
 import util.UniqueString;
 
@@ -61,10 +62,15 @@ public class DFIDModelChecker extends AbstractChecker
         this.theInitStates = null;
         this.theInitFPs = null;
         this.theFPSet = new MemFPIntSet(); // init the state set
-        this.theFPSet.init(TLCGlobals.getNumWorkers(), this.metadir, this.tool.getRootFile());
+        final int numWorkers = TLCGlobals.getNumWorkers();
+               if (numWorkers > 1) {
+                       Assert.fail(EC.GENERAL,
+                                       "Multi-threaded DFID is broken (https://github.com/tlaplus/tlaplus/issues/548).  Please remove the '-workers' flag from your command-line.");
+               }
+               this.theFPSet.init(numWorkers, this.metadir, this.tool.getRootFile());
 
         // Initialize all the workers:
-        this.workers = new DFIDWorker[TLCGlobals.getNumWorkers()];
+        this.workers = new DFIDWorker[numWorkers];
         this.numOfGenStates = new AtomicLong(0);
     }

Given the popularity of DFID mode and the existence of a workaround (-workers 1), this has low priority.

exec

@lemmy lemmy added bug error, glitch, fault, flaw, ... help wanted We need your help Tools The command line tools - TLC, SANY, ... labels Dec 10, 2020
@lemmy
Copy link
Member

lemmy commented Dec 28, 2020

Related: #544

@lemmy lemmy mentioned this issue Dec 28, 2020
25 tasks
@lemmy lemmy changed the title TLC with DFID fails to find safety invariant violation around 1/50-100 times when using -workers parameter TLC with multiple workers in DFID mode intermittently fails to find safety invariant violation Dec 29, 2020
@lemmy lemmy added this to the 1.7.1 milestone Dec 29, 2020
lemmy added a commit to lemmy/tlaplus that referenced this issue Dec 29, 2020
safety invariant violation.

Addresses Github issue tlaplus#548
tlaplus#548

[Bug][TLC]
@lemmy lemmy closed this as completed Dec 29, 2020
lemmy added a commit that referenced this issue Dec 30, 2020
safety invariant violation.

Addresses Github issue #548
#548

[Bug][TLC]
@lemmy lemmy removed the help wanted We need your help label Dec 31, 2020
lemmy added a commit that referenced this issue Dec 31, 2020
safety invariant violation.

Addresses Github issue #548
#548

[Bug][TLC]
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug error, glitch, fault, flaw, ... Tools The command line tools - TLC, SANY, ...
Development

No branches or pull requests

2 participants