Skip to content


Subversion checkout URL

You can clone with
Download ZIP
Simulator for message passing protocols, supporting (really) unfair process scheduling and dropped messages
Latest commit acc8fee @slfritchie Major insight with McErlang: size of per-process state *matters*.
So, silly me, McErlang (and most/all other finite state model
checkers) need to be able to identify states that have been seen
before.  The SPIN book warned against the use of counters and has a
long commentary on how process IDs are managed to try to avoid using
too many.  And there's a section in the McErlang user's guide about
avoiding things like counters, references, timestamps,
`erlang:now()`, and other things are end up being unique.

Did I pay attention?  Well, I am now.

Here are various hacks to reduce the size and variability of process
state records:

* Always sort lists (unless order is important).
* Avoid `now()` and use `make_ref()` instead.  McErlang's parse
  transform takes care of making as-undynamic-as-possible substitute
  values for PIDs and references, but it doesn't do anything with
* Remove unused `#s` record members: watchers2 and ph2_now.
* Use a constant atom instead of time for vclock timestamp.

The `#c` client record and protcols both could be slightly (more than
slightly?) simplified by making the cookie constant.  However, as
the protocol needs cookie uniqueness to avoid mistaking delayed
messages with current operations.  Bleh.  Same for the `clop`
identifier (for selective receive).  Bleh bleh.

So, these changes make a difference ... but, alas, they don't make a
tremendously huge difference.  When I tried running a 1 client x 3
server x 2 counter ops by that single client, it finishes in under an
hour and a half, but ... 66 million states?  {sigh}

    > mce:apply(slf_msgsim_qc, prop_mc_simulate2, [distrib_counter_2phase_vclocksetwatch_sim, [{no_generator, true}], 1, 3, 1, lists:duplicate(2, {c3, {counter_op, [s1,s2,s3]}}), [{c3,[],counter_client1}], [{s1, {s,{obj,[],[0]},undefined,undefined,[]}, counter_server1}, {s2, {s,{obj,[],[0]},undefined,undefined,[]}, counter_server1}, {s3, {s,{obj,[],[0]},undefined,undefined,[]}, counter_server1}]], #mce_opts{table={mce_table_bitHash,[1024*1024*510]}, pathLimit=100}).

    *** Run ending. 66524923 states explored, stored states 15383564.

    Execution terminated normally.
    Table has 15383564 states and 66524922 transitions.

    Reductions: 67208.03 mreds; Runtime: 4081.100 seconds; 4139.29 elapsed seconds

Message Passing (and Dropping) Protocol Simulator

Wouldn't it be great if you could:

  • pick up a distributed algorithms textbook (Nancy Lynch's is my favorite) or a research paper or a crazy idea scribbled in a notebook for some message passing protocol, then...
  • write a small amount of code to implement the algorithm, then...
  • run a simulator to tell you if it works?

Ja, those things exist today. But there are things to be careful of, including but not limited to:

  1. What language are you writing your algorithm in?
  2. What language are you verifying the algorithm's properties in?
  3. Can the protocol simulator control process scheduling patterns? (Irregular scheduling is far more interesting than regular/consistent scheduling.)
  4. Can the protocol simulator simulate dropped/lost messages? (Most protocols do pretty well in perfect environments. It's much more interesting to see what a protocol does when it starts dropping messages.)
  5. Can the simulator explore all possible scheduling and message dropping states to verify that my protocol is correct.

This message passing simulator attempts to address almost all of the above list.

  1. We use Erlang. It's a nice, high-level language that's frequently used for distributed algorithms anyway. If you're using Erlang, it's not as useful to simulate your protocol outside of Erlang and then reimplement it in Erlang.
  2. We use Erlang also. High level languages make clear expression of your protocol's desired properties pretty easy.
  3. Yes. We use the QuickCheck tool to simulate a token-based scheduler, where QuickCheck generates random lists of tokens. The scheduler is as fair or as unfair as you wish. There is nothing in the implementation (that I know of) that would prevent the use of PropEr for generating any of the test cases within the simulator. I simply haven't had time to try using PropEr yet.
  4. Yes. We use QuickCheck again to specify when network partitions happen during a simulated test run: For none/some/many {t0,t1,A,B}, then between simulated time t0 and t1, any message from a process in set A that sent to a process in set B will be dropped.
    • Note that these network partitions can be asymmetric: process C can be partitioned from process D, but D can send messages to C.
  5. No. Exhaustive exploration of the entire state space is beyond the scope of this tool. However, it's my hope that this tool points the way for something similarly easy for McErlang which is capable of performing full state space exploration.

How to run simulated protocols

$ cd /path/to/top/of/msgdropsim
$ make
$ erl -pz ./ebin
[... Erlang VM starts up]
> eqc:quickcheck(slf_msgsim_qc:prop_simulate(echomany_sim, [])).

Also: See the top of each .erl file in the src directory for instructions or pointers to instructions.

General usage:

eqc:quickcheck(slf_msgsim_qc:prop_simulate(SimModuleName, OptionsList)).
  • SimModuleName is a protocol simulation implementation module such as echo_sim or distrib_counter_2phase_vclocksetwatch.
  • OptionsList is a list of zero or more of the following options:
    [{min_clients, N}, {max_clients, M},  %% defaults: N=1, M=9
     {min_servers, N}, {max_servers, M},  %% defaults: N=1, M=9
     disable_partitions,                  %% disable network partitions
     disable_delays                       %% disable message delays
     {min_keys, N},                       %% typically not implemented
     {max_keys, N},                       %% typically not implemented
     crash_report,                        %% enable verbose report upon crash
     {stop_step, N}]                      %% stop execution at step N

For example:

eqc:quickcheck(slf_msgsim_qc:prop_simulate(distrib_counter_2phase_vclocksetwatch_sim, [])).

By default, QuickCheck will run 100 random test cases. For a protocol simulation, that usually isn't enough cases to find bugs in even a simple protocol.

  • The argument to eqc:quickcheck/1 is a property.
  • The return value of slf_msgsim_qc:prop_simulate(SimModuleName, OptionsList) is a property.
    • So the example above, eqc:quickcheck(slf_msgsim_qc:prop_simulate(SimModuleName, OptionsList))., works as you'd expect.
  • If you wish to run 5,000 test cases with a property Property, then use the construct eqc:quickcheck(eqc:numtests(5000, Property)).
  • If you wish to make each individual test case longer, i.e., to contain longer sequences of protocol operations, use the construct eqc:quickcheck(eqc_gen:resize(R, Property)). where R is an integer larger than 40. The effect seems to be exponential: test cases with R=100 are much, much longer than tests that use R=50.
  • The resize() and numtests() wrappers can be used together, e.g., eqc:quickcheck(eqc:numtests(5000, eqc_gen:resize(60, Property)))

For example, to run 10,000 test cases of the distrib_counter_2phase_vclocksetwatch_sim simulator:

eqc:quickcheck(eqc:numtests(10*1000,slf_msgsim_qc:prop_simulate(distrib_counter_2phase_vclocksetwatch_sim, [{max_clients,9}, {max_servers,9}]))).

Describing the evolution of two flawed protocols

The simulator source contains code for simulating two protocols:

  • An echo service, one flawed and one ok.
    • The source for these are echo_bad1_sim.erl and echo_sim.erl, respectively.
  • A distributed counter service, where all clients are supposed to generate strictly-increasing counters. There are five variations of the protocol; all of them are buggy.
    • The sources for these are distrib_counter_bad1_sim.erl through distrib_counter_bad5_sim.erl.

All of the buggy simulator code in a file foo.erl has a corresponding text file called foo.txt which contains:

  • Instructions on how to run the test case
  • Output from the test case
  • Annotations within the output, marked by %% characters, that help explain what the output means.

The foo.txt file has annotated simulator output and discussion of what's wrong with each implementation, e.g. echo_bad1_sim.txt and distrib_counter_bad1_sim.txt.

For the distributed counter simulations, it can be instructive to use "diff" to compare each implementation, in sequence, to see what changed.

  • diff -u distrib_counter_bad1_sim.erl distrib_counter_bad2_sim.erl
  • diff -u distrib_counter_bad2_sim.erl distrib_counter_bad3_sim.erl
  • diff -u distrib_counter_bad3_sim.erl distrib_counter_bad4_sim.erl
  • diff -u distrib_counter_bad4_sim.erl distrib_counter_bad5_sim.erl

How the simulator works

TODO Finish this section

The simulator attempts to maintain Erlang message passing semantics. Those semantics are not formally documented but can loosely be described as "send and pray", i.e. no guarantee that any message will be delivered. In the case where process X sends messages A and B to process Y, if Y receives both messages B and A, then message A will be delivered before B. (I hope I got that right ... if not, the Async Message Passing Police will come and arrest me.)

  • Write a callback module
    • gen_initial_ops/4 The simulator scheduler sends messages from created by this generator to each of the simulated processes. QuickCheck will randomly choose some number of client & server processes for each test case.
    • gen_client_initial_states/2 Generate the local process state data for each client process.
    • gen_server_initial_states/2 Generate the local process state data for each server process.
    • verify_property/11 After a simulated test case has run, verify that whatever protocol properties should be true are indeed true. Any failure will cause QuickCheck to try to find a smaller-but-still-failing counterexample.
  • Compile
  • Run via eqc:quickcheck(slf_msgsim_qc:prop_simulate(YourSimModuleName, PropertyList)).

    • If your test passes 100 test cases, then you probably need to run for thousands or even millions of test cases. Use eqc:numtests() and/or/both eqc_gen:resize(N, YourProperty) where N is a large number on the range of 50-100.

Using the simulators with McErlang

The current work on McErlang integration is ... well, barely recognizable as "integration". But it's trying to get there, slowly.

Short answer: look at the commit log entries starting on May 28, 2011. There are cut-and-paste'able examples and a fair amount of commentary there.

One major complication is the simulator's support for Erlang's "selective receive" feature. Take this bit of code from distrib_counter_2phase_sim.erl:

client_ph1_waiting({ph1_ask_ok, ClOp, _Server, _Cookie, _Count} = Msg,
                   C = #c{clop = ClOp, num_responses = Resps, ph1_oks = Oks}) ->
    cl_p1_next_step(false, C#c{num_responses = Resps + 1,
                               ph1_oks       = [Msg|Oks]});
client_ph1_waiting({ph1_ask_sorry, ClOp, _Server, _LuckyClient} = Msg,
                   C = #c{clop = ClOp,
                          num_responses = Resps, ph1_sorrys = Sorrys}) ->
    cl_p1_next_step(false, C#c{num_responses = Resps + 1,
                               ph1_sorrys    = [Msg|Sorrys]});
client_ph1_waiting(timeout, C) ->
    cl_p1_next_step(true, C).

If the simulated process receives a {unexpected, ...} message while in the client_ph1_waiting state, that message will be ignored. Why? "Selective receive" will only pull a message out of a mailbox when there is a sufficiently general pattern to match it. In the code for client_ph1_waiting() above, there are exactly three possible messages that can be processed while in that state:

  • {ph1_ask_ok, ClOp, _, _, _} where ClOp is exactly equal to the second argument's C#c.clop value. (Remember, the underscore _ means "don't care".)
  • {ph1_ask_sorry, ClOp, _, _} where ClOp is exactly equal to the second argument's C#c.clop value.
  • timeout

It may be possible to perform an automatic (or semi-automatic) transformation of this code into something like:

client_ph1_waiting(C) ->
    C = #c{clop = ClOp, num_responses = Resps, ph1_oks = Oks, ph1_sorrys = Sorrys},
        {ph1_ask_ok, ClOp, _Server, _Cookie, _Count} = Msg ->
            cl_p1_next_step(false, C#c{num_responses = Resps + 1,
                                       ph1_oks       = [Msg|Oks]});
        {ph1_ask_sorry, ClOp, _Server, _LuckyClient} = Msg ->
            cl_p1_next_step(false, C#c{num_responses = Resps + 1,
                                       ph1_sorrys    = [Msg|Sorrys]})
    after ?SomeTimeOut ->
            cl_p1_next_step(true, C)

Future work

  • See if it's feasible to use one of the following to convert slfmsgsim-style code into straight Erlang code:
    • parse transform
    • The OTP distribution's syntax_tools tools.
  • Add support for simulated Erlang monitors, Erlang's method for informing processes that messages may have been dropped.
  • There's gotta be some cool opportunities for visualization and animation in here....

Contact the author

Contact Scott Lystig Fritchie via GitHub email or via slfritchie (at} snookles{dot)com.

Something went wrong with that request. Please try again.