Skip to content

stephen-hlx/overcooked

Repository files navigation

Overview

Overcooked is a tool that provides a way to run formal verification against a distributed system, in the CI/CD pipeline, using the real code.

In a distributed system, there are usually multiple actors. The distributed system works by these actors interacting with each other. Their interactions form a number of sequences, with different interleaving. The number of these sequences grows rapidly as the number of actors and actions increase. It may not be easy to reach a conclusion that all these sequences are allowed, or in other words, the system does not break after each of all these sequences.

By exhausting all possible sequences, including simulating failures like network partition, Overcooked verifies whether or not the distributed system's invariants are honoured during and after these sequences of actions, in which case we can be confident that the system's correctness is guaranteed.

Overcooked currently supports only JAVA, and it requires the service and the client of the system's applications to have an in-memory version of its implementations.

The remaining of this document comprises 2 main parts:

How does it work?

  • A distributed system is said to be in a correct state if all its invariants are honoured
  • A distributed system's invariants are defined by its states satisfying a set of conditions
  • A distributed system's state comprises the states of each of its actors
  • A state transition of the distributed system is the result of a single action taken by one of its actors

Starting from an initial state of the distributed system, Overcooked triggers all actions of each actor, one action per step, to exhaust the sequences of all possible interleaving and verify that all of them leave the system in a state that with all its invariants honoured. Overcooked also allows the injection of failures by having an actor reject actions performed by one or more specific actors, simulating situations like network partitioning.

Quick Example

The water pouring puzzle can be considered as 2 actors in a distributed system, one is a jug with a capacity of 3 litres, and the other with 5 litres. Let's call them Jug3 and Jug5. They both are able to perform following actions:

  • empty itself
  • fill itself (to the full capacity)
  • add water to a different jug (till it empties itself or the other is full)

For a distributed system with these 2 actors, Jug3 and Jug5, the state machine looks like this:

Among these states, there are 2 of them with Jug5 holding 4 litres of water. If the system has an invariant that neither of these 2 jugs can be holding 4 litres of water (Jug3 intrinsically cannot hold 4 litres of water), that means the system has a bug in there allowing some paths leading to the states with the invariants violated. With Overcooked, an invariant can be defined as Jug5 not holding 4 litres of water. Then, when Overcooked builds the state machine, for every new state the invariant is checked and the paths leading to the states with Jug5 holding 4 litres of water will be identified. Following is one of the 2 paths:
waterjug_shortest_path

This example is included in the waterjug package.

Key Components

In the rest of this section, the Two Phase Commit example (code) will be frequently used to demonstrate how TransactionManager works multiple ResourceManagers to commit a transaction together.

Actor and Action

A distributed system consists of more than one actor. The system works by these actors interacting with each other.

The Two Phase Commit example has two types of actors in the system, ResourceManager and TransactionManager. The actions between these two actors are:
RmTmInteraction

Actor State

The state of an actor represents the state of an individual actor at the application level.

An actor has its behaviour. It can perform an action against itself, and it can also perform an action against a different actor. Each of these actions may or may not transition the action's performer and receiver into a different state, which can be described via a finite state machine.

The ResourceManager for example, has its own finite state machine:
(RM: ResourceManager, TM: TransactionManager)
ResourceManagerFSM

Actor Environment State

The state of the environment an actor lives in. It is used to keep the state that indicates whether the actor accepts actions performed by a different actor. It is a deny list based state, so by default, the list is empty and the actor accepts actions performed by other actors in the system.

Following is an example: ActorEnvState
(For simplicity, most of the diagram in this document omits the Actor Environment State because it makes the diagram big and hard to read)

Local State

Local state consists of both the Actor State and the Actor Environment State of an actor, it looks like the example in the Actor Environment State.

Global State

A distributed system is made up of a number of actors. The system's state is therefore a collection of the states of the actors. This collection is called GlobalState.

Following is an example from the Two Phase Commit sample:
(TM's local state consists of a view of the states of the RMs)
GlobalStateExample

Invariant

A distributed system's correctness is defined by the upholding of its invariants. Its invariants are defined by its global states satisfying a set of conditions. For instance, one of the Two Phase Commit protocol's invariants requires that if any of the ResourceManagers is ABORTED, no ResourceManager can be in a state of COMMITTED. Therefore, the following global state would be violating such an invariant:
GlobalStateExample2

Another invariant example is that, TransactionManager should always have a view of the ResourceManagers' states that is consistent with all ResourceManagers' states.

In-Memory Implementation

OvercookedCodeStructure

Model verification cares only the state of the actors in the distributed system, and is agnostic to the network communication and the underlying storage.

Actors in a distributed system interact with each other via the counterpart's client, of which the implementation could be REST, gRPC, etc. The model verification however, is not concerned with the communication implementation. Instead, it needs an in-memory implementation of these clients to restore the actors using their local states.

Similarly, service implementation may include a durable storage, e.g. AWS S3. However, such integration could slow down and even complicate the restoration of an actor. Therefore, an in-memory service implementation for Overcooked simplifies the model verification.

Persona

Usually, actors in a system interact with each other via the counterpart's client. For example, in production code, ResourceManager would use the TransactionManagerClient to let the TransactionManager know that it is prepared for commit:

class ResourceManagerService {
  private final ResourceManager resourceManager;
  private final TransactionManagerClient transactionManagerClient;
  void processRequest(Request request) {
    if (shouldPrepareForCommit(request)) {
      resourceManager.prepare(transactionManagerClient);
    }
  }
}

To express interactions in this way, it requires two pairs of actors:

  • ResourceManager vs TransactionManagerClient, e.g.
    resourceManager.prepare(transactionManagerClient);
    
  • TransactionManager vs ResourceManagerClient, e.g.
    transactionManager.commit(resourceManagerClient);
    

To simplify this, Overcooked personifies the service and client as a single actor and requires that both of the in-memory implementations of the service and client share the same state:

class ResourceManagerActor implements ResourceManagerClient, ResourceManager { }
class TransactionManagerActor implements TransactionManagerClient, TransactionManager { }

The interaction then can be represented as:

resourceManagerActor.prepare(transactionManagerActor);
transactionManagerActor.commit(resourceManagerActor);

More specifically, following is what the real definition of an action looks like:

ActionTemplate.<ResourceManagerActor, TransactionManagerActor>builder()
  .actionPerformerId(actionPerformerId)
  .actionType(new TransitiveActionType(TM))
  .actionLabel("prepare")
  .action(ResourceManagerActor::prepare)
  .build()

Failure Injection

Failures in a distributed system have the symptom of the service being unreachable, either due to network issue (e.g. partition) or the node is actually down. The observable symptom by the dependent systems would be an exception thrown when accessing the service using the client. To simulate such failures, Overcooked provides a way to specify that an actor (e.g. ResourceManager) throws an exception when a specific actor (e.g. TransactionManager) invokes a specific action (e.g. commit) of the actor. In the following example, once this "rejectActionFrom" action is performed, subsequent calls to RM0.commit() by TMwill result in a RuntimeException being thrown:

ActionTemplate.<ResourceManagerActor, Void>builder()
  .actionPerformerId(RM0)
  .actionType(new IntransitiveActionType())
  .actionLabel("rejectCommitFromTM")
  .action(((rm, unused) -> rm.rejectActionFrom(TM,
    new SimulatedFailure("rejectCommit",
      obj -> ((ResourceManagerActor) obj).commit(),
      new RuntimeException()))))
  .build()

This "rejectActionFrom" action simulates the transition from RM0, the ResourceManagerActor, being reachable by TM to being unreachable.

Similarly, there is a corresponding reverse action named "acceptActionFrom" for restoring connection to an actor. For example, the following "acceptActionFrom" action

ActionTemplate.<ResourceManagerActor, Void>builder()
  .actionPerformerId(RM0)
  .actionType(new IntransitiveActionType())
  .actionLabel("acceptCommitFromTM")
  .action((rm, unused) -> rm.acceptActionFrom(TM))
  .build()

restores access to RM0, the ResourceManagerActor from TM, the TransactionManagerActor.

Model Verifier

The ModelVerifier encapsulates the necessities of running the model verification. That includes implementations like:

  • ActorActionConfig - the actors and the actions they can perform
  • ActorStateTransformerConfig - the implementation that extracts the local states from actors and reconstructs actors using a given local state
  • InvariantVerifier - the verification implementation of a given GlobalState

The ModelVerifier then produces an execution context that consists of the finite state machine (FSM) of the distributed system and a set of the states that violate the invariant specified, if there is any. This information can be used to generate the paths from its initial state to the invariant violating states.

For example:

class TwoPhaseCommitModelVerifier {
  void run() {
    ModelVerifier modelVerifier = ModelVerifier.builder()
        .actorActionConfig(actorActionConfig())
        .actorStateTransformerConfig(actorStateTransformerConfig())
        .invariantVerifier(new TransactionStateVerifier())
        .build();

    StateMachineExecutionContext stateMachineExecutionContext =
        modelVerifier.runWith(initialGlobalState());
    
    ReportGenerator reportGenerator = ReportGenerator.builder()
        .analyser(new JgraphtAnalyser())
        .dotGraphExporter(DotGraphExporterFactory.create())
        .outputDirName(outputDirName)
        .build();
    Report report = reportGenerator.generate(stateMachineExecutionContext.getData());
  }
}

How to use it?

Overcooked requires the system to be verified be implemented in a certain way in order for it to work effectively. You will need:

  • an in-memory implementation of each of the actors in the system
  • an implementation to extract the local state of each actors
  • an implementation to reconstruct the actors using their local states
  • specifications of the actions that are going to take place between these actors
  • definition of the invariants of the system using the local states

The Two Phase Commit sample has an in-memory implementation for both the client and service of the ResourceManager

as well as the TransactionManager:

For action specifications and invariants of the Two Phase Commit sample, please see its modelverifier package.

Next Steps

Origin of this project

This library was inspired by the thesis "[Stefanescu, 2006] Stefanescu, A. (2006). Automatic Synthesis of Distributed Transition Systems. PhD thesis, Universitat Stuttgart." and by a former project at work. The former project was about a data pipeline that consists of multiple data sources funnelling into a single component. Back then we were not able to verify whether the design and implementation had a flaw due to the large number of possible interleaving of events happening between the components. Hence the conception of this library.

The samples included in this library were inspired by the examples of TLA+.

License

This project is licensed under the MIT license.

Why is the name "Overcooked"?

When I finally got around doing this, I needed a name for it. I asked my wife for a name after describing what it does, "a tool that verifies that when a couple of actors can simultaneously perform their actions, are there any certain sequence of actions by certain actors that are going to cause a problem", she immediately recalled the video game we played at our friends' place, Overcooked. I was like, "aha, that's it!".

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages