Skip to content

Overview of Sui implementation

Sergey Fedorov edited this page Sep 26, 2023 · 1 revision

Following is an overview of Sui implementation, as of version v1.9.0 (September 2023). This overview was prepared in the scope of this issue.

General Structure

Sui is a smart contract platform maintained by a permissionless set of authorities that play a role similar to validators or miners in other blockchain systems. Sui uses Narwhal and Bullshark as the mempool and consensus engines. It is mostly implemented in async Rust, using the tokio framework. The code seems to define abstract traits rather sparingly in favor of using concrete types, especially for smaller components, perhaps to avoid the verbosity of generic code and the costs of dynamic dispatch; it also uses the enum_dispatch crate to replace dynamically dispatched trait accesses for the sake of performance. The larger components mostly interact by calling each others methods. To provide shared access to components and their mutable state, the code often resorts to traditional synchronization primitives, such as mutexes and reader-writer locks from tokio-sync, together with reference-counted smart pointers; sometimes atomic types are also used.

The smaller components are often interconnected using various types of asynchronous primitives from the tokio::sync module, such as broadcast, oneshot, mpsc, and watch channels, as well as Notify. There's also a wrapped version of tokio::sync::mpsc channel provided by the mysten_metrics::metered_channel, which collects metrics of the number of queued items. Sui implementation clearly avoids using unbounded channels. This helps limiting memory consumption and maintaining flow control (back-pressure) between components; although, it's worth noting that broadcast channel receivers can lag behind and miss messages. Watch channels come useful for keeping track of a changing piece of state from multiple points in the code base without blocking the writer or building up a queue of messages. There is also NotifyOnce in the mysten_common crate, which is similar to the watch channel, except that the waiter always receives a notification, even if subscribed after the value has been changed.

A notable pattern in async code in Sui is receiving messages from a channel, creating futures for handling the received messages, queuing those futures to an instance of FuturesOrdered or FuturesUnordered from futures_util::stream (up to a certain limit of concurrent futures), and polling the queue concurrently with receiving further messages using tokio::select. Messages communicated through channels often contain an instance of oneshot::Sender for the response/acknowledgement.

Sui node is represented with the SuiNode structure, which gets created with the start or start_async method. Sui node consists of many components, such as epoch store, checkpoint store, state sync store, authority state, P2P network with state sync and peer discovery services, and validator components. Almost all of the high-level components are effectively wrapped into reference-counted smart pointers (Arc), or solely consist of other Arc-wrapped components, with ValidatorComponents being a notable exception, which is kept behind a mutex (although the ValidatorComponents structure itself mostly consists of Arc-wrapped sub-components). Some of the components are used as dependencies when creating other components, apart from the trusted-peer-change and end-of-epoch channels, which are also used to connect some components, as well as the state sync handle, which consists of sender ends of async channels.

Communication

Sui nodes communicate with each other using anemo, a peer-to-peer networking library built on top of QUIC with mutual TLS. The anemo library has a layered design, based on the tower crate. The communication layer in anemo is represented with the Network type, which can be constructed with the Network::bind and Builder::start methods, following the builder pattern. Network supports only async RPC-style communication with the peer nodes, providing the rpc method; connections with other peers can be established with the connect method and terminated with the disconnect method; other components can learn about connecting/disconnecting peers through a broadcast channel returned by the subscribe method. The anemo library provides a set of tower middleware layers through the anemo::middleware module and the anemo-tower create, which includes layers for authentication, tracing, custom callbacks, per-peer inflight and rate limits on inbound requests, request timeouts, setting request/response headers, attaching IDs and arbitrary values to requests.

Narwhal also defines the ReliableNetwork trait, which provides the send and broadcast methods for sending a request to a single or multible peers. Both methods return an instance of CancelOnDropHandler for each request destination, which is a wrapper around tokio::task::JoinHandle that aborts the corresponding tokio task when dropped. There is an implementation of ReliableNetwork for anemo::Network with WorkerBatchMessage as the request type. The implementation spawns a tokio task for each request that keeps periodically trying to send the message to the peer until received a response or aborted; the tasks are then wrapped into CancelOnDropHandler and returned back to the caller.

Network can handle several different services on the same connection. This is achieved through the Router service, which allows composing other services identified by their paths using the route and add_rpc_service methods.

Code for anemo server and client stubs can be generated using the anemo-build crate. Sui node generates anemo stubs with the build_anemo_services function in the build script of the sui_network crate; similarly, Narwhal does that in the build script of the narwhal-types crate.

For each outbound RPC request, anemo opens a new bidirectional stream with the peer, taking advantage of QUIC streams, which are cheap and instantaneous to open. Individual RPC requests in anemo provide end-to-end flow control (backpressure).

Consensus Protocol Implementation

As mentioned before, Sui's consensus protocol relies on Narwhal and Bullshark as the mempool and consensus engines. Narwhal ensures reliable and efficient payload propagation, providing a DAG-based structured mempool; whereas Bullshark allows reaching consensus among the nodes on the ordering of the batches of transactions by interpreting Narwhal mempool's DAG. A Narwhal peer consists of a single primary node and at least one worker node. The worker nodes are responsible for availability and dissemination of the payload; whereas the primary node maintains the mempool structure and ensures its properties.

Consensus-related components get constructed by the construct_validator_components method of SuiNode, which creates an instance of ConsensusAdapter for submitting transactions to Narwhal and an instance of NarwhalManager for operating Narwhal primary and worker nodes; it also creates and starts EpochDataRemover, which is responsible for cleaning up persistent storage of old epoch data. The function then invokes the start_epoch_specific_validator_components method, which starts consensus-specific components, such as the CheckpointService, ConsensusHandler, and NarwhalManager; those components get shutdown and restarted upon reconfiguration.

ConsensusAdapter is created with the construct_consensus_adapter method, using LazyNarwhalClient, which is a Narwhal client that instantiates LocalNarwhalClient lazily. Arc-wrapped ConsensusAdapter implements the ReconfigurationInitiator and SubmitToConsensus traits. So it is supplied as a dependency when creating CheckpointService.

CheckpointService is created and started with the start_checkpoint_service method. CheckpointService spawns two async tasks: one to run CheckpointBuilder and another—CheckpointAggregator. CheckpointBuilder produces new checkpoints, submits the checkpoints for consensus, stores them into the checkpoint store, and notifies CheckpointAggregator, whereas CheckpointAggregator aggregates checkpoint signatures into certified checkpoints, stores them into the checkpoint store, and submits the checkpoint certificates for consensus. CheckpointBuilder and CheckpointAggregator use tokio::sync::Notify to wait and get notified about new events to be processed in their async tasks. CheckpointBuilder requires an instance of the CheckpointOutput interface as a dependency to submit checkpoints for consensus, which is implemented by SubmitCheckpointToConsensus using ConsensusAdapter. CheckpointAggregator requires an instance of the CertifiedCheckpointOutput interface as a dependency to submit checkpoint certificates to the state sync service, which is implemented by SendCheckpointToStateSync, a simple wrapper around state_sync::Handle.

ConsensusHandler implements the ExecutionState async trait to consume and execute ConsensusOutput—transaction batches ordered by the consensus. So it is later supplied as a dependency to NarwhalManager::start. When created, ConsensusAdapter starts AsyncTransactionScheduler, which creates a tokio mpsc channel and spawns an async task that simply receives transactions from the channel and hands them over to TransactionManager. AsyncTransactionScheduler::schedule then just sends transactions to that channel. Transactions get submitted to AsyncTransactionScheduler in ConsensusHandler::handle_consensus_output.

NarwhalManager, when created by construct_validator_components, just initializes PrimaryNode and WorkerNodes, which get started by calling NarwhalManager::start in start_epoch_specific_validator_components. NarwhalManager::start opens Narwhal NodeStorage, creates its NetworkClient, and starts the primary and worker nodes. Local Narwhal primary and worker nodes communicate with each other through NetworkClient, which implements the PrimaryToWorkerClient and WorkerToPrimaryClient traits by using local instances of WorkerToPrimary and PrimaryToWorker anemo services set with the set_worker_to_primary_local_handler and set_primary_to_worker_local_handler methods of NetworkClient.

PrimaryNode, when started, spawns async tasks for consensus and the primary, which are connected by mpsc channels for new and committed certificates, as well as a watch channel for consensus round updates; they also share the same instance of LeaderSchedule created in spawn_consensus. The spawn_consensus method spawns the consensus core and the executor, which are connected by an mpsc channel for committed sub-DAGs. The consensus core receives new certificates and sends committed certificates, consensus round updates, and committed sub-DAGs, whereas the executor receives the committed sub-DAGs and uses the supplied instance of ExecutionState to handle the consensus output.

The consensus core is represented by the Consensus structure. It maintains the consensus state represented by the ConsensusState structure and uses Bullshark as a consensus protocol to process received Narwhal certificates and update the consensus state. First, it reconstructs the consensus state from the consensus and certificate persistent stores; then spawns an async task that receives new certificates from the supplied channel, processes them using the protocol, updating the consensus state, and sends the results to the committed sub-DAG and committed certificates channels, as well as signals a consensus round update. It's worth noting that the logic in Bullshark is completely sequential and synchronous; i.e. there's no concurrency when processing certificates.

The executor is implemented as two async tasks connected by a channel for consensus output. One task receives committed sub-DAGs form the supplied channel, creates a limited number of concurrent futures to fetch the batches from the Narwhal worker nodes, and sends the committed batches to the other task. The other task simply receives the batches from the channel and invokes the handle_consensus_output method of the supplied instance of ExecutionState.

Narwhal primary node, represented with the Primary type, spawns its own anemo network layer and a number of asynchronous components interconnected by channels. Synchronizer contains most of the certificate processing logic in Narwhal; it is the central component, which is wrapped in Arc and supplied to most of the other components as a dependency. Synchronizer spawns a number of async tasks and uses channels, atomic types, and mutexes to synchronize interaction between the aync tasks, as well as public method invocations and incoming RPC request handling.

Each Narwhal worker node, represented with the Worker type, spawns its own anemo network layer to communicate with other Narwhal nodes. It also spawns a gRPC server, based on the tonic crate, and a number of asynchronous tasks responsible to handle clients transactions, interconnected by channels.

Tracing and Controlling Execution

For common diagnostic logging and telemetry functionality, such as distributed tracing, common logs and Prometheus metrics, Sui node uses the telemetry-subscribers crate. It is based on Tokio's tracing library. The code extensively adds traces and collects various metrics. Some functions are instrumented with the tracing::instrument attribute macro, which automatically creates and enters a tracing span every time the function is called. It also supports tokio-console, a diagnostics and debugging tool for asynchronous Rust programs.

Sui has a powerful simulator, based on madsim, that supports deterministic, randomized execution of an entire Sui network in a single process, with simulated network latency and packet loss. The simulator has the following main features:

  • a simulation runtime with a randomized but deterministic executor that allows stopping, restarting, and terminating nodes, provides simulated clock, timers, and pseudo-random generator;
  • a network simulator to deliver network messages between nodes adding latency and simulating packet loss;
  • an API-compatible replacement for tokio;
  • an interceptor of various POSIX API calls in order to enforce determinism;
  • procedural macros to conveniently run test code inside a testing environment.

The sui_macros crate provide the fail_point and fail_point_async macros to inject fail points in code, which is used by tests and benchmarks for registering callbacks to execute at the fail points.