-
-
Notifications
You must be signed in to change notification settings - Fork 1
Overview of AptosBFT implementation
Following is an overview of the AptosBFT implementation, as of version aptos-cli-v1.0.13
(April 2023). This overview was prepared in the scope of this issue.
The consensus component supports state machine replication using the AptosBFT consensus protocol, a variant of the HotStuff consensus protocol. The protocol is mostly implemented in async Rust, using the tokio framework, and generally following the actor programming model.
Actors are typically represented as types with an associated async start
function, e.g. the NetworkTask
structure and its start
method. Such start
function usually runs a loop polling async streams with the futures::select macro and handling received values. It is supposed to be executed in its own asynchronous task, e.g. using tokio::runtime::Runtime::spawn
or tokio::runtime::Handle::spawn
. Those tasks typically interact with each other using two kinds of async channels, both implemented in the aptos-channels
crate. The channels are constructed using the following constructor functions:
-
aptos_channels::new
, which creates a simple bounded FIFO channel with backpressure, -
aptos_channels::aptos_channel::new
, which creates a bounded multi-queue channel with no backpressure and one of the supported policies for dropping and retrieving messages:- LIFO (oldest messages are dropped),
- FIFO (newest messages are dropped),
- KLAST (oldest messages are dropped, but remaining are retrieved in FIFO order).
Some values that are sent over channels include the sender part of a one-shot channel from futures::channel::oneshot
, for communicating back the response. Occasionally, futures::channel::mpsc
and tokio::sync::mpsc
channels are also used instead of the channels from the aptos-channels
crate.
Communication between Aptos nodes happens according to the AptosNet
protocol. Nodes try to maintain at most one connection with each remote peer; the application protocols, such as consensus, mempool etc., are multiplexed over the single TCP connection. Application protocols are identified with a numerical ProtocolId
. Peers are identified by their account address. AptosNet
provides application protocols two primary interfaces:
-
DirectSend
for fire-and-forget style message delivery, -
RPC
for unary Remote Procedure Calls.
To interact with the networking layer, application protocols are supplied with NetworkClient
and NetworkServiceEvents
types, both generic over the message type. The message type has to satisfy the Message
trait constraint; the Message
trait has a blanket implementation for any type implementing both serde::de::DeserializeOwned
and serde::de::Deserialize
traits.
NetworkClient
implements the NetworkClientInterface
trait and provides basic support for sending messages, disconnecting from peers, notifying the network stack of new peers, and managing some application-specific metadata for each peer. In particular, its send_to_peer
and send_to_peers
methods allow to send a message to one or many peers, though there is no guarantee for successful message delivery. The async send_to_peer_rpc
method provides RPC-style communication with a timeout: it send the given message to the remote peer, then awaits and returns a response message, or fails after a timeout.
The implementation of NetworkClient
is based on the NetworkSender
structure, which is responsible for de-/serializing messages and communicating with the lower network layer. NetworkSender
is a wrapper around PeerManagerRequestSender
and ConnectionRequestSender
, both of which are, in turn, convenience wrappers around aptos_channels::aptos_channel::Sender
. PeerManagerRequestSender
and ConnectionRequestSender
are used to issue communication and connection requests, respectively, and await the responses from PeerManager
. For each RPC request, PeerManagerRequestSender
creates an instance of the one-shot channel from futures::channel::oneshot
, in order to receive the response message.
NetworkServiceEvents
is used to receive and respond to network events. It is just a collection of NetworkEvents
structures, one per NetworkId
, which are responsible for de-/serializing messages and communicating with the lower network layer. NetworkEvents
represents a stream of events from the lower network layer. It is a wrapper around a pair of aptos_channels::aptos_channel::Receiver
receiving communication and connection notifications from PeerManager
. NetworkEvents
implements the async Stream
interface by merging those two channels into a single stream of Event
values.
The consensus protocol implementation mostly follows the actor programming model. The top-level component is EpochManager
, which manages consensus components across epochs and connects to other components constructed in the start_consensus
function:
-
StorageWriteProxy
, which implements thePersistentLivenessStorage
trait, for storing and retrieving proposed, but not yet committed blocks and certificates to and from the persistent storage; -
QuorumStoreDB
, which implements theQuorumStoreStorage
trait, apparently for interacting with the persistent storage for batches of transactions; -
ExecutionProxy
, which implementsStateComputer
trait, for managing the results of (speculative) execution of proposed blocks; -
ClockTimeService
, which implements theTimeService
trait, for performing operations that depend on time; -
ConsensusNetworkClient
, a convenience wrapper aroundNetworkClient
; -
NetworkTask
for pulling incoming messages from the network layer and dispatching them to appropriate internal channels.
Messages generated in the consensus components and directed to own node are looped back to NetworkTask
through a channel created in the start_consensus
function. Even though the loopback channel never drops messages and provides backpressure, the intermediate channels internally used in NetworkTask
can drop excessive messages.
EpochManager
, in turn, creates a bunch of other components, such as:
-
SafetyRulesManager
, which provides an instance of theTSafetyRules
trait responsible for the safety of the consensus protocol; -
EpochState
, which implements theVerifier
trait, for validating messages in the epoch; -
RoundState
for keeping information about a specific round and moving forward when receiving new certificates; -
ProposerElection
, which incorporates the logic of choosing a leader among multiple candidates; -
NetworkSender
, which implements support for all consensus messaging usingConsensusNetworkClient
; -
QuorumStoreBuilder
, which apparently coordinates interaction with the mempool and quorum store throughPayloadManager
and a bunch of channels; -
QuorumStoreClient
, which implements thePayloadClient
trait, that can pull information about transactions from the mempool; -
BlockStore
, which implements theBlockReader
trait, for maintaining all the blocks of payload and the dependencies of those blocks; -
ProposalGenerator
for generating proposal blocks on demand.
Details of interaction between those components and further sub-components are complicated, difficult to follow, and, therefore, fall out of the scope of this overview. It is worth though noting that EpochManager
uses a bounded async task executor to spawn tasks for verifying incoming messages and forwarding them for further processing to appropriate internal channels. The implementation also not fully follows the actor programming model, e.g. BlockStore
is accessed in parallel by several sub-components.
Aptos node implements its own diagnostic logging mechanism in the aptos-logger
crate. It provides a set of logging macros for emitting logs at different levels, sampled logging when logging a large amount of data is expensive; it also supports addition of structured data along with a formatted text message, as well as typed logging schemas. The code also extensively collects metrics using the prometheus
crate, e.g. channel implementation in the aptos-channels
crate accept metrics counters to keep track of the number of items in the queue and the number of dropped items.
To dynamically inject errors at runtime in integration tests, the code includes fail points using the fail
crate. The consensus implementation doesn't use system time for time-dependent operation, such as scheduling timeouts or sleeping; instead, it uses the supplied abstraction, represented by the TimeService
trait. This allows using simulated time in tests, implemented by the SimulatedTimeService
.