Skip to content

Latest commit

History

History
252 lines (190 loc) 路 20 KB

0075-PLUP-protocol_upgrades.md

File metadata and controls

252 lines (190 loc) 路 20 KB

Protocol upgrades

Summary

Current state of upgrading the vega network

As of today, upgrading the protocol is near impossible when a major changes to the step are possible without proceeding with a Limited Network Life checkpoint restore. This functionality has the following significant issues:

  • A synchronous restart is required
  • All node need to be restarted in a very short time so all state can be restore from Ethereum, and the network can start properly with a checkpoint.

Limited Network Life is not the end goal. This spec outlines how the protocol evolves from LNL checkpoints to rolling software updates, controlled by a reasonable set of governance and user controls.

How other protocols proceed

Other protocol e.g Ethereum carry out updates in an asynchronous manner. Usually a new version of the protocol is made available, a hardcoded block height is set at which a new code path will be enabled. If enough node runners have deployed the updated code, then the network will continue with the new code path. Others nodes which haven't updated will then fork the blockchain.

Cosmos is using a small program called cosmovisor. The cosmovisor is listening to proposals and when a proposal for upgrade is accepted, it is preparing the binary for a new release and handling the upgrade - see cosmovisor docs.

The idea in this spec is to draw inspiration from the design of cosmovisor and build vega-visor to manage protocol upgrades.

Prior example of upgrading the nodes asynchronously

Back in December 2021, vega proceeded to a LNL restore. Unfortunately, a bug in the code prevented the dispatch of the network parameters after the restore. This left the network in a semi invalid state where the network parameters were defined by the ones from the genesis block instead of the one from the checkpoint from the previous network.

The solution employed at the time was to:

  • implement a patch fix
  • keep the patch fix behind a guard until a given time, once the time is reached network parameters would be dispatched.
  • distribute the code to the validator so they can test it.
  • then decide of an actual date which would give a week for the validator to asynchronously update their node.

This upgrade went quite smoothly without any incident, although it was possible as the fix was located in a single place, and wasn't impacting much of the state (meaning no changes in the protobuf files were required for example.)

Protocol upgrades mechanism

One of the main challenge with upgrading the network both asynchronously and without a restart / downtime is the management of the state, and possible incompatible changes between the state of two version of the protocol.

The following describes the general workflow of upgrades:

  1. Vega continuously make new releases available
  2. Validators can suggest to the network an upgrade with a given release tag and a block height for the upgrade to take place
  3. If a proposal gets enough votes (validators.vote.required of the validators, not stake) at the given height vega will take a snapshot automatically and stop processing further blocks until restarted by the vega-visor, the vega process manager.
  4. The validators manually download and build/prepare the new binaries.
  5. When restarted, vega will load from the last snapshot and start processing blocks with the new version.
  6. If a majority isn't reached and the block height of the proposal has passed, the proposal is rejected.
  7. Proposals result in event emitted from the core indicating the proposed upgrade tag, proposed block height for the upgrade, and the validators supporting it.
  8. Only tendermint validators can propose an upgrade (i.e. ersatz cannot)
  9. The process manager is polling vega to ask whether an upgrade is expected, when an upgrade is expected it's asking if vega is ready to be stopped.
  10. Vega will be ready to be stopped only once its completed taking the snapshot and the state from the upgrade block has been committed
  11. When vega core returns it is ready to restart (via the admin RPC) - the vega-visor will stop vega and restart it from the last snapshot taken.
  12. The network resumes with the new software version.
  13. Active proposals should be trackable via data-node
  14. The process manager can manage both validator and non validator nodes and data nodes. However, only tendermint validators can propose an upgrade.

NB: by the time of the block height of the upgrade the validators must have downloaded and built the binaries and prepared them in the right location as required by the vega-visor.

Framework / data structures

As protobuf:

message ProtocolUpgradeProposal {
   // The block height at which to perform the upgrade
   uint64 upgrade_block_height = 1;
   // the release tag for the vega binary
   string vega_release_tag = 2;
   // the release tag for the data-node binary
   string data_node_release_tag = 3;
}

enum ProtocolUpgradeProposalStatus {
  PROTOCOL_UPGRADE_PROPOSAL_STATUS_UNSPECIFIED = 0;
  // The proposal is pending
  PROTOCOL_UPGRADE_PROPOSAL_STATUS_PENDING = 1;
  // The proposal is approved
  PROTOCOL_UPGRADE_PROPOSAL_STATUS_APPROVED = 2;
  // The proposal is rejected
  PROTOCOL_UPGRADE_PROPOSAL_STATUS_REJECTED = 3;
}

message ProtocolUpgradeEvent {
   // The block height at which to perform the upgrade
   uint64 upgrade_block_height = 1;
   // the release tag for the vega binary
   string vega_release_tag = 2;
   // the release tag for the data-node binary
   string data_node_release_tag = 3;
   // tendermint validators that have agreed to the upgrade
  repeated string approvers = 4;
  // the status of the proposal
  ProtocolUpgradeProposalStatus status = 5;
}

Acceptance criteria

Invalid proposals - Rejections

  • A network with 5 validators
  • (0075-PLUP-001) Validator proposes a protocol upgrade to an invalid tag - should result in an error
  • (0075-PLUP-002) Validator proposes a protocol upgrade on a block height preceding the current block - should result in an error
  • (0075-PLUP-003) Propose and enact a version downgrade
  • (0075-PLUP-004) Non-validator attempts to propose upgrade
  • (0075-PLUP-005) Ersatz validator (standby validator) attempts to propose upgrade

Block height validation

Proposal will not be accepted as valid if validator:

VISOR

  • (0075-PLUP-010) Can be seen to automatically download the tagged version proposed for install when available at the source location when file meets the format criteria defined
  • (0075-PLUP-011) Visor automatically upgrades validators to proposed version if required majority has been reached

Epochs

  • ((0075-COSMICELEVATOR-012)) Proposing an upgrade block which ought to be the end of an epoch. After upgrade takes place, confirm rewards are distributed, any pending delegations take effect, and validator joining/leaving takes effect.
  • (0075-PLUP-013) Propose an upgrade block which should result in a network running new code version in the same epoch.
  • (0075-PLUP-014) Ensure end of epoch processes still run after restore e.g reward calculation and distributions

Required Majority

For the purposes of protocol upgrade each validator that participates in consensus has one vote. Required majority is set by validators.vote.required network parameter.

  • (0075-PLUP-015) Counting proposal votes to check if required majority has been reached occurs when any proposed target block has been reached
  • (0075-PLUP-016) Only proposals from validators participating in consensus are counted when any proposed target block has been reached.
  • (0075-PLUP-017) Events are emitted for all proposals which fail to reach required majority when target block is reached
  • (0075-PLUP-018) When majority reached during the process of upgrading, those validators which didn't propose will stop producing blocks
  • (0075-PLUP-019) Proposals for multiple versions at same block height will be rejected if majority has not been reached, network continues with the current running version
  • (0075-PLUP-020) Propose with a validator which is moved to Ersatz by the time the upgrade is enacted. If there are 5 validators, 3 vote yes, 2 vote no: One of the yes voters is kicked in favour of a new one, leaving the vote at 2-2 so the upgrade should not happen as counting votes happens at block height only
  • (0075-PLUP-036) Changing validators.vote.required network parameter to a value above two thirds is respected.
  • (0075-PLUP-037) The value of validators.vote.required is checked at upgrade block, i.e: vote on a proposal with all validators, then change the validators.vote.required network parameter before upgrade block, to a higher value, which would cause the upgrade to be rejected. Upgrade fails.

Multiple proposals (0075-PLUP-021)

  • If multiple proposals are submitted from a validator before the block heights are reached then only the last proposal is considered
  • Excessive numbers of proposals from a single validator within an epoch should be detected and rejected - (Future requirement)

Snapshots

  • (0075-PLUP-023) Post a validator becoming a consensus-participating validator they should be immediately allowed to propose an upgrade and be included in the overall total count
  • (0075-PLUP-024) Ensure that required majority is not met when enough validators join between validator proposals and target block, i.e: In a network with 5 validators, required majority is two thirds, 4 vote to upgrade, 2 more validators join before upgrade block and do not vote. Upgrade does not take place.
  • (0075-PLUP-025) Node starting from snapshot which has a proposal at a given block, ensure during replay when the block height is reached a new version is loaded and also post load an upgrade takes place at target block.
  • (0075-PLUP-045) Arrange a network where n nodes are required for consensus, and at least n+1 nodes in the network. Schedule a protocol upgrade where n-1 nodes automatically start on the new version after upgrade, i.e: No consensus after upgrade. Start the (n+1)th node and consensus is achieved. For the nth node, clear vega and tm, and restart the node using state-sync at the upgrade block height. All nodes produce blocks.

LNL Checkpoints

  • (0075-PLUP-026) Validator proposals should not be stored in the checkpoints and restored into the network
  • (0075-PLUP-027) Upgrade will not occur after a post checkpoint restore until new proposals are made and block height reached

API

  • (0075-PLUP-028) An datanode API should be available to provide information on the upcoming confirmed proposal including total proposals/block details/versions

Successful upgrade (0075-PLUP-029)

  • A new release is made available, and is successfully deployed
  • Setup a network with 5 validators running version x
  • Have 4 validator submit request to upgrade to release >x at block height 1000
  • At the end of block height 1000 a snapshot is taken and vega is stopped by the vegavisor
  • All nodes are starting from the snapshot of block 1000 and the network resumes with version >x

Failing consensus

  • (0075-PLUP-030) Upgrade takes place at block N. Restart with a number of validators whose voting power is <= two thirds. Restart one more validator whose voting power would take the total voting power >= two thirds, with an incorrect version. Consensus is not achieved. Now restart that validator with the correct version. Consensus is achieved.
  • (0075-PLUP-031) 5 validator network. Upgrade takes places at block N. Start 3 validators immediately. Allow several seconds to pass. - no blocks producing as 3 validators do not have enough weight - need 70% weight to produce blocks. Start two remaining validators. (All validators continue to work).
  • (0075-PLUP-032) Upgrade takes place, but insufficient validators are restored for 1, 5, 10, minutes. Validators which are restored immediately patiently wait for consensus to be achieved, and then blocks continue - consensus achieved

Mainnet

  • (0075-COSMICELEVATOR-033) Check that we can protocol upgrade a system which has been restarted from mainnet snapshots with current mainnet version, to next intended release version. Check all data available pre-upgrade is still available.
  • (0075-PLUP-046) Check that we can protocol upgrade a system which has been restarted from latest mainnet checkpoint with current mainnet version, to next intended release version. Check all data available pre-upgrade is still available.

Overwriting transactions

  • (0075-PLUP-034) A proposal made to upgrade to the currently running version will retract previous proposals. i.e: System is running version V. Make a proposal for block height H and version V + 1 and vote with all validators. Before block height H, submit a new proposal for version V and any future block height, with all validators. Upgrade proposals are retracted, and upgrade does not take place.
  • (0075-PLUP-035) Rejected proposals do not overwrite previous valid upgrade proposals.

Data is preserved

  • (0075-PLUP-038) An open market with active orders which is available prior to upgrade, is still available, active, and can be traded on, post-upgrade.
  • (0075-PLUP-039) Stake available prior to upgrade is still available post upgrade.
  • (0075-PLUP-040) Active and pending delegations made prior to upgrade are still active post upgrade.
  • (0075-PLUP-041) A market due to expire during an upgrade will terminate and/or settle post-upgrade.
  • (0075-PLUP-042) Trader balances available prior to upgrade is still available post upgrade.
  • (0075-PLUP-043) Pending and active assets available prior to upgrade is still available post upgrade.
  • (0075-PLUP-044) Network parameter, market and asset proposals can span a protocol upgrade.

Ethereum events during outage

  • (0075-PLUP-051) Deposit events that take place during protocol upgrade are registered by the network once the upgrade is complete.
    1. Schedule an upgrade on a network that is not using visor.
    2. When the nodes stop processing blocks for the upgrade, shut down the nodes.
    3. Deposit tokens via the ERC20 bridge.
    4. Start the network using the upgrade binary.
    5. Balance reported as added in the appropriate account(s).
  • (0075-PLUP-052) Staking events that take place during protocol upgrade are registered by the network once the upgrade is complete.
    1. Ensure parties A & B have some stake, which is delegated to a/some node(s).
    2. Schedule an upgrade on a network that is not using visor.
    3. When the nodes stop processing blocks for the upgrade, shut down the nodes.
    4. Add stake to party A.
    5. Remove some (not all) stake from party B.
    6. Start the network using the upgrade binary.
    7. Additional stake reported for party A and auto-delegated. Stake removed for party B and delegation reduced.
  • (0075-PLUP-047) Multisig events that take place during protocol upgrade are registered by the network once the upgrade is complete.
    1. Arrange a network where one validator is promoted to replace another validator. Collect signatures to update the multisig contract, but do not yet update the multisig.
    2. Schedule an upgrade on the network (should not be using visor).
    3. When the nodes stop processing blocks for the upgrade, shut down the nodes.
    4. Update the multisig contract to reflect the correct validators.
    5. Start the network using the upgrade binary.
    6. At the end of the current epoch, rewards are paid out.
  • (0075-PLUP-048) Multisig events that take place during protocol upgrade are registered by the network once the upgrade is complete.
    1. Arrange a network where one validator is promoted to replace another validator. Collect signatures to update the multisig contract, but do not yet update the multisig.
    2. Schedule an upgrade on the network (should not be using visor).
    3. When the nodes stop processing blocks for the upgrade, shut down the nodes.
    4. Do not update the multisig contract to reflect the correct validators.
    5. Start the network using the upgrade binary.
    6. At the end of the current epoch, rewards are not paid out.
    7. Update the multisig contract to reflect the correct validators.
    8. At the end of the current epoch, rewards are paid out.

Transactions during upgrade

  • (0075-PLUP-049) Network handles filled mempool during upgrade.
    1. Schedule a protocol upgrade in a network with no nodes using visor.
    2. When the nodes stop processing blocks for the upgrade, shut down the nodes.
    3. Start one node on the new binary.
    4. Send enough transactions to the node to fill the tendermint mempool. (Expect sane rejection once mempool is full)
    5. Start the other nodes on the correct upgrade binary.
    6. Expect all transactions that reached the mempool without being rejected to be correctly processed over several blocks.
  • (0075-PLUP-050) Transactions can be made in block immediately before protocol upgrade.
    1. Schedule a protocol upgrade in a network with no nodes using visor.
    2. Continuously send transactions as the upgrade block approaches.
    3. When the nodes stop processing blocks for the upgrade, make a note of all transactions which reached blocks already (transactions which did not are expected to be discarded).
    4. Shut down the nodes.
    5. Start all nodes on the new binary.
    6. Expect all transactions that reached blocks prior to upgrade to have taken effect. None of the other transactions did.