Skip to content
This repository has been archived by the owner on Dec 26, 2023. It is now read-only.

Updates Part I: node auto-update #32

Open
lrettig opened this issue Nov 11, 2020 · 9 comments
Open

Updates Part I: node auto-update #32

lrettig opened this issue Nov 11, 2020 · 9 comments

Comments

@lrettig
Copy link
Member

lrettig commented Nov 11, 2020

Requirements

  • Don't cause the network to fail, or security assumptions to be violated, after a hard fork upgrade
  • Don't take agency away from node operators or make minority forks unreasonably burdensome
  • Allow emergency fixes to be deployed to as many nodes as possible as quickly as possible, while giving node operators more time to respond to non-emergency fixes
  • Minimize node downtime around an update
  • Give clarity and transparency into which nodes are running which version of the software
  • Simplest UX and implementation possible subject to these constraints: make it as quick and easy as possible for node operators to stay up to date.

Non-requirements

  • Easy for non-technical users (this should be handled at the application layer, which is outside the scope of this proposal, not at the node layer)

Design

  • Opt-in --auto-update flag. When enabled, this tracks a Spacemesh team-managed beacon that announces when an update is available, and automatically downloads and installs it, and restarts the node.
    • Optional but desirable: this flag requires a tracker URL and/or a pub key so that the user can choose to follow another update “channel."
    • When an update is detected, the app automatically downloads it, using HTTP or over the P2P network if it’s available from a peer.
    • The update includes metadata that indicates when the change should be activated (layer number or clock time) and whether it’s an emergency fix or not. If there is still time before it’s set to go live, a countdown warning will be printed, giving the node operator the opportunity to cancel the update (e.g., simply by restarting the node). This period should not be less than 24 hours, and should be longer for non-emergency than for emergency updates.
    • A warning message is logged when the node first starts, to the effect of, “You’ve enabled auto-updates. This means that updates released by the Spacemesh core team will automatically be downloaded and installed. Please read for more information on how to participate in network governance.” This message is printed every time the node updates itself and restarts.
    • Note: The application layer (smapp) is outside the scope of this proposal, and this flag can be enabled or disabled by the application according to its own criteria.
    • If the update is old for some reason, i.e., its hardcoded activation layer number has already passed, it’s discarded, and an error message is printed.
  • Add a --testnet flag that sets the right network ID and network/consensus params for the testnet, and turns auto-updates on by default. They can be turned off with --auto-update=false. This information will be printed in a warning message when the node first starts.
  • The node adds a small piece of data to each ATX, or each block, it publishes that contains a hash of the client and/or protocol version it’s running.
    • To discuss: versioning for client, core protocol, P2P protocol, and how to compress/convey all of this succinctly and clearly (see Version information metadata #75)
  • Optional: hardcode an end-of-life (EOL) for each node release. This would be some number of months (measured in layers) since the release occurs. When used on mainnet, after this point, the node would cease to function and would print an error if started. Warnings would be printed regularly leading up to the EOL point. Note that the node software could still be used on another network, such as on a private network.

Tasks

  • Research, design, and develop auto-update mechanism
  • Research, design, and develop update beacon
  • Finalize testnet params and add --testnet flag
  • Finalize the design of the update/version “signature” that nodes add to ATXs and/or blocks. Consider pros and cons of adding to ATXs versus blocks.
  • Finalize criteria for an “emergency update"
  • Finalize copy, order, and timing of warning messages, and write a longer doc explaining this
  • Add tests for upgrade mechanism
@tal-m
Copy link

tal-m commented Nov 12, 2020

I would separate this into two SMIPs, since there are actually two independent features here. The first is dealing with automatic node updates, and the second with protocol updates.

Node Update

The node auto-update procedure is agnostic to the update contents --- it consists of what you called "Phase I", but doesn't involve beacons or anything protocol-related. This is the mechanism that will also be used for emergency updates. The node auto-update should have a minimal grace period (during which the node operator can decide to veto an update) even for emergency updates. The reason is that veto power is intended, among other things, to mitigate an attack by an adversary who with access to the update signing key. In this case, the adversary could always claim the update is an emergency update, if that sidesteps the grace period.

I think it makes sense to have a much shorter grace period during the initial phase of mainnet (when emergency updates are much more likely), but in version 1.0 it should be long enough to allow for human response to an attack. (After 1.0, if there's a need for a faster emergency update, we'll have to rely on getting the word out to enough node operators who will override the grace period manually).

Protocol Update

The protocol update procedure describes the rules for when (and if) to switch to a new protocol version, based on the consensus about intent to update. This mechanism doesn't really care how the node is updated; it's relevant even if there are no automatic node updates at all. Essentially, the code for executing an updated protocol will always include the code for executing the current version of the protocol, and a decision mechanism that determines when to switch versions. (This would be the case even if the decision mechanism does not check on-mesh consensus --- e.g., "always switch to the new protocol version at layer 5000").
In terms of content, I think your "Phase II" sounds good.

@lrettig
Copy link
Member Author

lrettig commented Nov 12, 2020

Hi @tal-m, thanks for the feedback.

I would separate this into two SMIPs, since there are actually two independent features here. The first is dealing with automatic node updates, and the second with protocol updates.

I generally agree with this. I think we can and probably should separate them as you propose. As written they are not completely independent since, e.g., I included the "protocol version signature" message in "Phase I", which is the vote that "Phase II" relies on.

Also agree with your points on grace period and veto power.

@lrettig lrettig changed the title Node updates Updates Part I: node auto-update Nov 18, 2020
@lrettig
Copy link
Member Author

lrettig commented Nov 18, 2020

@tal-m @avive @iddo333 I broke this out into three separate proposals: this one, for the node, #33, for the protocol (per Tal's suggestion), and #34, for the app.

@y0sher
Copy link

y0sher commented Nov 24, 2020

Looks good generally, some comments,

  • the --testnet could probably ease it for some users, but essentially it is an hard-coded config file right?
  • about versioning for client, core protocol, P2P protocol, at the past I thought this is necessary but I don't see a reason why not to version the whole thing as one piece. we do need to think of a scheme to support the protocol across clients..
  • also you mentioned something about downloading the update from a peer, this feels weird to me when we get informed about the new version from a centralized source, or we can write a decentralized protocol to negotiate versions with all your peers and decide which is the latest valid version and then download it from peers..
  • if we assume mostly tech-savy users will use the go-spacemesh client without the smapp maybe we can not include auto-update and count on them to update manually and opt-in by doing this? or again we can enable updating by the api or something and then we cover smapp and tech-savy users that can easily opt-in without needing to actually download and restart by themselves..

@lrettig
Copy link
Member Author

lrettig commented Nov 24, 2020

the --testnet could probably ease it for some users, but essentially it is an hard-coded config file right?

I thought about it more like a shortcut to passing a bunch of other CLI flags (e.g., --network 123 --auto-updates ... ) but I suppose you could think of it that way too. We'd have to work out precedence between it and the config file.

about versioning for client, core protocol, P2P protocol, at the past I thought this is necessary but I don't see a reason why not to version the whole thing as one piece

Well, they're just sort of fundamentally different things. E.g., no P2P protocol negotiation should be necessary between two nodes if one of them installs a new version of go-spacemesh that doesn't touch the P2P code.

also you mentioned something about downloading the update from a peer, this feels weird to me when we get informed about the new version from a centralized source, or we can write a decentralized protocol to negotiate versions with all your peers and decide which is the latest valid version and then download it from peers

This is tricky, I think we should not try to build this for 0.2. Downloading over HTTP is much easier.

if we assume mostly tech-savy users will use the go-spacemesh client without the smapp maybe we can not include auto-update and count on them to update manually and opt-in by doing this?

This isn't a bad idea. Especially if we expect that most users will be running smapp early on, and that smapp will be able to handle auto-updating go-spacemesh.

@noamnelke
Copy link
Member

Most of these comments are minor, but I think the last point is very important.

Adding a --testnet flag

I don't think this is needed at all. There should be a distinct config file for each network anyway and this config file can enable auto-updates for the testnet. Opting out would be done by overriding the setting from the config file using a command line arg (or simply changing the config file).

Versioning

I don't think that every little thing should have its own version, but specifically the p2p should have a version which is negotiated as part of the handshake. Clients should each send the highest version they support - if the other party's version is too low (we dropped support for that version) the connection is terminated, and otherwise the lower value between the two should be used. This allows gradually updating the p2p protocol in a heterogeneous network, esp. for smaller changes, like encoding optimizations.

I actually don't think we should version the protocol at all. We should have a bit array in ATXs where miners can indicate support for specific upgrades to the protocol. I'm totally fine with making this a single byte since I can't imaging having more than 8 proposals up concurrently, but if someone thinks we should support 16 concurrent proposals then fine. These bits will always be zero, unless a proposal is up for a vote in which case we select a bit and use that for voting.

This enables a use-case where we have a proposal up for voting that would take effect in 2 months, and a month after voting started we want to propose another upgrade that would take effect a month later.

The problem with rolling both proposals into one number (or other value) would be that a miner that supports (as in their node has code that supports it) the first, but not the second proposal, wouldn't know about the second proposal and wouldn't know that miners who vote for it also vote for the first - making it impossible for that miner to correctly tell if they should activate the first proposal or not.

End of life

I'm against this proposal. A node should be able to operate indefinitely, unless something happens that prevents it.

While I don't like the idea of ever retiring a node proactively, I can live with what @tal-m suggested about nodes shutting down when they detect that the network accepted an upgrade that they don't have code support for.

But just dying of old age - what's the benefit in that?

Auto-update wait period

As discussed on the call, the minimal waiting period should be hard coded in the node and not shorter than 24 hours, imo.

I don't believe we should have a lower minimum for the testnet. The stakes are much lower in the testnet - no real money is at stake, only some hassle and perhaps reputation. So the tools available to us in mainnet should suffice.

Downloading code via P2P

I don't think this has any advantage. If we monitor updates via a URL we can also download via a URL.

If we feel that a URL is too centralized (I think that for the purpose of auto-updates it isn't) then let's start with announcing new versions over gossip instead of using a URL. Then we can include a bittorrent client library (like this one) and include trackers for the binaries of all supported platforms with the version published via gossip. This makes much more sense than implementing our own file transfer protocol. BUT I THINK WE SHOULDN'T DO EVEN THIS. URLs are great! CDNs for the binaries are awesome. Let's not fix what ain't broken.

Timing of upgrade

We should be careful with timing the actual upgrade of the node. If we push an update and exactly 24h later 2/3 of the nodes restart - this will kill the network.

At a minimum, miners should restart at a random time slot within a ~6 hour window, to give the first wave of upgraders a chance to sync and start mining again before starting another wave.

The predictability of the upgrade time is a serious security issue, IMO. During the upgrade window attackers know in advance that some nodes will miss their blocks and Hare messages and can take advantage. This is even more true when the upgrade is due to some disclosed security issue that's being fixed...

A more advanced (and safer) version is for miners to intelligently select the best time for them to upgrade, when they aren't eligible for any blocks or Hare participation (or at least are eligible for as few blocks and Hare votes as possible).

@lrettig
Copy link
Member Author

lrettig commented Nov 25, 2020

The problem with rolling both proposals into one number (or other value) would be that a miner that supports (as in their node has code that supports it) the first, but not the second proposal, wouldn't know about the second proposal and wouldn't know that miners who vote for it also vote for the first - making it impossible for that miner to correctly tell if they should activate the first proposal or not.

I'm having some trouble understanding this. I guess I don't think about protocol updates as discrete "proposals" that can be adopted or not adopted independent of one another. In many cases, one proposal will depend on another and there may be a complex web of interdependencies. That's why I think it's better to think of a particular instantiation of the protocol as monolithic, and give it a unique, meaningless ID, like a hash or something.

But just dying of old age - what's the benefit in that?

It makes upgrades much easier. We know that, by a known point in time, all nodes running a particular, old version will have reached their "end-of-support" date and shut down (unless the user modified the source code and recompiled). Zcash has been using this to great success, see:

the minimal waiting period should be hard coded in the node and not shorter than 24 hours, imo.

We can debate the exact right number but my gut tells me around 72 hours. 24 feels too short because someone could, e.g., be on a long flight (or, you know, a meditation retreat ;) for that long and "not get the memo."

I don't believe we should have a lower minimum for the testnet

I agree strongly with the case you made on the call, @noamnelke: we should strive to operate the testnet with identical parameters to mainnet wherever possible.

URLs are great! CDNs for the binaries are awesome. Let's not fix what ain't broken.

Agree. If you're explicitly trusting a particular developer to notify you of updates, and to provide you with signed updates (see #36), that's already "centralized." You can always choose to "track another updates channel" (to borrow Linux terminology).

Timing of upgrade

Very good point. We could maybe key this on one of the existing beacons to do it securely and in an unpredictable fashion. As long as the upgrade happens before the protocol activation layer height, which of course all nodes do need to agree on!

@brusherru
Copy link
Member

I'm a bit confused about the big amount of issues for kinda coupled things.
I'd prefer to make a decision about the whole strategy and user workflows, and then decompose it to separate SMIPs / issues and dive into details. Anyway...

I've posted some of my thoughts (mainly related to Smapp, but not at all) here: #34 (comment) Please check it out.

In the rest, I have some thoughts specifically about updating the Node itself.

Centralization

I think this is one of the most important things that determines how we deliver updates. Since we have signed apps we already got a centralization — only we can introduce a new version. So in this case I don't see a reason to gossip about updates through p2p.
However, I think that the idea of decentralization of it is a very good one. So we would not worry anymore about what will happen with the network if something happens with us. Or someone will block our domain, etc. But we will face a much more difficult task, how to deliver updates, how to trust them, and how to avoid vulnerabilities.
But I don't think that these difficulties have the highest priority, for now, so we can use a centralized source of updates. But I think that we definitely should have a CLI flag to set a custom trusted source of the updates.

Gradual update of nodes

In case the Nodes will notify the network about the version that is used, I think we can try to make a gradual update by using the highest byte of sha256(NodeID + UpdateHash) to determine the "place in the queue".
For example, use the highest byte to determine how much to wait from the moment when the update is available. 0 for "update immediately" and 255 for "update when almost everyone else updated", or group it by some range (E.G. 0-15 first in the queue, 16-32 the second, and so on).
It means that it may tell the Node not just "wait N hours before update", but to check the network for some percentage of updated nodes. For example, if my node is the second in the queue, my node will wait until seeing some percentage (E.G. 10%) of the Nodes are updated on the network, then update and tell the network that my node is updated too.
But there might be a possible vulnerability — some hacker might build his own node that would not tell the network that it is updated. If there will be a big bunch of such malicious nodes in the network it will block updating all of the nodes on the network.
Also, I'm not sure how it will work for non-backward compatibility updates.

End-of-support

@lrettig can you sum up what is the benefits of such a solution?
I hope that our network will grow and develop, but I can imagine that at some moment in the future we may not have an actual update of the Node, but we will still need to release a new version.
I'd prefer to mark some older nodes as outdated not just by past time, but by existing of a bunch of newer versions (and probably a percentage of used versions on the network).
For example, when

  • patch part of semver outdates the previous version within the next ~72 hours since it is a "hotfix"
  • minor part of semver outdates previous versions, in case if there is less than 10% of nodes with previous version on the network
  • major — outdates previous versions by the specific layer number

Grace period

I think this is a good idea. But:

  • What if we face a critical error on the network and we need to fix it asap?
  • We should wait the same 24-72 hours as for other updates? Until then our network will be down and the price of SMH will fall three days in a row?
  • If not, how to determine that it is a hotfix?

Until we're a centralized source of updates, we can just check for the "patch" part in the semver and install it much faster than others. Have a 24-72 hours grace period for minor updates and some kind of "wait until NNN000 layer" (I mean next layer number that ends up with some zeroes) for major.

Auto-update flag and changing the mind

First of all, since the recommended way is to turn on auto-updates, I propose to name such flag as --no-auto-updates and it will turn it off.
Secondly, what if we're running the Node and then decide to switch auto-updating on or off? In case we have only a CLI flag for it, we have to restart the Node. But it is desirable to not turn off a node at all. So maybe we need another way to handle it. For example, via API. But it also will face other questions. More about it I wrote in the mentioned comment in #34.

@lrettig
Copy link
Member Author

lrettig commented Feb 4, 2022

I think this is one of the most important things that determines how we deliver updates. Since we have signed apps we already got a centralization — only we can introduce a new version. So in this case I don't see a reason to gossip about updates through p2p.

It's not true that "only we can introduce a new version." We need to allow people to fork our code and offer competing versions. There's nothing enshrined or special about the software released by the Spacemesh team, other than the fact that we're releasing the first version.

But I don't think that these difficulties have the highest priority, for now, so we can use a centralized source of updates. But I think that we definitely should have a CLI flag to set a custom trusted source of the updates.

I agree.

In case the Nodes will notify the network about the version that is used, I think we can try to make a gradual update by using the highest byte of sha256(NodeID + UpdateHash) to determine the "place in the queue".
But there might be a possible vulnerability — some hacker might build his own node that would not tell the network that it is updated. If there will be a big bunch of such malicious nodes in the network it will block updating all of the nodes on the network.

This is a clever idea! I like the idea that not all nodes auto-update at the same time. I think we can work around the vulnerability you describe by using the highest byte to pick an update time, and remove the notion of a queue or of checking how many other nodes have already updated. E.g., cause nodes to auto-upgrade over a period of 24 hrs, and the exact time they perform the update within that window depends on their ID.

End-of-support

@lrettig can you sum up what is the benefits of such a solution?

I answered this above:

It makes upgrades much easier. We know that, by a known point in time, all nodes running a particular, old version will have reached their "end-of-support" date and shut down (unless the user modified the source code and recompiled). Zcash has been using this to great success, see:

Grace period

I think this is a good idea. But:

  • What if we face a critical error on the network and we need to fix it asap?
  • We should wait the same 24-72 hours as for other updates? Until then our network will be down and the price of SMH will fall three days in a row?

All of this only applies to auto-updates. Node operators always have the option of manually installing updates without waiting for an auto-update or a grace period. In practice, in case of a critical error, we'd need to communicate directly with node operators and ask them to update immediately.

First of all, since the recommended way is to turn on auto-updates, I propose to name such flag as --no-auto-updates and it will turn it off.

Defaults are important, and I think the default should always be not to auto-update (for governance reasons). We can recommend that users enable auto-update but I think it should be explicit opt-in.

Secondly, what if we're running the Node and then decide to switch auto-updating on or off? In case we have only a CLI flag for it, we have to restart the Node. But it is desirable to not turn off a node at all. So maybe we need another way to handle it. For example, via API.

Agree, we can add this to the API, it should be pretty straightforward.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants