Skip to content
This repository has been archived by the owner on Jun 3, 2020. It is now read-only.

Double-signing prevention (MVP for launch) #60

Closed
tarcieri opened this issue Oct 3, 2018 · 16 comments
Closed

Double-signing prevention (MVP for launch) #60

tarcieri opened this issue Oct 3, 2018 · 16 comments

Comments

@tarcieri
Copy link
Contributor

tarcieri commented Oct 3, 2018

This is a tracking issue for KMS double-signing prevention.

Launch goal: attempt to prevent validator bugs or data loss from causing the validator to double sign by making the KMS aware of the current block height.

Longer-term goal: provide defensive capabilities / survive compromise of validator hosts. See #115 for discussion of post-launch double signing improvements.

Previous discussion:

Current Status

Launch Plan

Add support to the KMS for a user-configurable subcommand for obtaining the current block height. This can be used to "bootstrap" a block height value when a KMS process is started. From there, the KMS can track the last block it signed.

For example, the KMS could call out to a shell script which hits the /status RPC endpoint for a validator's sentries, piping the output through e.g. jq to extract "latest_block_height" and sorting the results, taking the highest value. An example script can be included in the KMS repo which people can customize to their needs.

This should allow validators to choose whatever mechanism they like for providing the KMS with the current block height, and implement e.g. storing the current block height in external databases as proposed in #11.

Longer-Term Plan

See #115.

@jleni
Copy link
Contributor

jleni commented Oct 3, 2018

Just a short clarification: signatory-ledger-cosval is agnostic to these concepts, it will just pass messages to the ledger device. It is actually the ledger app/firmware that will track current block height and round.

The device only signs blocks in incremental order. To define the initial/current block height, the block height of the first KMS request after plugging is used as a reference. Setting this initial value requires manual user confirmation in the ledger device.

A similar approach could be used by signatory providers. A good idea would be to create a signatory provider wrapper with this functionality.

@tarcieri
Copy link
Contributor Author

tarcieri commented Oct 3, 2018

The device only signs blocks in incremental order.

@jleni does it just preserve monotonicity, or does it require each signed block immediately follow the previous one?

I am wondering about things like failover.

@jleni
Copy link
Contributor

jleni commented Oct 3, 2018

This behavior is according to the specs. It will sign in monotonic order, they do not need to be sequential.
For instance:

  • n, n+1, n+3 (all are signed)
  • n, n+2, n+1, n+3. (only n+1 is rejected)

Adding Failover/HA to KMS is an interesting follow up. It might actually need KMS/signatory arbitration + something like Raft (consensus) to handle these cases.

@tarcieri tarcieri added question Further information is requested security Security-critical issues labels Nov 22, 2018
@mdyring
Copy link
Contributor

mdyring commented Nov 23, 2018

My understanding - please correct me if wrong - is that HSM2 double signing prevention will be implemented by tracking the last signed height, which is persisted in one of the slots of the device.

KMS will then need to update this slot before signing each block, and should ideally read the data back to ensure it was stored correctly.

To make this as robust as possible, if the update/read cycle fails, KMS should complain loudly, but continue operating in degraded state. It could still prevent double signing by using locally cached data, and I guess(?) the HSM2 might still continue signing.

Apologies if this seems trivial, but I thought it was worth stressing since I couldn't find any MTBF data on the HSM2. Validator failure due to write wear on the HSM would be the worst.

TL;DR - the failure characteristics of the underlying devices (YubiHSM2, Ledger etc.) should be carefully considered. They might not be designed for write intensive operations, including double signing prevention, and could wear out.

@tarcieri
Copy link
Contributor Author

@mdyring that's a good point. I will investigate that before I go forward with this approach.

@jleni
Copy link
Contributor

jleni commented Nov 23, 2018 via email

@tarcieri tarcieri changed the title Double-signing prevention (launch and beyond) Double-signing prevention (MVP for launch) Nov 23, 2018
@tarcieri tarcieri removed question Further information is requested security Security-critical issues labels Nov 23, 2018
@mdyring
Copy link
Contributor

mdyring commented Nov 24, 2018

Block number/rounds are tracked in RAM, not nvram to avoid this issue. When the device is plugged will skip first votes and request the user for confirmation to align with current values. I hope this answers your question.

That seems like a sensible solution. In case KMS needs to be restarted, would this require physical access to the ledger device for confirmation? Ideally it should be possible to restart/update software remotely.

Alternatively, KMS could signal a "clean shutdown" to the device which can then write to nvram and use that information on next start? (this could be useful in cases where a server needs to be power cycled)

@tarcieri
Copy link
Contributor Author

We talked with Yubico about wear on the flash. One of their reps suggested it wouldn't be an issue, although it's something I'm loathe to risk without a precise MTBF. The last thing we want is a bunch of validators dying at the same time because they all wore out their flash roughly at the same time.

@adrianbrink
Copy link
Contributor

Alternatively, KMS could signal a "clean shutdown" to the device which can then write to nvram and use that information on next start? (this could be useful in cases where a server needs to be power cycled)

This case won't work with the Ledger, since a power-cycle will restart the Ledger. In order to unlock the ledger you have to physically enter your PIN, which means you have to be at the datacenter anyway. That's why persistent storage for height/round is not as important with the ledger.

@tarcieri
Copy link
Contributor Author

tarcieri commented Feb 27, 2019

I'm just about ready to (finally) start work on this. Here is a tentative plan:

  1. Add a configuration section to tmkms.toml for Tendermint blockchain networks the KMS is operating on. This can include the Bech32 prefixes used by that network (addressing Consider support different consensus pubkey prefix by config? #178, although it's still unclear if all Tendermint networks will adopt Bech32)
  2. Make chain ID information configured for keys first-class (it's presently a string). This will tie signing keys directly to specific chains (addressing Access control: validator AuthN and signing key AuthZ #111, which is something of a showstopper security issue)
  3. Persist state files for different network containing e.g. the block height. The exact details of this are still TBD.

There was some earlier discussion of persisting this information in e.g. YubiHSM2's opaque data. Due to concerns about write wear, I don't think this is a good idea.

Another alternative would be introducing some kind of embedded database, e.g. sled, LevelDB, or LMDB. That seems like a complicated change to introduce right now, and also something where the other potential / future persistence needs of the KMS need to be considered.

The existing priv_validator_state.json files provide at least something of a known quantity people are familiar with. I would suggest implementing a similar approach (but with at least one of these files per Tendermint network/chain). This is a KISS solution that can be potentially be replaced by an embedded (or non-embedded) backend database, but at such a time where we're actually ready to cross that bridge.

@jleni
Copy link
Contributor

jleni commented Feb 27, 2019

with respect to 2.
Can we have something like this? #177 (comment)

keys = [{ id = "gaia-6000", pubkey="123....", key = 1 }]
keys = [{ id = "gaia-7000", pubkey="123....", key = 2 }]
keys = [{ id = "gaia-9000", pubkey="456....", key = 1 }]

This would be more secure and allow support several devices connected to the same KMS.

@tarcieri
Copy link
Contributor Author

tarcieri commented Mar 2, 2019

@jleni I'm not sure that syntax is valid TOML... it seems like you want:

[[keys]]
id = "gaia-6000"
pubkey="123...."
key = 1

[[keys]]
id = "gaia-7000"
pubkey="123...."
key = 2

?

I'd agree the config syntax needs changes, but the main thing that needs to be solved, particularly in the context of this issue, is expressing an m:n mapping between Tendermint networks/chains and keys.

With your proposed syntax, I think we'd need at least:

{ key = 1, chains = ["gaia-6000", "gaia-7000"], ... }

@jleni
Copy link
Contributor

jleni commented Mar 2, 2019 via email

@mdyring
Copy link
Contributor

mdyring commented Mar 6, 2019

Regarding support for multiple chains, it would be very nice if a single tmkms instance could support multiple (tendermint based, running compatible version, etc.) projects, such as IRIS, Cosmos, IOV, etc.

Could be it something as simple as making the HRP of bech32 addresses configurable?

@tarcieri
Copy link
Contributor Author

tarcieri commented Mar 6, 2019

@mdyring that's definitely the plan. See #178

tarcieri pushed a commit that referenced this issue Mar 10, 2019
This commit moves the last sign state tracking for chains into the
global chain registry.

It also allows configurable locations for the chain state tracking
files, which should probably make it easier to ensure they don't clobber
each other in tests.

This doesn't fully implement the double signing plan in #60 but at this
point I think we're close enough. The remaining item is (optionally)
running a user-specified command on startup to query the current block
height.
tarcieri added a commit that referenced this issue Mar 10, 2019
@tarcieri
Copy link
Contributor Author

Double signing tracking was added in #193 (thanks @zaki!) and the rest of this plan (i.e. hook support) implemented in #205.

It will be released in 0.5.0. I plan on having 0.5.0-beta1 out later today.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants