Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bridge/pkg/solana: retry VAA submission on transient errors #48

Closed
wants to merge 3 commits into from

Conversation

leoluk
Copy link
Contributor

@leoluk leoluk commented Oct 21, 2020

Stack from ghstack:

In particular, this fixes a race condition where the Solana devnet would
take longer to deploy than the ETH devnet to deploy and we'd end up
with an outdated guardian set on Solana.

We currently create a Goroutine for every pending resubmission, which
waits and blocks on the channel until solwatch is processing requests
again. This is effectively an unbounded queue. An alternative approach
would be a channel with sufficient capacity plus backoff.

Test Plan: Deployed without solana-devnet, waited for initial guardian
set change VAA to be requeued, then deployed solana-devnet.

The VAA was successfully submitted once the transient error resolved:

[...]
21:08:44.712Z	ERROR	wormhole-guardian-0.supervisor	Runnable died	{"dn": "root.solwatch", "error": "returned error when NODE_STATE_HEALTHY: failed to receive message from agent: EOF"}
21:08:44.712Z	INFO	wormhole-guardian-0.supervisor	rescheduling supervised node	{"dn": "root.solwatch", "backoff": 0.737286432}
21:08:45.451Z	INFO	wormhole-guardian-0.root.solwatch	watching for on-chain events
21:08:50.031Z	ERROR	wormhole-guardian-0.root.solwatch	failed to submit VAA	{"error": "rpc error: code = Canceled desc = stream terminated by RST_STREAM with error code: CANCEL", "digest": "79[...]"}
21:08:50.031Z	ERROR	wormhole-guardian-0.root.solwatch	requeuing VAA	{"error": "rpc error: code = Canceled desc = stream terminated by RST_STREAM with error code: CANCEL", "digest": "79[...]"}
21:09:02.062Z	INFO	wormhole-guardian-0.root.solwatch	submitted VAA	{"tx_sig": "4EKmH[...]", "digest": "79[...]"}

In particular, this fixes a race condition where the Solana devnet would
take longer to deploy than the ETH devnet to deploy and we'd end up
with an outdated guardian set on Solana.

We currently create a Goroutine for every pending resubmission, which
waits and blocks on the channel until solwatch is processing requests
again. This is effectively an unbounded queue. An alternative approach
would be a channel with sufficient capacity plus backoff.

Test Plan: Deployed without solana-devnet, waited for initial guardian
set change VAA to be requeued, then deployed solana-devnet.

The VAA was successfully submitted once the transient error resolved:

```
[...]
21:08:44.712Z	ERROR	wormhole-guardian-0.supervisor	Runnable died	{"dn": "root.solwatch", "error": "returned error when NODE_STATE_HEALTHY: failed to receive message from agent: EOF"}
21:08:44.712Z	INFO	wormhole-guardian-0.supervisor	rescheduling supervised node	{"dn": "root.solwatch", "backoff": 0.737286432}
21:08:45.451Z	INFO	wormhole-guardian-0.root.solwatch	watching for on-chain events
21:08:50.031Z	ERROR	wormhole-guardian-0.root.solwatch	failed to submit VAA	{"error": "rpc error: code = Canceled desc = stream terminated by RST_STREAM with error code: CANCEL", "digest": "79[...]"}
21:08:50.031Z	ERROR	wormhole-guardian-0.root.solwatch	requeuing VAA	{"error": "rpc error: code = Canceled desc = stream terminated by RST_STREAM with error code: CANCEL", "digest": "79[...]"}
21:09:02.062Z	INFO	wormhole-guardian-0.root.solwatch	submitted VAA	{"tx_sig": "4EKmH[...]", "digest": "79[...]"}
```

[ghstack-poisoned]
Leo added 2 commits October 22, 2020 12:18
In particular, this fixes a race condition where the Solana devnet would
take longer to deploy than the ETH devnet to deploy and we'd end up
with an outdated guardian set on Solana.

We currently create a Goroutine for every pending resubmission, which
waits and blocks on the channel until solwatch is processing requests
again. This is effectively an unbounded queue. An alternative approach
would be a channel with sufficient capacity plus backoff.

Test Plan: Deployed without solana-devnet, waited for initial guardian
set change VAA to be requeued, then deployed solana-devnet.

The VAA was successfully submitted once the transient error resolved:

```
[...]
21:08:44.712Z	ERROR	wormhole-guardian-0.supervisor	Runnable died	{"dn": "root.solwatch", "error": "returned error when NODE_STATE_HEALTHY: failed to receive message from agent: EOF"}
21:08:44.712Z	INFO	wormhole-guardian-0.supervisor	rescheduling supervised node	{"dn": "root.solwatch", "backoff": 0.737286432}
21:08:45.451Z	INFO	wormhole-guardian-0.root.solwatch	watching for on-chain events
21:08:50.031Z	ERROR	wormhole-guardian-0.root.solwatch	failed to submit VAA	{"error": "rpc error: code = Canceled desc = stream terminated by RST_STREAM with error code: CANCEL", "digest": "79[...]"}
21:08:50.031Z	ERROR	wormhole-guardian-0.root.solwatch	requeuing VAA	{"error": "rpc error: code = Canceled desc = stream terminated by RST_STREAM with error code: CANCEL", "digest": "79[...]"}
21:09:02.062Z	INFO	wormhole-guardian-0.root.solwatch	submitted VAA	{"tx_sig": "4EKmH[...]", "digest": "79[...]"}
```

[ghstack-poisoned]
In particular, this fixes a race condition where the Solana devnet would
take longer to deploy than the ETH devnet to deploy and we'd end up
with an outdated guardian set on Solana.

We currently create a Goroutine for every pending resubmission, which
waits and blocks on the channel until solwatch is processing requests
again. This is effectively an unbounded queue. An alternative approach
would be a channel with sufficient capacity plus backoff.

Test Plan: Deployed without solana-devnet, waited for initial guardian
set change VAA to be requeued, then deployed solana-devnet.

The VAA was successfully submitted once the transient error resolved:

```
[...]
21:08:44.712Z	ERROR	wormhole-guardian-0.supervisor	Runnable died	{"dn": "root.solwatch", "error": "returned error when NODE_STATE_HEALTHY: failed to receive message from agent: EOF"}
21:08:44.712Z	INFO	wormhole-guardian-0.supervisor	rescheduling supervised node	{"dn": "root.solwatch", "backoff": 0.737286432}
21:08:45.451Z	INFO	wormhole-guardian-0.root.solwatch	watching for on-chain events
21:08:50.031Z	ERROR	wormhole-guardian-0.root.solwatch	failed to submit VAA	{"error": "rpc error: code = Canceled desc = stream terminated by RST_STREAM with error code: CANCEL", "digest": "79[...]"}
21:08:50.031Z	ERROR	wormhole-guardian-0.root.solwatch	requeuing VAA	{"error": "rpc error: code = Canceled desc = stream terminated by RST_STREAM with error code: CANCEL", "digest": "79[...]"}
21:09:02.062Z	INFO	wormhole-guardian-0.root.solwatch	submitted VAA	{"tx_sig": "4EKmH[...]", "digest": "79[...]"}
```

[ghstack-poisoned]
@leoluk leoluk closed this in 91241ee Oct 22, 2020
@leoluk leoluk deleted the gh/leoluk/6/head branch October 22, 2020 10:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants