Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add health probe to p2w relayer #107

Merged
merged 1 commit into from
Mar 24, 2022

Conversation

ali-bahjati
Copy link
Contributor

@ali-bahjati ali-bahjati commented Mar 23, 2022

Adds /health on the rest api (default to port 4200) if it is enabled. It returns healthy if it has processed at least one VAA and the time difference between last VAA and current time is greater than MAX_HEALTHY_NO_RELAY_DURATION_IN_SECONDS defaulted to 120 seconds.

This will help the tilt to understand this service is working correctly and we can use it in our prod to make actions (restart, alert, ..) based on it.

There was a readiness probe before which only captured successful initialization of the service. This one replaces it because it's more complete.

Notes:

@ali-bahjati ali-bahjati changed the title Add health probe Add health probe to p2w relayer Mar 23, 2022
Copy link
Contributor

@drozdziak1 drozdziak1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks complete otherwise!

- name: RETRY_MAX_ATTEMPTS
value: '4'
- name: RETRY_DELAY_IN_MS
value: '250'
- name: MAX_MSGS_PER_BATCH
value: '1'
- name: MAX_HEALTHY_NO_RELAY_DURATION_IN_SECONDS
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: The ability to relay is directly based off the ability to attest. An attester that is down for 2 minutes (not that hard to do on Solana) would fail this healthcheck. For this reason, I think an alternative not based on time is worth considering - e.g. a count of successes/failures in a row. Start the service with false, If we relay N batches successfully, switch isHealthy to true, if we fail to relay N batches, switch it back to false.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. It depends on how we define ready. if you make it it is ready and all dependencies are ready then it makes sense. I made it like this specifically because of inability to verify some third-party middlewares which are wormhole and spy which we currently cannot track.

What do you think? @erancx @jayantk I also like to hear your feedback.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, we should be fine with your solution either way, especially if we tweak the vlaue against what we see in prod

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It depends on what this check is used for. I think @ali-bahjati 's definition is better for tracking whether the system as a whole is operating correctly, as it is more likely to flag errors. However, it's not a good definition for determining when to restart the service automatically. @drozdziak1 's definition seems better for the second thing, though i think it fails to capture some error cases (e.g., relayer is stuck and never tries to relay anything that it receives).

I think we should be clear on what we're going to do with the check and decide based on that. If the check is for raising an alert that something is wrong, I think the current definition in the code is correct.

Copy link
Contributor

@jayantk jayantk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving this assuming the use of this endpoint is to alert. See inline comment

- name: RETRY_MAX_ATTEMPTS
value: '4'
- name: RETRY_DELAY_IN_MS
value: '250'
- name: MAX_MSGS_PER_BATCH
value: '1'
- name: MAX_HEALTHY_NO_RELAY_DURATION_IN_SECONDS
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It depends on what this check is used for. I think @ali-bahjati 's definition is better for tracking whether the system as a whole is operating correctly, as it is more likely to flag errors. However, it's not a good definition for determining when to restart the service automatically. @drozdziak1 's definition seems better for the second thing, though i think it fails to capture some error cases (e.g., relayer is stuck and never tries to relay anything that it receives).

I think we should be clear on what we're going to do with the check and decide based on that. If the check is for raising an alert that something is wrong, I think the current definition in the code is correct.

@erancx
Copy link
Contributor

erancx commented Mar 23, 2022

first, just to make clear, the idea here is to make the pod decide if it is deadlocked and restart itself.
I think having no log is a good indication if the app is stale or not. we will be monitoring on our part for errors that could pop up and see if it gives us any good indication too.

I also think it will be a good idea to have a readiness probe once relay is connected to spy?

devnet/p2w-terra-relay.yaml Show resolved Hide resolved
@ali-bahjati ali-bahjati requested a review from erancx March 24, 2022 08:40
@ali-bahjati ali-bahjati merged commit 2c37465 into dev.v2 Mar 24, 2022
@ali-bahjati ali-bahjati deleted the abehjati/add-readiness-to-relayer branch March 25, 2022 11:12
ali-bahjati added a commit that referenced this pull request Apr 11, 2022
* Move js sdk on p2w-sdk to js folder

Also modifies other dependencies to correct path

* Reversed removal of wasm build for nodejs

* Add newline to a file

* pyth2wormhole: Fix attestation validation bug

commit-id:567942d7

* Add p2w sdk

It uses Pyth clients structs and cleans some of definitions for Pyth2Wormhole structures.

* Add emitter type and add wasm function for it

- It requires solitaire and it requires nightly rust
- No logic is applied, code is from p2w solana contract. (Eventually will be removed from there)

* Add new line

* Move WASM gen docker to root

It is because wasm is going to be used for p2w-sdk too.

* Fix unchanged cache mount paths

* Move terra relayer into the repo

* Update readme

* p2w-client: Add lib target, make helpers into lib functions there

commit-id:3aeb9ee6

* pyth2wormhole-client: Implement retries

commit-id:462677a2

* Make p2w-sdk js use p2w-sdk rust wasm bindings (#65)

* Make p2w-sdk js use p2w-sdk rust wasm bindings (instead of solana contract bindings)
- Removes `wasm.rs` in solana contract too.

* p2w attester contract use p2w-sdk (#68)

* Make solana pyth2wormhole contract to use the sdk

* Use threadpool to set up price symbols (#69)

* Add solana feature flag for p2w sdk (#71)

* Pyth bridge terra contract support batch attestation + use p2w sdk (#72)

* Make terra contract to use pyth2wormhole-sdk and support batch attestation

* Update packages + code format

* Move terra dockerfile out to support third-party dependency

* pyth2wormhole-client: Add polling-based concurrent tx confirmation

commit-id:5d16d035

* chore: p2w spy guarding improve Dockerfile

* fix: p2w_autoattest don't die after initialization

also minimal formatting

* add P2W_EXIT_ON_ERROR

* set P2W_EXIT_ON_ERROR default to True

* Remove bool test

* hopefully this time.

* add tilt p2w-attest P2W_EXIT_ON_ERROR

* convert P2W_EXIT_ON_ERROR to "true"

* Fix pyth test publisher (#76)

* Fix test pyth publisher to actually publish price

- Uses newer pyth images and removes existing hacks for old versions. It essentially makes dockers cleaner.
- Also improve some adds in dockers to cache more efficiently

* Support Batch Price attestation for terra relay (#75)

* Support Batch Price attestation for terra relay

* Abehjati/update p2w sdk to pyth sdk (#83)

* Make p2w-sdk use pyth-sdk

* Correct test values to reflect .env.test

* update p2w sdk to use ema instead of twa (#84)

* Rename twa to ema in terra relay (#85)

* Bring PythStructs.PriceAttestation struct in line with new API

* Add ability to parse batch price attestations

* Pyth terra remove wormhole governance (#87)

* Pyth in terra: remove wormhole governance

* [WIP] p2w-relay-iface: Add NPM package with relayer interface PoC

commit-id:efcb9b34

* Define Pyth SDK Price struct

* Define internal PythStructs.PriceInfo struct

* Cache price updates in standardised PriceInfo format

* Cache price updates from batch attestations

* p2w-relay-iface -> p2w-relay-terra/src/relay/iface.ts

commit-id:ed9846e3

* p2w relay interface: remove config from Relay iface

commit-id:0359d886

* Remove now unnused parsePriceAttestation function

* Pyth terra bridge: add contract deployment script (#88)

* Add pyth deployment script

- Also updates build.sh to build pyth completely
- Add a readme for deployment guide

* Add test for partial update behaviour

* update p2w sdk to new pyth (#91)

* p2w-sdk/rust use pyth sdk solana v2

* Dockerfile.client: solana 1.8.1 -> 1.9.4

commit-id:643299d3

* p2w-terra-relay: ignore lib and node, own project dir in docker

commit-id:b084bc40

* p2w-terra-relay: iface.ts review nits, naive impl for Terra

commit-id:0ecbfdd6

* Terra contract public api (#79)

* Use pyth-sdk in terra contract
* Update terra contract according to agreed API
- Also adds v2 suffix to price_info key because this migration is breaking.

* p2w-terra-relay: apply review nits

commit-id:aec39c85

* p2w-terra-relay: make worker.ts generic w.r.t. Relay interface

commit-id:5937a08c

* terra.ts: add missing return statement

commit-id:ba0365e6

* Update worket to handle timeout in callback correctly (#97)

* Remove wormhole-based governance

* Remove now unused legacy governance state and variables

* Remove Pyth Implementation implementation

* p2w-terra-relay: run formatter

commit-id:df311e23

* p2w-terra-relay: apply review nits

commit-id:5034b061

* Run formatter to trigger CI

commit-id:7c643d79

* p2w-terra-relay: EVM boilerplate

commit-id:8ad73ded

* Remove old PythProxy inheritance hierarchy

* Remove now unnused initialized implementations map

* Remove old mock bridge implementation

* Remove dependency to wormhole sdk as path and cleanup wrong eth copies (#104)

* Dockerfile.pyth_relay: Fix lockfile issue in ethereum

This commit fixes a lockfile issue resulting from newer NPM in our
container.

Specifically, our Dockerfile is pinned, relaxes Ethereum's
lockfile (npm ci -> npm install) and hardens our lockfile (npm install
-> npm ci)

commit-id:3381c8ec

* p2w-terra-relay: Admit loss against mkdir -p

commit-id:3abdb58d

* Remove unused components from wormhole (#108)

* Remove unused components from wormhole

Removes the following:
- explorer
- e2e
- bridge_ui
- algorand stuff (teal dockerfile and third_party/algorand)
- ci_tests (testing directory)  which are for JS/Bridge UI

* Remove unused terra contracts (#109)

- Note: Terra contract addresses are changed by this PR due to deterministic ordering.
- Removed unused nft and token bridge, and migration contracts in Terra
- Modified documentation to remove info regarding removed contracts.(docs/devnet.md)

* Remove unused solana contracts and their wasm creations (#110)

Removes token bridge, nft bridge, migration. Also removes them from deployments and docs.

* Add fee estimate for terra relay (#112)

* Removes directores which are not related to p2w (#111)

Removes
- audits
- dashboards (dashboard is removed from Tilt)
- event_database (all of it's dependencies are removed from Tilt and it's not for p2w)
- lp_ui: a project (pressumably liquidity pool) not related to p2w
- sdk: wormhole sdk, p2w depends on it's npm package and there is no dependency to rust one
- spydk: it's not anywhere in p2w
- staging/algorand: these are for alrogrand which is not used in p2w
- whitepapers: these are for wormhole

* Add and update openzeppelin packages

* Add initializer to Pyth contract

* Add upgradable PythProxy contract

* Update tests to work with new proxy setup

* Update migrate script to work with new proxy setup

* Add tests for new proxy setup

* Inline PythStorage.Provider struct

* Make Pyth.verifyPythVM function internal

* Fix struct field names

* Rename Price to PriceFeed to be consistent with SDK

* Replace PythGetters.latestPriceInfo with Pyth.queryPriceFeed in public API

* p2w-terra-relay: Add a query() EVM call and Tilt boilerplate

commit-id:f97d0c16

* Clarify test comments

* Add health probe (#107)

* Rename PythProxy to PythUpgradable

* p2w-evm-relay: Backport the proxy address change from debug session

commit-id:55b63ed5

* p2w-terra-relay -> p2w-relay, split EVM relay into new service

commit-id:36d0db6e

* Tiltfile: typo

commit-id:3bbba986

* p2w-evm-relay.yaml: typo

commit-id:35c87c79

* p2w-evm-relay.yaml: typo 2: electric boogaloo

commit-id:40892265

* Add build folder to dockerignore

* Rename attestPriceBatch to updatePriceBatchFromVm

* Update comment on time check

* Trigger Build

* Tiltfile: Fix port forwards for p2w-evm-relay

commit-id:6e5e9c14

* p2w-relay: PythImplementation -> PythUpgradable

commit-id:bfea7eb5

* Remove unused Pyth Chain ID metadata

* Add the query() call

commit-id:02966ce5

* p2w-terra-relay: Fix evm.ts after contract rename

commit-id:87381bec

* Make truffle migrations directory configurable

* p2w-evm-relay: Fix wrong EVM contract ID, add a check for it

This commit takes care of an outdated pyth2wormhole EVM contract
address and implements a contract/non-contract check using
web3.eth.getCode() (empty for non-contracts).

This problem cost us several hours of debugging and resulted from an
EVM gotcha - a contract call to a non-contract address will simply
ignore the call payload and make a plain transfer. Additionally, ETH
accounts don't have a notion of initialization - used and unused
addresses are equally valid tx recipients. Resulting from both
properties, any unused address could potentially yield wrongly
successful calls, wasting funds and debug time over p2w-relay. Thus
the heuristic to protect us from this is to see if the address' code
storage is populated.

commit-id:b655a720

* p2w-relay: Also implement the contract check in EVM relay()

commit-id:e28709e5

* evm.ts: Fix wording in changed/unchanged logs

commit-id:13c81625

* Make terra relayer more resillient (#120)

- Increase retry attempts (4 to 6) and retry_delay (250ms to 1s) to be more resillient
  - This is because when account sequence mismatch happens it might take some time be fixed
- Removed estimate fee because it's being done in wallet.createAndSignTx (less requests)
- Improved logging on when error happens

* Update dockerfile to chown less files (#121)

* Update dockerfile to chown sooner

* p2w-relay: review nits

* p2w-evm-relay: make feed verification queries configurable

* p2w-relay: cache wormhole import

* p2w-relay: formatter, remove getcode() from relay(), add comments

commit-id:1a65c52c

* p2w-relay: typos and leftovers

commit-id:9b523b25

* Change websocket to json socket to support bsc testnet + improves env vars (#139)

* Change websocket to json socket to support bsc testnet + imporving env vars

* Add unit test to Pyth Terra Contract (#123)

* Add unit test to the terra contract

- Refactors the code into multiple functions to make unit testing easier
- Adds build and test of terra contract to CI according to #73

* p2w-relay: harden exception handling, yell about uncaught stuff

commit-id:24e14835

* p2w-relay: Correct outdated comment

commit-id:d0b57d33

* p2w-evm-relay: s/async (e)/(e)/

commit-id:11b3a474

* Modify proto docker and tiltfile to stop creating unnecessary files (#144)

* Remove sdk/spydk from wasm and remove buf gen web yaml (#145)

* Remove wormhole contract from wasm generation (#160)

* pyth2wormhole: Add num_publishers to libraries and contracts

commit-id:f7263eed

* pyth2wormhole: add max_num_publishers to cross-chain metadata

commit-id:7550fa50

* Move p2w relayer parsing to p2w sdk js (#162)

* Move Price Attestation parsing logic to the sdk

* pyth2wormhole: Add contract testing boilerplate for attest()

commit-id:51949fbe

* Create p2w-api base (from p2w-relay) (#142)

* Create p2w-api base (from p2w-relay)

* Refactor project structure

* Rename p2w to pyth price service (#166)

* Abehjati/price-service-add-rest-layer (#167)

* Add rest api for latest vaa

Co-authored-by: Stan Drozd <stan@nexantic.com>
Co-authored-by: Eran Davidovich <edavidovich@jumptrading.com>
Co-authored-by: Eran Davidovich <erancx@users.noreply.github.com>
Co-authored-by: Tom Pointon <tom@teepeestudios.net>
Co-authored-by: Stan Drozd <drozdziak1@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants