Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

computeRollupGas eats all the CPUs #117

Closed
arnaudbriche opened this issue Jan 10, 2024 · 11 comments
Closed

computeRollupGas eats all the CPUs #117

arnaudbriche opened this issue Jan 10, 2024 · 11 comments

Comments

@arnaudbriche
Copy link

arnaudbriche commented Jan 10, 2024

System information

Erigon version: erigon version 0.02.0-unstable (docker image testinprod/op-erigon:2.48.1-0.2.0-amd64)

OS & Version: Linux

Commit hash:

Erigon Command (with flags/config): erigon --datadir=/data/op-erigon --ethash.dagdir=/data/op-erigon --snapshots=false --private.api.addr=0.0.0.0:9090 --http.addr=0.0.0.0 --http.port=8545 --http.vhosts=* --http.corsdomain=* --http.compression=true --authrpc.addr=0.0.0.0 --authrpc.port=8551 --authrpc.vhosts=* --authrpc.jwtsecret=/data/op-erigon/jwt.hex --http.api=eth,erigon,web3,net,debug,trace,txpool,engine --ws=true --ws.compression=true --db.pagesize=16KB --db.size.limit=8TB --db.read.concurrency=96 --torrent.port=42069 --port=30303 --nat=any --networkid=8453 --metrics=true --metrics.addr=0.0.0.0 --metrics.port=6060 --pprof=true --pprof.addr=0.0.0.0 --pprof.port=6061 --rpc.batch.concurrency=2 --rpc.batch.limit=10000 --rpc.returndata.limit=1048576 --nodiscover --rollup.sequencerhttp=https://mainnet-sequencer.base.org

Concensus Layer: op-node

Concensus Layer Command (with flags/config): op-node --l1=<L1_RPC_URL> --l2=http://localhost:8551 --l2.jwt-secret=/data/op-erigon/jwt.hex --rpc.addr=0.0.0.0 --rpc.port=9545 --l1.trustrpc --l1.rpckind=erigon --l1.rpc-rate-limit=0 --l1.rpc-max-batch-size=100 --l1.http-poll-interval=12s --metrics.enabled --metrics.addr=0.0.0.0 --metrics.port=6062 --rollup.config=/data/op-node/rollup.json

Chain/Network:

Expected behaviour

op-erigon should not spend most CPU time on atomic store in computeRollupGas function

Actual behaviour

Most of the CPU is spent on an atomic store in the computeRollupGas function.

Capture d’écran 2024-01-10 à 17 31 20

Here is a pprof profile taken on the node.

profile.pb.gz

Steps to reproduce the behaviour

The node is in sync. I just ran an RPC client doing some calls at relatively modest concurrency (100 RPS, 5 concurrent calls). The calls are mostly traces calls.
The observed behaviour triggers quickly after the client starts sending requests.

@arnaudbriche
Copy link
Author

May be of interest: even after I stopped the RPC client for 24h, CPUs on the machine is still high and profile looks nearly the same. With no rpc traffic at all.
profile (2).pb.gz

@ImTei
Copy link
Member

ImTei commented Jan 16, 2024

@arnaudbriche Are you still having this issue? or is this just one-time issue?

@arnaudbriche
Copy link
Author

arnaudbriche commented Jan 16, 2024

@ImTei I had the issue for a long time. And the issue was very easy to reproduce.
Then I tried to upgrade op-node and op-erigon and now my node is stuck.

op-node image: us-docker.pkg.dev/oplabs-tools-artifacts/images/op-node:v1.4.0
op-erigon image: testinprod/op-erigon:2.51.0-0.3.0-amd64

I can see this message in erigon logs:

[WARN] [01-16|10:20:58.080] Served conn=[::1]:48042 method=engine_forkchoiceUpdatedV2 reqid=18168 t=100.809µs err="missing withdrawals list"

And this one is op-node:

t=2024-01-16T10:23:59+0000 lvl=warn msg="Derivation process temporary error" attempts=16344 err="engine stage failed: temp: temporarily cannot insert new safe block: failed to create new block via forkchoice: unrecognized rpc error: missing withdrawals list"

@ImTei
Copy link
Member

ImTei commented Jan 17, 2024

@arnaudbriche That error seems to be caused by missing canyon config. You're using the manual rollup config json when you run the op-node. Does the config has Canyon time? I recommend you to use --network=base-mainnet flag.

And can you give me an example of RPC calls to reproduce?

@arnaudbriche
Copy link
Author

@ImTei Yes, that was the issue. I had to resync my node from scratch with the --network=base-mainnet flag.

Regarding my previous issue, the calls were mostly trace_block and eth_getBlockReceipts. Doing tens of these in parallel always led to the contention issue on the atomic.

@ImTei
Copy link
Member

ImTei commented Jan 23, 2024

@arnaudbriche Sorry for the delayed response. Due to lack of our resources, we are unable to investigate this issue right now. We're planning to handle this issue in Q1. Please be patient.

@arnaudbriche
Copy link
Author

@ImTei No problem. My node is now synced again, and running latest version. The issue still exists. Please let me know if I can help debug whenever you're working on this.

@arnaudbriche
Copy link
Author

Hi @ImTei , I spend of bit of time debugging the issue and came to a solution.
I sent a PR.
The fix runs on my node for 3 days without any issue.

@ImTei
Copy link
Member

ImTei commented Feb 13, 2024

@arnaudbriche Thank you for your great work! Our team will review the PR and get back to you.

@arnaudbriche
Copy link
Author

arnaudbriche commented Jun 18, 2024

Tx guys! Closing the issue.

@ImTei
Copy link
Member

ImTei commented Jun 20, 2024

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants