-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: computeRollupGas high CPU usage under load #135
fix: computeRollupGas high CPU usage under load #135
Conversation
|
||
// cache how much gas the tx takes on L1 for its share of rollup data | ||
rollupGas atomic.Value | ||
rollupGas atomic.Pointer[types2.RollupCostData] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your contribution. However, I spot some few issues and feel like sharing the overall op-erigon maintenance policy would be beneficial for all of us.
Our policy is to minimize code diff from the original erigon implementation. This helps us to narrow down optimism specific code and helps us to maintain.
In this PR, I see that you have altered three field types: hash
, from
, rollupGas
from atomic.Value
to atomic.Pointer[T]
. Your intention was to speed up the Store
operation, and in that sense you are correct: atomic.Pointer
store is way cheaper than atomic.Value
store.
However, this change adds diff comparing with upstream. We will have merge conflicts every time while fetching erigon upstream. Can you please only apply the type changes only to optimism specific fields: rollupGas
? After that, can you please test that the CPU issue is solved? In our side, this issue was not reproducible.
If only we change the type of rollupGas
and issue still persists, this is a problem and we shall dive in deeper. If is not, we are all almost happy, but still have some issues.
Failed CI's linter is complaining about atomic.Pointer
must not be copied. The proper way to do this is to fix copy
method of transactions, and manually copy the atomic.Pointer
fields. I believe that the original erigon code has same problem, although for legacy golang issues, linter could not catch this error. I think this copy bug can be first fixed at upstream(only for hash
, from
fields since upstream does not know about optimism), and fetch down to op-erigon.
TL; DR:
- Our policy is to minimize diff with upstream erigon.
- Lets see the bug is fixed when only updating the type of
rollupGas
; to minimize diff. - Still, bug persists because atomic types cannot be copied. This must be also fixed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Totally make sense
- I did a commit with only changes to rollupGasg. I'm currently stress-testing it and will profile again to see if it fixes the issue.
- From https://pkg.go.dev/sync/atomic regarding atomic.Value:
Once Store has been called, a Value must not be copied. A Value must not be copied after first use.
It seems like upstream Erigon does not respect this already from what I see. There should be no change of behaviour with my change (except that it triggers the linter)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For point 3, opening a PR to upstream to fix the issue seems to be valid. I think if you want to contribute upstream as well, please go ahead. If not, I shall open the PR when I'm available. Maybe opening an issue would be enough.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, Can you run CIs again for the new commits? I think nocopy
issue still persists, because of usage of atomic.Pointer
. Expecting linter to complain. FYI, for golang legacy reasons, although copying atomic.Value
is invalid, CI does not complain because it does not have nocopy
field in it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I found I can trigger CIs for external branches.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@arnaudbriche Can you please share the profiling results after the change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, here it is: profile.pb.gz | profile.svg
Regarding linter, as you said, it's related to nocopy
, but the actual behaviour is the same as previous version of the code does copy the transactions structs holding atomic.Value
.
2207d7e
to
4795314
Compare
@arnaudbriche
I have selected option 3 after experimenting at #143. Also, upstream PR is opened at ledgerwatch#9667. Lets take some time to make the patch trickle into downstream. It would take some time to merge this PR, but I think it is the best option. I hope you understand. Also, can you confirm that your issue is solved? You have attached the profiling results at here, and they seem to solve the CPU consumption issue. However, can you confirm that issue is fixed? |
@pcw109550
I confirm this fix makes the perf issue disappear completely. |
Hi guys, Anything new regarding this issue ? |
@arnaudbriche Hi, sorry for the delay in updating this issue. We opened a follow-up PR in the upstream repo, but it's still pending. So I think we should merge this PR for now. This will cause CI failure, but we can handle it by ignoring |
eb35d43
to
accacae
Compare
@ImTei I just rebased from upstream as you wanted. |
@arnaudbriche Thank you! we will merge this and handle CI error, and this fix will be available in the next release. We will let you know in the issue(#117) once it's released. |
After a bit of investigation regarding #117 , I finally pinpointed it to the usage of atomic.Value.
It seems like the Store method does quite a few things and could cause a lot of busy looping in case of high contention.
A replaced the cache values on TransactionMisc by atomic.Pointer, and it seems to fix the perf issue.
No more atomic.* in the profiles during high query load.