Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: computeRollupGas high CPU usage under load #135

Merged
merged 1 commit into from
Jun 17, 2024

Conversation

arnaudbriche
Copy link

After a bit of investigation regarding #117 , I finally pinpointed it to the usage of atomic.Value.
It seems like the Store method does quite a few things and could cause a lot of busy looping in case of high contention.
A replaced the cache values on TransactionMisc by atomic.Pointer, and it seems to fix the perf issue.
No more atomic.* in the profiles during high query load.

@pcw109550 pcw109550 self-requested a review February 14, 2024 05:39

// cache how much gas the tx takes on L1 for its share of rollup data
rollupGas atomic.Value
rollupGas atomic.Pointer[types2.RollupCostData]
Copy link
Member

@pcw109550 pcw109550 Feb 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your contribution. However, I spot some few issues and feel like sharing the overall op-erigon maintenance policy would be beneficial for all of us.

Our policy is to minimize code diff from the original erigon implementation. This helps us to narrow down optimism specific code and helps us to maintain.

In this PR, I see that you have altered three field types: hash, from, rollupGas from atomic.Value to atomic.Pointer[T]. Your intention was to speed up the Store operation, and in that sense you are correct: atomic.Pointer store is way cheaper than atomic.Value store.

However, this change adds diff comparing with upstream. We will have merge conflicts every time while fetching erigon upstream. Can you please only apply the type changes only to optimism specific fields: rollupGas? After that, can you please test that the CPU issue is solved? In our side, this issue was not reproducible.

If only we change the type of rollupGas and issue still persists, this is a problem and we shall dive in deeper. If is not, we are all almost happy, but still have some issues.

Failed CI's linter is complaining about atomic.Pointer must not be copied. The proper way to do this is to fix copy method of transactions, and manually copy the atomic.Pointer fields. I believe that the original erigon code has same problem, although for legacy golang issues, linter could not catch this error. I think this copy bug can be first fixed at upstream(only for hash, from fields since upstream does not know about optimism), and fetch down to op-erigon.

TL; DR:

  1. Our policy is to minimize diff with upstream erigon.
  2. Lets see the bug is fixed when only updating the type of rollupGas; to minimize diff.
  3. Still, bug persists because atomic types cannot be copied. This must be also fixed.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Totally make sense
  2. I did a commit with only changes to rollupGasg. I'm currently stress-testing it and will profile again to see if it fixes the issue.
  3. From https://pkg.go.dev/sync/atomic regarding atomic.Value: Once Store has been called, a Value must not be copied. A Value must not be copied after first use.
    It seems like upstream Erigon does not respect this already from what I see. There should be no change of behaviour with my change (except that it triggers the linter)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For point 3, opening a PR to upstream to fix the issue seems to be valid. I think if you want to contribute upstream as well, please go ahead. If not, I shall open the PR when I'm available. Maybe opening an issue would be enough.

Copy link
Member

@pcw109550 pcw109550 Feb 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, Can you run CIs again for the new commits? I think nocopy issue still persists, because of usage of atomic.Pointer. Expecting linter to complain. FYI, for golang legacy reasons, although copying atomic.Value is invalid, CI does not complain because it does not have nocopy field in it.

Copy link
Member

@pcw109550 pcw109550 Feb 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I found I can trigger CIs for external branches.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@arnaudbriche Can you please share the profiling results after the change?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, here it is: profile.pb.gz | profile.svg

Regarding linter, as you said, it's related to nocopy, but the actual behaviour is the same as previous version of the code does copy the transactions structs holding atomic.Value.

@pcw109550
Copy link
Member

pcw109550 commented Mar 11, 2024

@arnaudbriche
We have three options.

  1. Merge this PR directly. This will
  • (Bad) Every followup commit's linter will break.
  • (Somewhat good? ¯\_(ツ) _/¯) This PR is merged.
  1. Fix the nocopy issue in this PR. This will
  • (Bad) May cause merge conflicts while fetching upstream
  • (Bad) Much more complicated; and have no review inputs from upstream maintainers.
  • (Bad) Unlocking can of worms in this PR; must be separated because it is a separate issue.
  • (Good) Linter CI will pass.
  1. Fix the upstream first, and wait till the patch trickles down to here. This will
  • (Bad) Will quite take some time.
  • (Good) Can fix the root cause.
  • (Good) Can get reviews from upstream maintainers.
  • (Good) After some time, Linter CI will pass.

I have selected option 3 after experimenting at #143. Also, upstream PR is opened at ledgerwatch#9667.

Lets take some time to make the patch trickle into downstream. It would take some time to merge this PR, but I think it is the best option. I hope you understand.

Also, can you confirm that your issue is solved? You have attached the profiling results at here, and they seem to solve the CPU consumption issue. However, can you confirm that issue is fixed?

@arnaudbriche
Copy link
Author

@pcw109550
Make total sense. I will keep my fork alive until you can merge this PR.

Also, can you confirm that your issue is solved? You have attached the profiling results at #135 (comment), and they seem to solve the CPU consumption issue. However, can you confirm that issue is fixed?

I confirm this fix makes the perf issue disappear completely.

@arnaudbriche
Copy link
Author

Hi guys,

Anything new regarding this issue ?

@ImTei
Copy link
Member

ImTei commented Jun 17, 2024

@arnaudbriche Hi, sorry for the delay in updating this issue. We opened a follow-up PR in the upstream repo, but it's still pending. So I think we should merge this PR for now. This will cause CI failure, but we can handle it by ignoring nocopy linter.
Please back merge latest op-erigon branch and resolve confliction. we will merge this PR and handle CI failure in the following PR. Thank you!

@arnaudbriche
Copy link
Author

@ImTei I just rebased from upstream as you wanted.

@ImTei
Copy link
Member

ImTei commented Jun 17, 2024

@arnaudbriche Thank you! we will merge this and handle CI error, and this fix will be available in the next release. We will let you know in the issue(#117) once it's released.

@ImTei ImTei merged commit 6021b13 into testinprod-io:op-erigon Jun 17, 2024
2 of 5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants