New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Coroutinize commit log #8954
Coroutinize commit log #8954
Conversation
Please rebase on top of #8980 |
Looks mostly good (the complexity of commitlog is still astounding). I'm concerned about regressions. Please check if perf_simple_query --write uses commitlog (add an option if not). It will report insn per op and allocations per op as well as throughput, we can compare before/after. We can tolerate a small regression and improve it later, but if it's large (>3%) we'll need to think. |
Fair enough. Will do. |
master:
coroutine
Seems to be a small perf drop. Best tasks/op for master is 14.155, coroutine 14.115 -> ~0.3% |
rebased on master. |
ping? |
The numbers are extremely noisy. The only one I trust is allocs_per_op, and it shows the 3 extra allocations I pointed out. To get less noisy numbers, please run on tmpfs, set the cpu frequency to some fixed thing, and run with --smp 1 --task-quota-ms 10. This gets more stable numbers. I'll also add instructions per op to the json report (it's already in the text report), it's more stable than tps. I think we should un-coroutinize those three allocations (or maybe uncoroutinize just the fast paths), but can also decide based on more accurate perf stats. |
tmp + opts above. Unfortunately, I could not get any instruction counts (probably because of running inside a container - weird though, it is unconfined). Can maybe spin up an AWS or similar and test that instead. old:
coroutine:
|
And with instruction count. About ~1.4% increase in instructions. Not great. But perhaps not unexpected without actually trying to fold coroutines. old:
coroutine:
|
"opts above" is the decoroutinization of the fast paths? Maybe they aren't effective? Hard to believe without switching to batch mode and running on a slow disk. |
No, "opts above" as in the |
The batch mode comment was in the context of me thinking you applied the de-coroutinization and attempting to explain why they didn't help. Given the optimizations weren't made, it's not surprising they didn't help. Batch mode still amortizes, so it's not one alloc more, it's three allocs per {as many writes as fit in a disk write}, which can be quite a lot. |
So please try undoing those trouble spots and see if it helps. It should recover all of the performance, since the other cases are executed rarely, and should be faster with coroutines (once you wait, you get fewer allocs with coroutines compared to continuations). |
Yes, but batch mode is likely to run into suspend-states anyway (i.e. wait for IO), in which case raw futures will allocate as well, so not much difference. |
Dropping the patches for pure alloc yields coroutine:
|
It means we missed a fast-path coroutine. Perhaps it can be split into a coroutine part and a non-coroutine slow path. |
Most likely |
Yup. Verified. Dropping coroutine in
|
Calling frames keeps object alive in all paths. Use references in allocate()/allocate_when_possible()
Change args to values so stays on coroutine frame. Remove pointless subscription/stream usage, just iterate.
Removes 2 coroutine frames in fast path (as long as segment + space is avail). Puts IPS back on track with master.
And call it by-value with the polymorphic writers. This eliminates outer coroutine frame and ensures we use only one for fast-case allocation.
Rebased, removed gate_guard (using gate::holder now), updated PR blurb with perf info. |
Le ping? |
No real refactoring, just move the various methods to coroutines. Because coroutines are neat.
Broken down into one method per change to make review easier. And hoping I get tipped per change.
Grand idea being that using coroutines will eventually make real refactoring easier.
Unit tests + relevant dtest.
As discussed below, simply coroutinizing the code will, at least in the fast path, cause the slightly naive
compiler to generate multiple unused coroutine frames, dropping raw performance a bit.
The last two patches in this series addresses this, by breaking the fast path into non-coroutine
subroutines (no futures involved) and one coroutine main loop.
Results, as collected by
perf_simple_query
are:Master (before changes):
This PR (coroutines + fast path optimization patches):
I.e. both allocations/op and instruction count seem to be on par.