-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance Regression - 12% degradation of write throughput #8159
Comments
@slivne please assign someone. |
"This caused all sorts of problems, even beyond user queries running in I think fixing this bad behavior may be the cause of the regression (i.e. the bad behavior caused a performance improvement). If I'm correct, then 0c6bbc8 should show a performance improvement compared to the previous commit. The way that 0c6bbc8 "helps" is by using the system reader concurrency semaphore, which has lower concurrency, and so thrashes the cpu caches less. We could prove this by running 3f84d41 with the user reader concurrency semaphore limit reduced to match the system reader concurrency semaphore. Longer term the fix is #5718 (if I'm correct). |
@avikivity lets try to make it clearer - do you want us to try and apply 0c6bbc8 and retest ? |
Interesting hypothesis. So you think we've seen a false improvement due to 0c6bbc8 restricting the read concurrency freeing up resources for writes. This presumes a mixed workload, @roydahan does this test execute a mixed workload? Another alternative explanation is that writes too were mistakenly running in the system group, enjoying the elevated shares. This elevated shares should affect reads too, although this might be countered by the much restricted concurrency. Note though that in both cases the bug introduced in 0c6bbc8 only affected reads/writes that arrived from a remote coordinator, so not 100% of the operations are affected. |
No, this would be bad even in a read-only workload. With high concurrency we need to cycle between unrelated reads, loading their state from the data cache each time and missing. With a smaller concurrency the data cache would be able to hold all the active reads. Mind you, anything can be explained by cache effects, these theories are more or less worthless until proven.
|
No, it's a write-only test. |
Haha, so the entire theory was worthless. |
No, let's test 0c6bbc8 and its predecessor. |
rpms
Please note these are prior to scylla holding all submodules so they do not contain everything only scylla - I hope it works |
Good Result - 120K ops.
Bad Result - 110K ops.
|
So, now we have to understand why 0c6bbc8 improved performance by almost 10%. |
@avikivity - do you want botond to look at this ? |
@denesb - its all yours .... lets come up with a theory of why this matters as much |
My immediate gut feeling is that by running in the system/default scheduling group, writes can somehow have a higher concurrency or priority at the expense of compactions, being able to sustain a higher throughput. However the system sheduling group has the same amount of shares as the statement one, so using the former instead of the latter shouldn't yield any benefits in this regards. I don't know of any semaphore or other means of restricting concurrency that depends on the current scheduling group on the write path. I'll need to compare the metrics to be able to come up with a theory. @roydahan can you re-run the test with all metrics captured and give me the metrics databases afterwards so I can compare them? Alternatively you can explain me how to run this test. I'd prefer not coming up with a test of my own. |
Sure, I’ll check if you can restore the monitors of both runs using hydra. |
@denesb here is a monitor from a bad run on 097a5e9. I just notice it also has many warnings saying: (Have no idea if it's related and exist also in successful runs, but I can check if you want). |
Thanks @roydahan, how long will this be up? Do you also have a monitor for a good run? |
It's up till you tell me to terminate it. |
@denesb - do you need anything else ? |
No, the two monitors is all I need for now, thanks. |
@avikivity please review |
Quoting Avi "The bug favored writes over reads - and the fix balanced this back". The expectation that a mixed workload will work better with the fix. |
The scale factor between compaction backlog and compaction shares is arbitrary. It doesn't take into account the user's read-vs-write preference, the size of mutations (small mutations are harder to compact, per byte). We need a way to adjust this scale factor. |
Installation details
Affected version: 666.development 20200727(a7df848) (4.3 is also affected).
Write throughput BEFORE: 124K ops.
Write throughput AFTER: 110K ops
After bisecting the range of commit between 20200724(d08e22c) to 20200727(a7df848),
the commit that caused the degradation was found to be 3f84d41.
I used throughput test_write for bisecting.
The degradation was masked by the temporary change we did between these two builds when we switched from el7 kernel stream to amazon linux 2 kernel.
Hence, I used ami-0eedbef0ebb289b75 as the base AMI which runs kernel 5.7.10-1.el7.elrepo.x86_64
The text was updated successfully, but these errors were encountered: