Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance Regression - 12% degradation of write throughput #8159

Closed
roydahan opened this issue Feb 24, 2021 · 26 comments
Closed

Performance Regression - 12% degradation of write throughput #8159

roydahan opened this issue Feb 24, 2021 · 26 comments
Assignees
Labels
status/regression symptom/performance Issues causing performance problems

Comments

@roydahan
Copy link

Installation details
Affected version: 666.development 20200727(a7df848) (4.3 is also affected).

Write throughput BEFORE: 124K ops.
Write throughput AFTER: 110K ops

After bisecting the range of commit between 20200724(d08e22c) to 20200727(a7df848),
the commit that caused the degradation was found to be 3f84d41.

I used throughput test_write for bisecting.
The degradation was masked by the temporary change we did between these two builds when we switched from el7 kernel stream to amazon linux 2 kernel.
Hence, I used ami-0eedbef0ebb289b75 as the base AMI which runs kernel 5.7.10-1.el7.elrepo.x86_64

@roydahan roydahan added the symptom/performance Issues causing performance problems label Feb 24, 2021
@roydahan
Copy link
Author

@slivne please assign someone.

@slivne
Copy link
Contributor

slivne commented Feb 25, 2021

the commit is:

3f84d41 Merge "messaging: make verb handler registering independent of current scheduling group" from Botond "

@denesb / @bhalevy

@avikivity
Copy link
Member

@denesb wrote (in 3f84d41):

"This caused all sorts of problems, even beyond user queries running in
the system group. Also as of 0c6bbc8 queries on the replicas are
classified based on the scheduling group they are running on, so user
reads also ended up using the system concurrency semaphore."

I think fixing this bad behavior may be the cause of the regression (i.e. the bad behavior caused a performance improvement). If I'm correct, then 0c6bbc8 should show a performance improvement compared to the previous commit.

The way that 0c6bbc8 "helps" is by using the system reader concurrency semaphore, which has lower concurrency, and so thrashes the cpu caches less. We could prove this by running 3f84d41 with the user reader concurrency semaphore limit reduced to match the system reader concurrency semaphore.

Longer term the fix is #5718 (if I'm correct).

@slivne
Copy link
Contributor

slivne commented Feb 25, 2021

@avikivity lets try to make it clearer - do you want us to try and apply 0c6bbc8 and retest ?

@denesb
Copy link
Contributor

denesb commented Feb 25, 2021

@denesb wrote (in 3f84d41):

"This caused all sorts of problems, even beyond user queries running in
the system group. Also as of 0c6bbc8 queries on the replicas are
classified based on the scheduling group they are running on, so user
reads also ended up using the system concurrency semaphore."

I think fixing this bad behavior may be the cause of the regression (i.e. the bad behavior caused a performance improvement). If I'm correct, then 0c6bbc8 should show a performance improvement compared to the previous commit.

The way that 0c6bbc8 "helps" is by using the system reader concurrency semaphore, which has lower concurrency, and so thrashes the cpu caches less. We could prove this by running 3f84d41 with the user reader concurrency semaphore limit reduced to match the system reader concurrency semaphore.

Longer term the fix is #5718 (if I'm correct).

Interesting hypothesis. So you think we've seen a false improvement due to 0c6bbc8 restricting the read concurrency freeing up resources for writes. This presumes a mixed workload, @roydahan does this test execute a mixed workload? Another alternative explanation is that writes too were mistakenly running in the system group, enjoying the elevated shares. This elevated shares should affect reads too, although this might be countered by the much restricted concurrency. Note though that in both cases the bug introduced in 0c6bbc8 only affected reads/writes that arrived from a remote coordinator, so not 100% of the operations are affected.

@avikivity
Copy link
Member

@denesb wrote (in 3f84d41):
"This caused all sorts of problems, even beyond user queries running in
the system group. Also as of 0c6bbc8 queries on the replicas are
classified based on the scheduling group they are running on, so user
reads also ended up using the system concurrency semaphore."
I think fixing this bad behavior may be the cause of the regression (i.e. the bad behavior caused a performance improvement). If I'm correct, then 0c6bbc8 should show a performance improvement compared to the previous commit.
The way that 0c6bbc8 "helps" is by using the system reader concurrency semaphore, which has lower concurrency, and so thrashes the cpu caches less. We could prove this by running 3f84d41 with the user reader concurrency semaphore limit reduced to match the system reader concurrency semaphore.
Longer term the fix is #5718 (if I'm correct).

Interesting hypothesis. So you think we've seen a false improvement due to 0c6bbc8 restricting the read concurrency freeing up resources for writes. This presumes a mixed workload,

No, this would be bad even in a read-only workload. With high concurrency we need to cycle between unrelated reads, loading their state from the data cache each time and missing. With a smaller concurrency the data cache would be able to hold all the active reads.

Mind you, anything can be explained by cache effects, these theories are more or less worthless until proven.

@roydahan does this test execute a mixed workload? Another alternative explanation is that writes too were mistakenly running in the system group, enjoying the elevated shares. This elevated shares should affect reads too, although this might be countered by the much restricted concurrency. Note though that in both cases the bug introduced in 0c6bbc8 only affected reads/writes that arrived from a remote coordinator, so not 100% of the operations are affected.

@roydahan
Copy link
Author

@denesb wrote (in 3f84d41):
"This caused all sorts of problems, even beyond user queries running in
the system group. Also as of 0c6bbc8 queries on the replicas are
classified based on the scheduling group they are running on, so user
reads also ended up using the system concurrency semaphore."
I think fixing this bad behavior may be the cause of the regression (i.e. the bad behavior caused a performance improvement). If I'm correct, then 0c6bbc8 should show a performance improvement compared to the previous commit.
The way that 0c6bbc8 "helps" is by using the system reader concurrency semaphore, which has lower concurrency, and so thrashes the cpu caches less. We could prove this by running 3f84d41 with the user reader concurrency semaphore limit reduced to match the system reader concurrency semaphore.
Longer term the fix is #5718 (if I'm correct).

Interesting hypothesis. So you think we've seen a false improvement due to 0c6bbc8 restricting the read concurrency freeing up resources for writes. This presumes a mixed workload, @roydahan does this test execute a mixed workload?

No, it's a write-only test.

@avikivity
Copy link
Member

Haha, so the entire theory was worthless.

@avikivity
Copy link
Member

@avikivity lets try to make it clearer - do you want us to try and apply 0c6bbc8 and retest ?

No, let's test 0c6bbc8 and its predecessor.

@slivne
Copy link
Contributor

slivne commented Mar 2, 2021

rpms

  • s3://roy-rpms/perf_8159_1/scylla_perf_0c6bbc84c.tar.gz
  • s3://roy-rpms/perf_8159_2/scylla_perf_097a5e9e0.tar.gz

Please note these are prior to scylla holding all submodules so they do not contain everything only scylla - I hope it works

@roydahan
Copy link
Author

roydahan commented Mar 3, 2021

rpms

  • s3://roy-rpms/perf_8159_1/scylla_perf_0c6bbc84c.tar.gz

Good Result - 120K ops.

  • s3://roy-rpms/perf_8159_2/scylla_perf_097a5e9e0.tar.gz

Bad Result - 110K ops.

Please note these are prior to scylla holding all submodules so they do not contain everything only scylla - I hope it works

@avikivity
Copy link
Member

So, now we have to understand why 0c6bbc8 improved performance by almost 10%.

@slivne
Copy link
Contributor

slivne commented Mar 4, 2021

@avikivity - do you want botond to look at this ?

@slivne
Copy link
Contributor

slivne commented Mar 7, 2021

@denesb - its all yours .... lets come up with a theory of why this matters as much

@denesb
Copy link
Contributor

denesb commented Mar 8, 2021

My immediate gut feeling is that by running in the system/default scheduling group, writes can somehow have a higher concurrency or priority at the expense of compactions, being able to sustain a higher throughput. However the system sheduling group has the same amount of shares as the statement one, so using the former instead of the latter shouldn't yield any benefits in this regards. I don't know of any semaphore or other means of restricting concurrency that depends on the current scheduling group on the write path. I'll need to compare the metrics to be able to come up with a theory.

@roydahan can you re-run the test with all metrics captured and give me the metrics databases afterwards so I can compare them? Alternatively you can explain me how to run this test. I'd prefer not coming up with a test of my own.

@roydahan
Copy link
Author

roydahan commented Mar 8, 2021

Sure, I’ll check if you can restore the monitors of both runs using hydra.

@roydahan
Copy link
Author

roydahan commented Mar 9, 2021

@denesb here is a monitor from a bad run on 097a5e9.
http://52.2.44.149:3000/d/BK53z5UMk/scylla-per-server-metrics-nemesis-master?orgId=1&from=1615219862368&to=1615223610640&var-by=instance&var-cluster=&var-dc=All&var-node=All&var-shard=All

I just notice it also has many warnings saying:
WARNING | scylla: [shard 0] cdc - Could not retrieve CDC streams with timestamp 2021/03/08 15:53:50: std::runtime_error (Could not find CDC generation with timestamp 2021/03/08 15:53:50 in distributed system tables (current time: 2021/03/08 16:42:36), even though some node gossiped about it.). Will retry again.

(Have no idea if it's related and exist also in successful runs, but I can check if you want).

@denesb
Copy link
Contributor

denesb commented Mar 9, 2021

Thanks @roydahan, how long will this be up? Do you also have a monitor for a good run?

@roydahan
Copy link
Author

It's up till you tell me to terminate it.
I'll provide you from a good run as well.

@fgelcer
Copy link

fgelcer commented Mar 11, 2021

@denesb , the monitor is currently alive and you can access it here

@slivne
Copy link
Contributor

slivne commented Mar 14, 2021

@denesb - do you need anything else ?

@denesb
Copy link
Contributor

denesb commented Mar 15, 2021

No, the two monitors is all I need for now, thanks.

@denesb
Copy link
Contributor

denesb commented Mar 16, 2021

I think the key is this:
Good run (pre fix):
image

Bad run (post fix):
image

Notice, how in the case of the "good" run, roughly half of the work leaks to the main group due to the bug. This causes the write workload to make use of two 1000 share groups, effectively using 2000 shares, suppressing compaction, which effectively uses the same amount of shares in both runs. This results in higher write throughput at the expense of compaction throughput.

@slivne
Copy link
Contributor

slivne commented Mar 16, 2021

@avikivity please review

@slivne
Copy link
Contributor

slivne commented Mar 16, 2021

Quoting Avi

"The bug favored writes over reads - and the fix balanced this back".

The expectation that a mixed workload will work better with the fix.

@slivne slivne closed this as completed Mar 16, 2021
@avikivity
Copy link
Member

The scale factor between compaction backlog and compaction shares is arbitrary. It doesn't take into account the user's read-vs-write preference, the size of mutations (small mutations are harder to compact, per byte). We need a way to adjust this scale factor.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
status/regression symptom/performance Issues causing performance problems
Projects
None yet
Development

No branches or pull requests

6 participants