Performance Regression - 12% degradation of write throughput #8159

roydahan · 2021-02-24T21:35:47Z

Installation details
Affected version: 666.development 20200727(a7df848) (4.3 is also affected).

Write throughput BEFORE: 124K ops.
Write throughput AFTER: 110K ops

After bisecting the range of commit between 20200724(d08e22c) to 20200727(a7df848),
the commit that caused the degradation was found to be 3f84d41.

I used throughput test_write for bisecting.
The degradation was masked by the temporary change we did between these two builds when we switched from el7 kernel stream to amazon linux 2 kernel.
Hence, I used ami-0eedbef0ebb289b75 as the base AMI which runs kernel 5.7.10-1.el7.elrepo.x86_64

The text was updated successfully, but these errors were encountered:

roydahan · 2021-02-24T22:03:51Z

@slivne please assign someone.

slivne · 2021-02-25T08:17:10Z

the commit is:

3f84d41 Merge "messaging: make verb handler registering independent of current scheduling group" from Botond "

@denesb / @bhalevy

avikivity · 2021-02-25T10:33:01Z

@denesb wrote (in 3f84d41):

"This caused all sorts of problems, even beyond user queries running in
the system group. Also as of 0c6bbc8 queries on the replicas are
classified based on the scheduling group they are running on, so user
reads also ended up using the system concurrency semaphore."

I think fixing this bad behavior may be the cause of the regression (i.e. the bad behavior caused a performance improvement). If I'm correct, then 0c6bbc8 should show a performance improvement compared to the previous commit.

The way that 0c6bbc8 "helps" is by using the system reader concurrency semaphore, which has lower concurrency, and so thrashes the cpu caches less. We could prove this by running 3f84d41 with the user reader concurrency semaphore limit reduced to match the system reader concurrency semaphore.

Longer term the fix is #5718 (if I'm correct).

slivne · 2021-02-25T12:14:44Z

@avikivity lets try to make it clearer - do you want us to try and apply 0c6bbc8 and retest ?

denesb · 2021-02-25T13:54:45Z

@denesb wrote (in 3f84d41):

"This caused all sorts of problems, even beyond user queries running in
the system group. Also as of 0c6bbc8 queries on the replicas are
classified based on the scheduling group they are running on, so user
reads also ended up using the system concurrency semaphore."

I think fixing this bad behavior may be the cause of the regression (i.e. the bad behavior caused a performance improvement). If I'm correct, then 0c6bbc8 should show a performance improvement compared to the previous commit.

The way that 0c6bbc8 "helps" is by using the system reader concurrency semaphore, which has lower concurrency, and so thrashes the cpu caches less. We could prove this by running 3f84d41 with the user reader concurrency semaphore limit reduced to match the system reader concurrency semaphore.

Longer term the fix is #5718 (if I'm correct).

Interesting hypothesis. So you think we've seen a false improvement due to 0c6bbc8 restricting the read concurrency freeing up resources for writes. This presumes a mixed workload, @roydahan does this test execute a mixed workload? Another alternative explanation is that writes too were mistakenly running in the system group, enjoying the elevated shares. This elevated shares should affect reads too, although this might be countered by the much restricted concurrency. Note though that in both cases the bug introduced in 0c6bbc8 only affected reads/writes that arrived from a remote coordinator, so not 100% of the operations are affected.

avikivity · 2021-02-25T14:34:42Z

@denesb wrote (in 3f84d41):
"This caused all sorts of problems, even beyond user queries running in
the system group. Also as of 0c6bbc8 queries on the replicas are
classified based on the scheduling group they are running on, so user
reads also ended up using the system concurrency semaphore."
I think fixing this bad behavior may be the cause of the regression (i.e. the bad behavior caused a performance improvement). If I'm correct, then 0c6bbc8 should show a performance improvement compared to the previous commit.
The way that 0c6bbc8 "helps" is by using the system reader concurrency semaphore, which has lower concurrency, and so thrashes the cpu caches less. We could prove this by running 3f84d41 with the user reader concurrency semaphore limit reduced to match the system reader concurrency semaphore.
Longer term the fix is #5718 (if I'm correct).

Interesting hypothesis. So you think we've seen a false improvement due to 0c6bbc8 restricting the read concurrency freeing up resources for writes. This presumes a mixed workload,

No, this would be bad even in a read-only workload. With high concurrency we need to cycle between unrelated reads, loading their state from the data cache each time and missing. With a smaller concurrency the data cache would be able to hold all the active reads.

Mind you, anything can be explained by cache effects, these theories are more or less worthless until proven.

@roydahan does this test execute a mixed workload? Another alternative explanation is that writes too were mistakenly running in the system group, enjoying the elevated shares. This elevated shares should affect reads too, although this might be countered by the much restricted concurrency. Note though that in both cases the bug introduced in 0c6bbc8 only affected reads/writes that arrived from a remote coordinator, so not 100% of the operations are affected.

roydahan · 2021-02-25T14:54:08Z

@denesb wrote (in 3f84d41):
"This caused all sorts of problems, even beyond user queries running in
the system group. Also as of 0c6bbc8 queries on the replicas are
classified based on the scheduling group they are running on, so user
reads also ended up using the system concurrency semaphore."
I think fixing this bad behavior may be the cause of the regression (i.e. the bad behavior caused a performance improvement). If I'm correct, then 0c6bbc8 should show a performance improvement compared to the previous commit.
The way that 0c6bbc8 "helps" is by using the system reader concurrency semaphore, which has lower concurrency, and so thrashes the cpu caches less. We could prove this by running 3f84d41 with the user reader concurrency semaphore limit reduced to match the system reader concurrency semaphore.
Longer term the fix is #5718 (if I'm correct).

Interesting hypothesis. So you think we've seen a false improvement due to 0c6bbc8 restricting the read concurrency freeing up resources for writes. This presumes a mixed workload, @roydahan does this test execute a mixed workload?

No, it's a write-only test.

avikivity · 2021-02-25T15:01:38Z

Haha, so the entire theory was worthless.

avikivity · 2021-02-25T15:02:52Z

@avikivity lets try to make it clearer - do you want us to try and apply 0c6bbc8 and retest ?

No, let's test 0c6bbc8 and its predecessor.

slivne · 2021-03-02T14:53:59Z

rpms

s3://roy-rpms/perf_8159_1/scylla_perf_0c6bbc84c.tar.gz
s3://roy-rpms/perf_8159_2/scylla_perf_097a5e9e0.tar.gz

Please note these are prior to scylla holding all submodules so they do not contain everything only scylla - I hope it works

roydahan · 2021-03-03T16:04:37Z

rpms

s3://roy-rpms/perf_8159_1/scylla_perf_0c6bbc84c.tar.gz

Good Result - 120K ops.

s3://roy-rpms/perf_8159_2/scylla_perf_097a5e9e0.tar.gz

Bad Result - 110K ops.

Please note these are prior to scylla holding all submodules so they do not contain everything only scylla - I hope it works

avikivity · 2021-03-03T17:35:35Z

So, now we have to understand why 0c6bbc8 improved performance by almost 10%.

slivne · 2021-03-04T12:21:29Z

@avikivity - do you want botond to look at this ?

slivne · 2021-03-07T12:25:56Z

@denesb - its all yours .... lets come up with a theory of why this matters as much

denesb · 2021-03-08T09:48:22Z

My immediate gut feeling is that by running in the system/default scheduling group, writes can somehow have a higher concurrency or priority at the expense of compactions, being able to sustain a higher throughput. However the system sheduling group has the same amount of shares as the statement one, so using the former instead of the latter shouldn't yield any benefits in this regards. I don't know of any semaphore or other means of restricting concurrency that depends on the current scheduling group on the write path. I'll need to compare the metrics to be able to come up with a theory.

@roydahan can you re-run the test with all metrics captured and give me the metrics databases afterwards so I can compare them? Alternatively you can explain me how to run this test. I'd prefer not coming up with a test of my own.

roydahan · 2021-03-08T11:33:41Z

Sure, I’ll check if you can restore the monitors of both runs using hydra.

roydahan · 2021-03-09T12:06:33Z

@denesb here is a monitor from a bad run on 097a5e9.
http://52.2.44.149:3000/d/BK53z5UMk/scylla-per-server-metrics-nemesis-master?orgId=1&from=1615219862368&to=1615223610640&var-by=instance&var-cluster=&var-dc=All&var-node=All&var-shard=All

I just notice it also has many warnings saying:
WARNING | scylla: [shard 0] cdc - Could not retrieve CDC streams with timestamp 2021/03/08 15:53:50: std::runtime_error (Could not find CDC generation with timestamp 2021/03/08 15:53:50 in distributed system tables (current time: 2021/03/08 16:42:36), even though some node gossiped about it.). Will retry again.

(Have no idea if it's related and exist also in successful runs, but I can check if you want).

denesb · 2021-03-09T12:21:36Z

Thanks @roydahan, how long will this be up? Do you also have a monitor for a good run?

roydahan · 2021-03-11T12:18:16Z

It's up till you tell me to terminate it.
I'll provide you from a good run as well.

fgelcer · 2021-03-11T13:36:49Z

@denesb , the monitor is currently alive and you can access it here

slivne · 2021-03-14T12:21:02Z

@denesb - do you need anything else ?

denesb · 2021-03-15T07:01:56Z

No, the two monitors is all I need for now, thanks.

denesb · 2021-03-16T10:43:36Z

I think the key is this:
Good run (pre fix):

Bad run (post fix):

Notice, how in the case of the "good" run, roughly half of the work leaks to the main group due to the bug. This causes the write workload to make use of two 1000 share groups, effectively using 2000 shares, suppressing compaction, which effectively uses the same amount of shares in both runs. This results in higher write throughput at the expense of compaction throughput.

slivne · 2021-03-16T12:21:14Z

@avikivity please review

slivne · 2021-03-16T12:28:14Z

Quoting Avi

"The bug favored writes over reads - and the fix balanced this back".

The expectation that a mixed workload will work better with the fix.

avikivity · 2021-03-17T17:44:36Z

The scale factor between compaction backlog and compaction shares is arbitrary. It doesn't take into account the user's read-vs-write preference, the size of mutations (small mutations are harder to compact, per byte). We need a way to adjust this scale factor.

roydahan added the symptom/performance Issues causing performance problems label Feb 24, 2021

slivne assigned bhalevy and denesb Feb 25, 2021

slivne added the status/regression label Feb 25, 2021

slivne closed this as completed Mar 16, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance Regression - 12% degradation of write throughput #8159

Performance Regression - 12% degradation of write throughput #8159

roydahan commented Feb 24, 2021

roydahan commented Feb 24, 2021

slivne commented Feb 25, 2021

avikivity commented Feb 25, 2021

slivne commented Feb 25, 2021

denesb commented Feb 25, 2021

avikivity commented Feb 25, 2021

roydahan commented Feb 25, 2021

avikivity commented Feb 25, 2021

avikivity commented Feb 25, 2021

slivne commented Mar 2, 2021

roydahan commented Mar 3, 2021 •

edited

Loading

avikivity commented Mar 3, 2021

slivne commented Mar 4, 2021

slivne commented Mar 7, 2021

denesb commented Mar 8, 2021

roydahan commented Mar 8, 2021

roydahan commented Mar 9, 2021

denesb commented Mar 9, 2021

roydahan commented Mar 11, 2021

fgelcer commented Mar 11, 2021

slivne commented Mar 14, 2021

denesb commented Mar 15, 2021

denesb commented Mar 16, 2021

slivne commented Mar 16, 2021

slivne commented Mar 16, 2021

avikivity commented Mar 17, 2021

Performance Regression - 12% degradation of write throughput #8159

Performance Regression - 12% degradation of write throughput #8159

Comments

roydahan commented Feb 24, 2021

roydahan commented Feb 24, 2021

slivne commented Feb 25, 2021

avikivity commented Feb 25, 2021

slivne commented Feb 25, 2021

denesb commented Feb 25, 2021

avikivity commented Feb 25, 2021

roydahan commented Feb 25, 2021

avikivity commented Feb 25, 2021

avikivity commented Feb 25, 2021

slivne commented Mar 2, 2021

roydahan commented Mar 3, 2021 • edited Loading

avikivity commented Mar 3, 2021

slivne commented Mar 4, 2021

slivne commented Mar 7, 2021

denesb commented Mar 8, 2021

roydahan commented Mar 8, 2021

roydahan commented Mar 9, 2021

denesb commented Mar 9, 2021

roydahan commented Mar 11, 2021

fgelcer commented Mar 11, 2021

slivne commented Mar 14, 2021

denesb commented Mar 15, 2021

denesb commented Mar 16, 2021

slivne commented Mar 16, 2021

slivne commented Mar 16, 2021

avikivity commented Mar 17, 2021

roydahan commented Mar 3, 2021 •

edited

Loading