feature(advanced RPC compression): enable dictionary training and ZSTD compression #7401

michoecho · 2024-05-06T08:07:35Z

This patch:

enables dictionary training (by changing rpc_dict_training_when from the default value of 'never' to 'when_leader),
increases (by 4x) the frequency of dictionary training and updates, to stress the feature more,
enables ZSTD internode compression (provided that internode_compression is also enabled) by giving it a nonzero CPU usage limit.

However, this patch doesn't enable internode_compression itself, because it's a performance-affecting option. To actually put the feature to use, a test must enable internode_compression on its own.

Testing

Syntax check.
Untested.

PR pre-checks (self review)

I added the relevant backport labels
I didn't leave commented-out/debugging code

Refs #7364

…D compression This patch: - enables dictionary training (by changing `rpc_dict_training_when` from the default value of 'never' to 'when_leader), - increases (by 4x) the frequency of dictionary training and updates, to stress the feature more, - enables ZSTD internode compression (provided that `internode_compression` is also enabled) by giving it a nonzero CPU usage limit. However, this patch doesn't enable `internode_compression` itself, because it's a performance-affecting option. To actually put the feature to use, a test must enable `internode_compression` on its own.

michoecho · 2024-05-06T08:10:05Z

Marking as draft, because I haven't tested it.

This PR enables (I hope) dictionary training by default. It should be non-invasive, so there should be no reason to enable it only for a subset of tests.

The next step would be to also enable internode_compression: all for some longevity tests. It can't be enabled by default for all tests because it could disturb performance tests.

michoecho · 2024-05-06T08:10:43Z

I have no idea how to test this.

fruch · 2024-05-06T08:26:48Z

I have no idea how to test this.

I've enabled the labels, so we can have a basic sanity for it (even that it default to running 5.2, we'll need to run it again with latest:master)

but you should define a specific cases you want to try, i.e. a longevity that you can have a reference of the run to compare with
for example teir1 (which are run on a weekly basis with master):

https://argus.scylladb.com/workspace?state=WyJlMTdmZjM5YS1mM2Y2LTRhYzYtYTE2MS04NzEyNjdmZDM5NTUiLCI3ZjYwNDIxMy01Y2FkLTQ0MTgtOTIxZS0xY2E5ODkxODA3MWMiLCI5OTY1MDFhZC1jMTM3LTQwZWYtOGRjYy1hN2QxMmYwYmEzYjYiLCI0ODEyMmRmZS0xNDhiLTRhOTktYTQzYy0xYTcxYzY4Njk5NzMiLCI0MjhiMmZiMy1lOWFjLTQyNzYtYjcxYS05YmY0YjQ1YTQzMWEiLCI5ODA1MDczMi1kZmUzLTQ2NGMtYTY2YS1mMjM1YmFkMzA4MjkiLCJlY2Q0OTdjMC04MmQ2LTQyNjktYjA1My1mNWMyMTU3ZTA0YWUiLCIxZTMzM2RmOC1hN2U4LTQxNzEtOGFiNy0xZDdiZGVhOTA3ZDUiXQ

you can start with a short variations, of some of those, like longevity-100gb-4h which also runs quite regularly

fruch · 2024-05-06T08:30:32Z

sdcm/provision/scylla_yaml/scylla_yaml.py

@@ -147,6 +147,11 @@ def set_endpoint_snitch(cls, endpoint_snitch: str):
    internode_send_buff_size_in_bytes: int = None  # 0
    internode_recv_buff_size_in_bytes: int = None  # 0
    internode_compression: Literal['none', 'all', 'dc'] = None  # "none"
+    rpc_dict_training_when: Literal['always', 'never', 'when_leader'] = 'when_leader'


I would recommend putting the actual defaults here (i.e. None)

and create an SCT configuration for specific test that would enabled it, for the sake of testing.
enabling it for all by default, should preferably done by scylla-core itself if decided as such.

see the configurations folder, and as example configurations/tablets-initial-32.yaml how a specific test can set it's own scylla.yaml parameters

I would recommend putting the actual defaults here (i.e. None)

and create an SCT configuration for specific test that would enabled it, for the sake of testing. enabling it for all by default, should preferably done by scylla-core itself if decided as such.

Why? The good thing about enabling it by default is that it gives more coverage with little effort. What's the drawback?

the drawback is that we won't be testing the defaults from that point onwards.

if it's zero risk, change the default in scylla, if not the are reasons not to do so, they apply here in SCT as well.

I would recommend putting the actual defaults here (i.e. None)
and create an SCT configuration for specific test that would enabled it, for the sake of testing.

@fruch

In that case, would it be acceptable to enable dict training and internode compression for all longevity tests?

We can't enable this feature by default in Scylla because it has a performance cost. For some users, RPC compression is useless (e.g. because they have an on-premise deployment and they don't pay for the amount of data sent over network), and we don't want to regress them.

But we do want it to be stable and relatively cheap "by default". We might want to enable it by default in the cloud, because every cluster pays the network costs there.

the drawback is that we won't be testing the defaults from that point onwards.

Internode compression only adds code paths, it doesn't remove them. So when you enable it, you are testing strictly more code. And I think that's a good thing?

@fruch ^Ping.

so you are saying it's not gonna be enabled by default for core, cause of fear of regression.
so it might cause regression on some of the tests

and for the "cheap" argument, every time we said something like that, we regretted it (see the enabling audit by default, we after a while got reverted)

I would still argue you do want specific case that can show different compression option vs. other options

and in this PR, you are not enabling the internode_compression at all, so it would only affect some cases using it.
which are not that many cases

Anyhow, we should be trying it on a few cases before merging it

one short longevity

one case of rolling upgrades

as followup work after merge:

one performance test - during operation(nemesis) one

michoecho · 2024-05-06T08:34:32Z

but you should define a specific cases you want to try, i.e. a longevity that you can have a reference of the run to compare with
for example teir1 (which are run on a weekly basis with master):

@fruch Do you mean that I should manually clone the relevant jenkins job and run it, or is there some better way?

fruch · 2024-05-06T09:29:41Z

but you should define a specific cases you want to try, i.e. a longevity that you can have a reference of the run to compare with
for example teir1 (which are run on a weekly basis with master):

@fruch Do you mean that I should manually clone the relevant jenkins job and run it, or is there some better way?

no there no better way, you can try out the new Argus clone command:
https://argus.scylladb.com/test/f05fea04-eb74-4961-94fb-f71c67df52cb/runs?additionalRuns[]=c221d97d-1b34-4c27-b75a-283abf8bd790

soyacz · 2024-05-09T06:02:04Z

In upgrade tests we start from 2024.1 or 5.4 - how this change affect these versions? Ignoring or failing?

michoecho · 2024-05-09T06:21:51Z

In upgrade tests we start from 2024.1 or 5.4 - how this change affect these versions? Ignoring or failing?

@#7401 (comment) I don't understand the question. None of these options are present in 2024.1/5.4, so it shouldn't affect their behaviour at all.

fruch · 2024-05-09T12:00:14Z

In upgrade tests we start from 2024.1 or 5.4 - how this change affect these versions? Ignoring or failing?

@#7401 (comment) I don't understand the question. None of these options are present in 2024.1/5.4, so it shouldn't affect their behaviour at all.

I think @soyacz meant that we enable internode compression in rolling upgrades from 5.4/2024.1
and we can run those cases to see the effect, and also the effect of those configuration before and after quite clearly.

michoecho marked this pull request as draft May 6, 2024 08:07

github-actions bot assigned michoecho May 6, 2024

michoecho requested a review from soyacz May 6, 2024 08:10

fruch added test-provision-aws Run provision test on AWS test-provision-gce Run provision test on GCE test-provision-azure labels May 6, 2024

fruch reviewed May 6, 2024

View reviewed changes

dimakr mentioned this pull request May 29, 2024

Internode compression test #7364

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feature(advanced RPC compression): enable dictionary training and ZSTD compression #7401

feature(advanced RPC compression): enable dictionary training and ZSTD compression #7401

michoecho commented May 6, 2024

michoecho commented May 6, 2024

michoecho commented May 6, 2024

fruch commented May 6, 2024

fruch May 6, 2024

michoecho May 6, 2024

fruch May 6, 2024

michoecho May 8, 2024

michoecho May 9, 2024

fruch May 9, 2024

fruch May 9, 2024

michoecho commented May 6, 2024

fruch commented May 6, 2024

soyacz commented May 9, 2024

michoecho commented May 9, 2024 •

edited

Loading

fruch commented May 9, 2024

feature(advanced RPC compression): enable dictionary training and ZSTD compression #7401

Are you sure you want to change the base?

feature(advanced RPC compression): enable dictionary training and ZSTD compression #7401

Conversation

michoecho commented May 6, 2024

Testing

PR pre-checks (self review)

michoecho commented May 6, 2024

michoecho commented May 6, 2024

fruch commented May 6, 2024

fruch May 6, 2024

Choose a reason for hiding this comment

michoecho May 6, 2024

Choose a reason for hiding this comment

fruch May 6, 2024

Choose a reason for hiding this comment

michoecho May 8, 2024

Choose a reason for hiding this comment

michoecho May 9, 2024

Choose a reason for hiding this comment

fruch May 9, 2024

Choose a reason for hiding this comment

fruch May 9, 2024

Choose a reason for hiding this comment

michoecho commented May 6, 2024

fruch commented May 6, 2024

soyacz commented May 9, 2024

michoecho commented May 9, 2024 • edited Loading

fruch commented May 9, 2024

michoecho commented May 9, 2024 •

edited

Loading