Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feature(advanced RPC compression): enable dictionary training and ZSTD compression #7401

Draft
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

michoecho
Copy link

This patch:

  • enables dictionary training (by changing rpc_dict_training_when from the default value of 'never' to 'when_leader),
  • increases (by 4x) the frequency of dictionary training and updates, to stress the feature more,
  • enables ZSTD internode compression (provided that internode_compression is also enabled) by giving it a nonzero CPU usage limit.

However, this patch doesn't enable internode_compression itself, because it's a performance-affecting option. To actually put the feature to use, a test must enable internode_compression on its own.

Testing

  • Syntax check.
  • Untested.

PR pre-checks (self review)

  • I added the relevant backport labels
  • I didn't leave commented-out/debugging code

Refs #7364

…D compression

This patch:
- enables dictionary training (by changing `rpc_dict_training_when`
from the default value of 'never' to 'when_leader),
- increases (by 4x) the frequency of dictionary training and updates, to
stress the feature more,
- enables ZSTD internode compression (provided that `internode_compression` is
also enabled) by giving it a nonzero CPU usage limit.

However, this patch doesn't enable `internode_compression` itself, because
it's a performance-affecting option. To actually put the feature to use,
a test must enable `internode_compression` on its own.
@michoecho michoecho marked this pull request as draft May 6, 2024 08:07
@michoecho
Copy link
Author

Marking as draft, because I haven't tested it.

This PR enables (I hope) dictionary training by default. It should be non-invasive, so there should be no reason to enable it only for a subset of tests.

The next step would be to also enable internode_compression: all for some longevity tests. It can't be enabled by default for all tests because it could disturb performance tests.

@michoecho michoecho requested a review from soyacz May 6, 2024 08:10
@michoecho
Copy link
Author

I have no idea how to test this.

@fruch fruch added test-provision-aws Run provision test on AWS test-provision-gce Run provision test on GCE test-provision-azure labels May 6, 2024
@fruch
Copy link
Contributor

fruch commented May 6, 2024

I have no idea how to test this.

I've enabled the labels, so we can have a basic sanity for it (even that it default to running 5.2, we'll need to run it again with latest:master)

but you should define a specific cases you want to try, i.e. a longevity that you can have a reference of the run to compare with
for example teir1 (which are run on a weekly basis with master):

https://argus.scylladb.com/workspace?state=WyJlMTdmZjM5YS1mM2Y2LTRhYzYtYTE2MS04NzEyNjdmZDM5NTUiLCI3ZjYwNDIxMy01Y2FkLTQ0MTgtOTIxZS0xY2E5ODkxODA3MWMiLCI5OTY1MDFhZC1jMTM3LTQwZWYtOGRjYy1hN2QxMmYwYmEzYjYiLCI0ODEyMmRmZS0xNDhiLTRhOTktYTQzYy0xYTcxYzY4Njk5NzMiLCI0MjhiMmZiMy1lOWFjLTQyNzYtYjcxYS05YmY0YjQ1YTQzMWEiLCI5ODA1MDczMi1kZmUzLTQ2NGMtYTY2YS1mMjM1YmFkMzA4MjkiLCJlY2Q0OTdjMC04MmQ2LTQyNjktYjA1My1mNWMyMTU3ZTA0YWUiLCIxZTMzM2RmOC1hN2U4LTQxNzEtOGFiNy0xZDdiZGVhOTA3ZDUiXQ

you can start with a short variations, of some of those, like longevity-100gb-4h which also runs quite regularly

@@ -147,6 +147,11 @@ def set_endpoint_snitch(cls, endpoint_snitch: str):
internode_send_buff_size_in_bytes: int = None # 0
internode_recv_buff_size_in_bytes: int = None # 0
internode_compression: Literal['none', 'all', 'dc'] = None # "none"
rpc_dict_training_when: Literal['always', 'never', 'when_leader'] = 'when_leader'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would recommend putting the actual defaults here (i.e. None)

and create an SCT configuration for specific test that would enabled it, for the sake of testing.
enabling it for all by default, should preferably done by scylla-core itself if decided as such.

see the configurations folder, and as example configurations/tablets-initial-32.yaml how a specific test can set it's own scylla.yaml parameters

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would recommend putting the actual defaults here (i.e. None)

and create an SCT configuration for specific test that would enabled it, for the sake of testing. enabling it for all by default, should preferably done by scylla-core itself if decided as such.

Why? The good thing about enabling it by default is that it gives more coverage with little effort. What's the drawback?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the drawback is that we won't be testing the defaults from that point onwards.

if it's zero risk, change the default in scylla, if not the are reasons not to do so, they apply here in SCT as well.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would recommend putting the actual defaults here (i.e. None)
and create an SCT configuration for specific test that would enabled it, for the sake of testing.

@fruch

In that case, would it be acceptable to enable dict training and internode compression for all longevity tests?

We can't enable this feature by default in Scylla because it has a performance cost. For some users, RPC compression is useless (e.g. because they have an on-premise deployment and they don't pay for the amount of data sent over network), and we don't want to regress them.

But we do want it to be stable and relatively cheap "by default". We might want to enable it by default in the cloud, because every cluster pays the network costs there.

the drawback is that we won't be testing the defaults from that point onwards.

Internode compression only adds code paths, it doesn't remove them. So when you enable it, you are testing strictly more code. And I think that's a good thing?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fruch ^Ping.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so you are saying it's not gonna be enabled by default for core, cause of fear of regression.
so it might cause regression on some of the tests

and for the "cheap" argument, every time we said something like that, we regretted it (see the enabling audit by default, we after a while got reverted)

I would still argue you do want specific case that can show different compression option vs. other options

and in this PR, you are not enabling the internode_compression at all, so it would only affect some cases using it.
which are not that many cases

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Anyhow, we should be trying it on a few cases before merging it

  • one short longevity
  • one case of rolling upgrades

as followup work after merge:

  • one performance test - during operation(nemesis) one

@michoecho
Copy link
Author

but you should define a specific cases you want to try, i.e. a longevity that you can have a reference of the run to compare with
for example teir1 (which are run on a weekly basis with master):

@fruch Do you mean that I should manually clone the relevant jenkins job and run it, or is there some better way?

@fruch
Copy link
Contributor

fruch commented May 6, 2024

but you should define a specific cases you want to try, i.e. a longevity that you can have a reference of the run to compare with
for example teir1 (which are run on a weekly basis with master):

@fruch Do you mean that I should manually clone the relevant jenkins job and run it, or is there some better way?

no there no better way, you can try out the new Argus clone command:
https://argus.scylladb.com/test/f05fea04-eb74-4961-94fb-f71c67df52cb/runs?additionalRuns[]=c221d97d-1b34-4c27-b75a-283abf8bd790

@soyacz
Copy link
Contributor

soyacz commented May 9, 2024

In upgrade tests we start from 2024.1 or 5.4 - how this change affect these versions? Ignoring or failing?

@michoecho
Copy link
Author

michoecho commented May 9, 2024

In upgrade tests we start from 2024.1 or 5.4 - how this change affect these versions? Ignoring or failing?

@#7401 (comment) I don't understand the question. None of these options are present in 2024.1/5.4, so it shouldn't affect their behaviour at all.

@fruch
Copy link
Contributor

fruch commented May 9, 2024

In upgrade tests we start from 2024.1 or 5.4 - how this change affect these versions? Ignoring or failing?

@#7401 (comment) I don't understand the question. None of these options are present in 2024.1/5.4, so it shouldn't affect their behaviour at all.

I think @soyacz meant that we enable internode compression in rolling upgrades from 5.4/2024.1
and we can run those cases to see the effect, and also the effect of those configuration before and after quite clearly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
test-provision-aws Run provision test on AWS test-provision-azure test-provision-gce Run provision test on GCE
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants