Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feature(test-cases): Add test cases for low and asymmetric loads #6517

Merged
merged 1 commit into from
Sep 13, 2023

Conversation

k0machi
Copy link
Contributor

@k0machi k0machi commented Aug 18, 2023

This change adds new configurations for 200gb-48 longevities and one
low load 4 hour longevity, intended to simulate a low load happening
during repair processes, to cover potential overhead like in
scylladb/scylladb#14093.

Task: scylladb/qa-tasks#1416

PR pre-checks (self review)

  • I followed KISS principle and best practices
  • I didn't leave commented-out/debugging code
  • I added the relevant backport labels
  • New configuration option are added and documented (in sdcm/sct_config.py)
  • I have added tests to cover my changes (Infrastructure only - under unit-test/ folder)
  • All new and existing unit tests passed (CI)
  • I have updated the Readme/doc folder accordingly (if needed)

@k0machi k0machi self-assigned this Aug 18, 2023
@k0machi
Copy link
Contributor Author

k0machi commented Aug 18, 2023

Jobs

@k0machi k0machi requested review from roydahan and removed request for aleksbykov August 18, 2023 10:47
@k0machi
Copy link
Contributor Author

k0machi commented Aug 22, 2023

Jobs

Some of the jobs failed due to Manager Restore nemesis failing:

2023-08-18 12:50:15.749: (DisruptionEvent Severity.ERROR) period_type=end event_id=a5160f01-9cca-4eba-8d10-54f1ccddfd82 duration=49s: nemesis_name=MgmtRestore target_node=Node longevity-200gb-48h-verify-limited--db-node-701edbff-4 [52.18.240.42 | 10.4.1.194] (seed: True) errors=Encountered an error on sctool command: restore -c a40ff04f-4e42-4a4f-bc83-e549e01208d9 --restore-schema --location s3:manager-backup-tests-permanent-snapshots-us-east-1  --snapshot-tag sm_20230703000157UTC: Encountered a bad command exit code!
Command: 'sudo sctool restore -c a40ff04f-4e42-4a4f-bc83-e549e01208d9 --restore-schema --location s3:manager-backup-tests-permanent-snapshots-us-east-1  --snapshot-tag sm_20230703000157UTC'
Exit code: 1
Stdout:
Stderr:
Error: create restore units: get CQL cluster session: gocql: unable to create session: unable to discover protocol version: authentication required (using "org.apache.cassandra.auth.PasswordAuthenticator")
Trace ID: cBseMcX8TFCFz5fXELwRGQ (grep in scylla-manager logs)
Traceback (most recent call last):
File "/home/ubuntu/scylla-cluster-tests/sdcm/mgmt/cli.py", line 1125, in run
res = self.manager_node.remoter.sudo(f"sctool {cmd}")
File "/home/ubuntu/scylla-cluster-tests/sdcm/remote/base.py", line 123, in sudo
return self.run(cmd=cmd,
File "/home/ubuntu/scylla-cluster-tests/sdcm/remote/remote_base.py", line 613, in run
result = _run()
File "/home/ubuntu/scylla-cluster-tests/sdcm/utils/decorators.py", line 70, in inner
return func(*args, **kwargs)
File "/home/ubuntu/scylla-cluster-tests/sdcm/remote/remote_base.py", line 604, in _run
return self._run_execute(cmd, timeout, ignore_status, verbose, new_session, watchers)
File "/home/ubuntu/scylla-cluster-tests/sdcm/remote/remote_base.py", line 537, in _run_execute
result = connection.run(**command_kwargs)
File "/home/ubuntu/scylla-cluster-tests/sdcm/remote/libssh2_client/__init__.py", line 620, in run
return self._complete_run(channel, exception, timeout_reached, timeout, result, warn, stdout, stderr)
File "/home/ubuntu/scylla-cluster-tests/sdcm/remote/libssh2_client/__init__.py", line 655, in _complete_run
raise UnexpectedExit(result)
sdcm.remote.libssh2_client.exceptions.UnexpectedExit: Encountered a bad command exit code!
Command: 'sudo sctool restore -c a40ff04f-4e42-4a4f-bc83-e549e01208d9 --restore-schema --location s3:manager-backup-tests-permanent-snapshots-us-east-1  --snapshot-tag sm_20230703000157UTC'
Exit code: 1
Stdout:
Stderr:
Error: create restore units: get CQL cluster session: gocql: unable to create session: unable to discover protocol version: authentication required (using "org.apache.cassandra.auth.PasswordAuthenticator")
Trace ID: cBseMcX8TFCFz5fXELwRGQ (grep in scylla-manager logs)
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis.py", line 4954, in wrapper
result = method(*args[1:], **kwargs)
File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis.py", line 2768, in disrupt_mgmt_restore
restore_task = mgr_cluster.create_restore_task(restore_schema=True, location_list=location_list,
File "/home/ubuntu/scylla-cluster-tests/sdcm/mgmt/cli.py", line 544, in create_restore_task
res = self.sctool.run(cmd=cmd, parse_table_res=False)
File "/home/ubuntu/scylla-cluster-tests/sdcm/mgmt/cli.py", line 1128, in run
raise ScyllaManagerError(f"Encountered an error on sctool command: {cmd}: {ex}") from ex
sdcm.mgmt.common.ScyllaManagerError: Encountered an error on sctool command: restore -c a40ff04f-4e42-4a4f-bc83-e549e01208d9 --restore-schema --location s3:manager-backup-tests-permanent-snapshots-us-east-1  --snapshot-tag sm_20230703000157UTC: Encountered a bad command exit code!
Command: 'sudo sctool restore -c a40ff04f-4e42-4a4f-bc83-e549e01208d9 --restore-schema --location s3:manager-backup-tests-permanent-snapshots-us-east-1  --snapshot-tag sm_20230703000157UTC'
Exit code: 1
Stdout:
Stderr:
Error: create restore units: get CQL cluster session: gocql: unable to create session: unable to discover protocol version: authentication required (using "org.apache.cassandra.auth.PasswordAuthenticator")
Trace ID: cBseMcX8TFCFz5fXELwRGQ (grep in scylla-manager logs)

Pretty sure that's a known issue, but unrelated to this PR

@k0machi k0machi force-pushed the 5.3-repair-low-load-longevities branch from 5ec36bc to c45f6d9 Compare August 22, 2023 10:36
@temichus temichus requested a review from fgelcer August 23, 2023 14:13
@@ -0,0 +1,30 @@
test_duration: 330

prepare_write_cmd: "cassandra-stress write cl=QUORUM n=2097152 -schema 'replication(strategy=NetworkTopologyStrategy,replication_factor=1) compaction(strategy=SizeTieredCompactionStrategy)' -mode cql3 native -rate threads=40 -pop seq=1..2097152 -col 'n=FIXED(10) size=FIXED(512)' -log interval=5"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why rf=1?

Copy link
Contributor Author

@k0machi k0machi Aug 23, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Roy requested a longevity that behaves like this, RF=1, CL=1, it's to cover extremely low loads during repairs

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't need to duplicate the entire yaml file in order to add one or two parameters and In this case you don't even need to add a new configuration because under configurations you already have one called "db_nodes_shards_selection.yaml".

So, the only thing you need is to have a new pipeline and add this configuration.

backend: 'aws',
region: 'eu-west-1',
test_name: 'longevity_test.LongevityTest.test_custom_time',
test_config: 'test-cases/longevity/longevity-200GB-48h-verifier-LimitedMonkey-tls-asymmetric.yaml'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As mentioned in the comment below, here you just need to have the original test_config + "configurations/db_nodes_shards_selection.yaml".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

prepare_write_cmd: "cassandra-stress write cl=ALL n=200200300 -schema 'replication(strategy=NetworkTopologyStrategy,replication_factor=3) compaction(strategy=LeveledCompactionStrategy)' -mode cql3 native -rate threads=1000 -col 'size=FIXED(1024) n=FIXED(1)' -pop seq=1..200200300 -log interval=15"
prepare_write_cmd:
- "cassandra-stress write cl=ALL n=200200300 -schema 'replication(strategy=NetworkTopologyStrategy,replication_factor=3) compaction(strategy=LeveledCompactionStrategy)' -mode cql3 native -rate threads=1000 -col 'size=FIXED(1024) n=FIXED(1)' -pop seq=1..200200300 -log interval=15"
- "cassandra-stress write cl=ALL n=500 -schema 'keyspace=lowload1 replication(strategy=NetworkTopologyStrategy,replication_factor=3) compaction(strategy=LeveledCompactionStrategy)' -mode cql3 native -rate threads=50 -col 'size=FIXED(1024) n=FIXED(4)' -pop seq=1..500 -log interval=15"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- "cassandra-stress write cl=ALL n=500 -schema 'keyspace=lowload1 replication(strategy=NetworkTopologyStrategy,replication_factor=3) compaction(strategy=LeveledCompactionStrategy)' -mode cql3 native -rate threads=50 -col 'size=FIXED(1024) n=FIXED(4)' -pop seq=1..500 -log interval=15"
- "cassandra-stress write cl=ALL n=500 -schema 'keyspace=lowload1 replication(strategy=NetworkTopologyStrategy,replication_factor=3) compaction(strategy=LeveledCompactionStrategy)' -mode cql3 native -rate threads=1 -col 'size=FIXED(50) n=FIXED(1)' -pop seq=1..500 -log interval=15"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The rate threads is what makes it "low load".
I also changed the size of the columns so it will be a tiny table.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

# prepare_verify_cmd: "cassandra-stress read cl=ALL n=200200300 -schema 'replication(strategy=NetworkTopologyStrategy,replication_factor=3) compaction(strategy=LeveledCompactionStrategy)' -mode cql3 native -rate threads=2000 -col 'size=FIXED(1024) n=FIXED(1)' -pop seq=1..200200300 -log interval=15"

stress_cmd: ["cassandra-stress write cl=QUORUM duration=2860m -schema 'replication(strategy=NetworkTopologyStrategy,replication_factor=3) compaction(strategy=LeveledCompactionStrategy)' -mode cql3 native -rate threads=250 -col 'size=FIXED(1024) n=FIXED(1)' -pop seq=400200300..600200300 -log interval=15"]
stress_read_cmd: ["cassandra-stress read cl=ONE duration=2860m -schema 'replication(strategy=NetworkTopologyStrategy,replication_factor=3) compaction(strategy=LeveledCompactionStrategy)' -mode cql3 native -rate threads=250 -col 'size=FIXED(1024) n=FIXED(1)' -pop seq=1..200200300 -log interval=5"]
stress_read_cmd:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're changing, it's a good opportunity to remove the "Stress_read_cmd" and move it under stress_cmd.
IIRC it's just start few minutes later, but it doesn't work with round_robin.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

stress_read_cmd: ["cassandra-stress read cl=ONE duration=2860m -schema 'replication(strategy=NetworkTopologyStrategy,replication_factor=3) compaction(strategy=LeveledCompactionStrategy)' -mode cql3 native -rate threads=250 -col 'size=FIXED(1024) n=FIXED(1)' -pop seq=1..200200300 -log interval=5"]
stress_read_cmd:
- "cassandra-stress read cl=ONE duration=2860m -schema 'replication(strategy=NetworkTopologyStrategy,replication_factor=3) compaction(strategy=LeveledCompactionStrategy)' -mode cql3 native -rate threads=250 -col 'size=FIXED(1024) n=FIXED(1)' -pop seq=1..200200300 -log interval=5"
- "cassandra-stress read cl=ONE duration=2860m -schema 'keyspace=lowload1 replication(strategy=NetworkTopologyStrategy,replication_factor=3) compaction(strategy=LeveledCompactionStrategy)' -mode cql3 native -rate threads=25 -col 'size=FIXED(1024) n=FIXED(4)' -pop seq=1..500 -log interval=5"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- "cassandra-stress read cl=ONE duration=2860m -schema 'keyspace=lowload1 replication(strategy=NetworkTopologyStrategy,replication_factor=3) compaction(strategy=LeveledCompactionStrategy)' -mode cql3 native -rate threads=25 -col 'size=FIXED(1024) n=FIXED(4)' -pop seq=1..500 -log interval=5"
- "cassandra-stress read cl=ONE duration=2860m -schema 'keyspace=lowload1 replication(strategy=NetworkTopologyStrategy,replication_factor=3) compaction(strategy=LeveledCompactionStrategy)' -mode cql3 native -rate threads=1 -col 'size=FIXED(50) n=FIXED(1)' -pop seq=1..500 -log interval=5"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The low load in the name here is a mistake.
The purpose of this longevity is RF=1.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

backend: 'aws',
region: 'eu-west-1',
test_name: 'longevity_test.LongevityTest.test_custom_time',
test_config: 'test-cases/longevity/longevity-10gb-4h-low-load.yaml',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In any case, I prefer you use the 200gb-48h as the baseline (with test duration we can run it shorter than 48h).
Also here, instead of duplicating the entire yaml, you can add a configuration that overrides only the prepare_cmd to be "replication_factor=1").
(If it's mentioned in the other commands it's irrelevant anyway).

And the nemesis class to be "NonDisruptive".

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This longevity requires extensive testing with random nemesis_seed to see that it actually works as expected for all nemesis.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will start running it today with different nemesis seeds

@k0machi k0machi force-pushed the 5.3-repair-low-load-longevities branch from c45f6d9 to 51fa7a4 Compare August 24, 2023 13:36
@k0machi k0machi requested a review from roydahan August 24, 2023 13:36
@k0machi
Copy link
Contributor Author

k0machi commented Aug 24, 2023

Jobs, again, this time with fixes.

@k0machi k0machi force-pushed the 5.3-repair-low-load-longevities branch from 51fa7a4 to b990bf6 Compare August 24, 2023 13:59
prepare_write_cmd: "cassandra-stress write cl=ALL n=200200300 -schema 'replication(strategy=NetworkTopologyStrategy,replication_factor=3) compaction(strategy=LeveledCompactionStrategy)' -mode cql3 native -rate threads=1000 -col 'size=FIXED(1024) n=FIXED(1)' -pop seq=1..200200300 -log interval=15"
prepare_write_cmd:
- "cassandra-stress write cl=ALL n=200200300 -schema 'replication(strategy=NetworkTopologyStrategy,replication_factor=3) compaction(strategy=LeveledCompactionStrategy)' -mode cql3 native -rate threads=1000 -col 'size=FIXED(1024) n=FIXED(1)' -pop seq=1..200200300 -log interval=15"
- "cassandra-stress write cl=ALL n=500 -schema 'keyspace=lowload1 replication(strategy=NetworkTopologyStrategy,replication_factor=3) compaction(strategy=LeveledCompactionStrategy)' -mode cql3 native -rate threads=1 -col 'size=FIXED(50) n=FIXED(1)' -pop seq=1..500 -log interval=15"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- "cassandra-stress write cl=ALL n=500 -schema 'keyspace=lowload1 replication(strategy=NetworkTopologyStrategy,replication_factor=3) compaction(strategy=LeveledCompactionStrategy)' -mode cql3 native -rate threads=1 -col 'size=FIXED(50) n=FIXED(1)' -pop seq=1..500 -log interval=15"
- "cassandra-stress write cl=ALL n=500 -schema 'keyspace=lowload1 replication(strategy=NetworkTopologyStrategy,replication_factor=3)' -mode cql3 native -rate threads=1 -col 'size=FIXED(50) n=FIXED(1)' -pop seq=1..500 -log interval=15"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

stress_cmd:
- "cassandra-stress write cl=QUORUM duration=2860m -schema 'replication(strategy=NetworkTopologyStrategy,replication_factor=3) compaction(strategy=LeveledCompactionStrategy)' -mode cql3 native -rate threads=250 -col 'size=FIXED(1024) n=FIXED(1)' -pop seq=400200300..600200300 -log interval=15"
- "cassandra-stress read cl=ONE duration=2860m -schema 'replication(strategy=NetworkTopologyStrategy,replication_factor=3) compaction(strategy=LeveledCompactionStrategy)' -mode cql3 native -rate threads=250 -col 'size=FIXED(1024) n=FIXED(1)' -pop seq=1..200200300 -log interval=5"
- "cassandra-stress read cl=ONE duration=2860m -schema 'keyspace=lowload1 replication(strategy=NetworkTopologyStrategy,replication_factor=3) compaction(strategy=LeveledCompactionStrategy)' -mode cql3 native -rate threads=1 -col 'size=FIXED(50) n=FIXED(1)' -pop seq=1..500 -log interval=5"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- "cassandra-stress read cl=ONE duration=2860m -schema 'keyspace=lowload1 replication(strategy=NetworkTopologyStrategy,replication_factor=3) compaction(strategy=LeveledCompactionStrategy)' -mode cql3 native -rate threads=1 -col 'size=FIXED(50) n=FIXED(1)' -pop seq=1..500 -log interval=5"
- "cassandra-stress read cl=ONE duration=2860m -schema 'keyspace=lowload1 replication(strategy=NetworkTopologyStrategy,replication_factor=3)' -mode cql3 native -rate threads=1 -col 'size=FIXED(50) n=FIXED(1)' -pop seq=1..500 -log interval=5"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove the low-load from the name.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Comment on lines 3 to 6
- "cassandra-stress write cl=ALL n=500 -schema 'keyspace=lowload1 replication(strategy=NetworkTopologyStrategy,replication_factor=3) compaction(strategy=LeveledCompactionStrategy)' -mode cql3 native -rate threads=1 -col 'size=FIXED(50) n=FIXED(1)' -pop seq=1..500 -log interval=15"

stress_cmd:
- "cassandra-stress read cl=ONE duration=2860m -schema 'keyspace=lowload1 replication(strategy=NetworkTopologyStrategy,replication_factor=3) compaction(strategy=LeveledCompactionStrategy)' -mode cql3 native -rate threads=1 -col 'size=FIXED(50) n=FIXED(1)' -pop seq=1..500 -log interval=5"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These aren't the correct c-s commands for the RF=1.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Corrected

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it a job just for testing?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could keep it after, but yes, I'm testing with it.

@k0machi k0machi force-pushed the 5.3-repair-low-load-longevities branch from b990bf6 to 77bb238 Compare August 24, 2023 15:25
@k0machi k0machi requested a review from roydahan August 24, 2023 15:26
@k0machi k0machi force-pushed the 5.3-repair-low-load-longevities branch from 77bb238 to 2f84dbc Compare August 24, 2023 15:30
@k0machi k0machi force-pushed the 5.3-repair-low-load-longevities branch 3 times, most recently from 72d7eb5 to 857a6ae Compare September 8, 2023 10:19
This change adds new configurations for 200gb-48 longevities and one
low load 4 hour longevity, intended to simulate a low load happening
during repair processes, to cover potential overhead like in
scylladb/scylladb#14093.

Task: scylladb/qa-tasks#1416
@k0machi k0machi force-pushed the 5.3-repair-low-load-longevities branch from 857a6ae to 818b2aa Compare September 13, 2023 09:56
@roydahan roydahan merged commit 9fd18c8 into scylladb:master Sep 13, 2023
5 checks passed
@fruch fruch added the backport/none Backport is not required label Nov 5, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport/none Backport is not required
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants