feature(aws): add suport to multi az #5934

soyacz · 2023-03-21T14:00:48Z

We need to support multi-az deployments in SCT to be able to cover new
test cases. Especially we aim for grow/shrink cluster tests when working
in multi-rack deployments.

Add support to multi az in AWS provisioning code (old provision). To use
multi-az, specify multiple coma-separated availability zones
(e.g. a,b,c).
When multi-az is specified, job's provision step is skipped.

GrowShrinkClusterNemesis is able to work with multi-az on AWS
backend. When rack is set to None in discrupt_grow_shrink_cluster,
it will be growing cluster evenly, otherwise grow only in specified rack.
Adapted 3 jobs to use multi-az on AWS.

refs: https://github.com/scylladb/qa-tasks/issues/1070

PR pre-checks (self review)

I followed KISS principle and best practices
I didn't leave commented-out/debugging code
I added the relevant backport labels
New configuration option are added and documented (in sdcm/sct_config.py)
I have added tests to cover my changes (Infrastructure only - under unit-test/ folder)
All new and existing unit tests passed (CI)
I have updated the Readme/doc folder accordingly (if needed)

sdcm/cluster.py

soyacz · 2023-03-21T14:36:15Z

test drive: https://jenkins.scylladb.com/view/staging/job/scylla-staging/job/lukasz/job/longevity-mini-test-1h-test/2/

fgelcer · 2023-03-21T20:53:41Z

test drive: https://jenkins.scylladb.com/view/staging/job/scylla-staging/job/lukasz/job/longevity-mini-test-1h-test/2/

i'm also giving it a ride here

soyacz · 2023-03-22T07:33:02Z

I selected 3 longevities to use multi-az 696d48c
@roydahan please review if selection is right

soyacz · 2023-03-22T07:50:46Z

test drive: https://jenkins.scylladb.com/view/staging/job/scylla-staging/job/lukasz/job/longevity-mini-test-1h-test/2/

Ride was successful, testing eks: https://jenkins.scylladb.com/view/staging/job/scylla-staging/job/lukasz/job/longevity-scylla-operator-3h-eks/7/

soyacz · 2023-03-22T12:54:50Z

Ready for review.
I propose doing below things in separate PR's to enable testing multi-az quicker:

Adapt new AWS provisioning to support multi-az
Make AddRemoveRackNemesis work with AWS backend

test-cases/longevity/longevity-topology-changes-3h.yaml

test-cases/longevity/longevity-sla-100gb-4h.yaml

fgelcer

agree with the longevity-sla not being stable enough, so i would suggest using the longevity 4h for it... or add a short longevity to have this configuration, and add it to the weekly trigger (think the 2nd is the best option)

test-cases/longevity/longevity-sla-100gb-4h.yaml

fruch · 2023-03-23T11:40:34Z

agree with the longevity-sla not being stable enough, so i would suggest using the longevity 4h for it... or add a short longevity to have this configuration, and add it to the weekly trigger (think the 2nd is the best option)

we specifically said we don't want it on short longevity that are automatically triggered
until we'll know how stable the multi AZ runs, we don't want to generate too much noise for ourselves

fgelcer · 2023-03-23T12:51:25Z

don't

so we could add a new short job, that will do it, as we must exercise it the maximum to have stable ASAP

fruch · 2023-04-03T08:37:38Z

@roydahan, waiting for you input on which case to add this one (and a review in general)

soyacz · 2023-04-03T09:41:36Z

@fruch @fgelcer I replaced SLA with test-cases/longevity/longevity-cdc-100gb-4h.yaml. Triggered weekly, 3AZ's.
@roydahan please review if tests selection looks ok.

sdcm/sct_config.py

sdcm/nemesis.py

sdcm/cluster.py

vponomaryov

LGTM

roydahan · 2023-04-03T18:03:06Z

@fgelcer did you try your scenario for performance of bootstrap with this code?
If not, let's try it before merging.

roydahan · 2023-04-03T18:05:32Z

jenkins-pipelines/longevity-cdc-100gb-4h.jenkinsfile

@@ -6,6 +6,7 @@ def lib = library identifier: 'sct@snapshot', retriever: legacySCM(scm)
 longevityPipeline(
    backend: 'aws',
    region: 'eu-west-1',
+    availability_zone: 'a,b,c',


Why do we put it as a pipeline parameter and not in the test_yaml?
It's hard to track it like this.

@roydahan
because pipelines define availability_zone defaults - so it will override yaml.
Solutions I see:

add yet another param to test yaml which will be confusing (setting az in jenkins pipeline will differ from actual az's)

add availability_zone to test yaml and leave jenkins params as now (so it's clearer when reading test yaml) - my preference.

@roydahan please let me know if you have another idea or which you prefer

Also the current disable of the new provision step depends on this pipeline parameter.

So we need some indication on the pipeline level as well

I see.
I don't like option 1 either.
The problem with option 2 is that one may want to change that and he'll change that only in the yaml but it won't take effect since the pipeline overrides it, right?
OTOH, it will be really hard for someone to understand that his test uses multi AZ....

roydahan · 2023-04-03T18:07:44Z

jenkins-pipelines/longevity-topology-changes-3h.jenkinsfile

@@ -6,6 +6,7 @@ def lib = library identifier: 'sct@snapshot', retriever: legacySCM(scm)
 longevityPipeline(
    backend: 'aws',
    region: 'eu-west-1',
+    availability_zone: 'a,b',


I think that we should put it in the yaml and set it to 3 az.

roydahan · 2023-04-03T18:09:39Z

test-cases/longevity/longevity-topology-changes-3h.yaml

@@ -22,6 +23,7 @@ nemesis_class_name: 'SisyphusMonkey'
 nemesis_selector: ['topology_changes']
 nemesis_interval: 5
 nemesis_filter_seeds: false
+nemesis_add_node_cnt: 2


It's a weird case to test RF=3 with 2 AZs and adding 2 nodes.
I don't even know what we should expect.
For now, let's go with the recommended and supported case: RF=3, 3 AZs, adding 3 nodes every time.

Also, let's change:
n_db_nodes to 3
seeds_num: 3

@aleksbykov please review these suggested changes as well.

roydahan · 2023-04-03T18:13:02Z

jenkins-pipelines/longevity-cdc-8h-multi-dc-topology-changes.jenkinsfile

@@ -6,6 +6,7 @@ def lib = library identifier: 'sct@snapshot', retriever: legacySCM(scm)
 longevityPipeline(
    backend: 'aws',
    region: '''["eu-west-1", "us-east-1"]''',
+    availability_zone: 'a,b',


Let's move to the test yaml and set the AZ and RF to be the same.

this test has 4 nodes per DC, so it will make cluster unbalanced from the start in that case.
I change multidc case to longevity-counters-multidc.yaml

soyacz · 2023-04-04T09:42:44Z

@roydahan I applied changes, please see my responses.
Also note that I changed multi-dc scenario to longevity-counters-multidc.yaml as previous was using 4 nodes per dc.

test-cases/longevity/longevity-cdc-100gb-4h.yaml

roydahan

Overall it looks good to me.
There are still some usability issues making our life a bit harder.
It shouldn't block the PR, so we can start using it, but it should be addressed as part of a followup task.

sdcm/nemesis.py

vponomaryov

LGTM

fruch · 2023-04-11T08:31:14Z

@soyacz

Test case linting is failing:

10:31:38      raise ValueError(res)
10:31:38  ValueError: Unsupported config option/s found:
10:31:38  	 * 'availability_zones: a,b,c'

We need to support multi-az deployments in SCT to be able to cover new test cases. Especially we aim for grow/shrink cluster tests when working in multi-rack deployments. Add support to multi az in AWS provisioning code (old provision). To use multi-az, specify multiple coma-separated availability zones (e.g. `a,b,c`). When multi-az is specified, job's provision step is skipped. `GrowShrinkClusterNemesis` is able to work with multi-az on AWS backend. When `rack` is set to `None` in `discrupt_grow_shrink_cluster`, it will be growing cluster evenly, otherwise grow only in specified rack. Adapted 3 jobs to use multi-az on AWS.

soyacz · 2023-04-12T06:11:39Z

@fruch checks fixed

soyacz requested review from fruch, vponomaryov and fgelcer March 21, 2023 14:00

github-actions bot assigned soyacz Mar 21, 2023

soyacz marked this pull request as draft March 21, 2023 14:01

fruch reviewed Mar 21, 2023

View reviewed changes

sdcm/cluster.py Outdated Show resolved Hide resolved

soyacz force-pushed the 1070-aws-support-multi-az branch from d14f6bb to 9ea4ec3 Compare March 22, 2023 07:28

soyacz force-pushed the 1070-aws-support-multi-az branch from 696d48c to b1aef82 Compare March 22, 2023 12:48

soyacz changed the title ~~WIP: multi-az support for AWS~~ feature(aws): add suport to multi az Mar 22, 2023

soyacz marked this pull request as ready for review March 22, 2023 12:50

soyacz requested review from fruch and roydahan March 22, 2023 12:54

fruch reviewed Mar 22, 2023

View reviewed changes

test-cases/longevity/longevity-topology-changes-3h.yaml Outdated Show resolved Hide resolved

fruch reviewed Mar 22, 2023

View reviewed changes

test-cases/longevity/longevity-sla-100gb-4h.yaml Outdated Show resolved Hide resolved

fgelcer reviewed Mar 23, 2023

View reviewed changes

test-cases/longevity/longevity-sla-100gb-4h.yaml Outdated Show resolved Hide resolved

soyacz requested review from fruch and fgelcer March 27, 2023 06:35

soyacz force-pushed the 1070-aws-support-multi-az branch from b1aef82 to be20df5 Compare April 3, 2023 09:29

vponomaryov reviewed Apr 3, 2023

View reviewed changes

sdcm/sct_config.py Show resolved Hide resolved

sdcm/nemesis.py Outdated Show resolved Hide resolved

sdcm/nemesis.py Show resolved Hide resolved

sdcm/cluster.py Outdated Show resolved Hide resolved

soyacz force-pushed the 1070-aws-support-multi-az branch 2 times, most recently from fe10ac4 to ebd459e Compare April 3, 2023 13:58

vponomaryov previously approved these changes Apr 3, 2023

View reviewed changes

roydahan reviewed Apr 3, 2023

View reviewed changes

roydahan requested changes Apr 3, 2023

View reviewed changes

soyacz dismissed vponomaryov’s stale review via 841d9e8 April 4, 2023 09:15

soyacz force-pushed the 1070-aws-support-multi-az branch 3 times, most recently from dc87c80 to f38111b Compare April 4, 2023 09:38

soyacz requested a review from roydahan April 4, 2023 09:40

roydahan reviewed Apr 9, 2023

View reviewed changes

test-cases/longevity/longevity-cdc-100gb-4h.yaml Show resolved Hide resolved

roydahan previously approved these changes Apr 9, 2023

View reviewed changes

vponomaryov requested changes Apr 10, 2023

View reviewed changes

sdcm/nemesis.py Show resolved Hide resolved

sdcm/nemesis.py Outdated Show resolved Hide resolved

sdcm/nemesis.py Outdated Show resolved Hide resolved

soyacz dismissed roydahan’s stale review via 0960c20 April 11, 2023 07:08

soyacz force-pushed the 1070-aws-support-multi-az branch from f38111b to 0960c20 Compare April 11, 2023 07:08

soyacz requested a review from vponomaryov April 11, 2023 07:13

vponomaryov previously approved these changes Apr 11, 2023

View reviewed changes

fruch previously approved these changes Apr 11, 2023

View reviewed changes

soyacz dismissed stale reviews from fruch and vponomaryov via cecb0b4 April 11, 2023 08:40

soyacz force-pushed the 1070-aws-support-multi-az branch from 0960c20 to cecb0b4 Compare April 11, 2023 08:40

vponomaryov approved these changes Apr 11, 2023

View reviewed changes

soyacz requested a review from fruch April 11, 2023 08:57

fruch approved these changes Apr 13, 2023

View reviewed changes

fruch merged commit 798db4f into scylladb:master Apr 13, 2023

soyacz mentioned this pull request May 8, 2024

Grow-shrink cluster perf test is not using multi-az #7384

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feature(aws): add suport to multi az #5934

feature(aws): add suport to multi az #5934

soyacz commented Mar 21, 2023 •

edited

soyacz commented Mar 21, 2023

fgelcer commented Mar 21, 2023

soyacz commented Mar 22, 2023

soyacz commented Mar 22, 2023

soyacz commented Mar 22, 2023

fgelcer left a comment

fruch commented Mar 23, 2023

fgelcer commented Mar 23, 2023

fruch commented Apr 3, 2023

soyacz commented Apr 3, 2023

vponomaryov left a comment

roydahan commented Apr 3, 2023

roydahan Apr 3, 2023

soyacz Apr 4, 2023

fruch Apr 4, 2023

roydahan Apr 9, 2023 •

edited

roydahan Apr 3, 2023

roydahan Apr 3, 2023

roydahan Apr 3, 2023

roydahan Apr 3, 2023

soyacz Apr 4, 2023

soyacz commented Apr 4, 2023

roydahan left a comment

vponomaryov left a comment

fruch commented Apr 11, 2023

soyacz commented Apr 12, 2023

feature(aws): add suport to multi az #5934

feature(aws): add suport to multi az #5934

Conversation

soyacz commented Mar 21, 2023 • edited

PR pre-checks (self review)

soyacz commented Mar 21, 2023

fgelcer commented Mar 21, 2023

soyacz commented Mar 22, 2023

soyacz commented Mar 22, 2023

soyacz commented Mar 22, 2023

fgelcer left a comment

Choose a reason for hiding this comment

fruch commented Mar 23, 2023

fgelcer commented Mar 23, 2023

fruch commented Apr 3, 2023

soyacz commented Apr 3, 2023

vponomaryov left a comment

Choose a reason for hiding this comment

roydahan commented Apr 3, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

roydahan Apr 9, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

soyacz commented Apr 4, 2023

roydahan left a comment

Choose a reason for hiding this comment

vponomaryov left a comment

Choose a reason for hiding this comment

fruch commented Apr 11, 2023

soyacz commented Apr 12, 2023

soyacz commented Mar 21, 2023 •

edited

roydahan Apr 9, 2023 •

edited