Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feature(aws): add suport to multi az #5934

Merged
merged 1 commit into from
Apr 13, 2023

Conversation

soyacz
Copy link
Contributor

@soyacz soyacz commented Mar 21, 2023

We need to support multi-az deployments in SCT to be able to cover new
test cases. Especially we aim for grow/shrink cluster tests when working
in multi-rack deployments.

Add support to multi az in AWS provisioning code (old provision). To use
multi-az, specify multiple coma-separated availability zones
(e.g. a,b,c).
When multi-az is specified, job's provision step is skipped.

GrowShrinkClusterNemesis is able to work with multi-az on AWS
backend. When rack is set to None in discrupt_grow_shrink_cluster,
it will be growing cluster evenly, otherwise grow only in specified rack.
Adapted 3 jobs to use multi-az on AWS.

refs: https://github.com/scylladb/qa-tasks/issues/1070

PR pre-checks (self review)

  • I followed KISS principle and best practices
  • I didn't leave commented-out/debugging code
  • I added the relevant backport labels
  • New configuration option are added and documented (in sdcm/sct_config.py)
  • I have added tests to cover my changes (Infrastructure only - under unit-test/ folder)
  • All new and existing unit tests passed (CI)
  • I have updated the Readme/doc folder accordingly (if needed)

@soyacz soyacz marked this pull request as draft March 21, 2023 14:01
sdcm/cluster.py Outdated Show resolved Hide resolved
@soyacz
Copy link
Contributor Author

soyacz commented Mar 21, 2023

@fgelcer
Copy link
Contributor

fgelcer commented Mar 21, 2023

@soyacz
Copy link
Contributor Author

soyacz commented Mar 22, 2023

I selected 3 longevities to use multi-az 696d48c
@roydahan please review if selection is right

@soyacz soyacz changed the title WIP: multi-az support for AWS feature(aws): add suport to multi az Mar 22, 2023
@soyacz soyacz marked this pull request as ready for review March 22, 2023 12:50
@soyacz
Copy link
Contributor Author

soyacz commented Mar 22, 2023

Ready for review.
I propose doing below things in separate PR's to enable testing multi-az quicker:

  1. Adapt new AWS provisioning to support multi-az
  2. Make AddRemoveRackNemesis work with AWS backend

@soyacz soyacz requested review from fruch and roydahan March 22, 2023 12:54
Copy link
Contributor

@fgelcer fgelcer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agree with the longevity-sla not being stable enough, so i would suggest using the longevity 4h for it... or add a short longevity to have this configuration, and add it to the weekly trigger (think the 2nd is the best option)

test-cases/longevity/longevity-sla-100gb-4h.yaml Outdated Show resolved Hide resolved
@fruch
Copy link
Contributor

fruch commented Mar 23, 2023

agree with the longevity-sla not being stable enough, so i would suggest using the longevity 4h for it... or add a short longevity to have this configuration, and add it to the weekly trigger (think the 2nd is the best option)

we specifically said we don't want it on short longevity that are automatically triggered
until we'll know how stable the multi AZ runs, we don't want to generate too much noise for ourselves

@fgelcer
Copy link
Contributor

fgelcer commented Mar 23, 2023

don't

so we could add a new short job, that will do it, as we must exercise it the maximum to have stable ASAP

@soyacz soyacz requested review from fruch and fgelcer March 27, 2023 06:35
@fruch
Copy link
Contributor

fruch commented Apr 3, 2023

@roydahan, waiting for you input on which case to add this one (and a review in general)

@soyacz soyacz force-pushed the 1070-aws-support-multi-az branch from b1aef82 to be20df5 Compare April 3, 2023 09:29
@soyacz
Copy link
Contributor Author

soyacz commented Apr 3, 2023

@fruch @fgelcer I replaced SLA with test-cases/longevity/longevity-cdc-100gb-4h.yaml. Triggered weekly, 3AZ's.
@roydahan please review if tests selection looks ok.

sdcm/sct_config.py Show resolved Hide resolved
sdcm/nemesis.py Outdated Show resolved Hide resolved
sdcm/nemesis.py Show resolved Hide resolved
sdcm/cluster.py Outdated Show resolved Hide resolved
@soyacz soyacz force-pushed the 1070-aws-support-multi-az branch 2 times, most recently from fe10ac4 to ebd459e Compare April 3, 2023 13:58
vponomaryov
vponomaryov previously approved these changes Apr 3, 2023
Copy link
Contributor

@vponomaryov vponomaryov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@roydahan
Copy link
Contributor

roydahan commented Apr 3, 2023

@fgelcer did you try your scenario for performance of bootstrap with this code?
If not, let's try it before merging.

@@ -6,6 +6,7 @@ def lib = library identifier: 'sct@snapshot', retriever: legacySCM(scm)
longevityPipeline(
backend: 'aws',
region: 'eu-west-1',
availability_zone: 'a,b,c',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we put it as a pipeline parameter and not in the test_yaml?
It's hard to track it like this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@roydahan
because pipelines define availability_zone defaults - so it will override yaml.
Solutions I see:

  1. add yet another param to test yaml which will be confusing (setting az in jenkins pipeline will differ from actual az's)
  2. add availability_zone to test yaml and leave jenkins params as now (so it's clearer when reading test yaml) - my preference.

@roydahan please let me know if you have another idea or which you prefer

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also the current disable of the new provision step depends on this pipeline parameter.

So we need some indication on the pipeline level as well

Copy link
Contributor

@roydahan roydahan Apr 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see.
I don't like option 1 either.
The problem with option 2 is that one may want to change that and he'll change that only in the yaml but it won't take effect since the pipeline overrides it, right?
OTOH, it will be really hard for someone to understand that his test uses multi AZ....

@@ -6,6 +6,7 @@ def lib = library identifier: 'sct@snapshot', retriever: legacySCM(scm)
longevityPipeline(
backend: 'aws',
region: 'eu-west-1',
availability_zone: 'a,b',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that we should put it in the yaml and set it to 3 az.

@@ -22,6 +23,7 @@ nemesis_class_name: 'SisyphusMonkey'
nemesis_selector: ['topology_changes']
nemesis_interval: 5
nemesis_filter_seeds: false
nemesis_add_node_cnt: 2
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a weird case to test RF=3 with 2 AZs and adding 2 nodes.
I don't even know what we should expect.
For now, let's go with the recommended and supported case: RF=3, 3 AZs, adding 3 nodes every time.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, let's change:
n_db_nodes to 3
seeds_num: 3

@aleksbykov please review these suggested changes as well.

@@ -6,6 +6,7 @@ def lib = library identifier: 'sct@snapshot', retriever: legacySCM(scm)
longevityPipeline(
backend: 'aws',
region: '''["eu-west-1", "us-east-1"]''',
availability_zone: 'a,b',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's move to the test yaml and set the AZ and RF to be the same.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this test has 4 nodes per DC, so it will make cluster unbalanced from the start in that case.
I change multidc case to longevity-counters-multidc.yaml

@soyacz soyacz force-pushed the 1070-aws-support-multi-az branch 3 times, most recently from dc87c80 to f38111b Compare April 4, 2023 09:38
@soyacz soyacz requested a review from roydahan April 4, 2023 09:40
@soyacz
Copy link
Contributor Author

soyacz commented Apr 4, 2023

@roydahan I applied changes, please see my responses.
Also note that I changed multi-dc scenario to longevity-counters-multidc.yaml as previous was using 4 nodes per dc.

roydahan
roydahan previously approved these changes Apr 9, 2023
Copy link
Contributor

@roydahan roydahan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall it looks good to me.
There are still some usability issues making our life a bit harder.
It shouldn't block the PR, so we can start using it, but it should be addressed as part of a followup task.

sdcm/nemesis.py Show resolved Hide resolved
sdcm/nemesis.py Outdated Show resolved Hide resolved
sdcm/nemesis.py Outdated Show resolved Hide resolved
vponomaryov
vponomaryov previously approved these changes Apr 11, 2023
Copy link
Contributor

@vponomaryov vponomaryov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

fruch
fruch previously approved these changes Apr 11, 2023
@fruch
Copy link
Contributor

fruch commented Apr 11, 2023

@soyacz

Test case linting is failing:

10:31:38      raise ValueError(res)
10:31:38  ValueError: Unsupported config option/s found:
10:31:38  	 * 'availability_zones: a,b,c'

We need to support multi-az deployments in SCT to be able to cover new
test cases. Especially we aim for grow/shrink cluster tests when working
in multi-rack deployments.

Add support to multi az in AWS provisioning code (old provision). To use
multi-az, specify multiple coma-separated availability zones
(e.g. `a,b,c`).
When multi-az is specified, job's provision step is skipped.

`GrowShrinkClusterNemesis` is able to work with multi-az on AWS
backend. When `rack` is set to `None` in `discrupt_grow_shrink_cluster`,
it will be growing cluster evenly, otherwise grow only in specified rack.
Adapted 3 jobs to use multi-az on AWS.
@soyacz soyacz dismissed stale reviews from fruch and vponomaryov via cecb0b4 April 11, 2023 08:40
@soyacz soyacz requested a review from fruch April 11, 2023 08:57
@soyacz
Copy link
Contributor Author

soyacz commented Apr 12, 2023

@fruch checks fixed

@fruch fruch merged commit 798db4f into scylladb:master Apr 13, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants