-
Notifications
You must be signed in to change notification settings - Fork 83
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feature(aws): add suport to multi az #5934
Conversation
i'm also giving it a ride here |
d14f6bb
to
9ea4ec3
Compare
696d48c
to
b1aef82
Compare
Ready for review.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agree with the longevity-sla
not being stable enough, so i would suggest using the longevity 4h for it... or add a short longevity to have this configuration, and add it to the weekly trigger (think the 2nd is the best option)
we specifically said we don't want it on short longevity that are automatically triggered |
so we could add a new short job, that will do it, as we must exercise it the maximum to have stable ASAP |
@roydahan, waiting for you input on which case to add this one (and a review in general) |
b1aef82
to
be20df5
Compare
fe10ac4
to
ebd459e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@fgelcer did you try your scenario for performance of bootstrap with this code? |
@@ -6,6 +6,7 @@ def lib = library identifier: 'sct@snapshot', retriever: legacySCM(scm) | |||
longevityPipeline( | |||
backend: 'aws', | |||
region: 'eu-west-1', | |||
availability_zone: 'a,b,c', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we put it as a pipeline parameter and not in the test_yaml?
It's hard to track it like this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@roydahan
because pipelines define availability_zone defaults - so it will override yaml.
Solutions I see:
- add yet another param to test yaml which will be confusing (setting az in jenkins pipeline will differ from actual az's)
- add availability_zone to test yaml and leave jenkins params as now (so it's clearer when reading test yaml) - my preference.
@roydahan please let me know if you have another idea or which you prefer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also the current disable of the new provision step depends on this pipeline parameter.
So we need some indication on the pipeline level as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see.
I don't like option 1 either.
The problem with option 2 is that one may want to change that and he'll change that only in the yaml but it won't take effect since the pipeline overrides it, right?
OTOH, it will be really hard for someone to understand that his test uses multi AZ....
@@ -6,6 +6,7 @@ def lib = library identifier: 'sct@snapshot', retriever: legacySCM(scm) | |||
longevityPipeline( | |||
backend: 'aws', | |||
region: 'eu-west-1', | |||
availability_zone: 'a,b', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that we should put it in the yaml and set it to 3 az.
@@ -22,6 +23,7 @@ nemesis_class_name: 'SisyphusMonkey' | |||
nemesis_selector: ['topology_changes'] | |||
nemesis_interval: 5 | |||
nemesis_filter_seeds: false | |||
nemesis_add_node_cnt: 2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a weird case to test RF=3 with 2 AZs and adding 2 nodes.
I don't even know what we should expect.
For now, let's go with the recommended and supported case: RF=3, 3 AZs, adding 3 nodes every time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, let's change:
n_db_nodes to 3
seeds_num: 3
@aleksbykov please review these suggested changes as well.
@@ -6,6 +6,7 @@ def lib = library identifier: 'sct@snapshot', retriever: legacySCM(scm) | |||
longevityPipeline( | |||
backend: 'aws', | |||
region: '''["eu-west-1", "us-east-1"]''', | |||
availability_zone: 'a,b', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's move to the test yaml and set the AZ and RF to be the same.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this test has 4 nodes per DC, so it will make cluster unbalanced from the start in that case.
I change multidc case to longevity-counters-multidc.yaml
dc87c80
to
f38111b
Compare
@roydahan I applied changes, please see my responses. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall it looks good to me.
There are still some usability issues making our life a bit harder.
It shouldn't block the PR, so we can start using it, but it should be addressed as part of a followup task.
f38111b
to
0960c20
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Test case linting is failing:
|
We need to support multi-az deployments in SCT to be able to cover new test cases. Especially we aim for grow/shrink cluster tests when working in multi-rack deployments. Add support to multi az in AWS provisioning code (old provision). To use multi-az, specify multiple coma-separated availability zones (e.g. `a,b,c`). When multi-az is specified, job's provision step is skipped. `GrowShrinkClusterNemesis` is able to work with multi-az on AWS backend. When `rack` is set to `None` in `discrupt_grow_shrink_cluster`, it will be growing cluster evenly, otherwise grow only in specified rack. Adapted 3 jobs to use multi-az on AWS.
0960c20
to
cecb0b4
Compare
@fruch checks fixed |
We need to support multi-az deployments in SCT to be able to cover new
test cases. Especially we aim for grow/shrink cluster tests when working
in multi-rack deployments.
Add support to multi az in AWS provisioning code (old provision). To use
multi-az, specify multiple coma-separated availability zones
(e.g.
a,b,c
).When multi-az is specified, job's provision step is skipped.
GrowShrinkClusterNemesis
is able to work with multi-az on AWSbackend. When
rack
is set toNone
indiscrupt_grow_shrink_cluster
,it will be growing cluster evenly, otherwise grow only in specified rack.
Adapted 3 jobs to use multi-az on AWS.
refs: https://github.com/scylladb/qa-tasks/issues/1070
PR pre-checks (self review)
backport
labelssdcm/sct_config.py
)unit-test/
folder)