test(manager): introduce Manager installation test on different OS #7456

mikliapko · 2024-05-20T15:17:26Z

The test is designed for execution on non-main OS distributions (for example, debian10) where the main goal is to execute installation test and verify manager + agent are up and running.

The motivation of having such a test:

We have never faced OS-specific failures in backup/restore/repair functionality. Thus, there is no need to run the sanity test (which includes these functionality verification) against non-main operation systems;
Installation test takes only 20 minutes in average against ~2 hours for Manager sanity test;
The Manager sanity is still here and will be executed on the main OS - Ubuntu 22.

Testing

The folder with Manager installation jobs (https://jenkins.scylladb.com/view/staging/job/scylla-staging/job/mikita/job/sct/job/manager-master/job/installation-tests/)

PR pre-checks (self review)

I added the relevant backport labels
I didn't leave commented-out/debugging code

Reminders

Add New configuration option and document them (in sdcm/sct_config.py)
Add unit tests to cover my changes (under unit-test/ folder)
Update the Readme/doc folder relevant to this change (if needed)

soyacz · 2024-05-21T08:48:57Z

mgmt_cli_test.py

+
+        # Check scylla-manager and scylla-manager-agent are up and running (/ping endpoint)
+        manager_node.is_manager_server_up()
+        scylla_node.is_manager_agent_up()


shouldn't the test also verify communication between agent and server?
Or at least cluster is registered in server?
I believe several short functional checks could find some regressions without sacrificing a lot of time.

yes, makes sense, added several checks on cluster addition and health check.

please retry test builds if everything passes

Yes, it works, verified here:
https://jenkins.scylladb.com/view/staging/job/scylla-staging/job/mikita/job/sct/job/manager-master/job/installation-tests/job/ubuntu22-installation/4/

soyacz

LGTM

mgmt_cli_test.py

roydahan · 2024-05-21T13:27:36Z

I don’t understand why we need this PR.
Isn’t the sanity per OS do the exact same thing?

@ShlomiBalalis ?

mikliapko · 2024-05-21T13:40:28Z

I don’t understand why we need this PR. Isn’t the sanity per OS do the exact same thing?

It does the same and even more but lasts longer.
That's the idea as we have never faced any OS-specific issues related to backup/restore/repair it doesn't make sense to run the whole test_sanity_manager.

roydahan · 2024-05-21T15:59:12Z

How much longer?
I don’t think we never faced such an issue, rarly we did.
But anyway, if sanity Is not that long and anyway one of the comments above suggests that you actually verify that manager works, this PR doesn’t worth it.

if sanity became too long over the years, we can consider shortening it.

fruch · 2024-05-21T16:11:14Z

How much longer?
I don’t think we never faced such an issue, rarly we did.
But anyway, if sanity Is not that long and anyway one of the comments above suggests that you actually verify that manager works, this PR doesn’t worth it.

if sanity became too long over the years, we can consider shortening it.

It's ~2h for each sanity, and those are using spots

From my POV, this is more then fine to minimize to multiple disto cases to mostly installation

roydahan · 2024-05-21T16:25:43Z

From my POV it’s an unneeded duplication.
As I wrote above, we can/should shorten the sanity to what it has been originally, a quick sanity test.

fruch · 2024-05-21T18:03:02Z

From my POV it’s an unneeded duplication. As I wrote above, we can/should shorten the sanity to what it has been originally, a quick sanity test.

yet again, I'm all about let whom owns it to take the calls of how to structure it,
I don't see any duplication in this, it's the other way around, it replacing some of the jobs with shorter versions.

mikliapko · 2024-05-22T07:27:20Z

From my POV it’s an unneeded duplication. As I wrote above, we can/should shorten the sanity to what it has been originally, a quick sanity test.

Yes, it's not a duplication because the goal is to replace sanity jobs for debian10, debian11, ubuntu20 distributions with these installation jobs and use the sanity job only for ubuntu22.

One more thing, these short installation jobs can be very good when we need to verify any changes in SCT setup steps or images, etc. We already had this case here (#7459) and instead of running ~1.5-2h jobs for every distro we've run these installation jobs that gave the result in 20 minutes.

mikliapko · 2024-05-22T07:29:32Z

But anyway, if sanity Is not that long and anyway one of the comments above suggests that you actually verify that manager works, this PR doesn’t worth it.

Some extra checks were added in test to address this comment.

The test is intended for execution on non-main OS distributions (for example, debian10) where the main goal is to execute installation test and verify manager + agent are up and running.

roydahan · 2024-05-22T10:19:11Z

From my POV it’s an unneeded duplication. As I wrote above, we can/should shorten the sanity to what it has been originally, a quick sanity test.

Yes, it's not a duplication because the goal is to replace sanity jobs for debian10, debian11, ubuntu20 distributions with these installation jobs and use the sanity job only for ubuntu22.

One more thing, these short installation jobs can be very good when we need to verify any changes in SCT setup steps or images, etc. We already had this case here (#7459) and instead of running ~1.5-2h jobs for every distro we've run these installation jobs that gave the result in 20 minutes.

In that case why do we need it also for ununtu22 in this PR?
And what are the triggers changes?

roydahan · 2024-05-22T10:24:06Z

But anyway, if sanity Is not that long and anyway one of the comments above suggests that you actually verify that manager works, this PR doesn’t worth it.

Some extra checks were added in test to address this comment.

And now it looks like the Sanity when it was started.
In the past we had an issue on one platform that manager repair failed with too “many open files”.
With this change we will never hit it, because we stop testing repair when manager is Debian for example.
And if we continue using the sanity, half of it is duplicated with this PR.

mikliapko · 2024-05-22T10:26:01Z

In that case why do we need it also for ununtu22 in this PR?

From my point of view, we can also benefit from having such a job if we need to quickly verify ubuntu22 installation process for some particular cases (changes in setup procedure).

And what are the triggers changes?

It's not implemented yet, but we are planning to implement the triggers for them.

mikliapko · 2024-05-22T10:35:45Z

And now it looks like the Sanity when it was started.

But it takes 15 minutes instead of 1.5-2 hours.

In the past we had an issue on one platform that manager repair failed with too “many open files”. With this change we will never hit it, because we stop testing repair when manager is Debian for example. And if we continue using the sanity, half of it is duplicated with this PR.

I haven't known about such issue. In such case we may consider to leave, for example, one per release run of all sanity tests on all platforms (manager-3.2 release branch) and run installation tests only in master.

When I joined to the project, all the sanity tests except Ubuntu22 and CentOS7 (which is deprecated already) have been disabled in Jenkins master branch for more than 1 year. So, nobody really cared much about these runs in master all that time.

roydahan · 2024-05-22T10:50:06Z

And now it looks like the Sanity when it was started.

But it takes 15 minutes instead of 1.5-2 hours.

In the past we had an issue on one platform that manager repair failed with too “many open files”. With this change we will never hit it, because we stop testing repair when manager is Debian for example. And if we continue using the sanity, half of it is duplicated with this PR.

I haven't known about such issue. In such case we may consider to leave, for example, one per release run of all sanity tests on all platforms (manager-3.2 release branch) and run installation tests only in master.

When I joined to the project, all the sanity tests except Ubuntu22 and CentOS7 (which is deprecated already) have been disabled in Jenkins master branch for more than 1 year. So, nobody really cared much about these runs in master all that time.

Because the developers probably never looked at master so it was worthless to run on master.
Once you run it only for releases, it means that if there is an issue you fix it and it’s a release blocker.
Hence you don’t really care if it takes 15mins or 2 hours because it runs only when you have releases.

IMO, if you want to have it in master, it should be tracked by developers, and if you want it to be tracked by developers, it should be part of their CI.
And if it’s part of their CI, the place for it is probably GitHub actions like the current CI they have.

mikliapko · 2024-05-22T11:32:01Z

Hence you don’t really care if it takes 15mins or 2 hours because it runs only when you have releases.

From my perspective, I find it more efficient to not run 2 hours job if the same result can be got in 15 minutes job. It's not only the question of time but the cloud resources as well.

IMO, if you want to have it in master, it should be tracked by developers, and if you want it to be tracked by developers, it should be part of their CI. And if it’s part of their CI, the place for it is probably GitHub actions like the current CI they have.

I don't really think they run something from SCT in Actions.

gmizrahi · 2024-05-23T06:43:34Z

@roydahan as we now own manager testing and these are manager specific changes:
a. please add @mikliapko @rayakurl and myself as maintainers for this part of the code
b. approve the PR (as two of your team even approved the PR as well + manager is our domain).

cc @mykaul

mykaul · 2024-05-23T07:01:47Z

We have never faced OS-specific failures in backup/restore/repair functionality. Thus, there is no need to run the sanity test (which includes these functionality verification) against non-main operation systems;

Isn't this an argument why it's not very important?

Installation test takes only 20 minutes in average against ~2 hours for Manager sanity test;

Ouch! 20 minutes? What is taking so loooong? I'm very interested to know. Installation logs might help.

mikliapko · 2024-05-23T08:36:12Z

Ouch! 20 minutes? What is taking so loooong? I'm very interested to know. Installation logs might help.

It's not the test itself that takes 20 minutes but infra preparation steps.
We hope it should be improved by added yesterday provisioning step to manager pipeline (#7469). Haven't tested these installation tests with this step yet.

mikliapko · 2024-05-23T08:53:37Z

So, I'm not going to duplicate the tests in SCT for this. There are already tests for all distributions which do install. Since I'm also maintainer I'm not approving that.

My suggestion here instead of duplicating tests is to support a new mode for the sanity that will skip some sub-tests. Then, if you decides to enable back the tests in master you can enable them with the skip mode in the trigger and for releases without the skip mode.

P.S Better to link to the trigger PRs in scylla-pkg if you want a complete review.

In current implementation I don't see really many intersections between these two tests:

Sanity test:

Generate load
Do basic backup test
Do restore test
Do repair
Test cluster CRUD operations
Health check test
Suspend and resume test
Client encryption test

Installation test:

Check scylla-manager and scylla-manager-agent are up and running (/ping endpoint)
Check the sctool version method is callable and returns valid info
Add cluster and verify hosts health

So, of course, they can be joined but from my point of view it will make sanity test more complex.
Moreover, I don't see a big issue in introducing new test.
To be honest, the Manager tests are quite far from being laconic and non-containing duplicated code. And I agree that it's not a reason to introduce new duplicates but I personally don't see much duplication here.

soyacz · 2024-05-23T09:18:44Z

as to test speed - collect logs phase takes long because of getting monitoring screenshots mainly.
I assume this is not relevant for these tests and we could provide means to skip it for this cases.

roydahan · 2024-05-23T15:35:22Z

@roydahan as we now own manager testing and these are manager specific changes: a. please add @mikliapko @rayakurl and myself as maintainers for this part of the code b. approve the PR (as two of your team even approved the PR as well + manager is our domain).

Sure, once each will reach few hundreds significant commits we will consider adding new maintainers.

cc @mykaul

roydahan · 2024-05-23T15:36:25Z

So, I'm not going to duplicate the tests in SCT for this. There are already tests for all distributions which do install. Since I'm also maintainer I'm not approving that.
My suggestion here instead of duplicating tests is to support a new mode for the sanity that will skip some sub-tests. Then, if you decides to enable back the tests in master you can enable them with the skip mode in the trigger and for releases without the skip mode.
P.S Better to link to the trigger PRs in scylla-pkg if you want a complete review.

In current implementation I don't see really many intersections between these two tests:

Sanity test:

Generate load

Do basic backup test

Do restore test

Do repair

Test cluster CRUD operations

Health check test

Suspend and resume test

Client encryption test

Installation test:

Check scylla-manager and scylla-manager-agent are up and running (/ping endpoint)

Check the sctool version method is callable and returns valid info

Add cluster and verify hosts health

So, of course, they can be joined but from my point of view it will make sanity test more complex. Moreover, I don't see a big issue in introducing new test. To be honest, the Manager tests are quite far from being laconic and non-containing duplicated code. And I agree that it's not a reason to introduce new duplicates but I personally don't see much duplication here.

@mikliapko again, if it’s only installation of manager, have you considered using simple GH actions that will be part of CI?

mykaul · 2024-05-23T15:46:04Z

@roydahan as we now own manager testing and these are manager specific changes: a. please add @mikliapko @rayakurl and myself as maintainers for this part of the code b. approve the PR (as two of your team even approved the PR as well + manager is our domain).

Sure, once each will reach few hundreds significant commits we will consider adding new maintainers.

Hundreds?

cc @mykaul

mykaul · 2024-05-23T15:46:42Z

as to test speed - collect logs phase takes long because of getting monitoring screenshots mainly. I assume this is not relevant for these tests and we could provide means to skip it for this cases.

I hope there's an issue on this - why would taking a screenshot take long? (I assume there's a reason, perhaps we can improve that).

mikliapko · 2024-05-23T15:47:37Z

@mikliapko again, if it’s only installation of manager, have you considered using simple GH actions that will be part of CI?

No, we haven't tried this option. But what for?
The QA team doesn't look into GH Actions jobs that devs are taking care of. We want to have all the tests interesting and important for us in one place, together with the rest Manager's SCT jobs.

roydahan · 2024-05-23T16:28:36Z

as to test speed - collect logs phase takes long because of getting monitoring screenshots mainly. I assume this is not relevant for these tests and we could provide means to skip it for this cases.

I hope there's an issue on this - why would taking a screenshot take long? (I assume there's a reason, perhaps we can improve that).

Probably since the monitor is very very slow.
It’s a small node that runs manager (with its Scylla) and monitor.

roydahan · 2024-05-23T16:32:36Z

@mikliapko again, if it’s only installation of manager, have you considered using simple GH actions that will be part of CI?

No, we haven't tried this option. But what for? The QA team doesn't look into GH Actions jobs that devs are taking care of. We want to have all the tests interesting and important for us in one place, together with the rest Manager's SCT jobs.

That’s exactly my whole point.
Since we don’t want only QA to watch these jobs and finding AFTER the fact that there was a breaking commit.
That’s why these tests were disabled at first place. Developers didn’t watch them, hence no one watch them so it was a waste to run them.
So, instead the same thing will happen with your new tests, having a simple GH action that only installs it will allow developers to detect these as part of their CI. It can be also set as an extended CI.

rayakurl · 2024-05-24T05:46:38Z

@mikliapko again, if it’s only installation of manager, have you considered using simple GH actions that will be part of CI?

No, we haven't tried this option. But what for? The QA team doesn't look into GH Actions jobs that devs are taking care of. We want to have all the tests interesting and important for us in one place, together with the rest Manager's SCT jobs.

That’s exactly my whole point. Since we don’t want only QA to watch these jobs and finding AFTER the fact that there was a breaking commit. That’s why these tests were disabled at first place. Developers didn’t watch them, hence no one watch them so it was a waste to run them. So, instead the same thing will happen with your new tests, having a simple GH action that only installs it will allow developers to detect these as part of their CI. It can be also set as an extended CI.

@roydahan we have a defined testing strategy, we have tests in GH actions maintained and watched by @karol-kokoszka and @Michal-Leszczynski .
So, We have a few levels of tests and we have a clear plan on how to test manager.
We are defining a procedure on who looks at the runs and have high collaboration between the qa and the developers as we are part of the same group now.

Having said that, as far as I know this is not a concern of a SCT maintainer.
As far as I understand, a maintainer of a repository should check the correctness of the code, verify that the new code is not introducing new regressions etc.
The manager tests are owned by us and we decide how it's tested, when and by whom. It's discussed and pre approved by everyone who are involved in the project.
And IMO this PR is not the place to discuss it.

For the scope of this PR, Mikita already showed that the new test doesn't duplicate the old test, and Israel who is a maintainer didn't think so as well, as well as Lukasz, who initially approved the PR. We have approval from 2 maintainers and 2 members of the manager task force, which is more than we need.
So nothing should hold this PR from being merged.

@mykaul, @gmizrahi - if I am wrong, and the SCT maintainers job is to define the test strategy (never witnessed this before), all the maintainers should attend all the manager meetings that we have (bi-weekly, weekly grooming etc) , be part of the manager channel and to keep up to date with all the changes that we make in real time.
I don't think that this is productive but your call.

roydahan · 2024-05-24T09:46:25Z

It does introduce duplication.
It doesn’t remove the older sanities nor linking to a Scylla-pkg PR that changes the triggers accordingly.

mikliapko · 2024-05-24T10:07:48Z

It does introduce duplication.

It seems like we already discussed duplication question yesterday. Here you can find my vision and explanation
(#7456 (comment)).
Could you please specify the exact places we you see the code or test duplication?

It doesn’t remove the older sanities

There is no need in sanities removing. This test should continue to be enabled for main distro (ubuntu22). In addition, we are planning to leave it for secondary distributions as well but to run only by request (manually) if we may suspect any possible issues related to OS.

nor linking to a Scylla-pkg PR that changes the triggers accordingly.

There is a ongoing effort to rework test triggering approach a bit (scylladb/scylla-manager#3856). It will be addressed there.

rayakurl · 2024-05-27T18:26:45Z

This PR is pending a few days now.
@roydahan it's waiting for you to reply to Mikita's comment

roydahan · 2024-05-29T15:13:27Z

Waiting to see the triggers PR which is anyway blocking these tests.

mikliapko · 2024-05-29T16:52:24Z

Waiting to see the triggers PR which is anyway blocking these tests.

Is it a must-have for any tests to be merged?

We are planning to introduce the triggers for these installation tests as a part of this task (scylladb/scylla-manager#3856) where we plan to rework the triggering process itself and for manual builds allow to choose the group of tests to execute depending on their priority. I just don't wanna do the double work adding the triggers for installation tests only if in the near time this triggers would be reworked again.

Moreover, not having such triggers yet doesn't prevent us from manually executing these tests for now.

@roydahan

gmizrahi · 2024-06-02T12:06:59Z

@roydahan please approve the PR or grant me permissions to approve it.

Thanks!

vponomaryov

Read the whole thread.
The reasoning for this PR makes sense to me.
The common SCT code is not affected, only manager-specific.

LGTM

mikliapko · 2024-06-04T14:09:22Z

@roydahan

While waiting for this PR to be merged, we've implemented the task (scylladb/scylla-manager#3856) where the tests triggering approach has been reworked a bit. Installation tests were also added.

See the PR https://github.com/scylladb/scylla-pkg/pull/4109

gmizrahi · 2024-06-06T06:36:23Z

thank you @roydahan

vponomaryov · 2024-06-06T10:30:53Z

@mergify backport manager-3.2

mergify · 2024-06-06T10:30:56Z

backport manager-3.2

✅ Backports have been created

#7584 test(manager): introduce Manager installation test on different OS (backport #7456) has been created for branch manager-3.2

mikliapko added Ready for review backport/manager-3.2 labels May 20, 2024

mikliapko self-assigned this May 20, 2024

mikliapko requested review from karol-kokoszka, ilya-rarov, Michal-Leszczynski and a team May 20, 2024 15:18

soyacz reviewed May 21, 2024

View reviewed changes

mikliapko force-pushed the installation-test-for-manager-3852 branch from 37e8d7e to e12f771 Compare May 21, 2024 10:21

mikliapko requested a review from soyacz May 21, 2024 10:52

soyacz previously approved these changes May 21, 2024

View reviewed changes

karol-kokoszka approved these changes May 21, 2024

View reviewed changes

ilya-rarov reviewed May 21, 2024

View reviewed changes

mgmt_cli_test.py Outdated Show resolved Hide resolved

test(manager): introduce Manager installation test on different OS

3077b47

The test is intended for execution on non-main OS distributions (for example, debian10) where the main goal is to execute installation test and verify manager + agent are up and running.

mikliapko dismissed soyacz’s stale review via 3077b47 May 22, 2024 07:37

mikliapko force-pushed the installation-test-for-manager-3852 branch from e12f771 to 3077b47 Compare May 22, 2024 07:37

rayakurl mentioned this pull request May 28, 2024

Manager tests failures related to Spot instances #7413

Closed

vponomaryov approved these changes Jun 3, 2024

View reviewed changes

roydahan approved these changes Jun 6, 2024

View reviewed changes

roydahan merged commit 1293b0a into scylladb:master Jun 6, 2024
6 checks passed

scylladbbot added the promoted-to-master label Jun 6, 2024

mikliapko deleted the installation-test-for-manager-3852 branch June 6, 2024 08:52

mergify bot mentioned this pull request Jun 6, 2024

test(manager): introduce Manager installation test on different OS (backport #7456) #7584

Merged

3 tasks

test(manager): introduce Manager installation test on different OS #7456

test(manager): introduce Manager installation test on different OS #7456

Conversation

mikliapko commented May 20, 2024

Testing

PR pre-checks (self review)

Reminders

soyacz May 21, 2024

Choose a reason for hiding this comment

mikliapko May 21, 2024

Choose a reason for hiding this comment

soyacz May 21, 2024

Choose a reason for hiding this comment

mikliapko May 21, 2024

Choose a reason for hiding this comment

soyacz left a comment

Choose a reason for hiding this comment

roydahan commented May 21, 2024

mikliapko commented May 21, 2024 • edited

roydahan commented May 21, 2024

fruch commented May 21, 2024

roydahan commented May 21, 2024

fruch commented May 21, 2024

mikliapko commented May 22, 2024

mikliapko commented May 22, 2024

roydahan commented May 22, 2024

roydahan commented May 22, 2024

mikliapko commented May 22, 2024

mikliapko commented May 22, 2024

roydahan commented May 22, 2024

mikliapko commented May 22, 2024

gmizrahi commented May 23, 2024

mykaul commented May 23, 2024

mikliapko commented May 23, 2024

mikliapko commented May 23, 2024

soyacz commented May 23, 2024

roydahan commented May 23, 2024

roydahan commented May 23, 2024

mykaul commented May 23, 2024

mykaul commented May 23, 2024

mikliapko commented May 23, 2024

roydahan commented May 23, 2024

roydahan commented May 23, 2024

rayakurl commented May 24, 2024

roydahan commented May 24, 2024

mikliapko commented May 24, 2024

rayakurl commented May 27, 2024

roydahan commented May 29, 2024

mikliapko commented May 29, 2024

gmizrahi commented Jun 2, 2024

vponomaryov left a comment

Choose a reason for hiding this comment

mikliapko commented Jun 4, 2024

gmizrahi commented Jun 6, 2024

vponomaryov commented Jun 6, 2024

mergify bot commented Jun 6, 2024 • edited

✅ Backports have been created

mikliapko commented May 21, 2024 •

edited

mergify bot commented Jun 6, 2024 •

edited