Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test(manager): introduce Manager installation test on different OS #7456

Merged

Conversation

mikliapko
Copy link
Contributor

Closes scylladb/scylla-manager#3852

The test is designed for execution on non-main OS distributions (for example, debian10) where the main goal is to execute installation test and verify manager + agent are up and running.

The motivation of having such a test:

  • We have never faced OS-specific failures in backup/restore/repair functionality. Thus, there is no need to run the sanity test (which includes these functionality verification) against non-main operation systems;
  • Installation test takes only 20 minutes in average against ~2 hours for Manager sanity test;
  • The Manager sanity is still here and will be executed on the main OS - Ubuntu 22.

Testing

PR pre-checks (self review)

  • I added the relevant backport labels
  • I didn't leave commented-out/debugging code

Reminders

  • Add New configuration option and document them (in sdcm/sct_config.py)
  • Add unit tests to cover my changes (under unit-test/ folder)
  • Update the Readme/doc folder relevant to this change (if needed)


# Check scylla-manager and scylla-manager-agent are up and running (/ping endpoint)
manager_node.is_manager_server_up()
scylla_node.is_manager_agent_up()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't the test also verify communication between agent and server?
Or at least cluster is registered in server?
I believe several short functional checks could find some regressions without sacrificing a lot of time.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, makes sense, added several checks on cluster addition and health check.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please retry test builds if everything passes

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mikliapko mikliapko force-pushed the installation-test-for-manager-3852 branch from 37e8d7e to e12f771 Compare May 21, 2024 10:21
@mikliapko mikliapko requested a review from soyacz May 21, 2024 10:52
soyacz
soyacz previously approved these changes May 21, 2024
Copy link
Contributor

@soyacz soyacz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

mgmt_cli_test.py Outdated Show resolved Hide resolved
@roydahan
Copy link
Contributor

I don’t understand why we need this PR.
Isn’t the sanity per OS do the exact same thing?

@ShlomiBalalis ?

@mikliapko
Copy link
Contributor Author

mikliapko commented May 21, 2024

I don’t understand why we need this PR. Isn’t the sanity per OS do the exact same thing?

It does the same and even more but lasts longer.
That's the idea as we have never faced any OS-specific issues related to backup/restore/repair it doesn't make sense to run the whole test_sanity_manager.

@roydahan
Copy link
Contributor

How much longer?
I don’t think we never faced such an issue, rarly we did.
But anyway, if sanity Is not that long and anyway one of the comments above suggests that you actually verify that manager works, this PR doesn’t worth it.

if sanity became too long over the years, we can consider shortening it.

@fruch
Copy link
Contributor

fruch commented May 21, 2024

How much longer?
I don’t think we never faced such an issue, rarly we did.
But anyway, if sanity Is not that long and anyway one of the comments above suggests that you actually verify that manager works, this PR doesn’t worth it.

if sanity became too long over the years, we can consider shortening it.

It's ~2h for each sanity, and those are using spots

From my POV, this is more then fine to minimize to multiple disto cases to mostly installation

@roydahan
Copy link
Contributor

From my POV it’s an unneeded duplication.
As I wrote above, we can/should shorten the sanity to what it has been originally, a quick sanity test.

@fruch
Copy link
Contributor

fruch commented May 21, 2024

From my POV it’s an unneeded duplication. As I wrote above, we can/should shorten the sanity to what it has been originally, a quick sanity test.

yet again, I'm all about let whom owns it to take the calls of how to structure it,
I don't see any duplication in this, it's the other way around, it replacing some of the jobs with shorter versions.

@mikliapko
Copy link
Contributor Author

From my POV it’s an unneeded duplication. As I wrote above, we can/should shorten the sanity to what it has been originally, a quick sanity test.

Yes, it's not a duplication because the goal is to replace sanity jobs for debian10, debian11, ubuntu20 distributions with these installation jobs and use the sanity job only for ubuntu22.

One more thing, these short installation jobs can be very good when we need to verify any changes in SCT setup steps or images, etc. We already had this case here (#7459) and instead of running ~1.5-2h jobs for every distro we've run these installation jobs that gave the result in 20 minutes.

@mikliapko
Copy link
Contributor Author

But anyway, if sanity Is not that long and anyway one of the comments above suggests that you actually verify that manager works, this PR doesn’t worth it.

Some extra checks were added in test to address this comment.

The test is intended for execution on non-main OS distributions (for
example, debian10) where the main goal is to execute installation test
and verify manager + agent are up and running.
@roydahan
Copy link
Contributor

From my POV it’s an unneeded duplication. As I wrote above, we can/should shorten the sanity to what it has been originally, a quick sanity test.

Yes, it's not a duplication because the goal is to replace sanity jobs for debian10, debian11, ubuntu20 distributions with these installation jobs and use the sanity job only for ubuntu22.

One more thing, these short installation jobs can be very good when we need to verify any changes in SCT setup steps or images, etc. We already had this case here (#7459) and instead of running ~1.5-2h jobs for every distro we've run these installation jobs that gave the result in 20 minutes.

In that case why do we need it also for ununtu22 in this PR?
And what are the triggers changes?

@roydahan
Copy link
Contributor

But anyway, if sanity Is not that long and anyway one of the comments above suggests that you actually verify that manager works, this PR doesn’t worth it.

Some extra checks were added in test to address this comment.

And now it looks like the Sanity when it was started.
In the past we had an issue on one platform that manager repair failed with too “many open files”.
With this change we will never hit it, because we stop testing repair when manager is Debian for example.
And if we continue using the sanity, half of it is duplicated with this PR.

@mikliapko
Copy link
Contributor Author

In that case why do we need it also for ununtu22 in this PR?

From my point of view, we can also benefit from having such a job if we need to quickly verify ubuntu22 installation process for some particular cases (changes in setup procedure).

And what are the triggers changes?

It's not implemented yet, but we are planning to implement the triggers for them.

@mikliapko
Copy link
Contributor Author

And now it looks like the Sanity when it was started.

But it takes 15 minutes instead of 1.5-2 hours.

In the past we had an issue on one platform that manager repair failed with too “many open files”. With this change we will never hit it, because we stop testing repair when manager is Debian for example. And if we continue using the sanity, half of it is duplicated with this PR.

I haven't known about such issue. In such case we may consider to leave, for example, one per release run of all sanity tests on all platforms (manager-3.2 release branch) and run installation tests only in master.

When I joined to the project, all the sanity tests except Ubuntu22 and CentOS7 (which is deprecated already) have been disabled in Jenkins master branch for more than 1 year. So, nobody really cared much about these runs in master all that time.

@roydahan
Copy link
Contributor

And now it looks like the Sanity when it was started.

But it takes 15 minutes instead of 1.5-2 hours.

In the past we had an issue on one platform that manager repair failed with too “many open files”. With this change we will never hit it, because we stop testing repair when manager is Debian for example. And if we continue using the sanity, half of it is duplicated with this PR.

I haven't known about such issue. In such case we may consider to leave, for example, one per release run of all sanity tests on all platforms (manager-3.2 release branch) and run installation tests only in master.

When I joined to the project, all the sanity tests except Ubuntu22 and CentOS7 (which is deprecated already) have been disabled in Jenkins master branch for more than 1 year. So, nobody really cared much about these runs in master all that time.

Because the developers probably never looked at master so it was worthless to run on master.
Once you run it only for releases, it means that if there is an issue you fix it and it’s a release blocker.
Hence you don’t really care if it takes 15mins or 2 hours because it runs only when you have releases.

IMO, if you want to have it in master, it should be tracked by developers, and if you want it to be tracked by developers, it should be part of their CI.
And if it’s part of their CI, the place for it is probably GitHub actions like the current CI they have.

@mikliapko
Copy link
Contributor Author

Hence you don’t really care if it takes 15mins or 2 hours because it runs only when you have releases.

From my perspective, I find it more efficient to not run 2 hours job if the same result can be got in 15 minutes job. It's not only the question of time but the cloud resources as well.

IMO, if you want to have it in master, it should be tracked by developers, and if you want it to be tracked by developers, it should be part of their CI. And if it’s part of their CI, the place for it is probably GitHub actions like the current CI they have.

I don't really think they run something from SCT in Actions.

@gmizrahi
Copy link

@roydahan as we now own manager testing and these are manager specific changes:
a. please add @mikliapko @rayakurl and myself as maintainers for this part of the code
b. approve the PR (as two of your team even approved the PR as well + manager is our domain).

cc @mykaul

@mykaul
Copy link
Contributor

mykaul commented May 23, 2024

  • We have never faced OS-specific failures in backup/restore/repair functionality. Thus, there is no need to run the sanity test (which includes these functionality verification) against non-main operation systems;

Isn't this an argument why it's not very important?

  • Installation test takes only 20 minutes in average against ~2 hours for Manager sanity test;

Ouch! 20 minutes? What is taking so loooong? I'm very interested to know. Installation logs might help.

@mikliapko
Copy link
Contributor Author

Ouch! 20 minutes? What is taking so loooong? I'm very interested to know. Installation logs might help.

It's not the test itself that takes 20 minutes but infra preparation steps.
We hope it should be improved by added yesterday provisioning step to manager pipeline (#7469). Haven't tested these installation tests with this step yet.

@mikliapko
Copy link
Contributor Author

So, I'm not going to duplicate the tests in SCT for this. There are already tests for all distributions which do install. Since I'm also maintainer I'm not approving that.

My suggestion here instead of duplicating tests is to support a new mode for the sanity that will skip some sub-tests. Then, if you decides to enable back the tests in master you can enable them with the skip mode in the trigger and for releases without the skip mode.

P.S Better to link to the trigger PRs in scylla-pkg if you want a complete review.

In current implementation I don't see really many intersections between these two tests:

Sanity test:

  1. Generate load
  2. Do basic backup test
  3. Do restore test
  4. Do repair
  5. Test cluster CRUD operations
  6. Health check test
  7. Suspend and resume test
  8. Client encryption test

Installation test:

  1. Check scylla-manager and scylla-manager-agent are up and running (/ping endpoint)
  2. Check the sctool version method is callable and returns valid info
  3. Add cluster and verify hosts health

So, of course, they can be joined but from my point of view it will make sanity test more complex.
Moreover, I don't see a big issue in introducing new test.
To be honest, the Manager tests are quite far from being laconic and non-containing duplicated code. And I agree that it's not a reason to introduce new duplicates but I personally don't see much duplication here.

@soyacz
Copy link
Contributor

soyacz commented May 23, 2024

as to test speed - collect logs phase takes long because of getting monitoring screenshots mainly.
I assume this is not relevant for these tests and we could provide means to skip it for this cases.

@roydahan
Copy link
Contributor

@roydahan as we now own manager testing and these are manager specific changes: a. please add @mikliapko @rayakurl and myself as maintainers for this part of the code b. approve the PR (as two of your team even approved the PR as well + manager is our domain).

Sure, once each will reach few hundreds significant commits we will consider adding new maintainers.

cc @mykaul

@roydahan
Copy link
Contributor

So, I'm not going to duplicate the tests in SCT for this. There are already tests for all distributions which do install. Since I'm also maintainer I'm not approving that.
My suggestion here instead of duplicating tests is to support a new mode for the sanity that will skip some sub-tests. Then, if you decides to enable back the tests in master you can enable them with the skip mode in the trigger and for releases without the skip mode.
P.S Better to link to the trigger PRs in scylla-pkg if you want a complete review.

In current implementation I don't see really many intersections between these two tests:

Sanity test:

  1. Generate load
  2. Do basic backup test
  3. Do restore test
  4. Do repair
  5. Test cluster CRUD operations
  6. Health check test
  7. Suspend and resume test
  8. Client encryption test

Installation test:

  1. Check scylla-manager and scylla-manager-agent are up and running (/ping endpoint)
  2. Check the sctool version method is callable and returns valid info
  3. Add cluster and verify hosts health

So, of course, they can be joined but from my point of view it will make sanity test more complex. Moreover, I don't see a big issue in introducing new test. To be honest, the Manager tests are quite far from being laconic and non-containing duplicated code. And I agree that it's not a reason to introduce new duplicates but I personally don't see much duplication here.

@mikliapko again, if it’s only installation of manager, have you considered using simple GH actions that will be part of CI?

@mykaul
Copy link
Contributor

mykaul commented May 23, 2024

@roydahan as we now own manager testing and these are manager specific changes: a. please add @mikliapko @rayakurl and myself as maintainers for this part of the code b. approve the PR (as two of your team even approved the PR as well + manager is our domain).

Sure, once each will reach few hundreds significant commits we will consider adding new maintainers.

Hundreds?

cc @mykaul

@mykaul
Copy link
Contributor

mykaul commented May 23, 2024

as to test speed - collect logs phase takes long because of getting monitoring screenshots mainly. I assume this is not relevant for these tests and we could provide means to skip it for this cases.

I hope there's an issue on this - why would taking a screenshot take long? (I assume there's a reason, perhaps we can improve that).

@mikliapko
Copy link
Contributor Author

@mikliapko again, if it’s only installation of manager, have you considered using simple GH actions that will be part of CI?

No, we haven't tried this option. But what for?
The QA team doesn't look into GH Actions jobs that devs are taking care of. We want to have all the tests interesting and important for us in one place, together with the rest Manager's SCT jobs.

@roydahan
Copy link
Contributor

as to test speed - collect logs phase takes long because of getting monitoring screenshots mainly. I assume this is not relevant for these tests and we could provide means to skip it for this cases.

I hope there's an issue on this - why would taking a screenshot take long? (I assume there's a reason, perhaps we can improve that).

Probably since the monitor is very very slow.
It’s a small node that runs manager (with its Scylla) and monitor.

@roydahan
Copy link
Contributor

@mikliapko again, if it’s only installation of manager, have you considered using simple GH actions that will be part of CI?

No, we haven't tried this option. But what for? The QA team doesn't look into GH Actions jobs that devs are taking care of. We want to have all the tests interesting and important for us in one place, together with the rest Manager's SCT jobs.

That’s exactly my whole point.
Since we don’t want only QA to watch these jobs and finding AFTER the fact that there was a breaking commit.
That’s why these tests were disabled at first place. Developers didn’t watch them, hence no one watch them so it was a waste to run them.
So, instead the same thing will happen with your new tests, having a simple GH action that only installs it will allow developers to detect these as part of their CI. It can be also set as an extended CI.

@rayakurl
Copy link
Contributor

@mikliapko again, if it’s only installation of manager, have you considered using simple GH actions that will be part of CI?

No, we haven't tried this option. But what for? The QA team doesn't look into GH Actions jobs that devs are taking care of. We want to have all the tests interesting and important for us in one place, together with the rest Manager's SCT jobs.

That’s exactly my whole point. Since we don’t want only QA to watch these jobs and finding AFTER the fact that there was a breaking commit. That’s why these tests were disabled at first place. Developers didn’t watch them, hence no one watch them so it was a waste to run them. So, instead the same thing will happen with your new tests, having a simple GH action that only installs it will allow developers to detect these as part of their CI. It can be also set as an extended CI.

@roydahan we have a defined testing strategy, we have tests in GH actions maintained and watched by @karol-kokoszka and @Michal-Leszczynski .
So, We have a few levels of tests and we have a clear plan on how to test manager.
We are defining a procedure on who looks at the runs and have high collaboration between the qa and the developers as we are part of the same group now.

Having said that, as far as I know this is not a concern of a SCT maintainer.
As far as I understand, a maintainer of a repository should check the correctness of the code, verify that the new code is not introducing new regressions etc.
The manager tests are owned by us and we decide how it's tested, when and by whom. It's discussed and pre approved by everyone who are involved in the project.
And IMO this PR is not the place to discuss it.

For the scope of this PR, Mikita already showed that the new test doesn't duplicate the old test, and Israel who is a maintainer didn't think so as well, as well as Lukasz, who initially approved the PR. We have approval from 2 maintainers and 2 members of the manager task force, which is more than we need.
So nothing should hold this PR from being merged.

@mykaul, @gmizrahi - if I am wrong, and the SCT maintainers job is to define the test strategy (never witnessed this before), all the maintainers should attend all the manager meetings that we have (bi-weekly, weekly grooming etc) , be part of the manager channel and to keep up to date with all the changes that we make in real time.
I don't think that this is productive but your call.

@roydahan
Copy link
Contributor

It does introduce duplication.
It doesn’t remove the older sanities nor linking to a Scylla-pkg PR that changes the triggers accordingly.

@mikliapko
Copy link
Contributor Author

It does introduce duplication.

It seems like we already discussed duplication question yesterday. Here you can find my vision and explanation
(#7456 (comment)).
Could you please specify the exact places we you see the code or test duplication?

It doesn’t remove the older sanities

There is no need in sanities removing. This test should continue to be enabled for main distro (ubuntu22). In addition, we are planning to leave it for secondary distributions as well but to run only by request (manually) if we may suspect any possible issues related to OS.

nor linking to a Scylla-pkg PR that changes the triggers accordingly.

There is a ongoing effort to rework test triggering approach a bit (scylladb/scylla-manager#3856). It will be addressed there.

@rayakurl
Copy link
Contributor

This PR is pending a few days now.
@roydahan it's waiting for you to reply to Mikita's comment

@roydahan
Copy link
Contributor

Waiting to see the triggers PR which is anyway blocking these tests.

@mikliapko
Copy link
Contributor Author

Waiting to see the triggers PR which is anyway blocking these tests.

Is it a must-have for any tests to be merged?

We are planning to introduce the triggers for these installation tests as a part of this task (scylladb/scylla-manager#3856) where we plan to rework the triggering process itself and for manual builds allow to choose the group of tests to execute depending on their priority. I just don't wanna do the double work adding the triggers for installation tests only if in the near time this triggers would be reworked again.

Moreover, not having such triggers yet doesn't prevent us from manually executing these tests for now.

@roydahan

@gmizrahi
Copy link

gmizrahi commented Jun 2, 2024

@roydahan please approve the PR or grant me permissions to approve it.

Thanks!

Copy link
Contributor

@vponomaryov vponomaryov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Read the whole thread.
The reasoning for this PR makes sense to me.
The common SCT code is not affected, only manager-specific.

LGTM

@mikliapko
Copy link
Contributor Author

@roydahan

While waiting for this PR to be merged, we've implemented the task (scylladb/scylla-manager#3856) where the tests triggering approach has been reworked a bit. Installation tests were also added.

See the PR https://github.com/scylladb/scylla-pkg/pull/4109

@roydahan roydahan merged commit 1293b0a into scylladb:master Jun 6, 2024
6 checks passed
@gmizrahi
Copy link

gmizrahi commented Jun 6, 2024

thank you @roydahan

@mikliapko mikliapko deleted the installation-test-for-manager-3852 branch June 6, 2024 08:52
@vponomaryov
Copy link
Contributor

@mergify backport manager-3.2

Copy link

mergify bot commented Jun 6, 2024

backport manager-3.2

✅ Backports have been created

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[SCT] Introduce the test for Manager installation and smoke checks