Manager tests failures related to Spot instances #7413

mikliapko · 2024-05-08T16:03:23Z

Issue description

Last month I've seen a multiple failures for manager tests related to Spot instances (manager jobs use this type of instances by default).
Couple of examples:

https://argus.scylladb.com/workspace?state=WyI1NDI5NjNlYS1lZGYwLTQ0YmItOGQzNy1iYzc4OGFkMDgxMTMiXQ

15:34:02  ----- LAST ERROR EVENT -------------------------------------------------------
15:34:02  2024-05-02 13:33:44.128: (TestFrameworkEvent Severity.ERROR) period_type=one-time event_id=27a9d539-dc2b-42ac-bc8d-51dc742f28ea, source=MgmtCliTest.SetUp()
15:34:02  exception=Failed to get spot instances: capacity-not-available

https://jenkins.scylladb.com/view/scylla-manager/job/manager-master/job/centos-sanity-test/1138

23:40:29  ----- LAST CRITICAL EVENT ----------------------------------------------------
23:40:29  2024-04-17 20:54:31.807: (SpotTerminationEvent Severity.CRITICAL) period_type=one-time event_id=edb9ca5a-4f3c-4ce2-a04d-d717c16dd97c: node=Node manager-regression-master-loader-node-9e05767c-1 [44.193.201.15 | 10.12.2.105] (dc name: us-east-1) message={'action': 'terminate', 'time': '2024-04-17T20:56:29Z', 'time-left': 117.1921648979187}

Switching to on_demand type of instances solved the problem in each case.

@fruch

Do we have any way to fight with this issue keeping the spot instance type in place?
If not, I'd suggest switching all manager pipelines to use on_demand instance types.

Impact

High

How frequently does it reproduce?

I'd say ~50% of test executions last month.

The text was updated successfully, but these errors were encountered:

fruch · 2024-05-08T17:24:35Z

@fruch

Do we have any way to fight with this issue keeping the spot instance type in place?
If not, I'd suggest switching all manager pipelines to use on_demand instance types.

no there isn't a magic thing to solve it, spot can be taken during test, and something they aren't available.
there are regions which might be less chances it might happen, or AZ that might be more available.
one can try check if switch to different instance type might also help, some instance type are more available

at the end it's a matter of cost, also in core we run the longer test with on_demand, and when there a release we do the same but for a very small set of regularly triggered jobs
so they are triggered with spot, and if need we trigger them again with on_demand, to have a result.

in you case you might want to have the trigger for releases to be with on_demend, and the ones on master with spot.

mikliapko · 2024-05-09T09:18:49Z

in you case you might want to have the trigger for releases to be with on_demend, and the ones on master with spot.

Agree, sounds like a good solution in our case.
Also, in case of future failures I'll try to experiment with different instance types.

mykaul · 2024-05-15T10:34:37Z

I'd argue there's a difference between no spot available and spot termination. The former - we can easily fallback to ondemand, and I think it makes sense. The latter - harder to deal with - but I'd like to hope is less common - and happens when the tests are 1h or longer, I reckon?

fruch · 2024-05-15T15:07:54Z

I'd argue there's a difference between no spot available and spot termination. The former - we can easily fallback to ondemand, and I think it makes sense. The latter - harder to deal with - but I'd like to hope is less common - and happens when the tests are 1h or longer, I reckon?

those tests are longer then 1 hour, and it's currently the same, since someone need to manually re-run if they are failing. (we are not doing it automatically)

mykaul · 2024-05-16T09:48:57Z

I'd argue there's a difference between no spot available and spot termination. The former - we can easily fallback to ondemand, and I think it makes sense. The latter - harder to deal with - but I'd like to hope is less common - and happens when the tests are 1h or longer, I reckon?

those tests are longer then 1 hour, and it's currently the same, since someone need to manually re-run if they are failing. (we are not doing it automatically)

All of them are longer than 1h?

rayakurl · 2024-05-28T09:20:55Z

longer

@mykaul - all of them.
@mikliapko is working on shorter tests but there were not merged yet - #7456

mykaul · 2024-05-28T09:22:29Z

longer

@mykaul - all of them.

That's too bad. I don't have time now, but I'd be happy to review this at a later point. It makes little sense to me - we should be able to be more efficient.

rayakurl · 2024-05-28T09:24:01Z

longer

@mykaul - all of them.

That's too bad. I don't have time now, but I'd be happy to review this at a later point. It makes little sense to me - we should be able to be more efficient.

@mikliapko is working on shorter tests but there were not merged yet - #7456

rayakurl · 2024-05-28T10:00:35Z

@mykaul - if you wish to review - https://docs.google.com/spreadsheets/d/1enOmxToYVXQEQgGPBCPZIA0JblV5zaBBEFNRaKMS5Ho/edit#gid=605769695
@mikliapko created a document with all the pipleines that existed before he joined the project. @mikliapko - it's worth to create a new tab with all the new pipleines both on master and 3.2

github-actions bot assigned mikliapko May 8, 2024

mikliapko assigned fruch May 8, 2024

rayakurl closed this as completed May 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Manager tests failures related to Spot instances #7413

Manager tests failures related to Spot instances #7413

mikliapko commented May 8, 2024

fruch commented May 8, 2024 •

edited

Loading

mikliapko commented May 9, 2024 •

edited

Loading

mykaul commented May 15, 2024

fruch commented May 15, 2024

mykaul commented May 16, 2024

rayakurl commented May 28, 2024 •

edited

Loading

mykaul commented May 28, 2024

rayakurl commented May 28, 2024

rayakurl commented May 28, 2024

Manager tests failures related to Spot instances #7413

Manager tests failures related to Spot instances #7413

Comments

mikliapko commented May 8, 2024

Issue description

Impact

How frequently does it reproduce?

fruch commented May 8, 2024 • edited Loading

mikliapko commented May 9, 2024 • edited Loading

mykaul commented May 15, 2024

fruch commented May 15, 2024

mykaul commented May 16, 2024

rayakurl commented May 28, 2024 • edited Loading

mykaul commented May 28, 2024

rayakurl commented May 28, 2024

rayakurl commented May 28, 2024

fruch commented May 8, 2024 •

edited

Loading

mikliapko commented May 9, 2024 •

edited

Loading

rayakurl commented May 28, 2024 •

edited

Loading