[bugfix] Fix job cancellation when job is pending due to `ReqNodeNotAvail` #2417

vkarak · 2022-02-03T23:43:33Z

@casparvl Can you check if this works for you?

victorusu

lgtm

casparvl · 2022-02-07T14:53:08Z

It took me a while to recreate the node status... In the end, I only seem to get this status if

I have a reservation (to which I am eligible)
I submit to the node in the reservation
That node is already running some other job
There are no nodes in idle+drain or down+drain status in the same partition to which I submit (if there are, I get a (ReqNodeNotAvail, UnavailableNodes: <nodename>), where nodename is not the node I requested, but the one in the drain status...)

But yes, this fix works! I checked out vakar:bugfix/reqnodenotavail-cancel-job, ran ./bootstrap and ran ./bin/reframe on a dummy test. I now see:

             JOBID            PARTITION                 NAME     USER    STATE       TIME TIME_LIMI  NODES   PRIORITY           START_TIME  REQ_NODES NODELIST(REASON)
           8734232                   sw      rfm_testXYZ_job  casparl COMPLETI       0:03      5:00      1     685795  2022-02-07T15:50:10  software1 software1
           8734233                   sw      rfm_testXYZ_job  casparl  PENDING       0:00      5:00      1     685794  2022-02-07T15:55:10  software1 (ReqNodeNotAvail, May be reserved for other jo
b)
           8734234                   sw      rfm_testXYZ_job  casparl  PENDING       0:00      5:00      1     685794  2022-02-07T15:55:10  software1 (ReqNodeNotAvail, May be reserved for other jo
b)

And the two pending jobs are no longer cancelled, but they keep waiting. until the node becomes available.

vkarak · 2022-02-07T18:10:58Z

Thanks @casparvl for the detailed analysis! Indeed Slurm is a bit strange on how and what it reports with ReqNodeNotAvail. I think I have seen the same problem in a reservation as well, but didn't have the time to investigate it as you did, so I learnt something new!

codecov-commenter · 2022-02-07T18:35:10Z

Codecov Report

Merging #2417 (628a0a6) into master (cd924f8) will increase coverage by 0.00%.
The diff coverage is 0.00%.

@@           Coverage Diff           @@
##           master    #2417   +/-   ##
=======================================
  Coverage   85.66%   85.67%           
=======================================
  Files          56       56           
  Lines       10467    10466    -1     
=======================================
  Hits         8967     8967           
+ Misses       1500     1499    -1

Impacted Files	Coverage Δ
reframe/core/schedulers/slurm.py	`52.57% <0.00%> (+0.14%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update cd924f8...628a0a6. Read the comment docs.

Fix job cancellation when job is pending due to ReqNodeNotAvail

f2484ab

vkarak added prio: normal bugfix schedulers labels Feb 3, 2022

vkarak added this to the ReFrame Sprint 22.01.2 milestone Feb 3, 2022

vkarak requested review from teojgo and victorusu February 3, 2022 23:43

vkarak self-assigned this Feb 3, 2022

vkarak changed the title ~~[bugfix] Fix job cancellation when job is pending due to ReqNodeNotAvail~~ [bugfix] Fix job cancellation when job is pending due to ReqNodeNotAvail Feb 3, 2022

victorusu approved these changes Feb 4, 2022

View reviewed changes

Merge branch 'master' into bugfix/reqnodenotavail-cancel-job

628a0a6

vkarak merged commit 46a00d1 into reframe-hpc:master Feb 7, 2022

vkarak deleted the bugfix/reqnodenotavail-cancel-job branch February 7, 2022 21:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[bugfix] Fix job cancellation when job is pending due to `ReqNodeNotAvail` #2417

[bugfix] Fix job cancellation when job is pending due to `ReqNodeNotAvail` #2417

Uh oh!

vkarak commented Feb 3, 2022

Uh oh!

victorusu left a comment

Uh oh!

casparvl commented Feb 7, 2022 •

edited

Loading

Uh oh!

vkarak commented Feb 7, 2022

Uh oh!

codecov-commenter commented Feb 7, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[bugfix] Fix job cancellation when job is pending due to ReqNodeNotAvail #2417

[bugfix] Fix job cancellation when job is pending due to ReqNodeNotAvail #2417

Uh oh!

Conversation

vkarak commented Feb 3, 2022

Uh oh!

victorusu left a comment

Choose a reason for hiding this comment

Uh oh!

casparvl commented Feb 7, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vkarak commented Feb 7, 2022

Uh oh!

codecov-commenter commented Feb 7, 2022

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[bugfix] Fix job cancellation when job is pending due to `ReqNodeNotAvail` #2417

[bugfix] Fix job cancellation when job is pending due to `ReqNodeNotAvail` #2417

casparvl commented Feb 7, 2022 •

edited

Loading