Skip to content

Conversation

@vkarak
Copy link
Contributor

@vkarak vkarak commented Feb 3, 2022

Fixes #2411.

@casparvl Can you check if this works for you?

@vkarak vkarak added this to the ReFrame Sprint 22.01.2 milestone Feb 3, 2022
@vkarak vkarak requested review from teojgo and victorusu February 3, 2022 23:43
@vkarak vkarak self-assigned this Feb 3, 2022
@vkarak vkarak changed the title [bugfix] Fix job cancellation when job is pending due to ReqNodeNotAvail [bugfix] Fix job cancellation when job is pending due to ReqNodeNotAvail Feb 3, 2022
Copy link
Contributor

@victorusu victorusu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@casparvl
Copy link

casparvl commented Feb 7, 2022

It took me a while to recreate the node status... In the end, I only seem to get this status if

  • I have a reservation (to which I am eligible)
  • I submit to the node in the reservation
  • That node is already running some other job
  • There are no nodes in idle+drain or down+drain status in the same partition to which I submit (if there are, I get a (ReqNodeNotAvail, UnavailableNodes: <nodename>), where nodename is not the node I requested, but the one in the drain status...)

But yes, this fix works! I checked out vakar:bugfix/reqnodenotavail-cancel-job, ran ./bootstrap and ran ./bin/reframe on a dummy test. I now see:

             JOBID            PARTITION                 NAME     USER    STATE       TIME TIME_LIMI  NODES   PRIORITY           START_TIME  REQ_NODES NODELIST(REASON)
           8734232                   sw      rfm_testXYZ_job  casparl COMPLETI       0:03      5:00      1     685795  2022-02-07T15:50:10  software1 software1
           8734233                   sw      rfm_testXYZ_job  casparl  PENDING       0:00      5:00      1     685794  2022-02-07T15:55:10  software1 (ReqNodeNotAvail, May be reserved for other jo
b)
           8734234                   sw      rfm_testXYZ_job  casparl  PENDING       0:00      5:00      1     685794  2022-02-07T15:55:10  software1 (ReqNodeNotAvail, May be reserved for other jo
b)

And the two pending jobs are no longer cancelled, but they keep waiting. until the node becomes available.

@vkarak
Copy link
Contributor Author

vkarak commented Feb 7, 2022

Thanks @casparvl for the detailed analysis! Indeed Slurm is a bit strange on how and what it reports with ReqNodeNotAvail. I think I have seen the same problem in a reservation as well, but didn't have the time to investigate it as you did, so I learnt something new!

@codecov-commenter
Copy link

Codecov Report

Merging #2417 (628a0a6) into master (cd924f8) will increase coverage by 0.00%.
The diff coverage is 0.00%.

Impacted file tree graph

@@           Coverage Diff           @@
##           master    #2417   +/-   ##
=======================================
  Coverage   85.66%   85.67%           
=======================================
  Files          56       56           
  Lines       10467    10466    -1     
=======================================
  Hits         8967     8967           
+ Misses       1500     1499    -1     
Impacted Files Coverage Δ
reframe/core/schedulers/slurm.py 52.57% <0.00%> (+0.14%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update cd924f8...628a0a6. Read the comment docs.

@vkarak vkarak merged commit 46a00d1 into reframe-hpc:master Feb 7, 2022
@vkarak vkarak deleted the bugfix/reqnodenotavail-cancel-job branch February 7, 2022 21:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Job that gets cancelled prematurely by ReFrame

4 participants