Skip to content

Conversation

@sleak-lbl
Copy link
Contributor

Fixes #1330

Exceptions raised from build tasks within compile_wait with the async policy are not caught by reschedule_task, and so bubble up and stop the reframe process. Any queued jobs then get abandoned.

This PR catches exceptions that should cause the test, but not the whole reframe run, to fail

@pep8speaks
Copy link

pep8speaks commented May 21, 2020

Hello @sleak-lbl, Thank you for updating!

Cheers! There are no PEP8 issues in this Pull Request!Do see the ReFrame Coding Style Guide

Comment last updated at 2020-05-28 05:12:54 UTC

@jenkins-cscs
Copy link
Collaborator

Can I test this patch?

@codecov-commenter
Copy link

codecov-commenter commented May 21, 2020

Codecov Report

Merging #1331 into master will increase coverage by 0.01%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #1331      +/-   ##
==========================================
+ Coverage   91.75%   91.77%   +0.01%     
==========================================
  Files          83       83              
  Lines       12493    12523      +30     
==========================================
+ Hits        11463    11493      +30     
  Misses       1030     1030              
Impacted Files Coverage Δ
reframe/frontend/executors/policies.py 98.13% <100.00%> (+0.01%) ⬆️
unittests/resources/checks/frontend_checks.py 99.34% <100.00%> (+0.03%) ⬆️
unittests/test_policies.py 99.45% <100.00%> (+0.03%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9ddadeb...4524181. Read the comment docs.

@sleak-lbl sleak-lbl changed the title [bugfix] Fix early abort when compile step fails with async policy WIP: [bugfix] Fix early abort when compile step fails with async policy May 22, 2020
@sleak-lbl
Copy link
Contributor Author

I've just had an early abort with this patch in place, so it's clearly not the soltion (or at least, the full solution) .. I've added a WIP marker to the PR, digging a bit more now

@vkarak vkarak marked this pull request as draft May 23, 2020 22:23
@vkarak
Copy link
Contributor

vkarak commented May 23, 2020

Check my comment in the issue for a possible cause of this bug.

As a result of another task's exit.
@vkarak vkarak marked this pull request as ready for review May 24, 2020 18:19
@vkarak vkarak changed the title WIP: [bugfix] Fix early abort when compile step fails with async policy [bugfix] Fix early abort when compile step fails with async policy May 24, 2020
@vkarak
Copy link
Contributor

vkarak commented May 24, 2020

@sleak-lbl Can you test this PR now with a real workload?

@vkarak vkarak added this to the ReFrame sprint 20.08 milestone May 24, 2020
Copy link
Contributor

@ekouts ekouts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

Copy link
Contributor

@vkarak vkarak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current fix lgtm. @sleak-lbl did you have time to try it? My plan is to merge this PR for ReFrame 3.0 since it indeed fixes a bug, which I hope is your case.

@sleak-lbl
Copy link
Contributor Author

tested with real workload and verified it worked (and, as far as I can tell, worked for the right reasons) - so I think it is good to go

@vkarak
Copy link
Contributor

vkarak commented May 28, 2020

Thanks for testing and confirming @sleak-lbl !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Tasks waiting to run are silently aborted if a later task fails compile stage, with async policy

6 participants