Skip to content

Conversation

@ekouts
Copy link
Contributor

@ekouts ekouts commented Sep 16, 2021

This PR refactors the policies of ReFrame, and especially how the asynchronous policy is managed. Additionally we allow the asynchronous execution of the build phase.

The main ideas of the PR are:

  • In the policy we keep a set of all the "current_tasks".
  • All of the tests start from the wait stage.
  • After all tests are processed (and are in the current_tasks set) we go into a busy loop in which:
    • We poll all the schedulers but without immediately advancing any test.
    • We check all the tasks of the current_tasks set (green stages) and try to advance them to the next stage.
  • Retired tasks (the tasks that have successfully finished but haven't cleaned up their resources yet) are managed differently because of the dependencies. A test is not cleaned up until all tests that are depending on it have successfully finished. If a test (or any of the tests that depend on it) fail the stage directory will not be "cleaned up" and consequently the ReFrame execution might finish without emptying the retired stack.

reframe_policies

With the async building a new issue was created. By default ReFrame will build the test locally (and not in the partition). So the max_jobs limit of the partition might not be good enough for the "forced" local tests (the ones that have the attribute local to True) + the building jobs. For this reason we added a new limit (current name is rfm_max_jobs but might change) for these jobs.

Finally there is also to possibility to not advance all tests in every loop but we can set a time limit (to avoid for example that all tests in the beginning are serially doing some "setup work", like conflict module resolution or hook that perform heavy operations) without submitting jobs to the schedulers. I would recommend we set arbitrarily the time limit to maybe 10 seconds(?). This would make it more relevant in the beginning, where the tests are setting up, doing submissions etc and not so much after the jobs are submitted and we poll for them.

Closes #1708.
Closes #1617.
Closes #2109.

@pep8speaks
Copy link

pep8speaks commented Sep 16, 2021

Hello @ekouts, Thank you for updating!

Cheers! There are no PEP8 issues in this Pull Request!Do see the ReFrame Coding Style Guide

Comment last updated at 2022-01-21 17:21:40 UTC

@ekouts ekouts requested review from victorusu and vkarak September 16, 2021 09:05
@ekouts ekouts marked this pull request as ready for review September 27, 2021 12:27
@codecov-commenter
Copy link

codecov-commenter commented Sep 27, 2021

Codecov Report

Merging #2194 (695aea5) into master (3125b5b) will decrease coverage by 0.28%.
The diff coverage is 89.51%.

❗ Current head 695aea5 differs from pull request most recent head ab2b0d0. Consider uploading reports for the commit ab2b0d0 to get more accurate results
Impacted file tree graph

@@            Coverage Diff             @@
##           master    #2194      +/-   ##
==========================================
- Coverage   86.14%   85.86%   -0.29%     
==========================================
  Files          56       56              
  Lines       10225    10270      +45     
==========================================
+ Hits         8808     8818      +10     
- Misses       1417     1452      +35     
Impacted Files Coverage Δ
reframe/frontend/executors/policies.py 92.79% <88.12%> (-6.91%) ⬇️
reframe/core/pipeline.py 92.88% <100.00%> (+0.05%) ⬆️
reframe/frontend/cli.py 76.51% <100.00%> (+0.08%) ⬆️
reframe/frontend/executors/__init__.py 97.50% <100.00%> (-0.79%) ⬇️
reframe/core/schedulers/local.py 93.27% <0.00%> (-3.37%) ⬇️
reframe/core/systems.py 88.15% <0.00%> (-1.32%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 3125b5b...ab2b0d0. Read the comment docs.

@vkarak
Copy link
Contributor

vkarak commented Jan 14, 2022

We will also need to change the version annotations in the documentation for things added or changed by this PR to mention 3.10 instead of 3.9.3.

Copy link
Contributor

@vkarak vkarak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a couple of minor changes still.

Copy link
Contributor

@vkarak vkarak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a minor comment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

5 participants