[feat] Redesign job polling mechanism in the framework #1402

ekouts · 2020-07-07T13:47:34Z

Main idea of this PR:

the jobs' state is updated in poll(*jobs) instead of _update_state. Job-specific errors are not raised immediately but their delivery is deferred until the job status is explicitly queried by its finished() method.
the scheduler object doesn't have any information for individual jobs anymore
each partition has one common scheduler for all the tests, and the policies should have access to the scheduler objects
the former RegressionTask.poll(), now renamed to run_complete(), checks only the current state and does not trigger any backend command to the scheduler anymore. It also raises an exception for any job-specific error encountered during the polling performed by the scheduler.

- the jobs' state is updated in poll_jobs instead of update_state - the scheduler object doesn't have any information for individual jobs anymore - each partition jass its one common scheduler for all the tests - the RegressionTask poll method should only check the current state and not trigger any backend command to the scheduler anymore

pep8speaks · 2020-07-07T13:47:40Z

Hello @ekouts, Thank you for updating!

In the file reframe/core/pipeline.py:

Line 37:80: E501 line too long (80 > 79 characters)

Do see the ReFrame Coding Style Guide

Comment last updated at 2020-10-06 14:42:01 UTC

…t/aggregate_polling

reframe/core/pipeline.py

…t/aggregate_polling

codecov-commenter · 2020-07-15T07:52:31Z

Codecov Report

Merging #1402 into master will decrease coverage by 0.21%.
The diff coverage is 66.06%.

@@            Coverage Diff             @@
##           master    #1402      +/-   ##
==========================================
- Coverage   91.72%   91.50%   -0.22%     
==========================================
  Files          83       83              
  Lines       13029    13190     +161     
==========================================
+ Hits        11951    12070     +119     
- Misses       1078     1120      +42

Impacted Files	Coverage Δ
reframe/core/schedulers/torque.py	`20.58% <7.84%> (-6.20%)`	⬇️
reframe/core/schedulers/slurm.py	`53.20% <24.59%> (-3.70%)`	⬇️
reframe/core/schedulers/pbs.py	`64.42% <37.20%> (-0.42%)`	⬇️
unittests/test_cli.py	`90.97% <50.00%> (ø)`
reframe/core/pipeline.py	`92.43% <75.00%> (-0.44%)`	⬇️
unittests/test_launchers.py	`92.45% <75.00%> (+0.07%)`	⬆️
unittests/test_utility.py	`99.39% <80.00%> (+<0.01%)`	⬆️
reframe/core/systems.py	`88.31% <81.25%> (+0.11%)`	⬆️
unittests/test_schedulers.py	`94.25% <87.50%> (+0.45%)`	⬆️
reframe/frontend/executors/__init__.py	`98.68% <90.00%> (-0.32%)`	⬇️
... and 17 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 068dccb...f48cf44. Read the comment docs.

vkarak

It was a bit hard for me to understand at some points due to not so good variable names, especially in the schedulers. I have suggested some alternatives. The part of the execution policies was fine.

reframe/core/pipeline.py

reframe/core/schedulers/__init__.py

reframe/core/schedulers/local.py

reframe/frontend/executors/policies.py

unittests/test_policies.py

vkarak · 2020-10-02T14:00:37Z

I have fixed the polling mechanism and fined tuned it. The only remaining thing for this PR now is to revisit the use of task.poll().

ekouts · 2020-10-02T14:02:47Z

I have fixed the polling mechanism and fined tuned it. The only remaining thing for this PR now is to revisit the use of task.poll().

Would it make sense to rename to task.update_status()?

vkarak · 2020-10-04T20:44:53Z

@jenkins-cscs retry daint

vkarak

Still not ready... It crashes with compile-only regression tests. For example:

./bin/reframe --report-file=report.json --prefix=$SCRATCH/rfm-stage/ -C config/cscs.py -c cscs-checks/compile/haswell_fma_check.py -R -t production -t craype -r

./bin/reframe: unexpected error: 'NoneType' object has no attribute 'scheduler'
Traceback (most recent call last):
  File "/users/karakasv/Devel/reframe/reframe/frontend/cli.py", line 718, in main
    runner.runall(testcases)
  File "/users/karakasv/Devel/reframe/reframe/frontend/executors/__init__.py", line 375, in runall
    self._runall(testcases)
  File "/users/karakasv/Devel/reframe/reframe/frontend/executors/__init__.py", line 428, in _runall
    self._policy.runcase(t)
  File "/users/karakasv/Devel/reframe/reframe/frontend/executors/policies.py", line 315, in runcase
    if not self._setup_task(task):
  File "/users/karakasv/Devel/reframe/reframe/frontend/executors/policies.py", line 284, in _setup_task
    sched_options=self.sched_options)
  File "/users/karakasv/Devel/reframe/reframe/frontend/executors/__init__.py", line 258, in setup
    self._notify_listeners('on_task_setup')
  File "/users/karakasv/Devel/reframe/reframe/frontend/executors/__init__.py", line 221, in _notify_listeners
    callback(self)
  File "/users/karakasv/Devel/reframe/reframe/frontend/executors/policies.py", line 243, in on_task_setup
    self.schedulers.setdefault(partname, task.check.job.scheduler)
AttributeError: 'NoneType' object has no attribute 'scheduler'

This is expected, because job is never set for this type of tests. The problem is that this problem is not caught by the unit tests.

- The scheduler is now part of the `SystemPartition` and it is instantiated once on first access. This is to avoid unnecessary copies of the scheduler upon cloning of `SystemPartition` when the test cases are generated. - The rest of the implementation of the policies was adapted to this. - A bug fix in unit tests that was working until now due to side effects.

- Remove stale print - Make unit test of find_modules() more robust

vkarak

It should be fine now.

vkarak · 2020-10-06T09:30:15Z

@jenkins-cscs retry dom

ekouts

I had a look at the latest changes and it looks good to me

vkarak · 2020-10-06T12:44:50Z

There is still a small bug that I have fixed. I am waiting for an extended run to finish, push the fix and merge it.

vkarak · 2020-10-06T14:43:34Z

Final bug fixed. After the CI passes, I think that this PR is finally good to go!

ekouts added request for enhancement prio: normal labels Jul 7, 2020

ekouts self-assigned this Jul 7, 2020

ekouts marked this pull request as draft July 7, 2020 13:49

Merge branch 'master' of https://github.com/eth-cscs/reframe into fea…

379ed53

…t/aggregate_polling

vkarak self-requested a review July 8, 2020 22:22

vkarak added this to the ReFrame sprint 20.11 milestone Jul 9, 2020

remove commented lines

342f43c

vkarak reviewed Jul 9, 2020

View reviewed changes

reframe/core/pipeline.py Outdated Show resolved Hide resolved

ekouts added 8 commits July 10, 2020 08:58

Merge branch 'master' of https://github.com/eth-cscs/reframe into fea…

b15eab5

…t/aggregate_polling

Move scheduler variables back to the scheduler class

70cd2a3

Merge branch 'master' of https://github.com/eth-cscs/reframe into fea…

390cdd0

…t/aggregate_polling

fix setup signature in CompileOnlyRegressionTest

d5371e9

allign serial policy with the async one

fad80d1

fix torque polling bug

be0dc6c

fix pep8 issues

06b5573

Merge branch 'master' of https://github.com/eth-cscs/reframe into fea…

79cb0a1

…t/aggregate_polling

ekouts added 2 commits July 15, 2020 10:04

fix pep8 issues

528d02e

fix indentation

e841fe3

ekouts requested review from victorusu and vkarak July 15, 2020 08:09

ekouts marked this pull request as ready for review July 15, 2020 08:12

vkarak requested changes Jul 17, 2020

View reviewed changes

Address PR comments

1a08f1d

vkarak removed this from the ReFrame sprint 20.11 milestone Jul 22, 2020

vkarak mentioned this pull request Jul 23, 2020

[bugfix] Capture all pending test tasks in the exit loop of the asynchronous execution policy #1434

Merged

vkarak added this to the ReFrame sprint 20.12 milestone Aug 24, 2020

Vasileios Karakasis added 4 commits October 3, 2020 00:43

Fix local scheduler polling

0904c3b

Use better names for the poll and wait functions

a9391ff

Defer job errors that happen during polling

a3b103f

Unit test improvements and fixes

bcda18d

Merge branch 'master' into feat/aggregate_polling

92c33a3

vkarak approved these changes Oct 4, 2020

View reviewed changes

vkarak requested changes Oct 5, 2020

View reviewed changes

Vasileios Karakasis added 3 commits October 5, 2020 22:28

Fix failure in unit tests and other fixes

89bdb0e

- Remove stale print - Make unit test of find_modules() more robust

Adjust some sleep times in local scheduler unit tests

a1d5829

vkarak approved these changes Oct 6, 2020

View reviewed changes

Add missing sleep

db90266

vkarak added enhancement code quality and removed request for enhancement labels Oct 6, 2020

ekouts commented Oct 6, 2020

View reviewed changes

vkarak mentioned this pull request Oct 6, 2020

test_submit_timelimit unit test for the local scheduler fails on Debian #1494

Closed

Treat empty job lists correctly in poll() functions

f48cf44

vkarak changed the title ~~[feat] Aggregate job status polls into a single backend command~~ [feat] Redesign job polling mechanism in the framework Oct 6, 2020

vkarak merged commit 01b5dd2 into reframe-hpc:master Oct 6, 2020

vkarak mentioned this pull request Oct 7, 2020

.local fails on slurm partition #1511

Closed

ekouts deleted the feat/aggregate_polling branch October 8, 2020 12:16

This was referenced Oct 9, 2020

Minimise Slurm DB hit rate #508

Closed

Provide more accurate timings for the different test phases #1527

Open

[feat] Redesign job polling mechanism in the framework #1402

[feat] Redesign job polling mechanism in the framework #1402

Uh oh!

Conversation

ekouts commented Jul 7, 2020 • edited by vkarak Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pep8speaks commented Jul 7, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Comment last updated at 2020-10-06 14:42:01 UTC

Uh oh!

Uh oh!

codecov-commenter commented Jul 15, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

vkarak left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

vkarak commented Oct 2, 2020

Uh oh!

ekouts commented Oct 2, 2020

Uh oh!

vkarak commented Oct 4, 2020

Uh oh!

vkarak left a comment

Choose a reason for hiding this comment

Uh oh!

vkarak left a comment

Choose a reason for hiding this comment

Uh oh!

vkarak commented Oct 6, 2020

Uh oh!

ekouts left a comment

Choose a reason for hiding this comment

Uh oh!

vkarak commented Oct 6, 2020

Uh oh!

vkarak commented Oct 6, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ekouts commented Jul 7, 2020 •

edited by vkarak

Loading

pep8speaks commented Jul 7, 2020 •

edited

Loading

codecov-commenter commented Jul 15, 2020 •

edited

Loading