Skip to content

Conversation

@ekouts
Copy link
Contributor

@ekouts ekouts commented Jul 7, 2020

Main idea of this PR:

  • the jobs' state is updated in poll(*jobs) instead of _update_state. Job-specific errors are not raised immediately but their delivery is deferred until the job status is explicitly queried by its finished() method.
  • the scheduler object doesn't have any information for individual jobs anymore
  • each partition has one common scheduler for all the tests, and the policies should have access to the scheduler objects
  • the former RegressionTask.poll(), now renamed to run_complete(), checks only the current state and does not trigger any backend command to the scheduler anymore. It also raises an exception for any job-specific error encountered during the polling performed by the scheduler.

Fixes #443
Fixes #1290
Fixes #1494

 - the jobs' state is updated in poll_jobs instead of update_state
 - the scheduler object doesn't have any information for individual jobs anymore
 - each partition jass its one common scheduler for all the tests
 - the RegressionTask poll method should only check the current state and not trigger any backend command to the scheduler anymore
@pep8speaks
Copy link

pep8speaks commented Jul 7, 2020

Hello @ekouts, Thank you for updating!

Line 37:80: E501 line too long (80 > 79 characters)

Do see the ReFrame Coding Style Guide

Comment last updated at 2020-10-06 14:42:01 UTC

@ekouts ekouts marked this pull request as draft July 7, 2020 13:49
@vkarak vkarak self-requested a review July 8, 2020 22:22
@vkarak vkarak added this to the ReFrame sprint 20.11 milestone Jul 9, 2020
@codecov-commenter
Copy link

codecov-commenter commented Jul 15, 2020

Codecov Report

Merging #1402 into master will decrease coverage by 0.21%.
The diff coverage is 66.06%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #1402      +/-   ##
==========================================
- Coverage   91.72%   91.50%   -0.22%     
==========================================
  Files          83       83              
  Lines       13029    13190     +161     
==========================================
+ Hits        11951    12070     +119     
- Misses       1078     1120      +42     
Impacted Files Coverage Δ
reframe/core/schedulers/torque.py 20.58% <7.84%> (-6.20%) ⬇️
reframe/core/schedulers/slurm.py 53.20% <24.59%> (-3.70%) ⬇️
reframe/core/schedulers/pbs.py 64.42% <37.20%> (-0.42%) ⬇️
unittests/test_cli.py 90.97% <50.00%> (ø)
reframe/core/pipeline.py 92.43% <75.00%> (-0.44%) ⬇️
unittests/test_launchers.py 92.45% <75.00%> (+0.07%) ⬆️
unittests/test_utility.py 99.39% <80.00%> (+<0.01%) ⬆️
reframe/core/systems.py 88.31% <81.25%> (+0.11%) ⬆️
unittests/test_schedulers.py 94.25% <87.50%> (+0.45%) ⬆️
reframe/frontend/executors/__init__.py 98.68% <90.00%> (-0.32%) ⬇️
... and 17 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 068dccb...f48cf44. Read the comment docs.

@ekouts ekouts requested review from victorusu and vkarak July 15, 2020 08:09
@ekouts ekouts marked this pull request as ready for review July 15, 2020 08:12
Copy link
Contributor

@vkarak vkarak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was a bit hard for me to understand at some points due to not so good variable names, especially in the schedulers. I have suggested some alternatives. The part of the execution policies was fine.

@vkarak
Copy link
Contributor

vkarak commented Oct 2, 2020

I have fixed the polling mechanism and fined tuned it. The only remaining thing for this PR now is to revisit the use of task.poll().

@ekouts
Copy link
Contributor Author

ekouts commented Oct 2, 2020

I have fixed the polling mechanism and fined tuned it. The only remaining thing for this PR now is to revisit the use of task.poll().

Would it make sense to rename to task.update_status()?

@vkarak
Copy link
Contributor

vkarak commented Oct 4, 2020

@jenkins-cscs retry daint

Copy link
Contributor

@vkarak vkarak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still not ready... It crashes with compile-only regression tests. For example:

./bin/reframe --report-file=report.json --prefix=$SCRATCH/rfm-stage/ -C config/cscs.py -c cscs-checks/compile/haswell_fma_check.py -R -t production -t craype -r
./bin/reframe: unexpected error: 'NoneType' object has no attribute 'scheduler'
Traceback (most recent call last):
  File "/users/karakasv/Devel/reframe/reframe/frontend/cli.py", line 718, in main
    runner.runall(testcases)
  File "/users/karakasv/Devel/reframe/reframe/frontend/executors/__init__.py", line 375, in runall
    self._runall(testcases)
  File "/users/karakasv/Devel/reframe/reframe/frontend/executors/__init__.py", line 428, in _runall
    self._policy.runcase(t)
  File "/users/karakasv/Devel/reframe/reframe/frontend/executors/policies.py", line 315, in runcase
    if not self._setup_task(task):
  File "/users/karakasv/Devel/reframe/reframe/frontend/executors/policies.py", line 284, in _setup_task
    sched_options=self.sched_options)
  File "/users/karakasv/Devel/reframe/reframe/frontend/executors/__init__.py", line 258, in setup
    self._notify_listeners('on_task_setup')
  File "/users/karakasv/Devel/reframe/reframe/frontend/executors/__init__.py", line 221, in _notify_listeners
    callback(self)
  File "/users/karakasv/Devel/reframe/reframe/frontend/executors/policies.py", line 243, in on_task_setup
    self.schedulers.setdefault(partname, task.check.job.scheduler)
AttributeError: 'NoneType' object has no attribute 'scheduler'

This is expected, because job is never set for this type of tests. The problem is that this problem is not caught by the unit tests.

Vasileios Karakasis added 3 commits October 5, 2020 22:28
- The scheduler is now part of the `SystemPartition` and it is instantiated once
  on first access. This is to avoid unnecessary copies of the scheduler upon
  cloning of `SystemPartition` when the test cases are generated.
- The rest of the implementation of the policies was adapted to this.
- A bug fix in unit tests that was working until now due to side effects.
- Remove stale print
- Make unit test of find_modules() more robust
Copy link
Contributor

@vkarak vkarak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be fine now.

@vkarak
Copy link
Contributor

vkarak commented Oct 6, 2020

@jenkins-cscs retry dom

Copy link
Contributor Author

@ekouts ekouts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had a look at the latest changes and it looks good to me

@vkarak
Copy link
Contributor

vkarak commented Oct 6, 2020

There is still a small bug that I have fixed. I am waiting for an extended run to finish, push the fix and merge it.

@vkarak
Copy link
Contributor

vkarak commented Oct 6, 2020

Final bug fixed. After the CI passes, I think that this PR is finally good to go!

@vkarak vkarak changed the title [feat] Aggregate job status polls into a single backend command [feat] Redesign job polling mechanism in the framework Oct 6, 2020
@vkarak vkarak merged commit 01b5dd2 into reframe-hpc:master Oct 6, 2020
@ekouts ekouts deleted the feat/aggregate_polling branch October 8, 2020 12:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

4 participants