-
Notifications
You must be signed in to change notification settings - Fork 117
[feat] Redesign job polling mechanism in the framework #1402
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
- the jobs' state is updated in poll_jobs instead of update_state - the scheduler object doesn't have any information for individual jobs anymore - each partition jass its one common scheduler for all the tests - the RegressionTask poll method should only check the current state and not trigger any backend command to the scheduler anymore
|
Hello @ekouts, Thank you for updating!
Do see the ReFrame Coding Style Guide Comment last updated at 2020-10-06 14:42:01 UTC |
…t/aggregate_polling
…t/aggregate_polling
…t/aggregate_polling
…t/aggregate_polling
Codecov Report
@@ Coverage Diff @@
## master #1402 +/- ##
==========================================
- Coverage 91.72% 91.50% -0.22%
==========================================
Files 83 83
Lines 13029 13190 +161
==========================================
+ Hits 11951 12070 +119
- Misses 1078 1120 +42
Continue to review full report at Codecov.
|
vkarak
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was a bit hard for me to understand at some points due to not so good variable names, especially in the schedulers. I have suggested some alternatives. The part of the execution policies was fine.
|
I have fixed the polling mechanism and fined tuned it. The only remaining thing for this PR now is to revisit the use of |
Would it make sense to rename to |
|
@jenkins-cscs retry daint |
vkarak
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still not ready... It crashes with compile-only regression tests. For example:
./bin/reframe --report-file=report.json --prefix=$SCRATCH/rfm-stage/ -C config/cscs.py -c cscs-checks/compile/haswell_fma_check.py -R -t production -t craype -r
./bin/reframe: unexpected error: 'NoneType' object has no attribute 'scheduler'
Traceback (most recent call last):
File "/users/karakasv/Devel/reframe/reframe/frontend/cli.py", line 718, in main
runner.runall(testcases)
File "/users/karakasv/Devel/reframe/reframe/frontend/executors/__init__.py", line 375, in runall
self._runall(testcases)
File "/users/karakasv/Devel/reframe/reframe/frontend/executors/__init__.py", line 428, in _runall
self._policy.runcase(t)
File "/users/karakasv/Devel/reframe/reframe/frontend/executors/policies.py", line 315, in runcase
if not self._setup_task(task):
File "/users/karakasv/Devel/reframe/reframe/frontend/executors/policies.py", line 284, in _setup_task
sched_options=self.sched_options)
File "/users/karakasv/Devel/reframe/reframe/frontend/executors/__init__.py", line 258, in setup
self._notify_listeners('on_task_setup')
File "/users/karakasv/Devel/reframe/reframe/frontend/executors/__init__.py", line 221, in _notify_listeners
callback(self)
File "/users/karakasv/Devel/reframe/reframe/frontend/executors/policies.py", line 243, in on_task_setup
self.schedulers.setdefault(partname, task.check.job.scheduler)
AttributeError: 'NoneType' object has no attribute 'scheduler'
This is expected, because job is never set for this type of tests. The problem is that this problem is not caught by the unit tests.
- The scheduler is now part of the `SystemPartition` and it is instantiated once on first access. This is to avoid unnecessary copies of the scheduler upon cloning of `SystemPartition` when the test cases are generated. - The rest of the implementation of the policies was adapted to this. - A bug fix in unit tests that was working until now due to side effects.
- Remove stale print - Make unit test of find_modules() more robust
vkarak
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should be fine now.
|
@jenkins-cscs retry dom |
ekouts
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had a look at the latest changes and it looks good to me
|
There is still a small bug that I have fixed. I am waiting for an extended run to finish, push the fix and merge it. |
|
Final bug fixed. After the CI passes, I think that this PR is finally good to go! |
Main idea of this PR:
poll(*jobs)instead of_update_state. Job-specific errors are not raised immediately but their delivery is deferred until the job status is explicitly queried by itsfinished()method.RegressionTask.poll(), now renamed torun_complete(), checks only the current state and does not trigger any backend command to the scheduler anymore. It also raises an exception for any job-specific error encountered during the polling performed by the scheduler.Fixes #443
Fixes #1290
Fixes #1494