Skip to content

Conversation

@ekouts
Copy link
Contributor

@ekouts ekouts commented Jan 7, 2021

When reframe catches the following errors (KeyboardInterrupt, ReframeForceExitError, AssertionError) it unnecessarily marks as failures tasks from retired tasks. This leads to confusing output, where tasks are repeated, T1 and T5 in this example:

[----------] waiting for spawned checks to finish
[       OK ] ( 1/10) T0 on generic:default using builtin [compile: 0.012s run: 0.183s total: 0.223s]
[       OK ] ( 2/10) T4 on generic:default using builtin [compile: 0.012s run: 0.179s total: 0.218s]
[       OK ] ( 3/10) T5 on generic:default using builtin [compile: 0.012s run: 0.208s total: 0.247s]
[       OK ] ( 4/10) T1 on generic:default using builtin [compile: 0.012s run: 0.208s total: 0.248s]
[     FAIL ] ( 5/10) T8 on generic:default using builtin [compile: n/a run: n/a total: 0.003s]
==> test failed during 'setup': test staged in '/path/to/reframe/stage/generic/default/builtin/T8'
[     FAIL ] ( 6/10) T9 [compile: n/a run: n/a total: n/a]
==> test failed during 'startup': test staged in '<not available>'
[     FAIL ] ( 7/10) T6 on generic:default using builtin [compile: 0.012s run: 0.004s total: 0.063s]
==> test failed during 'run': test staged in '/path/to/reframe/stage/generic/default/builtin/T6'
[     FAIL ] ( 8/10) T2 [compile: n/a run: n/a total: n/a]
==> test failed during 'startup': test staged in '<not available>'
[     FAIL ] ( 9/10) T3 [compile: n/a run: n/a total: n/a]
==> test failed during 'startup': test staged in '<not available>'
[     FAIL ] (10/10) T7 [compile: n/a run: n/a total: n/a]
==> test failed during 'startup': test staged in '<not available>'
[     FAIL ] (11/10) T5 on generic:default using builtin [compile: 0.012s run: 0.208s total: 0.247s]
==> test failed during 'finalize': test staged in '/path/to/reframe/stage/generic/default/builtin/T5'
[     FAIL ] (12/10) T1 on generic:default using builtin [compile: 0.012s run: 0.208s total: 0.248s]
==> test failed during 'finalize': test staged in '/path/to/reframe/stage/generic/default/builtin/T1'
[----------] all spawned checks have finished

[  FAILED  ] Ran 10 test case(s) from 10 check(s) (8 failure(s))
[==========] Finished on Thu Jan  7 10:47:59 2021 

==============================================================================
SUMMARY OF FAILURES
------------------------------------------------------------------------------
FAILURE INFO for T5 
  * Test Description: T5
  * System partition: generic:default
  * Environment: builtin
  * Stage directory: /path/to/reframe/stage/generic/default/builtin/T5
  * Job type: local (id=29155)
  * Dependencies (conceptual): ['T4']
  * Dependencies (actual): [('T4', 'generic:default', 'builtin')]
  * Maintainers: []
  * Failing phase: finalize
  * Rerun with '-n T5 -p builtin --system generic:default'
  * Reason: process lookup error: [Errno 3] No such process
Traceback (most recent call last):
  File "/path/to/reframe/frontend/executors/__init__.py", line 318, in abort
    self.check.job.cancel()
  File "/path/to/reframe/reframe/core/schedulers/__init__.py", line 409, in cancel
    return self.scheduler.cancel(self)
  File "/path/to/reframe/reframe/core/schedulers/local.py", line 127, in cancel
    self._term_all(job)
  File "/path/to/reframe/reframe/core/schedulers/local.py", line 116, in _term_all
    os.killpg(job.jobid, signal.SIGTERM)
ProcessLookupError: [Errno 3] No such process

------------------------------------------------------------------------------
FAILURE INFO for T1 
  * Test Description: T1
  * System partition: generic:default
  * Environment: builtin
  * Stage directory: /path/to/reframe/stage/generic/default/builtin/T1
  * Job type: local (id=29166)
  * Dependencies (conceptual): ['T4', 'T5']
  * Dependencies (actual): [('T4', 'generic:default', 'builtin'), ('T5', 'generic:default', 'builtin')]
  * Maintainers: []
  * Failing phase: finalize
  * Rerun with '-n T1 -p builtin --system generic:default'
  * Reason: process lookup error: [Errno 3] No such process
Traceback (most recent call last):
  File "/path/to/reframe/reframe/frontend/executors/__init__.py", line 318, in abort
    self.check.job.cancel()
  File "/path/to/reframe/reframe/core/schedulers/__init__.py", line 409, in cancel
    return self.scheduler.cancel(self)
  File "/path/to/reframe/reframe/core/schedulers/local.py", line 127, in cancel
    self._term_all(job)
  File "/path/to/reframe/reframe/core/schedulers/local.py", line 116, in _term_all
    os.killpg(job.jobid, signal.SIGTERM)
ProcessLookupError: [Errno 3] No such process

It is a bit hard to reproduce so I just added an extra _failall by hand to get this output.

ekouts added 2 commits January 7, 2021 10:50
When reframe catches the following errors (KeyboardInterrupt, ReframeForceExitError,
AssertionError) it unnecessarily marks as failures tasks from retired and
completed tasks. This leads to confusing output.
@ekouts ekouts requested review from teojgo and vkarak January 7, 2021 10:23
@ekouts ekouts self-assigned this Jan 7, 2021
@codecov-io
Copy link

codecov-io commented Jan 7, 2021

Codecov Report

Merging #1673 (2a7a5eb) into master (1ffe893) will not change coverage.
The diff coverage is n/a.

Impacted file tree graph

@@           Coverage Diff           @@
##           master    #1673   +/-   ##
=======================================
  Coverage   87.21%   87.21%           
=======================================
  Files          45       45           
  Lines        7546     7546           
=======================================
  Hits         6581     6581           
  Misses        965      965           
Impacted Files Coverage Δ
reframe/frontend/executors/policies.py 99.33% <ø> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 1ffe893...2a7a5eb. Read the comment docs.

@vkarak
Copy link
Contributor

vkarak commented Jan 8, 2021

@ekouts Didn't we have an open issue about that? If yes, could you link it to this PR?

@ekouts
Copy link
Contributor Author

ekouts commented Jan 8, 2021

@ekouts Didn't we have an open issue about that? If yes, could you link it to this PR?

@vkarak I couldn't find an issue, that's why I opened directly the PR, since the fix is so small and I need it for #1676

@vkarak
Copy link
Contributor

vkarak commented Jan 11, 2021

@ekouts I think I found the issue. It's #1398, but we have closed it, because we would not reproduce it.

Copy link
Contributor

@vkarak vkarak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@vkarak vkarak changed the title [bugfix] Fix unnecessary marking of failed tasks when reframe execution is aborted [bugfix] Fix numbering of failed tasks when reframe execution is aborted Jan 11, 2021
@vkarak vkarak merged commit 9d85b5f into reframe-hpc:master Jan 11, 2021
@vkarak vkarak added this to the ReFrame sprint 21.01 milestone Jan 11, 2021
@ekouts ekouts deleted the bugfix/abort_tasks branch January 12, 2021 07:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants