Skip to content

Bug in Summary of Failures #2009

@casparvl

Description

@casparvl

TLDR: see my minimal reproducer in my reply below

=====

I have a designed a ReFrame test that

  • uses find_modules
  • checks if GPUs are present in the current partition
  • if GPUs are present, calls skip_test if the module name does not include CUDA
  • if GPUs are not present, calls skip_test if the module name does include CUDA

I have 5 modules on my system: 2 CPU based on 3 GPU based modules. When running on my GPU partition:

reframe --config-file=config/settings_cartesius.py --checkpath eessi-checks/applications/gromacs.py -r --performance-report --system=example_system:gpu -t singlenode

it correctly generates 5 tests, and skips 2 out of these 5 tests. For the remaining 3 tests, it fails one of the sanity checks (this is expected, that module is broken).

So far, so good. However, ReFrame's SUMMARY OF FAILURES output is unexpected:

...
[----------] started processing Gromacs_EESSI_singlenode___example_system_gpu_short____builtin____GROMACS_2020_3_foss_2020a__ (GROMACS Prace Benchmark Suite case A)
[ RUN      ] Gromacs_EESSI_singlenode___example_system_gpu_short____builtin____GROMACS_2020_3_foss_2020a__ on example_system:gpu_short using builtin
GPU is present on this partition, skipping CPU-based test
[     SKIP ] (1/5) None
[----------] finished processing Gromacs_EESSI_singlenode___example_system_gpu_short____builtin____GROMACS_2020_3_foss_2020a__ (GROMACS Prace Benchmark Suite case A)
...
[----------] started processing Gromacs_EESSI_singlenode___example_system_gpu_short____builtin____GROMACS_2020_3_intel_2020a_CUDA_11_0_3__ (GROMACS Prace Benchmark Suite case A)
[ RUN      ] Gromacs_EESSI_singlenode___example_system_gpu_short____builtin____GROMACS_2020_3_intel_2020a_CUDA_11_0_3__ on example_system:gpu_short using builtin
[----------] finished processing Gromacs_EESSI_singlenode___example_system_gpu_short____builtin____GROMACS_2020_3_intel_2020a_CUDA_11_0_3__ (GROMACS Prace Benchmark Suite case A)
...
[     FAIL ] (3/5) Gromacs_EESSI_singlenode___example_system_gpu_short____builtin____GROMACS_2020_3_intel_2020a_CUDA_11_0_3__ on example_system:gpu_short using builtin [compile: 0.007s run: 23.463s total: 2
3.554s]
==> test failed during 'sanity': test staged in '/nfs/home4/casparl/EESSI/software-layer/tests/reframe/stage/example_system/gpu_short/builtin/Gromacs_EESSI_singlenode___example_system_gpu_short____builtin__
__GROMACS_2020_3_intel_2020a_CUDA_11_0_3__'
...
[  FAILED  ] Ran 3/5 test case(s) from 5 check(s) (1 failure(s), 2 skipped)
[==========] Finished on Wed Jun  9 18:23:02 2021

==============================================================================
SUMMARY OF FAILURES
------------------------------------------------------------------------------
FAILURE INFO for Gromacs_EESSI_singlenode___example_system_gpu_short____builtin____GROMACS_2020_3_foss_2020a__
  * Test Description: GROMACS Prace Benchmark Suite case A
  * System partition: example_system:gpu_short
  * Environment: builtin
  * Stage directory: /nfs/home4/casparl/EESSI/software-layer/tests/reframe/stage/example_system/gpu_short/builtin/Gromacs_EESSI_singlenode___example_system_gpu_short____builtin____GROMACS_2020_3_foss_2020a__
  * Node list:
  * Job type: batch job (id=None)
  * Dependencies (conceptual): []
  * Dependencies (actual): []
  * Maintainers: ['casparvl']
  * Failing phase: None
  * Rerun with '-n Gromacs_EESSI_singlenode___example_system_gpu_short____builtin____GROMACS_2020_3_foss_2020a__ -p builtin --system example_system:gpu_short -r'
  * Reason: None
/home/casparl/.local/easybuild/RedHatEnterpriseServer7/2020/software/ReFrame/3.6.2/bin/reframe: run session stopped: key error: 'fail_info'
/home/casparl/.local/easybuild/RedHatEnterpriseServer7/2020/software/ReFrame/3.6.2/bin/reframe: Traceback (most recent call last):
  File "/home/casparl/.local/easybuild/RedHatEnterpriseServer7/2020/software/ReFrame/3.6.2/lib/python3.6/site-packages/ReFrame_HPC-3.6.2-py3.6.egg/reframe/frontend/cli.py", line 1022, in main
    runner.stats.print_failure_report(printer)
  File "/home/casparl/.local/easybuild/RedHatEnterpriseServer7/2020/software/ReFrame/3.6.2/lib/python3.6/site-packages/ReFrame_HPC-3.6.2-py3.6.egg/reframe/frontend/statistics.py", line 237, in print_failure_report
    tb = ''.join(traceback.format_exception(*r['fail_info'].values()))
KeyError: 'fail_info'

Log file(s) saved in: '/nfs/home4/casparl/EESSI/software-layer/tests/reframe/reframe.log'

There are three problems with this output:

  1. It prints failure info for Gromacs_EESSI_singlenode___example_system_gpu_short____builtin____GROMACS_2020_3_foss_2020a__ (which is one of the tests that is actually skipped) rather than for Gromacs_EESSI_singlenode___example_system_gpu_short____builtin____GROMACS_2020_3_intel_2020a_CUDA_11_0_3__ (the test that actually failed).
  2. The Traceback.
  3. It doesn't print the performance log (though I guess that's because it crashes, resulting in the traceback).

The original test is here https://github.com/casparvl/software-layer/blob/gromacs_libtest/tests/reframe/eessi-checks/applications/gromacs.py (which uses the library test at https://github.com/casparvl/software-layer/tree/gromacs_libtest/tests/reframe/testlib/applications/gromacs).

I'll try to create a more minimal reproducer. My best guess is it has something to with the skipped tests. Maybe a list is being indexed based on 5 tests being run, but then this indexing fails since only 3 tests ran in the end? Anyway, I'll try to confirm that with a minimal reproducer that runs 2 tests of which one is skipped and the other fails...

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions