-
Notifications
You must be signed in to change notification settings - Fork 117
Description
TLDR: see my minimal reproducer in my reply below
=====
I have a designed a ReFrame test that
- uses
find_modules - checks if GPUs are present in the current partition
- if GPUs are present, calls
skip_testif the module name does not include CUDA - if GPUs are not present, calls
skip_testif the module name does include CUDA
I have 5 modules on my system: 2 CPU based on 3 GPU based modules. When running on my GPU partition:
reframe --config-file=config/settings_cartesius.py --checkpath eessi-checks/applications/gromacs.py -r --performance-report --system=example_system:gpu -t singlenode
it correctly generates 5 tests, and skips 2 out of these 5 tests. For the remaining 3 tests, it fails one of the sanity checks (this is expected, that module is broken).
So far, so good. However, ReFrame's SUMMARY OF FAILURES output is unexpected:
...
[----------] started processing Gromacs_EESSI_singlenode___example_system_gpu_short____builtin____GROMACS_2020_3_foss_2020a__ (GROMACS Prace Benchmark Suite case A)
[ RUN ] Gromacs_EESSI_singlenode___example_system_gpu_short____builtin____GROMACS_2020_3_foss_2020a__ on example_system:gpu_short using builtin
GPU is present on this partition, skipping CPU-based test
[ SKIP ] (1/5) None
[----------] finished processing Gromacs_EESSI_singlenode___example_system_gpu_short____builtin____GROMACS_2020_3_foss_2020a__ (GROMACS Prace Benchmark Suite case A)
...
[----------] started processing Gromacs_EESSI_singlenode___example_system_gpu_short____builtin____GROMACS_2020_3_intel_2020a_CUDA_11_0_3__ (GROMACS Prace Benchmark Suite case A)
[ RUN ] Gromacs_EESSI_singlenode___example_system_gpu_short____builtin____GROMACS_2020_3_intel_2020a_CUDA_11_0_3__ on example_system:gpu_short using builtin
[----------] finished processing Gromacs_EESSI_singlenode___example_system_gpu_short____builtin____GROMACS_2020_3_intel_2020a_CUDA_11_0_3__ (GROMACS Prace Benchmark Suite case A)
...
[ FAIL ] (3/5) Gromacs_EESSI_singlenode___example_system_gpu_short____builtin____GROMACS_2020_3_intel_2020a_CUDA_11_0_3__ on example_system:gpu_short using builtin [compile: 0.007s run: 23.463s total: 2
3.554s]
==> test failed during 'sanity': test staged in '/nfs/home4/casparl/EESSI/software-layer/tests/reframe/stage/example_system/gpu_short/builtin/Gromacs_EESSI_singlenode___example_system_gpu_short____builtin__
__GROMACS_2020_3_intel_2020a_CUDA_11_0_3__'
...
[ FAILED ] Ran 3/5 test case(s) from 5 check(s) (1 failure(s), 2 skipped)
[==========] Finished on Wed Jun 9 18:23:02 2021
==============================================================================
SUMMARY OF FAILURES
------------------------------------------------------------------------------
FAILURE INFO for Gromacs_EESSI_singlenode___example_system_gpu_short____builtin____GROMACS_2020_3_foss_2020a__
* Test Description: GROMACS Prace Benchmark Suite case A
* System partition: example_system:gpu_short
* Environment: builtin
* Stage directory: /nfs/home4/casparl/EESSI/software-layer/tests/reframe/stage/example_system/gpu_short/builtin/Gromacs_EESSI_singlenode___example_system_gpu_short____builtin____GROMACS_2020_3_foss_2020a__
* Node list:
* Job type: batch job (id=None)
* Dependencies (conceptual): []
* Dependencies (actual): []
* Maintainers: ['casparvl']
* Failing phase: None
* Rerun with '-n Gromacs_EESSI_singlenode___example_system_gpu_short____builtin____GROMACS_2020_3_foss_2020a__ -p builtin --system example_system:gpu_short -r'
* Reason: None
/home/casparl/.local/easybuild/RedHatEnterpriseServer7/2020/software/ReFrame/3.6.2/bin/reframe: run session stopped: key error: 'fail_info'
/home/casparl/.local/easybuild/RedHatEnterpriseServer7/2020/software/ReFrame/3.6.2/bin/reframe: Traceback (most recent call last):
File "/home/casparl/.local/easybuild/RedHatEnterpriseServer7/2020/software/ReFrame/3.6.2/lib/python3.6/site-packages/ReFrame_HPC-3.6.2-py3.6.egg/reframe/frontend/cli.py", line 1022, in main
runner.stats.print_failure_report(printer)
File "/home/casparl/.local/easybuild/RedHatEnterpriseServer7/2020/software/ReFrame/3.6.2/lib/python3.6/site-packages/ReFrame_HPC-3.6.2-py3.6.egg/reframe/frontend/statistics.py", line 237, in print_failure_report
tb = ''.join(traceback.format_exception(*r['fail_info'].values()))
KeyError: 'fail_info'
Log file(s) saved in: '/nfs/home4/casparl/EESSI/software-layer/tests/reframe/reframe.log'
There are three problems with this output:
- It prints failure info for
Gromacs_EESSI_singlenode___example_system_gpu_short____builtin____GROMACS_2020_3_foss_2020a__(which is one of the tests that is actually skipped) rather than forGromacs_EESSI_singlenode___example_system_gpu_short____builtin____GROMACS_2020_3_intel_2020a_CUDA_11_0_3__(the test that actually failed). - The Traceback.
- It doesn't print the performance log (though I guess that's because it crashes, resulting in the traceback).
The original test is here https://github.com/casparvl/software-layer/blob/gromacs_libtest/tests/reframe/eessi-checks/applications/gromacs.py (which uses the library test at https://github.com/casparvl/software-layer/tree/gromacs_libtest/tests/reframe/testlib/applications/gromacs).
I'll try to create a more minimal reproducer. My best guess is it has something to with the skipped tests. Maybe a list is being indexed based on 5 tests being run, but then this indexing fails since only 3 tests ran in the end? Anyway, I'll try to confirm that with a minimal reproducer that runs 2 tests of which one is skipped and the other fails...