-
Notifications
You must be signed in to change notification settings - Fork 117
Description
I think this is related to the fix for this bug I filed: #2531
Another use case we have is running many multi-node tests with sbatch from the login node:
'partitions': [
{
'name': 'sbatch',
'descr': 'sbatch from login node',
'scheduler': 'slurm',
'launcher': 'srun',
'max_jobs': 10
}Those tests are ran through GitLab CI after using reframe --ci-generate
As described in #2531 (comment), I switched to the serial execution policy and I thought this would be fine with the GitLab CI approach since it's using one GitLab CI job per Reframe test (e.g. one reframe invocation per test).
But after switching to 3.11.2, all those tests (~45) fail with an error like the following:
[ FAILED ] Ran 1/1 test case(s) from 1 check(s) (1 failure(s), 0 skipped)
[==========] Finished on Tue Jun 14 11:22:46 2022
==============================================================================
SUMMARY OF FAILURES
------------------------------------------------------------------------------
FAILURE INFO for OSUP2PBandwidth_D_D
* Expanded name: OSUP2PBandwidth %src_dests=D D
* Description: OSUP2PBandwidth %src_dests=D D
* System partition: XXX:sbatch
* Environment: builtin
* Stage directory: XXX
* Node list:
* Job type: batch job (id=12175)
* Dependencies (conceptual): []
* Dependencies (actual): []
* Maintainers: []
* Failing phase: run
* Rerun with '-n OSUP2PBandwidth_D_D -p builtin --system XXX:sbatch -r'
* Reason: spawned process error: command 'sacct -S 2022-06-14 -P -j 12175 -o jobid,state,exitcode,end,nodelist' failed with exit code 1:
--- stdout ---
--- stdout ---
--- stderr ---
sacct: error: slurm_persist_conn_open_without_init: failed to open persistent connection to host:slurm-XXX:6819: Cannot assign requested address
sacct: error: Sending PersistInit msg: Cannot assign requested address
sacct: error: Problem talking to the database: Cannot assign requested address
--- stderr ---This is likely because all the separate reframe invocations are now polling too fast after #2534, so the system exhausts available ephemeral ports (I noticed a lot of ports being used or in TIME_WAIT state using netstat). But I guess we were already close to the edge before the 3.11.2 upgrade, so the better solution is probably to be able to set SLEEP_MIN as mentioned in #2534
Thank you!