3.11.2 now polling slurmdbd too quickly

I think this is related to the fix for this bug I filed: https://github.com/reframe-hpc/reframe/issues/2531 
Another use case we have is running many multi-node tests with sbatch from the login node:
```python
            'partitions': [
                {
                    'name': 'sbatch',
                    'descr': 'sbatch from login node',
                    'scheduler': 'slurm',
                    'launcher': 'srun',
                    'max_jobs': 10
               }
```
Those tests are ran through GitLab CI after using `reframe --ci-generate`

As described in https://github.com/reframe-hpc/reframe/issues/2531#issuecomment-1146141976, I switched to the `serial` execution policy and I thought this would be fine with the GitLab CI approach since it's using one GitLab CI job per Reframe test (e.g. one reframe invocation per test).
But after switching to 3.11.2, all those tests (~45) fail with an error like the following:
```console
[  FAILED  ] Ran 1/1 test case(s) from 1 check(s) (1 failure(s), 0 skipped)
[==========] Finished on Tue Jun 14 11:22:46 2022 
==============================================================================
SUMMARY OF FAILURES
------------------------------------------------------------------------------
FAILURE INFO for OSUP2PBandwidth_D_D 
  * Expanded name: OSUP2PBandwidth %src_dests=D D
  * Description: OSUP2PBandwidth %src_dests=D D
  * System partition: XXX:sbatch
  * Environment: builtin
  * Stage directory: XXX
  * Node list: 
  * Job type: batch job (id=12175)
  * Dependencies (conceptual): []
  * Dependencies (actual): []
  * Maintainers: []
  * Failing phase: run
  * Rerun with '-n OSUP2PBandwidth_D_D -p builtin --system XXX:sbatch -r'
  * Reason: spawned process error: command 'sacct -S 2022-06-14 -P -j 12175 -o jobid,state,exitcode,end,nodelist' failed with exit code 1:
--- stdout ---
--- stdout ---
--- stderr ---
sacct: error: slurm_persist_conn_open_without_init: failed to open persistent connection to host:slurm-XXX:6819: Cannot assign requested address
sacct: error: Sending PersistInit msg: Cannot assign requested address
sacct: error: Problem talking to the database: Cannot assign requested address
--- stderr ---
```

This is likely because all the separate reframe invocations are now polling *too fast* after https://github.com/reframe-hpc/reframe/pull/2534, so the system exhausts available ephemeral ports (I noticed a lot of ports being used or in `TIME_WAIT` state using `netstat`). But I guess we were already close to the edge before the 3.11.2 upgrade, so the better solution is probably to be able to set `SLEEP_MIN` as mentioned in https://github.com/reframe-hpc/reframe/pull/2534

Thank you!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

3.11.2 now polling slurmdbd too quickly #2536

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

3.11.2 now polling slurmdbd too quickly #2536

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions