Skip to content

Some Slurm configuration changes lead to invalid node state #199

@priteau

Description

@priteau

Take a configuration such as:

openhpc_nodegroups:
  - name: cpu
    node_params:
      CPUSpecList: 92-95

Initial deployment will work fine. However, if CPUSpecList is changed (e.g. to 90-95), the deployment of the new configuration will lead affected nodes to invalid state with Reason=CoreSpec differ. This happens as soon as slurmctld is restarted, probably due to a mismatch between slurmctld and slurmd.

This can be fixed by forcing another configuration update elsewhere in the Slurm configuration.

There must be some safe way to roll out this change? Maybe stop slurmd services first, then restart slurmctld and finally start all slurmd services?

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions