Skip to content

Conversation

@sjpb
Copy link
Collaborator

@sjpb sjpb commented Jan 7, 2021

In configless mode compute nodes have an environment var set to tell slurmd what the the slurmctld host is. Login-only nodes don't have this so slurm commands can't work.

There are three possible approaches (docs):

  1. Use DNS SRV records to provide the slurmctld host name - not under control of this role, so can rule that out.
  2. Template the config onto the login nodes - could do that, but then login images aren't "fixed" and have multiple possible sources-of-truth for cluster config when resizing etc, so don't like that.
  3. Run slurmd on the login nodes - this is the docs suggestion when 1) isn't possible.

This PR implements 3). I don't particularly like the implementation but here's why it is the best I could come up with:

  1. The current "runtime"-node package list (in e.g. vars/ohpc-2) already includes packages for slurmd.
  2. We need ohpc_slurm_services = slurmd so that slurmd gets configured. This could be done simply by setting openhpc_enable.batch=True.
  3. The slurm.conf needs a NodeName entry for each login-only node. Entries don't need any other information and don't need to be included in any partition.
  4. Due to the openhpc_slurm_partitions design its very hard to extract the set of batch nodes which aren't listed in any partition. (This set is the login-only nodes). Therefore a new openhpc_enable.login flag is added to explicitly identify such nodes, rather than setting .batch = True.
  5. As openhpc_enable is a role var it doesn't end up in hostvars so there's no way to find the .login nodes from within the templating. Hence, runtime.yml has to set a fact openhpc_login_only to get this info into hostvars (I considered calling this openhpc_enable_login but thought the similarity to openhpc_enable.login would actually be confusing not helpful).

@sjpb sjpb requested review from JohnGarbutt and jovial January 7, 2021 11:19
@sjpb sjpb self-assigned this Jan 7, 2021
@sjpb sjpb added the bug Something isn't working label Jan 7, 2021
@sjpb sjpb linked an issue Jan 7, 2021 that may be closed by this pull request
@jovial jovial self-requested a review January 8, 2021 10:04
Copy link
Contributor

@jovial jovial left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks pretty clean - nice work

@sjpb sjpb merged commit ee2fb9c into master Jan 8, 2021
@sjpb sjpb deleted the fix/login-only branch January 8, 2021 14:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Login-only node in configless mode fails to get config

3 participants