Add support for autoscaling #120

sjpb · 2021-09-24T14:33:27Z

Most of the autoscaling setup can be done either before running runtime.yml, or using the openhpc_config variable to pass in additional slurm.conf parameters.

This PR adds an option extra_nodes to the group/partition definitions in openhpc_partitions to allow additional nodes definitions to be added into the slurm.conf node/partition definitions. As well as autoscaling/state=CLOUD nodes these could actually be used to add non-role-controlled normal nodes into a cluster using this role.

It also modifies the docs, as I realised they were a bit messy/confusing in places.

There are some subtleties which needed changes to the slurm.conf templating:

"matching" inventory groups "<cluster_name>_<group_name>" won't exist if a group/partition contains only extra_nodes nodes.
the NodeName=DEFAULT approach can't be used as we don't want cpu information to "fall through" to subsequent partitions which might not define all the current threads_per_core etc variables. I found existing code to group nodenames, so it uses that instead (renamed to avoid total confusion) to keep node definitions concise. As a bonus this will make the config MUCH shorter in the usual case of sequential nodenames, which will help slurmctld performance and startup time for large clusters.
given the above changes the opportunity was taken to make the templating of node definitions clearer - generally this templating has been really hard to follow so hopefully this helps.

…#115)

This reverts commit f787c51.

sjpb · 2021-10-14T10:41:05Z

Note test13 for openhpc_config was not actually getting run, and verification for it needed fixing. Done in this PR as we also need openhpc_config for autoscaling.

jovial

Overall, change seems pretty reasonable. Templating in slurm.conf is pretty hairy, but then again it was before this change. Had a couple of little questions.

defaults/main.yml

filter_plugins/slurm_conf.py

templates/slurm.conf.j2

jovial

Overall, change seems pretty reasonable. Just that niggle about openhpc_config really.

sjpb · 2021-11-11T12:33:00Z

centos:8.2.2004, test4 failed in CI but worked ok on local molecule. Rerunning ...

sjpb · 2021-11-11T12:43:48Z

Passed on 2nd attempt, ready for re-review @jovial.

defaults/main.yml

Co-authored-by: jovial <will@stackhpc.com>

sjpb · 2021-11-11T16:52:04Z

Passed tests on 2nd attempt

sjpb · 2021-11-29T10:09:21Z

@jovial can you rereview please? Think this is ready to go.

JohnGarbutt · 2022-01-04T12:07:18Z

filter_plugins/slurm_conf.py

That looks like its backwards incompatible? Could we keep old and new easily enough?

The problem is the filter name is massively confusing when used in the templating which also used group_hosts as a variable containing the list of hosts in that group. I can't see why backward compat is required as I can't see a use-case for another playbook using this filter.

JohnGarbutt · 2022-01-04T12:08:48Z

molecule/test13/verify.yml

    register: slurm_config
  - assert:
-      that: "item in slurm_config.stdout"
+      that: "item in slurm_config.stdout_lines | map('replace', ' ', '')"


Might be worth adding brackets here?, I am unclear on the ordering here, I think its:
"item in (slurm_config.stdout_lines | map('replace', ' ', ''))"

Done in 505661e - testing...

JohnGarbutt · 2022-01-04T12:11:34Z

molecule/test14/converge.yml

+              # Need to specify IPs as even with State=DOWN, if slurmctld can't lookup a host it just excludes it from the config entirely
+              # Can't add to /etc/hosts via ansible due to Docker limitations on modifying /etc/hosts
+              - NodeName: fake-x,fake-y
+                NodeAddr: 0.42.42.0,0.42.42.1


Should we use internal IPs here? like 10.42.42.0?

Does this also affect cloud nodes that do not exist?

@JohnGarbutt does 9b04782 clarify why I'm using invalid IPs rather than internal ones?

It doesn't affect cloud nodes which don't exist as they will have been listed in the config as State=CLOUD, not State=DOWN. The former specifically means slurmctld knows it doesn't know how to contact them until they're "resumed".

sjpb added 30 commits September 8, 2021 13:12

add cloud_nodes option/templating

28b161b

Merge branch 'feature/flexi-config' into feature/autoscale

a770503

add openhpc_slurm_dirs variable

f787c51

Merge branch 'master' into feature/autoscale

1e87ade

Fix errors for login-only nodes not matching compute node specs (issue …

b3dba86

…#115)

Merge branch 'fix/116' into feature/autoscale

eb85c8b

Merge commit 'cd381cf' into feature/autoscale

5548366

use hostlist expressions for shorter config

2f21c0a

rename group_hosts filter plugin -> hostlist_expression for clarity

b7b14b7

WIP: support cloud-only partitions (see TODOs and fix docs)

4f8fa0e

fix State for cloud_nodes

11703b9

WIP/FAIL add partition info back in

2ced32b

fix cloud node templating

4e99892

Add openhpc_suspend_exc_nodes

7cd9706

fix galaxy meta info for CI tests

3cd5a8e

Revert "add openhpc_slurm_dirs variable"

92b4864

This reverts commit f787c51.

support specifying cpus=

399bd61

add support for features

a868d94

deprecate openhpc_extra_config in favour of openhpc_config

2816433

use cloud_nodes directly in forming nodenames

1b24122

add cloud_features support

840fff3

fix default value fall-through - removes cpus

1f7bcf4

Merge branch 'master' into feature/autoscale

97a25b4

remove cloud_features and cope with empty features

1a1fc8e

raise error (again) if cpu/memory info not specified

07e33f5

remove duplicate CLOUD definition

534ae78

support lists as values for openhpc_config

c23a735

revert partition defaults to previous documentation

0d13fe2

fix #122 re. configless options

d42c71d

remove cloud_nodes and features and replace with extra_nodes

6595cce

sjpb added 6 commits October 13, 2021 19:36

reallow empty inventory groups as per docs

2f938f4

allow NOT setting DOCKER_MTU for test6

cc262b2

support DOCKER_MTU for test5

a803905

fix docker networking for molecule test5

b6416fd

add tests 13 and 14 to github CI

bca9709

fix test13 verification

97268e6

sjpb marked this pull request as ready for review October 14, 2021 11:53

sjpb requested review from jovial and JohnGarbutt October 14, 2021 11:54

sjpb mentioned this pull request Oct 15, 2021

Enable autoscaling stackhpc/ansible-slurm-appliance#128

Closed

9 tasks

jovial reviewed Nov 10, 2021

View reviewed changes

defaults/main.yml Outdated Show resolved Hide resolved

filter_plugins/slurm_conf.py Show resolved Hide resolved

jovial reviewed Nov 10, 2021

View reviewed changes

templates/slurm.conf.j2 Show resolved Hide resolved

jovial reviewed Nov 10, 2021

View reviewed changes

make openhpc_slurm_configless depend on openhpc_config

1617c96

jovial reviewed Nov 11, 2021

View reviewed changes

defaults/main.yml Outdated Show resolved Hide resolved

simplify enable_configless logic

717752c

Co-authored-by: jovial <will@stackhpc.com>

jovial approved these changes Jan 4, 2022

View reviewed changes

JohnGarbutt reviewed Jan 4, 2022

View reviewed changes

sjpb added 2 commits January 4, 2022 16:02

clarify assert logic

505661e

clarify why invalid IPs are used

9b04782

sjpb merged commit a0d6f53 into master Jan 20, 2022

sjpb deleted the feature/autoscale branch January 20, 2022 15:27

sjpb mentioned this pull request Jan 20, 2022

Fix slurm.conf templating #119

Closed

Add support for autoscaling #120

Add support for autoscaling #120

Uh oh!

Conversation

sjpb commented Sep 24, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sjpb commented Oct 14, 2021

Uh oh!

jovial left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jovial left a comment

Choose a reason for hiding this comment

Uh oh!

sjpb commented Nov 11, 2021

Uh oh!

sjpb commented Nov 11, 2021

Uh oh!

Uh oh!

sjpb commented Nov 11, 2021

Uh oh!

sjpb commented Nov 29, 2021

Uh oh!

JohnGarbutt Jan 4, 2022

Choose a reason for hiding this comment

Uh oh!

sjpb Jan 4, 2022

Choose a reason for hiding this comment

Uh oh!

JohnGarbutt Jan 4, 2022

Choose a reason for hiding this comment

Uh oh!

sjpb Jan 4, 2022

Choose a reason for hiding this comment

Uh oh!

JohnGarbutt Jan 4, 2022

Choose a reason for hiding this comment

Uh oh!

JohnGarbutt Jan 4, 2022

Choose a reason for hiding this comment

Uh oh!

sjpb Jan 4, 2022

Choose a reason for hiding this comment

Uh oh!

Uh oh!

sjpb commented Sep 24, 2021 •

edited

Loading