-
Notifications
You must be signed in to change notification settings - Fork 19
Add support for autoscaling #120
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This reverts commit f787c51.
Note test13 for |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall, change seems pretty reasonable. Templating in slurm.conf is pretty hairy, but then again it was before this change. Had a couple of little questions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall, change seems pretty reasonable. Just that niggle about openhpc_config really.
centos:8.2.2004, test4 failed in CI but worked ok on local molecule. Rerunning ... |
Passed on 2nd attempt, ready for re-review @jovial. |
Co-authored-by: jovial <will@stackhpc.com>
Passed tests on 2nd attempt |
@jovial can you rereview please? Think this is ready to go. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That looks like its backwards incompatible? Could we keep old and new easily enough?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem is the filter name is massively confusing when used in the templating which also used group_hosts
as a variable containing the list of hosts in that group. I can't see why backward compat is required as I can't see a use-case for another playbook using this filter.
molecule/test13/verify.yml
Outdated
register: slurm_config | ||
- assert: | ||
that: "item in slurm_config.stdout" | ||
that: "item in slurm_config.stdout_lines | map('replace', ' ', '')" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might be worth adding brackets here?, I am unclear on the ordering here, I think its:
"item in (slurm_config.stdout_lines | map('replace', ' ', ''))"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done in 505661e - testing...
# Need to specify IPs as even with State=DOWN, if slurmctld can't lookup a host it just excludes it from the config entirely | ||
# Can't add to /etc/hosts via ansible due to Docker limitations on modifying /etc/hosts | ||
- NodeName: fake-x,fake-y | ||
NodeAddr: 0.42.42.0,0.42.42.1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we use internal IPs here? like 10.42.42.0?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this also affect cloud nodes that do not exist?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@JohnGarbutt does 9b04782 clarify why I'm using invalid IPs rather than internal ones?
It doesn't affect cloud nodes which don't exist as they will have been listed in the config as State=CLOUD, not State=DOWN. The former specifically means slurmctld
knows it doesn't know how to contact them until they're "resumed".
Most of the autoscaling setup can be done either before running runtime.yml, or using the openhpc_config variable to pass in additional slurm.conf parameters.
This PR adds an option
extra_nodes
to the group/partition definitions in openhpc_partitions to allow additional nodes definitions to be added into the slurm.conf node/partition definitions. As well as autoscaling/state=CLOUD nodes these could actually be used to add non-role-controlled normal nodes into a cluster using this role.It also modifies the docs, as I realised they were a bit messy/confusing in places.
There are some subtleties which needed changes to the slurm.conf templating:
NodeName=DEFAULT
approach can't be used as we don't want cpu information to "fall through" to subsequent partitions which might not define all the current threads_per_core etc variables. I found existing code to group nodenames, so it uses that instead (renamed to avoid total confusion) to keep node definitions concise. As a bonus this will make the config MUCH shorter in the usual case of sequential nodenames, which will help slurmctld performance and startup time for large clusters.