Add Slurm autoscaling #18

sjpb · 2023-07-25T15:47:49Z

Requires KUBECONFIG to be defined in the shell used to run helm. This is injected as a secret to use to create pods on demand.

Uses host network for slurmd pods. These pods have a hostPort defined (for slurmd) which means they don't get scheduled onto the same k-node.

Note there's no ResumeFailProgram defined; assuming the slurmd pod definition is ok the the most likely reason for "resume failure" from slurm's PoV is the pod pending beyond ResumeTimeout, due to e.g. not enough k-nodes. While the s-node does then show DOWN, if k-resources become available later the pod will launch, at which time the s-node changes from DOWN to IDLE (this has been tested). Note this is different from e.g. autoscale on openstack, where a failed VM launch due to cloud resources will not keep retrying and the s-node state needs to be reset to allow slurm to try to launch it again. So really leaving the s-node showing as DOWN accurately reflects the state of cloud resources here.

…file

…mage)

Convert Rook NFS to a Helm chart and install it as a Slurm chart dependency

Moved secret generation from scripts to helm

Only permit one slurmd pod per k8s node

Use host networking

Replaced mounted kubeconfig with service account

Hook race fix

sjpb · 2023-08-18T15:39:51Z

NB: this CANNOT make use of #27

wtripp180901 added 30 commits July 14, 2023 10:47

Added Open Ondemand to image

648db22

Running ood portal generator

b241c36

Trying adding ood user before starts

1995fd9

Apache runs but auth errors

26a4750

Creating htpasswd file and adding user on startup

6abcad0

Now adds rocky as authenticated user and uses htdbm to generate auth …

494a7a5

…file

Updated image + mounted cluster config

547428b

Trying creating shell directory on startup

a1bd370

Trying adding env file to shell directory

ee321c9

Bump values.yaml

d48976b

Trying installing modules in Dockerfile

e3b8774

Trying to cinfugre clusters (not working)

2172d7b

Trying entrypoint tweaks

3f86fbe

Trying to configure cluster with the login nodes

7c541b0

Image now sets up rocky OOD password with env variable from secret

c24c181

Rocky OOD password now set as secret from generate-secrets.sh

ad79e16

Fixed broken mountpath for cluster config

2655d12

Fixed incorrect slurm binaries path

44e71b4

Updated docs

804c74d

Changed image to allow self-sshing

0e2666a

Fixed incorrect path

7513b72

Added newline to avoid breaking authorized_keys file

4ba0991

Bumped image

833b0d2

Removed host key generation from login image

d38e241

Updated image to copy and set permissions for host keys from mount

a89e584

Server now has persistent set of host keys from mount

a6c8e38

Removed comments

7a2480b

Added https (fixes job composer)

1345a58

Now generates keys for rocky to self-ssh if don't already exist (in i…

0f286ed

…mage)

Updated image tag

c094754

sjpb and others added 21 commits August 16, 2023 12:01

Merge pull request #23 from stackhpc/feature/helm-install-nfs

9cde995

Convert Rook NFS to a Helm chart and install it as a Slurm chart dependency

Merge image rebuild

d3daba4

Updated image

7c5b6c4

Merge pull request #24 from stackhpc/azimuth-helm

e839442

Moved secret generation from scripts to helm

Replaced kubeconfig mount with ServiceAccount

968515e

Added debug to k8s files

e25332e

only permit one slurmd pod per k8s node

d313063

Added more debugging for k8s

e90f227

use host networking

6530f78

Sending debug to log files

f5c1261

Adding kubectl output to logs

10b8e8e

Adding error check

ef184aa

Merge pull request #28 from stackhpc/feat/hostport

a0193a6

Only permit one slurmd pod per k8s node

Merge branch 'main' into feat/hostnetwork

4d4a15b

Adding /dev/tty pipes

a9ea92b

Debug

be00d24

Added error redirection

63795d3

Merge pull request #29 from stackhpc/feat/hostnetwork

def4a77

Use host networking

Fixed missing environment variables in power up/down scripts

a731c60

Updated values.yaml and gave all pod permissions to account

a2ca5e3

Merge pull request #30 from stackhpc/feat/autoscaler-service-account

96d933f

Replaced mounted kubeconfig with service account

sd109 changed the title ~~Add autoscale~~ Add Slurm autoscaling Aug 18, 2023

wtripp180901 and others added 7 commits August 18, 2023 13:34

Rebuilding image with fixed merged conflicts

1f51003

Updated image tag

89981e6

Merge pull request #21 from stackhpc/hook-race-fix

a0a2323

Hook race fix

Image rebuild with fixed merge conflicts

6ca2cd0

Updated image

3ebcfe4

Pre-merge image rebuild

0602876

Updated image tag

344b9b2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Slurm autoscaling #18

Add Slurm autoscaling #18

sjpb commented Jul 25, 2023 •

edited

sjpb commented Aug 18, 2023

Add Slurm autoscaling #18

Are you sure you want to change the base?

Add Slurm autoscaling #18

Conversation

sjpb commented Jul 25, 2023 • edited

sjpb commented Aug 18, 2023

sjpb commented Jul 25, 2023 •

edited