Skip to content

Commit

Permalink
Pre-merge image rebuild
Browse files Browse the repository at this point in the history
  • Loading branch information
wtripp180901 committed Aug 18, 2023
2 parents 3ebcfe4 + a0a2323 commit 0602876
Show file tree
Hide file tree
Showing 4 changed files with 50 additions and 2 deletions.
2 changes: 2 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -77,6 +77,8 @@ Subsequent releases can be deployed using:
helm upgrade <deployment-name> slurm-cluster-chart
```

Note: When updating the cluster with `helm upgrade`, a pre-upgrade hook will prevent upgrades if there are running jobs in the Slurm queue. Attempting to upgrade will set all Slurm nodes to `DRAINED` state. If an upgrade fails due to running jobs, you can undrain the nodes either by waiting for running jobs to complete and then retrying the upgrade or by manually undraining them by accessing the cluster as a privileged user. Alternatively you can bypass the hook by running `helm upgrade` with the `--no-hooks` flag (may result in running jobs being lost)

## Accessing the Cluster

Retrieve the external IP address of the login node using:
Expand Down
12 changes: 10 additions & 2 deletions image/docker-entrypoint.sh
Original file line number Diff line number Diff line change
Expand Up @@ -141,15 +141,23 @@ elif [ "$1" = "check-queue-hook" ]
then
start_munge

scontrol update NodeName=all State=DRAIN Reason="Preventing new jobs running before upgrade"

RUNNING_JOBS=$(squeue --states=RUNNING,COMPLETING,CONFIGURING,RESIZING,SIGNALING,STAGE_OUT,STOPPED,SUSPENDED --noheader --array | wc --lines)

if [[ $RUNNING_JOBS -eq 0 ]]
then
exit 0
exit 0
else
exit 1
exit 1
fi

elif [ "$1" = "undrain-nodes-hook" ]
then
start_munge
scontrol update NodeName=all State=UNDRAIN
exit 0

elif [ "$1" = "generate-keys-hook" ]
then
mkdir -p ./temphostkeys/etc/ssh
Expand Down
34 changes: 34 additions & 0 deletions slurm-cluster-chart/templates/undrain-nodes-hook.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
apiVersion: batch/v1
kind: Job
metadata:
name: undrain-nodes-hook
annotations:
"helm.sh/hook": post-upgrade
"helm.sh/hook-delete-policy": hook-succeeded
spec:
backoffLimit: 0
ttlSecondsAfterFinished: 0
template:
metadata:
name: undrain-nodes-hook
spec:
restartPolicy: Never
containers:
- name: undrain-nodes-hook
image: {{ .Values.slurmImage }}
args:
- undrain-nodes-hook
volumeMounts:
- mountPath: /tmp/munge.key
name: munge-key-secret
subPath: munge.key
- mountPath: /etc/slurm/
name: slurm-config-volume
volumes:
- name: munge-key-secret
secret:
secretName: {{ .Values.secrets.mungeKey }}
defaultMode: 0400
- name: slurm-config-volume
configMap:
name: {{ .Values.configmaps.slurmConf }}
4 changes: 4 additions & 0 deletions slurm-cluster-chart/values.yaml
Original file line number Diff line number Diff line change
@@ -1,4 +1,8 @@
<<<<<<< HEAD
slurmImage: ghcr.io/stackhpc/slurm-k8s-cluster:6ca2cd0
=======
slurmImage: ghcr.io/stackhpc/slurm-docker-cluster:1f51003
>>>>>>> main

login:
# Deployment resource name
Expand Down

0 comments on commit 0602876

Please sign in to comment.