-
Notifications
You must be signed in to change notification settings - Fork 65
Description
The following is described under node roles doc:
## Dedicated GPU & CPU Nodes
Separate nodes into those that:
* Run GPU workloads
* Run CPU workloads
* Do not run Run:ai at all. these jobs will not be monitored using the Run:ai Administration User interface.
This is actually not true, all nodes in the cluster are displayed under Nodes
tab in the Administration UI.
That includes Run:ai worker nodes, Run:ai system nodes, regular workers, and cluster masters.
All nodes containing GPUs and having DCGM exporting metrics upon them, would count as "GPU nodes" in the Overview
dashboard.
That includes nodes that don't have the runai-container-toolkit
& runai-container-toolkit-exporter
DaemonSets running on them - that means that any Run:ai pod won't be scheduled upon them, but they are still counted.
Review nodes names using `kubectl get nodes`. For each such node run:
'```
runai-adm set node-role --gpu-worker <node-name>
'```
or
'```
runai-adm set node-role --cpu-worker <node-name>
'```
Nodes not marked as GPU worker or CPU worker will not run Run:ai at all.
That's also not true, nodes that are not marked as GPU workers nor CPU workers would run any kind of Run:ai workload.
The same behavior will be achieved if both roles are assigned to a node.