Autopilot provides a /status
handler that can be queried to get the entire system status, meaning that it will run all the tests on all the nodes. Autopilot is reachable by service name autopilot-healthchecks.autopilot.svc
in-cluster only, meaning it can be reached from a pod running in the cluster, or through port forwarding (see below).
Health check names are pciebw
, dcgm
, remapped
, ping
, iperf
, pvc
, gpumem
.
For example, using port forwarding to localhost or by exposing the service
kubectl port-forward service/autopilot-healthchecks 3333:3333 -n autopilot
# or oc expose service autopilot-healthchecks -n autopilot in OpenShift
If using port forward, then launch curl
on another terminal
curl "http://localhost:3333/status?check=pciebw&host=nodename1"
Alternatively, retrieve the route with kubectl get routes autopilot-healthchecks -n autopilot
When using routes, it is recommended to increase the timeout with the following command
oc annotate route autopilot-healthchecks -n autopilot --overwrite haproxy.router.openshift.io/timeout=30m
Then:
curl "http://<route-name>/status?check=pciebw&host=nodename1"
All tests can be tailored by a combination of:
host=<hostname1,hostname2,...>
, to run all tests on a specific node or on a comma separated list of nodes.check=<healthcheck1,healtcheck2,...>
, to run a single test (pciebw
,dcgm
,remapped
,gpumem
,ping
,iperf
orall
) or a list of comma separated tests. When no parameters are specified, onlypciebw
,dcgm
,remapped
,ping
tests are run.job=<namespace:key=value>
, run tests on nodes running a job labeled withkey=value
in a specific namespace.nodelabel=<key=value>
, run tests on nodes having thekey=value
label.batch=<#hosts>
, how many hosts to check at a single moment. Requests to the batch are run in parallel asynchronously. Batching is done to avoid running too many requests in parallel when the number of worker nodes increases. Defaults to all nodes.
Some health checks provide further customization. More details on all the tests can be found here
Note that if multiple node selection parameters (host
, job
, nodelabel
) are provided together, Autopilot will run tests on nodes that match any of the specified parameters (set union). For example, the following command will run the pciebw
test on all nodes that either have the label label1
OR are running the job jobKey=job2
because both nodelabel
and job
parameters are provided in the input:
curl "http://<route-name>/status?check=pciebw&nodelabel=label1&job=default:jobKey=job2"
This test runs dcgmi diag
, and we support only r
as parameter.
The default is 1
, but can customize it by /status?check=dcgm&r=2
.
As part of this workload, Autopilot will generate the Ring Workload and then start iperf3 servers
on each interface on each Autopilot pod based on the configuration options provided by the user. Only after the iperf3 servers
are started, Autopilot will begin executing the workload by starting iperf3 clients
based on the configuration options provided by the user. All results are logged back to the user.
-
For each network interface on each node, an
iperf3 server
is started. The number ofiperf3 servers
is dependent on thenumber of clients
intended on being run. For example, if thenumber of clients
is8
, then there will be8
iperf3 servers
started per interface on a uniqueport
. -
Invocation from the exposed Autopilot API is as follows below:
# Invoked via the `status` handle:
curl "http://127.0.0.1:3333/status?check=iperf&workload=ring&pclients=<NUMBER_OF_IPERF3_CLIENTS>&startport=<STARTING_IPERF3_SERVER_PORT>"
# Invoked via the `status` with defaults (iperf clients = 8, starting server port = 5200, workload = ring):
curl "http://127.0.0.1:3333/status?check=iperf"
# Invoked via the `iperf` handle directly:
curl "http://127.0.0.1:3333/iperf?workload=ring&pclients=<NUMBER_OF_IPERF3_CLIENTS>&startport=<STARTING_IPERF3_SERVER_PORT>"
# Invoked via the `iperf` handle directly (iperf clients = 8, starting server port = 5200, workload = ring):
curl "http://127.0.0.1:3333/iperf"
In this example, we target one node and check the pcie bandwidth and use the port-forwarding method.
In this scenario, we have a value lower than 8GB/s
, which results in an alert. This error will be exported to the OpenShift web console and on Slack, if that is enabled by admins.
curl "http://127.0.0.1:3333/status?check=pciebw"
The output of the command above, will be similar to the following (edited to save space):
Checking status on all nodes
Autopilot Endpoint: 10.128.6.187
Node: hostname
url(s): http://10.128.6.187:3333/status?host=hostname&check=pciebw
Response:
Checking system status of host hostname (localhost)
[[ PCIEBW ]] Briefings completed. Continue with PCIe Bandwidth evaluation.
[[ PCIEBW ]] FAIL
Host hostname
12.3 12.3 12.3 12.3 5.3 12.3 12.3 12.3
Node Status: PCIE Failed
-------------------------------------
Autopilot Endpoint: 10.131.4.93
Node: hostname2
url(s): http://10.131.4.93:3333/status?host=hostname2&check=pciebw
Response:
Checking system status of host hostname2 (localhost)
[[ PCIEBW ]] Briefings completed. Continue with PCIe Bandwidth evaluation.
[[ PCIEBW ]] SUCCESS
Host hostname2
12.1 12.0 12.3 12.3 11.9 11.5 12.1 12.1
Node Status: Ok
-------------------------------------
Node Summary:
{'hostname': ['PCIE Failed'],
'hostname2': ['Ok']}
runtime: 31.845192193984985 sec