The nvidia-smi
command provided by NVIDIA can be used to manage and monitor GPUs enabled Compute Nodes. In conjunction with the xCATxdsh
command, you can easily manage and monitor the entire set of GPU enabled Compute Nodes remotely from the Management Node.
Example: :
# xdsh <noderange> "nvidia-smi -i 0 --query-gpu=name,serial,uuid --format=csv,noheader"
node01: Tesla K80, 0322415075970, GPU-b4f79b83-c282-4409-a0e8-0da3e06a13c3
...
Warning
The following commands are provided as convenience. Always consult the nvidia-smi
manpage for the latest supported functions.
Some useful nvidia-smi
example commands for management.
Set persistence mode, When persistence mode is enabled the NVIDIA driver remains loaded even when no active clients, DISABLED by default:
nvidia-smi -i 0 -pm 1
Disabled ECC support for GPU. Toggle ECC support, A flag that indicates whether ECC support is enabled, need to use --query-gpu=ecc.mode.pending to check [Reboot required]:
nvidia-smi -i 0 -e 0
Reset the ECC volatile/aggregate error counters for the target GPUs:
nvidia-smi -i 0 -p 0/1
Set MODE for compute applications, query with --query-gpu=compute_mode:
nvidia-smi -i 0 -c 0/1/2/3
Trigger reset of the GPU :
nvidia-smi -i 0 -r
Enable or disable Accounting Mode, statistics can be calculated for each compute process running on the GPU, query with -query-gpu=accounting.mode:
nvidia-smi -i 0 -am 0/1
Specifies maximum power management limit in watts, query with --query-gpu=power.limit :
nvidia-smi -i 0 -pl 200
Some useful nvidia-smi
example commands for monitoring.
The number of NVIDIA GPUs in the system :
nvidia-smi --query-gpu=count --format=csv,noheader
The version of the installed NVIDIA display driver :
nvidia-smi -i 0 --query-gpu=driver_version --format=csv,noheader
The BIOS of the GPU board :
nvidia-smi -i 0 --query-gpu=vbios_version --format=csv,noheader
Product name, serial number and UUID of the GPU:
nvidia-smi -i 0 --query-gpu=name,serial,uuid --format=csv,noheader
Fan speed:
nvidia-smi -i 0 --query-gpu=fan.speed --format=csv,noheader
The compute mode flag indicates whether individual or multiple compute applications may run on the GPU. (known as exclusivity modes) :
nvidia-smi -i 0 --query-gpu=compute_mode --format=csv,noheader
Percent of time over the past sample period during which one or more kernels was executing on the GPU:
nvidia-smi -i 0 --query-gpu=utilization.gpu --format=csv,noheader
Total errors detected across entire chip. Sum of device_memory, register_file, l1_cache, l2_cache and texture_memory :
nvidia-smi -i 0 --query-gpu=ecc.errors.corrected.aggregate.total --format=csv,noheader
Core GPU temperature, in degrees C:
nvidia-smi -i 0 --query-gpu=temperature.gpu --format=csv,noheader
The ECC mode that the GPU is currently operating under:
nvidia-smi -i 0 --query-gpu=ecc.mode.current --format=csv,noheader
The power management status:
nvidia-smi -i 0 --query-gpu=power.management --format=csv,noheader
The last measured power draw for the entire board, in watts:
nvidia-smi -i 0 --query-gpu=power.draw --format=csv,noheader
The minimum and maximum value in watts that power limit can be set to :
nvidia-smi -i 0 --query-gpu=power.min_limit,power.max_limit --format=csv