Skip to content

Latest commit

 

History

History
69 lines (55 loc) · 4.14 KB

File metadata and controls

69 lines (55 loc) · 4.14 KB
title description ms.service ms.topic author ms.author ms.date
Azure Node Health Checks (AzNHC)
Documentation for using Azure Node Health Checks (AzNHC) to validate AI-HPC VM offerings.
azure-node-health-checks
overview
rafsalas19
rafaelsalas
04/15/2024

AzureHPC Node Health Checks

Description

Azure Node Health Checks (AzNHC) provides a suite of tests that target Azure AI/HPC VM offerings. AzNHC is an extension of the existing test suite, LBNL Node Health Checks. It combines Azure custom and NHC default checks to target AI and HPC components internal to a VM/node.

Key Features

  1. All checks are single node and runtime is relatively short up to 5 minutes.
  2. Easy set up and use. See the quickstart guide.
  3. Targets AI/HPC hardware components (GPU, InfiniBand, CPU). See the test coverage page.
  4. Customizable tests. See the detailed running AzNHC guide.

Recommended Usage

AzNHC can be used in a few ways to validate node health.

  1. Validate VM health before launching intended AI/HPC workload.
  2. Validate idle VM capacity on a reoccurring interval.
  3. Validate suspected faulty VMs to troubleshoot and isolate issues. In cases where performance degradation is observed by other means (i.e. performance drop during AI/HPC workload).

Supported VM SKU Offerings

Refer to the AzNHC Github page for the supported VM SKUs.

Design

AzNHC is launched via the run script. This deploys a docker container which performs the targeted health checks. Refer to sections Work Flow and Container Diagram below for more details.

Work Flow

The AzNHC process begins by launching run-health-checks script. The work flow follows the below graphic: workflow

Container Diagram

container diagram

Health Checks

The checks presented here are Azure custom checks. If you would like to learn more about the default NHC tests, find out more at Node Health Checks project.

The following are Azure custom checks added to the existing NHC suite of tests:

Check Component Tested nd96asr_v4 expected nd96amsr_a100_v4 expected nd96isr_h100_v5 expected hx176rs expected hb176rs_v4 expected
check_gpu_count GPU count 8 8 8 NA NA
check_nvlink_status NVlink no inactive links no inactive links no inactive links NA NA
check_gpu_xid GPU XID errors not present not present not present NA NA
check_nvsmi_healthmon Nvidia-smi GPU health check pass pass pass NA NA
check_gpu_bandwidth GPU DtH/HtD bandwidth 23 GB/s 23 GB/s 52 GB/s NA NA
check_gpu_ecc GPU Mem Errors (ECC) 20000000 20000000 20000000 NA NA
check_gpu_clock_throttling GPU Throttle codes assertion not present not present not present NA NA
check_nccl_allreduce GPU NVLink bandwidth 228 GB/s 228 GB/s 460 GB/s NA NA
check_ib_bw_gdr InfiniBand device (GDR) bandwidth 180 GB/s 180 GB/s 380 GB/s NA NA
check_ib_bw_non_gdr InfiniBand device (non GDR) bandwidth NA NA NA 390 GB/s 390 GB/s
check_nccl_allreduce_ib_loopback GPU/GPU Direct RDMA(GDR) + InfiniBand device bandwidth 18 GB/s 18 GB/s NA NA NA
check_hw_topology InfiniBand/GPU device topology/PCIE mapping pass pass pass NA NA
check_ib_link_flapping InfiniBand link flap occurrence not present not present not present not present not present
check_cpu_stream CPU compute/memory bandwidth NA NA NA 665500 MB/s 665500 MB/s

Note: See the test coverage page for more detailed descriptions.

References