Skip to content

Fix A800 GPU detection fields used by gpu-info and health checks#1

Open
csh0101 wants to merge 1 commit intoqinyusen:mainfrom
csh0101:fix/a800-health-check-detection
Open

Fix A800 GPU detection fields used by gpu-info and health checks#1
csh0101 wants to merge 1 commit intoqinyusen:mainfrom
csh0101:fix/a800-health-check-detection

Conversation

@csh0101
Copy link
Copy Markdown

@csh0101 csh0101 commented Apr 28, 2026

Fix A800 GPU detection fields used by gpu-info and health checks

This patch fixes field selection and service detection in gpu_info.py and
health_check.py so the suite reports correct PCIe, CUDA, throttling, and
persistence status on our A800 validation host.

Constraint: Validated only on an 8x NVIDIA A800-SXM4-80GB host with driver 570.195.03 and CUDA 12.8
Constraint: Original failure path came from nvidia-smi field mismatches and stale service-state assumptions, not GPU hardware faults
Rejected: Keep current fields and document false alarms | leaves GPU info and health output misleading on A800
Confidence: medium
Scope-risk: narrow
Directive: This has only been tested on A800; it has not been regressed on B200/H200 and those paths should be rechecked before relying on identical behavior
Tested: python3 h200_tester.py --test gpu-info and python3 h200_tester.py --test health on A800 after driver recovery
Not-tested: B200, H200, other driver branches, non-NVSwitch systems

This patch fixes field selection and service detection in `gpu_info.py` and
`health_check.py` so the suite reports correct PCIe, CUDA, throttling, and
persistence status on our A800 validation host.

Constraint: Validated only on an 8x NVIDIA A800-SXM4-80GB host with driver 570.195.03 and CUDA 12.8
Constraint: Original failure path came from `nvidia-smi` field mismatches and stale service-state assumptions, not GPU hardware faults
Rejected: Keep current fields and document false alarms | leaves GPU info and health output misleading on A800
Confidence: medium
Scope-risk: narrow
Directive: This has only been tested on A800; it has not been regressed on B200/H200 and those paths should be rechecked before relying on identical behavior
Tested: `python3 h200_tester.py --test gpu-info` and `python3 h200_tester.py --test health` on A800 after driver recovery
Not-tested: B200, H200, other driver branches, non-NVSwitch systems
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant