Fix A800 GPU detection fields used by gpu-info and health checks#1
Open
csh0101 wants to merge 1 commit intoqinyusen:mainfrom
Open
Fix A800 GPU detection fields used by gpu-info and health checks#1csh0101 wants to merge 1 commit intoqinyusen:mainfrom
csh0101 wants to merge 1 commit intoqinyusen:mainfrom
Conversation
This patch fixes field selection and service detection in `gpu_info.py` and `health_check.py` so the suite reports correct PCIe, CUDA, throttling, and persistence status on our A800 validation host. Constraint: Validated only on an 8x NVIDIA A800-SXM4-80GB host with driver 570.195.03 and CUDA 12.8 Constraint: Original failure path came from `nvidia-smi` field mismatches and stale service-state assumptions, not GPU hardware faults Rejected: Keep current fields and document false alarms | leaves GPU info and health output misleading on A800 Confidence: medium Scope-risk: narrow Directive: This has only been tested on A800; it has not been regressed on B200/H200 and those paths should be rechecked before relying on identical behavior Tested: `python3 h200_tester.py --test gpu-info` and `python3 h200_tester.py --test health` on A800 after driver recovery Not-tested: B200, H200, other driver branches, non-NVSwitch systems
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fix A800 GPU detection fields used by gpu-info and health checks
This patch fixes field selection and service detection in
gpu_info.pyandhealth_check.pyso the suite reports correct PCIe, CUDA, throttling, andpersistence status on our A800 validation host.
Constraint: Validated only on an 8x NVIDIA A800-SXM4-80GB host with driver 570.195.03 and CUDA 12.8
Constraint: Original failure path came from
nvidia-smifield mismatches and stale service-state assumptions, not GPU hardware faultsRejected: Keep current fields and document false alarms | leaves GPU info and health output misleading on A800
Confidence: medium
Scope-risk: narrow
Directive: This has only been tested on A800; it has not been regressed on B200/H200 and those paths should be rechecked before relying on identical behavior
Tested:
python3 h200_tester.py --test gpu-infoandpython3 h200_tester.py --test healthon A800 after driver recoveryNot-tested: B200, H200, other driver branches, non-NVSwitch systems