Refactor bounds_check_indices offset checks to condition-first (Phase 1) (#5682)#5682
Closed
gchalump wants to merge 2 commits into
Closed
Refactor bounds_check_indices offset checks to condition-first (Phase 1) (#5682)#5682gchalump wants to merge 2 commits into
gchalump wants to merge 2 commits into
Conversation
Contributor
|
@gchalump has exported this pull request. If you are a Meta employee, you can view the originating Diff in D101718260. |
Summary:
This diff bundles the two pieces needed to benchmark the
bounds_check_indices stack across architectures, so a single diff
number can be forwarded to collaborators (NVIDIA A100/H100/B200,
AMD MI300/MI350) for cross-arch perf testing before any of the
stack lands.
# Part 1: --bounds-check-version flag
Adds a `--bounds-check-version` CLI flag to the
`bounds-check-indices` subcommand of
`fbcode/deeplearning/fbgemm/fbgemm_gpu/bench/tbe/tbe_utils_benchmark.py`,
plumbed through to `torch.ops.fbgemm.bounds_check_indices` as the
`bounds_check_version` kwarg (which already exists in the C++ op
signature).
Without this, v2 was only reachable by either flipping the
`BOUNDS_CHECK_INDICES_V2` JK gate (requires JK perms; not
available to OSS/external runners) or manually editing the bench
file inline (fragile, doesn't survive `sl goto`).
Default `1` preserves existing bench behavior (no behavior change
for any current caller).
# Part 2: portable bench tooling
Three bash scripts in
`fbcode/ai_codesign/nonprod/gchalump/scripts/fbgemm/`:
1. **run_bounds_check_indices_bench.sh** — inner runner. Sweeps
{modes} × {versions} × {trials} on the currently-checked-out
commit. Captures `system.json` (GPU/driver/CUDA/host/checkout).
Mirrors the arch dispatch (`-c h100`/`-c mi350`/`-c b200`) and
`--gpu N` (sets both `CUDA_VISIBLE_DEVICES` and
`HIP_VISIBLE_DEVICES`) conventions from supadchaya's existing
TBE bench scripts. Auto-falls-back to v1-only on checkouts
that predate Part 1.
2. **run_bounds_check_indices_on_stack.sh** — outer driver. Loops
`sl goto` across {Baseline, Phase 0.5, 1, 1.5, 2, 3}, runs
the inner per phase, restores original checkout via EXIT trap.
Forwards all bench flags. Pre-pulls all diffs to avoid
mid-sweep network calls. Auto-runs the analyzer at the end.
3. **analyze_bounds_check_results.sh** — log parser + reporter
(pure bash). Emits `results.csv` and `summary.md` (median per
cell with Δ T vs Baseline, per kernel version). Uploads both
to pastry by default.
# Why folded into one diff
Originally Part 1 was the bottom of the bounds_check stack and
Part 2 was a parallel branch off master (D102682427, now
abandoned). Folded into one diff so collaborators only need to
remember one diff number to reproduce the full benchmark.
Differential Revision: D102675932
… 1) (pytorch#5682) Summary: X-link: facebookresearch/FBGEMM#2624 Phase 1 of the bounds_check_indices race-condition cleanup stack. This is a pure refactor — no behavior change. The per-b_t offset validation in both v1 and v2 CUDA kernels is rewritten from mode-first to condition-first. **Before (mode-first):** ``` if (FATAL) { asserts } else if (WARNING) { if (bad) { warn + adjust } } else if (IGNORE) { adjust unconditionally } ``` **After (condition-first):** ``` if (bad) { if (FATAL) { asserts } else { if (WARNING) { warn } adjust } } ``` **Why**: the mode-first form had IGNORE call `adjust_offset_kernel` on every (b, t) pair, even when offsets were valid — wasted writes plus needless contention on the offsets buffer. Condition-first runs adjustment only when bounds are actually violated. **Stack:** - (this) Phase 1 — Condition-first refactor - Phase 1.5 — Gate offset correction on lane 0 + shfl_sync broadcast - Phase 2 — Make bad offsets unconditionally fatal Reviewed By: q10 Differential Revision: D101718260
56a9d90 to
e53328b
Compare
Contributor
|
This pull request has been merged in b6f2db1. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary:
X-link: https://github.com/facebookresearch/FBGEMM/pull/2624
Phase 1 of the bounds_check_indices race-condition cleanup stack.
This is a pure refactor — no behavior change. The per-b_t offset validation in both v1 and v2 CUDA kernels is rewritten from mode-first to condition-first.
Before (mode-first):
After (condition-first):
Why: the mode-first form had IGNORE call
adjust_offset_kernelon every (b, t) pair, even when offsets were valid — wasted writes plus needless contention on the offsets buffer. Condition-first runs adjustment only when bounds are actually violated.Stack:
Reviewed By: q10
Differential Revision: D101718260