Skip to content

Refactor bounds_check_indices offset checks to condition-first (Phase 1) (#5682)#5682

Closed
gchalump wants to merge 2 commits into
pytorch:mainfrom
gchalump:export-D101718260
Closed

Refactor bounds_check_indices offset checks to condition-first (Phase 1) (#5682)#5682
gchalump wants to merge 2 commits into
pytorch:mainfrom
gchalump:export-D101718260

Conversation

@gchalump
Copy link
Copy Markdown
Contributor

@gchalump gchalump commented Apr 23, 2026

Summary:
X-link: https://github.com/facebookresearch/FBGEMM/pull/2624

Phase 1 of the bounds_check_indices race-condition cleanup stack.

This is a pure refactor — no behavior change. The per-b_t offset validation in both v1 and v2 CUDA kernels is rewritten from mode-first to condition-first.

Before (mode-first):

if (FATAL) { asserts }
else if (WARNING) { if (bad) { warn + adjust } }
else if (IGNORE) { adjust unconditionally }

After (condition-first):

if (bad) {
  if (FATAL) { asserts }
  else { if (WARNING) { warn } adjust }
}

Why: the mode-first form had IGNORE call adjust_offset_kernel on every (b, t) pair, even when offsets were valid — wasted writes plus needless contention on the offsets buffer. Condition-first runs adjustment only when bounds are actually violated.

Stack:

  • (this) Phase 1 — Condition-first refactor
  • Phase 1.5 — Gate offset correction on lane 0 + shfl_sync broadcast
  • Phase 2 — Make bad offsets unconditionally fatal

Reviewed By: q10

Differential Revision: D101718260

@meta-cla meta-cla Bot added the cla signed label Apr 23, 2026
@meta-codesync
Copy link
Copy Markdown
Contributor

meta-codesync Bot commented Apr 23, 2026

@gchalump has exported this pull request. If you are a Meta employee, you can view the originating Diff in D101718260.

Summary:
This diff bundles the two pieces needed to benchmark the
bounds_check_indices stack across architectures, so a single diff
number can be forwarded to collaborators (NVIDIA A100/H100/B200,
AMD MI300/MI350) for cross-arch perf testing before any of the
stack lands.

# Part 1: --bounds-check-version flag

Adds a `--bounds-check-version` CLI flag to the
`bounds-check-indices` subcommand of
`fbcode/deeplearning/fbgemm/fbgemm_gpu/bench/tbe/tbe_utils_benchmark.py`,
plumbed through to `torch.ops.fbgemm.bounds_check_indices` as the
`bounds_check_version` kwarg (which already exists in the C++ op
signature).

Without this, v2 was only reachable by either flipping the
`BOUNDS_CHECK_INDICES_V2` JK gate (requires JK perms; not
available to OSS/external runners) or manually editing the bench
file inline (fragile, doesn't survive `sl goto`).

Default `1` preserves existing bench behavior (no behavior change
for any current caller).

# Part 2: portable bench tooling

Three bash scripts in
`fbcode/ai_codesign/nonprod/gchalump/scripts/fbgemm/`:

1. **run_bounds_check_indices_bench.sh** — inner runner. Sweeps
   {modes} × {versions} × {trials} on the currently-checked-out
   commit. Captures `system.json` (GPU/driver/CUDA/host/checkout).
   Mirrors the arch dispatch (`-c h100`/`-c mi350`/`-c b200`) and
   `--gpu N` (sets both `CUDA_VISIBLE_DEVICES` and
   `HIP_VISIBLE_DEVICES`) conventions from supadchaya's existing
   TBE bench scripts. Auto-falls-back to v1-only on checkouts
   that predate Part 1.

2. **run_bounds_check_indices_on_stack.sh** — outer driver. Loops
   `sl goto` across {Baseline, Phase 0.5, 1, 1.5, 2, 3}, runs
   the inner per phase, restores original checkout via EXIT trap.
   Forwards all bench flags. Pre-pulls all diffs to avoid
   mid-sweep network calls. Auto-runs the analyzer at the end.

3. **analyze_bounds_check_results.sh** — log parser + reporter
   (pure bash). Emits `results.csv` and `summary.md` (median per
   cell with Δ T vs Baseline, per kernel version). Uploads both
   to pastry by default.

# Why folded into one diff

Originally Part 1 was the bottom of the bounds_check stack and
Part 2 was a parallel branch off master (D102682427, now
abandoned). Folded into one diff so collaborators only need to
remember one diff number to reproduce the full benchmark.

Differential Revision: D102675932
… 1) (pytorch#5682)

Summary:
X-link: facebookresearch/FBGEMM#2624


Phase 1 of the bounds_check_indices race-condition cleanup stack.

This is a pure refactor — no behavior change. The per-b_t offset validation in both v1 and v2 CUDA kernels is rewritten from mode-first to condition-first.

**Before (mode-first):**
```
if (FATAL) { asserts }
else if (WARNING) { if (bad) { warn + adjust } }
else if (IGNORE) { adjust unconditionally }
```

**After (condition-first):**
```
if (bad) {
  if (FATAL) { asserts }
  else { if (WARNING) { warn } adjust }
}
```

**Why**: the mode-first form had IGNORE call `adjust_offset_kernel` on every (b, t) pair, even when offsets were valid — wasted writes plus needless contention on the offsets buffer. Condition-first runs adjustment only when bounds are actually violated.

**Stack:**
- (this) Phase 1 — Condition-first refactor
- Phase 1.5 — Gate offset correction on lane 0 + shfl_sync broadcast
- Phase 2 — Make bad offsets unconditionally fatal

Reviewed By: q10

Differential Revision: D101718260
@meta-codesync meta-codesync Bot changed the title Refactor bounds_check_indices offset checks to condition-first (Phase 1) Refactor bounds_check_indices offset checks to condition-first (Phase 1) (#5682) Apr 28, 2026
@gchalump gchalump force-pushed the export-D101718260 branch from 56a9d90 to e53328b Compare April 28, 2026 23:24
@meta-codesync meta-codesync Bot closed this in b6f2db1 May 1, 2026
@meta-codesync
Copy link
Copy Markdown
Contributor

meta-codesync Bot commented May 1, 2026

This pull request has been merged in b6f2db1.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant