Harden repeat_arange benchmark with input validation and trace export (#5676) by q10 · Pull Request #5676 · pytorch/FBGEMM

q10 · 2026-04-22T22:27:17Z

Summary:
X-link: https://github.com/facebookresearch/FBGEMM/pull/2618

Harden the repeat_arange benchmark with input validation, trace export, and documentation fixes

This diff backports hardening improvements from the tritonbench port back into the original FBGEMM GPU repeat_arange benchmark. Specifically:

Input validation (_validate_inputs): Checks batch_size and max_length against int32 limits to prevent silent overflow in the CUDA kernel's PackedTensorAccessor32 indexing
Kineto trace export (_export_kineto_trace + --export-trace CLI flag): Enables profiler trace export for kernel-level performance analysis on all 3 CLI subcommands (bench-repeat-arange, bench-repeat-arange-quick, bench-repeat-arange-scaling)
Documentation fix: Clarifies that the "reference" PyTorch implementation actually calls torch.ops.fbgemm.asynchronous_complete_cumsum (not pure PyTorch)
Docstring changelog: Records what was changed and the tritonbench port provenance

Detailed Changes

repeat_arange_benchmark.py

Added INT32_MAX constant and _validate_inputs() function that checks batch_size >= 1, max_length >= 1, batch_size <= INT32_MAX, and batch_size * max_length <= INT32_MAX
Added _export_kineto_trace() helper that runs both implementations under torch.profiler.profile and exports Chrome-compatible JSON traces
Added --export-trace/--no-export-trace click option to all three CLI subcommands
Added NOTE to repeat_arange_pytorch docstring documenting the fbgemm.asynchronous_complete_cumsum dependency

Reviewed By: henrylhtsang

Differential Revision: D102041711

meta-codesync · 2026-04-22T22:27:25Z

@q10 has exported this pull request. If you are a Meta employee, you can view the originating Diff in D102041711.

…pytorch#5676) Summary: X-link: facebookresearch/FBGEMM#2618 Harden the repeat_arange benchmark with input validation, trace export, and documentation fixes This diff backports hardening improvements from the tritonbench port back into the original FBGEMM GPU `repeat_arange` benchmark. Specifically: 1. **Input validation** (`_validate_inputs`): Checks `batch_size` and `max_length` against int32 limits to prevent silent overflow in the CUDA kernel's `PackedTensorAccessor32` indexing 2. **Kineto trace export** (`_export_kineto_trace` + `--export-trace` CLI flag): Enables profiler trace export for kernel-level performance analysis on all 3 CLI subcommands (`bench-repeat-arange`, `bench-repeat-arange-quick`, `bench-repeat-arange-scaling`) 3. **Documentation fix**: Clarifies that the "reference" PyTorch implementation actually calls `torch.ops.fbgemm.asynchronous_complete_cumsum` (not pure PyTorch) 4. **Docstring changelog**: Records what was changed and the tritonbench port provenance ## Detailed Changes ### repeat_arange_benchmark.py - Added `INT32_MAX` constant and `_validate_inputs()` function that checks `batch_size >= 1`, `max_length >= 1`, `batch_size <= INT32_MAX`, and `batch_size * max_length <= INT32_MAX` - Added `_export_kineto_trace()` helper that runs both implementations under `torch.profiler.profile` and exports Chrome-compatible JSON traces - Added `--export-trace/--no-export-trace` click option to all three CLI subcommands - Added NOTE to `repeat_arange_pytorch` docstring documenting the `fbgemm.asynchronous_complete_cumsum` dependency Reviewed By: henrylhtsang Differential Revision: D102041711

…pytorch#5676) Summary: X-link: https://github.com/facebookresearch/FBGEMM/pull/2618 Pull Request resolved: pytorch#5676 Harden the repeat_arange benchmark with input validation, trace export, and documentation fixes This diff backports hardening improvements from the tritonbench port back into the original FBGEMM GPU `repeat_arange` benchmark. Specifically: 1. **Input validation** (`_validate_inputs`): Checks `batch_size` and `max_length` against int32 limits to prevent silent overflow in the CUDA kernel's `PackedTensorAccessor32` indexing 2. **Kineto trace export** (`_export_kineto_trace` + `--export-trace` CLI flag): Enables profiler trace export for kernel-level performance analysis on all 3 CLI subcommands (`bench-repeat-arange`, `bench-repeat-arange-quick`, `bench-repeat-arange-scaling`) 3. **Documentation fix**: Clarifies that the "reference" PyTorch implementation actually calls `torch.ops.fbgemm.asynchronous_complete_cumsum` (not pure PyTorch) 4. **Docstring changelog**: Records what was changed and the tritonbench port provenance ## Detailed Changes ### repeat_arange_benchmark.py - Added `INT32_MAX` constant and `_validate_inputs()` function that checks `batch_size >= 1`, `max_length >= 1`, `batch_size <= INT32_MAX`, and `batch_size * max_length <= INT32_MAX` - Added `_export_kineto_trace()` helper that runs both implementations under `torch.profiler.profile` and exports Chrome-compatible JSON traces - Added `--export-trace/--no-export-trace` click option to all three CLI subcommands - Added NOTE to `repeat_arange_pytorch` docstring documenting the `fbgemm.asynchronous_complete_cumsum` dependency Reviewed By: henrylhtsang Differential Revision: D102041711

meta-codesync · 2026-04-23T23:15:00Z

This pull request has been merged in 90596c2.

meta-cla Bot added the cla signed label Apr 22, 2026

meta-codesync Bot added fb-exported meta-exported labels Apr 22, 2026

meta-codesync Bot changed the title ~~Harden repeat_arange benchmark with input validation and trace export~~ Harden repeat_arange benchmark with input validation and trace export (#5676) Apr 23, 2026

q10 force-pushed the export-D102041711 branch from 6a90a73 to 5b07bc4 Compare April 23, 2026 03:53

q10 force-pushed the export-D102041711 branch from 5b07bc4 to b063179 Compare April 23, 2026 03:53

q10 force-pushed the export-D102041711 branch from b063179 to 4cd26e0 Compare April 23, 2026 03:56

meta-codesync Bot closed this in 90596c2 Apr 23, 2026

facebook-github-tools Bot added the Merged label Apr 23, 2026

gchalump added category:improvement contributor:Meta feature:benchmarks feature:better-engineering labels May 14, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Harden repeat_arange benchmark with input validation and trace export (#5676)#5676

Harden repeat_arange benchmark with input validation and trace export (#5676)#5676
q10 wants to merge 1 commit into
pytorch:mainfrom
q10:export-D102041711

q10 commented Apr 22, 2026 •

edited by meta-codesync Bot

Loading

Uh oh!

meta-codesync Bot commented Apr 22, 2026

Uh oh!

meta-codesync Bot commented Apr 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

q10 commented Apr 22, 2026 • edited by meta-codesync Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Detailed Changes

repeat_arange_benchmark.py

Uh oh!

meta-codesync Bot commented Apr 22, 2026

Uh oh!

meta-codesync Bot commented Apr 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

q10 commented Apr 22, 2026 •

edited by meta-codesync Bot

Loading