Skip to content

Harden repeat_arange benchmark with input validation and trace export (#5676)#5676

Closed
q10 wants to merge 1 commit into
pytorch:mainfrom
q10:export-D102041711
Closed

Harden repeat_arange benchmark with input validation and trace export (#5676)#5676
q10 wants to merge 1 commit into
pytorch:mainfrom
q10:export-D102041711

Conversation

@q10
Copy link
Copy Markdown
Contributor

@q10 q10 commented Apr 22, 2026

Summary:
X-link: https://github.com/facebookresearch/FBGEMM/pull/2618

Harden the repeat_arange benchmark with input validation, trace export, and documentation fixes

This diff backports hardening improvements from the tritonbench port back into the original FBGEMM GPU repeat_arange benchmark. Specifically:

  1. Input validation (_validate_inputs): Checks batch_size and max_length against int32 limits to prevent silent overflow in the CUDA kernel's PackedTensorAccessor32 indexing
  2. Kineto trace export (_export_kineto_trace + --export-trace CLI flag): Enables profiler trace export for kernel-level performance analysis on all 3 CLI subcommands (bench-repeat-arange, bench-repeat-arange-quick, bench-repeat-arange-scaling)
  3. Documentation fix: Clarifies that the "reference" PyTorch implementation actually calls torch.ops.fbgemm.asynchronous_complete_cumsum (not pure PyTorch)
  4. Docstring changelog: Records what was changed and the tritonbench port provenance

Detailed Changes

repeat_arange_benchmark.py

  • Added INT32_MAX constant and _validate_inputs() function that checks batch_size >= 1, max_length >= 1, batch_size <= INT32_MAX, and batch_size * max_length <= INT32_MAX
  • Added _export_kineto_trace() helper that runs both implementations under torch.profiler.profile and exports Chrome-compatible JSON traces
  • Added --export-trace/--no-export-trace click option to all three CLI subcommands
  • Added NOTE to repeat_arange_pytorch docstring documenting the fbgemm.asynchronous_complete_cumsum dependency

Reviewed By: henrylhtsang

Differential Revision: D102041711

@meta-cla meta-cla Bot added the cla signed label Apr 22, 2026
@meta-codesync
Copy link
Copy Markdown
Contributor

meta-codesync Bot commented Apr 22, 2026

@q10 has exported this pull request. If you are a Meta employee, you can view the originating Diff in D102041711.

@meta-codesync meta-codesync Bot changed the title Harden repeat_arange benchmark with input validation and trace export Harden repeat_arange benchmark with input validation and trace export (#5676) Apr 23, 2026
q10 added a commit to q10/FBGEMM that referenced this pull request Apr 23, 2026
…pytorch#5676)

Summary:
X-link: facebookresearch/FBGEMM#2618


Harden the repeat_arange benchmark with input validation, trace export, and documentation fixes

This diff backports hardening improvements from the tritonbench port back into the original FBGEMM GPU `repeat_arange` benchmark. Specifically:

1. **Input validation** (`_validate_inputs`): Checks `batch_size` and `max_length` against int32 limits to prevent silent overflow in the CUDA kernel's `PackedTensorAccessor32` indexing
2. **Kineto trace export** (`_export_kineto_trace` + `--export-trace` CLI flag): Enables profiler trace export for kernel-level performance analysis on all 3 CLI subcommands (`bench-repeat-arange`, `bench-repeat-arange-quick`, `bench-repeat-arange-scaling`)
3. **Documentation fix**: Clarifies that the "reference" PyTorch implementation actually calls `torch.ops.fbgemm.asynchronous_complete_cumsum` (not pure PyTorch)
4. **Docstring changelog**: Records what was changed and the tritonbench port provenance

## Detailed Changes

### repeat_arange_benchmark.py
- Added `INT32_MAX` constant and `_validate_inputs()` function that checks `batch_size >= 1`, `max_length >= 1`, `batch_size <= INT32_MAX`, and `batch_size * max_length <= INT32_MAX`
- Added `_export_kineto_trace()` helper that runs both implementations under `torch.profiler.profile` and exports Chrome-compatible JSON traces
- Added `--export-trace/--no-export-trace` click option to all three CLI subcommands
- Added NOTE to `repeat_arange_pytorch` docstring documenting the `fbgemm.asynchronous_complete_cumsum` dependency

Reviewed By: henrylhtsang

Differential Revision: D102041711
@q10 q10 force-pushed the export-D102041711 branch from 6a90a73 to 5b07bc4 Compare April 23, 2026 03:53
q10 added a commit to q10/FBGEMM that referenced this pull request Apr 23, 2026
…pytorch#5676)

Summary:
X-link: facebookresearch/FBGEMM#2618


Harden the repeat_arange benchmark with input validation, trace export, and documentation fixes

This diff backports hardening improvements from the tritonbench port back into the original FBGEMM GPU `repeat_arange` benchmark. Specifically:

1. **Input validation** (`_validate_inputs`): Checks `batch_size` and `max_length` against int32 limits to prevent silent overflow in the CUDA kernel's `PackedTensorAccessor32` indexing
2. **Kineto trace export** (`_export_kineto_trace` + `--export-trace` CLI flag): Enables profiler trace export for kernel-level performance analysis on all 3 CLI subcommands (`bench-repeat-arange`, `bench-repeat-arange-quick`, `bench-repeat-arange-scaling`)
3. **Documentation fix**: Clarifies that the "reference" PyTorch implementation actually calls `torch.ops.fbgemm.asynchronous_complete_cumsum` (not pure PyTorch)
4. **Docstring changelog**: Records what was changed and the tritonbench port provenance

## Detailed Changes

### repeat_arange_benchmark.py
- Added `INT32_MAX` constant and `_validate_inputs()` function that checks `batch_size >= 1`, `max_length >= 1`, `batch_size <= INT32_MAX`, and `batch_size * max_length <= INT32_MAX`
- Added `_export_kineto_trace()` helper that runs both implementations under `torch.profiler.profile` and exports Chrome-compatible JSON traces
- Added `--export-trace/--no-export-trace` click option to all three CLI subcommands
- Added NOTE to `repeat_arange_pytorch` docstring documenting the `fbgemm.asynchronous_complete_cumsum` dependency

Reviewed By: henrylhtsang

Differential Revision: D102041711
@q10 q10 force-pushed the export-D102041711 branch from 5b07bc4 to b063179 Compare April 23, 2026 03:53
…pytorch#5676)

Summary:
X-link: https://github.com/facebookresearch/FBGEMM/pull/2618

Pull Request resolved: pytorch#5676

Harden the repeat_arange benchmark with input validation, trace export, and documentation fixes

This diff backports hardening improvements from the tritonbench port back into the original FBGEMM GPU `repeat_arange` benchmark. Specifically:

1. **Input validation** (`_validate_inputs`): Checks `batch_size` and `max_length` against int32 limits to prevent silent overflow in the CUDA kernel's `PackedTensorAccessor32` indexing
2. **Kineto trace export** (`_export_kineto_trace` + `--export-trace` CLI flag): Enables profiler trace export for kernel-level performance analysis on all 3 CLI subcommands (`bench-repeat-arange`, `bench-repeat-arange-quick`, `bench-repeat-arange-scaling`)
3. **Documentation fix**: Clarifies that the "reference" PyTorch implementation actually calls `torch.ops.fbgemm.asynchronous_complete_cumsum` (not pure PyTorch)
4. **Docstring changelog**: Records what was changed and the tritonbench port provenance

## Detailed Changes

### repeat_arange_benchmark.py
- Added `INT32_MAX` constant and `_validate_inputs()` function that checks `batch_size >= 1`, `max_length >= 1`, `batch_size <= INT32_MAX`, and `batch_size * max_length <= INT32_MAX`
- Added `_export_kineto_trace()` helper that runs both implementations under `torch.profiler.profile` and exports Chrome-compatible JSON traces
- Added `--export-trace/--no-export-trace` click option to all three CLI subcommands
- Added NOTE to `repeat_arange_pytorch` docstring documenting the `fbgemm.asynchronous_complete_cumsum` dependency

Reviewed By: henrylhtsang

Differential Revision: D102041711
@meta-codesync
Copy link
Copy Markdown
Contributor

meta-codesync Bot commented Apr 23, 2026

This pull request has been merged in 90596c2.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants