[interformer] batch pointwise op + unbind stack pass in post grad #126959

mengluy0125 · 2024-05-23T05:19:41Z

Summary: Tested on H100 with single GPU, and the bs is set to 64.

Test Plan:

local script

buck2 run mode/opt scripts/jackiexu0313/pt2:uniarch_perf_benchmark -- single-module-benchmark --provider interformer --enable_pt2 True --batch_size 64

baseline: P1370993922

Metric	Value
Latency	120.84 ms
Model size	5.93 G bytes
Flops/example	62.22 GB
TFLOPS	32.95
MFU	4.12%
Activation/example	128.17 MB

proposal: P1371676068

config

torch._inductor.config.pre_grad_fusion_options = {}
torch._inductor.config.post_grad_fusion_options = {
        "batch_aten_mul": {"min_fuse_set_size": 50},
        "batch_aten_sigmoid": {"min_fuse_set_size": 50},
        "batch_aten_relu": {"min_fuse_set_size": 50},
        "batch_linear_post_grad": {"min_fuse_set_size": 50},
        "unbind_stack_aten_pass": {},
}

Metric	Value
Latency	117.30 ms
Model size	5.93 G bytes
Flops/example	62.65 GB
TFLOPS	34.18
MFU	4.27%
Activation/example	163.12 MB

Differential Revision: D57595173

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @peterbell10 @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @ColinPeppler @amjames @desertfire @chauhang

pytorch-bot · 2024-05-23T05:19:44Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/126959

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

Upgrade MacOS runner to 14

✅ You can merge normally! (13 Unrelated Failures)

As of commit c0e0b38 with merge base 12d6446 ():

FLAKY - The following jobs failed but were likely due to flakiness present on trunk:

inductor / cuda12.1-py3.10-gcc9-sm86 / test (aot_inductor_timm, 1, 2, linux.g5.4xlarge.nvidia.gpu) (gh) (matched linux rule in flaky-rules.json)
No space left on device : '/home/ec2-user/actions-runner/_diag/Worker_20240530-194002-utc.log'
inductor / cuda12.1-py3.10-gcc9-sm86 / test (inductor, 1, 1, linux.g5.4xlarge.nvidia.gpu) (gh) (matched linux rule in flaky-rules.json)
No space left on device : '/home/ec2-user/actions-runner/_diag/Worker_20240530-194021-utc.log'
inductor / cuda12.4-py3.10-gcc9-sm86 / test (dynamic_inductor_huggingface, 1, 1, linux.g5.4xlarge.nvidia.gpu) (gh) (matched linux rule in flaky-rules.json)
No space left on device : '/home/ec2-user/actions-runner/_diag/Worker_20240530-194156-utc.log'
inductor / cuda12.4-py3.10-gcc9-sm86 / test (dynamic_inductor_timm, 2, 2, linux.g5.4xlarge.nvidia.gpu) (gh) (matched linux rule in flaky-rules.json)
No space left on device : '/home/ec2-user/actions-runner/_diag/Worker_20240530-194136-utc.log'
inductor / cuda12.4-py3.10-gcc9-sm86 / test (inductor_timm, 2, 2, linux.g5.4xlarge.nvidia.gpu) (gh) (matched linux rule in flaky-rules.json)
No space left on device : '/home/ec2-user/actions-runner/_diag/Worker_20240530-194119-utc.log'
inductor-periodic / cuda12.1-py3.10-gcc9-sm86-periodic-dynamo-benchmarks / test (aot_eager_timm, 2, 2, linux.g5.4xlarge.nvidia.gpu) (gh) (matched linux rule in flaky-rules.json)
No space left on device : '/home/ec2-user/actions-runner/_diag/Worker_20240530-193829-utc.log'
inductor-periodic / cuda12.1-py3.10-gcc9-sm86-periodic-dynamo-benchmarks / test (dynamic_aot_eager_torchbench, 1, 2, linux.g5.4xlarge.nvidia.gpu) (gh) (matched linux rule in flaky-rules.json)
No space left on device : '/home/ec2-user/actions-runner/_diag/Worker_20240530-193905-utc.log'
inductor-periodic / cuda12.1-py3.10-gcc9-sm86-periodic-dynamo-benchmarks / test (dynamic_aot_eager_torchbench, 2, 2, linux.g5.4xlarge.nvidia.gpu) (gh) (matched linux rule in flaky-rules.json)
No space left on device : '/home/ec2-user/actions-runner/_diag/Worker_20240530-193847-utc.log'

BROKEN TRUNK - The following job failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

Lint / lintrunner-noclang / linux-job (gh) (trunk failure)
>>> Lint for test/test_custom_ops.py:

UNSTABLE - The following jobs failed but were likely due to flakiness present on trunk and has been marked as unstable:

inductor / cuda12.1-py3.10-gcc9-sm86 / test (dynamic_inductor_timm, 2, 2, linux.g5.4xlarge.nvidia.gpu, unstable) (gh) (#127438)
sebotnet33ts_256
inductor / linux-jammy-cpu-py3.8-gcc11-inductor / test (inductor_torchbench_cpu_smoketest_perf, 1, 1, linux.24xl.spr-metal, unstable) (gh) (#126993)
Process completed with exit code 1.
pull / linux-focal-cuda12.4-py3.10-gcc9-sm86 / test (default, 2, 5, linux.g5.4xlarge.nvidia.gpu, unstable) (gh) ()
inductor/test_efficient_conv_bn_eval.py::EfficientConvBNEvalCudaTests::test_basic_cuda
trunk / macos-13-py3-arm64 / test (default, 3, 3, macos-m1-stable, unstable) (gh) ()
functorch/test_ops.py::TestOperatorsCPU::test_grad_svd_lowrank_cpu_float32

This comment was automatically generated by Dr. CI and updates every 15 minutes.

facebook-github-bot · 2024-05-23T05:19:53Z