-
Notifications
You must be signed in to change notification settings - Fork 21.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[interformer] batch pointwise op + unbind stack pass in post grad #126959
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/126959
Note: Links to docs will display an error until the docs builds have been completed. ❗ 1 Active SEVsThere are 1 currently active SEVs. If your PR is affected, please view them below: ✅ You can merge normally! (13 Unrelated Failures)As of commit c0e0b38 with merge base 12d6446 ( FLAKY - The following jobs failed but were likely due to flakiness present on trunk:
BROKEN TRUNK - The following job failed but were present on the merge base:👉 Rebase onto the `viable/strict` branch to avoid these failures
UNSTABLE - The following jobs failed but were likely due to flakiness present on trunk and has been marked as unstable:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
This pull request was exported from Phabricator. Differential Revision: D57595173 |
148f87e
to
67a8133
Compare
…torch#126959) Summary: Tested on H100 with single GPU, and the bs is set to 64. Test Plan: # local script ``` buck2 run mode/opt scripts/jackiexu0313/pt2:uniarch_perf_benchmark -- single-module-benchmark --provider interformer --enable_pt2 True --batch_size 64 ``` baseline: P1370993922 | Metric | Value | |:-------------------|:-------------| | Latency | 120.84 ms | | Model size | 5.93 G bytes | | Flops/example | 62.22 GB | | TFLOPS | 32.95 | | MFU | 4.12% | | Activation/example | 128.17 MB | proposal: P1371676068 config ``` torch._inductor.config.pre_grad_fusion_options = {} torch._inductor.config.post_grad_fusion_options = { "batch_aten_mul": {"min_fuse_set_size": 50}, "batch_aten_sigmoid": {"min_fuse_set_size": 50}, "batch_aten_relu": {"min_fuse_set_size": 50}, "batch_linear_post_grad": {"min_fuse_set_size": 50}, "unbind_stack_aten_pass": {}, } ``` | Metric | Value | |:-------------------|:-------------| | Latency | 117.30 ms | | Model size | 5.93 G bytes | | Flops/example | 62.65 GB | | TFLOPS | 34.18 | | MFU | 4.27% | | Activation/example | 163.12 MB | Differential Revision: D57595173
This pull request was exported from Phabricator. Differential Revision: D57595173 |
…torch#126959) Summary: Tested on H100 with single GPU, and the bs is set to 64. Test Plan: # local script ``` buck2 run mode/opt scripts/jackiexu0313/pt2:uniarch_perf_benchmark -- single-module-benchmark --provider interformer --enable_pt2 True --batch_size 64 ``` baseline: P1370993922 | Metric | Value | |:-------------------|:-------------| | Latency | 120.84 ms | | Model size | 5.93 G bytes | | Flops/example | 62.22 GB | | TFLOPS | 32.95 | | MFU | 4.12% | | Activation/example | 128.17 MB | proposal: P1371676068 config ``` torch._inductor.config.pre_grad_fusion_options = {} torch._inductor.config.post_grad_fusion_options = { "batch_aten_mul": {"min_fuse_set_size": 50}, "batch_aten_sigmoid": {"min_fuse_set_size": 50}, "batch_aten_relu": {"min_fuse_set_size": 50}, "batch_linear_post_grad": {"min_fuse_set_size": 50}, "unbind_stack_aten_pass": {}, } ``` | Metric | Value | |:-------------------|:-------------| | Latency | 117.30 ms | | Model size | 5.93 G bytes | | Flops/example | 62.65 GB | | TFLOPS | 34.18 | | MFU | 4.27% | | Activation/example | 163.12 MB | Differential Revision: D57595173
67a8133
to
36959a5
Compare
This pull request was exported from Phabricator. Differential Revision: D57595173 |
36959a5
to
8060625
Compare
…26959) Summary: Tested on H100 with single GPU, and the bs is set to 64. Test Plan: # local script ``` buck2 run mode/opt scripts/jackiexu0313/pt2:uniarch_perf_benchmark -- single-module-benchmark --provider interformer --enable_pt2 True --batch_size 64 ``` baseline: P1370993922 trace: https://our.intern.facebook.com/intern/perfdoctor/trace_view?filepath=tree/traces/test/interformer.May_22_10_58_35_trace.json.gz&bucket=pyper_traces | Metric | Value | |:-------------------|:-------------| | Latency | 120.84 ms | | Model size | 5.93 G bytes | | Flops/example | 62.22 GB | | TFLOPS | 32.95 | | MFU | 4.12% | | Activation/example | 128.17 MB | proposal: P1371676068 trace: https://our.intern.facebook.com/intern/perfdoctor/trace_view?filepath=tree/traces/test/interformer.May_22_22_08_09_trace.json.gz&bucket=pyper_traces config ``` torch._inductor.config.pre_grad_fusion_options = {} torch._inductor.config.post_grad_fusion_options = { "batch_aten_mul": {"min_fuse_set_size": 50}, "batch_aten_sigmoid": {"min_fuse_set_size": 50}, "batch_aten_relu": {"min_fuse_set_size": 50}, "batch_linear_post_grad": {"min_fuse_set_size": 50}, "unbind_stack_aten_pass": {}, } ``` Counter({'pattern_matcher_nodes': 4677, 'pattern_matcher_count': 3621, 'extern_calls': 1096, 'batch_linear_post_grad': 3, 'batch_aten_mul': 2}) | Metric | Value | |:-------------------|:-------------| | Latency | 117.30 ms | | Model size | 5.93 G bytes | | Flops/example | 62.65 GB | | TFLOPS | 34.18 | | MFU | 4.27% | | Activation/example | 163.12 MB | ### experiments after remove split 1. Patch D57739439 2. Detailed experiments can be found in P1376777898 ### Observations: - Batch fusion will introduce stack/cat nodes, which is costly. It only helps when we batch "enough" number of ops; otherwise regression can be seen (e.g., we use the default fusion size and the latency is increased P1370978074) - unbind stack pass are not always taking into effect only when we have nice structure and order, e.g., the example from test_group_batch_fusion, we can delete stack nodes introduced from batch_relu and batch_sigmoid. - Even remove the splits node cannot remove the "batch linear" nodes since the addmm = mm + add, where add leverages the broadcast rule. - Tested with enabling add broadcast and then stack the add ops to cancel the stack, it also not very helpful 1) broadcast is not for free, 2) linear will always introduce two stack nodes (stack input + stack other), which cannot be eliminated. Reviewed By: jackiexu1992 Differential Revision: D57595173
This pull request was exported from Phabricator. Differential Revision: D57595173 |
…torch#126959) Summary: Tested on H100 with single GPU, and the bs is set to 64. Test Plan: # local script ``` buck2 run mode/opt scripts/jackiexu0313/pt2:uniarch_perf_benchmark -- single-module-benchmark --provider interformer --enable_pt2 True --batch_size 64 ``` baseline: P1370993922 trace: https://our.intern.facebook.com/intern/perfdoctor/trace_view?filepath=tree/traces/test/interformer.May_22_10_58_35_trace.json.gz&bucket=pyper_traces | Metric | Value | |:-------------------|:-------------| | Latency | 120.84 ms | | Model size | 5.93 G bytes | | Flops/example | 62.22 GB | | TFLOPS | 32.95 | | MFU | 4.12% | | Activation/example | 128.17 MB | proposal: P1371676068 trace: https://our.intern.facebook.com/intern/perfdoctor/trace_view?filepath=tree/traces/test/interformer.May_22_22_08_09_trace.json.gz&bucket=pyper_traces config ``` torch._inductor.config.pre_grad_fusion_options = {} torch._inductor.config.post_grad_fusion_options = { "batch_aten_mul": {"min_fuse_set_size": 50}, "batch_aten_sigmoid": {"min_fuse_set_size": 50}, "batch_aten_relu": {"min_fuse_set_size": 50}, "batch_linear_post_grad": {"min_fuse_set_size": 50}, "unbind_stack_aten_pass": {}, } ``` Counter({'pattern_matcher_nodes': 4677, 'pattern_matcher_count': 3621, 'extern_calls': 1096, 'batch_linear_post_grad': 3, 'batch_aten_mul': 2}) | Metric | Value | |:-------------------|:-------------| | Latency | 117.30 ms | | Model size | 5.93 G bytes | | Flops/example | 62.65 GB | | TFLOPS | 34.18 | | MFU | 4.27% | | Activation/example | 163.12 MB | ### experiments after remove split 1. Patch D57739439 2. Detailed experiments can be found in P1376777898 ### Observations: - Batch fusion will introduce stack/cat nodes, which is costly. It only helps when we batch "enough" number of ops; otherwise regression can be seen (e.g., we use the default fusion size and the latency is increased P1370978074) - unbind stack pass are not always taking into effect only when we have nice structure and order, e.g., the example from test_group_batch_fusion, we can delete stack nodes introduced from batch_relu and batch_sigmoid. - Even remove the splits node cannot remove the "batch linear" nodes since the addmm = mm + add, where add leverages the broadcast rule. - Tested with enabling add broadcast and then stack the add ops to cancel the stack, it also not very helpful 1) broadcast is not for free, 2) linear will always introduce two stack nodes (stack input + stack other), which cannot be eliminated. Reviewed By: jackiexu1992 Differential Revision: D57595173
8060625
to
812ee22
Compare
This pull request was exported from Phabricator. Differential Revision: D57595173 |
812ee22
to
cb45e61
Compare
…torch#126959) Summary: Tested on H100 with single GPU, and the bs is set to 64. Test Plan: # local script ``` buck2 run mode/opt scripts/jackiexu0313/pt2:uniarch_perf_benchmark -- single-module-benchmark --provider interformer --enable_pt2 True --batch_size 64 ``` baseline: P1370993922 trace: https://our.intern.facebook.com/intern/perfdoctor/trace_view?filepath=tree/traces/test/interformer.May_22_10_58_35_trace.json.gz&bucket=pyper_traces | Metric | Value | |:-------------------|:-------------| | Latency | 120.84 ms | | Model size | 5.93 G bytes | | Flops/example | 62.22 GB | | TFLOPS | 32.95 | | MFU | 4.12% | | Activation/example | 128.17 MB | proposal: P1371676068 trace: https://our.intern.facebook.com/intern/perfdoctor/trace_view?filepath=tree/traces/test/interformer.May_22_22_08_09_trace.json.gz&bucket=pyper_traces config ``` torch._inductor.config.pre_grad_fusion_options = {} torch._inductor.config.post_grad_fusion_options = { "batch_aten_mul": {"min_fuse_set_size": 50}, "batch_aten_sigmoid": {"min_fuse_set_size": 50}, "batch_aten_relu": {"min_fuse_set_size": 50}, "batch_linear_post_grad": {"min_fuse_set_size": 50}, "unbind_stack_aten_pass": {}, } ``` Counter({'pattern_matcher_nodes': 4677, 'pattern_matcher_count': 3621, 'extern_calls': 1096, 'batch_linear_post_grad': 3, 'batch_aten_mul': 2}) | Metric | Value | |:-------------------|:-------------| | Latency | 117.30 ms | | Model size | 5.93 G bytes | | Flops/example | 62.65 GB | | TFLOPS | 34.18 | | MFU | 4.27% | | Activation/example | 163.12 MB | ### experiments after remove split 1. Patch D57739439 2. Detailed experiments can be found in P1376777898 ### Observations: - Batch fusion will introduce stack/cat nodes, which is costly. It only helps when we batch "enough" number of ops; otherwise regression can be seen (e.g., we use the default fusion size and the latency is increased P1370978074) - unbind stack pass are not always taking into effect only when we have nice structure and order, e.g., the example from test_group_batch_fusion, we can delete stack nodes introduced from batch_relu and batch_sigmoid. - Even remove the splits node cannot remove the "batch linear" nodes since the addmm = mm + add, where add leverages the broadcast rule. - Tested with enabling add broadcast and then stack the add ops to cancel the stack, it also not very helpful 1) broadcast is not for free, 2) linear will always introduce two stack nodes (stack input + stack other), which cannot be eliminated. Reviewed By: jackiexu1992 Differential Revision: D57595173
This pull request was exported from Phabricator. Differential Revision: D57595173 |
cb45e61
to
323a974
Compare
…torch#126959) Summary: Tested on H100 with single GPU, and the bs is set to 64. Test Plan: # local script ``` buck2 run mode/opt scripts/jackiexu0313/pt2:uniarch_perf_benchmark -- single-module-benchmark --provider interformer --enable_pt2 True --batch_size 64 ``` baseline: P1370993922 trace: https://our.intern.facebook.com/intern/perfdoctor/trace_view?filepath=tree/traces/test/interformer.May_22_10_58_35_trace.json.gz&bucket=pyper_traces | Metric | Value | |:-------------------|:-------------| | Latency | 120.84 ms | | Model size | 5.93 G bytes | | Flops/example | 62.22 GB | | TFLOPS | 32.95 | | MFU | 4.12% | | Activation/example | 128.17 MB | proposal: P1371676068 trace: https://our.intern.facebook.com/intern/perfdoctor/trace_view?filepath=tree/traces/test/interformer.May_22_22_08_09_trace.json.gz&bucket=pyper_traces config ``` torch._inductor.config.pre_grad_fusion_options = {} torch._inductor.config.post_grad_fusion_options = { "batch_aten_mul": {"min_fuse_set_size": 50}, "batch_aten_sigmoid": {"min_fuse_set_size": 50}, "batch_aten_relu": {"min_fuse_set_size": 50}, "batch_linear_post_grad": {"min_fuse_set_size": 50}, "unbind_stack_aten_pass": {}, } ``` Counter({'pattern_matcher_nodes': 4677, 'pattern_matcher_count': 3621, 'extern_calls': 1096, 'batch_linear_post_grad': 3, 'batch_aten_mul': 2}) | Metric | Value | |:-------------------|:-------------| | Latency | 117.30 ms | | Model size | 5.93 G bytes | | Flops/example | 62.65 GB | | TFLOPS | 34.18 | | MFU | 4.27% | | Activation/example | 163.12 MB | ### experiments after remove split 1. Patch D57739439 2. Detailed experiments can be found in P1376777898 ### Observations: - Batch fusion will introduce stack/cat nodes, which is costly. It only helps when we batch "enough" number of ops; otherwise regression can be seen (e.g., we use the default fusion size and the latency is increased P1370978074) - unbind stack pass are not always taking into effect only when we have nice structure and order, e.g., the example from test_group_batch_fusion, we can delete stack nodes introduced from batch_relu and batch_sigmoid. - Even remove the splits node cannot remove the "batch linear" nodes since the addmm = mm + add, where add leverages the broadcast rule. - Tested with enabling add broadcast and then stack the add ops to cancel the stack, it also not very helpful 1) broadcast is not for free, 2) linear will always introduce two stack nodes (stack input + stack other), which cannot be eliminated. Reviewed By: jackiexu1992 Differential Revision: D57595173
This pull request was exported from Phabricator. Differential Revision: D57595173 |
…torch#126959) Summary: Tested on H100 with single GPU, and the bs is set to 64. Test Plan: # local script ``` buck2 run mode/opt scripts/jackiexu0313/pt2:uniarch_perf_benchmark -- single-module-benchmark --provider interformer --enable_pt2 True --batch_size 64 ``` baseline: P1370993922 trace: https://our.intern.facebook.com/intern/perfdoctor/trace_view?filepath=tree/traces/test/interformer.May_22_10_58_35_trace.json.gz&bucket=pyper_traces | Metric | Value | |:-------------------|:-------------| | Latency | 120.84 ms | | Model size | 5.93 G bytes | | Flops/example | 62.22 GB | | TFLOPS | 32.95 | | MFU | 4.12% | | Activation/example | 128.17 MB | proposal: P1371676068 trace: https://our.intern.facebook.com/intern/perfdoctor/trace_view?filepath=tree/traces/test/interformer.May_22_22_08_09_trace.json.gz&bucket=pyper_traces config ``` torch._inductor.config.pre_grad_fusion_options = {} torch._inductor.config.post_grad_fusion_options = { "batch_aten_mul": {"min_fuse_set_size": 50}, "batch_aten_sigmoid": {"min_fuse_set_size": 50}, "batch_aten_relu": {"min_fuse_set_size": 50}, "batch_linear_post_grad": {"min_fuse_set_size": 50}, "unbind_stack_aten_pass": {}, } ``` Counter({'pattern_matcher_nodes': 4677, 'pattern_matcher_count': 3621, 'extern_calls': 1096, 'batch_linear_post_grad': 3, 'batch_aten_mul': 2}) | Metric | Value | |:-------------------|:-------------| | Latency | 117.30 ms | | Model size | 5.93 G bytes | | Flops/example | 62.65 GB | | TFLOPS | 34.18 | | MFU | 4.27% | | Activation/example | 163.12 MB | ### experiments after remove split 1. Patch D57739439 2. Detailed experiments can be found in P1376777898 ### Observations: - Batch fusion will introduce stack/cat nodes, which is costly. It only helps when we batch "enough" number of ops; otherwise regression can be seen (e.g., we use the default fusion size and the latency is increased P1370978074) - unbind stack pass are not always taking into effect only when we have nice structure and order, e.g., the example from test_group_batch_fusion, we can delete stack nodes introduced from batch_relu and batch_sigmoid. - Even remove the splits node cannot remove the "batch linear" nodes since the addmm = mm + add, where add leverages the broadcast rule. - Tested with enabling add broadcast and then stack the add ops to cancel the stack, it also not very helpful 1) broadcast is not for free, 2) linear will always introduce two stack nodes (stack input + stack other), which cannot be eliminated. Reviewed By: jackiexu1992 Differential Revision: D57595173
323a974
to
51f66a5
Compare
This pull request was exported from Phabricator. Differential Revision: D57595173 |
…torch#126959) Summary: Tested on H100 with single GPU, and the bs is set to 64. Test Plan: # local script ``` buck2 run mode/opt scripts/jackiexu0313/pt2:uniarch_perf_benchmark -- single-module-benchmark --provider interformer --enable_pt2 True --batch_size 64 ``` baseline: P1370993922 trace: https://our.intern.facebook.com/intern/perfdoctor/trace_view?filepath=tree/traces/test/interformer.May_22_10_58_35_trace.json.gz&bucket=pyper_traces | Metric | Value | |:-------------------|:-------------| | Latency | 120.84 ms | | Model size | 5.93 G bytes | | Flops/example | 62.22 GB | | TFLOPS | 32.95 | | MFU | 4.12% | | Activation/example | 128.17 MB | proposal: P1371676068 trace: https://our.intern.facebook.com/intern/perfdoctor/trace_view?filepath=tree/traces/test/interformer.May_22_22_08_09_trace.json.gz&bucket=pyper_traces config ``` torch._inductor.config.pre_grad_fusion_options = {} torch._inductor.config.post_grad_fusion_options = { "batch_aten_mul": {"min_fuse_set_size": 50}, "batch_aten_sigmoid": {"min_fuse_set_size": 50}, "batch_aten_relu": {"min_fuse_set_size": 50}, "batch_linear_post_grad": {"min_fuse_set_size": 50}, "unbind_stack_aten_pass": {}, } ``` Counter({'pattern_matcher_nodes': 4677, 'pattern_matcher_count': 3621, 'extern_calls': 1096, 'batch_linear_post_grad': 3, 'batch_aten_mul': 2}) | Metric | Value | |:-------------------|:-------------| | Latency | 117.30 ms | | Model size | 5.93 G bytes | | Flops/example | 62.65 GB | | TFLOPS | 34.18 | | MFU | 4.27% | | Activation/example | 163.12 MB | ### experiments after remove split 1. Patch D57739439 2. Detailed experiments can be found in P1376777898 ### Observations: - Batch fusion will introduce stack/cat nodes, which is costly. It only helps when we batch "enough" number of ops; otherwise regression can be seen (e.g., we use the default fusion size and the latency is increased P1370978074) - unbind stack pass are not always taking into effect only when we have nice structure and order, e.g., the example from test_group_batch_fusion, we can delete stack nodes introduced from batch_relu and batch_sigmoid. - Even remove the splits node cannot remove the "batch linear" nodes since the addmm = mm + add, where add leverages the broadcast rule. - Tested with enabling add broadcast and then stack the add ops to cancel the stack, it also not very helpful 1) broadcast is not for free, 2) linear will always introduce two stack nodes (stack input + stack other), which cannot be eliminated. Reviewed By: jackiexu1992 Differential Revision: D57595173
51f66a5
to
c0e0b38
Compare
This pull request was exported from Phabricator. Differential Revision: D57595173 |
@pytorchbot merge -f 'Landed internally' (Initiating merge automatically since Phabricator Diff has merged, using force because this PR might not pass merge_rules.json but landed internally) |
Merge startedYour change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
…torch#126959) Summary: Tested on H100 with single GPU, and the bs is set to 64. Test Plan: # local script ``` buck2 run mode/opt scripts/jackiexu0313/pt2:uniarch_perf_benchmark -- single-module-benchmark --provider interformer --enable_pt2 True --batch_size 64 ``` baseline: P1370993922 | Metric | Value | |:-------------------|:-------------| | Latency | 120.84 ms | | Model size | 5.93 G bytes | | Flops/example | 62.22 GB | | TFLOPS | 32.95 | | MFU | 4.12% | | Activation/example | 128.17 MB | proposal: P1371676068 config ``` torch._inductor.config.pre_grad_fusion_options = {} torch._inductor.config.post_grad_fusion_options = { "batch_aten_mul": {"min_fuse_set_size": 50}, "batch_aten_sigmoid": {"min_fuse_set_size": 50}, "batch_aten_relu": {"min_fuse_set_size": 50}, "batch_linear_post_grad": {"min_fuse_set_size": 50}, "unbind_stack_aten_pass": {}, } ``` | Metric | Value | |:-------------------|:-------------| | Latency | 117.30 ms | | Model size | 5.93 G bytes | | Flops/example | 62.65 GB | | TFLOPS | 34.18 | | MFU | 4.27% | | Activation/example | 163.12 MB | Differential Revision: D57595173 Pull Request resolved: pytorch#126959 Approved by: https://github.com/jackiexu1992
Summary: Tested on H100 with single GPU, and the bs is set to 64.
Test Plan:
local script
baseline: P1370993922
proposal: P1371676068
config
Differential Revision: D57595173
cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @peterbell10 @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @ColinPeppler @amjames @desertfire @chauhang