Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[interformer] batch pointwise op + unbind stack pass in post grad #126959

Closed
wants to merge 1 commit into from

Conversation

mengluy0125
Copy link
Contributor

@mengluy0125 mengluy0125 commented May 23, 2024

Summary: Tested on H100 with single GPU, and the bs is set to 64.

Test Plan:

local script

buck2 run mode/opt scripts/jackiexu0313/pt2:uniarch_perf_benchmark -- single-module-benchmark --provider interformer --enable_pt2 True --batch_size 64

baseline: P1370993922

Metric Value
Latency 120.84 ms
Model size 5.93 G bytes
Flops/example 62.22 GB
TFLOPS 32.95
MFU 4.12%
Activation/example 128.17 MB

proposal: P1371676068

config

torch._inductor.config.pre_grad_fusion_options = {}
torch._inductor.config.post_grad_fusion_options = {
        "batch_aten_mul": {"min_fuse_set_size": 50},
        "batch_aten_sigmoid": {"min_fuse_set_size": 50},
        "batch_aten_relu": {"min_fuse_set_size": 50},
        "batch_linear_post_grad": {"min_fuse_set_size": 50},
        "unbind_stack_aten_pass": {},
}
Metric Value
Latency 117.30 ms
Model size 5.93 G bytes
Flops/example 62.65 GB
TFLOPS 34.18
MFU 4.27%
Activation/example 163.12 MB

Differential Revision: D57595173

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @peterbell10 @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @ColinPeppler @amjames @desertfire @chauhang

Copy link

pytorch-bot bot commented May 23, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/126959

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

✅ You can merge normally! (13 Unrelated Failures)

As of commit c0e0b38 with merge base 12d6446 (image):

FLAKY - The following jobs failed but were likely due to flakiness present on trunk:

BROKEN TRUNK - The following job failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

UNSTABLE - The following jobs failed but were likely due to flakiness present on trunk and has been marked as unstable:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D57595173

mengluy0125 added a commit to mengluy0125/pytorch that referenced this pull request May 23, 2024
…torch#126959)

Summary:

Tested on H100 with single GPU, and the bs is set to 64.

Test Plan:
# local script

```
buck2 run mode/opt scripts/jackiexu0313/pt2:uniarch_perf_benchmark -- single-module-benchmark --provider interformer --enable_pt2 True --batch_size 64
```

baseline: P1370993922

| Metric             | Value        |
|:-------------------|:-------------|
| Latency            | 120.84 ms    |
| Model size         | 5.93 G bytes |
| Flops/example      | 62.22 GB     |
| TFLOPS             | 32.95        |
| MFU                | 4.12%        |
| Activation/example | 128.17 MB    |

proposal: P1371676068

config
```
torch._inductor.config.pre_grad_fusion_options = {}
torch._inductor.config.post_grad_fusion_options = {
        "batch_aten_mul": {"min_fuse_set_size": 50},
        "batch_aten_sigmoid": {"min_fuse_set_size": 50},
        "batch_aten_relu": {"min_fuse_set_size": 50},
        "batch_linear_post_grad": {"min_fuse_set_size": 50},
        "unbind_stack_aten_pass": {},
}
```

| Metric             | Value        |
|:-------------------|:-------------|
| Latency            | 117.30 ms    |
| Model size         | 5.93 G bytes |
| Flops/example      | 62.65 GB     |
| TFLOPS             | 34.18        |
| MFU                | 4.27%        |
| Activation/example | 163.12 MB    |

Differential Revision: D57595173
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D57595173

mengluy0125 added a commit to mengluy0125/pytorch that referenced this pull request May 23, 2024
…torch#126959)

Summary:

Tested on H100 with single GPU, and the bs is set to 64.

Test Plan:
# local script

```
buck2 run mode/opt scripts/jackiexu0313/pt2:uniarch_perf_benchmark -- single-module-benchmark --provider interformer --enable_pt2 True --batch_size 64
```

baseline: P1370993922

| Metric             | Value        |
|:-------------------|:-------------|
| Latency            | 120.84 ms    |
| Model size         | 5.93 G bytes |
| Flops/example      | 62.22 GB     |
| TFLOPS             | 32.95        |
| MFU                | 4.12%        |
| Activation/example | 128.17 MB    |

proposal: P1371676068

config
```
torch._inductor.config.pre_grad_fusion_options = {}
torch._inductor.config.post_grad_fusion_options = {
        "batch_aten_mul": {"min_fuse_set_size": 50},
        "batch_aten_sigmoid": {"min_fuse_set_size": 50},
        "batch_aten_relu": {"min_fuse_set_size": 50},
        "batch_linear_post_grad": {"min_fuse_set_size": 50},
        "unbind_stack_aten_pass": {},
}
```

| Metric             | Value        |
|:-------------------|:-------------|
| Latency            | 117.30 ms    |
| Model size         | 5.93 G bytes |
| Flops/example      | 62.65 GB     |
| TFLOPS             | 34.18        |
| MFU                | 4.27%        |
| Activation/example | 163.12 MB    |

Differential Revision: D57595173
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D57595173

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label May 23, 2024
pytorch-bot bot pushed a commit that referenced this pull request May 28, 2024
…26959)

Summary:

Tested on H100 with single GPU, and the bs is set to 64.

Test Plan:
# local script

```
buck2 run mode/opt scripts/jackiexu0313/pt2:uniarch_perf_benchmark -- single-module-benchmark --provider interformer --enable_pt2 True --batch_size 64
```

baseline: P1370993922
trace: https://our.intern.facebook.com/intern/perfdoctor/trace_view?filepath=tree/traces/test/interformer.May_22_10_58_35_trace.json.gz&bucket=pyper_traces

| Metric             | Value        |
|:-------------------|:-------------|
| Latency            | 120.84 ms    |
| Model size         | 5.93 G bytes |
| Flops/example      | 62.22 GB     |
| TFLOPS             | 32.95        |
| MFU                | 4.12%        |
| Activation/example | 128.17 MB    |

proposal: P1371676068
trace:
https://our.intern.facebook.com/intern/perfdoctor/trace_view?filepath=tree/traces/test/interformer.May_22_22_08_09_trace.json.gz&bucket=pyper_traces
config
```
torch._inductor.config.pre_grad_fusion_options = {}
torch._inductor.config.post_grad_fusion_options = {
        "batch_aten_mul": {"min_fuse_set_size": 50},
        "batch_aten_sigmoid": {"min_fuse_set_size": 50},
        "batch_aten_relu": {"min_fuse_set_size": 50},
        "batch_linear_post_grad": {"min_fuse_set_size": 50},
        "unbind_stack_aten_pass": {},
}
```

Counter({'pattern_matcher_nodes': 4677, 'pattern_matcher_count': 3621, 'extern_calls': 1096, 'batch_linear_post_grad': 3, 'batch_aten_mul': 2})


| Metric             | Value        |
|:-------------------|:-------------|
| Latency            | 117.30 ms    |
| Model size         | 5.93 G bytes |
| Flops/example      | 62.65 GB     |
| TFLOPS             | 34.18        |
| MFU                | 4.27%        |
| Activation/example | 163.12 MB    |


### experiments after remove split 

1. Patch D57739439
2. Detailed experiments can be found in P1376777898

### Observations:

- Batch fusion will introduce stack/cat nodes, which is costly. It only helps when we batch "enough" number of ops; otherwise regression can be seen (e.g., we use the default fusion size and the latency is increased P1370978074)

- unbind stack pass are not always taking into effect only when we have nice structure and order, e.g., the example from test_group_batch_fusion, we can delete stack nodes introduced from batch_relu and batch_sigmoid.

- Even remove the splits node cannot remove the "batch linear" nodes since the addmm = mm + add, where add leverages the broadcast rule.

- Tested with enabling add broadcast and then stack the add ops to cancel the stack, it also not very helpful 1) broadcast is not for free, 2) linear will always introduce two stack nodes (stack input + stack other), which cannot be eliminated.

Reviewed By: jackiexu1992

Differential Revision: D57595173
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D57595173

mengluy0125 added a commit to mengluy0125/pytorch that referenced this pull request May 28, 2024
…torch#126959)

Summary:

Tested on H100 with single GPU, and the bs is set to 64.

Test Plan:
# local script

```
buck2 run mode/opt scripts/jackiexu0313/pt2:uniarch_perf_benchmark -- single-module-benchmark --provider interformer --enable_pt2 True --batch_size 64
```

baseline: P1370993922
trace: https://our.intern.facebook.com/intern/perfdoctor/trace_view?filepath=tree/traces/test/interformer.May_22_10_58_35_trace.json.gz&bucket=pyper_traces

| Metric             | Value        |
|:-------------------|:-------------|
| Latency            | 120.84 ms    |
| Model size         | 5.93 G bytes |
| Flops/example      | 62.22 GB     |
| TFLOPS             | 32.95        |
| MFU                | 4.12%        |
| Activation/example | 128.17 MB    |

proposal: P1371676068
trace:
https://our.intern.facebook.com/intern/perfdoctor/trace_view?filepath=tree/traces/test/interformer.May_22_22_08_09_trace.json.gz&bucket=pyper_traces
config
```
torch._inductor.config.pre_grad_fusion_options = {}
torch._inductor.config.post_grad_fusion_options = {
        "batch_aten_mul": {"min_fuse_set_size": 50},
        "batch_aten_sigmoid": {"min_fuse_set_size": 50},
        "batch_aten_relu": {"min_fuse_set_size": 50},
        "batch_linear_post_grad": {"min_fuse_set_size": 50},
        "unbind_stack_aten_pass": {},
}
```

Counter({'pattern_matcher_nodes': 4677, 'pattern_matcher_count': 3621, 'extern_calls': 1096, 'batch_linear_post_grad': 3, 'batch_aten_mul': 2})


| Metric             | Value        |
|:-------------------|:-------------|
| Latency            | 117.30 ms    |
| Model size         | 5.93 G bytes |
| Flops/example      | 62.65 GB     |
| TFLOPS             | 34.18        |
| MFU                | 4.27%        |
| Activation/example | 163.12 MB    |


### experiments after remove split 

1. Patch D57739439
2. Detailed experiments can be found in P1376777898

### Observations:

- Batch fusion will introduce stack/cat nodes, which is costly. It only helps when we batch "enough" number of ops; otherwise regression can be seen (e.g., we use the default fusion size and the latency is increased P1370978074)

- unbind stack pass are not always taking into effect only when we have nice structure and order, e.g., the example from test_group_batch_fusion, we can delete stack nodes introduced from batch_relu and batch_sigmoid.

- Even remove the splits node cannot remove the "batch linear" nodes since the addmm = mm + add, where add leverages the broadcast rule.

- Tested with enabling add broadcast and then stack the add ops to cancel the stack, it also not very helpful 1) broadcast is not for free, 2) linear will always introduce two stack nodes (stack input + stack other), which cannot be eliminated.

Reviewed By: jackiexu1992

Differential Revision: D57595173
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D57595173

mengluy0125 added a commit to mengluy0125/pytorch that referenced this pull request May 29, 2024
…torch#126959)

Summary:

Tested on H100 with single GPU, and the bs is set to 64.

Test Plan:
# local script

```
buck2 run mode/opt scripts/jackiexu0313/pt2:uniarch_perf_benchmark -- single-module-benchmark --provider interformer --enable_pt2 True --batch_size 64
```

baseline: P1370993922
trace: https://our.intern.facebook.com/intern/perfdoctor/trace_view?filepath=tree/traces/test/interformer.May_22_10_58_35_trace.json.gz&bucket=pyper_traces

| Metric             | Value        |
|:-------------------|:-------------|
| Latency            | 120.84 ms    |
| Model size         | 5.93 G bytes |
| Flops/example      | 62.22 GB     |
| TFLOPS             | 32.95        |
| MFU                | 4.12%        |
| Activation/example | 128.17 MB    |

proposal: P1371676068
trace:
https://our.intern.facebook.com/intern/perfdoctor/trace_view?filepath=tree/traces/test/interformer.May_22_22_08_09_trace.json.gz&bucket=pyper_traces
config
```
torch._inductor.config.pre_grad_fusion_options = {}
torch._inductor.config.post_grad_fusion_options = {
        "batch_aten_mul": {"min_fuse_set_size": 50},
        "batch_aten_sigmoid": {"min_fuse_set_size": 50},
        "batch_aten_relu": {"min_fuse_set_size": 50},
        "batch_linear_post_grad": {"min_fuse_set_size": 50},
        "unbind_stack_aten_pass": {},
}
```

Counter({'pattern_matcher_nodes': 4677, 'pattern_matcher_count': 3621, 'extern_calls': 1096, 'batch_linear_post_grad': 3, 'batch_aten_mul': 2})


| Metric             | Value        |
|:-------------------|:-------------|
| Latency            | 117.30 ms    |
| Model size         | 5.93 G bytes |
| Flops/example      | 62.65 GB     |
| TFLOPS             | 34.18        |
| MFU                | 4.27%        |
| Activation/example | 163.12 MB    |


### experiments after remove split

1. Patch D57739439
2. Detailed experiments can be found in P1376777898

### Observations:

- Batch fusion will introduce stack/cat nodes, which is costly. It only helps when we batch "enough" number of ops; otherwise regression can be seen (e.g., we use the default fusion size and the latency is increased P1370978074)

- unbind stack pass are not always taking into effect only when we have nice structure and order, e.g., the example from test_group_batch_fusion, we can delete stack nodes introduced from batch_relu and batch_sigmoid.

- Even remove the splits node cannot remove the "batch linear" nodes since the addmm = mm + add, where add leverages the broadcast rule.

- Tested with enabling add broadcast and then stack the add ops to cancel the stack, it also not very helpful 1) broadcast is not for free, 2) linear will always introduce two stack nodes (stack input + stack other), which cannot be eliminated.

Reviewed By: jackiexu1992

Differential Revision: D57595173
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D57595173

mengluy0125 added a commit to mengluy0125/pytorch that referenced this pull request May 29, 2024
…torch#126959)

Summary:

Tested on H100 with single GPU, and the bs is set to 64.

Test Plan:
# local script

```
buck2 run mode/opt scripts/jackiexu0313/pt2:uniarch_perf_benchmark -- single-module-benchmark --provider interformer --enable_pt2 True --batch_size 64
```

baseline: P1370993922
trace: https://our.intern.facebook.com/intern/perfdoctor/trace_view?filepath=tree/traces/test/interformer.May_22_10_58_35_trace.json.gz&bucket=pyper_traces

| Metric             | Value        |
|:-------------------|:-------------|
| Latency            | 120.84 ms    |
| Model size         | 5.93 G bytes |
| Flops/example      | 62.22 GB     |
| TFLOPS             | 32.95        |
| MFU                | 4.12%        |
| Activation/example | 128.17 MB    |

proposal: P1371676068
trace:
https://our.intern.facebook.com/intern/perfdoctor/trace_view?filepath=tree/traces/test/interformer.May_22_22_08_09_trace.json.gz&bucket=pyper_traces
config
```
torch._inductor.config.pre_grad_fusion_options = {}
torch._inductor.config.post_grad_fusion_options = {
        "batch_aten_mul": {"min_fuse_set_size": 50},
        "batch_aten_sigmoid": {"min_fuse_set_size": 50},
        "batch_aten_relu": {"min_fuse_set_size": 50},
        "batch_linear_post_grad": {"min_fuse_set_size": 50},
        "unbind_stack_aten_pass": {},
}
```

Counter({'pattern_matcher_nodes': 4677, 'pattern_matcher_count': 3621, 'extern_calls': 1096, 'batch_linear_post_grad': 3, 'batch_aten_mul': 2})


| Metric             | Value        |
|:-------------------|:-------------|
| Latency            | 117.30 ms    |
| Model size         | 5.93 G bytes |
| Flops/example      | 62.65 GB     |
| TFLOPS             | 34.18        |
| MFU                | 4.27%        |
| Activation/example | 163.12 MB    |


### experiments after remove split

1. Patch D57739439
2. Detailed experiments can be found in P1376777898

### Observations:

- Batch fusion will introduce stack/cat nodes, which is costly. It only helps when we batch "enough" number of ops; otherwise regression can be seen (e.g., we use the default fusion size and the latency is increased P1370978074)

- unbind stack pass are not always taking into effect only when we have nice structure and order, e.g., the example from test_group_batch_fusion, we can delete stack nodes introduced from batch_relu and batch_sigmoid.

- Even remove the splits node cannot remove the "batch linear" nodes since the addmm = mm + add, where add leverages the broadcast rule.

- Tested with enabling add broadcast and then stack the add ops to cancel the stack, it also not very helpful 1) broadcast is not for free, 2) linear will always introduce two stack nodes (stack input + stack other), which cannot be eliminated.

Reviewed By: jackiexu1992

Differential Revision: D57595173
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D57595173

mengluy0125 added a commit to mengluy0125/pytorch that referenced this pull request May 29, 2024
…torch#126959)

Summary:

Tested on H100 with single GPU, and the bs is set to 64.

Test Plan:
# local script

```
buck2 run mode/opt scripts/jackiexu0313/pt2:uniarch_perf_benchmark -- single-module-benchmark --provider interformer --enable_pt2 True --batch_size 64
```

baseline: P1370993922
trace: https://our.intern.facebook.com/intern/perfdoctor/trace_view?filepath=tree/traces/test/interformer.May_22_10_58_35_trace.json.gz&bucket=pyper_traces

| Metric             | Value        |
|:-------------------|:-------------|
| Latency            | 120.84 ms    |
| Model size         | 5.93 G bytes |
| Flops/example      | 62.22 GB     |
| TFLOPS             | 32.95        |
| MFU                | 4.12%        |
| Activation/example | 128.17 MB    |

proposal: P1371676068
trace:
https://our.intern.facebook.com/intern/perfdoctor/trace_view?filepath=tree/traces/test/interformer.May_22_22_08_09_trace.json.gz&bucket=pyper_traces
config
```
torch._inductor.config.pre_grad_fusion_options = {}
torch._inductor.config.post_grad_fusion_options = {
        "batch_aten_mul": {"min_fuse_set_size": 50},
        "batch_aten_sigmoid": {"min_fuse_set_size": 50},
        "batch_aten_relu": {"min_fuse_set_size": 50},
        "batch_linear_post_grad": {"min_fuse_set_size": 50},
        "unbind_stack_aten_pass": {},
}
```

Counter({'pattern_matcher_nodes': 4677, 'pattern_matcher_count': 3621, 'extern_calls': 1096, 'batch_linear_post_grad': 3, 'batch_aten_mul': 2})


| Metric             | Value        |
|:-------------------|:-------------|
| Latency            | 117.30 ms    |
| Model size         | 5.93 G bytes |
| Flops/example      | 62.65 GB     |
| TFLOPS             | 34.18        |
| MFU                | 4.27%        |
| Activation/example | 163.12 MB    |


### experiments after remove split

1. Patch D57739439
2. Detailed experiments can be found in P1376777898

### Observations:

- Batch fusion will introduce stack/cat nodes, which is costly. It only helps when we batch "enough" number of ops; otherwise regression can be seen (e.g., we use the default fusion size and the latency is increased P1370978074)

- unbind stack pass are not always taking into effect only when we have nice structure and order, e.g., the example from test_group_batch_fusion, we can delete stack nodes introduced from batch_relu and batch_sigmoid.

- Even remove the splits node cannot remove the "batch linear" nodes since the addmm = mm + add, where add leverages the broadcast rule.

- Tested with enabling add broadcast and then stack the add ops to cancel the stack, it also not very helpful 1) broadcast is not for free, 2) linear will always introduce two stack nodes (stack input + stack other), which cannot be eliminated.

Reviewed By: jackiexu1992

Differential Revision: D57595173
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D57595173

…torch#126959)

Summary:

Tested on H100 with single GPU, and the bs is set to 64.

Test Plan:
# local script

```
buck2 run mode/opt scripts/jackiexu0313/pt2:uniarch_perf_benchmark -- single-module-benchmark --provider interformer --enable_pt2 True --batch_size 64
```

baseline: P1370993922
trace: https://our.intern.facebook.com/intern/perfdoctor/trace_view?filepath=tree/traces/test/interformer.May_22_10_58_35_trace.json.gz&bucket=pyper_traces

| Metric             | Value        |
|:-------------------|:-------------|
| Latency            | 120.84 ms    |
| Model size         | 5.93 G bytes |
| Flops/example      | 62.22 GB     |
| TFLOPS             | 32.95        |
| MFU                | 4.12%        |
| Activation/example | 128.17 MB    |

proposal: P1371676068
trace:
https://our.intern.facebook.com/intern/perfdoctor/trace_view?filepath=tree/traces/test/interformer.May_22_22_08_09_trace.json.gz&bucket=pyper_traces
config
```
torch._inductor.config.pre_grad_fusion_options = {}
torch._inductor.config.post_grad_fusion_options = {
        "batch_aten_mul": {"min_fuse_set_size": 50},
        "batch_aten_sigmoid": {"min_fuse_set_size": 50},
        "batch_aten_relu": {"min_fuse_set_size": 50},
        "batch_linear_post_grad": {"min_fuse_set_size": 50},
        "unbind_stack_aten_pass": {},
}
```

Counter({'pattern_matcher_nodes': 4677, 'pattern_matcher_count': 3621, 'extern_calls': 1096, 'batch_linear_post_grad': 3, 'batch_aten_mul': 2})


| Metric             | Value        |
|:-------------------|:-------------|
| Latency            | 117.30 ms    |
| Model size         | 5.93 G bytes |
| Flops/example      | 62.65 GB     |
| TFLOPS             | 34.18        |
| MFU                | 4.27%        |
| Activation/example | 163.12 MB    |


### experiments after remove split

1. Patch D57739439
2. Detailed experiments can be found in P1376777898

### Observations:

- Batch fusion will introduce stack/cat nodes, which is costly. It only helps when we batch "enough" number of ops; otherwise regression can be seen (e.g., we use the default fusion size and the latency is increased P1370978074)

- unbind stack pass are not always taking into effect only when we have nice structure and order, e.g., the example from test_group_batch_fusion, we can delete stack nodes introduced from batch_relu and batch_sigmoid.

- Even remove the splits node cannot remove the "batch linear" nodes since the addmm = mm + add, where add leverages the broadcast rule.

- Tested with enabling add broadcast and then stack the add ops to cancel the stack, it also not very helpful 1) broadcast is not for free, 2) linear will always introduce two stack nodes (stack input + stack other), which cannot be eliminated.

Reviewed By: jackiexu1992

Differential Revision: D57595173
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D57595173

@facebook-github-bot
Copy link
Contributor

@pytorchbot merge -f 'Landed internally'

(Initiating merge automatically since Phabricator Diff has merged, using force because this PR might not pass merge_rules.json but landed internally)

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

petrex pushed a commit to petrex/pytorch that referenced this pull request Jun 5, 2024
…torch#126959)

Summary: Tested on H100 with single GPU, and the bs is set to 64.

Test Plan:
# local script

```
buck2 run mode/opt scripts/jackiexu0313/pt2:uniarch_perf_benchmark -- single-module-benchmark --provider interformer --enable_pt2 True --batch_size 64
```

baseline: P1370993922

| Metric             | Value        |
|:-------------------|:-------------|
| Latency            | 120.84 ms    |
| Model size         | 5.93 G bytes |
| Flops/example      | 62.22 GB     |
| TFLOPS             | 32.95        |
| MFU                | 4.12%        |
| Activation/example | 128.17 MB    |

proposal: P1371676068

config
```
torch._inductor.config.pre_grad_fusion_options = {}
torch._inductor.config.post_grad_fusion_options = {
        "batch_aten_mul": {"min_fuse_set_size": 50},
        "batch_aten_sigmoid": {"min_fuse_set_size": 50},
        "batch_aten_relu": {"min_fuse_set_size": 50},
        "batch_linear_post_grad": {"min_fuse_set_size": 50},
        "unbind_stack_aten_pass": {},
}
```

| Metric             | Value        |
|:-------------------|:-------------|
| Latency            | 117.30 ms    |
| Model size         | 5.93 G bytes |
| Flops/example      | 62.65 GB     |
| TFLOPS             | 34.18        |
| MFU                | 4.27%        |
| Activation/example | 163.12 MB    |

Differential Revision: D57595173

Pull Request resolved: pytorch#126959
Approved by: https://github.com/jackiexu1992
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants