Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[inductor] post_grad batched linear fusion #112504

Closed
wants to merge 1 commit into from

Conversation

xuzhao9
Copy link
Contributor

@xuzhao9 xuzhao9 commented Oct 31, 2023

Summary: Fusing independent nn.Linear() functions with aten.bmm and aten.cat.

Test Plan:
Without the BMM fusion:

buck2 run @mode/opt //pytorch/benchmark:run -- test_module -d cuda --module test_linear_module --torchdynamo inductor --torchinductor_cudagraph 0 --torchinductor_batch_fusion 0

https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree/traces/test/torchbench_test_module_20231030_072536_6535183793.json.gz&bucket=pyper_traces

100 aten::mm operators

With the BMM fusion:

buck2 run @mode/opt //pytorch/benchmark:run -- test_module -d cuda --module test_linear_module --torchdynamo inductor --torchinductor_cudagraph 0 --torchinductor_batch_fusion 1

20 aten::bmm operators

https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree/traces/test/torchbench_test_module_20231030_072157_6535183793.json.gz&bucket=pyper_traces

Passes accuracy test:

$ buck2 run @mode/opt //pytorch/benchmark:run -- test_module -d cuda --module test_linear_module --torchdynamo inductor --torchinductor_cudagraph 0 --torchinductor_batch_fusion 1 --accuracy
Running eval method from test_module on cuda in dynamo inductor mode with input batch size 4 and precision tf32.
Accuracy:                            pass

Looks like the bmm and input cat has been fused successfully.

Checking the triton codegen:

TORCH_LOGS=+dynamo,+aot,+inductor buck2 run @mode/opt //pytorch/benchmark:run -- test_module -d cuda --module test_linear_module --torchdynamo inductor --torchinductor_cudagraph 0 --torchinductor_batch_fusion 1 --dump_triton 1

Triton code dump: https://www.internalfb.com/intern/everpaste/?handle=GHp1ABaqYuTjYCUBALiTWmteaI1PbsIXAAAB

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @peterbell10 @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @aakhundov @ColinPeppler

Copy link

pytorch-bot bot commented Oct 31, 2023

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/112504

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure, 1 Unrelated Failure

As of commit 155ffe8 with merge base ec124b9 (image):

NEW FAILURE - The following job has failed:

FLAKY - The following job failed but was likely due to flakiness present on trunk:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D46910718

xuzhao9 added a commit to xuzhao9/pytorch that referenced this pull request Oct 31, 2023
Summary:

Fusing independent nn.Linear() functions with aten.bmm and aten.cat.

Test Plan:
Without the BMM fusion:
```
buck2 run mode/opt //pytorch/benchmark:run -- test_module -d cuda --module test_linear_module --torchdynamo inductor --torchinductor_cudagraph 0 --torchinductor_batch_fusion 0
```
https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree/traces/test/torchbench_test_module_20231030_072536_6535183793.json.gz&bucket=pyper_traces

100 aten::mm operators

With the BMM fusion:
```
buck2 run mode/opt //pytorch/benchmark:run -- test_module -d cuda --module test_linear_module --torchdynamo inductor --torchinductor_cudagraph 0 --torchinductor_batch_fusion 1
```

20 aten::bmm operators

https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree/traces/test/torchbench_test_module_20231030_072157_6535183793.json.gz&bucket=pyper_traces

Passes accuracy test:
```
$ buck2 run mode/opt //pytorch/benchmark:run -- test_module -d cuda --module test_linear_module --torchdynamo inductor --torchinductor_cudagraph 0 --torchinductor_batch_fusion 1 --accuracy
Running eval method from test_module on cuda in dynamo inductor mode with input batch size 4 and precision tf32.
Accuracy:                            pass
```
Looks like the bmm and input cat has been fused successfully.

Checking the triton codegen:

```
TORCH_LOGS=+dynamo,+aot,+inductor buck2 run mode/opt //pytorch/benchmark:run -- test_module -d cuda --module test_linear_module --torchdynamo inductor --torchinductor_cudagraph 0 --torchinductor_batch_fusion 1 --dump_triton 1
```

Triton code dump: https://www.internalfb.com/intern/everpaste/?handle=GHp1ABaqYuTjYCUBALiTWmteaI1PbsIXAAAB

Differential Revision: D46910718
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D46910718

xuzhao9 added a commit to xuzhao9/pytorch that referenced this pull request Oct 31, 2023
Summary:

Fusing independent nn.Linear() functions with aten.bmm and aten.cat.

Test Plan:
Without the BMM fusion:
```
buck2 run mode/opt //pytorch/benchmark:run -- test_module -d cuda --module test_linear_module --torchdynamo inductor --torchinductor_cudagraph 0 --torchinductor_batch_fusion 0
```
https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree/traces/test/torchbench_test_module_20231030_072536_6535183793.json.gz&bucket=pyper_traces

100 aten::mm operators

With the BMM fusion:
```
buck2 run mode/opt //pytorch/benchmark:run -- test_module -d cuda --module test_linear_module --torchdynamo inductor --torchinductor_cudagraph 0 --torchinductor_batch_fusion 1
```

20 aten::bmm operators

https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree/traces/test/torchbench_test_module_20231030_072157_6535183793.json.gz&bucket=pyper_traces

Passes accuracy test:
```
$ buck2 run mode/opt //pytorch/benchmark:run -- test_module -d cuda --module test_linear_module --torchdynamo inductor --torchinductor_cudagraph 0 --torchinductor_batch_fusion 1 --accuracy
Running eval method from test_module on cuda in dynamo inductor mode with input batch size 4 and precision tf32.
Accuracy:                            pass
```
Looks like the bmm and input cat has been fused successfully.

Checking the triton codegen:

```
TORCH_LOGS=+dynamo,+aot,+inductor buck2 run mode/opt //pytorch/benchmark:run -- test_module -d cuda --module test_linear_module --torchdynamo inductor --torchinductor_cudagraph 0 --torchinductor_batch_fusion 1 --dump_triton 1
```

Triton code dump: https://www.internalfb.com/intern/everpaste/?handle=GHp1ABaqYuTjYCUBALiTWmteaI1PbsIXAAAB

Differential Revision: D46910718
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D46910718

xuzhao9 added a commit to xuzhao9/pytorch that referenced this pull request Oct 31, 2023
Summary:

Fusing independent nn.Linear() functions with aten.bmm and aten.cat.

Test Plan:
Without the BMM fusion:
```
buck2 run mode/opt //pytorch/benchmark:run -- test_module -d cuda --module test_linear_module --torchdynamo inductor --torchinductor_cudagraph 0 --torchinductor_batch_fusion 0
```
https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree/traces/test/torchbench_test_module_20231030_072536_6535183793.json.gz&bucket=pyper_traces

100 aten::mm operators

With the BMM fusion:
```
buck2 run mode/opt //pytorch/benchmark:run -- test_module -d cuda --module test_linear_module --torchdynamo inductor --torchinductor_cudagraph 0 --torchinductor_batch_fusion 1
```

20 aten::bmm operators

https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree/traces/test/torchbench_test_module_20231030_072157_6535183793.json.gz&bucket=pyper_traces

Passes accuracy test:
```
$ buck2 run mode/opt //pytorch/benchmark:run -- test_module -d cuda --module test_linear_module --torchdynamo inductor --torchinductor_cudagraph 0 --torchinductor_batch_fusion 1 --accuracy
Running eval method from test_module on cuda in dynamo inductor mode with input batch size 4 and precision tf32.
Accuracy:                            pass
```
Looks like the bmm and input cat has been fused successfully.

Checking the triton codegen:

```
TORCH_LOGS=+dynamo,+aot,+inductor buck2 run mode/opt //pytorch/benchmark:run -- test_module -d cuda --module test_linear_module --torchdynamo inductor --torchinductor_cudagraph 0 --torchinductor_batch_fusion 1 --dump_triton 1
```

Triton code dump: https://www.internalfb.com/intern/everpaste/?handle=GHp1ABaqYuTjYCUBALiTWmteaI1PbsIXAAAB

Differential Revision: D46910718
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D46910718

xuzhao9 added a commit to xuzhao9/pytorch that referenced this pull request Oct 31, 2023
Summary:

Fusing independent nn.Linear() functions with aten.bmm and aten.cat.

Test Plan:
Without the BMM fusion:
```
buck2 run mode/opt //pytorch/benchmark:run -- test_module -d cuda --module test_linear_module --torchdynamo inductor --torchinductor_cudagraph 0 --torchinductor_batch_fusion 0
```
https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree/traces/test/torchbench_test_module_20231030_072536_6535183793.json.gz&bucket=pyper_traces

100 aten::mm operators

With the BMM fusion:
```
buck2 run mode/opt //pytorch/benchmark:run -- test_module -d cuda --module test_linear_module --torchdynamo inductor --torchinductor_cudagraph 0 --torchinductor_batch_fusion 1
```

20 aten::bmm operators

https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree/traces/test/torchbench_test_module_20231030_072157_6535183793.json.gz&bucket=pyper_traces

Passes accuracy test:
```
$ buck2 run mode/opt //pytorch/benchmark:run -- test_module -d cuda --module test_linear_module --torchdynamo inductor --torchinductor_cudagraph 0 --torchinductor_batch_fusion 1 --accuracy
Running eval method from test_module on cuda in dynamo inductor mode with input batch size 4 and precision tf32.
Accuracy:                            pass
```
Looks like the bmm and input cat has been fused successfully.

Checking the triton codegen:

```
TORCH_LOGS=+dynamo,+aot,+inductor buck2 run mode/opt //pytorch/benchmark:run -- test_module -d cuda --module test_linear_module --torchdynamo inductor --torchinductor_cudagraph 0 --torchinductor_batch_fusion 1 --dump_triton 1
```

Triton code dump: https://www.internalfb.com/intern/everpaste/?handle=GHp1ABaqYuTjYCUBALiTWmteaI1PbsIXAAAB

Differential Revision: D46910718
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D46910718

xuzhao9 added a commit to xuzhao9/pytorch that referenced this pull request Oct 31, 2023
Summary:

Fusing independent nn.Linear() functions with aten.bmm and aten.cat.

Test Plan:
Without the BMM fusion:
```
buck2 run mode/opt //pytorch/benchmark:run -- test_module -d cuda --module test_linear_module --torchdynamo inductor --torchinductor_cudagraph 0 --torchinductor_batch_fusion 0
```
https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree/traces/test/torchbench_test_module_20231030_072536_6535183793.json.gz&bucket=pyper_traces

100 aten::mm operators

With the BMM fusion:
```
buck2 run mode/opt //pytorch/benchmark:run -- test_module -d cuda --module test_linear_module --torchdynamo inductor --torchinductor_cudagraph 0 --torchinductor_batch_fusion 1
```

20 aten::bmm operators

https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree/traces/test/torchbench_test_module_20231030_072157_6535183793.json.gz&bucket=pyper_traces

Passes accuracy test:
```
$ buck2 run mode/opt //pytorch/benchmark:run -- test_module -d cuda --module test_linear_module --torchdynamo inductor --torchinductor_cudagraph 0 --torchinductor_batch_fusion 1 --accuracy
Running eval method from test_module on cuda in dynamo inductor mode with input batch size 4 and precision tf32.
Accuracy:                            pass
```
Looks like the bmm and input cat has been fused successfully.

Checking the triton codegen:

```
TORCH_LOGS=+dynamo,+aot,+inductor buck2 run mode/opt //pytorch/benchmark:run -- test_module -d cuda --module test_linear_module --torchdynamo inductor --torchinductor_cudagraph 0 --torchinductor_batch_fusion 1 --dump_triton 1
```

Triton code dump: https://www.internalfb.com/intern/everpaste/?handle=GHp1ABaqYuTjYCUBALiTWmteaI1PbsIXAAAB

Reviewed By: jackiexu1992

Differential Revision: D46910718
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D46910718

xuzhao9 added a commit to xuzhao9/pytorch that referenced this pull request Nov 1, 2023
Summary:

Fusing independent nn.Linear() functions with aten.bmm and aten.cat.

Test Plan:
Without the BMM fusion:
```
buck2 run mode/opt //pytorch/benchmark:run -- test_module -d cuda --module test_linear_module --torchdynamo inductor --torchinductor_cudagraph 0 --torchinductor_batch_fusion 0
```
https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree/traces/test/torchbench_test_module_20231030_072536_6535183793.json.gz&bucket=pyper_traces

100 aten::mm operators

With the BMM fusion:
```
buck2 run mode/opt //pytorch/benchmark:run -- test_module -d cuda --module test_linear_module --torchdynamo inductor --torchinductor_cudagraph 0 --torchinductor_batch_fusion 1
```

20 aten::bmm operators

https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree/traces/test/torchbench_test_module_20231030_072157_6535183793.json.gz&bucket=pyper_traces

Passes accuracy test:
```
$ buck2 run mode/opt //pytorch/benchmark:run -- test_module -d cuda --module test_linear_module --torchdynamo inductor --torchinductor_cudagraph 0 --torchinductor_batch_fusion 1 --accuracy
Running eval method from test_module on cuda in dynamo inductor mode with input batch size 4 and precision tf32.
Accuracy:                            pass
```
Looks like the bmm and input cat has been fused successfully.

Checking the triton codegen:

```
TORCH_LOGS=+dynamo,+aot,+inductor buck2 run mode/opt //pytorch/benchmark:run -- test_module -d cuda --module test_linear_module --torchdynamo inductor --torchinductor_cudagraph 0 --torchinductor_batch_fusion 1 --dump_triton 1
```

Triton code dump: https://www.internalfb.com/intern/everpaste/?handle=GHp1ABaqYuTjYCUBALiTWmteaI1PbsIXAAAB

Reviewed By: jackiexu1992

Differential Revision: D46910718
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D46910718

xuzhao9 added a commit to xuzhao9/pytorch that referenced this pull request Nov 1, 2023
Summary:

Fusing independent nn.Linear() functions with aten.bmm and aten.cat.

Test Plan:
Without the BMM fusion:
```
buck2 run mode/opt //pytorch/benchmark:run -- test_module -d cuda --module test_linear_module --torchdynamo inductor --torchinductor_cudagraph 0 --torchinductor_batch_fusion 0
```
https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree/traces/test/torchbench_test_module_20231030_072536_6535183793.json.gz&bucket=pyper_traces

100 aten::mm operators

With the BMM fusion:
```
buck2 run mode/opt //pytorch/benchmark:run -- test_module -d cuda --module test_linear_module --torchdynamo inductor --torchinductor_cudagraph 0 --torchinductor_batch_fusion 1
```

20 aten::bmm operators

https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree/traces/test/torchbench_test_module_20231030_072157_6535183793.json.gz&bucket=pyper_traces

Passes accuracy test:
```
$ buck2 run mode/opt //pytorch/benchmark:run -- test_module -d cuda --module test_linear_module --torchdynamo inductor --torchinductor_cudagraph 0 --torchinductor_batch_fusion 1 --accuracy
Running eval method from test_module on cuda in dynamo inductor mode with input batch size 4 and precision tf32.
Accuracy:                            pass
```
Looks like the bmm and input cat has been fused successfully.

Checking the triton codegen:

```
TORCH_LOGS=+dynamo,+aot,+inductor buck2 run mode/opt //pytorch/benchmark:run -- test_module -d cuda --module test_linear_module --torchdynamo inductor --torchinductor_cudagraph 0 --torchinductor_batch_fusion 1 --dump_triton 1
```

Triton code dump: https://www.internalfb.com/intern/everpaste/?handle=GHp1ABaqYuTjYCUBALiTWmteaI1PbsIXAAAB

Reviewed By: jackiexu1992

Differential Revision: D46910718
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D46910718

1 similar comment
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D46910718

xuzhao9 added a commit to xuzhao9/pytorch that referenced this pull request Nov 1, 2023
Summary:

Fusing independent nn.Linear() functions with aten.bmm and aten.cat.

Test Plan:
Without the BMM fusion:
```
buck2 run mode/opt //pytorch/benchmark:run -- test_module -d cuda --module test_linear_module --torchdynamo inductor --torchinductor_cudagraph 0 --torchinductor_batch_fusion 0
```
https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree/traces/test/torchbench_test_module_20231030_072536_6535183793.json.gz&bucket=pyper_traces

100 aten::mm operators

With the BMM fusion:
```
buck2 run mode/opt //pytorch/benchmark:run -- test_module -d cuda --module test_linear_module --torchdynamo inductor --torchinductor_cudagraph 0 --torchinductor_batch_fusion 1
```

20 aten::bmm operators

https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree/traces/test/torchbench_test_module_20231030_072157_6535183793.json.gz&bucket=pyper_traces

Passes accuracy test:
```
$ buck2 run mode/opt //pytorch/benchmark:run -- test_module -d cuda --module test_linear_module --torchdynamo inductor --torchinductor_cudagraph 0 --torchinductor_batch_fusion 1 --accuracy
Running eval method from test_module on cuda in dynamo inductor mode with input batch size 4 and precision tf32.
Accuracy:                            pass
```
Looks like the bmm and input cat has been fused successfully.

Checking the triton codegen:

```
TORCH_LOGS=+dynamo,+aot,+inductor buck2 run mode/opt //pytorch/benchmark:run -- test_module -d cuda --module test_linear_module --torchdynamo inductor --torchinductor_cudagraph 0 --torchinductor_batch_fusion 1 --dump_triton 1
```

Triton code dump: https://www.internalfb.com/intern/everpaste/?handle=GHp1ABaqYuTjYCUBALiTWmteaI1PbsIXAAAB

Reviewed By: jackiexu1992

Differential Revision: D46910718
@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

xuzhao9 added a commit to xuzhao9/pytorch that referenced this pull request Dec 1, 2023
Summary:

Fusing independent nn.Linear() functions with aten.bmm and aten.cat.

Test Plan:
Without the BMM fusion:
```
buck2 run mode/opt //pytorch/benchmark:run -- test_module -d cuda --module test_linear_module --torchdynamo inductor --torchinductor_cudagraph 0 --torchinductor_batch_fusion 0
```
https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree/traces/test/torchbench_test_module_20231030_072536_6535183793.json.gz&bucket=pyper_traces

100 aten::mm operators

With the BMM fusion:
```
buck2 run mode/opt //pytorch/benchmark:run -- test_module -d cuda --module test_linear_module --torchdynamo inductor --torchinductor_cudagraph 0 --torchinductor_batch_fusion 1
```

20 aten::bmm operators

https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree/traces/test/torchbench_test_module_20231030_072157_6535183793.json.gz&bucket=pyper_traces

Passes accuracy test:
```
$ buck2 run mode/opt //pytorch/benchmark:run -- test_module -d cuda --module test_linear_module --torchdynamo inductor --torchinductor_cudagraph 0 --torchinductor_batch_fusion 1 --accuracy
Running eval method from test_module on cuda in dynamo inductor mode with input batch size 4 and precision tf32.
Accuracy:                            pass
```
Looks like the bmm and input cat has been fused successfully.

Checking the triton codegen:

```
TORCH_LOGS=+dynamo,+aot,+inductor buck2 run mode/opt //pytorch/benchmark:run -- test_module -d cuda --module test_linear_module --torchdynamo inductor --torchinductor_cudagraph 0 --torchinductor_batch_fusion 1 --dump_triton 1
```

Triton code dump: https://www.internalfb.com/intern/everpaste/?handle=GHp1ABaqYuTjYCUBALiTWmteaI1PbsIXAAAB

Reviewed By: yanboliang, jackiexu1992

Differential Revision: D46910718
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D46910718

xuzhao9 added a commit to xuzhao9/pytorch that referenced this pull request Dec 1, 2023
Summary:

Fusing independent nn.Linear() functions with aten.bmm and aten.cat.

Test Plan:
Without the BMM fusion:
```
buck2 run mode/opt //pytorch/benchmark:run -- test_module -d cuda --module test_linear_module --torchdynamo inductor --torchinductor_cudagraph 0 --torchinductor_batch_fusion 0
```
https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree/traces/test/torchbench_test_module_20231030_072536_6535183793.json.gz&bucket=pyper_traces

100 aten::mm operators

With the BMM fusion:
```
buck2 run mode/opt //pytorch/benchmark:run -- test_module -d cuda --module test_linear_module --torchdynamo inductor --torchinductor_cudagraph 0 --torchinductor_batch_fusion 1
```

20 aten::bmm operators

https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree/traces/test/torchbench_test_module_20231030_072157_6535183793.json.gz&bucket=pyper_traces

Passes accuracy test:
```
$ buck2 run mode/opt //pytorch/benchmark:run -- test_module -d cuda --module test_linear_module --torchdynamo inductor --torchinductor_cudagraph 0 --torchinductor_batch_fusion 1 --accuracy
Running eval method from test_module on cuda in dynamo inductor mode with input batch size 4 and precision tf32.
Accuracy:                            pass
```
Looks like the bmm and input cat has been fused successfully.

Checking the triton codegen:

```
TORCH_LOGS=+dynamo,+aot,+inductor buck2 run mode/opt //pytorch/benchmark:run -- test_module -d cuda --module test_linear_module --torchdynamo inductor --torchinductor_cudagraph 0 --torchinductor_batch_fusion 1 --dump_triton 1
```

Triton code dump: https://www.internalfb.com/intern/everpaste/?handle=GHp1ABaqYuTjYCUBALiTWmteaI1PbsIXAAAB

Reviewed By: yanboliang, jackiexu1992

Differential Revision: D46910718
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D46910718

@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: New commits were pushed while merging. Please rerun the merge command.

Details for Dev Infra team Raised by workflow job

Summary:

Fusing independent nn.Linear() functions with aten.bmm and aten.cat.

Test Plan:
Without the BMM fusion:
```
buck2 run mode/opt //pytorch/benchmark:run -- test_module -d cuda --module test_linear_module --torchdynamo inductor --torchinductor_cudagraph 0 --torchinductor_batch_fusion 0
```
https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree/traces/test/torchbench_test_module_20231030_072536_6535183793.json.gz&bucket=pyper_traces

100 aten::mm operators

With the BMM fusion:
```
buck2 run mode/opt //pytorch/benchmark:run -- test_module -d cuda --module test_linear_module --torchdynamo inductor --torchinductor_cudagraph 0 --torchinductor_batch_fusion 1
```

20 aten::bmm operators

https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree/traces/test/torchbench_test_module_20231030_072157_6535183793.json.gz&bucket=pyper_traces

Passes accuracy test:
```
$ buck2 run mode/opt //pytorch/benchmark:run -- test_module -d cuda --module test_linear_module --torchdynamo inductor --torchinductor_cudagraph 0 --torchinductor_batch_fusion 1 --accuracy
Running eval method from test_module on cuda in dynamo inductor mode with input batch size 4 and precision tf32.
Accuracy:                            pass
```
Looks like the bmm and input cat has been fused successfully.

Checking the triton codegen:

```
TORCH_LOGS=+dynamo,+aot,+inductor buck2 run mode/opt //pytorch/benchmark:run -- test_module -d cuda --module test_linear_module --torchdynamo inductor --torchinductor_cudagraph 0 --torchinductor_batch_fusion 1 --dump_triton 1
```

Triton code dump: https://www.internalfb.com/intern/everpaste/?handle=GHp1ABaqYuTjYCUBALiTWmteaI1PbsIXAAAB

Reviewed By: yanboliang, jackiexu1992

Differential Revision: D46910718
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D46910718

@xuzhao9
Copy link
Contributor Author

xuzhao9 commented Dec 1, 2023

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: 1 mandatory check(s) failed. The first few are:

Dig deeper by viewing the failures on hud

Details for Dev Infra team Raised by workflow job

Failing merge rule: Core Maintainers

@xuzhao9
Copy link
Contributor Author

xuzhao9 commented Dec 1, 2023

@pytorchbot merge -f "Skip failed lintrunner job timeout. Need to land this to unblock another prod fix"

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@xuzhao9 xuzhao9 deleted the export-D46910718 branch December 1, 2023 19:36
huydhn added a commit to pytorch/test-infra that referenced this pull request Dec 5, 2023
Fixes #4741

This is to strengthen Dr.CI flaky classification in the case of the
generic GHA `Process completed with exit code 1` failure by comparing
the failure context of the last command executed in addition to the
failure itself. The error itself doesn't mean anything in this case.

The failure context has been gathered for a while and stored in Rockset
under `job.torchci_classification.context`. Now, it's the time to start
utilize it. The context is a list of the last N commands executed traced
backward from where the failure occurs, for example,
```
[
  "+ python test/run_test.py --exclude-jit-executor --exclude-distributed-tests --shard 1 5 --verbose",
  "+ [[ -z 5 ]]",
  "+ test_python_shard 1",
  "+ '[' -n '' ']'",
  "+ pip install --progress-bar off --no-use-pep517 --user git+https://github.com/pytorch/vision.git@893b4abdc0c9df36c241c58769810f69e35dab48",
  "+ pip_install --no-use-pep517 --user git+https://github.com/pytorch/vision.git@893b4abdc0c9df36c241c58769810f69e35dab48",
  "+ '[' -n '' ']'",
  "+ orig_preload=",
  "+ commit=893b4abdc0c9df36c241c58769810f69e35dab48",
  "++ cat .github/ci_commit_pins/vision.txt",
  "++ get_pinned_commit vision",
  "+ local commit",
]
```

This change extracts and compares the last command, i.e. `+ python
test/run_test.py --exclude-jit-executor --exclude-distributed-tests
--shard 1 5 --verbose`, in addition to job name and the failure string.

### Testing

Try this out on a pytorch/pytorch#112504 with
failures

```
curl --request POST \
--url "http://localhost:3000/api/drci/drci?prNumber=112504" \
--header "Authorization: TOKEN" \
--data 'repo=pytorch'
```
hyperfraise pushed a commit to hyperfraise/pytorch that referenced this pull request Dec 21, 2023
Summary: Fusing independent nn.Linear() functions with aten.bmm and aten.cat.

Test Plan:
Without the BMM fusion:
```
buck2 run @mode/opt //pytorch/benchmark:run -- test_module -d cuda --module test_linear_module --torchdynamo inductor --torchinductor_cudagraph 0 --torchinductor_batch_fusion 0
```
https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree/traces/test/torchbench_test_module_20231030_072536_6535183793.json.gz&bucket=pyper_traces

100 aten::mm operators

With the BMM fusion:
```
buck2 run @mode/opt //pytorch/benchmark:run -- test_module -d cuda --module test_linear_module --torchdynamo inductor --torchinductor_cudagraph 0 --torchinductor_batch_fusion 1
```

20 aten::bmm operators

https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree/traces/test/torchbench_test_module_20231030_072157_6535183793.json.gz&bucket=pyper_traces

Passes accuracy test:
```
$ buck2 run @mode/opt //pytorch/benchmark:run -- test_module -d cuda --module test_linear_module --torchdynamo inductor --torchinductor_cudagraph 0 --torchinductor_batch_fusion 1 --accuracy
Running eval method from test_module on cuda in dynamo inductor mode with input batch size 4 and precision tf32.
Accuracy:                            pass
```
Looks like the bmm and input cat has been fused successfully.

Checking the triton codegen:

```
TORCH_LOGS=+dynamo,+aot,+inductor buck2 run @mode/opt //pytorch/benchmark:run -- test_module -d cuda --module test_linear_module --torchdynamo inductor --torchinductor_cudagraph 0 --torchinductor_batch_fusion 1 --dump_triton 1
```

Triton code dump: https://www.internalfb.com/intern/everpaste/?handle=GHp1ABaqYuTjYCUBALiTWmteaI1PbsIXAAAB

Pull Request resolved: pytorch#112504
Approved by: https://github.com/yanboliang
hyperfraise pushed a commit to hyperfraise/pytorch that referenced this pull request Dec 21, 2023
Summary: Fusing independent nn.Linear() functions with aten.bmm and aten.cat.

Test Plan:
Without the BMM fusion:
```
buck2 run @mode/opt //pytorch/benchmark:run -- test_module -d cuda --module test_linear_module --torchdynamo inductor --torchinductor_cudagraph 0 --torchinductor_batch_fusion 0
```
https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree/traces/test/torchbench_test_module_20231030_072536_6535183793.json.gz&bucket=pyper_traces

100 aten::mm operators

With the BMM fusion:
```
buck2 run @mode/opt //pytorch/benchmark:run -- test_module -d cuda --module test_linear_module --torchdynamo inductor --torchinductor_cudagraph 0 --torchinductor_batch_fusion 1
```

20 aten::bmm operators

https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree/traces/test/torchbench_test_module_20231030_072157_6535183793.json.gz&bucket=pyper_traces

Passes accuracy test:
```
$ buck2 run @mode/opt //pytorch/benchmark:run -- test_module -d cuda --module test_linear_module --torchdynamo inductor --torchinductor_cudagraph 0 --torchinductor_batch_fusion 1 --accuracy
Running eval method from test_module on cuda in dynamo inductor mode with input batch size 4 and precision tf32.
Accuracy:                            pass
```
Looks like the bmm and input cat has been fused successfully.

Checking the triton codegen:

```
TORCH_LOGS=+dynamo,+aot,+inductor buck2 run @mode/opt //pytorch/benchmark:run -- test_module -d cuda --module test_linear_module --torchdynamo inductor --torchinductor_cudagraph 0 --torchinductor_batch_fusion 1 --dump_triton 1
```

Triton code dump: https://www.internalfb.com/intern/everpaste/?handle=GHp1ABaqYuTjYCUBALiTWmteaI1PbsIXAAAB

Pull Request resolved: pytorch#112504
Approved by: https://github.com/yanboliang
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants