Skip to content

Conversation

dmpots
Copy link
Contributor

@dmpots dmpots commented Feb 3, 2025

Summary:
This commit fixes a crash in the gemm template lowering caused by hitting an assert that a buffer was previously removed.

The assert triggers because in the first gemm lowering we use a local accumulation buffer, which causes the original buffer name to be added to the removed_buffers set. Then in the next gemm lowering we use the global buffer for accumulation, but that buffer name is already in the removed_buffers set.

The fix is to add a unique suffix to the buffer name to avoid triggering the assert from different gemm lowerings.

Differential Revision: D68814625

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @amjames @desertfire @chauhang @aakhundov

@pytorch-bot
Copy link

pytorch-bot bot commented Feb 3, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/146353

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure, 5 Pending, 1 Unrelated Failure

As of commit b212a54 with merge base bc01918 (image):

NEW FAILURE - The following job has failed:

FLAKY - The following job failed but was likely due to flakiness present on trunk:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D68814625

@dmpots
Copy link
Contributor Author

dmpots commented Feb 3, 2025

@pytorchbot label "topic: not user facing"

@pytorch-bot pytorch-bot bot added the topic: not user facing topic category label Feb 3, 2025
dmpots added a commit to dmpots/pytorch that referenced this pull request Feb 3, 2025
Summary:

This commit fixes a crash in the gemm template lowering caused by hitting an [assert](https://github.com/pytorch/pytorch/blob/fd515e4f59bfa0ac9faa5185b7a02f3222c4cd08/torch/_inductor/codegen/common.py#L1181) that a buffer was previously removed.

The assert triggers because in the first gemm lowering we use a local accumulation buffer, which causes the original buffer name to be added to the `removed_buffers` set. Then in the next gemm lowering we use the global buffer for accumulation, but that buffer name is already in the `removed_buffers` set.

The fix is to add a unique suffix to the buffer name to avoid triggering the assert from different gemm lowerings.

Test Plan:
# Reduced test case
```
TRITON_LOCAL_BUILD=1 buck2 run 'fbcode//mode/opt' fbcode//caffe2/test/inductor:cpu_select_algorithm_cpu  -- caffe2.test.inductor.test_cpu_select_algorithm.TestSelectAlgorithmCPU.test_local_and_global_accumulator_cpu_float32
```

# Original failure
```
$ manifold get aitemplate/tree/aotinductor_cpu/915857944_1.input.predictor.disagg.remote_other /tmp/915857944_1.input.predictor.disagg.remote_other

$ buck2 run @//mode/opt //deeplearning/aot_inductor/cpu:cli -- --local-model-path /tmp/915857944_1.input.predictor.disagg.remote_other --submodule remote_other --preset ads_second_stage_ranking_type_1
```

Differential Revision: D68814625
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D68814625

@hl475
Copy link
Contributor

hl475 commented Feb 4, 2025

@leslie-fang-intel and @frost-intel - could you please help take a look? we found another issue when trying autuotune from Meta internal models, and David comes up with this fix

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D68814625

dmpots added a commit to dmpots/pytorch that referenced this pull request Feb 6, 2025
Summary:

This commit fixes a crash in the gemm template lowering caused by hitting an [assert](https://github.com/pytorch/pytorch/blob/fd515e4f59bfa0ac9faa5185b7a02f3222c4cd08/torch/_inductor/codegen/common.py#L1181) that a buffer was previously removed.

The assert triggers because in the first gemm lowering we use a local accumulation buffer, which causes the original buffer name to be added to the `removed_buffers` set. Then in the next gemm lowering we use the global buffer for accumulation, but that buffer name is already in the `removed_buffers` set.

The fix is to add a unique suffix to the buffer name to avoid triggering the assert from different gemm lowerings.

Test Plan:
# Reduced test case
```
TRITON_LOCAL_BUILD=1 buck2 run 'fbcode//mode/opt' fbcode//caffe2/test/inductor:cpu_select_algorithm_cpu  -- caffe2.test.inductor.test_cpu_select_algorithm.TestSelectAlgorithmCPU.test_local_and_global_accumulator_cpu_float32
```

# Original failure
```
$ manifold get aitemplate/tree/aotinductor_cpu/915857944_1.input.predictor.disagg.remote_other /tmp/915857944_1.input.predictor.disagg.remote_other

$ buck2 run @//mode/opt //deeplearning/aot_inductor/cpu:cli -- --local-model-path /tmp/915857944_1.input.predictor.disagg.remote_other --submodule remote_other --preset ads_second_stage_ranking_type_1
```

Differential Revision: D68814625
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D68814625

dmpots added a commit to dmpots/pytorch that referenced this pull request Feb 7, 2025
Summary:

This commit fixes a crash in the gemm template lowering caused by hitting an [assert](https://github.com/pytorch/pytorch/blob/fd515e4f59bfa0ac9faa5185b7a02f3222c4cd08/torch/_inductor/codegen/common.py#L1181) that a buffer was previously removed.

The assert triggers because in the first gemm lowering we use a local accumulation buffer, which causes the original buffer name to be added to the `removed_buffers` set. Then in the next gemm lowering we use the global buffer for accumulation, but that buffer name is already in the `removed_buffers` set.

The fix is to add a unique suffix to the buffer name to avoid triggering the assert from different gemm lowerings.

Test Plan:
# Reduced test case
```
TRITON_LOCAL_BUILD=1 buck2 run 'fbcode//mode/opt' fbcode//caffe2/test/inductor:cpu_select_algorithm_cpu  -- caffe2.test.inductor.test_cpu_select_algorithm.TestSelectAlgorithmCPU.test_local_and_global_accumulator_cpu_float32
```

# Original failure
```
$ manifold get aitemplate/tree/aotinductor_cpu/915857944_1.input.predictor.disagg.remote_other /tmp/915857944_1.input.predictor.disagg.remote_other

$ buck2 run @//mode/opt //deeplearning/aot_inductor/cpu:cli -- --local-model-path /tmp/915857944_1.input.predictor.disagg.remote_other --submodule remote_other --preset ads_second_stage_ranking_type_1
```

Differential Revision: D68814625
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D68814625

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Feb 7, 2025
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D68814625

Summary:

This commit fixes a crash in the gemm template lowering caused by hitting an [assert](https://github.com/pytorch/pytorch/blob/fd515e4f59bfa0ac9faa5185b7a02f3222c4cd08/torch/_inductor/codegen/common.py#L1181) that a buffer was previously removed.

The assert triggers because in the first gemm lowering we use a local accumulation buffer, which causes the original buffer name to be added to the `removed_buffers` set. Then in the next gemm lowering we use the global buffer for accumulation, but that buffer name is already in the `removed_buffers` set.

The fix is to add a unique suffix to the buffer name to avoid triggering the assert from different gemm lowerings.

Test Plan:
# Reduced test case
```
TRITON_LOCAL_BUILD=1 buck2 run 'fbcode//mode/opt' fbcode//caffe2/test/inductor:cpu_select_algorithm_cpu  -- caffe2.test.inductor.test_cpu_select_algorithm.TestSelectAlgorithmCPU.test_local_and_global_accumulator_cpu_float32
```

# Original failure
```
$ manifold get aitemplate/tree/aotinductor_cpu/915857944_1.input.predictor.disagg.remote_other /tmp/915857944_1.input.predictor.disagg.remote_other

$ buck2 run @//mode/opt //deeplearning/aot_inductor/cpu:cli -- --local-model-path /tmp/915857944_1.input.predictor.disagg.remote_other --submodule remote_other --preset ads_second_stage_ranking_type_1
```

Differential Revision: D68814625
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D68814625

@hl475
Copy link
Contributor

hl475 commented Feb 7, 2025

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: 1 jobs have failed, first few of them are: trunk / win-vs2019-cpu-py3 / test (default, 1, 3, windows.4xlarge.nonephemeral)

Details for Dev Infra team Raised by workflow job

@hl475
Copy link
Contributor

hl475 commented Feb 8, 2025

@pytorchbot merge -f "the failed test is irrelevant with this PR"

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants