[AOTI] set alignment for aot constant #124272

chunyuan-w · 2024-04-17T08:25:41Z

Stack from ghstack (oldest at bottom):

GPU copies the constant blob to aligned memory (RAII_cudaMalloc, 64-alignment) while CPU doesn't have this copy procedure for constant blob, which may result in sub-optimal performance when we want to directly use the constant blob buffer in the computation (for example when these constant blobs are the weight tensor to the oneDNN primitive).

We set the alignment to the constant.o directly so that there's no need to copy the data to an aligned memory for CPU (when using --rename-section, the original section name would need to be specified for --set-section-alignment).

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @peterbell10 @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @aakhundov @ColinPeppler @amjames @desertfire @chauhang

[ghstack-poisoned]

pytorch-bot · 2024-04-17T08:25:44Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/124272

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (1 Unrelated Failure)

As of commit d4439de with merge base d0211e2 ():

FLAKY - The following job failed but was likely due to flakiness present on trunk:

pull / linux-focal-cuda12.1-py3.10-gcc9 / test (default, 5, 5, linux.4xlarge.nvidia.gpu) (gh)
test_foreach.py::TestForeachCUDA::test_binary_op_list_error_cases__foreach_add_cuda_complex128

This comment was automatically generated by Dr. CI and updates every 15 minutes.

ghstack-source-id: 51c6f94 Pull Request resolved: #124272

GPU copies the constant blob to aligned memory ([RAII_cudaMalloc](https://github.com/pytorch/pytorch/blob/d0211e207c78fafac2edaf2e14954f668e898b4a/torch/csrc/inductor/aoti_runtime/model.h#L46 ), [64-alignment](https://github.com/pytorch/pytorch/blob/d0211e207c78fafac2edaf2e14954f668e898b4a/torch/csrc/inductor/aoti_runtime/model.h#L324)) while CPU doesn't have this copy procedure for constant blob, which may result in sub-optimal performance when we want to directly use the constant blob buffer in the computation (for example when these constant blobs are the weight tensor to the oneDNN primitive). We set the alignment to the `constant.o` directly so that there's no need to copy the data to an aligned memory for CPU (when using `--rename-section`, the original section name would need to be specified for `--set-section-alignment`). cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler amjames desertfire chauhang [ghstack-poisoned]

ghstack-source-id: f62a7a4 Pull Request resolved: #124272

desertfire · 2024-04-18T15:19:50Z

Somewhat relevant to #124034, but there we need to align each tensor.

chunyuan-w · 2024-04-22T06:59:27Z

@desertfire could you help review this PR?

chunyuan-w · 2024-04-25T08:35:40Z

@pytorchbot merge

pytorchmergebot · 2024-04-25T08:37:28Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

GPU copies the constant blob to aligned memory ([RAII_cudaMalloc](https://github.com/pytorch/pytorch/blob/d0211e207c78fafac2edaf2e14954f668e898b4a/torch/csrc/inductor/aoti_runtime/model.h#L46 ), [64-alignment](https://github.com/pytorch/pytorch/blob/d0211e207c78fafac2edaf2e14954f668e898b4a/torch/csrc/inductor/aoti_runtime/model.h#L324)) while CPU doesn't have this copy procedure for constant blob, which may result in sub-optimal performance when we want to directly use the constant blob buffer in the computation (for example when these constant blobs are the weight tensor to the oneDNN primitive). We set the alignment to the `constant.o` directly so that there's no need to copy the data to an aligned memory for CPU (when using `--rename-section`, the original section name would need to be specified for `--set-section-alignment`). Pull Request resolved: pytorch#124272 Approved by: https://github.com/jgong5, https://github.com/desertfire

malfet · 2024-04-26T16:21:26Z

This causes a failures with older objcopy:

 objcopy: unrecognized option '--set-section-alignment'

jerryzh168 · 2024-05-02T01:04:03Z

This causes a failures with older objcopy:
 objcopy: unrecognized option '--set-section-alignment'

I saw this as well, is this fixed?

Oh looks like this is the fix: pytorch/torchchat#497 thanks

GPU copies the constant blob to aligned memory ([RAII_cudaMalloc](https://github.com/pytorch/pytorch/blob/d0211e207c78fafac2edaf2e14954f668e898b4a/torch/csrc/inductor/aoti_runtime/model.h#L46 ), [64-alignment](https://github.com/pytorch/pytorch/blob/d0211e207c78fafac2edaf2e14954f668e898b4a/torch/csrc/inductor/aoti_runtime/model.h#L324)) while CPU doesn't have this copy procedure for constant blob, which may result in sub-optimal performance when we want to directly use the constant blob buffer in the computation (for example when these constant blobs are the weight tensor to the oneDNN primitive). We set the alignment to the `constant.o` directly so that there's no need to copy the data to an aligned memory for CPU (when using `--rename-section`, the original section name would need to be specified for `--set-section-alignment`). Pull Request resolved: #124272 Approved by: https://github.com/jgong5, https://github.com/desertfire

#124272 set the alignment to the consts_o but if there're `data_size` of tensor in the consts_o non divisible by the alignment, the following tensors are not aligned anymore, resulting in poor performance on CPU. We align the `data_size` as well in this PR and pad the serialized bytes. Since `size` of the tensor instead of the `data_size` is used when creating tensor from the serialized bytes, ([link](https://github.com/pytorch/pytorch/blob/f4d7cdc5e63c786b1f6588eafa53bbc6d33c3826/torch/csrc/inductor/aoti_runtime/model.h#L236-L259)), there won't be correctness issue. `data_size` is only used to record the [bytes_read](https://github.com/pytorch/pytorch/blob/f4d7cdc5e63c786b1f6588eafa53bbc6d33c3826/torch/csrc/inductor/aoti_runtime/model.h#L217). cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang [ghstack-poisoned]

#124272 set the alignment to the `consts_o` but if there're `data_size` of tensor in the `consts_o` non divisible by the alignment, the following tensors are not aligned anymore, resulting in poor performance on CPU. We align the `data_size` as well in this PR and pad the serialized bytes. Since `size` of the tensor instead of the `data_size` is used when creating tensor from the serialized bytes ([link](https://github.com/pytorch/pytorch/blob/f4d7cdc5e63c786b1f6588eafa53bbc6d33c3826/torch/csrc/inductor/aoti_runtime/model.h#L236-L259)), there won't be correctness issue. `data_size` is only used to record the [bytes_read](https://github.com/pytorch/pytorch/blob/f4d7cdc5e63c786b1f6588eafa53bbc6d33c3826/torch/csrc/inductor/aoti_runtime/model.h#L217). For the unit test, I add a bias value the original `data_size` of which is not divisible by the alignment to test the correctness: ``` constants_info_[0].dtype = static_cast<int32_t>(at::kFloat); constants_info_[0].data_size = 64; # was 40 before this PR constants_info_[0].shape = {10}; ``` cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang [ghstack-poisoned]

#124272 set the alignment to the `consts_o` but if there're `data_size` of tensor in the `consts_o` non divisible by the alignment, the following tensors are not aligned anymore, resulting in poor performance on CPU. We align the `data_size` as well in this PR and pad the serialized bytes. Since `size` of the tensor instead of the `data_size` is used when creating tensor from the serialized bytes ([link](https://github.com/pytorch/pytorch/blob/f4d7cdc5e63c786b1f6588eafa53bbc6d33c3826/torch/csrc/inductor/aoti_runtime/model.h#L236-L259)), there won't be correctness issue. `data_size` is only used to record the [bytes_read](https://github.com/pytorch/pytorch/blob/f4d7cdc5e63c786b1f6588eafa53bbc6d33c3826/torch/csrc/inductor/aoti_runtime/model.h#L217). For the unit test, I add a bias value the original `data_size` of which is not divisible by the alignment to test the correctness: ``` constants_info_[0].dtype = static_cast<int32_t>(at::kFloat); constants_info_[0].data_size = 64; # was 40 before this PR constants_info_[0].shape = {10}; constants_info_[1].dtype = static_cast<int32_t>(at::kFloat); ...... ``` cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang [ghstack-poisoned]

#124272 set the alignment to the `consts_o` but if there're `data_size` of tensor in the `consts_o` non divisible by the alignment, the following tensors are not aligned anymore, resulting in poor performance on CPU. We align the `data_size` as well in this PR and pad the serialized bytes. Since `size` of the tensor instead of the `data_size` is used when creating tensor from the serialized bytes ([link](https://github.com/pytorch/pytorch/blob/f4d7cdc5e63c786b1f6588eafa53bbc6d33c3826/torch/csrc/inductor/aoti_runtime/model.h#L236-L259)), there won't be correctness issue. `data_size` is only used to record the [bytes_read](https://github.com/pytorch/pytorch/blob/f4d7cdc5e63c786b1f6588eafa53bbc6d33c3826/torch/csrc/inductor/aoti_runtime/model.h#L217). This PR will improve the performance on CPU for 4 models in HF, 7 models in TIMM and 1 model in Torchbench. For the unit test, I add a bias value the original `data_size` of which is not divisible by the alignment to test the correctness: ``` constants_info_[0].dtype = static_cast<int32_t>(at::kFloat); constants_info_[0].data_size = 64; # was 40 before this PR constants_info_[0].shape = {10}; constants_info_[1].dtype = static_cast<int32_t>(at::kFloat); ...... ``` cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang [ghstack-poisoned]

#124272 set the alignment to the `consts_o` but if there're `data_size` of tensor in the `consts_o` non divisible by the alignment, the following tensors are not aligned anymore, resulting in poor performance on CPU. We align the `data_size` as well in this PR and pad the serialized bytes. Since `size` of the tensor instead of the `data_size` is used when creating tensor from the serialized bytes ([link](https://github.com/pytorch/pytorch/blob/f4d7cdc5e63c786b1f6588eafa53bbc6d33c3826/torch/csrc/inductor/aoti_runtime/model.h#L236-L259)), there won't be correctness issue. `data_size` is only used to record the [bytes_read](https://github.com/pytorch/pytorch/blob/f4d7cdc5e63c786b1f6588eafa53bbc6d33c3826/torch/csrc/inductor/aoti_runtime/model.h#L217). This PR will improve the performance on CPU for 4 models in HF, 7 models in TIMM and 1 model in Torchbench. For the unit test, I add a bias value the original `data_size` of which is not divisible by the alignment to test the correctness: ``` constants_info_[0].dtype = static_cast<int32_t>(at::kFloat); constants_info_[0].data_size = 64; # was 40 before this PR constants_info_[0].shape = {10}; constants_info_[1].dtype = static_cast<int32_t>(at::kFloat); ...... ``` Pull Request resolved: #127610 Approved by: https://github.com/jgong5, https://github.com/desertfire

pytorch#124272 set the alignment to the `consts_o` but if there're `data_size` of tensor in the `consts_o` non divisible by the alignment, the following tensors are not aligned anymore, resulting in poor performance on CPU. We align the `data_size` as well in this PR and pad the serialized bytes. Since `size` of the tensor instead of the `data_size` is used when creating tensor from the serialized bytes ([link](https://github.com/pytorch/pytorch/blob/f4d7cdc5e63c786b1f6588eafa53bbc6d33c3826/torch/csrc/inductor/aoti_runtime/model.h#L236-L259)), there won't be correctness issue. `data_size` is only used to record the [bytes_read](https://github.com/pytorch/pytorch/blob/f4d7cdc5e63c786b1f6588eafa53bbc6d33c3826/torch/csrc/inductor/aoti_runtime/model.h#L217). This PR will improve the performance on CPU for 4 models in HF, 7 models in TIMM and 1 model in Torchbench. For the unit test, I add a bias value the original `data_size` of which is not divisible by the alignment to test the correctness: ``` constants_info_[0].dtype = static_cast<int32_t>(at::kFloat); constants_info_[0].data_size = 64; # was 40 before this PR constants_info_[0].shape = {10}; constants_info_[1].dtype = static_cast<int32_t>(at::kFloat); ...... ``` Pull Request resolved: pytorch#127610 Approved by: https://github.com/jgong5, https://github.com/desertfire

That updates objcopy from 2.32 to 2.35 To workaround `objcopy: unrecognized option '--set-section-alignment'` introduced by pytorch/pytorch#124272

[AOTI] set alignment for aot constant

3d101e3

[ghstack-poisoned]

pytorch-bot bot added ciflow/inductor module: inductor labels Apr 17, 2024

chunyuan-w added a commit that referenced this pull request Apr 17, 2024

[AOTI] set alignment for aot constant

ee8115f

ghstack-source-id: 51c6f94 Pull Request resolved: #124272

chunyuan-w marked this pull request as draft April 17, 2024 08:26

pytorchbot added the open source label Apr 17, 2024

chunyuan-w added the topic: not user facing topic category label Apr 17, 2024

chunyuan-w added a commit that referenced this pull request Apr 18, 2024

[AOTI] set alignment for aot constant

f0ece2e

ghstack-source-id: f62a7a4 Pull Request resolved: #124272

chunyuan-w marked this pull request as ready for review April 18, 2024 01:37

chunyuan-w requested a review from jgong5 April 18, 2024 01:37

This was referenced Apr 18, 2024

upgrade ideep commit to include new serialize API #124349

Closed

[AOTI] support freezing for MKLDNN #124350

Closed

jgong5 approved these changes Apr 19, 2024

View reviewed changes

jgong5 requested a review from desertfire April 19, 2024 04:15

desertfire approved these changes Apr 24, 2024

View reviewed changes

chunyuan-w added the ciflow/trunk Trigger trunk jobs on your pull request label Apr 25, 2024

pytorchmergebot added the merging label Apr 25, 2024

pytorchmergebot closed this in 9d139ee Apr 25, 2024

pytorchmergebot added Merged and removed merging labels Apr 25, 2024

malfet mentioned this pull request Apr 26, 2024

Install devtoolset-10 on pytorch/conda-builder:cuda121 pytorch/torchchat#497

Merged

chunyuan-w mentioned this pull request May 31, 2024

[AOTI] align data_size of the constants #127610

Closed

github-actions bot deleted the gh/chunyuan-w/11/head branch June 1, 2024 01:58

IvanKobzarev mentioned this pull request Sep 16, 2024

running AOTI-generated DSO causes segmentation fault on Linux x86, passes on ARM MacOS #124034

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[AOTI] set alignment for aot constant #124272

[AOTI] set alignment for aot constant #124272

Uh oh!

chunyuan-w commented Apr 17, 2024 •

edited

Loading

Uh oh!

pytorch-bot bot commented Apr 17, 2024 •

edited

Loading

Uh oh!

desertfire commented Apr 18, 2024

Uh oh!

chunyuan-w commented Apr 22, 2024

Uh oh!

chunyuan-w commented Apr 25, 2024

Uh oh!

pytorchmergebot commented Apr 25, 2024

Uh oh!

malfet commented Apr 26, 2024

Uh oh!

jerryzh168 commented May 2, 2024 •

edited

Loading

Uh oh!

Uh oh!

[AOTI] set alignment for aot constant #124272

[AOTI] set alignment for aot constant #124272

Uh oh!

Conversation

chunyuan-w commented Apr 17, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Apr 17, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/124272

✅ You can merge normally! (1 Unrelated Failure)

Uh oh!

desertfire commented Apr 18, 2024

Uh oh!

chunyuan-w commented Apr 22, 2024

Uh oh!

chunyuan-w commented Apr 25, 2024

Uh oh!

pytorchmergebot commented Apr 25, 2024

Merge started

Uh oh!

malfet commented Apr 26, 2024

Uh oh!

jerryzh168 commented May 2, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

chunyuan-w commented Apr 17, 2024 •

edited

Loading

pytorch-bot bot commented Apr 17, 2024 •

edited

Loading

jerryzh168 commented May 2, 2024 •

edited

Loading