Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[AOTI] set alignment for aot constant #124272

Closed
wants to merge 2 commits into from

Conversation

chunyuan-w
Copy link
Collaborator

@chunyuan-w chunyuan-w commented Apr 17, 2024

Stack from ghstack (oldest at bottom):

GPU copies the constant blob to aligned memory (RAII_cudaMalloc, 64-alignment) while CPU doesn't have this copy procedure for constant blob, which may result in sub-optimal performance when we want to directly use the constant blob buffer in the computation (for example when these constant blobs are the weight tensor to the oneDNN primitive).

We set the alignment to the constant.o directly so that there's no need to copy the data to an aligned memory for CPU (when using --rename-section, the original section name would need to be specified for --set-section-alignment).

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @peterbell10 @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @aakhundov @ColinPeppler @amjames @desertfire @chauhang

Copy link

pytorch-bot bot commented Apr 17, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/124272

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (1 Unrelated Failure)

As of commit d4439de with merge base d0211e2 (image):

FLAKY - The following job failed but was likely due to flakiness present on trunk:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

chunyuan-w added a commit that referenced this pull request Apr 17, 2024
ghstack-source-id: 51c6f943d72289906d0ece2085a38556be262d23
Pull Request resolved: #124272
@chunyuan-w chunyuan-w marked this pull request as draft April 17, 2024 08:26
@chunyuan-w chunyuan-w added the topic: not user facing topic category label Apr 17, 2024
GPU copies the constant blob to aligned memory ([RAII_cudaMalloc](https://github.com/pytorch/pytorch/blob/d0211e207c78fafac2edaf2e14954f668e898b4a/torch/csrc/inductor/aoti_runtime/model.h#L46
), [64-alignment](https://github.com/pytorch/pytorch/blob/d0211e207c78fafac2edaf2e14954f668e898b4a/torch/csrc/inductor/aoti_runtime/model.h#L324)) while CPU doesn't have this copy procedure for constant blob, which may result in sub-optimal performance when we want to directly use the constant blob buffer in the computation (for example when these constant blobs are the weight tensor to the oneDNN primitive).

We set the alignment to the `constant.o` directly so that there's no need to copy the data to an aligned memory for CPU (when using `--rename-section`, the original section name would need to be specified for `--set-section-alignment`).

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler amjames desertfire chauhang

[ghstack-poisoned]
chunyuan-w added a commit that referenced this pull request Apr 18, 2024
ghstack-source-id: f62a7a4690367880c6e5f4b0ea8dc1008058f51f
Pull Request resolved: #124272
@chunyuan-w chunyuan-w marked this pull request as ready for review April 18, 2024 01:37
@chunyuan-w chunyuan-w requested a review from jgong5 April 18, 2024 01:37
@desertfire
Copy link
Contributor

Somewhat relevant to #124034, but there we need to align each tensor.

@jgong5 jgong5 requested a review from desertfire April 19, 2024 04:15
@chunyuan-w
Copy link
Collaborator Author

@desertfire could you help review this PR?

@chunyuan-w chunyuan-w added the ciflow/trunk Trigger trunk jobs on your pull request label Apr 25, 2024
@chunyuan-w
Copy link
Collaborator Author

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

alat-rights pushed a commit to alat-rights/pytorch that referenced this pull request Apr 26, 2024
GPU copies the constant blob to aligned memory ([RAII_cudaMalloc](https://github.com/pytorch/pytorch/blob/d0211e207c78fafac2edaf2e14954f668e898b4a/torch/csrc/inductor/aoti_runtime/model.h#L46
), [64-alignment](https://github.com/pytorch/pytorch/blob/d0211e207c78fafac2edaf2e14954f668e898b4a/torch/csrc/inductor/aoti_runtime/model.h#L324)) while CPU doesn't have this copy procedure for constant blob, which may result in sub-optimal performance when we want to directly use the constant blob buffer in the computation (for example when these constant blobs are the weight tensor to the oneDNN primitive).

We set the alignment to the `constant.o` directly so that there's no need to copy the data to an aligned memory for CPU (when using `--rename-section`, the original section name would need to be specified for `--set-section-alignment`).

Pull Request resolved: pytorch#124272
Approved by: https://github.com/jgong5, https://github.com/desertfire
@malfet
Copy link
Contributor

malfet commented Apr 26, 2024

This causes a failures with older objcopy:

 objcopy: unrecognized option '--set-section-alignment'

andoorve pushed a commit to andoorve/pytorch that referenced this pull request May 1, 2024
GPU copies the constant blob to aligned memory ([RAII_cudaMalloc](https://github.com/pytorch/pytorch/blob/d0211e207c78fafac2edaf2e14954f668e898b4a/torch/csrc/inductor/aoti_runtime/model.h#L46
), [64-alignment](https://github.com/pytorch/pytorch/blob/d0211e207c78fafac2edaf2e14954f668e898b4a/torch/csrc/inductor/aoti_runtime/model.h#L324)) while CPU doesn't have this copy procedure for constant blob, which may result in sub-optimal performance when we want to directly use the constant blob buffer in the computation (for example when these constant blobs are the weight tensor to the oneDNN primitive).

We set the alignment to the `constant.o` directly so that there's no need to copy the data to an aligned memory for CPU (when using `--rename-section`, the original section name would need to be specified for `--set-section-alignment`).

Pull Request resolved: pytorch#124272
Approved by: https://github.com/jgong5, https://github.com/desertfire
@jerryzh168
Copy link
Contributor

jerryzh168 commented May 2, 2024

This causes a failures with older objcopy:

 objcopy: unrecognized option '--set-section-alignment'

I saw this as well, is this fixed?

Oh looks like this is the fix: https://github.com/pytorch/torchchat/pull/497 thanks

pytorch-bot bot pushed a commit that referenced this pull request May 3, 2024
GPU copies the constant blob to aligned memory ([RAII_cudaMalloc](https://github.com/pytorch/pytorch/blob/d0211e207c78fafac2edaf2e14954f668e898b4a/torch/csrc/inductor/aoti_runtime/model.h#L46
), [64-alignment](https://github.com/pytorch/pytorch/blob/d0211e207c78fafac2edaf2e14954f668e898b4a/torch/csrc/inductor/aoti_runtime/model.h#L324)) while CPU doesn't have this copy procedure for constant blob, which may result in sub-optimal performance when we want to directly use the constant blob buffer in the computation (for example when these constant blobs are the weight tensor to the oneDNN primitive).

We set the alignment to the `constant.o` directly so that there's no need to copy the data to an aligned memory for CPU (when using `--rename-section`, the original section name would need to be specified for `--set-section-alignment`).

Pull Request resolved: #124272
Approved by: https://github.com/jgong5, https://github.com/desertfire
chunyuan-w added a commit that referenced this pull request May 31, 2024
#124272 set the alignment to the consts_o but if there're `data_size` of tensor in the consts_o non divisible by the alignment, the following tensors are not aligned anymore, resulting in poor performance on CPU.
We align the `data_size` as well in this PR and pad the serialized bytes. Since `size` of the tensor instead of the `data_size` is used when creating tensor from the serialized bytes, ([link](https://github.com/pytorch/pytorch/blob/f4d7cdc5e63c786b1f6588eafa53bbc6d33c3826/torch/csrc/inductor/aoti_runtime/model.h#L236-L259)), there won't be correctness issue. `data_size` is only used to record the [bytes_read](https://github.com/pytorch/pytorch/blob/f4d7cdc5e63c786b1f6588eafa53bbc6d33c3826/torch/csrc/inductor/aoti_runtime/model.h#L217).


cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang

[ghstack-poisoned]
chunyuan-w added a commit that referenced this pull request May 31, 2024
#124272 set the alignment to the `consts_o` but if there're `data_size` of tensor in the `consts_o` non divisible by the alignment, the following tensors are not aligned anymore, resulting in poor performance on CPU.
We align the `data_size` as well in this PR and pad the serialized bytes. Since `size` of the tensor instead of the `data_size` is used when creating tensor from the serialized bytes ([link](https://github.com/pytorch/pytorch/blob/f4d7cdc5e63c786b1f6588eafa53bbc6d33c3826/torch/csrc/inductor/aoti_runtime/model.h#L236-L259)), there won't be correctness issue. `data_size` is only used to record the [bytes_read](https://github.com/pytorch/pytorch/blob/f4d7cdc5e63c786b1f6588eafa53bbc6d33c3826/torch/csrc/inductor/aoti_runtime/model.h#L217).

For the unit test, I add a bias value the original `data_size` of which is not divisible by the alignment to test the correctness:
```
constants_info_[0].dtype = static_cast<int32_t>(at::kFloat);
constants_info_[0].data_size = 64; # was 40 before this PR
constants_info_[0].shape = {10};
```



cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang

[ghstack-poisoned]
chunyuan-w added a commit that referenced this pull request May 31, 2024
#124272 set the alignment to the `consts_o` but if there're `data_size` of tensor in the `consts_o` non divisible by the alignment, the following tensors are not aligned anymore, resulting in poor performance on CPU.
We align the `data_size` as well in this PR and pad the serialized bytes. Since `size` of the tensor instead of the `data_size` is used when creating tensor from the serialized bytes ([link](https://github.com/pytorch/pytorch/blob/f4d7cdc5e63c786b1f6588eafa53bbc6d33c3826/torch/csrc/inductor/aoti_runtime/model.h#L236-L259)), there won't be correctness issue. `data_size` is only used to record the [bytes_read](https://github.com/pytorch/pytorch/blob/f4d7cdc5e63c786b1f6588eafa53bbc6d33c3826/torch/csrc/inductor/aoti_runtime/model.h#L217).

For the unit test, I add a bias value the original `data_size` of which is not divisible by the alignment to test the correctness:
```
constants_info_[0].dtype = static_cast<int32_t>(at::kFloat);
constants_info_[0].data_size = 64; # was 40 before this PR
constants_info_[0].shape = {10};
```



cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang

[ghstack-poisoned]
@github-actions github-actions bot deleted the gh/chunyuan-w/11/head branch June 1, 2024 01:58
chunyuan-w added a commit that referenced this pull request Jun 3, 2024
#124272 set the alignment to the `consts_o` but if there're `data_size` of tensor in the `consts_o` non divisible by the alignment, the following tensors are not aligned anymore, resulting in poor performance on CPU.
We align the `data_size` as well in this PR and pad the serialized bytes. Since `size` of the tensor instead of the `data_size` is used when creating tensor from the serialized bytes ([link](https://github.com/pytorch/pytorch/blob/f4d7cdc5e63c786b1f6588eafa53bbc6d33c3826/torch/csrc/inductor/aoti_runtime/model.h#L236-L259)), there won't be correctness issue. `data_size` is only used to record the [bytes_read](https://github.com/pytorch/pytorch/blob/f4d7cdc5e63c786b1f6588eafa53bbc6d33c3826/torch/csrc/inductor/aoti_runtime/model.h#L217).

For the unit test, I add a bias value the original `data_size` of which is not divisible by the alignment to test the correctness:
```
constants_info_[0].dtype = static_cast<int32_t>(at::kFloat);
constants_info_[0].data_size = 64; # was 40 before this PR
constants_info_[0].shape = {10};

constants_info_[1].dtype = static_cast<int32_t>(at::kFloat);
......
```



cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang

[ghstack-poisoned]
chunyuan-w added a commit that referenced this pull request Jun 3, 2024
#124272 set the alignment to the `consts_o` but if there're `data_size` of tensor in the `consts_o` non divisible by the alignment, the following tensors are not aligned anymore, resulting in poor performance on CPU.
We align the `data_size` as well in this PR and pad the serialized bytes. Since `size` of the tensor instead of the `data_size` is used when creating tensor from the serialized bytes ([link](https://github.com/pytorch/pytorch/blob/f4d7cdc5e63c786b1f6588eafa53bbc6d33c3826/torch/csrc/inductor/aoti_runtime/model.h#L236-L259)), there won't be correctness issue. `data_size` is only used to record the [bytes_read](https://github.com/pytorch/pytorch/blob/f4d7cdc5e63c786b1f6588eafa53bbc6d33c3826/torch/csrc/inductor/aoti_runtime/model.h#L217).

For the unit test, I add a bias value the original `data_size` of which is not divisible by the alignment to test the correctness:
```
constants_info_[0].dtype = static_cast<int32_t>(at::kFloat);
constants_info_[0].data_size = 64; # was 40 before this PR
constants_info_[0].shape = {10};

constants_info_[1].dtype = static_cast<int32_t>(at::kFloat);
......
```



cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang

[ghstack-poisoned]
chunyuan-w added a commit that referenced this pull request Jun 3, 2024
#124272 set the alignment to the `consts_o` but if there're `data_size` of tensor in the `consts_o` non divisible by the alignment, the following tensors are not aligned anymore, resulting in poor performance on CPU.
We align the `data_size` as well in this PR and pad the serialized bytes. Since `size` of the tensor instead of the `data_size` is used when creating tensor from the serialized bytes ([link](https://github.com/pytorch/pytorch/blob/f4d7cdc5e63c786b1f6588eafa53bbc6d33c3826/torch/csrc/inductor/aoti_runtime/model.h#L236-L259)), there won't be correctness issue. `data_size` is only used to record the [bytes_read](https://github.com/pytorch/pytorch/blob/f4d7cdc5e63c786b1f6588eafa53bbc6d33c3826/torch/csrc/inductor/aoti_runtime/model.h#L217).

For the unit test, I add a bias value the original `data_size` of which is not divisible by the alignment to test the correctness:
```
constants_info_[0].dtype = static_cast<int32_t>(at::kFloat);
constants_info_[0].data_size = 64; # was 40 before this PR
constants_info_[0].shape = {10};

constants_info_[1].dtype = static_cast<int32_t>(at::kFloat);
......
```



cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang

[ghstack-poisoned]
chunyuan-w added a commit that referenced this pull request Jun 3, 2024
#124272 set the alignment to the `consts_o` but if there're `data_size` of tensor in the `consts_o` non divisible by the alignment, the following tensors are not aligned anymore, resulting in poor performance on CPU.
We align the `data_size` as well in this PR and pad the serialized bytes. Since `size` of the tensor instead of the `data_size` is used when creating tensor from the serialized bytes ([link](https://github.com/pytorch/pytorch/blob/f4d7cdc5e63c786b1f6588eafa53bbc6d33c3826/torch/csrc/inductor/aoti_runtime/model.h#L236-L259)), there won't be correctness issue. `data_size` is only used to record the [bytes_read](https://github.com/pytorch/pytorch/blob/f4d7cdc5e63c786b1f6588eafa53bbc6d33c3826/torch/csrc/inductor/aoti_runtime/model.h#L217).

For the unit test, I add a bias value the original `data_size` of which is not divisible by the alignment to test the correctness:
```
constants_info_[0].dtype = static_cast<int32_t>(at::kFloat);
constants_info_[0].data_size = 64; # was 40 before this PR
constants_info_[0].shape = {10};

constants_info_[1].dtype = static_cast<int32_t>(at::kFloat);
......
```



cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang

[ghstack-poisoned]
chunyuan-w added a commit that referenced this pull request Jun 3, 2024
#124272 set the alignment to the `consts_o` but if there're `data_size` of tensor in the `consts_o` non divisible by the alignment, the following tensors are not aligned anymore, resulting in poor performance on CPU.
We align the `data_size` as well in this PR and pad the serialized bytes. Since `size` of the tensor instead of the `data_size` is used when creating tensor from the serialized bytes ([link](https://github.com/pytorch/pytorch/blob/f4d7cdc5e63c786b1f6588eafa53bbc6d33c3826/torch/csrc/inductor/aoti_runtime/model.h#L236-L259)), there won't be correctness issue. `data_size` is only used to record the [bytes_read](https://github.com/pytorch/pytorch/blob/f4d7cdc5e63c786b1f6588eafa53bbc6d33c3826/torch/csrc/inductor/aoti_runtime/model.h#L217).

For the unit test, I add a bias value the original `data_size` of which is not divisible by the alignment to test the correctness:
```
constants_info_[0].dtype = static_cast<int32_t>(at::kFloat);
constants_info_[0].data_size = 64; # was 40 before this PR
constants_info_[0].shape = {10};

constants_info_[1].dtype = static_cast<int32_t>(at::kFloat);
......
```



cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang

[ghstack-poisoned]
chunyuan-w added a commit that referenced this pull request Jun 3, 2024
#124272 set the alignment to the `consts_o` but if there're `data_size` of tensor in the `consts_o` non divisible by the alignment, the following tensors are not aligned anymore, resulting in poor performance on CPU.
We align the `data_size` as well in this PR and pad the serialized bytes. Since `size` of the tensor instead of the `data_size` is used when creating tensor from the serialized bytes ([link](https://github.com/pytorch/pytorch/blob/f4d7cdc5e63c786b1f6588eafa53bbc6d33c3826/torch/csrc/inductor/aoti_runtime/model.h#L236-L259)), there won't be correctness issue. `data_size` is only used to record the [bytes_read](https://github.com/pytorch/pytorch/blob/f4d7cdc5e63c786b1f6588eafa53bbc6d33c3826/torch/csrc/inductor/aoti_runtime/model.h#L217).

For the unit test, I add a bias value the original `data_size` of which is not divisible by the alignment to test the correctness:
```
constants_info_[0].dtype = static_cast<int32_t>(at::kFloat);
constants_info_[0].data_size = 64; # was 40 before this PR
constants_info_[0].shape = {10};

constants_info_[1].dtype = static_cast<int32_t>(at::kFloat);
......
```



cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang

[ghstack-poisoned]
chunyuan-w added a commit that referenced this pull request Jun 3, 2024
#124272 set the alignment to the `consts_o` but if there're `data_size` of tensor in the `consts_o` non divisible by the alignment, the following tensors are not aligned anymore, resulting in poor performance on CPU.
We align the `data_size` as well in this PR and pad the serialized bytes. Since `size` of the tensor instead of the `data_size` is used when creating tensor from the serialized bytes ([link](https://github.com/pytorch/pytorch/blob/f4d7cdc5e63c786b1f6588eafa53bbc6d33c3826/torch/csrc/inductor/aoti_runtime/model.h#L236-L259)), there won't be correctness issue. `data_size` is only used to record the [bytes_read](https://github.com/pytorch/pytorch/blob/f4d7cdc5e63c786b1f6588eafa53bbc6d33c3826/torch/csrc/inductor/aoti_runtime/model.h#L217).

For the unit test, I add a bias value the original `data_size` of which is not divisible by the alignment to test the correctness:
```
constants_info_[0].dtype = static_cast<int32_t>(at::kFloat);
constants_info_[0].data_size = 64; # was 40 before this PR
constants_info_[0].shape = {10};

constants_info_[1].dtype = static_cast<int32_t>(at::kFloat);
......
```



cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang

[ghstack-poisoned]
chunyuan-w added a commit that referenced this pull request Jun 3, 2024
#124272 set the alignment to the `consts_o` but if there're `data_size` of tensor in the `consts_o` non divisible by the alignment, the following tensors are not aligned anymore, resulting in poor performance on CPU.
We align the `data_size` as well in this PR and pad the serialized bytes. Since `size` of the tensor instead of the `data_size` is used when creating tensor from the serialized bytes ([link](https://github.com/pytorch/pytorch/blob/f4d7cdc5e63c786b1f6588eafa53bbc6d33c3826/torch/csrc/inductor/aoti_runtime/model.h#L236-L259)), there won't be correctness issue. `data_size` is only used to record the [bytes_read](https://github.com/pytorch/pytorch/blob/f4d7cdc5e63c786b1f6588eafa53bbc6d33c3826/torch/csrc/inductor/aoti_runtime/model.h#L217).

For the unit test, I add a bias value the original `data_size` of which is not divisible by the alignment to test the correctness:
```
constants_info_[0].dtype = static_cast<int32_t>(at::kFloat);
constants_info_[0].data_size = 64; # was 40 before this PR
constants_info_[0].shape = {10};

constants_info_[1].dtype = static_cast<int32_t>(at::kFloat);
......
```



cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang

[ghstack-poisoned]
chunyuan-w added a commit that referenced this pull request Jun 6, 2024
#124272 set the alignment to the `consts_o` but if there're `data_size` of tensor in the `consts_o` non divisible by the alignment, the following tensors are not aligned anymore, resulting in poor performance on CPU.
We align the `data_size` as well in this PR and pad the serialized bytes. Since `size` of the tensor instead of the `data_size` is used when creating tensor from the serialized bytes ([link](https://github.com/pytorch/pytorch/blob/f4d7cdc5e63c786b1f6588eafa53bbc6d33c3826/torch/csrc/inductor/aoti_runtime/model.h#L236-L259)), there won't be correctness issue. `data_size` is only used to record the [bytes_read](https://github.com/pytorch/pytorch/blob/f4d7cdc5e63c786b1f6588eafa53bbc6d33c3826/torch/csrc/inductor/aoti_runtime/model.h#L217).

This PR will improve the performance on CPU for 4 models in HF, 7 models in TIMM and 1 model in Torchbench.

For the unit test, I add a bias value the original `data_size` of which is not divisible by the alignment to test the correctness:
```
constants_info_[0].dtype = static_cast<int32_t>(at::kFloat);
constants_info_[0].data_size = 64; # was 40 before this PR
constants_info_[0].shape = {10};

constants_info_[1].dtype = static_cast<int32_t>(at::kFloat);
......
```



cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang

[ghstack-poisoned]
chunyuan-w added a commit that referenced this pull request Jun 6, 2024
#124272 set the alignment to the `consts_o` but if there're `data_size` of tensor in the `consts_o` non divisible by the alignment, the following tensors are not aligned anymore, resulting in poor performance on CPU.
We align the `data_size` as well in this PR and pad the serialized bytes. Since `size` of the tensor instead of the `data_size` is used when creating tensor from the serialized bytes ([link](https://github.com/pytorch/pytorch/blob/f4d7cdc5e63c786b1f6588eafa53bbc6d33c3826/torch/csrc/inductor/aoti_runtime/model.h#L236-L259)), there won't be correctness issue. `data_size` is only used to record the [bytes_read](https://github.com/pytorch/pytorch/blob/f4d7cdc5e63c786b1f6588eafa53bbc6d33c3826/torch/csrc/inductor/aoti_runtime/model.h#L217).

This PR will improve the performance on CPU for 4 models in HF, 7 models in TIMM and 1 model in Torchbench.

For the unit test, I add a bias value the original `data_size` of which is not divisible by the alignment to test the correctness:
```
constants_info_[0].dtype = static_cast<int32_t>(at::kFloat);
constants_info_[0].data_size = 64; # was 40 before this PR
constants_info_[0].shape = {10};

constants_info_[1].dtype = static_cast<int32_t>(at::kFloat);
......
```



cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang

[ghstack-poisoned]
chunyuan-w added a commit that referenced this pull request Jun 7, 2024
#124272 set the alignment to the `consts_o` but if there're `data_size` of tensor in the `consts_o` non divisible by the alignment, the following tensors are not aligned anymore, resulting in poor performance on CPU.
We align the `data_size` as well in this PR and pad the serialized bytes. Since `size` of the tensor instead of the `data_size` is used when creating tensor from the serialized bytes ([link](https://github.com/pytorch/pytorch/blob/f4d7cdc5e63c786b1f6588eafa53bbc6d33c3826/torch/csrc/inductor/aoti_runtime/model.h#L236-L259)), there won't be correctness issue. `data_size` is only used to record the [bytes_read](https://github.com/pytorch/pytorch/blob/f4d7cdc5e63c786b1f6588eafa53bbc6d33c3826/torch/csrc/inductor/aoti_runtime/model.h#L217).

This PR will improve the performance on CPU for 4 models in HF, 7 models in TIMM and 1 model in Torchbench.

For the unit test, I add a bias value the original `data_size` of which is not divisible by the alignment to test the correctness:
```
constants_info_[0].dtype = static_cast<int32_t>(at::kFloat);
constants_info_[0].data_size = 64; # was 40 before this PR
constants_info_[0].shape = {10};

constants_info_[1].dtype = static_cast<int32_t>(at::kFloat);
......
```



cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang

[ghstack-poisoned]
chunyuan-w added a commit that referenced this pull request Jun 7, 2024
#124272 set the alignment to the `consts_o` but if there're `data_size` of tensor in the `consts_o` non divisible by the alignment, the following tensors are not aligned anymore, resulting in poor performance on CPU.
We align the `data_size` as well in this PR and pad the serialized bytes. Since `size` of the tensor instead of the `data_size` is used when creating tensor from the serialized bytes ([link](https://github.com/pytorch/pytorch/blob/f4d7cdc5e63c786b1f6588eafa53bbc6d33c3826/torch/csrc/inductor/aoti_runtime/model.h#L236-L259)), there won't be correctness issue. `data_size` is only used to record the [bytes_read](https://github.com/pytorch/pytorch/blob/f4d7cdc5e63c786b1f6588eafa53bbc6d33c3826/torch/csrc/inductor/aoti_runtime/model.h#L217).

This PR will improve the performance on CPU for 4 models in HF, 7 models in TIMM and 1 model in Torchbench.

For the unit test, I add a bias value the original `data_size` of which is not divisible by the alignment to test the correctness:
```
constants_info_[0].dtype = static_cast<int32_t>(at::kFloat);
constants_info_[0].data_size = 64; # was 40 before this PR
constants_info_[0].shape = {10};

constants_info_[1].dtype = static_cast<int32_t>(at::kFloat);
......
```



cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang

[ghstack-poisoned]
chunyuan-w added a commit that referenced this pull request Jun 7, 2024
#124272 set the alignment to the `consts_o` but if there're `data_size` of tensor in the `consts_o` non divisible by the alignment, the following tensors are not aligned anymore, resulting in poor performance on CPU.
We align the `data_size` as well in this PR and pad the serialized bytes. Since `size` of the tensor instead of the `data_size` is used when creating tensor from the serialized bytes ([link](https://github.com/pytorch/pytorch/blob/f4d7cdc5e63c786b1f6588eafa53bbc6d33c3826/torch/csrc/inductor/aoti_runtime/model.h#L236-L259)), there won't be correctness issue. `data_size` is only used to record the [bytes_read](https://github.com/pytorch/pytorch/blob/f4d7cdc5e63c786b1f6588eafa53bbc6d33c3826/torch/csrc/inductor/aoti_runtime/model.h#L217).

This PR will improve the performance on CPU for 4 models in HF, 7 models in TIMM and 1 model in Torchbench.

For the unit test, I add a bias value the original `data_size` of which is not divisible by the alignment to test the correctness:
```
constants_info_[0].dtype = static_cast<int32_t>(at::kFloat);
constants_info_[0].data_size = 64; # was 40 before this PR
constants_info_[0].shape = {10};

constants_info_[1].dtype = static_cast<int32_t>(at::kFloat);
......
```



cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang

[ghstack-poisoned]
chunyuan-w added a commit that referenced this pull request Jun 7, 2024
#124272 set the alignment to the `consts_o` but if there're `data_size` of tensor in the `consts_o` non divisible by the alignment, the following tensors are not aligned anymore, resulting in poor performance on CPU.
We align the `data_size` as well in this PR and pad the serialized bytes. Since `size` of the tensor instead of the `data_size` is used when creating tensor from the serialized bytes ([link](https://github.com/pytorch/pytorch/blob/f4d7cdc5e63c786b1f6588eafa53bbc6d33c3826/torch/csrc/inductor/aoti_runtime/model.h#L236-L259)), there won't be correctness issue. `data_size` is only used to record the [bytes_read](https://github.com/pytorch/pytorch/blob/f4d7cdc5e63c786b1f6588eafa53bbc6d33c3826/torch/csrc/inductor/aoti_runtime/model.h#L217).

This PR will improve the performance on CPU for 4 models in HF, 7 models in TIMM and 1 model in Torchbench.

For the unit test, I add a bias value the original `data_size` of which is not divisible by the alignment to test the correctness:
```
constants_info_[0].dtype = static_cast<int32_t>(at::kFloat);
constants_info_[0].data_size = 64; # was 40 before this PR
constants_info_[0].shape = {10};

constants_info_[1].dtype = static_cast<int32_t>(at::kFloat);
......
```



cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang

[ghstack-poisoned]
pytorchmergebot pushed a commit that referenced this pull request Jun 8, 2024
#124272 set the alignment to the `consts_o` but if there're `data_size` of tensor in the `consts_o` non divisible by the alignment, the following tensors are not aligned anymore, resulting in poor performance on CPU.
We align the `data_size` as well in this PR and pad the serialized bytes. Since `size` of the tensor instead of the `data_size` is used when creating tensor from the serialized bytes ([link](https://github.com/pytorch/pytorch/blob/f4d7cdc5e63c786b1f6588eafa53bbc6d33c3826/torch/csrc/inductor/aoti_runtime/model.h#L236-L259)), there won't be correctness issue. `data_size` is only used to record the [bytes_read](https://github.com/pytorch/pytorch/blob/f4d7cdc5e63c786b1f6588eafa53bbc6d33c3826/torch/csrc/inductor/aoti_runtime/model.h#L217).

This PR will improve the performance on CPU for 4 models in HF, 7 models in TIMM and 1 model in Torchbench.

For the unit test, I add a bias value the original `data_size` of which is not divisible by the alignment to test the correctness:
```
constants_info_[0].dtype = static_cast<int32_t>(at::kFloat);
constants_info_[0].data_size = 64; # was 40 before this PR
constants_info_[0].shape = {10};

constants_info_[1].dtype = static_cast<int32_t>(at::kFloat);
......
```

Pull Request resolved: #127610
Approved by: https://github.com/jgong5, https://github.com/desertfire
TharinduRusira pushed a commit to TharinduRusira/pytorch that referenced this pull request Jun 14, 2024
pytorch#124272 set the alignment to the `consts_o` but if there're `data_size` of tensor in the `consts_o` non divisible by the alignment, the following tensors are not aligned anymore, resulting in poor performance on CPU.
We align the `data_size` as well in this PR and pad the serialized bytes. Since `size` of the tensor instead of the `data_size` is used when creating tensor from the serialized bytes ([link](https://github.com/pytorch/pytorch/blob/f4d7cdc5e63c786b1f6588eafa53bbc6d33c3826/torch/csrc/inductor/aoti_runtime/model.h#L236-L259)), there won't be correctness issue. `data_size` is only used to record the [bytes_read](https://github.com/pytorch/pytorch/blob/f4d7cdc5e63c786b1f6588eafa53bbc6d33c3826/torch/csrc/inductor/aoti_runtime/model.h#L217).

This PR will improve the performance on CPU for 4 models in HF, 7 models in TIMM and 1 model in Torchbench.

For the unit test, I add a bias value the original `data_size` of which is not divisible by the alignment to test the correctness:
```
constants_info_[0].dtype = static_cast<int32_t>(at::kFloat);
constants_info_[0].data_size = 64; # was 40 before this PR
constants_info_[0].shape = {10};

constants_info_[1].dtype = static_cast<int32_t>(at::kFloat);
......
```

Pull Request resolved: pytorch#127610
Approved by: https://github.com/jgong5, https://github.com/desertfire
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants