Remove unnecessary cuda sync for better perf#17594
Conversation
Pull Request resolved: #17315 Right now we always do cudasync before existing cudabackend.execution(). However we only need that when copying data from gpu to cpu; any actions happen inside a same stream do not need explicit sync. Also we introduced new cuda_delegate_handle to remove cuda specifci inforamtion from aoti_delegate_handle for better hirearchy. ghstack-source-id: 343032381 @exported-using-ghexport Differential Revision: [D92193164](https://our.internmc.facebook.com/intern/diff/D92193164/)
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/17594
Note: Links to docs will display an error until the docs builds have been completed. ❌ 5 New Failures, 1 Unrelated FailureAs of commit abc1514 with merge base 9591a67 ( NEW FAILURES - The following jobs have failed:
BROKEN TRUNK - The following job failed but were present on the merge base:👉 Rebase onto the `viable/strict` branch to avoid these failures
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
This PR needs a
|
This PR was created by the merge bot to help merge the original PR into the main branch.
ghstack PR number: #17315 by @Gasoonjia
^ Please use this as the source of truth for the PR details, comments, and reviews
ghstack PR base: https://github.com/pytorch/executorch/tree/gh/gasoonjia/116/base
ghstack PR head: https://github.com/pytorch/executorch/tree/gh/gasoonjia/116/head
Merge bot PR base: https://github.com/pytorch/executorch/tree/main
Merge bot PR head: https://github.com/pytorch/executorch/tree/gh/gasoonjia/116/orig
Differential Revision: D92193164
@diff-train-skip-merge