Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use explicit templates in gpu_kernel_with_scalars #40992

Closed

Conversation

malfet
Copy link
Contributor

@malfet malfet commented Jul 5, 2020

This trick should have no effect on performance, but it reduces size of kernels using the template by 10%
For example, sizeof(BinaryMulDivKernel.cu.o) compiled by CUDA-10.1 toolchain for sm_75 before the change was 4.2Mb, after 3.8Mb

This trick should have no effect on performance, but it reduces size of kernels using the template by 10%
For example, sizeofBinaryMulDivKernel.cu.o) compiled by CUDA-10.1 toolchain for sm_75 before the change was 4.2Mb, after 3.8Mb
Copy link
Collaborator

@zasdfgbnm zasdfgbnm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is the binary size reduced?

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@malfet is landing this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@malfet
Copy link
Contributor Author

malfet commented Jul 6, 2020

@zasdfgbnm I'm not entirely sure, to tell the truth, but my guess is that too many lambdas confuse both host and GPU compiler to have multiple identical instances of the same template.
I.e. nm torch_cuda_generated_BinaryMulDivKernel.cu.o return 2578 symbols before the change, but only 2325 after.

@facebook-github-bot
Copy link
Contributor

@malfet merged this pull request in 87f9b55.

@malfet malfet deleted the malfet/CUDALoops-expilcit-templates branch July 7, 2020 00:24
csarofeen pushed a commit to csarofeen/pytorch that referenced this pull request Jul 7, 2020
Summary:
This trick should have no effect on performance, but it reduces size of kernels using the template by 10%
For example, sizeof(BinaryMulDivKernel.cu.o) compiled by CUDA-10.1 toolchain for sm_75 before the change was 4.2Mb, after 3.8Mb

Pull Request resolved: pytorch#40992

Differential Revision: D22398733

Pulled By: malfet

fbshipit-source-id: 6576f4da00dc5fc2575b2313577f52c6571d5e6f
facebook-github-bot pushed a commit that referenced this pull request Jul 9, 2020
Summary:
Follow up after #40992
Use explicit templates instead of lambdas to reduce binary size without affecting the perf by 100-200Kb per arch per CU, namely:
BinaryMulDivKernel.cu 3.8Mb -> 3.5Mb
CompareEQKernel.cu 1.8Mb -> 1.7Mb
BinaryAddSubKernel.cu 2.0Mb -> 1.8Mb
BinaryBitwiseOpsKernels.cu 2.6Mb -> 2.3Mb

Pull Request resolved: #41059

Differential Revision: D22458928

Pulled By: malfet

fbshipit-source-id: cca623bb6e769cfe372977b08463d98b1a02dd14
csarofeen added a commit to csarofeen/pytorch that referenced this pull request Aug 16, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants