Skip to content

Conversation

xuhdev
Copy link
Collaborator

@xuhdev xuhdev commented Jun 22, 2019

Stack from ghstack:

Previously, when n is small and dtype is not Half, randperm on cuda
would offload to CPU with a Float type, which has been changed to Half
type in this commit.

This commit basically swaps the following two blocks:

if (result.scalar_type() == at::ScalarType::Half)

and

if (n < 30000)

Differential Revision: D16153585

@pytorchbot pytorchbot added module: cuda Related to torch.cuda, and CUDA support in general module: operators labels Jun 22, 2019
@xuhdev xuhdev requested a review from ngimel June 22, 2019 06:29
xuhdev added 2 commits June 22, 2019 09:38
…sary conversion from float when the input is small."

Swap detection order in randperm_out_cuda to avoid unnecessary conversion from float when the input is small.

gh-metadata: pytorch pytorch 22103 gh/xuhdev/2/head
…sary conversion from float when the input is small."

Swap detection order in randperm_out_cuda to avoid unnecessary conversion from float when the input is small.

gh-metadata: pytorch pytorch 22103 gh/xuhdev/2/head
@xuhdev
Copy link
Collaborator Author

xuhdev commented Jun 23, 2019

@pytorchbot retest this please

…sary conversion from float when the input is small."

Swap detection order in randperm_out_cuda to avoid unnecessary conversion from float when the input is small.

Previously, when n is small and dtype is not Half, randperm on cuda
would offload to CPU with a Float type, which has been changed to Half
type in this commit.

gh-metadata: pytorch pytorch 22103 gh/xuhdev/2/head
xuhdev added 3 commits June 23, 2019 10:17
…sary conversion from float when the input is small."

Swap detection order in randperm_out_cuda to avoid unnecessary conversion from float when the input is small.

Previously, when n is small and dtype is not Half, randperm on cuda
would offload to CPU with a Float type, which has been changed to Half
type in this commit.

gh-metadata: pytorch pytorch 22103 gh/xuhdev/2/head
…sary conversion from float when the input is small."

Swap detection order in randperm_out_cuda to avoid unnecessary conversion from float when the input is small.

Previously, when n is small and dtype is not Half, randperm on cuda
would offload to CPU with a Float type, which has been changed to Half
type in this commit.

gh-metadata: pytorch pytorch 22103 gh/xuhdev/2/head
…sary conversion from float when the input is small."

Swap detection order in randperm_out_cuda to avoid unnecessary conversion from float when the input is small.

Previously, when n is small and dtype is not Half, randperm on cuda
would offload to CPU with a Float type, which has been changed to Half
type in this commit.

gh-metadata: pytorch pytorch 22103 gh/xuhdev/2/head
@soumith soumith requested a review from gchanan June 25, 2019 03:40
@soumith soumith added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Jun 25, 2019
…sary conversion from float when the input is small."

Swap detection order in randperm_out_cuda to avoid unnecessary conversion from float when the input is small.

Previously, when n is small and dtype is not Half, randperm on cuda
would offload to CPU with a Float type, which has been changed to Half
type in this commit.

gh-metadata: pytorch pytorch 22103 gh/xuhdev/2/head
@xuhdev
Copy link
Collaborator Author

xuhdev commented Jun 30, 2019

@pytorchbot retest this please

xuhdev added 2 commits June 30, 2019 01:20
…sary conversion from float when the input is small."

Swap detection order in randperm_out_cuda to avoid unnecessary conversion from float when the input is small.

Previously, when n is small and dtype is not Half, randperm on cuda
would offload to CPU with a Float type, which has been changed to Half
type in this commit.

gh-metadata: pytorch pytorch 22103 gh/xuhdev/2/head
…sary conversion from float when the input is small."

Swap detection order in randperm_out_cuda to avoid unnecessary conversion from float when the input is small.

Previously, when n is small and dtype is not Half, randperm on cuda
would offload to CPU with a Float type, which has been changed to Half
type in this commit.

gh-metadata: pytorch pytorch 22103 gh/xuhdev/2/head
@xuhdev
Copy link
Collaborator Author

xuhdev commented Jun 30, 2019

@pytorchbot retest this please

xuhdev added 2 commits July 1, 2019 15:33
…sary conversion from float when the input is small."

Swap detection order in randperm_out_cuda to avoid unnecessary conversion from float when the input is small.

Previously, when n is small and dtype is not Half, randperm on cuda
would offload to CPU with a Float type, which has been changed to Half
type in this commit.

gh-metadata: pytorch pytorch 22103 gh/xuhdev/2/head
…sary conversion from float when the input is small."

Swap detection order in randperm_out_cuda to avoid unnecessary conversion from float when the input is small.

Previously, when n is small and dtype is not Half, randperm on cuda
would offload to CPU with a Float type, which has been changed to Half
type in this commit.

gh-metadata: pytorch pytorch 22103 gh/xuhdev/2/head
@xuhdev xuhdev requested review from li-roy and yf225 July 2, 2019 20:36
…sary conversion from float when the input is small."

Swap detection order in randperm_out_cuda to avoid unnecessary conversion from float when the input is small.

Previously, when n is small and dtype is not Half, randperm on cuda
would offload to CPU with a Float type, which has been changed to Half
type in this commit.

gh-metadata: pytorch pytorch 22103 gh/xuhdev/2/head
@xuhdev
Copy link
Collaborator Author

xuhdev commented Jul 8, 2019

@gchanan @ngimel @yf225 @li-roy Could you give a quick review? This is actually a pretty simple change, if you view the diff without whitespace changes: https://github.com/pytorch/pytorch/pull/22103/files?w=1

@ngimel
Copy link
Collaborator

ngimel commented Jul 8, 2019

can we have some performance benchmarks? I'm concerned about int64_t to half conversion on the CPU that's happening for small tensors as a result of this PR

r__data[i*r__stride_0] = static_cast<scalar_t>(i);
- usually those are slow, slower than doing things in float on cpu and then copying to half.
Also, initialTensorOptions() implicitly set scalar type of the tensor to float, so it works now, but the comment in the file says that it's not a stable API, and what if that scalar type ever changes? https://github.com/pytorch/pytorch/blob/4453a1ff887dec226355b375d4f1bfa1eb016728/aten/src/ATen/InitialTensorOptions.h
I'd prefer scalar type to be explicitly set to kFloat.

…sary conversion from float when the input is small."

Swap detection order in randperm_out_cuda to avoid unnecessary conversion from float when the input is small.

Previously, when n is small and dtype is not Half, randperm on cuda
would offload to CPU with a Float type, which has been changed to Half
type in this commit.

gh-metadata: pytorch pytorch 22103 gh/xuhdev/2/head
@xuhdev
Copy link
Collaborator Author

xuhdev commented Jul 8, 2019

@ngimel The conversion from int64_t to float is checked in #22102: https://github.com/pytorch/pytorch/pull/22102/files#diff-37ce10604989fdc0a02a62d4949658b2

I have added an explicit dtype.

Is there a way to trigger the benchmark?

@ngimel
Copy link
Collaborator

ngimel commented Jul 8, 2019

I don't think there are ready benchmarks that you can trigger, you can use benchmarks similar to those in original issue #7606, for before and after your PR.
Don't forget to run a few warmup iterations for the sizes you are going to benchmark to make sure allocator settles, and synchronize after cuda benchmarks. Since the changes are for small tensors only, looks like you should be fine, int64_t -> half conversion should not slow you down too much compared to exposed h2d copy latencies.

…sary conversion from float when the input is small."

Swap detection order in randperm_out_cuda to avoid unnecessary conversion from float when the input is small.

Previously, when n is small and dtype is not Half, randperm on cuda
would offload to CPU with a Float type, which has been changed to Half
type in this commit.

gh-metadata: pytorch pytorch 22103 gh/xuhdev/2/head
@xuhdev
Copy link
Collaborator Author

xuhdev commented Jul 9, 2019

Here's the perf. It looks like this patch significantly improves the performance for half on cuda. I turned off CPU turbo boost and always ran three times the benchmark (warmup) before the one that is used, so the results should be reliable.

before this patch:

randperm(10) 100000 times
cpu, half	2.8076633399905404
cpu, float	2.7852712649910245
cpu, double	2.7890169029997196
cuda, half	18.277449170011096
cuda, float	10.359675675004837
cuda, double	10.433411638994585
randperm(1000) 1000 times
cpu, half	0.09014522400684655
cpu, float	0.06891961999644991
cpu, double	0.06935312499990687
cuda, half	0.2106321010069223
cuda, float	0.14546358000370674
cuda, double	0.1470240909984568

after this patch:

randperm(10) 100000 times
cpu, half	2.798496307004825
cpu, float	2.8195796069921926
cpu, double	2.827226548004546
cuda, half	11.731637821998447
cuda, float	10.317510107008275
cuda, double	10.415388538996922
randperm(1000) 1000 times
cpu, half	0.09070123700075783
cpu, float	0.06949712400091812
cpu, double	0.06988387599994894
cuda, half	0.16773680900223553
cuda, float	0.14579828100977466
cuda, double	0.14784032000170555

@xuhdev
Copy link
Collaborator Author

xuhdev commented Jul 9, 2019

I also realized that the if (result.scalar_type() == at::ScalarType::Half) block should never be reached because if n>=30000 and the type is Half, check_supported_max_int_with_precision should have reported an error. Instead of removing this block completely, I left the code here for the sake of clarity (because half in thrust is spotty, and we do not want future change unaware of this).

…sary conversion from float when the input is small."

Swap detection order in randperm_out_cuda to avoid unnecessary conversion from float when the input is small.

Previously, when n is small and dtype is not Half, randperm on cuda
would offload to CPU with a Float type, which has been changed to Half
type in this commit.

gh-metadata: pytorch pytorch 22103 gh/xuhdev/2/head
@zou3519 zou3519 deleted the gh/xuhdev/2/head branch July 10, 2019 19:26
zdevito pushed a commit to zdevito/ATen that referenced this pull request Jul 10, 2019
…sion from float when the input is small.

Summary: Pull Request resolved: pytorch/pytorch#22103

Test Plan: Imported from OSS

Differential Revision: D16153585

Pulled By: li-roy

fbshipit-source-id: 0801b91e7b352c8de8fdfbe929be85d69182b8da
@facebook-github-bot
Copy link
Contributor

@li-roy merged this pull request in 32709af.

xuhdev added a commit to xuhdev/pytorch that referenced this pull request Jul 17, 2019
One important comment is missing from pytorch#22103 (not sure what happened).
This commit makes it up.
facebook-github-bot pushed a commit that referenced this pull request Jul 17, 2019
Summary:
One important comment is missing from #22103 (not sure what happened).
This commit makes it up.
Pull Request resolved: #22984

Differential Revision: D16347044

Pulled By: ezyang

fbshipit-source-id: 0903909a5fb6740b43195136f1a23c28cfb2a02f
zdevito pushed a commit to zdevito/ATen that referenced this pull request Jul 18, 2019
Summary:
One important comment is missing from pytorch/pytorch#22103 (not sure what happened).
This commit makes it up.
Pull Request resolved: pytorch/pytorch#22984

Differential Revision: D16347044

Pulled By: ezyang

fbshipit-source-id: 0903909a5fb6740b43195136f1a23c28cfb2a02f
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Merged module: cuda Related to torch.cuda, and CUDA support in general open source triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants