Swap detection order in randperm_out_cuda to avoid unnecessary conversion from float when the input is small. #22103

xuhdev · 2019-06-22T06:22:57Z

Stack from ghstack:

Refactor and improve randperm tests. #22121 Refactor and improve randperm tests.
Swap detection order in randperm_out_cuda to avoid unnecessary conversion from float when the input is small. #22103 Swap detection order in randperm_out_cuda to avoid unnecessary conversion from float when the input is small.
Support Half type in randperm. #22102 Support Half type in randperm.

Previously, when n is small and dtype is not Half, randperm on cuda
would offload to CPU with a Float type, which has been changed to Half
type in this commit.

This commit basically swaps the following two blocks:

if (result.scalar_type() == at::ScalarType::Half)

and

if (n < 30000)

Differential Revision: D16153585

…sion from float when the input is small.

…sary conversion from float when the input is small." Swap detection order in randperm_out_cuda to avoid unnecessary conversion from float when the input is small. gh-metadata: pytorch pytorch 22103 gh/xuhdev/2/head

xuhdev · 2019-06-23T16:24:15Z

@pytorchbot retest this please

…sary conversion from float when the input is small." Swap detection order in randperm_out_cuda to avoid unnecessary conversion from float when the input is small. Previously, when n is small and dtype is not Half, randperm on cuda would offload to CPU with a Float type, which has been changed to Half type in this commit. gh-metadata: pytorch pytorch 22103 gh/xuhdev/2/head

xuhdev · 2019-06-30T05:08:18Z

@pytorchbot retest this please

…sary conversion from float when the input is small." Swap detection order in randperm_out_cuda to avoid unnecessary conversion from float when the input is small. Previously, when n is small and dtype is not Half, randperm on cuda would offload to CPU with a Float type, which has been changed to Half type in this commit. gh-metadata: pytorch pytorch 22103 gh/xuhdev/2/head

xuhdev · 2019-06-30T21:58:21Z

@pytorchbot retest this please

…sary conversion from float when the input is small." Swap detection order in randperm_out_cuda to avoid unnecessary conversion from float when the input is small. Previously, when n is small and dtype is not Half, randperm on cuda would offload to CPU with a Float type, which has been changed to Half type in this commit. gh-metadata: pytorch pytorch 22103 gh/xuhdev/2/head

xuhdev · 2019-07-08T20:12:02Z

@gchanan @ngimel @yf225 @li-roy Could you give a quick review? This is actually a pretty simple change, if you view the diff without whitespace changes: https://github.com/pytorch/pytorch/pull/22103/files?w=1

ngimel · 2019-07-08T20:30:18Z

can we have some performance benchmarks? I'm concerned about int64_t to half conversion on the CPU that's happening for small tensors as a result of this PR

pytorch/aten/src/ATen/native/TensorFactories.cpp

Line 488 in 0df39bb

r__data[i*r__stride_0] = static_cast<scalar_t>(i);

- usually those are slow, slower than doing things in float on cpu and then copying to half.
Also, initialTensorOptions() implicitly set scalar type of the tensor to float, so it works now, but the comment in the file says that it's not a stable API, and what if that scalar type ever changes? https://github.com/pytorch/pytorch/blob/4453a1ff887dec226355b375d4f1bfa1eb016728/aten/src/ATen/InitialTensorOptions.h
I'd prefer scalar type to be explicitly set to kFloat.

…sary conversion from float when the input is small." Swap detection order in randperm_out_cuda to avoid unnecessary conversion from float when the input is small. Previously, when n is small and dtype is not Half, randperm on cuda would offload to CPU with a Float type, which has been changed to Half type in this commit. gh-metadata: pytorch pytorch 22103 gh/xuhdev/2/head

xuhdev · 2019-07-08T21:17:59Z

@ngimel The conversion from int64_t to float is checked in #22102: https://github.com/pytorch/pytorch/pull/22102/files#diff-37ce10604989fdc0a02a62d4949658b2

I have added an explicit dtype.

Is there a way to trigger the benchmark?

ngimel · 2019-07-08T21:57:26Z

I don't think there are ready benchmarks that you can trigger, you can use benchmarks similar to those in original issue #7606, for before and after your PR.
Don't forget to run a few warmup iterations for the sizes you are going to benchmark to make sure allocator settles, and synchronize after cuda benchmarks. Since the changes are for small tensors only, looks like you should be fine, int64_t -> half conversion should not slow you down too much compared to exposed h2d copy latencies.

…sary conversion from float when the input is small." Swap detection order in randperm_out_cuda to avoid unnecessary conversion from float when the input is small. Previously, when n is small and dtype is not Half, randperm on cuda would offload to CPU with a Float type, which has been changed to Half type in this commit. gh-metadata: pytorch pytorch 22103 gh/xuhdev/2/head

xuhdev · 2019-07-09T01:28:08Z

Here's the perf. It looks like this patch significantly improves the performance for half on cuda. I turned off CPU turbo boost and always ran three times the benchmark (warmup) before the one that is used, so the results should be reliable.

before this patch:

randperm(10) 100000 times
cpu, half	2.8076633399905404
cpu, float	2.7852712649910245
cpu, double	2.7890169029997196
cuda, half	18.277449170011096
cuda, float	10.359675675004837
cuda, double	10.433411638994585
randperm(1000) 1000 times
cpu, half	0.09014522400684655
cpu, float	0.06891961999644991
cpu, double	0.06935312499990687
cuda, half	0.2106321010069223
cuda, float	0.14546358000370674
cuda, double	0.1470240909984568

after this patch:

randperm(10) 100000 times
cpu, half	2.798496307004825
cpu, float	2.8195796069921926
cpu, double	2.827226548004546
cuda, half	11.731637821998447
cuda, float	10.317510107008275
cuda, double	10.415388538996922
randperm(1000) 1000 times
cpu, half	0.09070123700075783
cpu, float	0.06949712400091812
cpu, double	0.06988387599994894
cuda, half	0.16773680900223553
cuda, float	0.14579828100977466
cuda, double	0.14784032000170555

xuhdev · 2019-07-09T01:33:03Z

I also realized that the if (result.scalar_type() == at::ScalarType::Half) block should never be reached because if n>=30000 and the type is Half, check_supported_max_int_with_precision should have reported an error. Instead of removing this block completely, I left the code here for the sake of clarity (because half in thrust is spotty, and we do not want future change unaware of this).

…sary conversion from float when the input is small." Swap detection order in randperm_out_cuda to avoid unnecessary conversion from float when the input is small. Previously, when n is small and dtype is not Half, randperm on cuda would offload to CPU with a Float type, which has been changed to Half type in this commit. gh-metadata: pytorch pytorch 22103 gh/xuhdev/2/head

…sion from float when the input is small. Summary: Pull Request resolved: pytorch/pytorch#22103 Test Plan: Imported from OSS Differential Revision: D16153585 Pulled By: li-roy fbshipit-source-id: 0801b91e7b352c8de8fdfbe929be85d69182b8da

facebook-github-bot · 2019-07-10T22:04:45Z

@li-roy merged this pull request in 32709af.

One important comment is missing from pytorch#22103 (not sure what happened). This commit makes it up.

Summary: One important comment is missing from #22103 (not sure what happened). This commit makes it up. Pull Request resolved: #22984 Differential Revision: D16347044 Pulled By: ezyang fbshipit-source-id: 0903909a5fb6740b43195136f1a23c28cfb2a02f

Summary: One important comment is missing from pytorch/pytorch#22103 (not sure what happened). This commit makes it up. Pull Request resolved: pytorch/pytorch#22984 Differential Revision: D16347044 Pulled By: ezyang fbshipit-source-id: 0903909a5fb6740b43195136f1a23c28cfb2a02f

Swap detection order in randperm_out_cuda to avoid unnecessary conver…

075da5c

…sion from float when the input is small.

pytorchbot added module: cuda Related to torch.cuda, and CUDA support in general module: operators labels Jun 22, 2019

This was referenced Jun 22, 2019

Support Half type in randperm. #22102

Closed

Support Half type for randperm. #22046

Closed

ezyang added the open source label Jun 22, 2019

xuhdev requested a review from ngimel June 22, 2019 06:29

xuhdev added 2 commits June 22, 2019 09:38

xuhdev mentioned this pull request Jun 23, 2019

Refactor and improve randperm tests. #22121

Closed

xuhdev added 3 commits June 23, 2019 10:17

soumith requested a review from gchanan June 25, 2019 03:40

soumith added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Jun 25, 2019

xuhdev added 2 commits June 30, 2019 01:20

xuhdev added 2 commits July 1, 2019 15:33

xuhdev requested review from li-roy and yf225 July 2, 2019 20:36

li-roy approved these changes Jul 10, 2019

View reviewed changes

facebook-github-bot closed this in 32709af Jul 10, 2019

zou3519 deleted the gh/xuhdev/2/head branch July 10, 2019 19:26

facebook-github-bot added the merged label Jul 10, 2019

gchanan mentioned this pull request Jul 10, 2019

randperm on cuda looks buggy #22710

Closed

xuhdev added a commit to xuhdev/pytorch that referenced this pull request Jul 17, 2019

Add missing comment from pytorch#22103

8df4c25

One important comment is missing from pytorch#22103 (not sure what happened). This commit makes it up.

xuhdev mentioned this pull request Jul 17, 2019

Add missing comment from #22103 #22984

Closed

mruberry added the Merged label Oct 28, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Swap detection order in randperm_out_cuda to avoid unnecessary conversion from float when the input is small. #22103

Swap detection order in randperm_out_cuda to avoid unnecessary conversion from float when the input is small. #22103

Uh oh!

xuhdev commented Jun 22, 2019 •

edited

Loading

Uh oh!

xuhdev commented Jun 23, 2019

Uh oh!

xuhdev commented Jun 30, 2019

Uh oh!

xuhdev commented Jun 30, 2019

Uh oh!

xuhdev commented Jul 8, 2019

Uh oh!

ngimel commented Jul 8, 2019

Uh oh!

xuhdev commented Jul 8, 2019

Uh oh!

ngimel commented Jul 8, 2019

Uh oh!

xuhdev commented Jul 9, 2019

Uh oh!

xuhdev commented Jul 9, 2019

Uh oh!

facebook-github-bot commented Jul 10, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

Swap detection order in randperm_out_cuda to avoid unnecessary conversion from float when the input is small. #22103

Swap detection order in randperm_out_cuda to avoid unnecessary conversion from float when the input is small. #22103

Uh oh!

Conversation

xuhdev commented Jun 22, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

xuhdev commented Jun 23, 2019

Uh oh!

xuhdev commented Jun 30, 2019

Uh oh!

xuhdev commented Jun 30, 2019

Uh oh!

xuhdev commented Jul 8, 2019

Uh oh!

ngimel commented Jul 8, 2019

Uh oh!

xuhdev commented Jul 8, 2019

Uh oh!

ngimel commented Jul 8, 2019

Uh oh!

xuhdev commented Jul 9, 2019

Uh oh!

xuhdev commented Jul 9, 2019

Uh oh!

facebook-github-bot commented Jul 10, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

xuhdev commented Jun 22, 2019 •

edited

Loading