Implement fast pass for CPU scalars /number literals #29915

ailzhang · 2019-11-15T19:33:23Z

The main changes in this PR are:

skip device dispatch for CPU scalars (number literals also fall into this). In most cases scalars should be on CPU for best perf, but if users explicitly put on other device, we will respect that setting and exit fast pass.
directly manipulate Tensor data_ptr when filling scalar into a 1-element tensor.

Some perf benchmark numbers:
[update 11/19/2019]: the old numbers are based on a debug build, updated with a normal build numbers.

## Before
In [2]: def test(x):
   ...:     x = x + 2
   ...:     return x
   ...:

In [6]: with torch.no_grad():
   ...:     x = torch.ones(100)
   ...:     %timeit test(x)
   ...:
   ...:
9.9 µs ± 84.6 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)


## After
In [1]: import torch

In [2]: def test(x):
   ...:     x = x + 2
   ...:     return x
   ...:

In [3]: with torch.no_grad():
   ...:     x = torch.ones(100)
   ...:     %timeit test(x)
   ...:
6.85 µs ± 420 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

Before the patch tensor_slow took 15.74% of total time.

After the patch tensor_slow takes 3.84% of total time.

cc: @roosephu who originally reported this issue to me.

aten/src/ATen/native/Fill.cpp

zou3519 · 2019-11-15T19:53:04Z

Some before/after numbers would be great

ailzhang · 2019-11-15T22:15:29Z

aten/src/ATen/native/Fill.cpp

+  // but we also want to skip compute_types which in not avoidable
+  // in TensorIterator for now.
+  if (self.device() == at::kCPU && self.numel() == 1) {
+     AT_DISPATCH_ALL_TYPES(self.scalar_type(), "fill_out", [&]() {


Test failures reminds me that all types are not really all types. I'm experimenting locally to see which is the best macro to use. But the rest of the PR is ready for review. ;)

The rest of the PR looks good :-)

ezyang · 2019-11-18T19:49:03Z

aten/src/ATen/native/Fill.cpp

+  template <typename scalar_t>
+  inline void fill_fast(Tensor& self, Scalar value_scalar) {
+    auto value = value_scalar.to<scalar_t>();
+    scalar_t * dptr = reinterpret_cast<scalar_t *>(self.data_ptr());


Use of a reinterpret_cast here shouldn't be necessary; data_ptr gives you a void* so you can just static cast it.

ailzhang · 2019-11-18T23:45:44Z

aten/src/ATen/native/TensorFactories.cpp

+    // In the future when we remove the overhead of device dispatch, we'll happily
+    // revert this to following:
+    //   auto result = at::empty({}, options);
+    at::AutoNonVariableTypeMode non_var_type_mode(true);


@ezyang @ngimel I did a minor change in the latest commit since empty_cpu requires VariableTypeId to be excluded.
Currently I'm using the AutoNonVariableTypeMode here to turn it off and then directly call into empty_cpu. Alternatively I can also use auto result = at::empty({}, options) but there's a small difference in perf (of the same script in description).

(On a debug build) ## with AutoNonVariableTypeMode + empty_cpu 61.3 µs ± 165 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) ## with at::empty 63.1 µs ± 173 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Since the total time before this PR was ~80us, the 2us difference here doesn't sound too appealing. But I'm putting it here, just checking whether we want this 2us when we know for sure something is definitely on CPU.

Use of AutoNonVariableTypeMode here is fine. (It's a bit of black magic inserting these, but that's what we're doing right now.)

ailzhang · 2019-11-19T00:08:19Z

Hmmm my favorite circleci jobs are not showing. trying closing and reopening

facebook-github-bot

@ailzhang has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

facebook-github-bot

@ailzhang has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

facebook-github-bot · 2019-11-19T18:37:32Z

@ailzhang merged this pull request in 2b02d15.

Summary: The main changes in this PR are: - skip device dispatch for CPU scalars (number literals also fall into this). In most cases scalars should be on CPU for best perf, but if users explicitly put on other device, we will respect that setting and exit fast pass. - directly manipulate Tensor data_ptr when filling scalar into a 1-element tensor. Some perf benchmark numbers: ``` ## Before In [4]: def test(x): ...: x = x + 2 ...: return x ...: In [5]: with torch.no_grad(): ...: x = torch.ones(100) ...: %timeit {test(x)} ...: 79.8 µs ± 127 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) ## After In [2]: def test(x): ...: x = x + 2 ...: return x ...: In [3]: with torch.no_grad(): ...: x = torch.ones(100) ...: %timeit {test(x)} ...: 60.5 µs ± 334 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) ``` Before the patch `tensor_slow` took 15.74% of total time. <img width="1186" alt="Screen Shot 2019-11-15 at 12 49 51 PM" src="https://user-images.githubusercontent.com/5248122/68976895-cc808c00-07ab-11ea-8f3c-7f15597d12cf.png"> After the patch `tensor_slow` takes 3.84% of total time. <img width="1190" alt="Screen Shot 2019-11-15 at 1 13 03 PM" src="https://user-images.githubusercontent.com/5248122/68976925-e28e4c80-07ab-11ea-94c0-91172fc3bb53.png"> cc: roosephu who originally reported this issue to me. Pull Request resolved: pytorch/pytorch#29915 Differential Revision: D18584251 Pulled By: ailzhang fbshipit-source-id: 2353c8012450a81872e1e09717b3b181362be401

zou3519 reviewed Nov 15, 2019

View reviewed changes

aten/src/ATen/native/Fill.cpp Outdated Show resolved Hide resolved

zou3519 reviewed Nov 15, 2019

View reviewed changes

aten/src/ATen/native/Fill.cpp Show resolved Hide resolved

ailzhang force-pushed the SCALAR branch from 23f9758 to 674d9bd Compare November 15, 2019 21:18

ailzhang changed the title ~~Implement fast pass for CPU scalars /python number literals~~ Implement fast pass for CPU scalars /number literals Nov 15, 2019

ailzhang requested review from ngimel, ezyang, gchanan, dzhulgakov and resistor November 15, 2019 21:29

ailzhang commented Nov 15, 2019

View reviewed changes

ailzhang requested a review from mruberry November 15, 2019 22:28

ezyang reviewed Nov 18, 2019

View reviewed changes

ezyang approved these changes Nov 18, 2019

View reviewed changes

ailzhang force-pushed the SCALAR branch 2 times, most recently from 0f8e225 to 1b18169 Compare November 18, 2019 23:37

ailzhang commented Nov 18, 2019

View reviewed changes

ailzhang closed this Nov 19, 2019

ailzhang reopened this Nov 19, 2019

ailzhang mentioned this pull request Nov 19, 2019

[TEST]Improve number literal #30056

Closed

facebook-github-bot reviewed Nov 19, 2019

View reviewed changes

Improve number literal perf.

4ea95e6

ailzhang force-pushed the SCALAR branch from 03f4ac8 to 4ea95e6 Compare November 19, 2019 05:44

facebook-github-bot reviewed Nov 19, 2019

View reviewed changes

facebook-github-bot closed this in 2b02d15 Nov 19, 2019

facebook-github-bot added the merged label Nov 19, 2019

ezyang mentioned this pull request Oct 22, 2020

[RFC] Decouple fast pass functions #46469

Closed

mruberry added the Merged label Oct 28, 2020

ezyang mentioned this pull request Jul 17, 2022

FakeTensorMode cannot handle non-fake tensor, but non-fake tensors can arise from non-interposable Tensor construction calls #81608

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement fast pass for CPU scalars /number literals #29915

Implement fast pass for CPU scalars /number literals #29915

ailzhang commented Nov 15, 2019 •

edited

zou3519 commented Nov 15, 2019

ailzhang Nov 15, 2019

ngimel Nov 18, 2019

ezyang Nov 18, 2019

ailzhang Nov 18, 2019 •

edited

ezyang Nov 19, 2019

ailzhang commented Nov 19, 2019

facebook-github-bot left a comment

facebook-github-bot left a comment

facebook-github-bot commented Nov 19, 2019

Implement fast pass for CPU scalars /number literals #29915

Implement fast pass for CPU scalars /number literals #29915

Conversation

ailzhang commented Nov 15, 2019 • edited

zou3519 commented Nov 15, 2019

ailzhang Nov 15, 2019

Choose a reason for hiding this comment

ngimel Nov 18, 2019

Choose a reason for hiding this comment

ezyang Nov 18, 2019

Choose a reason for hiding this comment

ailzhang Nov 18, 2019 • edited

Choose a reason for hiding this comment

ezyang Nov 19, 2019

Choose a reason for hiding this comment

ailzhang commented Nov 19, 2019

facebook-github-bot left a comment

Choose a reason for hiding this comment

facebook-github-bot left a comment

Choose a reason for hiding this comment

facebook-github-bot commented Nov 19, 2019

ailzhang commented Nov 15, 2019 •

edited

ailzhang Nov 18, 2019 •

edited