Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement fast pass for CPU scalars /number literals #29915

Closed
wants to merge 1 commit into from

Conversation

ailzhang
Copy link
Contributor

@ailzhang ailzhang commented Nov 15, 2019

The main changes in this PR are:

  • skip device dispatch for CPU scalars (number literals also fall into this). In most cases scalars should be on CPU for best perf, but if users explicitly put on other device, we will respect that setting and exit fast pass.
  • directly manipulate Tensor data_ptr when filling scalar into a 1-element tensor.

Some perf benchmark numbers:
[update 11/19/2019]: the old numbers are based on a debug build, updated with a normal build numbers.

## Before
In [2]: def test(x):
   ...:     x = x + 2
   ...:     return x
   ...:

In [6]: with torch.no_grad():
   ...:     x = torch.ones(100)
   ...:     %timeit test(x)
   ...:
   ...:
9.9 µs ± 84.6 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)


## After
In [1]: import torch

In [2]: def test(x):
   ...:     x = x + 2
   ...:     return x
   ...:

In [3]: with torch.no_grad():
   ...:     x = torch.ones(100)
   ...:     %timeit test(x)
   ...:
6.85 µs ± 420 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

Before the patch tensor_slow took 15.74% of total time.
Screen Shot 2019-11-15 at 12 49 51 PM
After the patch tensor_slow takes 3.84% of total time.
Screen Shot 2019-11-15 at 1 13 03 PM

cc: @roosephu who originally reported this issue to me.

@zou3519
Copy link
Contributor

zou3519 commented Nov 15, 2019

Some before/after numbers would be great

@ailzhang ailzhang changed the title Implement fast pass for CPU scalars /python number literals Implement fast pass for CPU scalars /number literals Nov 15, 2019
// but we also want to skip compute_types which in not avoidable
// in TensorIterator for now.
if (self.device() == at::kCPU && self.numel() == 1) {
AT_DISPATCH_ALL_TYPES(self.scalar_type(), "fill_out", [&]() {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test failures reminds me that all types are not really all types. I'm experimenting locally to see which is the best macro to use. But the rest of the PR is ready for review. ;)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The rest of the PR looks good :-)

template <typename scalar_t>
inline void fill_fast(Tensor& self, Scalar value_scalar) {
auto value = value_scalar.to<scalar_t>();
scalar_t * dptr = reinterpret_cast<scalar_t *>(self.data_ptr());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use of a reinterpret_cast here shouldn't be necessary; data_ptr gives you a void* so you can just static cast it.

@ailzhang ailzhang force-pushed the SCALAR branch 2 times, most recently from 0f8e225 to 1b18169 Compare November 18, 2019 23:37
// In the future when we remove the overhead of device dispatch, we'll happily
// revert this to following:
// auto result = at::empty({}, options);
at::AutoNonVariableTypeMode non_var_type_mode(true);
Copy link
Contributor Author

@ailzhang ailzhang Nov 18, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ezyang @ngimel I did a minor change in the latest commit since empty_cpu requires VariableTypeId to be excluded.
Currently I'm using the AutoNonVariableTypeMode here to turn it off and then directly call into empty_cpu. Alternatively I can also use auto result = at::empty({}, options) but there's a small difference in perf (of the same script in description).

(On a debug build)
## with AutoNonVariableTypeMode + empty_cpu
61.3 µs ± 165 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
## with at::empty
63.1 µs ± 173 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Since the total time before this PR was ~80us, the 2us difference here doesn't sound too appealing. But I'm putting it here, just checking whether we want this 2us when we know for sure something is definitely on CPU.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use of AutoNonVariableTypeMode here is fine. (It's a bit of black magic inserting these, but that's what we're doing right now.)

@ailzhang
Copy link
Contributor Author

Hmmm my favorite circleci jobs are not showing. trying closing and reopening

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ailzhang has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ailzhang has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@ailzhang merged this pull request in 2b02d15.

zdevito pushed a commit to zdevito/ATen that referenced this pull request Nov 19, 2019
Summary:
The main changes in this PR are:
- skip device dispatch for CPU scalars (number literals also fall into this). In most cases scalars should be on CPU for best perf, but if users explicitly put on other device, we will respect that setting and exit fast pass.
- directly manipulate Tensor data_ptr when filling scalar into a 1-element tensor.

Some perf benchmark numbers:
```
## Before
In [4]: def test(x):
   ...:     x = x + 2
   ...:     return x
   ...:

In [5]: with torch.no_grad():
   ...:     x = torch.ones(100)
   ...:     %timeit {test(x)}
   ...:
79.8 µs ± 127 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

## After
In [2]: def test(x):
   ...:     x = x + 2
   ...:     return x
   ...:

In [3]: with torch.no_grad():
   ...:     x = torch.ones(100)
   ...:     %timeit {test(x)}
   ...:
60.5 µs ± 334 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
```

Before the patch `tensor_slow` took 15.74% of total time.
<img width="1186" alt="Screen Shot 2019-11-15 at 12 49 51 PM" src="https://user-images.githubusercontent.com/5248122/68976895-cc808c00-07ab-11ea-8f3c-7f15597d12cf.png">
After the patch `tensor_slow` takes 3.84% of total time.
<img width="1190" alt="Screen Shot 2019-11-15 at 1 13 03 PM" src="https://user-images.githubusercontent.com/5248122/68976925-e28e4c80-07ab-11ea-94c0-91172fc3bb53.png">

cc: roosephu who originally reported this issue to me.
Pull Request resolved: pytorch/pytorch#29915

Differential Revision: D18584251

Pulled By: ailzhang

fbshipit-source-id: 2353c8012450a81872e1e09717b3b181362be401
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants