[perf] Reduce tensor & aten overhead

### Motivation
Launching a [FusionGroup for a JIT LSTM cell](https://github.com/pytorch/pytorch/blob/ca03c10cefa1e126eab1446d490f9314bd236c1b/test/expect/TestScript.test_lstm_fusion_cuda-forward.expect#L27) takes ~40us CPU time, as reported by the autograd profiler. This is not good because the kernel itself takes < 10us CUDA time and could probably be faster. After seeing nothing noticeably wrong with the JIT, I am looking into core performance overheads.

### Methodology
Performance is measured with gbenchmark microbenchmarks: https://github.com/pytorch/benchmark/tree/master/timing/cpp2. Here is some [sample output](https://gist.github.com/zou3519/0ce5543357d998b83e11e58533a0eeda) that motivates many of the below tasks.

### Investigations

- [x] [#12824] Speed up torch.empty({0}). It is implemented by making a tensor and then calling `resize_({0})`; that `resize_` should be a no-op but it takes 300ns.
- [x] [#12841][#13590] `get_device` is slow because it does two dispatches. It's used for the `DeviceGuard(tensor)` ctor.
- [x] [#12841] `is_cuda` is slow. It's used pretty commonly, most notably in DeviceGuard.
- [x] [#13185 @zou3519] rewrite `torch.as_strided`.
- [x] [#13267 @zou3519] speed up `tensor.storage_offset()`. `as_strided` is 100ns slower with/without a storage_offset argument (the lack of presense of one calls `tensor.storage_offset()`).
- [x] [#13411 @zou3519] stop allocating and throwing away StorageImpl during torch.as_strided
- [x] [probably done @zou3519] investigate `torch.as_strided` perf (it shouldn't be that much slower than just creating an empty tensor because we're setting fields on the new tensor).
- [x] [#13330 @zou3519 ] tensor.options() takes 50ns. 
- [x] [#13580 @zou3519] I *think* Variable::Impl's ctor constructs an empty StorageImpl, which costs like 200ns...
- [x] [#13649 @zou3519] Use at::SmallVector for TensorImpl's sizes and strides. This saves some time while allocating a tensor by avoiding dynamic allocations
- [x] [#13785 @zou3519] Speed up at::empty by avoiding DeviceGuard. at::empty_cuda now only does 1 DeviceGuard. 
- [x] [#13785] ~~Speed up torch.resize_(...), which is important in creating tensors of non-empty size.~~ Killed resizing logic in torch.empty. It no longer performs a resize; this results in a small speedup.
- [x] [done #13974] Use fewer DeviceGuards in common ops #13269 #13741 [motivation timings](https://gist.github.com/zou3519/541c4bd3d965f25d65e90be255b1be4f)
- [x] [#13974 DeviceGuard is no longer a problem!] DeviceGuard takes ~~300ns (before #12841)~~ ~~100ns~~ 200ns. It should ideally only perform a cudaGetDevice and be 34ns. ~~Also, it would be great if we could cache a thread-local device but that may lead to consistency issues.~~ (bad idea). A typical JIT LSTM does 11100 DeviceGuard at 100ns, which is 1.1ms overhead. cudnn does forward + bwd in a total of 17ms.
- [ ] [open] Library overhead: it takes 4us from an at::mm call until a gemm kernel begins launching. It shouldn't need to take that long; we need to allocate the output tensor but that should be < 2us.
- [ ] [in progress @zou3519 ] at::detail::infer_type takes 500-700ns typically (35ns when run in a tight loop). This is pretty bad. Should run experiment to cache type (in a hacky way) on TensorImpl and see if that improves the timing or not
- [ ] [open] eliminate `tensor.type().scalarType()` idiom in favor of tensor.scalarType or tensor.dtype(). Not sure if this is actually perf critical
- [ ] [open] Investigate THCCachingAllocator performance. 
- [ ] [open] investigate caching device on TensorImpl https://github.com/pytorch/pytorch/issues/12934

- [ ] [open] Look into overhead of creating a Variable. We currently create a StorageImpl, then use `make_tensor`, and finally use `as_variable`; `as_variable` seems pretty expensive. LSTM forward + backward creates 1100 tensors at 400ns variable overhead each, which is 0.4ms overhead. cudnn does forward + bwd in a total of 17ms.
- [ ] [in progress @yf225 #13638] Tensor-Variable merge will get rid of our Variable overhead, leading to significant wins



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[perf] Reduce tensor & aten overhead #13049

Motivation

Methodology

Investigations

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[perf] Reduce tensor & aten overhead #13049

Description

Motivation

Methodology

Investigations

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions