-
Notifications
You must be signed in to change notification settings - Fork 25.4k
Description
Motivation
Launching a FusionGroup for a JIT LSTM cell takes ~40us CPU time, as reported by the autograd profiler. This is not good because the kernel itself takes < 10us CUDA time and could probably be faster. After seeing nothing noticeably wrong with the JIT, I am looking into core performance overheads.
Methodology
Performance is measured with gbenchmark microbenchmarks: https://github.com/pytorch/benchmark/tree/master/timing/cpp2. Here is some sample output that motivates many of the below tasks.
Investigations
-
[Speed up tensor.resize_(sizes) when tensor has correct size #12824] Speed up torch.empty({0}). It is implemented by making a tensor and then calling
resize_({0})
; thatresize_
should be a no-op but it takes 300ns. -
[Speed up tensor.get_device(), is_cuda(), is_sparse() by avoiding dispatches #12841][codemod tensor.type().is_cuda(), tensor.type().is_sparse() #13590]
get_device
is slow because it does two dispatches. It's used for theDeviceGuard(tensor)
ctor. -
[Speed up tensor.get_device(), is_cuda(), is_sparse() by avoiding dispatches #12841]
is_cuda
is slow. It's used pretty commonly, most notably in DeviceGuard. -
[Reimplement as_strided in ATen. #13185 @zou3519] rewrite
torch.as_strided
. -
[Speed up tensor.storage_offset #13267 @zou3519] speed up
tensor.storage_offset()
.as_strided
is 100ns slower with/without a storage_offset argument (the lack of presense of one callstensor.storage_offset()
). -
[Stop unnecessarily setting storage in as_strided. #13411 @zou3519] stop allocating and throwing away StorageImpl during torch.as_strided
-
[probably done @zou3519] investigate
torch.as_strided
perf (it shouldn't be that much slower than just creating an empty tensor because we're setting fields on the new tensor). -
[Speed up tensor.options() by avoiding type dispatch #13330 @zou3519 ] tensor.options() takes 50ns.
-
[Don't allocate empty Storage/StorageImpl for Variable. #13580 @zou3519] I think Variable::Impl's ctor constructs an empty StorageImpl, which costs like 200ns...
-
[ Use SmallVector for TensorImpl sizes and strides. #13649 @zou3519] Use at::SmallVector for TensorImpl's sizes and strides. This saves some time while allocating a tensor by avoiding dynamic allocations
-
[ Avoid grabbing DeviceGuard in at::empty when possible #13785 @zou3519] Speed up at::empty by avoiding DeviceGuard. at::empty_cuda now only does 1 DeviceGuard.
-
[ Avoid grabbing DeviceGuard in at::empty when possible #13785]
Speed up torch.resize_(...), which is important in creating tensors of non-empty size.Killed resizing logic in torch.empty. It no longer performs a resize; this results in a small speedup. -
[done [study] Are we calling DeviceGuard too much? #13974] Use fewer DeviceGuards in common ops Operators that never (re)allocate memory do not need DeviceGuard #13269 Remove some more unnecessary DeviceGuard #13741 motivation timings
-
[[study] Are we calling DeviceGuard too much? #13974 DeviceGuard is no longer a problem!] DeviceGuard takes
300ns (before Speed up tensor.get_device(), is_cuda(), is_sparse() by avoiding dispatches #12841)100ns200ns. It should ideally only perform a cudaGetDevice and be 34ns.Also, it would be great if we could cache a thread-local device but that may lead to consistency issues.(bad idea). A typical JIT LSTM does 11100 DeviceGuard at 100ns, which is 1.1ms overhead. cudnn does forward + bwd in a total of 17ms. -
[open] Library overhead: it takes 4us from an at::mm call until a gemm kernel begins launching. It shouldn't need to take that long; we need to allocate the output tensor but that should be < 2us.
-
[in progress @zou3519 ] at::detail::infer_type takes 500-700ns typically (35ns when run in a tight loop). This is pretty bad. Should run experiment to cache type (in a hacky way) on TensorImpl and see if that improves the timing or not
-
[open] eliminate
tensor.type().scalarType()
idiom in favor of tensor.scalarType or tensor.dtype(). Not sure if this is actually perf critical -
[open] Investigate THCCachingAllocator performance.
-
[open] investigate caching device on TensorImpl [perf] Investigate caching device on TensorImpl #12934
-
[open] Look into overhead of creating a Variable. We currently create a StorageImpl, then use
make_tensor
, and finally useas_variable
;as_variable
seems pretty expensive. LSTM forward + backward creates 1100 tensors at 400ns variable overhead each, which is 0.4ms overhead. cudnn does forward + bwd in a total of 17ms. -
[in progress @yf225 Variable/Tensor Merge Proposal #13638] Tensor-Variable merge will get rid of our Variable overhead, leading to significant wins